A comprehensive, production-ready implementation for training and deploying a 4.9 billion parameter transformer language model. This framework incorporates state-of-the-art architectures, optimization techniques, and training datasets used by leading AI research organizations.
- Overview
- Architecture
- Installation
- Quick Start
- Training Data
- Model Configurations
- Training Pipeline
- Optimization Techniques
- Hardware Requirements
- Advanced Usage
- Evaluation and Benchmarking
- API and Deployment
- Troubleshooting
- Contributing
- License
iLLuMinator 4.9B represents a cutting-edge approach to large language model training, incorporating the latest advances in transformer architectures, optimization algorithms, and training methodologies. The framework is designed for both research and production environments, with comprehensive support for various hardware configurations and deployment scenarios.
- Advanced Architecture: Enhanced transformer with Grouped Query Attention (GQA), Rotary Position Embedding (RoPE), SwiGLU activation, and RMSNorm
- Production-Ready: Comprehensive error handling, logging, monitoring, and checkpointing
- Memory Efficient: Advanced memory optimization techniques including gradient checkpointing and mixed precision training
- Scalable Training: Support for distributed training across multiple GPUs and nodes
- Professional Datasets: Integration with high-quality datasets used to train leading AI models
- Flexible Deployment: Multiple deployment options including API servers, interactive clients, and batch processing
The iLLuMinator 4.9B model implements a state-of-the-art transformer architecture with several key innovations:
Component | Configuration | Description |
---|---|---|
Parameters | 4.9 billion | Optimized for performance/efficiency balance |
Layers | 32 transformer blocks | Deep architecture for complex reasoning |
Attention Heads | 32 (8 KV heads) | Grouped Query Attention for efficiency |
Model Dimension | 4096 | Hidden state dimension |
Feed-Forward | 14336 | Enhanced feed-forward network |
Context Length | 4096 tokens | Extended context for long-form content |
Vocabulary | 65536 tokens | Enhanced tokenization coverage |
-
Grouped Query Attention (GQA)
- Reduces memory usage during inference
- Maintains model quality while improving efficiency
- Based on research from Google and Meta
-
Rotary Position Embedding (RoPE)
- Superior handling of positional information
- Better extrapolation to longer sequences
- Eliminates the need for absolute position embeddings
-
SwiGLU Activation Function
- Improved performance over traditional GELU
- Based on "GLU Variants Improve Transformer" research
- Used in PaLM, LLaMA, and other leading models
-
RMSNorm Normalization
- More stable training than LayerNorm
- Faster computation and better numerical stability
- Standard in modern large language models
-
Memory Optimizations
- Gradient checkpointing for reduced memory usage
- Mixed precision training (FP16/BF16)
- Efficient attention implementations
- Python 3.8 or higher
- CUDA 11.8+ (for GPU training)
- 16GB+ RAM (32GB+ recommended)
- 100GB+ free disk space
# Clone the repository
git clone https://github.com/Anipaleja/iLLuMinator-4.9B.git
cd iLLuMinator-4.9B
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Optional: Install additional optimizations
pip install flash-attn # For flash attention (requires compatible GPU)
pip install deepspeed # For distributed training
# Build Docker container
docker build -t illuminator-4.9b .
# Run container with GPU support
docker run --gpus all -it illuminator-4.9b
# 1. Install dependencies
pip install -r requirements.txt
# 2. Start training with default configuration
python train_comprehensive.py
# 3. Monitor training progress
tail -f training_logs/training.log
# Training with custom configuration
python train_comprehensive.py \
--model-size 4.9B \
--batch-size 2 \
--learning-rate 3e-4 \
--epochs 10 \
--save-every 1000
# Apple Silicon optimized training
python train_4.9b_apple_silicon.py
# Training with external datasets
python practical_model/train_enhanced.py --use-external-data
The iLLuMinator 4.9B model is trained on a carefully curated dataset that mirrors the high-quality data used by leading AI research organizations. Our training corpus includes:
-
OpenOrca (4.5M examples)
- High-quality instruction-following data
- Filtered and deduplicated from FLAN collection
- Emphasizes reasoning and complex instruction following
-
UltraChat (1.57M examples)
- Multi-turn conversational data
- Covers diverse topics and conversation styles
- Includes context-aware dialogue training
-
WizardLM (196K examples)
- Evolved instruction-following examples
- Complex reasoning tasks and multi-step problems
- Generated through Evol-Instruct methodology
-
Stanford Alpaca (52K examples)
- High-quality instruction-following dataset
- Human-curated and verified examples
- Covers broad range of tasks and domains
-
Code Datasets
- GitHub repositories with permissive licenses
- Stack Overflow Q&A pairs
- Programming tutorials and documentation
- Multiple programming languages
-
Scientific Literature
- ArXiv papers and abstracts
- Wikipedia scientific articles
- Educational content and textbooks
- Technical documentation
-
Web Corpus (Filtered)
- CommonCrawl data (extensively filtered)
- Reddit discussions (high-quality subreddits)
- News articles and journalism
- Reference materials and documentation
# Example of our data processing pipeline
def process_training_data():
# 1. Load raw datasets
datasets = load_multiple_sources()
# 2. Quality filtering
filtered_data = apply_quality_filters(datasets)
# 3. Deduplication
deduplicated_data = remove_duplicates(filtered_data)
# 4. Format standardization
formatted_data = standardize_format(deduplicated_data)
# 5. Tokenization and chunking
tokenized_data = tokenize_and_chunk(formatted_data)
return tokenized_data
- Toxicity Filtering: Advanced toxicity detection and removal
- PII Removal: Comprehensive personally identifiable information filtering
- Language Detection: Multi-language support with quality thresholds
- Duplicate Detection: Advanced near-duplicate detection algorithms
- Content Verification: Automated fact-checking and consistency validation
The framework supports multiple model configurations for different use cases:
Configuration | Parameters | Use Case | Hardware Requirement |
---|---|---|---|
iLLuMinator-4.9B | 4.9B | Full model | 24GB+ VRAM |
iLLuMinator-2.8B | 2.8B | Medium scale | 16GB+ VRAM |
iLLuMinator-1.3B | 1.3B | Efficient | 8GB+ VRAM |
iLLuMinator-Practical | 120M | Development/Testing | 4GB+ VRAM |
# Example: 4.9B configuration
config_4_9b = {
"vocab_size": 65536,
"d_model": 4096,
"n_layers": 32,
"n_heads": 32,
"n_kv_heads": 8,
"d_ff": 14336,
"max_seq_length": 4096,
"dropout": 0.0,
"tie_embeddings": True,
"use_rope": True,
"use_swiglu": True,
"use_rmsnorm": True
}
-
Data Preparation
- Dataset loading and preprocessing
- Tokenization and formatting
- Data validation and quality checks
-
Model Initialization
- Architecture setup
- Weight initialization
- Optimizer configuration
-
Training Loop
- Forward pass computation
- Loss calculation
- Backward pass and optimization
- Checkpointing and logging
-
Evaluation and Validation
- Perplexity calculation
- Benchmark evaluations
- Model quality assessment
# Main training scripts
train_comprehensive.py # Full training pipeline
train_4.9b_apple_silicon.py # Apple Silicon optimized
train_enhanced.py # Enhanced with validation
train_mfu.py # MFU optimized training
# Training hyperparameters
training_config = {
"learning_rate": 3e-4,
"weight_decay": 0.1,
"beta1": 0.9,
"beta2": 0.95,
"eps": 1e-8,
"warmup_steps": 2000,
"max_steps": 100000,
"gradient_clip": 1.0,
"batch_size": 2,
"gradient_accumulation_steps": 32,
"save_every": 1000,
"eval_every": 500
}
-
Gradient Checkpointing
- Trades computation for memory
- Enables training larger models on limited hardware
- Configurable checkpointing strategies
-
Mixed Precision Training
- Automatic Mixed Precision (AMP)
- FP16 and BF16 support
- Significant memory and speed improvements
-
Optimizer State Sharding
- Distributed optimizer states
- Reduced memory footprint per GPU
- Enabled through DeepSpeed integration
-
Flash Attention
- Memory-efficient attention computation
- Significant speedup for long sequences
- Automatic fallback for unsupported hardware
-
Kernel Fusion
- Fused operations where possible
- Reduced memory bandwidth requirements
- Custom CUDA kernels for critical operations
-
Dynamic Loss Scaling
- Prevents gradient underflow in FP16
- Automatic scaling adjustment
- Maintains training stability
# Multi-GPU training setup
torchrun --nproc_per_node=4 train_comprehensive.py \
--distributed \
--backend=nccl \
--gradient-accumulation-steps=8
After training, the model can be loaded and used for:
- Text generation
- Question answering
- Code generation
- Conversational AI
- Instruction following
iLLuMinator-4.9B/
train_4_9B_professional.py # Main training script
inference.py # Model inference script
legacy/ # Your original CUDA model
illuminator_cuda.py # 4.9B CUDA model
illuminator_ai.py # AI implementation
cuda_api_server.py # API server
simple_illuminator.py # Simple AI implementation
client.py # Client interface
chatbot_client.py # Chatbot client
requirements.txt # Dependencies
README.md # This file
-
Out of Memory Errors
# Reduce batch size or sequence length python train_comprehensive.py --batch-size 1 --max-seq-length 2048 # Enable gradient checkpointing python train_comprehensive.py --gradient-checkpointing
-
CUDA Out of Memory
# Use mixed precision training python train_comprehensive.py --mixed-precision # Enable CPU offloading python train_comprehensive.py --cpu-offload
-
Slow Training Speed
# Enable flash attention pip install flash-attn python train_comprehensive.py --flash-attention # Use multiple GPUs torchrun --nproc_per_node=2 train_comprehensive.py
-
Memory Usage Monitoring
# Monitor GPU memory python mps-monitor.py --watch-memory # Memory profiling python train_comprehensive.py --profile-memory
-
Training Speed Analysis
# Analyze Model FLOPs Utilization python mfu_analyzer.py --model-path checkpoints/model.pt
# Enable debug logging
python train_comprehensive.py --debug --log-level DEBUG
iLLuMinator-4.9B/
├── README.md # This comprehensive documentation
├── requirements.txt # Python dependencies
├── LICENSE # MIT license
│
├── models/ # Model architecture definitions
│ ├── illuminator_model.py # Main 4.9B model implementation
│ ├── illuminator_practical.py # Practical 120M model
│ └── config/ # Model configuration files
│
├── training/ # Training scripts and utilities
│ ├── train_comprehensive.py # Main training script
│ ├── train_4.9b_apple_silicon.py # Apple Silicon optimized
│ ├── train_enhanced.py # Enhanced training with validation
│ ├── train_mfu.py # MFU optimized training
│ └── optimizers/ # Custom optimization modules
│
├── data/ # Data processing and loading
│ ├── dataset_integration.py # Dataset loading and processing
│ ├── tokenizer/ # Custom tokenizer implementation
│ └── datasets/ # Training datasets (created during setup)
│
├── evaluation/ # Model evaluation and benchmarking
│ ├── evaluate.py # Comprehensive evaluation script
│ ├── benchmarks/ # Standard benchmark implementations
│ └── metrics/ # Custom evaluation metrics
│
├── deployment/ # Deployment and serving
│ ├── api_server.py # REST API server
│ ├── interactive_client.py # Interactive chat client
│ ├── practical_api_server.py # Lightweight API for practical model
│ └── docker/ # Docker configurations
│
├── utils/ # Utility scripts and tools
│ ├── mps-monitor.py # Memory and performance monitoring
│ ├── mfu_analyzer.py # Model FLOPs utilization analysis
│ ├── model_comparison.py # Model architecture comparison
│ └── optimization/ # Performance optimization tools
│
├── practical_model/ # Lightweight model for development
│ ├── illuminator_practical.py # 120M parameter model
│ ├── train_practical.py # Training script for practical model
│ ├── simple_test.py # Quick testing utilities
│ └── README.md # Practical model documentation
│
├── legacy/ # Legacy implementations (deprecated)
│ ├── illuminator_cuda.py # Original CUDA implementation
│ └── cuda_api_server.py # Legacy API server
│
├── checkpoints/ # Model checkpoints (created during training)
├── logs/ # Training and evaluation logs
└── docs/ # Additional documentation
├── TRAINING_GUIDE.md # Detailed training guide
├── DEPLOYMENT_GUIDE.md # Production deployment guide
└── API_REFERENCE.md # Complete API documentation
We welcome contributions to the iLLuMinator project! Please see our contribution guidelines:
# Fork and clone the repository
git clone https://github.com/yourusername/iLLuMinator-4.9B.git
cd iLLuMinator-4.9B
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit install
- Follow PEP 8 style guidelines
- Include comprehensive docstrings
- Add unit tests for new features
- Ensure backwards compatibility
- Update documentation as needed
- Create a feature branch:
git checkout -b feature-name
- Make your changes and add tests
- Run the test suite:
python -m pytest tests/
- Submit a pull request with a clear description
If you use iLLuMinator in your research, please cite:
@software{illuminator_4_9b,
title={iLLuMinator 4.9B: Professional Language Model Training Framework},
author={Your Name},
year={2024},
url={https://github.com/Anipaleja/iLLuMinator-4.9B}
}
This project is licensed under the MIT License - see the LICENSE file for details.
- Transformer architecture based on "Attention Is All You Need" (Vaswani et al.)
- Grouped Query Attention from "GQA: Training Generalized Multi-Query Transformer" (Ainslie et al.)
- RoPE implementation based on "RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al.)
- SwiGLU activation from "GLU Variants Improve Transformer" (Shazeer)
- Training techniques inspired by PaLM, LLaMA, and GPT-4 methodologies
- Dataset processing based on LLMDataHub recommendations
For questions, issues, or feature requests:
- GitHub Issues: Submit an issue
- Discussions: Join the discussion
- Documentation: Read the docs
Status: Production Ready | Latest Version: 1.0.0 | Last Updated: January 2024