Spaces:
Sleeping
Sleeping
| # Lab 2: Parameter Efficient Fine-Tuning (PEFT) of Large Language Models | |
| **Course:** ID2223 / HT2025 | |
| **Students:** Sebastian Schmuelling, Ramin Darudi | |
| The public Hugging Space can be found here: https://huggingface.co/spaces/schmuelling/hopsworks_chat | |
| The models can be found here: https://huggingface.co/schmuelling | |
| --- | |
| ## Overview | |
| This project implements Parameter Efficient Fine-Tuning (PEFT) using LoRA (Low-Rank Adaptation) to fine-tune large language models on the FineTome-100k instruction dataset. The fine-tuned models are deployed in a Retrieval-Augmented Generation (RAG) chatbot interface that enables users to query documents indexed in Hopsworks Feature Store. | |
| ### Key Features | |
| - **PEFT Fine-Tuning**: Efficient fine-tuning using LoRA with 4-bit quantization | |
| - **Checkpoint Management**: Automatic checkpointing to HuggingFace Hub for resumable training | |
| - **Multiple Model Support**: Fine-tuned Llama-3.2-1B and Ministral-3-3B models (complete new model only came out a few days ago!!!) | |
| - **RAG System**: Document retrieval using Hopsworks Feature Store and FAISS | |
| - **CPU-Optimized Inference**: GGUF format models for efficient CPU deployment | |
| - **Interactive UI**: Gradio-based chatbot with dynamic model selection | |
| --- | |
| ## Task 1: Fine-Tune a Model and Build a UI | |
| ### 1.1 Fine-Tuning Implementation | |
| #### Models Fine-Tuned | |
| 1. **Llama-3.2-1B-Instruct** | |
| - Base Model: `unsloth/Llama-3.2-1B-Instruct` | |
| - Fine-tuned Model: `schmuelling/Llama-3.2-1B-Instruct-finetome` | |
| 2. **Ministral-3-3B-Instruct-2512** | |
| - Base Model: `unsloth/Ministral-3-3B-Instruct-2512` | |
| - Fine-tuned Model: `schmuelling/Ministral-3-3B-Instruct-2512-finetome` | |
| #### Fine-Tuning Process | |
| **Dataset**: FineTome-100k (`mlabonne/FineTome-100k`) | |
| - 100,000 instruction-following examples | |
| - Converted from ShareGPT format to HuggingFace chat format using `standardize_sharegpt()` | |
| - Applied model-specific chat templates (llama-3.1 for Llama, mistral for Ministral) | |
| **Training Configuration**: | |
| - **Framework**: Unsloth for memory-efficient fine-tuning | |
| - **Quantization**: 4-bit (BitsAndBytesConfig with NF4 quantization and double quantization) | |
| - **LoRA Configuration**: | |
| - Rank (r): 16 | |
| - LoRA Alpha: 16 | |
| - LoRA Dropout: 0 | |
| - Target Modules: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]` | |
| - **Training Parameters**: | |
| - Max Sequence Length: 2048 | |
| - Batch Size: 2 per device | |
| - Gradient Accumulation: 4 steps (effective batch size: 8) | |
| - Learning Rate: 2e-4 | |
| - Learning Rate Scheduler: Linear decay | |
| - Warmup Steps: 20 | |
| - Optimizer: AdamW 8-bit | |
| - Weight Decay: 0.001 | |
| - Epochs: 1 | |
| - Mixed Precision: bfloat16 on Ampere+ GPUs | |
| - Gradient Checkpointing: Enabled ("unsloth" mode) | |
| **Checkpointing Strategy**: | |
| - Automatic push to HuggingFace Hub: `schmuelling/{model_name}-checkpoint` for each model | |
| - Resume training capability: Automatically detects and loads from checkpoint if repository exists | |
| - Total checkpoint limit: 3 (oldest checkpoints automatically deleted) | |
| - Checkpoint strategy: "checkpoint" (push on every save) | |
| **Model Export**: | |
| - **Merged 4-bit**: Merged LoRA weights into 4-bit base model for HuggingFace Transformers inference | |
| - All models pushed to HuggingFace Hub for deployment. | |
| - **GGUF Format**: Apparently, we encountered serious dependency issues, and all attempts to convert the model to GGUF format failed. For the Mistral model, this can be explained by the fact that it was released a few days ago, so there may still be issues with the Llama cpp library. For the Llama model, it also leads back to a dependency issue, even though this model has existed for quite a while. Since Llama cpp is the fastest inference engine for CPUs, and since we only have a CPU in Free Huggingface, we decided to stick to the Llama cpp engine. We tried out Transformers, but they took an eternity. We used a GGUF model from UnsLoth. | |
| ### 1.2 RAG System Implementation | |
| The RAG (Retrieval-Augmented Generation) system enables the chatbot to answer questions based on indexed documents. | |
| **Document Indexing** (`index_content.ipynb`): | |
| - **Document Loader**: LangChain DoclingLoader for PDF processing | |
| - **Document**: "Building Machine Learning Systems with a Feature Store.pdf" | |
| - **Chunking Strategy**: HybridChunker with semantic chunking | |
| - **Embeddings Model**: `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions) | |
| - **Storage**: Hopsworks Feature Store (`book_embeddings` feature group, version 1) | |
| - **Index**: FAISS IndexFlatIP (Inner Product with L2 normalization) for similarity search | |
| - **Indexed Chunks**: 1,333 document chunks | |
| **Retrieval Process**: | |
| 1. User query is encoded into embedding vector using SentenceTransformer | |
| 2. FAISS performs cosine similarity search (L2-normalized inner product) | |
| 3. Top-k chunks retrieved (default: 10 chunks, configurable in `rag_prompt.yml`) | |
| 4. Context assembled with separator (`\n\n`) and passed to LLM | |
| **RAG Prompt Template** (`prompts/rag_prompt.yml`): | |
| - System prompt: Defines assistant role for Hopsworks documentation | |
| - Context injection: Retrieved document chunks inserted into prompt | |
| - Generation parameters: | |
| - max_tokens: 256 | |
| - temperature: 0.7 | |
| - stop_sequences: `["Question:", "\n\n"]` | |
| ### 1.3 User Interface | |
| **Gradio Application** (`app.py`): | |
| - **Model Selection**: Dropdown menus for repository and model selection | |
| - **Dynamic Loading**: Models loaded on-demand from HuggingFace Hub using `llama-cpp-python` | |
| - **Chat Interface**: Streaming responses with conversation history using `gr.ChatInterface` | |
| - **Status Display**: Real-time feedback on model loading and operations | |
| - **Model Information**: Displays model description, repository, and file details | |
| **Features**: | |
| - Multiple model support from different repositories (configured in `models_config.json`) | |
| - CPU-optimized inference using `llama-cpp-python` with GGUF models | |
| - Streaming text generation for better UX | |
| - Example prompts for quick testing | |
| - Error handling and user-friendly messages | |
| - Automatic installation of `llama-cpp-python` at runtime | |
| **Deployment**: | |
| - Deployed to HuggingFace Spaces | |
| - Environment variables configured via Space secrets (`HOPSWORKS_API_KEY`) | |
| - Automatic model downloading on first load | |
| - Supports GGUF format models for CPU inference | |
| --- | |
| ## Task 2: Improve Pipeline Scalability and Model Performance | |
| ### 2.1 Model-Centric Improvements | |
| #### Hyperparameter Configuration | |
| **Learning Rate Scheduling**: | |
| - Linear learning rate decay implemented | |
| - Warmup steps: 20 for stable training start | |
| - Learning rate: 2e-4 (standard for LoRA fine-tuning) | |
| **LoRA Configuration**: | |
| - **Rank (r=16)**: Selected for balance between model capacity and parameter efficiency | |
| - **Alpha (16)**: Set equal to rank for optimal scaling | |
| - **Target Modules**: Selected attention and MLP layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`) for maximum impact | |
| - **Dropout**: 0 (optimized for Unsloth) | |
| **Training Efficiency Optimizations**: | |
| - **Gradient Accumulation (4)**: Effective batch size of 8 with minimal memory overhead | |
| - **Mixed Precision**: bfloat16 on Ampere+ GPUs (automatic detection) | |
| - **Optimizer**: AdamW 8-bit for memory efficiency | |
| - **Gradient Checkpointing**: "unsloth" mode for memory savings | |
| ### 2.2 Data-Centric Improvements | |
| #### Dataset Used | |
| **FineTome-100k Dataset**: | |
| - Source: `mlabonne/FineTome-100k` from HuggingFace | |
| - Size: 100,000 instruction-following examples | |
| - Format: ShareGPT format (converted to HuggingFace standard format) | |
| - Quality: High-quality, diverse instruction-following examples | |
| - Processing: Standardized using `unsloth.chat_templates.standardize_sharegpt()` | |
| **Data Preprocessing**: | |
| - ShareGPT format conversion to HuggingFace chat format | |
| - Application of model-specific chat templates | |
| - Batched processing with `dataset.map()` for efficiency | |
| #### Evaluation Framework | |
| **Evaluation Script** (`evaluation/evaluate_models.py`): | |
| - **Perplexity Calculation**: Measures model's prediction confidence | |
| - Lower perplexity = better model | |
| - Calculated on held-out test set (50 examples from FineTome-100k) | |
| - Compares base model vs fine-tuned model | |
| - **Memory Efficiency Tracking**: Prints model size and parameter counts | |
| - **Implementation**: Uses 4-bit quantization for both models during evaluation | |
| **Evaluation Setup**: | |
| - Base Model: `unsloth/Llama-3.2-1B-Instruct` | |
| - Fine-tuned Model: `schmuelling/Llama-3.2-1B-Instruct-finetome` | |
| - Test Set: 50 examples from FineTome-100k dataset | |
| - Metrics: The finetuned model displayed an average of 2.3% improvement over base instruct model over 10 evaluation runs. | |
| - Ministral-3 Model could not be evaluated due to how new it is and not being part of the transformer library, something we had not taken into account. | |
| ### 2.3 Pipeline Scalability Improvements | |
| #### Training Scalability | |
| **Checkpoint Management**: | |
| - Automatic checkpointing to HuggingFace Hub every 100/1000 steps | |
| - Resume from checkpoint capability (automatic detection) | |
| - Checkpoint versioning with limit of 3 checkpoints | |
| **Model Versioning**: | |
| - Versioned models on HuggingFace Hub | |
| - Multiple quantization formats for different deployment scenarios | |
| - Separate checkpoint and final model repositories | |
| #### RAG System Scalability | |
| **Index Implementation**: | |
| - FAISS IndexFlatIP for exact similarity search | |
| - L2 normalization for cosine similarity | |
| - Efficient retrieval with configurable top-k | |
| **Embedding System**: | |
| - SentenceTransformer embeddings (`all-MiniLM-L6-v2`, 384 dimensions) | |
| - Stored in Hopsworks Feature Store for persistence | |
| - FAISS index built in-memory for fast retrieval | |
| **Retrieval Configuration**: | |
| - Configurable number of retrieved chunks (default: 10) | |
| - Configurable context separator | |
| - Real-time retrieval and context assembly | |
| --- | |
| ### File Structure | |
| ``` | |
| rag_finetune_LLM/ | |
| ├── app.py # Gradio UI application | |
| ├── models_config.json # Model configuration | |
| ├── prompts/ | |
| │ └── rag_prompt.yml # RAG prompt template | |
| ├── finetuning/ | |
| │ ├── Finetune_notebook_Llama.ipynb # Llama fine-tuning | |
| │ └── Finetune_notebook_ministral.ipynb # Ministral fine-tuning | |
| ├── evaluation/ | |
| │ └── evaluate_models.py # Model evaluation script | |
| ├── index_content.ipynb # Document indexing notebook | |
| ├── requirements.txt # Python dependencies | |
| ├── README.md # HuggingFace Space config | |
| ├── README_SETUP.md # Setup instructions | |
| └── LAB_DESCRIPTION.md # This file | |
| ``` | |
| --- | |
| ## Conclusion | |
| This project successfully demonstrates Parameter Efficient Fine-Tuning (PEFT) using LoRA on large language models, achieving memory and computational savings while maintaining model quality. The implementation includes: | |
| - Efficient fine-tuning with checkpointing and resume capability | |
| - Multiple model support (Llama and Ministral) | |
| - RAG system with Hopsworks Feature Store integration | |
| - Production-ready UI deployed on HuggingFace Spaces | |
| - Comprehensive documentation and evaluation framework | |
| --- | |
| **Last Updated**: December 2025 | |