Spaces:

schmuelling
/

hopsworks_chat

Sleeping

hopsworks_chat / LAB_DESCRIPTION.MD

Sebastian Schmülling

adapted assignment description

da29572 11 days ago

11.1 kB

	# Lab 2: Parameter Efficient Fine-Tuning (PEFT) of Large Language Models

	Course: ID2223 / HT2025
	Students: Sebastian Schmuelling, Ramin Darudi

	The public Hugging Space can be found here: https://huggingface.co/spaces/schmuelling/hopsworks_chat

	The models can be found here: https://huggingface.co/schmuelling

	---

	## Overview

	This project implements Parameter Efficient Fine-Tuning (PEFT) using LoRA (Low-Rank Adaptation) to fine-tune large language models on the FineTome-100k instruction dataset. The fine-tuned models are deployed in a Retrieval-Augmented Generation (RAG) chatbot interface that enables users to query documents indexed in Hopsworks Feature Store.

	### Key Features

	- PEFT Fine-Tuning: Efficient fine-tuning using LoRA with 4-bit quantization
	- Checkpoint Management: Automatic checkpointing to HuggingFace Hub for resumable training
	- Multiple Model Support: Fine-tuned Llama-3.2-1B and Ministral-3-3B models (complete new model only came out a few days ago!!!)
	- RAG System: Document retrieval using Hopsworks Feature Store and FAISS
	- CPU-Optimized Inference: GGUF format models for efficient CPU deployment
	- Interactive UI: Gradio-based chatbot with dynamic model selection

	---

	## Task 1: Fine-Tune a Model and Build a UI

	### 1.1 Fine-Tuning Implementation

	#### Models Fine-Tuned

	1. Llama-3.2-1B-Instruct
	- Base Model: `unsloth/Llama-3.2-1B-Instruct`
	- Fine-tuned Model: `schmuelling/Llama-3.2-1B-Instruct-finetome`


	2. Ministral-3-3B-Instruct-2512
	- Base Model: `unsloth/Ministral-3-3B-Instruct-2512`
	- Fine-tuned Model: `schmuelling/Ministral-3-3B-Instruct-2512-finetome`


	#### Fine-Tuning Process

	Dataset: FineTome-100k (`mlabonne/FineTome-100k`)
	- 100,000 instruction-following examples
	- Converted from ShareGPT format to HuggingFace chat format using `standardize_sharegpt()`
	- Applied model-specific chat templates (llama-3.1 for Llama, mistral for Ministral)

	Training Configuration:
	- Framework: Unsloth for memory-efficient fine-tuning
	- Quantization: 4-bit (BitsAndBytesConfig with NF4 quantization and double quantization)
	- LoRA Configuration:
	- Rank (r): 16
	- LoRA Alpha: 16
	- LoRA Dropout: 0
	- Target Modules: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
	- Training Parameters:
	- Max Sequence Length: 2048
	- Batch Size: 2 per device
	- Gradient Accumulation: 4 steps (effective batch size: 8)
	- Learning Rate: 2e-4
	- Learning Rate Scheduler: Linear decay
	- Warmup Steps: 20
	- Optimizer: AdamW 8-bit
	- Weight Decay: 0.001
	- Epochs: 1
	- Mixed Precision: bfloat16 on Ampere+ GPUs
	- Gradient Checkpointing: Enabled ("unsloth" mode)

	Checkpointing Strategy:
	- Automatic push to HuggingFace Hub: `schmuelling/{model_name}-checkpoint` for each model
	- Resume training capability: Automatically detects and loads from checkpoint if repository exists
	- Total checkpoint limit: 3 (oldest checkpoints automatically deleted)
	- Checkpoint strategy: "checkpoint" (push on every save)

	Model Export:


	- Merged 4-bit: Merged LoRA weights into 4-bit base model for HuggingFace Transformers inference
	- All models pushed to HuggingFace Hub for deployment.

	- GGUF Format: Apparently, we encountered serious dependency issues, and all attempts to convert the model to GGUF format failed. For the Mistral model, this can be explained by the fact that it was released a few days ago, so there may still be issues with the Llama cpp library. For the Llama model, it also leads back to a dependency issue, even though this model has existed for quite a while. Since Llama cpp is the fastest inference engine for CPUs, and since we only have a CPU in Free Huggingface, we decided to stick to the Llama cpp engine. We tried out Transformers, but they took an eternity. We used a GGUF model from UnsLoth.


	### 1.2 RAG System Implementation

	The RAG (Retrieval-Augmented Generation) system enables the chatbot to answer questions based on indexed documents.

	Document Indexing (`index_content.ipynb`):
	- Document Loader: LangChain DoclingLoader for PDF processing
	- Document: "Building Machine Learning Systems with a Feature Store.pdf"
	- Chunking Strategy: HybridChunker with semantic chunking
	- Embeddings Model: `sentence-transformers/all-MiniLM-L6-v2` (384 dimensions)
	- Storage: Hopsworks Feature Store (`book_embeddings` feature group, version 1)
	- Index: FAISS IndexFlatIP (Inner Product with L2 normalization) for similarity search
	- Indexed Chunks: 1,333 document chunks

	Retrieval Process:
	1. User query is encoded into embedding vector using SentenceTransformer
	2. FAISS performs cosine similarity search (L2-normalized inner product)
	3. Top-k chunks retrieved (default: 10 chunks, configurable in `rag_prompt.yml`)
	4. Context assembled with separator (`\n\n`) and passed to LLM

	RAG Prompt Template (`prompts/rag_prompt.yml`):
	- System prompt: Defines assistant role for Hopsworks documentation
	- Context injection: Retrieved document chunks inserted into prompt
	- Generation parameters:
	- max_tokens: 256
	- temperature: 0.7
	- stop_sequences: `["Question:", "\n\n"]`

	### 1.3 User Interface

	Gradio Application (`app.py`):
	- Model Selection: Dropdown menus for repository and model selection
	- Dynamic Loading: Models loaded on-demand from HuggingFace Hub using `llama-cpp-python`
	- Chat Interface: Streaming responses with conversation history using `gr.ChatInterface`
	- Status Display: Real-time feedback on model loading and operations
	- Model Information: Displays model description, repository, and file details

	Features:
	- Multiple model support from different repositories (configured in `models_config.json`)
	- CPU-optimized inference using `llama-cpp-python` with GGUF models
	- Streaming text generation for better UX
	- Example prompts for quick testing
	- Error handling and user-friendly messages
	- Automatic installation of `llama-cpp-python` at runtime

	Deployment:
	- Deployed to HuggingFace Spaces
	- Environment variables configured via Space secrets (`HOPSWORKS_API_KEY`)
	- Automatic model downloading on first load
	- Supports GGUF format models for CPU inference

	---

	## Task 2: Improve Pipeline Scalability and Model Performance

	### 2.1 Model-Centric Improvements

	#### Hyperparameter Configuration

	Learning Rate Scheduling:
	- Linear learning rate decay implemented
	- Warmup steps: 20 for stable training start
	- Learning rate: 2e-4 (standard for LoRA fine-tuning)

	LoRA Configuration:
	- Rank (r=16): Selected for balance between model capacity and parameter efficiency
	- Alpha (16): Set equal to rank for optimal scaling
	- Target Modules: Selected attention and MLP layers (`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`) for maximum impact
	- Dropout: 0 (optimized for Unsloth)

	Training Efficiency Optimizations:
	- Gradient Accumulation (4): Effective batch size of 8 with minimal memory overhead
	- Mixed Precision: bfloat16 on Ampere+ GPUs (automatic detection)
	- Optimizer: AdamW 8-bit for memory efficiency
	- Gradient Checkpointing: "unsloth" mode for memory savings

	### 2.2 Data-Centric Improvements

	#### Dataset Used

	FineTome-100k Dataset:
	- Source: `mlabonne/FineTome-100k` from HuggingFace
	- Size: 100,000 instruction-following examples
	- Format: ShareGPT format (converted to HuggingFace standard format)
	- Quality: High-quality, diverse instruction-following examples
	- Processing: Standardized using `unsloth.chat_templates.standardize_sharegpt()`

	Data Preprocessing:
	- ShareGPT format conversion to HuggingFace chat format
	- Application of model-specific chat templates
	- Batched processing with `dataset.map()` for efficiency

	#### Evaluation Framework

	Evaluation Script (`evaluation/evaluate_models.py`):
	- Perplexity Calculation: Measures model's prediction confidence
	- Lower perplexity = better model
	- Calculated on held-out test set (50 examples from FineTome-100k)
	- Compares base model vs fine-tuned model
	- Memory Efficiency Tracking: Prints model size and parameter counts
	- Implementation: Uses 4-bit quantization for both models during evaluation

	Evaluation Setup:
	- Base Model: `unsloth/Llama-3.2-1B-Instruct`
	- Fine-tuned Model: `schmuelling/Llama-3.2-1B-Instruct-finetome`
	- Test Set: 50 examples from FineTome-100k dataset
	- Metrics: The finetuned model displayed an average of 2.3% improvement over base instruct model over 10 evaluation runs.

	- Ministral-3 Model could not be evaluated due to how new it is and not being part of the transformer library, something we had not taken into account.

	### 2.3 Pipeline Scalability Improvements

	#### Training Scalability

	Checkpoint Management:
	- Automatic checkpointing to HuggingFace Hub every 100/1000 steps
	- Resume from checkpoint capability (automatic detection)
	- Checkpoint versioning with limit of 3 checkpoints

	Model Versioning:
	- Versioned models on HuggingFace Hub
	- Multiple quantization formats for different deployment scenarios
	- Separate checkpoint and final model repositories


	#### RAG System Scalability

	Index Implementation:
	- FAISS IndexFlatIP for exact similarity search
	- L2 normalization for cosine similarity
	- Efficient retrieval with configurable top-k

	Embedding System:
	- SentenceTransformer embeddings (`all-MiniLM-L6-v2`, 384 dimensions)
	- Stored in Hopsworks Feature Store for persistence
	- FAISS index built in-memory for fast retrieval

	Retrieval Configuration:
	- Configurable number of retrieved chunks (default: 10)
	- Configurable context separator
	- Real-time retrieval and context assembly

	---


	### File Structure

	```
	rag_finetune_LLM/
	├── app.py # Gradio UI application
	├── models_config.json # Model configuration
	├── prompts/
	│ └── rag_prompt.yml # RAG prompt template
	├── finetuning/
	│ ├── Finetune_notebook_Llama.ipynb # Llama fine-tuning
	│ └── Finetune_notebook_ministral.ipynb # Ministral fine-tuning
	├── evaluation/
	│ └── evaluate_models.py # Model evaluation script
	├── index_content.ipynb # Document indexing notebook
	├── requirements.txt # Python dependencies
	├── README.md # HuggingFace Space config
	├── README_SETUP.md # Setup instructions
	└── LAB_DESCRIPTION.md # This file
	```

	---

	## Conclusion

	This project successfully demonstrates Parameter Efficient Fine-Tuning (PEFT) using LoRA on large language models, achieving memory and computational savings while maintaining model quality. The implementation includes:

	- Efficient fine-tuning with checkpointing and resume capability
	- Multiple model support (Llama and Ministral)
	- RAG system with Hopsworks Feature Store integration
	- Production-ready UI deployed on HuggingFace Spaces
	- Comprehensive documentation and evaluation framework

	---

	Last Updated: December 2025