AnseMin commited on
Commit
9e9e9ff
Β·
1 Parent(s): 18e6067

Enhance vector store retrieval with limited results

Browse files

- Introduced a new `LimitedEnsembleRetriever` class to limit the number of results returned by the ensemble retriever to a specified count (k).
- Updated the `get_hybrid_retriever` method to return a `LimitedEnsembleRetriever` instance, ensuring that exactly k results are provided for improved performance and usability.
- Enhanced logging to reflect the creation of the limited retriever with specified weights and result limits.

Files changed (2) hide show
  1. README.md +127 -324
  2. src/rag/vector_store.py +31 -5
README.md CHANGED
@@ -14,6 +14,8 @@ hf_oauth: true
14
 
15
  # Document to Markdown Converter with RAG Chat
16
 
 
 
17
  A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation).
18
 
19
  <details>
@@ -21,360 +23,160 @@ A powerful Hugging Face Space that converts various document formats to Markdown
21
 
22
  <!-- Begin ToC -->
23
 
 
24
  - [System Overview](#-system-overview)
25
- - [Key Features](#-key-features)
26
- - [Document Conversion](#document-conversion)
27
- - [RAG Chat with Documents](#-rag-chat-with-documents)
28
- - [Query Ranker (NEW!)](#-query-ranker-new)
29
- - [User Interface](#user-interface)
30
- - [Supported Libraries](#supported-libraries)
31
- - [Multi-Document Processing](#-multi-document-processing)
32
- - [Environment Variables](#environment-variables)
33
- - [API Keys](#-api-keys)
34
- - [Configuration Options](#️-configuration-options)
35
- - [Docling Configuration](#-docling-configuration)
36
- - [Model Configuration](#-model-configuration)
37
- - [RAG Configuration](#-rag-configuration)
38
- - [Advanced Retrieval Configuration](#-advanced-retrieval-configuration)
39
- - [Usage Guide](#-usage-guide)
40
- - [Parser Selection](#-parser-selection)
41
- - [Document Conversion](#document-conversion-1)
42
- - [RAG Chat & Query System](#-rag-chat--query-system)
43
- - [Local Development](#local-development)
44
- - [Quick Start](#-quick-start)
45
- - [Data Management](#-data-management)
46
- - [Development Features](#-development-features)
47
- - [GOT-OCR LaTeX Processing](#-got-ocr-latex-processing)
48
- - [Credits](#credits)
49
- - [Retrieval Strategies](#-retrieval-strategies)
50
- - [Development](#-development)
51
- - [Quick Start](#quick-start)
52
- - [Key Technologies](#key-technologies)
53
 
54
  <!-- End ToC -->
55
 
56
  </details>
57
 
58
- ## 🎯 System Overview
59
 
 
60
  <div align="center">
61
- <img src="img/Overall%20System%20Workflow%20(Essential).png" alt="Overall System Workflow" width="400">
62
-
63
- *Complete workflow from document upload to intelligent RAG chat interaction*
64
  </div>
65
 
66
- ## ✨ Key Features
67
-
68
- ### Document Conversion
69
- - Convert PDFs, Office documents, images, and more to Markdown
70
- - **πŸ†• Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
71
- - **5 Powerful Parsers**:
72
- - **Gemini Flash**: General Purpose + High Accuracy
73
- - **Mistral OCR**: Fastest Processing
74
- - **Docling**: Open Source
75
- - **GOT-OCR**: Document to LaTeX + Open Source
76
- - **MarkItDown**: High Accuracy CSV/XML + Open Source
77
- - **πŸ†• Intelligent Processing Types**:
78
- - **Combined**: Merge documents into unified content with duplicate removal
79
- - **Individual**: Separate sections per document with clear organization
80
- - **Summary**: Executive overview + detailed analysis of all documents
81
- - **Comparison**: Cross-document analysis with similarities/differences tables
82
- - Download converted documents as Markdown files
83
-
84
- ### πŸ€– RAG Chat with Documents
85
- - **Chat with your converted documents** using advanced AI
86
- - **πŸ†• Advanced Retrieval Strategies**: Multiple search methods for optimal results
87
- - **Similarity Search**: Traditional semantic similarity using embeddings
88
- - **MMR (Maximal Marginal Relevance)**: Diverse results with reduced redundancy
89
- - **BM25 Keyword Search**: Traditional keyword-based retrieval
90
- - **Hybrid Search**: Combines semantic + keyword search for best accuracy
91
- - **Intelligent document retrieval** using vector embeddings
92
- - **πŸ†• Smart Content-Aware Chunking**:
93
- - **Markdown chunking** that preserves tables and code blocks
94
- - **LaTeX chunking** that preserves mathematical tables, environments, and structures
95
- - **Automatic format detection** for optimal chunking strategy
96
- - **Streaming chat responses** for real-time interaction
97
- - **Chat history management** with session persistence
98
- - **Usage limits** to prevent abuse on public spaces
99
- - **Powered by Gemini 2.5 Flash** for high-quality responses
100
- - **OpenAI embeddings** for accurate document retrieval
101
- - **πŸ—‘οΈ Clear All Data** button for easy data management in both local and HF Space environments
102
-
103
- ### πŸ” Query Ranker (NEW!)
104
- - **πŸ†• Third dedicated tab** for document search and ranking
105
- - **Interactive query search** with real-time document chunk ranking
106
- - **Multiple retrieval methods**: Similarity, MMR, BM25, and Hybrid search
107
- - **Intelligent confidence scoring**: Rank-based confidence levels (High/Medium/Low)
108
- - **Real similarity scores**: Actual ChromaDB similarity scores for similarity search
109
- - **Transparent results**: Clear display of source documents, page numbers, and chunk lengths
110
- - **Adjustable result count**: 1-10 results with responsive slider control
111
- - **Method comparison**: Test different retrieval strategies on the same query
112
- - **Modern card-based UI**: Clean, professional result display with hover effects
113
-
114
- ### User Interface
115
- - **πŸ†• Three-tab interface**: Document Converter + Chat + Query Ranker
116
- - **πŸ†• Unified File Input**: Single interface handles both single and multiple file uploads
117
- - **πŸ†• Dynamic Processing Options**: Multi-document processing type selector appears automatically
118
- - **πŸ†• Real-time Validation**: Live feedback on file count, size limits, and processing mode
119
- - **Real-time status monitoring** for RAG system with environment detection
120
- - **Auto-ingestion** of converted documents into chat system
121
- - **Enhanced status display**: Shows vector store document count, chat history files, and environment type
122
- - **Data management controls**: Clear All Data button with comprehensive feedback
123
- - **Filename preservation**: Downloaded files maintain original names (e.g., "example data.pdf" β†’ "example data.md")
124
- - **πŸ†• Smart Output Naming**: Batch processing creates descriptive filenames (e.g., "Combined_3_Documents_20240125.md")
125
- - **πŸ†• Consistent modern styling**: All tabs share the same professional design theme
126
- - Clean, responsive UI with modern styling
127
-
128
- ## Supported Libraries
129
-
130
- **MarkItDown** ([Microsoft](https://github.com/microsoft/markitdown)): PDF, Office docs, images, audio, HTML, ZIP files, YouTube URLs, EPubs, and more.
131
-
132
- **Docling** ([IBM](https://github.com/DS4SD/docling)): Advanced PDF understanding with table structure recognition, multiple OCR engines, and layout analysis. **Supports multi-document processing** with Gemini-powered summary & comparison.
133
-
134
- **Gemini Flash** ([Google](https://deepmind.google/technologies/gemini/)): AI-powered document understanding with **advanced multi-document processing capabilities**, cross-format analysis, and intelligent content synthesis.
135
-
136
- **Mistral OCR**: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode. **Supports multi-document processing** with Gemini-powered summary & comparison.
137
-
138
- ## πŸš€ Multi-Document Processing
139
-
140
- <img src="img/Multi-Document%20Processing%20Types%20(Flagship%20Feature).png" alt="Multi-Document Processing Types" width="700">
141
-
142
- *Industry-leading multi-document processing with 4 intelligent processing types*
143
-
144
- ### **Key Capabilities:**
145
- - **πŸ“Š Cross-Document Analysis**: Compare and contrast information across different files
146
- - **πŸ”„ Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
147
- - **πŸ“‹ Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
148
- - **🧠 Contextual Understanding**: Recognizes relationships and patterns across document boundaries
149
-
150
- ### **Processing Types:**
151
-
152
- - **πŸ”— Combined**: Merge documents into unified content with duplicate removal
153
  - **πŸ“‘ Individual**: Separate sections per document with clear organization
154
  - **πŸ“ˆ Summary**: Executive overview + detailed analysis of all documents
155
  - **βš–οΈ Comparison**: Cross-document analysis with similarities/differences tables
156
 
157
- ## Environment Variables
158
-
159
- The application uses centralized configuration management. You can enhance functionality by setting these environment variables:
160
-
161
- ### πŸ”‘ **API Keys:**
162
- - `GOOGLE_API_KEY`: Used for Gemini Flash parser, LaTeX conversion, and **RAG chat functionality**
163
- - `OPENAI_API_KEY`: Enables AI-based image descriptions in MarkItDown and **vector embeddings for RAG**
164
- - `MISTRAL_API_KEY`: For Mistral OCR parser (if available)
165
-
166
- ### βš™οΈ **Configuration Options:**
167
- - `DEBUG`: Set to `true` for debug mode with verbose logging
168
- - `MAX_FILE_SIZE`: Maximum file size in bytes (default: 10MB)
169
- - `MAX_BATCH_FILES`: Maximum files for multi-document processing (default: 5)
170
- - `MAX_BATCH_SIZE`: Maximum combined size for batch processing (default: 20MB)
171
- - `TEMP_DIR`: Directory for temporary files (default: ./temp)
172
- - `TESSERACT_PATH`: Custom path to Tesseract executable
173
- - `TESSDATA_PATH`: Path to Tesseract language data
174
-
175
- ### πŸ”§ **Docling Configuration:**
176
- - `DOCLING_ARTIFACTS_PATH`: Path to pre-downloaded Docling models for offline use
177
- - `DOCLING_ENABLE_REMOTE_SERVICES`: Enable remote vision model services (default: false)
178
- - `DOCLING_ENABLE_TABLES`: Enable table structure recognition (default: true)
179
- - `DOCLING_ENABLE_CODE_ENRICHMENT`: Enable code block enrichment (default: false)
180
- - `DOCLING_ENABLE_FORMULA_ENRICHMENT`: Enable formula understanding (default: false)
181
- - `DOCLING_ENABLE_PICTURE_CLASSIFICATION`: Enable picture classification (default: false)
182
- - `DOCLING_GENERATE_PICTURE_IMAGES`: Generate picture images during processing (default: false)
183
- - `OMP_NUM_THREADS`: Number of CPU threads for OCR processing (default: 4)
184
-
185
- ### πŸ€– **Model Configuration:**
186
- - `GEMINI_MODEL`: Gemini model to use (default: gemini-1.5-flash)
187
- - `MISTRAL_MODEL`: Mistral model to use (default: pixtral-12b-2409)
188
- - `GOT_OCR_MODEL`: GOT-OCR model to use (default: stepfun-ai/GOT-OCR2_0)
189
- - `MODEL_TEMPERATURE`: Model temperature for AI responses (default: 0.1)
190
- - `MODEL_MAX_TOKENS`: Maximum tokens for AI responses (default: 4096)
191
-
192
- ### 🧠 **RAG Configuration:**
193
- - `VECTOR_STORE_PATH`: Path for vector database storage (default: ./data/vector_store)
194
- - `CHAT_HISTORY_PATH`: Path for chat history storage (default: ./data/chat_history)
195
- - `EMBEDDING_MODEL`: OpenAI embedding model (default: text-embedding-3-small)
196
- - `CHUNK_SIZE`: Document chunk size for Markdown content (default: 1000)
197
- - `CHUNK_OVERLAP`: Overlap between chunks for Markdown (default: 200)
198
- - `LATEX_CHUNK_SIZE`: Document chunk size for LaTeX content (default: 1200)
199
- - `LATEX_CHUNK_OVERLAP`: Overlap between chunks for LaTeX (default: 150)
200
- - `MAX_MESSAGES_PER_SESSION`: Chat limit per session (default: 50)
201
- - `MAX_MESSAGES_PER_HOUR`: Chat limit per hour (default: 100)
202
- - `RETRIEVAL_K`: Number of documents to retrieve (default: 4)
203
- - `RAG_MODEL`: Model for RAG chat (default: gemini-2.5-flash)
204
- - `RAG_TEMPERATURE`: Temperature for RAG responses (default: 0.1)
205
- - `RAG_MAX_TOKENS`: Max tokens for RAG responses (default: 4096)
206
-
207
- ### πŸ” **Advanced Retrieval Configuration:**
208
- - `DEFAULT_RETRIEVAL_METHOD`: Default retrieval strategy (default: similarity)
209
- - `MMR_LAMBDA_MULT`: MMR diversity parameter (default: 0.5)
210
- - `MMR_FETCH_K`: MMR candidate document count (default: 10)
211
- - `HYBRID_SEMANTIC_WEIGHT`: Semantic search weight in hybrid mode (default: 0.7)
212
- - `HYBRID_KEYWORD_WEIGHT`: Keyword search weight in hybrid mode (default: 0.3)
213
- - `BM25_K1`: BM25 term frequency saturation parameter (default: 1.2)
214
- - `BM25_B`: BM25 field length normalization parameter (default: 0.75)
215
-
216
- ## πŸ“– Usage Guide
217
-
218
- ### 🎯 Parser Selection
219
-
220
- <img src="img/Parser%20Selection%20Guide%20(User-Friendly).png" alt="Parser Selection Guide" width="700">
221
 
222
- *Choose the right parser for your specific needs and document types*
 
223
 
224
- ### Document Conversion
 
225
 
226
- #### πŸ“„ **Single Document Processing**
227
- 1. Upload a single file
228
- 2. Choose your preferred parser
229
- 3. Select an OCR method based on your chosen parser
230
- 4. Click "Convert"
231
- 5. Download the converted file (.tex for GOT-OCR, .md for others)
232
 
233
- #### πŸ“‚ **Multi-Document Processing**
234
- 1. Upload **2-5 files** (up to 20MB combined)
235
- 2. Choose processing type: Combined, Individual, Summary, or Comparison
236
- 3. Select your preferred parser
237
- 4. Click "Convert" for intelligent cross-document analysis
 
238
 
239
- ### πŸ€– RAG Chat & Query System
240
 
241
- <img src="img/RAG%20Retrieval%20Strategies%20(Technical%20Highlight).png" alt="RAG Retrieval Strategies" width="700">
 
242
 
243
- *Advanced RAG system with 4 retrieval strategies for optimal document search*
 
244
 
245
- #### **Chat with Documents**
246
- 1. Choose your retrieval strategy (Similarity, MMR, BM25, or Hybrid)
247
- 2. Ask questions about your converted documents
248
- 3. Get real-time streaming responses with document context
249
-
250
- #### **Query Ranker**
251
- 1. Enter search queries to explore document chunks
252
- 2. Compare different retrieval methods
253
- 3. View confidence scores and source information
254
-
255
- ## Local Development
256
-
257
- ### πŸš€ **Quick Start:**
258
- 1. Clone the repository
259
- 2. Create a `.env` file with your API keys:
260
- ```
261
- GOOGLE_API_KEY=your_gemini_api_key_here
262
- OPENAI_API_KEY=your_openai_api_key_here
263
- MISTRAL_API_KEY=your_mistral_api_key_here
264
- DEBUG=true
265
-
266
- # RAG Configuration (optional - uses defaults if not set)
267
- MAX_MESSAGES_PER_SESSION=50
268
- MAX_MESSAGES_PER_HOUR=100
269
- CHUNK_SIZE=1000
270
- ```
271
- 3. Install dependencies:
272
- ```bash
273
- pip install -r requirements.txt
274
- ```
275
- 4. Run the application:
276
- ```bash
277
- # For full environment setup (HF Spaces compatible)
278
- python app.py
279
-
280
- # For local development (faster startup)
281
- python run_app.py
282
-
283
- # For testing with clean data
284
- python run_app.py --clear-data-and-run
285
-
286
- # Show all available options
287
- python run_app.py --help
288
- ```
289
-
290
- ### 🧹 **Data Management:**
291
 
292
- **Two ways to clear data:**
 
 
 
 
293
 
294
- 1. **Command-line** (for development):
295
- - `python run_app.py --clear-data-and-run` - Clear data then start app
296
- - `python run_app.py --clear-data` - Clear data and exit
297
 
298
- 2. **In-app UI** (for users):
299
- - Go to "Chat with Documents" tab β†’ Click "πŸ—‘οΈ Clear All Data" button
300
- - Automatically detects environment (local vs HF Space)
301
- - Provides detailed feedback and starts new session
302
 
303
- **What gets cleared:**
304
- - `data/chat_history/*` - All saved chat sessions
305
- - `data/vector_store/*` - All document embeddings and vector database
306
 
307
- ### πŸ§ͺ **Development Features:**
308
- - **Automatic Environment Setup**: Dependencies are checked and installed automatically
309
- - **Configuration Validation**: Startup validation reports missing API keys and configuration issues
310
- - **Enhanced Error Messages**: Detailed error reporting for debugging
311
- - **Centralized Logging**: Configurable logging levels and output formats
312
 
313
- ## πŸ“„ GOT-OCR LaTeX Processing
 
 
 
 
314
 
315
- Markit v2 features **advanced LaTeX processing** for GOT-OCR results, providing proper mathematical and tabular content handling:
316
 
317
- ### **🎯 Key Features:**
 
 
 
318
 
319
- #### **1. Native LaTeX Output**
320
- - **No LLM conversion**: GOT-OCR returns raw LaTeX directly for maximum accuracy
321
- - **Preserves mathematical structures**: Complex formulas, tables, and equations remain intact
322
- - **.tex file output**: Save files in proper LaTeX format for external use
 
323
 
324
- #### **2. Mathpix Markdown Rendering**
325
- - **Professional display**: Uses Mathpix Markdown library (same as official GOT-OCR demo)
326
- - **Complex table support**: Renders `\begin{tabular}`, `\multirow`, `\multicolumn` properly
327
- - **Mathematical expressions**: Displays LaTeX math with proper formatting
328
- - **Base64 iframe embedding**: Secure, isolated rendering environment
329
 
330
- #### **3. RAG-Compatible LaTeX Chunking**
331
- - **LaTeX-aware chunker**: Specialized chunking preserves LaTeX structures
332
- - **Complete table preservation**: Entire `\begin{tabular}...\end{tabular}` blocks stay intact
333
- - **Environment detection**: Maintains `\begin{env}...\end{env}` pairs
334
- - **Intelligent separators**: Uses LaTeX commands (`\section`, `\title`) as break points
335
 
336
- #### **4. Enhanced Metadata**
337
- - **Content type tracking**: `content_type: "latex"` for proper handling
338
- - **Structure detection**: Identifies tables, environments, and mathematical content
339
- - **Auto-format detection**: GOT-OCR results automatically use LaTeX chunker
340
 
341
- ### **πŸ”§ Technical Implementation:**
 
342
 
343
- ```javascript
344
- // Mathpix rendering (inspired by official GOT-OCR demo)
345
- const html = window.render(latexContent, {htmlTags: true});
346
 
347
- // LaTeX structure preservation
348
- \begin{tabular}{|l|c|c|}
349
- \hline Disability & Participants & Results \\
350
- \hline Blind & 5 & $34.5\%, n=1$ \\
351
- \end{tabular}
352
  ```
353
 
354
- ### **πŸ“Š Use Cases:**
355
- - **Research papers**: Mathematical formulas and data tables
356
- - **Scientific documents**: Complex equations and statistical data
357
- - **Financial reports**: Tabular data with calculations
358
- - **Academic content**: Mixed text, math, and structured data
 
 
 
 
359
 
360
- ## Credits
361
 
362
- - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
363
- - [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
364
- - [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering
365
- - [Gradio](https://gradio.app/) for the UI framework
 
366
 
367
- ---
 
 
368
 
369
- **Author: Anse Min** | [GitHub](https://github.com/ansemin) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
 
370
 
371
- **Project Links:**
372
- - [GitHub Repository](https://github.com/ansemin/Markit_v2)
373
- - [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
 
 
374
 
 
 
 
 
375
 
376
- ## πŸ” Retrieval Strategies
377
 
 
 
 
378
  | Method | Best For | Accuracy |
379
  |--------|----------|----------|
380
  | **🎯 Similarity** | General semantic questions | Good |
@@ -382,24 +184,25 @@ const html = window.render(latexContent, {htmlTags: true});
382
  | **πŸ” BM25** | Exact keyword searches | Medium |
383
  | **πŸ”— Hybrid** | Most queries (recommended) | **Excellent** |
384
 
385
- ## πŸ’» Development
 
 
 
 
386
 
387
- ### Quick Start
388
- ```bash
389
- # Clone repository
390
- git clone https://github.com/ansemin/Markit_v2
391
 
392
- # Set up environment variables
393
- cp .env.example .env
394
- # Edit .env with your API keys
395
 
396
- # Install dependencies & run
397
- pip install -r requirements.txt
398
- python app.py
399
- ```
 
400
 
401
- ### Key Technologies
402
- - **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown
403
- - **RAG System**: OpenAI embeddings + Chroma vector store + Gemini 2.5 Flash
404
- - **UI Framework**: Gradio with modular component architecture
405
- - **GPU Support**: ZeroGPU integration for HF Spaces
 
14
 
15
  # Document to Markdown Converter with RAG Chat
16
 
17
+ **Author: Anse Min** | [πŸ€— Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2) | [GitHub](https://github.com/ansemin/Markit_v2) | [LinkedIn](https://www.linkedin.com/in/ansemin/)
18
+
19
  A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation).
20
 
21
  <details>
 
23
 
24
  <!-- Begin ToC -->
25
 
26
+ - [Live Demos](#-live-demos)
27
  - [System Overview](#-system-overview)
28
+ - [Environment Setup](#-environment-setup)
29
+ - [Local Development](#-local-development)
30
+ - [Technical Details](#-technical-details)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  <!-- End ToC -->
33
 
34
  </details>
35
 
36
+ ## 🎬 Live Demos
37
 
38
+ ### 1. Multi-Document Processing (Flagship Feature)
39
  <div align="center">
40
+ <img src="GIF/Multi-Document Processing Showcase.gif" alt="Multi-Document Processing Demo" width="800">
 
 
41
  </div>
42
 
43
+ **What it does:** Process up to 5 files simultaneously (20MB combined) with 4 intelligent processing types:
44
+ - **πŸ”— Combined**: Merge documents with smart duplicate removal
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  - **πŸ“‘ Individual**: Separate sections per document with clear organization
46
  - **πŸ“ˆ Summary**: Executive overview + detailed analysis of all documents
47
  - **βš–οΈ Comparison**: Cross-document analysis with similarities/differences tables
48
 
49
+ **Why it matters:** Industry-leading multi-document processing that compares and contrasts information across different files, handles mixed file types seamlessly, and recognizes relationships across document boundaries.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
+ <div align="center">
52
+ <img src="img/Multi-Document Processing Types (Flagship Feature).png" alt="Multi-Document Processing Types" width="700">
53
 
54
+ *Industry-leading multi-document processing with 4 intelligent processing types*
55
+ </div>
56
 
57
+ ### 2. Single Document Conversion Flow
58
+ <div align="center">
59
+ <img src="GIF/Single Document Conversion Flow.gif" alt="Single Document Conversion Demo" width="800">
60
+ </div>
 
 
61
 
62
+ **What it does:** Convert PDFs, Office documents, images, and more to Markdown using 5 powerful parsers:
63
+ - **Gemini Flash**: AI-powered understanding with high accuracy
64
+ - **Mistral OCR**: Fastest processing with document understanding
65
+ - **Docling**: Open source with advanced PDF table recognition
66
+ - **GOT-OCR**: Mathematical/scientific documents to LaTeX
67
+ - **MarkItDown**: High accuracy for CSV/XML and broad format support
68
 
69
+ **Why it matters:** Perfect table preservation creates enhanced markdown tables for superior RAG context, unlike standard PDF text extraction.
70
 
71
+ <div align="center">
72
+ <img src="img/Parser Selection Guide (User-Friendly).png" alt="Parser Selection Guide" width="700">
73
 
74
+ *Choose the right parser for your specific needs and document types*
75
+ </div>
76
 
77
+ ### 3. RAG Chat System in Action
78
+ <div align="center">
79
+ <img src="GIF/RAG Chat System in Action.gif" alt="RAG Chat System Demo" width="800">
80
+ </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
+ **What it does:** Chat with your converted documents using 4 advanced retrieval strategies:
83
+ - **🎯 Similarity**: Traditional semantic similarity using embeddings
84
+ - **πŸ”€ MMR**: Diverse results with reduced redundancy
85
+ - **πŸ” BM25**: Traditional keyword-based retrieval
86
+ - **πŸ”— Hybrid**: Combines semantic + keyword search (recommended)
87
 
88
+ **Why it matters:** Ask for markdown tables in chat responses (impossible with standard PDF RAG), get streaming responses with document context, and easily clear data directly from the interface.
 
 
89
 
90
+ <div align="center">
91
+ <img src="img/RAG Retrieval Strategies (Technical Highlight).png" alt="RAG Retrieval Strategies" width="700">
 
 
92
 
93
+ *Advanced RAG system with 4 retrieval strategies for optimal document search*
94
+ </div>
 
95
 
96
+ ### 4. Query Ranker Analysis
97
+ <div align="center">
98
+ <img src="GIF/Query Ranker Analysis.gif" alt="Query Ranker Demo" width="800">
99
+ </div>
 
100
 
101
+ **What it does:** Interactive document search with:
102
+ - **Real-time ranking** of document chunks with confidence scores
103
+ - **Method comparison** to test different retrieval strategies
104
+ - **Adjustable results** (1-10) with responsive slider control
105
+ - **Transparent scoring** with actual ChromaDB similarity scores
106
 
107
+ **Why it matters:** Provides complete transparency into how your RAG system finds and ranks information, helping you optimize retrieval strategies.
108
 
109
+ ### 5. GOT-OCR LaTeX Processing
110
+ <div align="center">
111
+ <img src="GIF/GOT-OCR LaTeX Processing.gif" alt="GOT-OCR LaTeX Demo" width="800">
112
+ </div>
113
 
114
+ **What it does:** Advanced LaTeX processing for mathematical and scientific documents:
115
+ - **Native LaTeX output** with no LLM conversion for maximum accuracy
116
+ - **Mathpix rendering** using the same library as official GOT-OCR demo
117
+ - **RAG-compatible chunking** that preserves LaTeX structures and mathematical tables
118
+ - **Professional display** with proper mathematical formatting
119
 
120
+ **Why it matters:** Perfect for research papers, scientific documents, and academic content with complex equations and structured data.
 
 
 
 
121
 
122
+ ## 🎯 System Overview
 
 
 
 
123
 
124
+ <div align="center">
125
+ <img src="img/Overall%20System%20Workflow%20(Essential).png" alt="Overall System Workflow" width="600">
 
 
126
 
127
+ *Complete workflow from document upload to intelligent RAG chat interaction*
128
+ </div>
129
 
130
+ ## πŸ”§ Environment Setup
 
 
131
 
132
+ ### Required API Keys
133
+ ```bash
134
+ GOOGLE_API_KEY=your_gemini_api_key_here # For Gemini Flash parser and RAG chat
135
+ OPENAI_API_KEY=your_openai_api_key_here # For embeddings and AI descriptions
136
+ MISTRAL_API_KEY=your_mistral_api_key_here # For Mistral OCR parser (optional)
137
  ```
138
 
139
+ ### Key Configuration Options
140
+ ```bash
141
+ DEBUG=true # Enable debug logging
142
+ MAX_FILE_SIZE=10485760 # 10MB per file limit
143
+ MAX_BATCH_FILES=5 # Maximum files for multi-document processing
144
+ MAX_BATCH_SIZE=20971520 # 20MB combined limit for batch processing
145
+ CHUNK_SIZE=1000 # Document chunk size for Markdown content
146
+ RETRIEVAL_K=4 # Number of documents to retrieve for RAG
147
+ ```
148
 
149
+ ## πŸš€ Local Development
150
 
151
+ ### Quick Start
152
+ ```bash
153
+ # Clone repository
154
+ git clone https://github.com/ansemin/Markit_v2
155
+ cd Markit_v2
156
 
157
+ # Create environment file
158
+ cp .env.example .env
159
+ # Edit .env with your API keys
160
 
161
+ # Install dependencies
162
+ pip install -r requirements.txt
163
 
164
+ # Run application
165
+ python app.py # Full environment setup (HF Spaces compatible)
166
+ python run_app.py # Local development (faster startup)
167
+ python run_app.py --clear-data-and-run # Testing with clean data
168
+ ```
169
 
170
+ ### Data Management
171
+ **Two ways to clear data:**
172
+ 1. **UI Method**: Chat tab β†’ "πŸ—‘οΈ Clear All Data" button (works in both local and HF Space)
173
+ 2. **CLI Method**: `python run_app.py --clear-data-and-run`
174
 
175
+ **What gets cleared:** Vector store embeddings, chat history, and session data
176
 
177
+ ## πŸ” Technical Details
178
+
179
+ ### Retrieval Strategy Performance
180
  | Method | Best For | Accuracy |
181
  |--------|----------|----------|
182
  | **🎯 Similarity** | General semantic questions | Good |
 
184
  | **πŸ” BM25** | Exact keyword searches | Medium |
185
  | **πŸ”— Hybrid** | Most queries (recommended) | **Excellent** |
186
 
187
+ ### Core Technologies
188
+ - **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown
189
+ - **RAG System**: OpenAI embeddings + ChromaDB vector store + Gemini 2.5 Flash
190
+ - **UI Framework**: Gradio with modular component architecture
191
+ - **GPU Support**: ZeroGPU integration for HF Spaces
192
 
193
+ ### Smart Content-Aware Chunking
194
+ - **Markdown chunking**: Preserves tables and code blocks
195
+ - **LaTeX chunking**: Preserves mathematical tables, environments, and structures
196
+ - **Automatic format detection**: Optimal chunking strategy per document type
197
 
198
+ ## Credits
 
 
199
 
200
+ - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
201
+ - [Docling](https://github.com/DS4SD/docling) by IBM Research
202
+ - [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) by StepFun
203
+ - [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) for LaTeX rendering
204
+ - [Gradio](https://gradio.app/) for the UI framework
205
 
206
+ ---
207
+
208
+ **πŸš€ [Try it live on Hugging Face Spaces](https://huggingface.co/spaces/Ansemin101/Markit_v2)**
 
 
src/rag/vector_store.py CHANGED
@@ -8,12 +8,35 @@ from langchain_core.documents import Document
8
  from langchain_core.vectorstores import VectorStoreRetriever
9
  from langchain_community.retrievers import BM25Retriever
10
  from langchain.retrievers import EnsembleRetriever
 
11
  from src.rag.embeddings import embedding_manager
12
  from src.core.config import config
13
  from src.core.logging_config import get_logger
14
 
15
  logger = get_logger(__name__)
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  class VectorStoreManager:
18
  """Manages Chroma vector store for document storage and retrieval."""
19
 
@@ -215,19 +238,19 @@ class VectorStoreManager:
215
  semantic_weight: float = 0.7,
216
  keyword_weight: float = 0.3,
217
  search_type: str = "similarity",
218
- search_kwargs: Optional[Dict[str, Any]] = None) -> EnsembleRetriever:
219
  """
220
  Get a hybrid retriever that combines semantic (vector) and keyword (BM25) search.
221
 
222
  Args:
223
- k: Number of documents to return
224
  semantic_weight: Weight for semantic search (0.0 to 1.0)
225
  keyword_weight: Weight for keyword search (0.0 to 1.0)
226
  search_type: Type of semantic search ("similarity", "mmr", "similarity_score_threshold")
227
  search_kwargs: Additional search parameters for semantic retriever
228
 
229
  Returns:
230
- EnsembleRetriever object combining both approaches
231
  """
232
  try:
233
  # Normalize weights
@@ -259,8 +282,11 @@ class VectorStoreManager:
259
  weights=[semantic_weight, keyword_weight]
260
  )
261
 
262
- logger.info(f"Created hybrid retriever with weights: semantic={semantic_weight:.2f}, keyword={keyword_weight:.2f}")
263
- return ensemble_retriever
 
 
 
264
 
265
  except Exception as e:
266
  logger.error(f"Error creating hybrid retriever: {e}")
 
8
  from langchain_core.vectorstores import VectorStoreRetriever
9
  from langchain_community.retrievers import BM25Retriever
10
  from langchain.retrievers import EnsembleRetriever
11
+ from langchain_core.retrievers import BaseRetriever
12
  from src.rag.embeddings import embedding_manager
13
  from src.core.config import config
14
  from src.core.logging_config import get_logger
15
 
16
  logger = get_logger(__name__)
17
 
18
+
19
+ class LimitedEnsembleRetriever(BaseRetriever):
20
+ """Wrapper around EnsembleRetriever that limits total results to k."""
21
+
22
+ def __init__(self, ensemble_retriever: EnsembleRetriever, k: int):
23
+ super().__init__()
24
+ self.ensemble_retriever = ensemble_retriever
25
+ self.k = k
26
+
27
+ def _get_relevant_documents(self, query: str, *, run_manager=None) -> List[Document]:
28
+ """Get relevant documents, limited to k results."""
29
+ # Get all results from ensemble retriever
30
+ docs = self.ensemble_retriever.get_relevant_documents(query)
31
+ # Limit to k results
32
+ return docs[:self.k]
33
+
34
+ async def _aget_relevant_documents(self, query: str, *, run_manager=None) -> List[Document]:
35
+ """Async version of get_relevant_documents."""
36
+ docs = await self.ensemble_retriever.aget_relevant_documents(query)
37
+ return docs[:self.k]
38
+
39
+
40
  class VectorStoreManager:
41
  """Manages Chroma vector store for document storage and retrieval."""
42
 
 
238
  semantic_weight: float = 0.7,
239
  keyword_weight: float = 0.3,
240
  search_type: str = "similarity",
241
+ search_kwargs: Optional[Dict[str, Any]] = None) -> LimitedEnsembleRetriever:
242
  """
243
  Get a hybrid retriever that combines semantic (vector) and keyword (BM25) search.
244
 
245
  Args:
246
+ k: Number of documents to return (exactly k results will be returned)
247
  semantic_weight: Weight for semantic search (0.0 to 1.0)
248
  keyword_weight: Weight for keyword search (0.0 to 1.0)
249
  search_type: Type of semantic search ("similarity", "mmr", "similarity_score_threshold")
250
  search_kwargs: Additional search parameters for semantic retriever
251
 
252
  Returns:
253
+ LimitedEnsembleRetriever object that returns exactly k results
254
  """
255
  try:
256
  # Normalize weights
 
282
  weights=[semantic_weight, keyword_weight]
283
  )
284
 
285
+ # Wrap with LimitedEnsembleRetriever to ensure exactly k results
286
+ limited_retriever = LimitedEnsembleRetriever(ensemble_retriever, k)
287
+
288
+ logger.info(f"Created hybrid retriever with weights: semantic={semantic_weight:.2f}, keyword={keyword_weight:.2f}, limited to {k} results")
289
+ return limited_retriever
290
 
291
  except Exception as e:
292
  logger.error(f"Error creating hybrid retriever: {e}")