AnseMin commited on
Commit
4a97b0c
·
1 Parent(s): d85616d

Enhance README and parser functionality for improved document processing

Browse files

- Updated README to provide a comprehensive overview of the system, including a detailed table of contents and enhanced descriptions of key features.
- Improved GeminiFlashParser to include detailed error messages for unsupported file types and added support for additional MIME types.
- Enhanced MistralOcrParser to support DOCX and PPTX file types, improving document processing capabilities.
- Introduced a new test suite for Gemini wrapper functionality to validate integration with MarkItDown and ensure robust performance.

README.md CHANGED
@@ -14,19 +14,66 @@ hf_oauth: true
14
 
15
  # Document to Markdown Converter with RAG Chat
16
 
17
- A Hugging Face Space that converts various document formats to Markdown and lets you chat with your documents using RAG (Retrieval-Augmented Generation)!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ## ✨ Key Features
20
 
21
  ### Document Conversion
22
  - Convert PDFs, Office documents, images, and more to Markdown
23
  - **🆕 Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
24
- - Multiple parser options:
25
- - MarkItDown: For comprehensive document conversion
26
- - Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
27
- - GOT-OCR: For image-based OCR with **native LaTeX output** and Mathpix rendering
28
- - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
29
- - Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
30
  - **🆕 Intelligent Processing Types**:
31
  - **Combined**: Merge documents into unified content with duplicate removal
32
  - **Individual**: Separate sections per document with clear organization
@@ -90,47 +137,22 @@ A Hugging Face Space that converts various document formats to Markdown and lets
90
 
91
  ## 🚀 Multi-Document Processing
92
 
93
- ### **What makes this special?**
94
- Markit v2 introduces **industry-leading multi-document processing** with **three powerful parser options**: Gemini Flash (native multi-document AI), Mistral OCR (high-accuracy with Document Understanding), and Docling (advanced PDF analysis). All support intelligent cross-document analysis.
 
95
 
96
  ### **Key Capabilities:**
97
  - **📊 Cross-Document Analysis**: Compare and contrast information across different files
98
  - **🔄 Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
99
  - **📋 Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
100
  - **🧠 Contextual Understanding**: Recognizes relationships and patterns across document boundaries
101
- - **⚡ Single API Call Processing**: Efficient batch processing using Gemini's native multi-document support
102
-
103
- ### **Processing Types Explained:**
104
-
105
- #### 🔗 **Combined Processing**
106
- - **Purpose**: Create one unified, cohesive document from multiple sources
107
- - **Best for**: Related documents that should be read as one complete resource
108
- - **Intelligence**: Removes redundant information while preserving all critical content
109
- - **Example**: Merge project proposal + budget + timeline into one comprehensive document
110
-
111
- #### 📑 **Individual Processing**
112
- - **Purpose**: Convert each document separately but organize them in one output
113
- - **Best for**: Different documents you want in one place for easy reference
114
- - **Intelligence**: Maintains original structure while creating clear organization
115
- - **Example**: Meeting agenda + presentation + notes → organized sections
116
-
117
- #### 📈 **Summary Processing**
118
- - **Purpose**: Executive overview + detailed analysis
119
- - **Best for**: Complex document sets needing high-level insights
120
- - **Intelligence**: Cross-document pattern recognition and key insight extraction
121
- - **Example**: Research papers → executive summary + detailed analysis of each paper
122
-
123
- #### ⚖️ **Comparison Processing**
124
- - **Purpose**: Analyze differences, similarities, and relationships
125
- - **Best for**: Multiple proposals, document versions, or conflicting sources
126
- - **Intelligence**: Creates comparison tables and identifies discrepancies/alignments
127
- - **Example**: Contract versions → side-by-side analysis with change identification
128
-
129
- ### **Technical Advantages:**
130
- - **Native Multimodal Support**: Processes text + images in same workflow
131
- - **Advanced Reasoning**: Understands context and relationships between documents
132
- - **Efficient Processing**: Single Gemini API call vs. multiple individual calls
133
- - **Format Agnostic**: Works across all supported file types seamlessly
134
 
135
  ## Environment Variables
136
 
@@ -191,82 +213,44 @@ The application uses centralized configuration management. You can enhance funct
191
  - `BM25_K1`: BM25 term frequency saturation parameter (default: 1.2)
192
  - `BM25_B`: BM25 field length normalization parameter (default: 0.75)
193
 
194
- ## Usage
 
 
 
 
 
 
195
 
196
  ### Document Conversion
197
 
198
  #### 📄 **Single Document Processing**
199
- 1. Go to the **"Document Converter"** tab
200
- 2. Upload a single file
201
- 3. Choose your preferred parser:
202
- - **"MarkItDown"** for comprehensive document conversion
203
- - **"Docling"** for advanced PDF understanding and table extraction
204
- - **"Gemini Flash"** for AI-powered text extraction
205
- 4. Select an OCR method based on your chosen parser
206
- 5. Click "Convert"
207
- 6. **For GOT-OCR**: View the LaTeX output with **Mathpix rendering** for proper mathematical and tabular display
208
- 7. **For other parsers**: View the Markdown output
209
- 8. Download the converted file (.tex for GOT-OCR, .md for others)
210
-
211
- #### 📂 **Multi-Document Processing** (NEW!)
212
- 1. Go to the **"Document Converter"** tab
213
- 2. Upload **2-5 files** (up to 20MB combined)
214
- 3. **Processing type selector appears automatically**
215
- 4. Choose your processing type:
216
- - **Combined**: Merge all documents into unified content with smart duplicate removal
217
- - **Individual**: Keep documents separate with clear section headers
218
- - **Summary**: Executive overview + detailed analysis of each document
219
- - **Comparison**: Side-by-side analysis with similarities/differences tables
220
- 5. Choose your preferred parser:
221
- - **Gemini Flash**: Best for advanced cross-document reasoning and native multi-document support
222
- - **Mistral OCR**: Great for high-accuracy OCR with Document Understanding mode
223
- - **Docling**: Excellent for PDF table structure + multi-document analysis
224
- 6. Click "Convert"
225
- 7. Get intelligent cross-document analysis and download enhanced output
226
-
227
- #### 💡 **Multi-Document Tips**
228
- - **Mixed file types work great**: Upload PDF + images, Word docs + PDFs, etc.
229
- - **Gemini Flash excels at**: Cross-document reasoning, duplicate detection, and format analysis
230
- - **Perfect for**: Comparing document versions, analyzing related reports, consolidating research
231
- - **Real-time validation**: UI shows file count, size limits, and processing mode
232
-
233
- #### 🤖 **RAG Integration**
234
- - **All converted documents are automatically added to the RAG system** for chat functionality
235
- - Multi-document processing creates richer context for chat interactions
236
-
237
- ### 🤖 Chat with Documents
238
- 1. Go to the **"Chat with Documents"** tab
239
- 2. Check the system status to ensure RAG components are ready
240
- 3. **🆕 Choose your retrieval strategy** for optimal results:
241
- - **Similarity**: Best for general semantic search
242
- - **MMR**: Best for diverse, non-repetitive results
243
- - **Hybrid**: Best overall accuracy (recommended)
244
- 4. Ask questions about your converted documents
245
- 5. Enjoy real-time streaming responses with document context
246
- 6. Use "New Session" to start fresh conversations
247
- 7. Use "🗑️ Clear All Data" to remove all documents and chat history
248
- 8. Monitor your usage limits in the status panel
249
 
250
- ### 🔍 Query Ranker (NEW!)
251
- 1. Go to the **"Query Ranker"** tab
252
- 2. Check the system status to ensure documents are loaded
253
- 3. **Enter your search query** in the search box
254
- 4. **Choose your retrieval method**:
255
- - **🎯 Similarity Search**: Semantic similarity with real scores
256
- - **🔀 MMR (Diverse)**: Diverse results with reduced redundancy
257
- - **🔍 BM25 (Keywords)**: Traditional keyword-based search
258
- - **🔗 Hybrid (Recommended)**: Best overall accuracy combining semantic + keyword
259
- 5. **Adjust result count** (1-10) using the slider
260
- 6. **Review ranked results** with confidence levels and source information
261
- 7. **Compare methods** by trying different retrieval strategies on the same query
262
- 8. Use results to understand how your documents are chunked and ranked
263
-
264
- #### 🔍 **Retrieval Strategy Guide:**
265
- - **For research papers**: Use MMR to get diverse perspectives
266
- - **For technical docs**: Use Hybrid for comprehensive coverage
267
- - **For specific facts**: Use Similarity for targeted results
268
- - **For broad topics**: Use Hybrid for balanced semantic + keyword matching
269
- - **For transparency**: Use Query Ranker to see exactly which chunks are being retrieved
 
270
 
271
  ## Local Development
272
 
@@ -389,203 +373,33 @@ const html = window.render(latexContent, {htmlTags: true});
389
  - [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
390
 
391
 
392
- ## 🔍 Advanced RAG Retrieval Strategies
393
-
394
- The system supports **four different retrieval methods** for optimal document search and question answering:
395
-
396
- ### **1. 🎯 Similarity Search (Default)**
397
- - **How it works**: Semantic similarity using OpenAI embeddings
398
- - **Best for**: General questions and semantic understanding
399
- - **Use case**: "What is the main topic of this document?"
400
- - **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
401
- - **Chunking**: Uses content-aware chunking (Markdown or LaTeX) for optimal structure preservation
402
-
403
- ### **2. 🔀 MMR (Maximal Marginal Relevance)**
404
- - **How it works**: Balances relevance with result diversity to reduce redundancy
405
- - **Best for**: Research questions requiring diverse perspectives
406
- - **Use case**: "What are different approaches to transformer architecture?"
407
- - **Configuration**: `{'k': 4, 'fetch_k': 10, 'lambda_mult': 0.5}`
408
- - **Benefits**: Prevents repetitive results, ensures comprehensive coverage
409
-
410
- ### **3. 🔍 BM25 Keyword Search**
411
- - **How it works**: Traditional keyword-based search with TF-IDF scoring
412
- - **Best for**: Exact term matching and specific factual queries
413
- - **Use case**: "Find mentions of 'attention mechanism' in the documents"
414
- - **Configuration**: `{'k': 4}`
415
- - **Benefits**: Excellent for technical terms and specific concepts
416
-
417
- ### **4. 🔗 Hybrid Search (Recommended)**
418
- - **How it works**: Combines semantic embeddings + keyword search using ensemble weighting
419
- - **Best for**: Most queries - provides best overall accuracy
420
- - **Use case**: Any complex question benefiting from both semantic and keyword matching
421
- - **Configuration**: `{'k': 4, 'semantic_weight': 0.7, 'keyword_weight': 0.3}`
422
- - **Benefits**: **87.5% hit rate vs 79.2% for similarity-only** (based on LangChain research)
423
-
424
- ### **🎯 Performance Comparison:**
425
- | Method | Accuracy | Diversity | Speed | Best Use Case |
426
- |--------|----------|-----------|-------|---------------|
427
- | Similarity | Good | Low | Fast | General semantic questions |
428
- | MMR | Good | High | Medium | Research requiring diverse viewpoints |
429
- | BM25 | Medium | Medium | Fast | Exact term/keyword searches |
430
- | **Hybrid** | **Excellent** | **High** | **Medium** | **Most questions (recommended)** |
431
-
432
- ### **💡 Usage Examples:**
433
-
434
- ```python
435
- # In your application code
436
- from src.rag.chat_service import rag_chat_service
437
-
438
- # Use hybrid search (recommended)
439
- response = rag_chat_service.chat_with_retrieval(
440
- "How does attention work in transformers?",
441
- retrieval_method="hybrid",
442
- retrieval_config={'k': 4, 'semantic_weight': 0.8, 'keyword_weight': 0.2}
443
- )
444
-
445
- # Use MMR for diverse research results
446
- response = rag_chat_service.chat_with_retrieval(
447
- "What are different transformer architectures?",
448
- retrieval_method="mmr",
449
- retrieval_config={'k': 3, 'fetch_k': 10, 'lambda_mult': 0.6}
450
- )
451
- ```
452
 
453
- ## Development Guide
 
 
 
 
 
454
 
455
- ### Project Structure
456
 
457
- ```
458
- markit_v2/
459
- ├── app.py # Main application entry point (HF Spaces compatible)
460
- ├── run_app.py # 🆕 Lightweight app launcher for local development
461
- ├── setup.sh # Setup script
462
- ├── build.sh # Build script
463
- ├── requirements.txt # Python dependencies
464
- ├── README.md # Project documentation
465
- ├── .env # Environment variables (local development)
466
- ├── .gitignore # Git ignore file
467
- ├── .gitattributes # Git attributes file
468
- ├── src/ # Source code
469
- │ ├── __init__.py # Package initialization
470
- │ ├── main.py # Application launcher
471
- │ ├── core/ # Core functionality and utilities
472
- │ │ ├── __init__.py # Package initialization
473
- │ │ ├── config.py # 🆕 Centralized configuration management (with RAG settings)
474
- │ │ ├── exceptions.py # 🆕 Custom exception hierarchy
475
- │ │ ├── logging_config.py # 🆕 Centralized logging setup
476
- │ │ ├── environment.py # 🆕 Environment setup and dependency management
477
- │ │ ├── converter.py # Document conversion orchestrator (refactored)
478
- │ │ ├── parser_factory.py # Parser factory pattern
479
- │ │ └── latex_to_markdown_converter.py # LaTeX conversion utility
480
- │ ├── services/ # Business logic layer
481
- │ │ ├── __init__.py # Package initialization
482
- │ │ ├── document_service.py # 🆕 Document processing service
483
- │ │ └── data_clearing_service.py # 🆕 Data management and clearing service
484
- │ ├── parsers/ # Parser implementations
485
- │ │ ├── __init__.py # Package initialization
486
- │ │ ├── parser_interface.py # Enhanced parser interface
487
- │ │ ├── parser_registry.py # Parser registry pattern
488
- │ │ ├── markitdown_parser.py # MarkItDown parser (updated)
489
- │ │ ├── docling_parser.py # 🆕 Docling parser with advanced PDF understanding
490
- │ │ ├── got_ocr_parser.py # GOT-OCR parser for images
491
- │ │ ├── mistral_ocr_parser.py # 🆕 Mistral OCR parser
492
- │ │ └── gemini_flash_parser.py # 🆕 Enhanced Gemini Flash parser with multi-document processing
493
- │ ├── rag/ # 🆕 RAG (Retrieval-Augmented Generation) system
494
- │ │ ├── __init__.py # Package initialization
495
- │ │ ├── embeddings.py # OpenAI embedding model management
496
- │ │ ├── chunking.py # Markdown-aware document chunking
497
- │ │ ├── vector_store.py # Chroma vector database management
498
- │ │ ├── memory.py # Chat history and session management
499
- │ │ ├── chat_service.py # RAG chat service with Gemini 2.5 Flash
500
- │ │ └── ingestion.py # Document ingestion pipeline
501
- │ └── ui/ # 🆕 Modular user interface layer
502
- │ ├── __init__.py # Package initialization
503
- │ ├── ui.py # Main UI orchestrator (~60 lines)
504
- │ ├── components/ # UI components
505
- │ │ ├── __init__.py # Package initialization
506
- │ │ ├── document_converter.py # Document converter tab (~200 lines)
507
- │ │ ├── chat_interface.py # Chat interface tab (~180 lines)
508
- │ │ └── query_ranker.py # Query ranker tab (~200 lines)
509
- │ ├── formatters/ # Content formatting utilities
510
- │ │ ├── __init__.py # Package initialization
511
- │ │ └── content_formatters.py # Markdown/LaTeX formatters (~150 lines)
512
- │ ├── styles/ # UI styling
513
- │ │ ├── __init__.py # Package initialization
514
- │ │ └── ui_styles.py # CSS styles and themes (~800 lines)
515
- │ └── utils/ # UI utility functions
516
- │ ├── __init__.py # Package initialization
517
- │ ├── file_validation.py # File validation utilities (~80 lines)
518
- │ └── threading_utils.py # Threading utilities (~40 lines)
519
- ├── documents/ # Documentation and examples (gitignored)
520
- ├── tessdata/ # Tesseract OCR data (gitignored)
521
- └── tests/ # 🆕 Test suite for Phase 1 RAG implementation
522
- ├── __init__.py # Package initialization
523
- ├── README.md # Test documentation and usage guide
524
- ├── test_implementation_structure.py # Structure validation (no API keys)
525
- ├── test_retrieval_methods.py # Full functionality testing
526
- └── test_data_usage.py # Data usage demonstration
527
  ```
528
 
529
- ### 🆕 **New Architecture Components:**
530
- - **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
531
- - **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
532
- - **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
533
- - **Data Management Service**: Comprehensive data clearing functionality (`src/services/data_clearing_service.py`)
534
- - **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
535
- - **Enhanced Parser Interface**: Validation, metadata, and cancellation support
536
- - **Lightweight Launcher**: Quick development startup with `run_app.py`
537
- - **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
538
- - **🆕 RAG System**: Complete RAG implementation with vector search and chat capabilities
539
- - **🆕 Query Ranker Interface**: Dedicated transparency tool for document search and ranking
540
- - **🆕 Modular UI Architecture**: Component-based UI with clear separation of concerns
541
- - **UI Components**: Individual tab components for focused functionality
542
- - **Content Formatters**: Specialized markdown and LaTeX rendering utilities
543
- - **UI Styles**: Centralized CSS styling system with responsive design
544
- - **UI Utils**: File validation and threading utilities for better code organization
545
-
546
- ### 🧠 **RAG System Architecture:**
547
- - **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
548
- - **🆕 Smart Content-Aware Chunking** (`src/rag/chunking.py`):
549
- - **Unified chunker** supporting both Markdown and LaTeX content
550
- - **Markdown chunking**: Preserves tables and code blocks as whole units
551
- - **LaTeX chunking**: Preserves `\begin{tabular}`, mathematical environments, and LaTeX structures
552
- - **Automatic format detection**: GOT-OCR results → LaTeX chunker, others → Markdown chunker
553
- - **Enhanced metadata**: Content type tracking and structure detection
554
- - **🆕 Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
555
- - **Similarity Search**: Traditional semantic retrieval using embeddings
556
- - **MMR Support**: Maximal Marginal Relevance for diverse results
557
- - **BM25 Integration**: Keyword-based search with TF-IDF scoring
558
- - **Hybrid Retrieval**: Ensemble combining semantic + keyword methods
559
- - **Chroma database**: Persistent storage with deduplication
560
- - **Chat Memory** (`src/rag/memory.py`): Session management and conversation history
561
- - **🆕 Enhanced Chat Service** (`src/rag/chat_service.py`): Multi-method RAG with Gemini 2.5 Flash
562
- - **Document Ingestion** (`src/rag/ingestion.py`): Automated pipeline with intelligent duplicate handling
563
- - **Usage Limiting**: Anti-abuse measures for public deployment
564
- - **Auto-Ingestion**: Seamless integration with document conversion workflow
565
-
566
- ### 🗑️ **Data Management & Deduplication:**
567
- - **File Hash-Based Deduplication**: Uses SHA-256 hashes of original file content to prevent duplicates
568
- - **Chroma Where Filter Integration**: Persistent duplicate detection using vector store metadata queries
569
- - **Automatic Document Replacement**: When same file is uploaded again, old version is replaced with new one
570
- - **Cross-Environment Data Clearing**: Works seamlessly in both local development and HF Space environments
571
- - **Environment-Aware Path Resolution**: Automatically detects and uses correct data paths (`./data/*` vs `/tmp/data/*`)
572
- - **Comprehensive Status Reporting**: Real-time display of vector store documents, chat history files, and environment type
573
- - **Safe Clearing Operations**: Graceful error handling with detailed feedback on clearing operations
574
-
575
- ### ZeroGPU Integration Notes
576
-
577
- When developing for Hugging Face Spaces with Stateless GPU:
578
-
579
- 1. Always import the `spaces` module before any CUDA initialization
580
- 2. Place all CUDA operations inside functions decorated with `@spaces.GPU()`
581
- 3. Ensure only picklable objects are passed to GPU-decorated functions
582
- 4. Use wrapper functions to filter out unpicklable objects like thread locks
583
- 5. For advanced use cases, consider implementing fallback mechanisms for serialization errors
584
- 6. **Add `hf_oauth: true` to your Space's README.md metadata** to mitigate GPU quota limitations
585
- 7. Sign in with your Hugging Face account when using the app to utilize your personal GPU quota
586
- 8. For extensive GPU usage without quota limitations, a Hugging Face Pro subscription is required
587
-
588
- > **Note**: If you're implementing a Space with ZeroGPU on your own, you may encounter quota limitations ("GPU task aborted" errors). These can be mitigated by:
589
- > - Adding `hf_oauth: true` to your Space's metadata (as shown in this Space)
590
- > - Having users sign in with their Hugging Face accounts
591
- > - Upgrading to a Hugging Face Pro subscription for dedicated GPU resources
 
14
 
15
  # Document to Markdown Converter with RAG Chat
16
 
17
+ A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation).
18
+
19
+ <details>
20
+ <summary><strong>Table of contents</strong></summary>
21
+
22
+ <!-- Begin ToC -->
23
+
24
+ - [System Overview](#-system-overview)
25
+ - [Key Features](#-key-features)
26
+ - [Document Conversion](#document-conversion)
27
+ - [RAG Chat with Documents](#-rag-chat-with-documents)
28
+ - [Query Ranker (NEW!)](#-query-ranker-new)
29
+ - [User Interface](#user-interface)
30
+ - [Supported Libraries](#supported-libraries)
31
+ - [Multi-Document Processing](#-multi-document-processing)
32
+ - [Environment Variables](#environment-variables)
33
+ - [API Keys](#-api-keys)
34
+ - [Configuration Options](#️-configuration-options)
35
+ - [Docling Configuration](#-docling-configuration)
36
+ - [Model Configuration](#-model-configuration)
37
+ - [RAG Configuration](#-rag-configuration)
38
+ - [Advanced Retrieval Configuration](#-advanced-retrieval-configuration)
39
+ - [Usage Guide](#-usage-guide)
40
+ - [Parser Selection](#-parser-selection)
41
+ - [Document Conversion](#document-conversion-1)
42
+ - [RAG Chat & Query System](#-rag-chat--query-system)
43
+ - [Local Development](#local-development)
44
+ - [Quick Start](#-quick-start)
45
+ - [Data Management](#-data-management)
46
+ - [Development Features](#-development-features)
47
+ - [GOT-OCR LaTeX Processing](#-got-ocr-latex-processing)
48
+ - [Credits](#credits)
49
+ - [Retrieval Strategies](#-retrieval-strategies)
50
+ - [Development](#-development)
51
+ - [Quick Start](#quick-start)
52
+ - [Key Technologies](#key-technologies)
53
+
54
+ <!-- End ToC -->
55
+
56
+ </details>
57
+
58
+ ## 🎯 System Overview
59
+
60
+ <div align="center">
61
+ <img src="img/Overall%20System%20Workflow%20(Essential).png" alt="Overall System Workflow" width="400">
62
+
63
+ *Complete workflow from document upload to intelligent RAG chat interaction*
64
+ </div>
65
 
66
  ## ✨ Key Features
67
 
68
  ### Document Conversion
69
  - Convert PDFs, Office documents, images, and more to Markdown
70
  - **🆕 Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
71
+ - **5 Powerful Parsers**:
72
+ - **Gemini Flash**: General Purpose + High Accuracy
73
+ - **Mistral OCR**: Fastest Processing
74
+ - **Docling**: Open Source
75
+ - **GOT-OCR**: Document to LaTeX + Open Source
76
+ - **MarkItDown**: High Accuracy CSV/XML + Open Source
77
  - **🆕 Intelligent Processing Types**:
78
  - **Combined**: Merge documents into unified content with duplicate removal
79
  - **Individual**: Separate sections per document with clear organization
 
137
 
138
  ## 🚀 Multi-Document Processing
139
 
140
+ <img src="img/Multi-Document%20Processing%20Types%20(Flagship%20Feature).png" alt="Multi-Document Processing Types" width="700">
141
+
142
+ *Industry-leading multi-document processing with 4 intelligent processing types*
143
 
144
  ### **Key Capabilities:**
145
  - **📊 Cross-Document Analysis**: Compare and contrast information across different files
146
  - **🔄 Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
147
  - **📋 Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
148
  - **🧠 Contextual Understanding**: Recognizes relationships and patterns across document boundaries
149
+
150
+ ### **Processing Types:**
151
+
152
+ - **🔗 Combined**: Merge documents into unified content with duplicate removal
153
+ - **📑 Individual**: Separate sections per document with clear organization
154
+ - **📈 Summary**: Executive overview + detailed analysis of all documents
155
+ - **⚖️ Comparison**: Cross-document analysis with similarities/differences tables
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
  ## Environment Variables
158
 
 
213
  - `BM25_K1`: BM25 term frequency saturation parameter (default: 1.2)
214
  - `BM25_B`: BM25 field length normalization parameter (default: 0.75)
215
 
216
+ ## 📖 Usage Guide
217
+
218
+ ### 🎯 Parser Selection
219
+
220
+ <img src="img/Parser%20Selection%20Guide%20(User-Friendly).png" alt="Parser Selection Guide" width="700">
221
+
222
+ *Choose the right parser for your specific needs and document types*
223
 
224
  ### Document Conversion
225
 
226
  #### 📄 **Single Document Processing**
227
+ 1. Upload a single file
228
+ 2. Choose your preferred parser
229
+ 3. Select an OCR method based on your chosen parser
230
+ 4. Click "Convert"
231
+ 5. Download the converted file (.tex for GOT-OCR, .md for others)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
232
 
233
+ #### 📂 **Multi-Document Processing**
234
+ 1. Upload **2-5 files** (up to 20MB combined)
235
+ 2. Choose processing type: Combined, Individual, Summary, or Comparison
236
+ 3. Select your preferred parser
237
+ 4. Click "Convert" for intelligent cross-document analysis
238
+
239
+ ### 🤖 RAG Chat & Query System
240
+
241
+ <img src="img/RAG%20Retrieval%20Strategies%20(Technical%20Highlight).png" alt="RAG Retrieval Strategies" width="700">
242
+
243
+ *Advanced RAG system with 4 retrieval strategies for optimal document search*
244
+
245
+ #### **Chat with Documents**
246
+ 1. Choose your retrieval strategy (Similarity, MMR, BM25, or Hybrid)
247
+ 2. Ask questions about your converted documents
248
+ 3. Get real-time streaming responses with document context
249
+
250
+ #### **Query Ranker**
251
+ 1. Enter search queries to explore document chunks
252
+ 2. Compare different retrieval methods
253
+ 3. View confidence scores and source information
254
 
255
  ## Local Development
256
 
 
373
  - [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
374
 
375
 
376
+ ## 🔍 Retrieval Strategies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
 
378
+ | Method | Best For | Accuracy |
379
+ |--------|----------|----------|
380
+ | **🎯 Similarity** | General semantic questions | Good |
381
+ | **🔀 MMR** | Diverse perspectives | Good |
382
+ | **🔍 BM25** | Exact keyword searches | Medium |
383
+ | **🔗 Hybrid** | Most queries (recommended) | **Excellent** |
384
 
385
+ ## 💻 Development
386
 
387
+ ### Quick Start
388
+ ```bash
389
+ # Clone repository
390
+ git clone https://github.com/ansemin/Markit_v2
391
+
392
+ # Set up environment variables
393
+ cp .env.example .env
394
+ # Edit .env with your API keys
395
+
396
+ # Install dependencies & run
397
+ pip install -r requirements.txt
398
+ python app.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
399
  ```
400
 
401
+ ### Key Technologies
402
+ - **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown
403
+ - **RAG System**: OpenAI embeddings + Chroma vector store + Gemini 2.5 Flash
404
+ - **UI Framework**: Gradio with modular component architecture
405
+ - **GPU Support**: ZeroGPU integration for HF Spaces
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/parsers/gemini_flash_parser.py CHANGED
@@ -194,8 +194,17 @@ class GeminiFlashParser(DocumentParser):
194
  # Validate file types
195
  for file_path in file_paths:
196
  file_extension = file_path.suffix.lower()
197
- if self._get_mime_type(file_extension) == "application/octet-stream":
198
- raise ValueError(f"Unsupported file type: {file_path.name}")
 
 
 
 
 
 
 
 
 
199
 
200
  def _create_batch_contents(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> List[Any]:
201
  """Create contents list for batch API call."""
@@ -344,6 +353,7 @@ Return only the markdown content, no other text."""
344
  ".md": "text/markdown",
345
  ".html": "text/html",
346
  ".htm": "text/html",
 
347
  ".jpg": "image/jpeg",
348
  ".jpeg": "image/jpeg",
349
  ".png": "image/png",
 
194
  # Validate file types
195
  for file_path in file_paths:
196
  file_extension = file_path.suffix.lower()
197
+ mime_type = self._get_mime_type(file_extension)
198
+ if mime_type == "application/octet-stream":
199
+ raise ValueError(f"Unsupported file type: {file_path.name}. Gemini supports: PDF, TXT, HTML, CSS, MD, CSV, XML, RTF, JS, PY, and image files.")
200
+ # Check if it's a supported MIME type for Gemini
201
+ if mime_type in ["application/vnd.openxmlformats-officedocument.wordprocessingml.document",
202
+ "application/msword",
203
+ "application/vnd.openxmlformats-officedocument.presentationml.presentation",
204
+ "application/vnd.ms-powerpoint",
205
+ "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
206
+ "application/vnd.ms-excel"]:
207
+ raise ValueError(f"File type not supported by Gemini: {file_path.name}. Gemini supports: PDF, TXT, HTML, CSS, MD, CSV, XML, RTF, JS, PY, and image files.")
208
 
209
  def _create_batch_contents(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> List[Any]:
210
  """Create contents list for batch API call."""
 
353
  ".md": "text/markdown",
354
  ".html": "text/html",
355
  ".htm": "text/html",
356
+ ".csv": "text/csv",
357
  ".jpg": "image/jpeg",
358
  ".jpeg": "image/jpeg",
359
  ".png": "image/png",
src/parsers/mistral_ocr_parser.py CHANGED
@@ -111,8 +111,8 @@ class MistralOcrParser(DocumentParser):
111
  """Extract document content using basic OCR."""
112
  try:
113
  # Process according to file type
114
- if file_extension in ['.pdf']:
115
- # For PDFs, we need to upload the file to the Mistral API first
116
  try:
117
  # Upload the file to Mistral API
118
  uploaded_pdf = client.files.upload(
@@ -137,20 +137,21 @@ class MistralOcrParser(DocumentParser):
137
  )
138
  except Exception as e:
139
  # If file upload fails, try to use a direct URL method with base64
140
- logger.warning(f"Failed to upload PDF, trying alternate method: {str(e)}")
141
- base64_pdf = self.encode_image(file_path)
142
 
143
- if base64_pdf:
 
144
  ocr_response = client.ocr.process(
145
  model="mistral-ocr-latest",
146
  document={
147
  "type": "document_url",
148
- "document_url": f"data:application/pdf;base64,{base64_pdf}"
149
  },
150
  include_image_base64=True
151
  )
152
  else:
153
- raise DocumentProcessingError("Failed to process PDF document")
154
  else:
155
  # For images (jpg, png, etc.), use image_url with base64
156
  base64_image = self.encode_image(file_path)
@@ -237,9 +238,9 @@ class MistralOcrParser(DocumentParser):
237
  def _extract_with_document_understanding(self, client, file_path, file_extension):
238
  """Extract and understand document content using chat completion."""
239
  try:
240
- # For PDFs and images, we'll use Mistral's document understanding capability
241
- if file_extension in ['.pdf']:
242
- # Upload PDF first
243
  try:
244
  # Upload the file
245
  uploaded_pdf = client.files.upload(
@@ -321,9 +322,13 @@ class MistralOcrParser(DocumentParser):
321
  raise ConversionError(f"Document understanding failed: {str(e)}")
322
 
323
  def _get_mime_type(self, file_extension: str) -> str:
324
- """Get the MIME type for a file extension."""
325
  mime_types = {
 
326
  ".pdf": "application/pdf",
 
 
 
327
  ".jpg": "image/jpeg",
328
  ".jpeg": "image/jpeg",
329
  ".png": "image/png",
@@ -331,6 +336,8 @@ class MistralOcrParser(DocumentParser):
331
  ".bmp": "image/bmp",
332
  ".tiff": "image/tiff",
333
  ".tif": "image/tiff",
 
 
334
  }
335
 
336
  return mime_types.get(file_extension, "application/octet-stream")
@@ -383,7 +390,7 @@ class MistralOcrParser(DocumentParser):
383
  def _create_document_part(self, file_path: Path) -> Dict[str, Any]:
384
  """Return a dict representing an image_url or document_url part for Mistral chat/OCR."""
385
  ext = file_path.suffix.lower()
386
- if ext == '.pdf':
387
  # upload and get signed url
388
  client = Mistral(api_key=config.api.mistral_api_key)
389
  uploaded = client.files.upload(
 
111
  """Extract document content using basic OCR."""
112
  try:
113
  # Process according to file type
114
+ if file_extension in ['.pdf', '.docx', '.pptx']:
115
+ # For documents (PDF, DOCX, PPTX), we need to upload the file to the Mistral API first
116
  try:
117
  # Upload the file to Mistral API
118
  uploaded_pdf = client.files.upload(
 
137
  )
138
  except Exception as e:
139
  # If file upload fails, try to use a direct URL method with base64
140
+ logger.warning(f"Failed to upload document, trying alternate method: {str(e)}")
141
+ base64_doc = self.encode_image(file_path)
142
 
143
+ if base64_doc:
144
+ mime_type = self._get_mime_type(file_extension)
145
  ocr_response = client.ocr.process(
146
  model="mistral-ocr-latest",
147
  document={
148
  "type": "document_url",
149
+ "document_url": f"data:{mime_type};base64,{base64_doc}"
150
  },
151
  include_image_base64=True
152
  )
153
  else:
154
+ raise DocumentProcessingError("Failed to process document")
155
  else:
156
  # For images (jpg, png, etc.), use image_url with base64
157
  base64_image = self.encode_image(file_path)
 
238
  def _extract_with_document_understanding(self, client, file_path, file_extension):
239
  """Extract and understand document content using chat completion."""
240
  try:
241
+ # For documents and images, we'll use Mistral's document understanding capability
242
+ if file_extension in ['.pdf', '.docx', '.pptx']:
243
+ # Upload document first
244
  try:
245
  # Upload the file
246
  uploaded_pdf = client.files.upload(
 
322
  raise ConversionError(f"Document understanding failed: {str(e)}")
323
 
324
  def _get_mime_type(self, file_extension: str) -> str:
325
+ """Get the MIME type for a file extension supported by Mistral OCR."""
326
  mime_types = {
327
+ # Document formats supported by Mistral OCR
328
  ".pdf": "application/pdf",
329
+ ".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
330
+ ".pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
331
+ # Image formats supported by Mistral OCR
332
  ".jpg": "image/jpeg",
333
  ".jpeg": "image/jpeg",
334
  ".png": "image/png",
 
336
  ".bmp": "image/bmp",
337
  ".tiff": "image/tiff",
338
  ".tif": "image/tiff",
339
+ ".avif": "image/avif",
340
+ ".webp": "image/webp",
341
  }
342
 
343
  return mime_types.get(file_extension, "application/octet-stream")
 
390
  def _create_document_part(self, file_path: Path) -> Dict[str, Any]:
391
  """Return a dict representing an image_url or document_url part for Mistral chat/OCR."""
392
  ext = file_path.suffix.lower()
393
+ if ext in ['.pdf', '.docx', '.pptx']:
394
  # upload and get signed url
395
  client = Mistral(api_key=config.api.mistral_api_key)
396
  uploaded = client.files.upload(
test_gemini_wrapper.py → tests/test_gemini_wrapper.py RENAMED
File without changes