Spaces:
Runtime error
Runtime error
Enhance README and parser functionality for improved document processing
Browse files- Updated README to provide a comprehensive overview of the system, including a detailed table of contents and enhanced descriptions of key features.
- Improved GeminiFlashParser to include detailed error messages for unsupported file types and added support for additional MIME types.
- Enhanced MistralOcrParser to support DOCX and PPTX file types, improving document processing capabilities.
- Introduced a new test suite for Gemini wrapper functionality to validate integration with MarkItDown and ensure robust performance.
- README.md +122 -308
- src/parsers/gemini_flash_parser.py +12 -2
- src/parsers/mistral_ocr_parser.py +19 -12
- test_gemini_wrapper.py → tests/test_gemini_wrapper.py +0 -0
README.md
CHANGED
|
@@ -14,19 +14,66 @@ hf_oauth: true
|
|
| 14 |
|
| 15 |
# Document to Markdown Converter with RAG Chat
|
| 16 |
|
| 17 |
-
A Hugging Face Space that converts various document formats to Markdown and
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## ✨ Key Features
|
| 20 |
|
| 21 |
### Document Conversion
|
| 22 |
- Convert PDFs, Office documents, images, and more to Markdown
|
| 23 |
- **🆕 Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
|
| 24 |
-
-
|
| 25 |
-
-
|
| 26 |
-
-
|
| 27 |
-
-
|
| 28 |
-
-
|
| 29 |
-
-
|
| 30 |
- **🆕 Intelligent Processing Types**:
|
| 31 |
- **Combined**: Merge documents into unified content with duplicate removal
|
| 32 |
- **Individual**: Separate sections per document with clear organization
|
|
@@ -90,47 +137,22 @@ A Hugging Face Space that converts various document formats to Markdown and lets
|
|
| 90 |
|
| 91 |
## 🚀 Multi-Document Processing
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
|
|
|
| 95 |
|
| 96 |
### **Key Capabilities:**
|
| 97 |
- **📊 Cross-Document Analysis**: Compare and contrast information across different files
|
| 98 |
- **🔄 Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
|
| 99 |
- **📋 Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
|
| 100 |
- **🧠 Contextual Understanding**: Recognizes relationships and patterns across document boundaries
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
-
|
| 107 |
-
-
|
| 108 |
-
- **Intelligence**: Removes redundant information while preserving all critical content
|
| 109 |
-
- **Example**: Merge project proposal + budget + timeline into one comprehensive document
|
| 110 |
-
|
| 111 |
-
#### 📑 **Individual Processing**
|
| 112 |
-
- **Purpose**: Convert each document separately but organize them in one output
|
| 113 |
-
- **Best for**: Different documents you want in one place for easy reference
|
| 114 |
-
- **Intelligence**: Maintains original structure while creating clear organization
|
| 115 |
-
- **Example**: Meeting agenda + presentation + notes → organized sections
|
| 116 |
-
|
| 117 |
-
#### 📈 **Summary Processing**
|
| 118 |
-
- **Purpose**: Executive overview + detailed analysis
|
| 119 |
-
- **Best for**: Complex document sets needing high-level insights
|
| 120 |
-
- **Intelligence**: Cross-document pattern recognition and key insight extraction
|
| 121 |
-
- **Example**: Research papers → executive summary + detailed analysis of each paper
|
| 122 |
-
|
| 123 |
-
#### ⚖️ **Comparison Processing**
|
| 124 |
-
- **Purpose**: Analyze differences, similarities, and relationships
|
| 125 |
-
- **Best for**: Multiple proposals, document versions, or conflicting sources
|
| 126 |
-
- **Intelligence**: Creates comparison tables and identifies discrepancies/alignments
|
| 127 |
-
- **Example**: Contract versions → side-by-side analysis with change identification
|
| 128 |
-
|
| 129 |
-
### **Technical Advantages:**
|
| 130 |
-
- **Native Multimodal Support**: Processes text + images in same workflow
|
| 131 |
-
- **Advanced Reasoning**: Understands context and relationships between documents
|
| 132 |
-
- **Efficient Processing**: Single Gemini API call vs. multiple individual calls
|
| 133 |
-
- **Format Agnostic**: Works across all supported file types seamlessly
|
| 134 |
|
| 135 |
## Environment Variables
|
| 136 |
|
|
@@ -191,82 +213,44 @@ The application uses centralized configuration management. You can enhance funct
|
|
| 191 |
- `BM25_K1`: BM25 term frequency saturation parameter (default: 1.2)
|
| 192 |
- `BM25_B`: BM25 field length normalization parameter (default: 0.75)
|
| 193 |
|
| 194 |
-
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
|
| 196 |
### Document Conversion
|
| 197 |
|
| 198 |
#### 📄 **Single Document Processing**
|
| 199 |
-
1.
|
| 200 |
-
2.
|
| 201 |
-
3.
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
- **"Gemini Flash"** for AI-powered text extraction
|
| 205 |
-
4. Select an OCR method based on your chosen parser
|
| 206 |
-
5. Click "Convert"
|
| 207 |
-
6. **For GOT-OCR**: View the LaTeX output with **Mathpix rendering** for proper mathematical and tabular display
|
| 208 |
-
7. **For other parsers**: View the Markdown output
|
| 209 |
-
8. Download the converted file (.tex for GOT-OCR, .md for others)
|
| 210 |
-
|
| 211 |
-
#### 📂 **Multi-Document Processing** (NEW!)
|
| 212 |
-
1. Go to the **"Document Converter"** tab
|
| 213 |
-
2. Upload **2-5 files** (up to 20MB combined)
|
| 214 |
-
3. **Processing type selector appears automatically**
|
| 215 |
-
4. Choose your processing type:
|
| 216 |
-
- **Combined**: Merge all documents into unified content with smart duplicate removal
|
| 217 |
-
- **Individual**: Keep documents separate with clear section headers
|
| 218 |
-
- **Summary**: Executive overview + detailed analysis of each document
|
| 219 |
-
- **Comparison**: Side-by-side analysis with similarities/differences tables
|
| 220 |
-
5. Choose your preferred parser:
|
| 221 |
-
- **Gemini Flash**: Best for advanced cross-document reasoning and native multi-document support
|
| 222 |
-
- **Mistral OCR**: Great for high-accuracy OCR with Document Understanding mode
|
| 223 |
-
- **Docling**: Excellent for PDF table structure + multi-document analysis
|
| 224 |
-
6. Click "Convert"
|
| 225 |
-
7. Get intelligent cross-document analysis and download enhanced output
|
| 226 |
-
|
| 227 |
-
#### 💡 **Multi-Document Tips**
|
| 228 |
-
- **Mixed file types work great**: Upload PDF + images, Word docs + PDFs, etc.
|
| 229 |
-
- **Gemini Flash excels at**: Cross-document reasoning, duplicate detection, and format analysis
|
| 230 |
-
- **Perfect for**: Comparing document versions, analyzing related reports, consolidating research
|
| 231 |
-
- **Real-time validation**: UI shows file count, size limits, and processing mode
|
| 232 |
-
|
| 233 |
-
#### 🤖 **RAG Integration**
|
| 234 |
-
- **All converted documents are automatically added to the RAG system** for chat functionality
|
| 235 |
-
- Multi-document processing creates richer context for chat interactions
|
| 236 |
-
|
| 237 |
-
### 🤖 Chat with Documents
|
| 238 |
-
1. Go to the **"Chat with Documents"** tab
|
| 239 |
-
2. Check the system status to ensure RAG components are ready
|
| 240 |
-
3. **🆕 Choose your retrieval strategy** for optimal results:
|
| 241 |
-
- **Similarity**: Best for general semantic search
|
| 242 |
-
- **MMR**: Best for diverse, non-repetitive results
|
| 243 |
-
- **Hybrid**: Best overall accuracy (recommended)
|
| 244 |
-
4. Ask questions about your converted documents
|
| 245 |
-
5. Enjoy real-time streaming responses with document context
|
| 246 |
-
6. Use "New Session" to start fresh conversations
|
| 247 |
-
7. Use "🗑️ Clear All Data" to remove all documents and chat history
|
| 248 |
-
8. Monitor your usage limits in the status panel
|
| 249 |
|
| 250 |
-
|
| 251 |
-
1.
|
| 252 |
-
2.
|
| 253 |
-
3.
|
| 254 |
-
4.
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
|
|
|
| 270 |
|
| 271 |
## Local Development
|
| 272 |
|
|
@@ -389,203 +373,33 @@ const html = window.render(latexContent, {htmlTags: true});
|
|
| 389 |
- [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
|
| 390 |
|
| 391 |
|
| 392 |
-
## 🔍
|
| 393 |
-
|
| 394 |
-
The system supports **four different retrieval methods** for optimal document search and question answering:
|
| 395 |
-
|
| 396 |
-
### **1. 🎯 Similarity Search (Default)**
|
| 397 |
-
- **How it works**: Semantic similarity using OpenAI embeddings
|
| 398 |
-
- **Best for**: General questions and semantic understanding
|
| 399 |
-
- **Use case**: "What is the main topic of this document?"
|
| 400 |
-
- **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
|
| 401 |
-
- **Chunking**: Uses content-aware chunking (Markdown or LaTeX) for optimal structure preservation
|
| 402 |
-
|
| 403 |
-
### **2. 🔀 MMR (Maximal Marginal Relevance)**
|
| 404 |
-
- **How it works**: Balances relevance with result diversity to reduce redundancy
|
| 405 |
-
- **Best for**: Research questions requiring diverse perspectives
|
| 406 |
-
- **Use case**: "What are different approaches to transformer architecture?"
|
| 407 |
-
- **Configuration**: `{'k': 4, 'fetch_k': 10, 'lambda_mult': 0.5}`
|
| 408 |
-
- **Benefits**: Prevents repetitive results, ensures comprehensive coverage
|
| 409 |
-
|
| 410 |
-
### **3. 🔍 BM25 Keyword Search**
|
| 411 |
-
- **How it works**: Traditional keyword-based search with TF-IDF scoring
|
| 412 |
-
- **Best for**: Exact term matching and specific factual queries
|
| 413 |
-
- **Use case**: "Find mentions of 'attention mechanism' in the documents"
|
| 414 |
-
- **Configuration**: `{'k': 4}`
|
| 415 |
-
- **Benefits**: Excellent for technical terms and specific concepts
|
| 416 |
-
|
| 417 |
-
### **4. 🔗 Hybrid Search (Recommended)**
|
| 418 |
-
- **How it works**: Combines semantic embeddings + keyword search using ensemble weighting
|
| 419 |
-
- **Best for**: Most queries - provides best overall accuracy
|
| 420 |
-
- **Use case**: Any complex question benefiting from both semantic and keyword matching
|
| 421 |
-
- **Configuration**: `{'k': 4, 'semantic_weight': 0.7, 'keyword_weight': 0.3}`
|
| 422 |
-
- **Benefits**: **87.5% hit rate vs 79.2% for similarity-only** (based on LangChain research)
|
| 423 |
-
|
| 424 |
-
### **🎯 Performance Comparison:**
|
| 425 |
-
| Method | Accuracy | Diversity | Speed | Best Use Case |
|
| 426 |
-
|--------|----------|-----------|-------|---------------|
|
| 427 |
-
| Similarity | Good | Low | Fast | General semantic questions |
|
| 428 |
-
| MMR | Good | High | Medium | Research requiring diverse viewpoints |
|
| 429 |
-
| BM25 | Medium | Medium | Fast | Exact term/keyword searches |
|
| 430 |
-
| **Hybrid** | **Excellent** | **High** | **Medium** | **Most questions (recommended)** |
|
| 431 |
-
|
| 432 |
-
### **💡 Usage Examples:**
|
| 433 |
-
|
| 434 |
-
```python
|
| 435 |
-
# In your application code
|
| 436 |
-
from src.rag.chat_service import rag_chat_service
|
| 437 |
-
|
| 438 |
-
# Use hybrid search (recommended)
|
| 439 |
-
response = rag_chat_service.chat_with_retrieval(
|
| 440 |
-
"How does attention work in transformers?",
|
| 441 |
-
retrieval_method="hybrid",
|
| 442 |
-
retrieval_config={'k': 4, 'semantic_weight': 0.8, 'keyword_weight': 0.2}
|
| 443 |
-
)
|
| 444 |
-
|
| 445 |
-
# Use MMR for diverse research results
|
| 446 |
-
response = rag_chat_service.chat_with_retrieval(
|
| 447 |
-
"What are different transformer architectures?",
|
| 448 |
-
retrieval_method="mmr",
|
| 449 |
-
retrieval_config={'k': 3, 'fetch_k': 10, 'lambda_mult': 0.6}
|
| 450 |
-
)
|
| 451 |
-
```
|
| 452 |
|
| 453 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 454 |
|
| 455 |
-
|
| 456 |
|
| 457 |
-
|
| 458 |
-
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
│ ├── __init__.py # Package initialization
|
| 470 |
-
│ ├── main.py # Application launcher
|
| 471 |
-
│ ├── core/ # Core functionality and utilities
|
| 472 |
-
│ │ ├── __init__.py # Package initialization
|
| 473 |
-
│ │ ├── config.py # 🆕 Centralized configuration management (with RAG settings)
|
| 474 |
-
│ │ ├── exceptions.py # 🆕 Custom exception hierarchy
|
| 475 |
-
│ │ ├── logging_config.py # 🆕 Centralized logging setup
|
| 476 |
-
│ │ ├── environment.py # 🆕 Environment setup and dependency management
|
| 477 |
-
│ │ ├── converter.py # Document conversion orchestrator (refactored)
|
| 478 |
-
│ │ ├── parser_factory.py # Parser factory pattern
|
| 479 |
-
│ │ └── latex_to_markdown_converter.py # LaTeX conversion utility
|
| 480 |
-
│ ├── services/ # Business logic layer
|
| 481 |
-
│ │ ├── __init__.py # Package initialization
|
| 482 |
-
│ │ ├── document_service.py # 🆕 Document processing service
|
| 483 |
-
│ │ └── data_clearing_service.py # 🆕 Data management and clearing service
|
| 484 |
-
│ ├── parsers/ # Parser implementations
|
| 485 |
-
│ │ ├── __init__.py # Package initialization
|
| 486 |
-
│ │ ├── parser_interface.py # Enhanced parser interface
|
| 487 |
-
│ │ ├── parser_registry.py # Parser registry pattern
|
| 488 |
-
│ │ ├── markitdown_parser.py # MarkItDown parser (updated)
|
| 489 |
-
│ │ ├── docling_parser.py # 🆕 Docling parser with advanced PDF understanding
|
| 490 |
-
│ │ ├── got_ocr_parser.py # GOT-OCR parser for images
|
| 491 |
-
│ │ ├── mistral_ocr_parser.py # 🆕 Mistral OCR parser
|
| 492 |
-
│ │ └── gemini_flash_parser.py # 🆕 Enhanced Gemini Flash parser with multi-document processing
|
| 493 |
-
│ ├── rag/ # 🆕 RAG (Retrieval-Augmented Generation) system
|
| 494 |
-
│ │ ├── __init__.py # Package initialization
|
| 495 |
-
│ │ ├── embeddings.py # OpenAI embedding model management
|
| 496 |
-
│ │ ├── chunking.py # Markdown-aware document chunking
|
| 497 |
-
│ │ ├── vector_store.py # Chroma vector database management
|
| 498 |
-
│ │ ├── memory.py # Chat history and session management
|
| 499 |
-
│ │ ├── chat_service.py # RAG chat service with Gemini 2.5 Flash
|
| 500 |
-
│ │ └── ingestion.py # Document ingestion pipeline
|
| 501 |
-
│ └── ui/ # 🆕 Modular user interface layer
|
| 502 |
-
│ ├── __init__.py # Package initialization
|
| 503 |
-
│ ├── ui.py # Main UI orchestrator (~60 lines)
|
| 504 |
-
│ ├── components/ # UI components
|
| 505 |
-
│ │ ├── __init__.py # Package initialization
|
| 506 |
-
│ │ ├── document_converter.py # Document converter tab (~200 lines)
|
| 507 |
-
│ │ ├── chat_interface.py # Chat interface tab (~180 lines)
|
| 508 |
-
│ │ └── query_ranker.py # Query ranker tab (~200 lines)
|
| 509 |
-
│ ├── formatters/ # Content formatting utilities
|
| 510 |
-
│ │ ├── __init__.py # Package initialization
|
| 511 |
-
│ │ └── content_formatters.py # Markdown/LaTeX formatters (~150 lines)
|
| 512 |
-
│ ├── styles/ # UI styling
|
| 513 |
-
│ │ ├── __init__.py # Package initialization
|
| 514 |
-
│ │ └── ui_styles.py # CSS styles and themes (~800 lines)
|
| 515 |
-
│ └── utils/ # UI utility functions
|
| 516 |
-
│ ├── __init__.py # Package initialization
|
| 517 |
-
│ ├── file_validation.py # File validation utilities (~80 lines)
|
| 518 |
-
│ └── threading_utils.py # Threading utilities (~40 lines)
|
| 519 |
-
├── documents/ # Documentation and examples (gitignored)
|
| 520 |
-
├── tessdata/ # Tesseract OCR data (gitignored)
|
| 521 |
-
└── tests/ # 🆕 Test suite for Phase 1 RAG implementation
|
| 522 |
-
├── __init__.py # Package initialization
|
| 523 |
-
├── README.md # Test documentation and usage guide
|
| 524 |
-
├── test_implementation_structure.py # Structure validation (no API keys)
|
| 525 |
-
├── test_retrieval_methods.py # Full functionality testing
|
| 526 |
-
└── test_data_usage.py # Data usage demonstration
|
| 527 |
```
|
| 528 |
|
| 529 |
-
###
|
| 530 |
-
- **
|
| 531 |
-
- **
|
| 532 |
-
- **
|
| 533 |
-
- **
|
| 534 |
-
- **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
|
| 535 |
-
- **Enhanced Parser Interface**: Validation, metadata, and cancellation support
|
| 536 |
-
- **Lightweight Launcher**: Quick development startup with `run_app.py`
|
| 537 |
-
- **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
|
| 538 |
-
- **🆕 RAG System**: Complete RAG implementation with vector search and chat capabilities
|
| 539 |
-
- **🆕 Query Ranker Interface**: Dedicated transparency tool for document search and ranking
|
| 540 |
-
- **🆕 Modular UI Architecture**: Component-based UI with clear separation of concerns
|
| 541 |
-
- **UI Components**: Individual tab components for focused functionality
|
| 542 |
-
- **Content Formatters**: Specialized markdown and LaTeX rendering utilities
|
| 543 |
-
- **UI Styles**: Centralized CSS styling system with responsive design
|
| 544 |
-
- **UI Utils**: File validation and threading utilities for better code organization
|
| 545 |
-
|
| 546 |
-
### 🧠 **RAG System Architecture:**
|
| 547 |
-
- **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
|
| 548 |
-
- **🆕 Smart Content-Aware Chunking** (`src/rag/chunking.py`):
|
| 549 |
-
- **Unified chunker** supporting both Markdown and LaTeX content
|
| 550 |
-
- **Markdown chunking**: Preserves tables and code blocks as whole units
|
| 551 |
-
- **LaTeX chunking**: Preserves `\begin{tabular}`, mathematical environments, and LaTeX structures
|
| 552 |
-
- **Automatic format detection**: GOT-OCR results → LaTeX chunker, others → Markdown chunker
|
| 553 |
-
- **Enhanced metadata**: Content type tracking and structure detection
|
| 554 |
-
- **🆕 Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
|
| 555 |
-
- **Similarity Search**: Traditional semantic retrieval using embeddings
|
| 556 |
-
- **MMR Support**: Maximal Marginal Relevance for diverse results
|
| 557 |
-
- **BM25 Integration**: Keyword-based search with TF-IDF scoring
|
| 558 |
-
- **Hybrid Retrieval**: Ensemble combining semantic + keyword methods
|
| 559 |
-
- **Chroma database**: Persistent storage with deduplication
|
| 560 |
-
- **Chat Memory** (`src/rag/memory.py`): Session management and conversation history
|
| 561 |
-
- **🆕 Enhanced Chat Service** (`src/rag/chat_service.py`): Multi-method RAG with Gemini 2.5 Flash
|
| 562 |
-
- **Document Ingestion** (`src/rag/ingestion.py`): Automated pipeline with intelligent duplicate handling
|
| 563 |
-
- **Usage Limiting**: Anti-abuse measures for public deployment
|
| 564 |
-
- **Auto-Ingestion**: Seamless integration with document conversion workflow
|
| 565 |
-
|
| 566 |
-
### 🗑️ **Data Management & Deduplication:**
|
| 567 |
-
- **File Hash-Based Deduplication**: Uses SHA-256 hashes of original file content to prevent duplicates
|
| 568 |
-
- **Chroma Where Filter Integration**: Persistent duplicate detection using vector store metadata queries
|
| 569 |
-
- **Automatic Document Replacement**: When same file is uploaded again, old version is replaced with new one
|
| 570 |
-
- **Cross-Environment Data Clearing**: Works seamlessly in both local development and HF Space environments
|
| 571 |
-
- **Environment-Aware Path Resolution**: Automatically detects and uses correct data paths (`./data/*` vs `/tmp/data/*`)
|
| 572 |
-
- **Comprehensive Status Reporting**: Real-time display of vector store documents, chat history files, and environment type
|
| 573 |
-
- **Safe Clearing Operations**: Graceful error handling with detailed feedback on clearing operations
|
| 574 |
-
|
| 575 |
-
### ZeroGPU Integration Notes
|
| 576 |
-
|
| 577 |
-
When developing for Hugging Face Spaces with Stateless GPU:
|
| 578 |
-
|
| 579 |
-
1. Always import the `spaces` module before any CUDA initialization
|
| 580 |
-
2. Place all CUDA operations inside functions decorated with `@spaces.GPU()`
|
| 581 |
-
3. Ensure only picklable objects are passed to GPU-decorated functions
|
| 582 |
-
4. Use wrapper functions to filter out unpicklable objects like thread locks
|
| 583 |
-
5. For advanced use cases, consider implementing fallback mechanisms for serialization errors
|
| 584 |
-
6. **Add `hf_oauth: true` to your Space's README.md metadata** to mitigate GPU quota limitations
|
| 585 |
-
7. Sign in with your Hugging Face account when using the app to utilize your personal GPU quota
|
| 586 |
-
8. For extensive GPU usage without quota limitations, a Hugging Face Pro subscription is required
|
| 587 |
-
|
| 588 |
-
> **Note**: If you're implementing a Space with ZeroGPU on your own, you may encounter quota limitations ("GPU task aborted" errors). These can be mitigated by:
|
| 589 |
-
> - Adding `hf_oauth: true` to your Space's metadata (as shown in this Space)
|
| 590 |
-
> - Having users sign in with their Hugging Face accounts
|
| 591 |
-
> - Upgrading to a Hugging Face Pro subscription for dedicated GPU resources
|
|
|
|
| 14 |
|
| 15 |
# Document to Markdown Converter with RAG Chat
|
| 16 |
|
| 17 |
+
A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation).
|
| 18 |
+
|
| 19 |
+
<details>
|
| 20 |
+
<summary><strong>Table of contents</strong></summary>
|
| 21 |
+
|
| 22 |
+
<!-- Begin ToC -->
|
| 23 |
+
|
| 24 |
+
- [System Overview](#-system-overview)
|
| 25 |
+
- [Key Features](#-key-features)
|
| 26 |
+
- [Document Conversion](#document-conversion)
|
| 27 |
+
- [RAG Chat with Documents](#-rag-chat-with-documents)
|
| 28 |
+
- [Query Ranker (NEW!)](#-query-ranker-new)
|
| 29 |
+
- [User Interface](#user-interface)
|
| 30 |
+
- [Supported Libraries](#supported-libraries)
|
| 31 |
+
- [Multi-Document Processing](#-multi-document-processing)
|
| 32 |
+
- [Environment Variables](#environment-variables)
|
| 33 |
+
- [API Keys](#-api-keys)
|
| 34 |
+
- [Configuration Options](#️-configuration-options)
|
| 35 |
+
- [Docling Configuration](#-docling-configuration)
|
| 36 |
+
- [Model Configuration](#-model-configuration)
|
| 37 |
+
- [RAG Configuration](#-rag-configuration)
|
| 38 |
+
- [Advanced Retrieval Configuration](#-advanced-retrieval-configuration)
|
| 39 |
+
- [Usage Guide](#-usage-guide)
|
| 40 |
+
- [Parser Selection](#-parser-selection)
|
| 41 |
+
- [Document Conversion](#document-conversion-1)
|
| 42 |
+
- [RAG Chat & Query System](#-rag-chat--query-system)
|
| 43 |
+
- [Local Development](#local-development)
|
| 44 |
+
- [Quick Start](#-quick-start)
|
| 45 |
+
- [Data Management](#-data-management)
|
| 46 |
+
- [Development Features](#-development-features)
|
| 47 |
+
- [GOT-OCR LaTeX Processing](#-got-ocr-latex-processing)
|
| 48 |
+
- [Credits](#credits)
|
| 49 |
+
- [Retrieval Strategies](#-retrieval-strategies)
|
| 50 |
+
- [Development](#-development)
|
| 51 |
+
- [Quick Start](#quick-start)
|
| 52 |
+
- [Key Technologies](#key-technologies)
|
| 53 |
+
|
| 54 |
+
<!-- End ToC -->
|
| 55 |
+
|
| 56 |
+
</details>
|
| 57 |
+
|
| 58 |
+
## 🎯 System Overview
|
| 59 |
+
|
| 60 |
+
<div align="center">
|
| 61 |
+
<img src="img/Overall%20System%20Workflow%20(Essential).png" alt="Overall System Workflow" width="400">
|
| 62 |
+
|
| 63 |
+
*Complete workflow from document upload to intelligent RAG chat interaction*
|
| 64 |
+
</div>
|
| 65 |
|
| 66 |
## ✨ Key Features
|
| 67 |
|
| 68 |
### Document Conversion
|
| 69 |
- Convert PDFs, Office documents, images, and more to Markdown
|
| 70 |
- **🆕 Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
|
| 71 |
+
- **5 Powerful Parsers**:
|
| 72 |
+
- **Gemini Flash**: General Purpose + High Accuracy
|
| 73 |
+
- **Mistral OCR**: Fastest Processing
|
| 74 |
+
- **Docling**: Open Source
|
| 75 |
+
- **GOT-OCR**: Document to LaTeX + Open Source
|
| 76 |
+
- **MarkItDown**: High Accuracy CSV/XML + Open Source
|
| 77 |
- **🆕 Intelligent Processing Types**:
|
| 78 |
- **Combined**: Merge documents into unified content with duplicate removal
|
| 79 |
- **Individual**: Separate sections per document with clear organization
|
|
|
|
| 137 |
|
| 138 |
## 🚀 Multi-Document Processing
|
| 139 |
|
| 140 |
+
<img src="img/Multi-Document%20Processing%20Types%20(Flagship%20Feature).png" alt="Multi-Document Processing Types" width="700">
|
| 141 |
+
|
| 142 |
+
*Industry-leading multi-document processing with 4 intelligent processing types*
|
| 143 |
|
| 144 |
### **Key Capabilities:**
|
| 145 |
- **📊 Cross-Document Analysis**: Compare and contrast information across different files
|
| 146 |
- **🔄 Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
|
| 147 |
- **📋 Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
|
| 148 |
- **🧠 Contextual Understanding**: Recognizes relationships and patterns across document boundaries
|
| 149 |
+
|
| 150 |
+
### **Processing Types:**
|
| 151 |
+
|
| 152 |
+
- **🔗 Combined**: Merge documents into unified content with duplicate removal
|
| 153 |
+
- **📑 Individual**: Separate sections per document with clear organization
|
| 154 |
+
- **📈 Summary**: Executive overview + detailed analysis of all documents
|
| 155 |
+
- **⚖️ Comparison**: Cross-document analysis with similarities/differences tables
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
## Environment Variables
|
| 158 |
|
|
|
|
| 213 |
- `BM25_K1`: BM25 term frequency saturation parameter (default: 1.2)
|
| 214 |
- `BM25_B`: BM25 field length normalization parameter (default: 0.75)
|
| 215 |
|
| 216 |
+
## 📖 Usage Guide
|
| 217 |
+
|
| 218 |
+
### 🎯 Parser Selection
|
| 219 |
+
|
| 220 |
+
<img src="img/Parser%20Selection%20Guide%20(User-Friendly).png" alt="Parser Selection Guide" width="700">
|
| 221 |
+
|
| 222 |
+
*Choose the right parser for your specific needs and document types*
|
| 223 |
|
| 224 |
### Document Conversion
|
| 225 |
|
| 226 |
#### 📄 **Single Document Processing**
|
| 227 |
+
1. Upload a single file
|
| 228 |
+
2. Choose your preferred parser
|
| 229 |
+
3. Select an OCR method based on your chosen parser
|
| 230 |
+
4. Click "Convert"
|
| 231 |
+
5. Download the converted file (.tex for GOT-OCR, .md for others)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 232 |
|
| 233 |
+
#### 📂 **Multi-Document Processing**
|
| 234 |
+
1. Upload **2-5 files** (up to 20MB combined)
|
| 235 |
+
2. Choose processing type: Combined, Individual, Summary, or Comparison
|
| 236 |
+
3. Select your preferred parser
|
| 237 |
+
4. Click "Convert" for intelligent cross-document analysis
|
| 238 |
+
|
| 239 |
+
### 🤖 RAG Chat & Query System
|
| 240 |
+
|
| 241 |
+
<img src="img/RAG%20Retrieval%20Strategies%20(Technical%20Highlight).png" alt="RAG Retrieval Strategies" width="700">
|
| 242 |
+
|
| 243 |
+
*Advanced RAG system with 4 retrieval strategies for optimal document search*
|
| 244 |
+
|
| 245 |
+
#### **Chat with Documents**
|
| 246 |
+
1. Choose your retrieval strategy (Similarity, MMR, BM25, or Hybrid)
|
| 247 |
+
2. Ask questions about your converted documents
|
| 248 |
+
3. Get real-time streaming responses with document context
|
| 249 |
+
|
| 250 |
+
#### **Query Ranker**
|
| 251 |
+
1. Enter search queries to explore document chunks
|
| 252 |
+
2. Compare different retrieval methods
|
| 253 |
+
3. View confidence scores and source information
|
| 254 |
|
| 255 |
## Local Development
|
| 256 |
|
|
|
|
| 373 |
- [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
|
| 374 |
|
| 375 |
|
| 376 |
+
## 🔍 Retrieval Strategies
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 377 |
|
| 378 |
+
| Method | Best For | Accuracy |
|
| 379 |
+
|--------|----------|----------|
|
| 380 |
+
| **🎯 Similarity** | General semantic questions | Good |
|
| 381 |
+
| **🔀 MMR** | Diverse perspectives | Good |
|
| 382 |
+
| **🔍 BM25** | Exact keyword searches | Medium |
|
| 383 |
+
| **🔗 Hybrid** | Most queries (recommended) | **Excellent** |
|
| 384 |
|
| 385 |
+
## 💻 Development
|
| 386 |
|
| 387 |
+
### Quick Start
|
| 388 |
+
```bash
|
| 389 |
+
# Clone repository
|
| 390 |
+
git clone https://github.com/ansemin/Markit_v2
|
| 391 |
+
|
| 392 |
+
# Set up environment variables
|
| 393 |
+
cp .env.example .env
|
| 394 |
+
# Edit .env with your API keys
|
| 395 |
+
|
| 396 |
+
# Install dependencies & run
|
| 397 |
+
pip install -r requirements.txt
|
| 398 |
+
python app.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 399 |
```
|
| 400 |
|
| 401 |
+
### Key Technologies
|
| 402 |
+
- **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown
|
| 403 |
+
- **RAG System**: OpenAI embeddings + Chroma vector store + Gemini 2.5 Flash
|
| 404 |
+
- **UI Framework**: Gradio with modular component architecture
|
| 405 |
+
- **GPU Support**: ZeroGPU integration for HF Spaces
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/parsers/gemini_flash_parser.py
CHANGED
|
@@ -194,8 +194,17 @@ class GeminiFlashParser(DocumentParser):
|
|
| 194 |
# Validate file types
|
| 195 |
for file_path in file_paths:
|
| 196 |
file_extension = file_path.suffix.lower()
|
| 197 |
-
|
| 198 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 199 |
|
| 200 |
def _create_batch_contents(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> List[Any]:
|
| 201 |
"""Create contents list for batch API call."""
|
|
@@ -344,6 +353,7 @@ Return only the markdown content, no other text."""
|
|
| 344 |
".md": "text/markdown",
|
| 345 |
".html": "text/html",
|
| 346 |
".htm": "text/html",
|
|
|
|
| 347 |
".jpg": "image/jpeg",
|
| 348 |
".jpeg": "image/jpeg",
|
| 349 |
".png": "image/png",
|
|
|
|
| 194 |
# Validate file types
|
| 195 |
for file_path in file_paths:
|
| 196 |
file_extension = file_path.suffix.lower()
|
| 197 |
+
mime_type = self._get_mime_type(file_extension)
|
| 198 |
+
if mime_type == "application/octet-stream":
|
| 199 |
+
raise ValueError(f"Unsupported file type: {file_path.name}. Gemini supports: PDF, TXT, HTML, CSS, MD, CSV, XML, RTF, JS, PY, and image files.")
|
| 200 |
+
# Check if it's a supported MIME type for Gemini
|
| 201 |
+
if mime_type in ["application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
| 202 |
+
"application/msword",
|
| 203 |
+
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
| 204 |
+
"application/vnd.ms-powerpoint",
|
| 205 |
+
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
|
| 206 |
+
"application/vnd.ms-excel"]:
|
| 207 |
+
raise ValueError(f"File type not supported by Gemini: {file_path.name}. Gemini supports: PDF, TXT, HTML, CSS, MD, CSV, XML, RTF, JS, PY, and image files.")
|
| 208 |
|
| 209 |
def _create_batch_contents(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> List[Any]:
|
| 210 |
"""Create contents list for batch API call."""
|
|
|
|
| 353 |
".md": "text/markdown",
|
| 354 |
".html": "text/html",
|
| 355 |
".htm": "text/html",
|
| 356 |
+
".csv": "text/csv",
|
| 357 |
".jpg": "image/jpeg",
|
| 358 |
".jpeg": "image/jpeg",
|
| 359 |
".png": "image/png",
|
src/parsers/mistral_ocr_parser.py
CHANGED
|
@@ -111,8 +111,8 @@ class MistralOcrParser(DocumentParser):
|
|
| 111 |
"""Extract document content using basic OCR."""
|
| 112 |
try:
|
| 113 |
# Process according to file type
|
| 114 |
-
if file_extension in ['.pdf']:
|
| 115 |
-
# For
|
| 116 |
try:
|
| 117 |
# Upload the file to Mistral API
|
| 118 |
uploaded_pdf = client.files.upload(
|
|
@@ -137,20 +137,21 @@ class MistralOcrParser(DocumentParser):
|
|
| 137 |
)
|
| 138 |
except Exception as e:
|
| 139 |
# If file upload fails, try to use a direct URL method with base64
|
| 140 |
-
logger.warning(f"Failed to upload
|
| 141 |
-
|
| 142 |
|
| 143 |
-
if
|
|
|
|
| 144 |
ocr_response = client.ocr.process(
|
| 145 |
model="mistral-ocr-latest",
|
| 146 |
document={
|
| 147 |
"type": "document_url",
|
| 148 |
-
"document_url": f"data:
|
| 149 |
},
|
| 150 |
include_image_base64=True
|
| 151 |
)
|
| 152 |
else:
|
| 153 |
-
raise DocumentProcessingError("Failed to process
|
| 154 |
else:
|
| 155 |
# For images (jpg, png, etc.), use image_url with base64
|
| 156 |
base64_image = self.encode_image(file_path)
|
|
@@ -237,9 +238,9 @@ class MistralOcrParser(DocumentParser):
|
|
| 237 |
def _extract_with_document_understanding(self, client, file_path, file_extension):
|
| 238 |
"""Extract and understand document content using chat completion."""
|
| 239 |
try:
|
| 240 |
-
# For
|
| 241 |
-
if file_extension in ['.pdf']:
|
| 242 |
-
# Upload
|
| 243 |
try:
|
| 244 |
# Upload the file
|
| 245 |
uploaded_pdf = client.files.upload(
|
|
@@ -321,9 +322,13 @@ class MistralOcrParser(DocumentParser):
|
|
| 321 |
raise ConversionError(f"Document understanding failed: {str(e)}")
|
| 322 |
|
| 323 |
def _get_mime_type(self, file_extension: str) -> str:
|
| 324 |
-
"""Get the MIME type for a file extension."""
|
| 325 |
mime_types = {
|
|
|
|
| 326 |
".pdf": "application/pdf",
|
|
|
|
|
|
|
|
|
|
| 327 |
".jpg": "image/jpeg",
|
| 328 |
".jpeg": "image/jpeg",
|
| 329 |
".png": "image/png",
|
|
@@ -331,6 +336,8 @@ class MistralOcrParser(DocumentParser):
|
|
| 331 |
".bmp": "image/bmp",
|
| 332 |
".tiff": "image/tiff",
|
| 333 |
".tif": "image/tiff",
|
|
|
|
|
|
|
| 334 |
}
|
| 335 |
|
| 336 |
return mime_types.get(file_extension, "application/octet-stream")
|
|
@@ -383,7 +390,7 @@ class MistralOcrParser(DocumentParser):
|
|
| 383 |
def _create_document_part(self, file_path: Path) -> Dict[str, Any]:
|
| 384 |
"""Return a dict representing an image_url or document_url part for Mistral chat/OCR."""
|
| 385 |
ext = file_path.suffix.lower()
|
| 386 |
-
if ext
|
| 387 |
# upload and get signed url
|
| 388 |
client = Mistral(api_key=config.api.mistral_api_key)
|
| 389 |
uploaded = client.files.upload(
|
|
|
|
| 111 |
"""Extract document content using basic OCR."""
|
| 112 |
try:
|
| 113 |
# Process according to file type
|
| 114 |
+
if file_extension in ['.pdf', '.docx', '.pptx']:
|
| 115 |
+
# For documents (PDF, DOCX, PPTX), we need to upload the file to the Mistral API first
|
| 116 |
try:
|
| 117 |
# Upload the file to Mistral API
|
| 118 |
uploaded_pdf = client.files.upload(
|
|
|
|
| 137 |
)
|
| 138 |
except Exception as e:
|
| 139 |
# If file upload fails, try to use a direct URL method with base64
|
| 140 |
+
logger.warning(f"Failed to upload document, trying alternate method: {str(e)}")
|
| 141 |
+
base64_doc = self.encode_image(file_path)
|
| 142 |
|
| 143 |
+
if base64_doc:
|
| 144 |
+
mime_type = self._get_mime_type(file_extension)
|
| 145 |
ocr_response = client.ocr.process(
|
| 146 |
model="mistral-ocr-latest",
|
| 147 |
document={
|
| 148 |
"type": "document_url",
|
| 149 |
+
"document_url": f"data:{mime_type};base64,{base64_doc}"
|
| 150 |
},
|
| 151 |
include_image_base64=True
|
| 152 |
)
|
| 153 |
else:
|
| 154 |
+
raise DocumentProcessingError("Failed to process document")
|
| 155 |
else:
|
| 156 |
# For images (jpg, png, etc.), use image_url with base64
|
| 157 |
base64_image = self.encode_image(file_path)
|
|
|
|
| 238 |
def _extract_with_document_understanding(self, client, file_path, file_extension):
|
| 239 |
"""Extract and understand document content using chat completion."""
|
| 240 |
try:
|
| 241 |
+
# For documents and images, we'll use Mistral's document understanding capability
|
| 242 |
+
if file_extension in ['.pdf', '.docx', '.pptx']:
|
| 243 |
+
# Upload document first
|
| 244 |
try:
|
| 245 |
# Upload the file
|
| 246 |
uploaded_pdf = client.files.upload(
|
|
|
|
| 322 |
raise ConversionError(f"Document understanding failed: {str(e)}")
|
| 323 |
|
| 324 |
def _get_mime_type(self, file_extension: str) -> str:
|
| 325 |
+
"""Get the MIME type for a file extension supported by Mistral OCR."""
|
| 326 |
mime_types = {
|
| 327 |
+
# Document formats supported by Mistral OCR
|
| 328 |
".pdf": "application/pdf",
|
| 329 |
+
".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
| 330 |
+
".pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
|
| 331 |
+
# Image formats supported by Mistral OCR
|
| 332 |
".jpg": "image/jpeg",
|
| 333 |
".jpeg": "image/jpeg",
|
| 334 |
".png": "image/png",
|
|
|
|
| 336 |
".bmp": "image/bmp",
|
| 337 |
".tiff": "image/tiff",
|
| 338 |
".tif": "image/tiff",
|
| 339 |
+
".avif": "image/avif",
|
| 340 |
+
".webp": "image/webp",
|
| 341 |
}
|
| 342 |
|
| 343 |
return mime_types.get(file_extension, "application/octet-stream")
|
|
|
|
| 390 |
def _create_document_part(self, file_path: Path) -> Dict[str, Any]:
|
| 391 |
"""Return a dict representing an image_url or document_url part for Mistral chat/OCR."""
|
| 392 |
ext = file_path.suffix.lower()
|
| 393 |
+
if ext in ['.pdf', '.docx', '.pptx']:
|
| 394 |
# upload and get signed url
|
| 395 |
client = Mistral(api_key=config.api.mistral_api_key)
|
| 396 |
uploaded = client.files.upload(
|
test_gemini_wrapper.py → tests/test_gemini_wrapper.py
RENAMED
|
File without changes
|