Spaces:

Ansemin101
/

Markit_v2

Runtime error

AnseMin commited on Jun 29

Commit

4a97b0c

1 Parent(s): d85616d

Enhance README and parser functionality for improved document processing

- Updated README to provide a comprehensive overview of the system, including a detailed table of contents and enhanced descriptions of key features.
- Improved GeminiFlashParser to include detailed error messages for unsupported file types and added support for additional MIME types.
- Enhanced MistralOcrParser to support DOCX and PPTX file types, improving document processing capabilities.
- Introduced a new test suite for Gemini wrapper functionality to validate integration with MarkItDown and ensure robust performance.

Files changed (4) hide show

README.md +122 -308
src/parsers/gemini_flash_parser.py +12 -2
src/parsers/mistral_ocr_parser.py +19 -12
test_gemini_wrapper.py → tests/test_gemini_wrapper.py +0 -0

README.md CHANGED Viewed

@@ -14,19 +14,66 @@ hf_oauth: true
 # Document to Markdown Converter with RAG Chat
-A Hugging Face Space that converts various document formats to Markdown and lets you chat with your documents using RAG (Retrieval-Augmented Generation)!
 ## ✨ Key Features
 ### Document Conversion
 - Convert PDFs, Office documents, images, and more to Markdown
 - **🆕 Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
-- Multiple parser options:
-  - MarkItDown: For comprehensive document conversion
-  - Docling: For advanced PDF understanding with table structure recognition + **multi-document processing**
-  - GOT-OCR: For image-based OCR with **native LaTeX output** and Mathpix rendering
-  - Gemini Flash: For AI-powered text extraction with **advanced multi-document capabilities**
-  - Mistral OCR: High-accuracy OCR for PDFs and images with optional *Document Understanding* mode + **multi-document processing**
 - **🆕 Intelligent Processing Types**:
   - **Combined**: Merge documents into unified content with duplicate removal
   - **Individual**: Separate sections per document with clear organization
@@ -90,47 +137,22 @@ A Hugging Face Space that converts various document formats to Markdown and lets
 ## 🚀 Multi-Document Processing
-### **What makes this special?**
-Markit v2 introduces **industry-leading multi-document processing** with **three powerful parser options**: Gemini Flash (native multi-document AI), Mistral OCR (high-accuracy with Document Understanding), and Docling (advanced PDF analysis). All support intelligent cross-document analysis.
 ### **Key Capabilities:**
 - **📊 Cross-Document Analysis**: Compare and contrast information across different files
 - **🔄 Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
 - **📋 Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
 - **🧠 Contextual Understanding**: Recognizes relationships and patterns across document boundaries
-- **⚡ Single API Call Processing**: Efficient batch processing using Gemini's native multi-document support
-### **Processing Types Explained:**
-#### 🔗 **Combined Processing**
-- **Purpose**: Create one unified, cohesive document from multiple sources
-- **Best for**: Related documents that should be read as one complete resource
-- **Intelligence**: Removes redundant information while preserving all critical content
-- **Example**: Merge project proposal + budget + timeline into one comprehensive document
-#### 📑 **Individual Processing**
-- **Purpose**: Convert each document separately but organize them in one output
-- **Best for**: Different documents you want in one place for easy reference
-- **Intelligence**: Maintains original structure while creating clear organization
-- **Example**: Meeting agenda + presentation + notes → organized sections
-#### 📈 **Summary Processing**
-- **Purpose**: Executive overview + detailed analysis
-- **Best for**: Complex document sets needing high-level insights
-- **Intelligence**: Cross-document pattern recognition and key insight extraction
-- **Example**: Research papers → executive summary + detailed analysis of each paper
-#### ⚖️ **Comparison Processing**
-- **Purpose**: Analyze differences, similarities, and relationships
-- **Best for**: Multiple proposals, document versions, or conflicting sources
-- **Intelligence**: Creates comparison tables and identifies discrepancies/alignments
-- **Example**: Contract versions → side-by-side analysis with change identification
-### **Technical Advantages:**
-- **Native Multimodal Support**: Processes text + images in same workflow
-- **Advanced Reasoning**: Understands context and relationships between documents
-- **Efficient Processing**: Single Gemini API call vs. multiple individual calls
-- **Format Agnostic**: Works across all supported file types seamlessly
 ## Environment Variables
@@ -191,82 +213,44 @@ The application uses centralized configuration management. You can enhance funct
 - `BM25_K1`: BM25 term frequency saturation parameter (default: 1.2)
 - `BM25_B`: BM25 field length normalization parameter (default: 0.75)
-## Usage
 ### Document Conversion
 #### 📄 **Single Document Processing**
-1. Go to the **"Document Converter"** tab
-2. Upload a single file
-3. Choose your preferred parser:
-   - **"MarkItDown"** for comprehensive document conversion
-   - **"Docling"** for advanced PDF understanding and table extraction
-   - **"Gemini Flash"** for AI-powered text extraction
-4. Select an OCR method based on your chosen parser
-5. Click "Convert"
-6. **For GOT-OCR**: View the LaTeX output with **Mathpix rendering** for proper mathematical and tabular display
-7. **For other parsers**: View the Markdown output
-8. Download the converted file (.tex for GOT-OCR, .md for others)
-#### 📂 **Multi-Document Processing** (NEW!)
-1. Go to the **"Document Converter"** tab
-2. Upload **2-5 files** (up to 20MB combined)
-3. **Processing type selector appears automatically**
-4. Choose your processing type:
-   - **Combined**: Merge all documents into unified content with smart duplicate removal
-   - **Individual**: Keep documents separate with clear section headers
-   - **Summary**: Executive overview + detailed analysis of each document
-   - **Comparison**: Side-by-side analysis with similarities/differences tables
-5. Choose your preferred parser:
-   - **Gemini Flash**: Best for advanced cross-document reasoning and native multi-document support
-   - **Mistral OCR**: Great for high-accuracy OCR with Document Understanding mode
-   - **Docling**: Excellent for PDF table structure + multi-document analysis
-6. Click "Convert"
-7. Get intelligent cross-document analysis and download enhanced output
-#### 💡 **Multi-Document Tips**
-- **Mixed file types work great**: Upload PDF + images, Word docs + PDFs, etc.
-- **Gemini Flash excels at**: Cross-document reasoning, duplicate detection, and format analysis
-- **Perfect for**: Comparing document versions, analyzing related reports, consolidating research
-- **Real-time validation**: UI shows file count, size limits, and processing mode
-#### 🤖 **RAG Integration**
-- **All converted documents are automatically added to the RAG system** for chat functionality
-- Multi-document processing creates richer context for chat interactions
-### 🤖 Chat with Documents
-1. Go to the **"Chat with Documents"** tab
-2. Check the system status to ensure RAG components are ready
-3. **🆕 Choose your retrieval strategy** for optimal results:
-   - **Similarity**: Best for general semantic search
-   - **MMR**: Best for diverse, non-repetitive results
-   - **Hybrid**: Best overall accuracy (recommended)
-4. Ask questions about your converted documents
-5. Enjoy real-time streaming responses with document context
-6. Use "New Session" to start fresh conversations
-7. Use "🗑️ Clear All Data" to remove all documents and chat history
-8. Monitor your usage limits in the status panel
-### 🔍 Query Ranker (NEW!)
-1. Go to the **"Query Ranker"** tab
-2. Check the system status to ensure documents are loaded
-3. **Enter your search query** in the search box
-4. **Choose your retrieval method**:
-   - **🎯 Similarity Search**: Semantic similarity with real scores
-   - **🔀 MMR (Diverse)**: Diverse results with reduced redundancy
-   - **🔍 BM25 (Keywords)**: Traditional keyword-based search
-   - **🔗 Hybrid (Recommended)**: Best overall accuracy combining semantic + keyword
-5. **Adjust result count** (1-10) using the slider
-6. **Review ranked results** with confidence levels and source information
-7. **Compare methods** by trying different retrieval strategies on the same query
-8. Use results to understand how your documents are chunked and ranked
-#### 🔍 **Retrieval Strategy Guide:**
-- **For research papers**: Use MMR to get diverse perspectives
-- **For technical docs**: Use Hybrid for comprehensive coverage
-- **For specific facts**: Use Similarity for targeted results
-- **For broad topics**: Use Hybrid for balanced semantic + keyword matching
-- **For transparency**: Use Query Ranker to see exactly which chunks are being retrieved
 ## Local Development
@@ -389,203 +373,33 @@ const html = window.render(latexContent, {htmlTags: true});
 - [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
-## 🔍 Advanced RAG Retrieval Strategies
-The system supports **four different retrieval methods** for optimal document search and question answering:
-### **1. 🎯 Similarity Search (Default)**
-- **How it works**: Semantic similarity using OpenAI embeddings
-- **Best for**: General questions and semantic understanding
-- **Use case**: "What is the main topic of this document?"
-- **Configuration**: `{'k': 4, 'search_type': 'similarity'}`
-- **Chunking**: Uses content-aware chunking (Markdown or LaTeX) for optimal structure preservation
-### **2. 🔀 MMR (Maximal Marginal Relevance)**
-- **How it works**: Balances relevance with result diversity to reduce redundancy
-- **Best for**: Research questions requiring diverse perspectives
-- **Use case**: "What are different approaches to transformer architecture?"
-- **Configuration**: `{'k': 4, 'fetch_k': 10, 'lambda_mult': 0.5}`
-- **Benefits**: Prevents repetitive results, ensures comprehensive coverage
-### **3. 🔍 BM25 Keyword Search**
-- **How it works**: Traditional keyword-based search with TF-IDF scoring
-- **Best for**: Exact term matching and specific factual queries
-- **Use case**: "Find mentions of 'attention mechanism' in the documents"
-- **Configuration**: `{'k': 4}`
-- **Benefits**: Excellent for technical terms and specific concepts
-### **4. 🔗 Hybrid Search (Recommended)**
-- **How it works**: Combines semantic embeddings + keyword search using ensemble weighting
-- **Best for**: Most queries - provides best overall accuracy
-- **Use case**: Any complex question benefiting from both semantic and keyword matching
-- **Configuration**: `{'k': 4, 'semantic_weight': 0.7, 'keyword_weight': 0.3}`
-- **Benefits**: **87.5% hit rate vs 79.2% for similarity-only** (based on LangChain research)
-### **🎯 Performance Comparison:**
-| Method | Accuracy | Diversity | Speed | Best Use Case |
-|--------|----------|-----------|-------|---------------|
-| Similarity | Good | Low | Fast | General semantic questions |
-| MMR | Good | High | Medium | Research requiring diverse viewpoints |
-| BM25 | Medium | Medium | Fast | Exact term/keyword searches |
-| **Hybrid** | **Excellent** | **High** | **Medium** | **Most questions (recommended)** |
-### **💡 Usage Examples:**
-```python
-# In your application code
-from src.rag.chat_service import rag_chat_service
-# Use hybrid search (recommended)
-response = rag_chat_service.chat_with_retrieval(
-    "How does attention work in transformers?",
-    retrieval_method="hybrid",
-    retrieval_config={'k': 4, 'semantic_weight': 0.8, 'keyword_weight': 0.2}
-)
-# Use MMR for diverse research results
-response = rag_chat_service.chat_with_retrieval(
-    "What are different transformer architectures?",
-    retrieval_method="mmr",
-    retrieval_config={'k': 3, 'fetch_k': 10, 'lambda_mult': 0.6}
-)
-```
-## Development Guide
-### Project Structure
-```
-markit_v2/
-├── app.py                  # Main application entry point (HF Spaces compatible)
-├── run_app.py              # 🆕 Lightweight app launcher for local development
-├── setup.sh                # Setup script
-├── build.sh                # Build script
-├── requirements.txt        # Python dependencies
-├── README.md               # Project documentation
-├── .env                    # Environment variables (local development)
-├── .gitignore              # Git ignore file
-├── .gitattributes          # Git attributes file
-├── src/                    # Source code
-│   ├── __init__.py         # Package initialization
-│   ├── main.py             # Application launcher
-│   ├── core/               # Core functionality and utilities
-│   │   ├── __init__.py     # Package initialization
-│   │   ├── config.py       # 🆕 Centralized configuration management (with RAG settings)
-│   │   ├── exceptions.py   # 🆕 Custom exception hierarchy
-│   │   ├── logging_config.py # 🆕 Centralized logging setup
-│   │   ├── environment.py  # 🆕 Environment setup and dependency management
-│   │   ├── converter.py    # Document conversion orchestrator (refactored)
-│   │   ├── parser_factory.py # Parser factory pattern
-│   │   └── latex_to_markdown_converter.py # LaTeX conversion utility
-│   ├── services/           # Business logic layer
-│   │   ├── __init__.py     # Package initialization
-│   │   ├── document_service.py # 🆕 Document processing service
-│   │   └── data_clearing_service.py # 🆕 Data management and clearing service
-│   ├── parsers/            # Parser implementations
-│   │   ├── __init__.py     # Package initialization
-│   │   ├── parser_interface.py # Enhanced parser interface
-│   │   ├── parser_registry.py # Parser registry pattern
-│   │   ├── markitdown_parser.py # MarkItDown parser (updated)
-│   │   ├── docling_parser.py # 🆕 Docling parser with advanced PDF understanding
-│   │   ├── got_ocr_parser.py # GOT-OCR parser for images
-│   │   ├── mistral_ocr_parser.py # 🆕 Mistral OCR parser
-│   │   └── gemini_flash_parser.py # 🆕 Enhanced Gemini Flash parser with multi-document processing
-│   ├── rag/                # 🆕 RAG (Retrieval-Augmented Generation) system
-│   │   ├── __init__.py     # Package initialization
-│   │   ├── embeddings.py   # OpenAI embedding model management
-│   │   ├── chunking.py     # Markdown-aware document chunking
-│   │   ├── vector_store.py # Chroma vector database management
-│   │   ├── memory.py       # Chat history and session management
-│   │   ├── chat_service.py # RAG chat service with Gemini 2.5 Flash
-│   │   └── ingestion.py    # Document ingestion pipeline
-│   └── ui/                 # 🆕 Modular user interface layer
-│       ├── __init__.py     # Package initialization
-│       ├── ui.py           # Main UI orchestrator (~60 lines)
-│       ├── components/     # UI components
-│       │   ├── __init__.py # Package initialization
-│       │   ├── document_converter.py # Document converter tab (~200 lines)
-│       │   ├── chat_interface.py # Chat interface tab (~180 lines)
-│       │   └── query_ranker.py # Query ranker tab (~200 lines)
-│       ├── formatters/     # Content formatting utilities
-│       │   ├── __init__.py # Package initialization
-│       │   └── content_formatters.py # Markdown/LaTeX formatters (~150 lines)
-│       ├── styles/         # UI styling
-│       │   ├── __init__.py # Package initialization
-│       │   └── ui_styles.py # CSS styles and themes (~800 lines)
-│       └── utils/          # UI utility functions
-│           ├── __init__.py # Package initialization
-│           ├── file_validation.py # File validation utilities (~80 lines)
-│           └── threading_utils.py # Threading utilities (~40 lines)
-├── documents/              # Documentation and examples (gitignored)
-├── tessdata/               # Tesseract OCR data (gitignored)
-└── tests/                  # 🆕 Test suite for Phase 1 RAG implementation
-    ├── __init__.py         # Package initialization
-    ├── README.md           # Test documentation and usage guide
-    ├── test_implementation_structure.py # Structure validation (no API keys)
-    ├── test_retrieval_methods.py # Full functionality testing
-    └── test_data_usage.py  # Data usage demonstration
 ```
-### 🆕 **New Architecture Components:**
-- **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
-- **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
-- **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
-- **Data Management Service**: Comprehensive data clearing functionality (`src/services/data_clearing_service.py`)
-- **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
-- **Enhanced Parser Interface**: Validation, metadata, and cancellation support
-- **Lightweight Launcher**: Quick development startup with `run_app.py`
-- **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
-- **🆕 RAG System**: Complete RAG implementation with vector search and chat capabilities
-- **🆕 Query Ranker Interface**: Dedicated transparency tool for document search and ranking
-- **🆕 Modular UI Architecture**: Component-based UI with clear separation of concerns
-  - **UI Components**: Individual tab components for focused functionality
-  - **Content Formatters**: Specialized markdown and LaTeX rendering utilities
-  - **UI Styles**: Centralized CSS styling system with responsive design
-  - **UI Utils**: File validation and threading utilities for better code organization
-### 🧠 **RAG System Architecture:**
-- **Embeddings Management** (`src/rag/embeddings.py`): OpenAI text-embedding-3-small integration
-- **🆕 Smart Content-Aware Chunking** (`src/rag/chunking.py`):
-  - **Unified chunker** supporting both Markdown and LaTeX content
-  - **Markdown chunking**: Preserves tables and code blocks as whole units
-  - **LaTeX chunking**: Preserves `\begin{tabular}`, mathematical environments, and LaTeX structures
-  - **Automatic format detection**: GOT-OCR results → LaTeX chunker, others → Markdown chunker
-  - **Enhanced metadata**: Content type tracking and structure detection
-- **🆕 Advanced Vector Store** (`src/rag/vector_store.py`): Multi-strategy retrieval system with:
-  - **Similarity Search**: Traditional semantic retrieval using embeddings
-  - **MMR Support**: Maximal Marginal Relevance for diverse results
-  - **BM25 Integration**: Keyword-based search with TF-IDF scoring
-  - **Hybrid Retrieval**: Ensemble combining semantic + keyword methods
-  - **Chroma database**: Persistent storage with deduplication
-- **Chat Memory** (`src/rag/memory.py`): Session management and conversation history
-- **🆕 Enhanced Chat Service** (`src/rag/chat_service.py`): Multi-method RAG with Gemini 2.5 Flash
-- **Document Ingestion** (`src/rag/ingestion.py`): Automated pipeline with intelligent duplicate handling
-- **Usage Limiting**: Anti-abuse measures for public deployment
-- **Auto-Ingestion**: Seamless integration with document conversion workflow
-### 🗑️ **Data Management & Deduplication:**
-- **File Hash-Based Deduplication**: Uses SHA-256 hashes of original file content to prevent duplicates
-- **Chroma Where Filter Integration**: Persistent duplicate detection using vector store metadata queries
-- **Automatic Document Replacement**: When same file is uploaded again, old version is replaced with new one
-- **Cross-Environment Data Clearing**: Works seamlessly in both local development and HF Space environments
-- **Environment-Aware Path Resolution**: Automatically detects and uses correct data paths (`./data/*` vs `/tmp/data/*`)
-- **Comprehensive Status Reporting**: Real-time display of vector store documents, chat history files, and environment type
-- **Safe Clearing Operations**: Graceful error handling with detailed feedback on clearing operations
-### ZeroGPU Integration Notes
-When developing for Hugging Face Spaces with Stateless GPU:
-1. Always import the `spaces` module before any CUDA initialization
-2. Place all CUDA operations inside functions decorated with `@spaces.GPU()`
-3. Ensure only picklable objects are passed to GPU-decorated functions
-4. Use wrapper functions to filter out unpicklable objects like thread locks
-5. For advanced use cases, consider implementing fallback mechanisms for serialization errors
-6. **Add `hf_oauth: true` to your Space's README.md metadata** to mitigate GPU quota limitations
-7. Sign in with your Hugging Face account when using the app to utilize your personal GPU quota
-8. For extensive GPU usage without quota limitations, a Hugging Face Pro subscription is required
-> **Note**: If you're implementing a Space with ZeroGPU on your own, you may encounter quota limitations ("GPU task aborted" errors). These can be mitigated by:
-> - Adding `hf_oauth: true` to your Space's metadata (as shown in this Space)
-> - Having users sign in with their Hugging Face accounts
-> - Upgrading to a Hugging Face Pro subscription for dedicated GPU resources

 # Document to Markdown Converter with RAG Chat
+A powerful Hugging Face Space that converts various document formats to Markdown and enables intelligent chat with your documents using advanced RAG (Retrieval-Augmented Generation).
+<details>
+<summary><strong>Table of contents</strong></summary>
+<!-- Begin ToC -->
+- [System Overview](#-system-overview)
+- [Key Features](#-key-features)
+  - [Document Conversion](#document-conversion)
+  - [RAG Chat with Documents](#-rag-chat-with-documents)
+  - [Query Ranker (NEW!)](#-query-ranker-new)
+  - [User Interface](#user-interface)
+- [Supported Libraries](#supported-libraries)
+- [Multi-Document Processing](#-multi-document-processing)
+- [Environment Variables](#environment-variables)
+  - [API Keys](#-api-keys)
+  - [Configuration Options](#️-configuration-options)
+  - [Docling Configuration](#-docling-configuration)
+  - [Model Configuration](#-model-configuration)
+  - [RAG Configuration](#-rag-configuration)
+  - [Advanced Retrieval Configuration](#-advanced-retrieval-configuration)
+- [Usage Guide](#-usage-guide)
+  - [Parser Selection](#-parser-selection)
+  - [Document Conversion](#document-conversion-1)
+  - [RAG Chat & Query System](#-rag-chat--query-system)
+- [Local Development](#local-development)
+  - [Quick Start](#-quick-start)
+  - [Data Management](#-data-management)
+  - [Development Features](#-development-features)
+- [GOT-OCR LaTeX Processing](#-got-ocr-latex-processing)
+- [Credits](#credits)
+- [Retrieval Strategies](#-retrieval-strategies)
+- [Development](#-development)
+  - [Quick Start](#quick-start)
+  - [Key Technologies](#key-technologies)
+<!-- End ToC -->
+</details>
+## 🎯 System Overview
+<div align="center">
+<img src="img/Overall%20System%20Workflow%20(Essential).png" alt="Overall System Workflow" width="400">
+*Complete workflow from document upload to intelligent RAG chat interaction*
+</div>
 ## ✨ Key Features
 ### Document Conversion
 - Convert PDFs, Office documents, images, and more to Markdown
 - **🆕 Multi-Document Processing**: Process up to 5 files simultaneously (20MB combined)
+- **5 Powerful Parsers**:
+  - **Gemini Flash**: General Purpose + High Accuracy
+  - **Mistral OCR**: Fastest Processing
+  - **Docling**: Open Source
+  - **GOT-OCR**: Document to LaTeX + Open Source
+  - **MarkItDown**: High Accuracy CSV/XML + Open Source
 - **🆕 Intelligent Processing Types**:
   - **Combined**: Merge documents into unified content with duplicate removal
   - **Individual**: Separate sections per document with clear organization
 ## 🚀 Multi-Document Processing
+<img src="img/Multi-Document%20Processing%20Types%20(Flagship%20Feature).png" alt="Multi-Document Processing Types" width="700">
+*Industry-leading multi-document processing with 4 intelligent processing types*
 ### **Key Capabilities:**
 - **📊 Cross-Document Analysis**: Compare and contrast information across different files
 - **🔄 Smart Duplicate Removal**: Intelligently merges overlapping content while preserving unique insights
 - **📋 Format Intelligence**: Handles mixed file types (PDF + images, Word + Excel, etc.) seamlessly
 - **🧠 Contextual Understanding**: Recognizes relationships and patterns across document boundaries
+### **Processing Types:**
+- **🔗 Combined**: Merge documents into unified content with duplicate removal
+- **📑 Individual**: Separate sections per document with clear organization
+- **📈 Summary**: Executive overview + detailed analysis of all documents
+- **⚖️ Comparison**: Cross-document analysis with similarities/differences tables
 ## Environment Variables
 - `BM25_K1`: BM25 term frequency saturation parameter (default: 1.2)
 - `BM25_B`: BM25 field length normalization parameter (default: 0.75)
+## 📖 Usage Guide
+### 🎯 Parser Selection
+<img src="img/Parser%20Selection%20Guide%20(User-Friendly).png" alt="Parser Selection Guide" width="700">
+*Choose the right parser for your specific needs and document types*
 ### Document Conversion
 #### 📄 **Single Document Processing**
+1. Upload a single file
+2. Choose your preferred parser
+3. Select an OCR method based on your chosen parser
+4. Click "Convert"
+5. Download the converted file (.tex for GOT-OCR, .md for others)
+#### 📂 **Multi-Document Processing**
+1. Upload **2-5 files** (up to 20MB combined)
+2. Choose processing type: Combined, Individual, Summary, or Comparison
+3. Select your preferred parser
+4. Click "Convert" for intelligent cross-document analysis
+### 🤖 RAG Chat & Query System
+<img src="img/RAG%20Retrieval%20Strategies%20(Technical%20Highlight).png" alt="RAG Retrieval Strategies" width="700">
+*Advanced RAG system with 4 retrieval strategies for optimal document search*
+#### **Chat with Documents**
+1. Choose your retrieval strategy (Similarity, MMR, BM25, or Hybrid)
+2. Ask questions about your converted documents
+3. Get real-time streaming responses with document context
+#### **Query Ranker**
+1. Enter search queries to explore document chunks
+2. Compare different retrieval methods
+3. View confidence scores and source information
 ## Local Development
 - [Hugging Face Space](https://huggingface.co/spaces/Ansemin101/Markit_v2)
+## 🔍 Retrieval Strategies
+| Method | Best For | Accuracy |
+|--------|----------|----------|
+| **🎯 Similarity** | General semantic questions | Good |
+| **🔀 MMR** | Diverse perspectives | Good |
+| **🔍 BM25** | Exact keyword searches | Medium |
+| **🔗 Hybrid** | Most queries (recommended) | **Excellent** |
+## 💻 Development
+### Quick Start
+```bash
+# Clone repository
+git clone https://github.com/ansemin/Markit_v2
+# Set up environment variables
+cp .env.example .env
+# Edit .env with your API keys
+# Install dependencies & run
+pip install -r requirements.txt
+python app.py
 ```
+### Key Technologies
+- **Parsers**: Gemini Flash, Mistral OCR, Docling, GOT-OCR, MarkItDown
+- **RAG System**: OpenAI embeddings + Chroma vector store + Gemini 2.5 Flash
+- **UI Framework**: Gradio with modular component architecture
+- **GPU Support**: ZeroGPU integration for HF Spaces

src/parsers/gemini_flash_parser.py CHANGED Viewed

@@ -194,8 +194,17 @@ class GeminiFlashParser(DocumentParser):
         # Validate file types
         for file_path in file_paths:
             file_extension = file_path.suffix.lower()
-            if self._get_mime_type(file_extension) == "application/octet-stream":
-                raise ValueError(f"Unsupported file type: {file_path.name}")
     def _create_batch_contents(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> List[Any]:
         """Create contents list for batch API call."""
@@ -344,6 +353,7 @@ Return only the markdown content, no other text."""
             ".md": "text/markdown",
             ".html": "text/html",
             ".htm": "text/html",
             ".jpg": "image/jpeg",
             ".jpeg": "image/jpeg",
             ".png": "image/png",

         # Validate file types
         for file_path in file_paths:
             file_extension = file_path.suffix.lower()
+            mime_type = self._get_mime_type(file_extension)
+            if mime_type == "application/octet-stream":
+                raise ValueError(f"Unsupported file type: {file_path.name}. Gemini supports: PDF, TXT, HTML, CSS, MD, CSV, XML, RTF, JS, PY, and image files.")
+            # Check if it's a supported MIME type for Gemini
+            if mime_type in ["application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+                           "application/msword",
+                           "application/vnd.openxmlformats-officedocument.presentationml.presentation",
+                           "application/vnd.ms-powerpoint",
+                           "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
+                           "application/vnd.ms-excel"]:
+                raise ValueError(f"File type not supported by Gemini: {file_path.name}. Gemini supports: PDF, TXT, HTML, CSS, MD, CSV, XML, RTF, JS, PY, and image files.")
     def _create_batch_contents(self, file_paths: List[Path], processing_type: str, original_filenames: Optional[List[str]] = None) -> List[Any]:
         """Create contents list for batch API call."""
             ".md": "text/markdown",
             ".html": "text/html",
             ".htm": "text/html",
+            ".csv": "text/csv",
             ".jpg": "image/jpeg",
             ".jpeg": "image/jpeg",
             ".png": "image/png",

src/parsers/mistral_ocr_parser.py CHANGED Viewed

@@ -111,8 +111,8 @@ class MistralOcrParser(DocumentParser):
         """Extract document content using basic OCR."""
         try:
             # Process according to file type
-            if file_extension in ['.pdf']:
-                # For PDFs, we need to upload the file to the Mistral API first
                 try:
                     # Upload the file to Mistral API
                     uploaded_pdf = client.files.upload(
@@ -137,20 +137,21 @@ class MistralOcrParser(DocumentParser):
                     )
                 except Exception as e:
                     # If file upload fails, try to use a direct URL method with base64
-                    logger.warning(f"Failed to upload PDF, trying alternate method: {str(e)}")
-                    base64_pdf = self.encode_image(file_path)
-                    if base64_pdf:
                         ocr_response = client.ocr.process(
                             model="mistral-ocr-latest",
                             document={
                                 "type": "document_url",
-                                "document_url": f"data:application/pdf;base64,{base64_pdf}"
                             },
                             include_image_base64=True
                         )
                     else:
-                        raise DocumentProcessingError("Failed to process PDF document")
             else:
                 # For images (jpg, png, etc.), use image_url with base64
                 base64_image = self.encode_image(file_path)
@@ -237,9 +238,9 @@ class MistralOcrParser(DocumentParser):
     def _extract_with_document_understanding(self, client, file_path, file_extension):
         """Extract and understand document content using chat completion."""
         try:
-            # For PDFs and images, we'll use Mistral's document understanding capability
-            if file_extension in ['.pdf']:
-                # Upload PDF first
                 try:
                     # Upload the file
                     uploaded_pdf = client.files.upload(
@@ -321,9 +322,13 @@ class MistralOcrParser(DocumentParser):
             raise ConversionError(f"Document understanding failed: {str(e)}")
     def _get_mime_type(self, file_extension: str) -> str:
-        """Get the MIME type for a file extension."""
         mime_types = {
             ".pdf": "application/pdf",
             ".jpg": "image/jpeg",
             ".jpeg": "image/jpeg",
             ".png": "image/png",
@@ -331,6 +336,8 @@ class MistralOcrParser(DocumentParser):
             ".bmp": "image/bmp",
             ".tiff": "image/tiff",
             ".tif": "image/tiff",
         }
         return mime_types.get(file_extension, "application/octet-stream")
@@ -383,7 +390,7 @@ class MistralOcrParser(DocumentParser):
     def _create_document_part(self, file_path: Path) -> Dict[str, Any]:
         """Return a dict representing an image_url or document_url part for Mistral chat/OCR."""
         ext = file_path.suffix.lower()
-        if ext == '.pdf':
             # upload and get signed url
             client = Mistral(api_key=config.api.mistral_api_key)
             uploaded = client.files.upload(

         """Extract document content using basic OCR."""
         try:
             # Process according to file type
+            if file_extension in ['.pdf', '.docx', '.pptx']:
+                # For documents (PDF, DOCX, PPTX), we need to upload the file to the Mistral API first
                 try:
                     # Upload the file to Mistral API
                     uploaded_pdf = client.files.upload(
                     )
                 except Exception as e:
                     # If file upload fails, try to use a direct URL method with base64
+                    logger.warning(f"Failed to upload document, trying alternate method: {str(e)}")
+                    base64_doc = self.encode_image(file_path)
+                    if base64_doc:
+                        mime_type = self._get_mime_type(file_extension)
                         ocr_response = client.ocr.process(
                             model="mistral-ocr-latest",
                             document={
                                 "type": "document_url",
+                                "document_url": f"data:{mime_type};base64,{base64_doc}"
                             },
                             include_image_base64=True
                         )
                     else:
+                        raise DocumentProcessingError("Failed to process document")
             else:
                 # For images (jpg, png, etc.), use image_url with base64
                 base64_image = self.encode_image(file_path)
     def _extract_with_document_understanding(self, client, file_path, file_extension):
         """Extract and understand document content using chat completion."""
         try:
+            # For documents and images, we'll use Mistral's document understanding capability
+            if file_extension in ['.pdf', '.docx', '.pptx']:
+                # Upload document first
                 try:
                     # Upload the file
                     uploaded_pdf = client.files.upload(
             raise ConversionError(f"Document understanding failed: {str(e)}")
     def _get_mime_type(self, file_extension: str) -> str:
+        """Get the MIME type for a file extension supported by Mistral OCR."""
         mime_types = {
+            # Document formats supported by Mistral OCR
             ".pdf": "application/pdf",
+            ".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+            ".pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
+            # Image formats supported by Mistral OCR
             ".jpg": "image/jpeg",
             ".jpeg": "image/jpeg",
             ".png": "image/png",
             ".bmp": "image/bmp",
             ".tiff": "image/tiff",
             ".tif": "image/tiff",
+            ".avif": "image/avif",
+            ".webp": "image/webp",
         }
         return mime_types.get(file_extension, "application/octet-stream")
     def _create_document_part(self, file_path: Path) -> Dict[str, Any]:
         """Return a dict representing an image_url or document_url part for Mistral chat/OCR."""
         ext = file_path.suffix.lower()
+        if ext in ['.pdf', '.docx', '.pptx']:
             # upload and get signed url
             client = Mistral(api_key=config.api.mistral_api_key)
             uploaded = client.files.upload(

test_gemini_wrapper.py → tests/test_gemini_wrapper.py RENAMED Viewed

File without changes