Spaces:

Ansemin101
/

Markit_v2

Runtime error

AnseMin commited on Jun 22

Commit

a773878

1 Parent(s): 55627c9

Refactor and enhance application structure for Markit_v2

- Introduced centralized configuration management for API keys and application settings.
- Implemented a new environment manager for dependency checks and setup.
- Refactored document processing into a dedicated service layer for improved organization.
- Enhanced error handling with custom exceptions for better clarity.
- Updated README with new configuration options and usage instructions.
- Added lightweight launcher for local development.
- Improved logging setup for better debugging and information tracking.
- Updated .gitignore to include new files and directories.

This commit lays the groundwork for a more modular and maintainable codebase, facilitating future feature additions and improvements.

Files changed (16) hide show

.gitignore +13 -1
=1.1.0 +0 -0
README.md +111 -32
app.py +35 -102
run_app.py +25 -0
src/core/config.py +123 -0
src/core/converter.py +45 -200
src/core/environment.py +246 -0
src/core/exceptions.py +83 -0
src/core/logging_config.py +83 -0
src/main.py +2 -2
src/parsers/got_ocr_parser.py +1 -0
src/parsers/markitdown_parser.py +23 -11
src/parsers/parser_interface.py +61 -2
src/services/document_service.py +243 -0
src/ui/ui.py +39 -11

.gitignore CHANGED Viewed

@@ -84,4 +84,16 @@ test_gemini_parser.py
 # Ignore tessdata folder
 /tessdata/
-/tessdata/*

 # Ignore tessdata folder
 /tessdata/
+/tessdata/*
+# Ignore .venv folder
+.venv/
+# Ignore Claude.md
+Claude.md
+# Ignore backup
+app_backup.py
+#Ignore .claude
+.claude/

=1.1.0 ADDED Viewed

File without changes

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Markit GOT OCR
 emoji: 📄
 colorFrom: blue
 colorTo: indigo
@@ -45,10 +45,26 @@ This app integrates [Microsoft's MarkItDown](https://github.com/microsoft/markit
 ## Environment Variables
-You can enhance the functionality by setting these environment variables:
-- `OPENAI_API_KEY`: Enables AI-based image descriptions in MarkItDown
 - `GOOGLE_API_KEY`: Used for Gemini Flash parser and LaTeX to Markdown conversion
 ## Usage
@@ -60,17 +76,34 @@ You can enhance the functionality by setting these environment variables:
 ## Local Development
 1. Clone the repository
-2. Create a `.env` file based on `.env.example`
-3. Install dependencies:
    ```
    pip install -r requirements.txt
    ```
 4. Run the application:
-   ```
    python app.py
    ```
 ## Credits
 - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
@@ -94,21 +127,33 @@ Markit is a powerful tool that converts various document formats (PDF, DOCX, ima
 - **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
 - **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
 - **Advanced Parsing Engines**:
-  - **PyPdfium**: Fast PDF parsing using the PDFium engine
-  - **Docling**: Advanced document structure analysis
   - **Gemini Flash**: AI-powered conversion using Google's Gemini API
   - **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
 - **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
 - **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
 - **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
 - **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
 ## System Architecture
-The application is built with a modular architecture:
-- **Core Engine**: Handles document conversion and processing workflows
-- **Parser Registry**: Central registry for all document parsers
-- **UI Layer**: Gradio-based web interface
-- **Service Layer**: Handles AI chat functionality and external services integration
 ## Installation
@@ -187,10 +232,10 @@ build:
 ### Document Conversion
 1. Upload your document using the file uploader
 2. Select a parser provider:
-   - **PyPdfium**: Best for standard PDFs with selectable text
-   - **Docling**: Best for complex document layouts
    - **Gemini Flash**: Best for AI-powered conversions (requires API key)
    - **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
 3. Choose an OCR option based on your selected parser:
    - **None**: No OCR processing (for documents with selectable text)
    - **Tesseract**: Basic OCR using Tesseract
@@ -206,6 +251,21 @@ build:
 6. Navigate through pages using the navigation buttons for multi-page documents
 7. Download the converted content in your selected format
 ## Troubleshooting
 ### OCR Issues
@@ -239,38 +299,57 @@ build:
 ### Project Structure
 ```
-markit/
-├── app.py                  # Main application entry point
 ├── setup.sh                # Setup script
 ├── build.sh                # Build script
 ├── requirements.txt        # Python dependencies
 ├── README.md               # Project documentation
-├── .env                    # Environment variables
 ├── .gitignore              # Git ignore file
 ├── .gitattributes          # Git attributes file
 ├── src/                    # Source code
 │   ├── __init__.py         # Package initialization
-│   ├── main.py             # Main module
-│   ├── core/               # Core functionality
 │   │   ├── __init__.py     # Package initialization
-│   │   ├── converter.py    # Document conversion logic
-│   │   └── parser_factory.py # Parser factory
 │   ├── parsers/            # Parser implementations
 │   │   ├── __init__.py     # Package initialization
-│   │   ├── parser_interface.py # Parser interface
-│   │   ├── parser_registry.py # Parser registry
-│   │   ├── docling_parser.py # Docling parser
 │   │   ├── got_ocr_parser.py # GOT-OCR parser for images
-│   │   └── pypdfium_parser.py # PyPDFium parser
-│   ├── ui/                 # User interface
-│   │   ├── __init__.py     # Package initialization
-│   │   └── ui.py           # Gradio UI implementation
-│   └── services/           # External services
-│       └── __init__.py     # Package initialization
-└── tests/                  # Tests
     └── __init__.py         # Package initialization
 ```
 ### ZeroGPU Integration Notes
 When developing for Hugging Face Spaces with Stateless GPU:

 ---
+title: Markit_v2
 emoji: 📄
 colorFrom: blue
 colorTo: indigo
 ## Environment Variables
+The application uses centralized configuration management. You can enhance functionality by setting these environment variables:
+### 🔑 **API Keys:**
 - `GOOGLE_API_KEY`: Used for Gemini Flash parser and LaTeX to Markdown conversion
+- `OPENAI_API_KEY`: Enables AI-based image descriptions in MarkItDown
+- `MISTRAL_API_KEY`: For Mistral OCR parser (if available)
+### ⚙️ **Configuration Options:**
+- `DEBUG`: Set to `true` for debug mode with verbose logging
+- `MAX_FILE_SIZE`: Maximum file size in bytes (default: 10MB)
+- `TEMP_DIR`: Directory for temporary files (default: ./temp)
+- `TESSERACT_PATH`: Custom path to Tesseract executable
+- `TESSDATA_PATH`: Path to Tesseract language data
+### 🤖 **Model Configuration:**
+- `GEMINI_MODEL`: Gemini model to use (default: gemini-1.5-flash)
+- `MISTRAL_MODEL`: Mistral model to use (default: pixtral-12b-2409)
+- `GOT_OCR_MODEL`: GOT-OCR model to use (default: stepfun-ai/GOT-OCR2_0)
+- `MODEL_TEMPERATURE`: Model temperature for AI responses (default: 0.1)
+- `MODEL_MAX_TOKENS`: Maximum tokens for AI responses (default: 4096)
 ## Usage
 ## Local Development
+### 🚀 **Quick Start:**
 1. Clone the repository
+2. Create a `.env` file with your API keys:
    ```
+   GOOGLE_API_KEY=your_gemini_api_key_here
+   OPENAI_API_KEY=your_openai_api_key_here
+   MISTRAL_API_KEY=your_mistral_api_key_here
+   DEBUG=true
+   ```
+3. Install dependencies:
+   ```bash
    pip install -r requirements.txt
    ```
 4. Run the application:
+   ```bash
+   # For full environment setup (HF Spaces compatible)
    python app.py
+   # For local development (faster startup)
+   python run_app.py
    ```
+### 🧪 **Development Features:**
+- **Automatic Environment Setup**: Dependencies are checked and installed automatically
+- **Configuration Validation**: Startup validation reports missing API keys and configuration issues
+- **Enhanced Error Messages**: Detailed error reporting for debugging
+- **Centralized Logging**: Configurable logging levels and output formats
 ## Credits
 - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
 - **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
 - **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
 - **Advanced Parsing Engines**:
+  - **MarkItDown**: Comprehensive document conversion (PDFs, Office docs, images, audio, etc.)
   - **Gemini Flash**: AI-powered conversion using Google's Gemini API
   - **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
+  - **Mistral OCR**: Advanced OCR using Mistral's Pixtral model for image-to-text conversion
 - **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
 - **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
 - **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
 - **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
 ## System Architecture
+The application is built with a clean, layered architecture following modern software engineering principles:
+### 🏗️ **Core Architecture Components:**
+- **Entry Point** (`app.py`): HF Spaces-compatible application launcher with environment setup
+- **Configuration Layer** (`src/core/config.py`): Centralized configuration management with validation
+- **Service Layer** (`src/services/`): Business logic for document processing and external services
+- **Core Engine** (`src/core/`): Document conversion workflows and utilities
+- **Parser Registry** (`src/parsers/`): Extensible parser system with standardized interfaces
+- **UI Layer** (`src/ui/`): Gradio-based web interface with enhanced error handling
+### 🎯 **Key Architectural Features:**
+- **Separation of Concerns**: Clean boundaries between UI, business logic, and core utilities
+- **Centralized Configuration**: All settings, API keys, and validation in one place
+- **Custom Exception Hierarchy**: Proper error handling with user-friendly messages
+- **Plugin Architecture**: Easy addition of new document parsers
+- **HF Spaces Optimized**: Maintains compatibility with Hugging Face deployment requirements
 ## Installation
 ### Document Conversion
 1. Upload your document using the file uploader
 2. Select a parser provider:
+   - **MarkItDown**: Best for comprehensive document conversion (supports PDFs, Office docs, images, audio, etc.)
    - **Gemini Flash**: Best for AI-powered conversions (requires API key)
    - **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
+   - **Mistral OCR**: Advanced OCR using Mistral's Pixtral model (requires API key)
 3. Choose an OCR option based on your selected parser:
    - **None**: No OCR processing (for documents with selectable text)
    - **Tesseract**: Basic OCR using Tesseract
 6. Navigate through pages using the navigation buttons for multi-page documents
 7. Download the converted content in your selected format
+## Configuration & Error Handling
+### 🔧 **Automatic Configuration:**
+The application includes intelligent configuration management that:
+- Validates API keys and reports availability at startup
+- Checks for required dependencies and installs them automatically
+- Provides helpful warnings for missing optional components
+- Reports which parsers are available based on current configuration
+### 🛡️ **Enhanced Error Handling:**
+- **User-Friendly Messages**: Clear error descriptions instead of technical stack traces
+- **File Validation**: Automatic checking of file size and format compatibility
+- **Parser Availability**: Real-time detection of which parsers can be used
+- **Graceful Degradation**: Application continues working even if some parsers are unavailable
 ## Troubleshooting
 ### OCR Issues
 ### Project Structure
 ```
+markit_v2/
+├── app.py                  # Main application entry point (HF Spaces compatible)
+├── run_app.py              # 🆕 Lightweight app launcher for local development
 ├── setup.sh                # Setup script
 ├── build.sh                # Build script
 ├── requirements.txt        # Python dependencies
 ├── README.md               # Project documentation
+├── .env                    # Environment variables (local development)
 ├── .gitignore              # Git ignore file
 ├── .gitattributes          # Git attributes file
 ├── src/                    # Source code
 │   ├── __init__.py         # Package initialization
+│   ├── main.py             # Application launcher
+│   ├── core/               # Core functionality and utilities
+│   │   ├── __init__.py     # Package initialization
+│   │   ├── config.py       # 🆕 Centralized configuration management
+│   │   ├── exceptions.py   # 🆕 Custom exception hierarchy
+│   │   ├── logging_config.py # 🆕 Centralized logging setup
+│   │   ├── environment.py  # 🆕 Environment setup and dependency management
+│   │   ├── converter.py    # Document conversion orchestrator (refactored)
+│   │   ├── parser_factory.py # Parser factory pattern
+│   │   └── latex_to_markdown_converter.py # LaTeX conversion utility
+│   ├── services/           # Business logic layer
 │   │   ├── __init__.py     # Package initialization
+│   │   └── document_service.py # 🆕 Document processing service
 │   ├── parsers/            # Parser implementations
 │   │   ├── __init__.py     # Package initialization
+│   │   ├── parser_interface.py # Enhanced parser interface
+│   │   ├── parser_registry.py # Parser registry pattern
+│   │   ├── markitdown_parser.py # MarkItDown parser (updated)
 │   │   ├── got_ocr_parser.py # GOT-OCR parser for images
+│   │   ├── mistral_ocr_parser.py # 🆕 Mistral OCR parser
+│   │   └── gemini_flash_parser.py # Gemini Flash parser
+│   └── ui/                 # User interface layer
+│       ├── __init__.py     # Package initialization
+│       └── ui.py           # Gradio UI with enhanced error handling
+├── documents/              # Documentation and examples (gitignored)
+├── tessdata/               # Tesseract OCR data (gitignored)
+└── tests/                  # Tests (future)
     └── __init__.py         # Package initialization
 ```
+### 🆕 **New Architecture Components:**
+- **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
+- **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
+- **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
+- **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
+- **Enhanced Parser Interface**: Validation, metadata, and cancellation support
+- **Lightweight Launcher**: Quick development startup with `run_app.py`
+- **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
 ### ZeroGPU Integration Notes
 When developing for Hugging Face Spaces with Stateless GPU:

app.py CHANGED Viewed

@@ -1,126 +1,59 @@
 import spaces  # Must be imported before any CUDA initialization
 import sys
 import os
-import subprocess
-import shutil
 from pathlib import Path
-import logging
-# Configure logging - Add this section to suppress httpx logs
-logging.getLogger("httpx").setLevel(logging.WARNING)  # Raise level to WARNING to suppress INFO logs
-logging.getLogger("urllib3").setLevel(logging.WARNING)  # Also suppress urllib3 logs which might be used
-logging.getLogger("httpcore").setLevel(logging.WARNING)  # httpcore is used by httpx
-# Get the current directory
 current_dir = os.path.dirname(os.path.abspath(__file__))
-# Run setup.sh at startup
-try:
-    setup_script = os.path.join(current_dir, "setup.sh")
-    if os.path.exists(setup_script):
-        print("Running setup.sh...")
-        subprocess.run(["bash", setup_script], check=False)
-        print("setup.sh completed")
-except Exception as e:
-    print(f"Error running setup.sh: {e}")
-# Check if spaces module is installed (needed for ZeroGPU)
-try:
-    print("Spaces module found for ZeroGPU support")
-except ImportError:
-    print("WARNING: Spaces module not found. Installing...")
-    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "spaces"], check=False)
-# Check for PyTorch and CUDA availability (needed for GOT-OCR)
-try:
-    import torch
-    print(f"PyTorch version: {torch.__version__}")
-    print(f"CUDA available: {torch.cuda.is_available()}")
-    if torch.cuda.is_available():
-        print(f"CUDA device: {torch.cuda.get_device_name(0)}")
-        print(f"CUDA version: {torch.version.cuda}")
-    else:
-        print("WARNING: CUDA not available. GOT-OCR performs best with GPU acceleration.")
-except ImportError:
-    print("WARNING: PyTorch not installed. Installing PyTorch...")
-    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
-# Check if transformers is installed (needed for GOT-OCR)
-try:
-    import transformers
-    print(f"Transformers version: {transformers.__version__}")
-except ImportError:
-    print("WARNING: Transformers not installed. Installing transformers from GitHub...")
-    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/huggingface/transformers.git@main", "accelerate", "verovio"], check=False)
-# Check if numpy is installed with the correct version
-try:
-    import numpy as np
-    print(f"NumPy version: {np.__version__}")
-    if np.__version__ != "1.26.3":
-        print("WARNING: NumPy version mismatch. Installing exact version 1.26.3...")
-        subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
-except ImportError:
-    print("WARNING: NumPy not installed. Installing NumPy 1.26.3...")
-    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
-# Check if markitdown is installed
-try:
-    from markitdown import MarkItDown
-    print("MarkItDown is installed")
-except ImportError:
-    print("WARNING: MarkItDown not installed. Installing...")
-    subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
     try:
         from markitdown import MarkItDown
-        print("MarkItDown installed successfully")
     except ImportError:
-        print("ERROR: Failed to install MarkItDown")
-# Try to load environment variables from .env file
-try:
-    from dotenv import load_dotenv
-    load_dotenv()
-    print("Loaded environment variables from .env file")
-except ImportError:
-    print("python-dotenv not installed, skipping .env file loading")
-# Load API keys from environment variables
-gemini_api_key = os.getenv("GOOGLE_API_KEY")
-openai_api_key = os.getenv("OPENAI_API_KEY")
-# Check if API keys are available and print messages
-if not gemini_api_key:
-    print("Warning: GOOGLE_API_KEY environment variable not found. Gemini Flash parser and LaTeX to Markdown conversion may not work.")
-else:
-    print(f"Found Gemini API key: {gemini_api_key[:5]}...{gemini_api_key[-5:] if len(gemini_api_key) > 10 else ''}")
-    print("Gemini API will be used for LaTeX to Markdown conversion when using GOT-OCR with Formatted Text mode")
-if not openai_api_key:
-    print("Warning: OPENAI_API_KEY environment variable not found. LLM-based image description in MarkItDown may not work.")
-else:
-    print(f"Found OpenAI API key: {openai_api_key[:5]}...{openai_api_key[-5:] if len(openai_api_key) > 10 else ''}")
-    print("OpenAI API will be available for LLM-based image descriptions in MarkItDown")
-# Add the current directory to the Python path
-sys.path.append(current_dir)
-# Try different import approaches
 try:
-    # First attempt - standard import
     from src.main import main
 except ModuleNotFoundError:
     try:
-        # Second attempt - adjust path and try again
         sys.path.append(os.path.join(current_dir, "src"))
         from src.main import main
     except ModuleNotFoundError:
-        # Third attempt - create __init__.py if it doesn't exist
         init_path = os.path.join(current_dir, "src", "__init__.py")
         if not os.path.exists(init_path):
             with open(init_path, "w") as f:
-                pass  # Create empty __init__.py file
-        # Try import again
         from src.main import main
 if __name__ == "__main__":

 import spaces  # Must be imported before any CUDA initialization
 import sys
 import os
 from pathlib import Path
+# Get the current directory and setup Python path
 current_dir = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(current_dir)
+# Import environment manager after setting up path
+try:
+    from src.core.environment import environment_manager
+    # Perform complete environment setup
+    print("Setting up environment...")
+    setup_results = environment_manager.full_environment_setup()
+    # Report setup status
+    print(f"Environment setup completed with results: {len([k for k, v in setup_results.items() if v])} successful, {len([k for k, v in setup_results.items() if not v])} failed")
+except ImportError as e:
+    print(f"Warning: Could not import environment manager: {e}")
+    print("Falling back to basic setup...")
+    # Fallback to basic setup if environment manager fails
+    import subprocess
+    # Basic dependency checks
+    try:
+        import torch
+        print(f"PyTorch version: {torch.__version__}")
+    except ImportError:
+        print("Installing PyTorch...")
+        subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
     try:
         from markitdown import MarkItDown
+        print("MarkItDown is available")
     except ImportError:
+        print("Installing MarkItDown...")
+        subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
+# Import main function with fallback strategies (HF Spaces compatibility)
 try:
     from src.main import main
 except ModuleNotFoundError:
     try:
+        # Fallback: adjust path and try again
         sys.path.append(os.path.join(current_dir, "src"))
         from src.main import main
     except ModuleNotFoundError:
+        # Last resort: create __init__.py if missing
         init_path = os.path.join(current_dir, "src", "__init__.py")
         if not os.path.exists(init_path):
             with open(init_path, "w") as f:
+                pass
         from src.main import main
 if __name__ == "__main__":

run_app.py ADDED Viewed

	@@ -0,0 +1,25 @@

+#!/usr/bin/env python3
+"""
+Simple app launcher that skips the heavy environment setup.
+Use this for local development when dependencies are already installed.
+"""
+import sys
+import os
+# Get the current directory and setup Python path
+current_dir = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(current_dir)
+# Load environment variables from .env file
+try:
+    from dotenv import load_dotenv
+    load_dotenv()
+    print("Loaded environment variables from .env file")
+except ImportError:
+    print("python-dotenv not installed, skipping .env file loading")
+# Import and run main directly
+from src.main import main
+if __name__ == "__main__":
+    main()

src/core/config.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""
+Centralized configuration management for Markit application.
+"""
+import os
+from typing import Optional, Dict, Any
+from dataclasses import dataclass
+@dataclass
+class APIConfig:
+    """Configuration for external API services."""
+    google_api_key: Optional[str] = None
+    openai_api_key: Optional[str] = None
+    mistral_api_key: Optional[str] = None
+    def __post_init__(self):
+        """Load API keys from environment variables."""
+        self.google_api_key = os.getenv("GOOGLE_API_KEY")
+        self.openai_api_key = os.getenv("OPENAI_API_KEY")
+        self.mistral_api_key = os.getenv("MISTRAL_API_KEY")
+@dataclass
+class OCRConfig:
+    """Configuration for OCR-related settings."""
+    tesseract_path: Optional[str] = None
+    tessdata_path: Optional[str] = None
+    default_language: str = "eng"
+    def __post_init__(self):
+        """Load OCR configuration from environment variables."""
+        self.tesseract_path = os.getenv("TESSERACT_PATH")
+        self.tessdata_path = os.getenv("TESSDATA_PATH", "./tessdata")
+@dataclass
+class ModelConfig:
+    """Configuration for AI model settings."""
+    gemini_model: str = "gemini-2.5-flash"
+    mistral_model: str = "pixtral-12b-2409"
+    got_ocr_model: str = "stepfun-ai/GOT-OCR2_0"
+    temperature: float = 0.1
+    max_tokens: int = 4096
+    def __post_init__(self):
+        """Load model configuration from environment variables."""
+        self.gemini_model = os.getenv("GEMINI_MODEL", self.gemini_model)
+        self.mistral_model = os.getenv("MISTRAL_MODEL", self.mistral_model)
+        self.got_ocr_model = os.getenv("GOT_OCR_MODEL", self.got_ocr_model)
+        self.temperature = float(os.getenv("MODEL_TEMPERATURE", self.temperature))
+        self.max_tokens = int(os.getenv("MODEL_MAX_TOKENS", self.max_tokens))
+@dataclass
+class AppConfig:
+    """Main application configuration."""
+    debug: bool = False
+    max_file_size: int = 10 * 1024 * 1024  # 10MB
+    allowed_extensions: tuple = (".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".tex", ".xlsx")
+    temp_dir: str = "./temp"
+    def __post_init__(self):
+        """Load application configuration from environment variables."""
+        self.debug = os.getenv("DEBUG", "false").lower() == "true"
+        self.max_file_size = int(os.getenv("MAX_FILE_SIZE", self.max_file_size))
+        self.temp_dir = os.getenv("TEMP_DIR", self.temp_dir)
+class Config:
+    """Main configuration container."""
+    def __init__(self):
+        self.api = APIConfig()
+        self.ocr = OCRConfig()
+        self.model = ModelConfig()
+        self.app = AppConfig()
+    def validate(self) -> Dict[str, Any]:
+        """Validate configuration and return validation results."""
+        validation_results = {
+            "valid": True,
+            "warnings": [],
+            "errors": []
+        }
+        # Check API keys
+        if not self.api.google_api_key:
+            validation_results["warnings"].append("Google API key not found - Gemini parser will be unavailable")
+        if not self.api.mistral_api_key:
+            validation_results["warnings"].append("Mistral API key not found - Mistral parser will be unavailable")
+        # Check tesseract setup
+        if not self.ocr.tesseract_path and not os.path.exists("/usr/bin/tesseract"):
+            validation_results["warnings"].append("Tesseract not found in system PATH - OCR functionality may be limited")
+        # Check temp directory
+        try:
+            os.makedirs(self.app.temp_dir, exist_ok=True)
+        except Exception as e:
+            validation_results["errors"].append(f"Cannot create temp directory {self.app.temp_dir}: {e}")
+            validation_results["valid"] = False
+        return validation_results
+    def get_available_parsers(self) -> list:
+        """Get list of available parsers based on current configuration."""
+        available = ["markitdown"]  # Always available
+        if self.api.google_api_key:
+            available.append("gemini_flash")
+        if self.api.mistral_api_key:
+            available.append("mistral_ocr")
+        # GOT-OCR is available if we have GPU or can use ZeroGPU
+        available.append("got_ocr")
+        return available
+# Global configuration instance
+config = Config()

src/core/converter.py CHANGED Viewed

@@ -1,55 +1,30 @@
-import tempfile
 import logging
-import time
-import os
-from pathlib import Path
-# Use relative imports instead of absolute imports
-from src.core.parser_factory import ParserFactory
 # Import all parsers to ensure they're registered
 from src import parsers
-# Import the LaTeX to Markdown converter
-try:
-    from src.core.latex_to_markdown_converter import convert_latex_to_markdown
-    HAS_GEMINI_CONVERTER = True
-except ImportError:
-    HAS_GEMINI_CONVERTER = False
-    logging.warning("LaTeX to Markdown converter not available. Raw LaTeX will be returned for formatted text.")
-# Reference to the cancellation flag from ui.py
-# This will be set by the UI when the cancel button is clicked
-conversion_cancelled = None  # Will be a threading.Event object
-# Flag to track if conversion is currently in progress
-_conversion_in_progress = False
-def set_cancellation_flag(flag):
     """Set the reference to the cancellation flag from ui.py"""
-    global conversion_cancelled
-    conversion_cancelled = flag
-def is_conversion_in_progress():
     """Check if conversion is currently in progress"""
-    global _conversion_in_progress
-    return _conversion_in_progress
-def check_cancellation():
-    """Check if cancellation has been requested"""
-    if conversion_cancelled and conversion_cancelled.is_set():
-        logging.info("Cancellation detected in check_cancellation")
-        return True
-    return False
-def safe_delete_file(file_path):
-    """Safely delete a file with error handling"""
-    if file_path and os.path.exists(file_path):
-        try:
-            os.unlink(file_path)
-        except Exception as e:
-            logging.error(f"Error cleaning up temp file {file_path}: {e}")
-def convert_file(file_path, parser_name, ocr_method_name, output_format):
     """
     Convert a file using the specified parser and OCR method.
@@ -62,165 +37,35 @@ def convert_file(file_path, parser_name, ocr_method_name, output_format):
     Returns:
         tuple: (content, download_file_path)
     """
-    global conversion_cancelled, _conversion_in_progress
-    # Set the conversion in progress flag
-    _conversion_in_progress = True
-    # Temporary file paths to clean up
-    temp_input = None
-    tmp_path = None
-    # Ensure we clean up the flag when we're done
     try:
-        if not file_path:
-            return "Please upload a file.", None
-        # Check for cancellation
-        if check_cancellation():
-            logging.info("Cancellation detected at start of convert_file")
-            return "Conversion cancelled.", None
-        # Create a temporary file with English filename
-        try:
-            original_ext = Path(file_path).suffix
-            with tempfile.NamedTemporaryFile(suffix=original_ext, delete=False) as temp_file:
-                temp_input = temp_file.name
-                # Copy the content of original file to temp file
-                with open(file_path, 'rb') as original:
-                    # Read in smaller chunks and check for cancellation between chunks
-                    chunk_size = 1024 * 1024  # 1MB chunks
-                    while True:
-                        # Check for cancellation frequently
-                        if check_cancellation():
-                            logging.info("Cancellation detected during file copy")
-                            safe_delete_file(temp_input)
-                            return "Conversion cancelled.", None
-                        chunk = original.read(chunk_size)
-                        if not chunk:
-                            break
-                        temp_file.write(chunk)
-            file_path = temp_input
-        except Exception as e:
-            safe_delete_file(temp_input)
-            return f"Error creating temporary file: {e}", None
-        # Check for cancellation again
-        if check_cancellation():
-            logging.info("Cancellation detected after file preparation")
-            safe_delete_file(temp_input)
-            return "Conversion cancelled.", None
-        content = None
-        try:
-            # Use the parser factory to parse the document
-            start = time.time()
-            # Pass the cancellation flag to the parser factory
-            content = ParserFactory.parse_document(
-                file_path=file_path,
-                parser_name=parser_name,
-                ocr_method_name=ocr_method_name,
-                output_format=output_format.lower(),
-                cancellation_flag=conversion_cancelled  # Pass the flag to parsers
-            )
-            # If content indicates cancellation, return early
-            if content == "Conversion cancelled.":
-                logging.info("Parser reported cancellation")
-                safe_delete_file(temp_input)
-                return content, None
-            duration = time.time() - start
-            logging.info(f"Processed in {duration:.2f} seconds.")
-            # Check for cancellation after processing
-            if check_cancellation():
-                logging.info("Cancellation detected after processing")
-                safe_delete_file(temp_input)
-                return "Conversion cancelled.", None
-            # Process LaTeX content for GOT-OCR formatted text
-            if parser_name == "GOT-OCR (jpg,png only)" and ocr_method_name == "Formatted Text" and HAS_GEMINI_CONVERTER:
-                logging.info("Converting LaTeX output to Markdown using Gemini API")
-                start_convert = time.time()
-                # Check for cancellation before conversion
-                if check_cancellation():
-                    logging.info("Cancellation detected before LaTeX conversion")
-                    safe_delete_file(temp_input)
-                    return "Conversion cancelled.", None
-                try:
-                    markdown_content = convert_latex_to_markdown(content)
-                    if markdown_content:
-                        content = markdown_content
-                        logging.info(f"LaTeX conversion completed in {time.time() - start_convert:.2f} seconds")
-                    else:
-                        logging.warning("LaTeX to Markdown conversion failed, using raw LaTeX output")
-                except Exception as e:
-                    logging.error(f"Error converting LaTeX to Markdown: {str(e)}")
-                    # Continue with the original content on error
-                # Check for cancellation after conversion
-                if check_cancellation():
-                    logging.info("Cancellation detected after LaTeX conversion")
-                    safe_delete_file(temp_input)
-                    return "Conversion cancelled.", None
-        except Exception as e:
-            safe_delete_file(temp_input)
-            return f"Error: {e}", None
-        # Determine the file extension based on the output format
-        if output_format == "Markdown":
-            ext = ".md"
-        elif output_format == "JSON":
-            ext = ".json"
-        elif output_format == "Text":
-            ext = ".txt"
-        elif output_format == "Document Tags":
-            ext = ".doctags"
-        else:
-            ext = ".txt"
-        # Check for cancellation again
-        if check_cancellation():
-            logging.info("Cancellation detected before output file creation")
-            safe_delete_file(temp_input)
             return "Conversion cancelled.", None
-        try:
-            # Create a temporary file for download
-            with tempfile.NamedTemporaryFile(mode="w", suffix=ext, delete=False, encoding="utf-8") as tmp:
-                tmp_path = tmp.name
-                # Write in chunks and check for cancellation
-                chunk_size = 10000  # characters
-                for i in range(0, len(content), chunk_size):
-                    # Check for cancellation
-                    if check_cancellation():
-                        logging.info("Cancellation detected during output file writing")
-                        safe_delete_file(tmp_path)
-                        safe_delete_file(temp_input)
-                        return "Conversion cancelled.", None
-                    tmp.write(content[i:i+chunk_size])
-            # Clean up the temporary input file
-            safe_delete_file(temp_input)
-            temp_input = None  # Mark as cleaned up
-            return content, tmp_path
-        except Exception as e:
-            safe_delete_file(tmp_path)
-            safe_delete_file(temp_input)
-            return f"Error: {e}", None
-    finally:
-        # Always clean up any remaining temp files
-        safe_delete_file(temp_input)
-        if check_cancellation() and tmp_path:
-            safe_delete_file(tmp_path)
-        # Always clear the conversion in progress flag when done
-        _conversion_in_progress = False

 import logging
+import threading
+from typing import Optional, Tuple
+from src.core.config import config
+from src.core.exceptions import (
+    DocumentProcessingError,
+    ConversionError,
+    ConfigurationError
+)
+from src.services.document_service import DocumentService
 # Import all parsers to ensure they're registered
 from src import parsers
+# Global document service instance
+_document_service = DocumentService()
+def set_cancellation_flag(flag: threading.Event) -> None:
     """Set the reference to the cancellation flag from ui.py"""
+    _document_service.set_cancellation_flag(flag)
+def is_conversion_in_progress() -> bool:
     """Check if conversion is currently in progress"""
+    return _document_service.is_conversion_in_progress()
+def convert_file(file_path: str, parser_name: str, ocr_method_name: str, output_format: str) -> Tuple[str, Optional[str]]:
     """
     Convert a file using the specified parser and OCR method.
     Returns:
         tuple: (content, download_file_path)
     """
+    if not file_path:
+        return "Please upload a file.", None
     try:
+        # Use the document service to handle conversion
+        content, output_path = _document_service.convert_document(
+            file_path=file_path,
+            parser_name=parser_name,
+            ocr_method_name=ocr_method_name,
+            output_format=output_format
+        )
+        return content, output_path
+    except ConversionError as e:
+        # Handle user-friendly conversion errors
+        if "cancelled" in str(e).lower():
             return "Conversion cancelled.", None
+        return f"Conversion failed: {e}", None
+    except DocumentProcessingError as e:
+        # Handle document processing errors
+        return f"Document processing error: {e}", None
+    except ConfigurationError as e:
+        # Handle configuration errors
+        return f"Configuration error: {e}", None
+    except Exception as e:
+        # Handle unexpected errors
+        logging.error(f"Unexpected error in convert_file: {e}")
+        return f"Unexpected error: {e}", None

src/core/environment.py ADDED Viewed

	@@ -0,0 +1,246 @@

+"""
+Environment setup and dependency management for the Markit application.
+Extracted from app.py to improve code organization while maintaining HF Spaces compatibility.
+"""
+import os
+import sys
+import subprocess
+import logging
+from typing import Dict, Optional, Tuple
+from pathlib import Path
+from src.core.config import config
+from src.core.logging_config import setup_logging
+class EnvironmentManager:
+    """Manages environment setup and dependency installation."""
+    def __init__(self):
+        self.current_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+        self.logger = logging.getLogger(__name__)
+    def run_setup_script(self) -> bool:
+        """Run setup.sh script if it exists."""
+        try:
+            setup_script = os.path.join(self.current_dir, "setup.sh")
+            if os.path.exists(setup_script):
+                print("Running setup.sh...")
+                subprocess.run(["bash", setup_script], check=False)
+                print("setup.sh completed")
+                return True
+        except Exception as e:
+            print(f"Error running setup.sh: {e}")
+        return False
+    def check_spaces_module(self) -> bool:
+        """Check and install spaces module for ZeroGPU support."""
+        try:
+            import spaces
+            print("Spaces module found for ZeroGPU support")
+            return True
+        except ImportError:
+            print("WARNING: Spaces module not found. Installing...")
+            try:
+                subprocess.run([sys.executable, "-m", "pip", "install", "-q", "spaces"], check=False)
+                return True
+            except Exception as e:
+                print(f"Error installing spaces module: {e}")
+                return False
+    def check_pytorch(self) -> Tuple[bool, Dict[str, str]]:
+        """Check PyTorch and CUDA availability."""
+        info = {}
+        try:
+            import torch
+            info["pytorch_version"] = torch.__version__
+            info["cuda_available"] = str(torch.cuda.is_available())
+            print(f"PyTorch version: {info['pytorch_version']}")
+            print(f"CUDA available: {info['cuda_available']}")
+            if torch.cuda.is_available():
+                info["cuda_device"] = torch.cuda.get_device_name(0)
+                info["cuda_version"] = torch.version.cuda
+                print(f"CUDA device: {info['cuda_device']}")
+                print(f"CUDA version: {info['cuda_version']}")
+            else:
+                print("WARNING: CUDA not available. GOT-OCR performs best with GPU acceleration.")
+            return True, info
+        except ImportError:
+            print("WARNING: PyTorch not installed. Installing PyTorch...")
+            try:
+                subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
+                return True, info
+            except Exception as e:
+                print(f"Error installing PyTorch: {e}")
+                return False, info
+    def check_transformers(self) -> bool:
+        """Check and install transformers library."""
+        try:
+            import transformers
+            print(f"Transformers version: {transformers.__version__}")
+            return True
+        except ImportError:
+            print("WARNING: Transformers not installed. Installing transformers from GitHub...")
+            try:
+                subprocess.run([
+                    sys.executable, "-m", "pip", "install", "-q",
+                    "git+https://github.com/huggingface/transformers.git@main",
+                    "accelerate", "verovio"
+                ], check=False)
+                return True
+            except Exception as e:
+                print(f"Error installing transformers: {e}")
+                return False
+    def check_numpy(self) -> bool:
+        """Check and install correct NumPy version."""
+        try:
+            import numpy as np
+            print(f"NumPy version: {np.__version__}")
+            if np.__version__ != "1.26.3":
+                print("WARNING: NumPy version mismatch. Installing exact version 1.26.3...")
+                subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
+            return True
+        except ImportError:
+            print("WARNING: NumPy not installed. Installing NumPy 1.26.3...")
+            try:
+                subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
+                return True
+            except Exception as e:
+                print(f"Error installing NumPy: {e}")
+                return False
+    def check_markitdown(self) -> bool:
+        """Check and install MarkItDown library."""
+        try:
+            from markitdown import MarkItDown
+            print("MarkItDown is installed")
+            return True
+        except ImportError:
+            print("WARNING: MarkItDown not installed. Installing...")
+            try:
+                subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
+                from markitdown import MarkItDown
+                print("MarkItDown installed successfully")
+                return True
+            except ImportError:
+                print("ERROR: Failed to install MarkItDown")
+                return False
+            except Exception as e:
+                print(f"Error installing MarkItDown: {e}")
+                return False
+    def load_environment_variables(self) -> bool:
+        """Load environment variables from .env file."""
+        try:
+            from dotenv import load_dotenv
+            load_dotenv()
+            print("Loaded environment variables from .env file")
+            return True
+        except ImportError:
+            print("python-dotenv not installed, skipping .env file loading")
+            return False
+    def validate_api_keys(self) -> Dict[str, bool]:
+        """Validate and report API key availability."""
+        results = {}
+        # Check Gemini API key
+        gemini_key = config.api.google_api_key
+        if not gemini_key:
+            print("Warning: GOOGLE_API_KEY environment variable not found. Gemini Flash parser and LaTeX to Markdown conversion may not work.")
+            results["gemini"] = False
+        else:
+            print(f"Found Gemini API key: {gemini_key[:5]}...{gemini_key[-5:] if len(gemini_key) > 10 else ''}")
+            print("Gemini API will be used for LaTeX to Markdown conversion when using GOT-OCR with Formatted Text mode")
+            results["gemini"] = True
+        # Check OpenAI API key
+        openai_key = config.api.openai_api_key
+        if not openai_key:
+            print("Warning: OPENAI_API_KEY environment variable not found. LLM-based image description in MarkItDown may not work.")
+            results["openai"] = False
+        else:
+            print(f"Found OpenAI API key: {openai_key[:5]}...{openai_key[-5:] if len(openai_key) > 10 else ''}")
+            print("OpenAI API will be available for LLM-based image descriptions in MarkItDown")
+            results["openai"] = True
+        # Check Mistral API key
+        mistral_key = config.api.mistral_api_key
+        if mistral_key:
+            print(f"Found Mistral API key: {mistral_key[:5]}...{mistral_key[-5:] if len(mistral_key) > 10 else ''}")
+            results["mistral"] = True
+        else:
+            results["mistral"] = False
+        return results
+    def setup_python_path(self) -> None:
+        """Setup Python path for imports."""
+        if self.current_dir not in sys.path:
+            sys.path.append(self.current_dir)
+    def setup_logging(self) -> None:
+        """Setup centralized logging configuration."""
+        # Configure logging to suppress httpx and other noisy logs
+        logging.getLogger("httpx").setLevel(logging.WARNING)
+        logging.getLogger("urllib3").setLevel(logging.WARNING)
+        logging.getLogger("httpcore").setLevel(logging.WARNING)
+        # Setup our centralized logging
+        setup_logging()
+    def full_environment_setup(self) -> Dict[str, bool]:
+        """
+        Perform complete environment setup.
+        Returns:
+            Dictionary with setup results for each component
+        """
+        results = {}
+        # Setup logging first
+        self.setup_logging()
+        # Run setup script
+        results["setup_script"] = self.run_setup_script()
+        # Check and install dependencies
+        results["spaces_module"] = self.check_spaces_module()
+        results["pytorch"], pytorch_info = self.check_pytorch()
+        results["transformers"] = self.check_transformers()
+        results["numpy"] = self.check_numpy()
+        results["markitdown"] = self.check_markitdown()
+        # Load environment variables
+        results["env_vars"] = self.load_environment_variables()
+        # Validate API keys
+        api_keys = self.validate_api_keys()
+        results["api_keys"] = api_keys
+        # Setup Python path
+        self.setup_python_path()
+        results["python_path"] = True
+        # Validate configuration
+        validation = config.validate()
+        results["config_valid"] = validation["valid"]
+        if validation["warnings"]:
+            for warning in validation["warnings"]:
+                print(f"Configuration warning: {warning}")
+        if validation["errors"]:
+            for error in validation["errors"]:
+                print(f"Configuration error: {error}")
+        return results
+# Global instance
+environment_manager = EnvironmentManager()

src/core/exceptions.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""
+Custom exception classes for the Markit application.
+"""
+class MarkitError(Exception):
+    """Base exception class for all Markit-related errors."""
+    pass
+class ConfigurationError(MarkitError):
+    """Raised when there's a configuration-related error."""
+    pass
+class ParserError(MarkitError):
+    """Base exception for parser-related errors."""
+    pass
+class ParserNotFoundError(ParserError):
+    """Raised when a requested parser is not available."""
+    pass
+class ParserInitializationError(ParserError):
+    """Raised when a parser fails to initialize properly."""
+    pass
+class DocumentProcessingError(ParserError):
+    """Raised when document processing fails."""
+    pass
+class UnsupportedFileTypeError(ParserError):
+    """Raised when trying to process an unsupported file type."""
+    pass
+class APIError(MarkitError):
+    """Base exception for API-related errors."""
+    pass
+class APIKeyMissingError(APIError):
+    """Raised when required API key is missing."""
+    pass
+class APIRateLimitError(APIError):
+    """Raised when API rate limit is exceeded."""
+    pass
+class APIQuotaExceededError(APIError):
+    """Raised when API quota is exceeded."""
+    pass
+class FileError(MarkitError):
+    """Base exception for file-related errors."""
+    pass
+class FileSizeLimitError(FileError):
+    """Raised when file size exceeds the allowed limit."""
+    pass
+class FileNotFoundError(FileError):
+    """Raised when a required file is not found."""
+    pass
+class ConversionError(MarkitError):
+    """Raised when document conversion fails."""
+    pass
+class ValidationError(MarkitError):
+    """Raised when input validation fails."""
+    pass

src/core/logging_config.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""
+Centralized logging configuration for the Markit application.
+"""
+import logging
+import sys
+from pathlib import Path
+from typing import Optional
+from src.core.config import config
+def setup_logging(
+    level: Optional[str] = None,
+    log_file: Optional[str] = None,
+    format_string: Optional[str] = None
+) -> None:
+    """
+    Setup centralized logging configuration.
+    Args:
+        level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
+        log_file: Optional file path for logging output
+        format_string: Custom format string for log messages
+    """
+    # Determine logging level
+    if level is None:
+        level = "DEBUG" if config.app.debug else "INFO"
+    # Default format string
+    if format_string is None:
+        format_string = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
+    # Configure root logger
+    root_logger = logging.getLogger()
+    root_logger.setLevel(getattr(logging, level.upper()))
+    # Clear existing handlers
+    root_logger.handlers.clear()
+    # Create formatter
+    formatter = logging.Formatter(format_string)
+    # Console handler
+    console_handler = logging.StreamHandler(sys.stdout)
+    console_handler.setLevel(getattr(logging, level.upper()))
+    console_handler.setFormatter(formatter)
+    root_logger.addHandler(console_handler)
+    # File handler (optional)
+    if log_file:
+        try:
+            log_path = Path(log_file)
+            log_path.parent.mkdir(parents=True, exist_ok=True)
+            file_handler = logging.FileHandler(log_file)
+            file_handler.setLevel(getattr(logging, level.upper()))
+            file_handler.setFormatter(formatter)
+            root_logger.addHandler(file_handler)
+        except Exception as e:
+            logging.warning(f"Could not setup file logging: {e}")
+    # Set specific logger levels to reduce noise
+    logging.getLogger("urllib3").setLevel(logging.WARNING)
+    logging.getLogger("requests").setLevel(logging.WARNING)
+    logging.getLogger("gradio").setLevel(logging.WARNING)
+    if not config.app.debug:
+        # Reduce noise from external libraries in non-debug mode
+        logging.getLogger("transformers").setLevel(logging.WARNING)
+        logging.getLogger("torch").setLevel(logging.WARNING)
+def get_logger(name: str) -> logging.Logger:
+    """
+    Get a logger with the specified name.
+    Args:
+        name: Logger name (typically __name__)
+    Returns:
+        Logger instance
+    """
+    return logging.getLogger(name)

src/main.py CHANGED Viewed

@@ -1,10 +1,10 @@
-import parsers  # Import all parsers to ensure they're registered
 from src.ui.ui import launch_ui
 def main():
     # Launch the UI
     launch_ui(
-        server_name="0.0.0.0",
         server_port=7860,
         share=False  # Explicitly disable sharing on Hugging Face
     )

+from src import parsers  # Import all parsers to ensure they're registered
 from src.ui.ui import launch_ui
 def main():
     # Launch the UI
     launch_ui(
+        server_name="localhost",
         server_port=7860,
         share=False  # Explicitly disable sharing on Hugging Face
     )

src/parsers/got_ocr_parser.py CHANGED Viewed

@@ -13,6 +13,7 @@ import tempfile
 import shutil
 from typing import Dict, List, Optional, Any, Union
 import copy
 from src.parsers.parser_interface import DocumentParser
 from src.parsers.parser_registry import ParserRegistry

 import shutil
 from typing import Dict, List, Optional, Any, Union
 import copy
+import pickle
 from src.parsers.parser_interface import DocumentParser
 from src.parsers.parser_registry import ParserRegistry

src/parsers/markitdown_parser.py CHANGED Viewed

@@ -1,12 +1,13 @@
 import logging
 import os
 from pathlib import Path
-from typing import Dict, List, Optional, Any, Union
 import io
 # Import the parser interface and registry
 from src.parsers.parser_interface import DocumentParser
 from src.parsers.parser_registry import ParserRegistry
 # Check for MarkItDown availability
 try:
@@ -27,6 +28,7 @@ class MarkItDownParser(DocumentParser):
     """
     def __init__(self):
         self.markdown_instance = None
         # Initialize MarkItDown instance
         if HAS_MARKITDOWN:
@@ -60,34 +62,44 @@ class MarkItDownParser(DocumentParser):
         Returns:
             str: Markdown representation of the document
         """
         # Check if MarkItDown is available
         if not HAS_MARKITDOWN or self.markdown_instance is None:
-            return "Error: MarkItDown is not available. Please install with 'pip install markitdown[all]'"
-        # Get cancellation check function from kwargs
-        check_cancellation = kwargs.get('check_cancellation', lambda: False)
         # Check for cancellation before starting
-        if check_cancellation():
-            return "Conversion cancelled."
         try:
             # Convert the file using the standard instance
-            result = self.markdown_instance.convert(file_path)
             # Check for cancellation after processing
-            if check_cancellation():
-                return "Conversion cancelled."
             return result.text_content
         except Exception as e:
             logger.error(f"Error converting file with MarkItDown: {str(e)}")
-            return f"Error: {str(e)}"
     @classmethod
     def get_name(cls) -> str:
         return "MarkItDown (pdf, jpg, png, xlsx --best for xlsx)"
     @classmethod
     def get_supported_ocr_methods(cls) -> List[Dict[str, Any]]:
         return [

 import logging
 import os
 from pathlib import Path
+from typing import Dict, List, Optional, Any, Union, Set
 import io
 # Import the parser interface and registry
 from src.parsers.parser_interface import DocumentParser
 from src.parsers.parser_registry import ParserRegistry
+from src.core.exceptions import DocumentProcessingError, ParserError
 # Check for MarkItDown availability
 try:
     """
     def __init__(self):
+        super().__init__()  # Initialize the base class (including _cancellation_flag)
         self.markdown_instance = None
         # Initialize MarkItDown instance
         if HAS_MARKITDOWN:
         Returns:
             str: Markdown representation of the document
         """
+        # Validate file first
+        self.validate_file(file_path)
         # Check if MarkItDown is available
         if not HAS_MARKITDOWN or self.markdown_instance is None:
+            raise ParserError("MarkItDown is not available. Please install with 'pip install markitdown[all]'")
         # Check for cancellation before starting
+        if self._check_cancellation():
+            raise DocumentProcessingError("Conversion cancelled")
         try:
             # Convert the file using the standard instance
+            result = self.markdown_instance.convert(str(file_path))
             # Check for cancellation after processing
+            if self._check_cancellation():
+                raise DocumentProcessingError("Conversion cancelled")
             return result.text_content
         except Exception as e:
             logger.error(f"Error converting file with MarkItDown: {str(e)}")
+            raise DocumentProcessingError(f"MarkItDown conversion failed: {str(e)}")
     @classmethod
     def get_name(cls) -> str:
         return "MarkItDown (pdf, jpg, png, xlsx --best for xlsx)"
+    @classmethod
+    def get_supported_file_types(cls) -> Set[str]:
+        """Return a set of supported file extensions."""
+        return {".pdf", ".docx", ".xlsx", ".pptx", ".html", ".txt", ".md", ".json", ".xml", ".csv", ".jpg", ".jpeg", ".png"}
+    @classmethod
+    def is_available(cls) -> bool:
+        """Check if this parser is available."""
+        return HAS_MARKITDOWN
     @classmethod
     def get_supported_ocr_methods(cls) -> List[Dict[str, Any]]:
         return [

src/parsers/parser_interface.py CHANGED Viewed

@@ -1,11 +1,26 @@
 from abc import ABC, abstractmethod
 from pathlib import Path
-from typing import Dict, List, Optional, Any, Union
 class DocumentParser(ABC):
     """Base interface for all document parsers in the system."""
     @abstractmethod
     def parse(self, file_path: Union[str, Path], ocr_method: Optional[str] = None, **kwargs) -> str:
         """
@@ -18,6 +33,10 @@ class DocumentParser(ABC):
         Returns:
             str: The parsed content
         """
         pass
@@ -44,4 +63,44 @@ class DocumentParser(ABC):
     @classmethod
     def get_description(cls) -> str:
         """Return a description of this parser"""
-        return f"{cls.get_name()} document parser"

 from abc import ABC, abstractmethod
 from pathlib import Path
+from typing import Dict, List, Optional, Any, Union, Set
+import threading
+from src.core.exceptions import ParserError, UnsupportedFileTypeError
 class DocumentParser(ABC):
     """Base interface for all document parsers in the system."""
+    def __init__(self):
+        """Initialize the parser."""
+        self._cancellation_flag: Optional[threading.Event] = None
+    def set_cancellation_flag(self, flag: Optional[threading.Event]) -> None:
+        """Set the cancellation flag for this parser."""
+        self._cancellation_flag = flag
+    def _check_cancellation(self) -> bool:
+        """Check if cancellation has been requested."""
+        return self._cancellation_flag is not None and self._cancellation_flag.is_set()
     @abstractmethod
     def parse(self, file_path: Union[str, Path], ocr_method: Optional[str] = None, **kwargs) -> str:
         """
         Returns:
             str: The parsed content
+        Raises:
+            ParserError: For general parsing errors
+            UnsupportedFileTypeError: For unsupported file types
         """
         pass
     @classmethod
     def get_description(cls) -> str:
         """Return a description of this parser"""
+        return f"{cls.get_name()} document parser"
+    @classmethod
+    def get_supported_file_types(cls) -> Set[str]:
+        """Return a set of supported file extensions (including the dot)."""
+        return {".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp"}
+    @classmethod
+    def is_available(cls) -> bool:
+        """Check if this parser is available with current configuration."""
+        return True
+    def validate_file(self, file_path: Union[str, Path]) -> None:
+        """
+        Validate that the file can be processed by this parser.
+        Args:
+            file_path: Path to the file to validate
+        Raises:
+            UnsupportedFileTypeError: If file type is not supported
+            ParserError: For other validation errors
+        """
+        path = Path(file_path)
+        if not path.exists():
+            raise ParserError(f"File not found: {file_path}")
+        if path.suffix.lower() not in self.get_supported_file_types():
+            raise UnsupportedFileTypeError(
+                f"File type '{path.suffix}' not supported by {self.get_name()}"
+            )
+    def get_metadata(self) -> Dict[str, Any]:
+        """Return metadata about this parser instance."""
+        return {
+            "name": self.get_name(),
+            "description": self.get_description(),
+            "supported_file_types": list(self.get_supported_file_types()),
+            "supported_ocr_methods": self.get_supported_ocr_methods(),
+            "available": self.is_available()
+        }

src/services/document_service.py ADDED Viewed

	@@ -0,0 +1,243 @@

+"""
+Document processing service layer.
+"""
+import tempfile
+import logging
+import time
+import os
+import threading
+from pathlib import Path
+from typing import Optional, Tuple, Any
+from src.core.config import config
+from src.core.exceptions import (
+    DocumentProcessingError,
+    FileSizeLimitError,
+    UnsupportedFileTypeError,
+    ConversionError
+)
+from src.core.parser_factory import ParserFactory
+from src.core.latex_to_markdown_converter import convert_latex_to_markdown
+class DocumentService:
+    """Service for handling document processing operations."""
+    def __init__(self):
+        self._conversion_in_progress = False
+        self._cancellation_flag: Optional[threading.Event] = None
+    def set_cancellation_flag(self, flag: threading.Event) -> None:
+        """Set the cancellation flag for this service."""
+        self._cancellation_flag = flag
+    def is_conversion_in_progress(self) -> bool:
+        """Check if conversion is currently in progress."""
+        return self._conversion_in_progress
+    def _check_cancellation(self) -> bool:
+        """Check if cancellation has been requested."""
+        if self._cancellation_flag and self._cancellation_flag.is_set():
+            logging.info("Cancellation detected in document service")
+            return True
+        return False
+    def _safe_delete_file(self, file_path: Optional[str]) -> None:
+        """Safely delete a file with error handling."""
+        if file_path and os.path.exists(file_path):
+            try:
+                os.unlink(file_path)
+            except Exception as e:
+                logging.error(f"Error cleaning up temp file {file_path}: {e}")
+    def _validate_file(self, file_path: str) -> None:
+        """Validate file size and type."""
+        if not os.path.exists(file_path):
+            raise DocumentProcessingError(f"File not found: {file_path}")
+        # Check file size
+        file_size = os.path.getsize(file_path)
+        if file_size > config.app.max_file_size:
+            raise FileSizeLimitError(
+                f"File size ({file_size} bytes) exceeds maximum allowed size "
+                f"({config.app.max_file_size} bytes)"
+            )
+        # Check file extension
+        file_ext = Path(file_path).suffix.lower()
+        if file_ext not in config.app.allowed_extensions:
+            raise UnsupportedFileTypeError(
+                f"File type '{file_ext}' is not supported. "
+                f"Allowed types: {', '.join(config.app.allowed_extensions)}"
+            )
+    def _create_temp_file(self, original_path: str) -> str:
+        """Create a temporary file with English filename."""
+        original_ext = Path(original_path).suffix
+        with tempfile.NamedTemporaryFile(suffix=original_ext, delete=False) as temp_file:
+            temp_path = temp_file.name
+            # Copy content in chunks with cancellation checks
+            with open(original_path, 'rb') as original:
+                chunk_size = 1024 * 1024  # 1MB chunks
+                while True:
+                    if self._check_cancellation():
+                        self._safe_delete_file(temp_path)
+                        raise ConversionError("Conversion cancelled during file copy")
+                    chunk = original.read(chunk_size)
+                    if not chunk:
+                        break
+                    temp_file.write(chunk)
+        return temp_path
+    def _process_latex_content(self, content: str, parser_name: str, ocr_method_name: str) -> str:
+        """Process LaTeX content for GOT-OCR formatted text."""
+        if (parser_name == "GOT-OCR (jpg,png only)" and
+            ocr_method_name == "Formatted Text" and
+            config.api.google_api_key):
+            logging.info("Converting LaTeX output to Markdown using Gemini API")
+            start_convert = time.time()
+            if self._check_cancellation():
+                raise ConversionError("Conversion cancelled before LaTeX conversion")
+            try:
+                markdown_content = convert_latex_to_markdown(content)
+                if markdown_content:
+                    logging.info(f"LaTeX conversion completed in {time.time() - start_convert:.2f} seconds")
+                    return markdown_content
+                else:
+                    logging.warning("LaTeX to Markdown conversion failed, using raw LaTeX output")
+            except Exception as e:
+                logging.error(f"Error converting LaTeX to Markdown: {str(e)}")
+                # Continue with original content on error
+        return content
+    def _create_output_file(self, content: str, output_format: str) -> str:
+        """Create output file with proper extension."""
+        # Determine file extension
+        format_extensions = {
+            "markdown": ".md",
+            "json": ".json",
+            "text": ".txt",
+            "document tags": ".doctags"
+        }
+        ext = format_extensions.get(output_format.lower(), ".txt")
+        if self._check_cancellation():
+            raise ConversionError("Conversion cancelled before output file creation")
+        # Create temporary output file
+        with tempfile.NamedTemporaryFile(mode="w", suffix=ext, delete=False, encoding="utf-8") as tmp:
+            tmp_path = tmp.name
+            # Write in chunks with cancellation checks
+            chunk_size = 10000  # characters
+            for i in range(0, len(content), chunk_size):
+                if self._check_cancellation():
+                    self._safe_delete_file(tmp_path)
+                    raise ConversionError("Conversion cancelled during output file writing")
+                tmp.write(content[i:i+chunk_size])
+        return tmp_path
+    def convert_document(
+        self,
+        file_path: str,
+        parser_name: str,
+        ocr_method_name: str,
+        output_format: str
+    ) -> Tuple[str, Optional[str]]:
+        """
+        Convert a document using the specified parser and OCR method.
+        Args:
+            file_path: Path to the input file
+            parser_name: Name of the parser to use
+            ocr_method_name: Name of the OCR method to use
+            output_format: Output format (Markdown, JSON, Text, Document Tags)
+        Returns:
+            Tuple of (content, output_file_path)
+        Raises:
+            DocumentProcessingError: For general processing errors
+            FileSizeLimitError: When file is too large
+            UnsupportedFileTypeError: For unsupported file types
+            ConversionError: When conversion fails or is cancelled
+        """
+        if not file_path:
+            raise DocumentProcessingError("No file provided")
+        self._conversion_in_progress = True
+        temp_input = None
+        output_path = None
+        try:
+            # Validate input file
+            self._validate_file(file_path)
+            if self._check_cancellation():
+                raise ConversionError("Conversion cancelled")
+            # Create temporary file with English name
+            temp_input = self._create_temp_file(file_path)
+            if self._check_cancellation():
+                raise ConversionError("Conversion cancelled")
+            # Process document using parser factory
+            start_time = time.time()
+            content = ParserFactory.parse_document(
+                file_path=temp_input,
+                parser_name=parser_name,
+                ocr_method_name=ocr_method_name,
+                output_format=output_format.lower(),
+                cancellation_flag=self._cancellation_flag
+            )
+            if content == "Conversion cancelled.":
+                raise ConversionError("Conversion cancelled by parser")
+            duration = time.time() - start_time
+            logging.info(f"Document processed in {duration:.2f} seconds")
+            if self._check_cancellation():
+                raise ConversionError("Conversion cancelled")
+            # Process LaTeX content if needed
+            content = self._process_latex_content(content, parser_name, ocr_method_name)
+            if self._check_cancellation():
+                raise ConversionError("Conversion cancelled")
+            # Create output file
+            output_path = self._create_output_file(content, output_format)
+            return content, output_path
+        except (DocumentProcessingError, FileSizeLimitError, UnsupportedFileTypeError, ConversionError):
+            # Re-raise our custom exceptions
+            self._safe_delete_file(temp_input)
+            self._safe_delete_file(output_path)
+            raise
+        except Exception as e:
+            # Wrap unexpected exceptions
+            self._safe_delete_file(temp_input)
+            self._safe_delete_file(output_path)
+            raise DocumentProcessingError(f"Unexpected error during conversion: {str(e)}")
+        finally:
+            # Clean up temp input file
+            self._safe_delete_file(temp_input)
+            # Clean up output file if cancelled
+            if self._check_cancellation() and output_path:
+                self._safe_delete_file(output_path)
+            self._conversion_in_progress = False

src/ui/ui.py CHANGED Viewed

@@ -6,19 +6,26 @@ import logging
 from pathlib import Path
 from src.core.converter import convert_file, set_cancellation_flag, is_conversion_in_progress
 from src.parsers.parser_registry import ParserRegistry
 # Import MarkItDown to check if it's available
 try:
     from markitdown import MarkItDown
     HAS_MARKITDOWN = True
-    logging.info("MarkItDown is available for use")
 except ImportError:
     HAS_MARKITDOWN = False
-    logging.warning("MarkItDown is not available")
-# Configure logging
-logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-logger = logging.getLogger(__name__)
 # Add a global variable to track cancellation state
 conversion_cancelled = threading.Event()
@@ -40,12 +47,33 @@ def validate_file_for_parser(file_path, parser_name):
     """Validate if the file type is supported by the selected parser."""
     if not file_path:
         return True, ""  # No file selected yet
-    if "GOT-OCR" in parser_name:
-        file_ext = Path(file_path).suffix.lower()
-        if file_ext not in ['.jpg', '.jpeg', '.png']:
-            return False, "GOT-OCR only supports JPG and PNG formats."
-    return True, ""
 def format_markdown_content(content):
     if not content:

 from pathlib import Path
 from src.core.converter import convert_file, set_cancellation_flag, is_conversion_in_progress
 from src.parsers.parser_registry import ParserRegistry
+from src.core.config import config
+from src.core.exceptions import (
+    DocumentProcessingError,
+    UnsupportedFileTypeError,
+    FileSizeLimitError,
+    ConfigurationError
+)
+from src.core.logging_config import get_logger
+# Use centralized logging
+logger = get_logger(__name__)
 # Import MarkItDown to check if it's available
 try:
     from markitdown import MarkItDown
     HAS_MARKITDOWN = True
+    logger.info("MarkItDown is available for use")
 except ImportError:
     HAS_MARKITDOWN = False
+    logger.warning("MarkItDown is not available")
 # Add a global variable to track cancellation state
 conversion_cancelled = threading.Event()
     """Validate if the file type is supported by the selected parser."""
     if not file_path:
         return True, ""  # No file selected yet
+    try:
+        file_path_obj = Path(file_path)
+        file_ext = file_path_obj.suffix.lower()
+        # Check file size
+        if file_path_obj.exists():
+            file_size = file_path_obj.stat().st_size
+            if file_size > config.app.max_file_size:
+                size_mb = file_size / (1024 * 1024)
+                max_mb = config.app.max_file_size / (1024 * 1024)
+                return False, f"File size ({size_mb:.1f}MB) exceeds maximum allowed size ({max_mb:.1f}MB)"
+        # Check file extension
+        if file_ext not in config.app.allowed_extensions:
+            return False, f"File type '{file_ext}' is not supported. Allowed types: {', '.join(config.app.allowed_extensions)}"
+        # Parser-specific validation
+        if "GOT-OCR" in parser_name:
+            if file_ext not in ['.jpg', '.jpeg', '.png']:
+                return False, "GOT-OCR only supports JPG and PNG formats."
+        return True, ""
+    except Exception as e:
+        logger.error(f"Error validating file: {e}")
+        return False, f"Error validating file: {e}"
 def format_markdown_content(content):
     if not content: