Spaces:
Runtime error
Runtime error
Refactor and enhance application structure for Markit_v2
Browse files- Introduced centralized configuration management for API keys and application settings.
- Implemented a new environment manager for dependency checks and setup.
- Refactored document processing into a dedicated service layer for improved organization.
- Enhanced error handling with custom exceptions for better clarity.
- Updated README with new configuration options and usage instructions.
- Added lightweight launcher for local development.
- Improved logging setup for better debugging and information tracking.
- Updated .gitignore to include new files and directories.
This commit lays the groundwork for a more modular and maintainable codebase, facilitating future feature additions and improvements.
- .gitignore +13 -1
- =1.1.0 +0 -0
- README.md +111 -32
- app.py +35 -102
- run_app.py +25 -0
- src/core/config.py +123 -0
- src/core/converter.py +45 -200
- src/core/environment.py +246 -0
- src/core/exceptions.py +83 -0
- src/core/logging_config.py +83 -0
- src/main.py +2 -2
- src/parsers/got_ocr_parser.py +1 -0
- src/parsers/markitdown_parser.py +23 -11
- src/parsers/parser_interface.py +61 -2
- src/services/document_service.py +243 -0
- src/ui/ui.py +39 -11
.gitignore
CHANGED
|
@@ -84,4 +84,16 @@ test_gemini_parser.py
|
|
| 84 |
|
| 85 |
# Ignore tessdata folder
|
| 86 |
/tessdata/
|
| 87 |
-
/tessdata/*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
# Ignore tessdata folder
|
| 86 |
/tessdata/
|
| 87 |
+
/tessdata/*
|
| 88 |
+
|
| 89 |
+
# Ignore .venv folder
|
| 90 |
+
.venv/
|
| 91 |
+
|
| 92 |
+
# Ignore Claude.md
|
| 93 |
+
Claude.md
|
| 94 |
+
|
| 95 |
+
# Ignore backup
|
| 96 |
+
app_backup.py
|
| 97 |
+
|
| 98 |
+
#Ignore .claude
|
| 99 |
+
.claude/
|
=1.1.0
ADDED
|
File without changes
|
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 📄
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: indigo
|
|
@@ -45,10 +45,26 @@ This app integrates [Microsoft's MarkItDown](https://github.com/microsoft/markit
|
|
| 45 |
|
| 46 |
## Environment Variables
|
| 47 |
|
| 48 |
-
You can enhance
|
| 49 |
|
| 50 |
-
|
| 51 |
- `GOOGLE_API_KEY`: Used for Gemini Flash parser and LaTeX to Markdown conversion
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
## Usage
|
| 54 |
|
|
@@ -60,17 +76,34 @@ You can enhance the functionality by setting these environment variables:
|
|
| 60 |
|
| 61 |
## Local Development
|
| 62 |
|
|
|
|
| 63 |
1. Clone the repository
|
| 64 |
-
2. Create a `.env` file
|
| 65 |
-
3. Install dependencies:
|
| 66 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
pip install -r requirements.txt
|
| 68 |
```
|
| 69 |
4. Run the application:
|
| 70 |
-
```
|
|
|
|
| 71 |
python app.py
|
|
|
|
|
|
|
|
|
|
| 72 |
```
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
## Credits
|
| 75 |
|
| 76 |
- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
|
|
@@ -94,21 +127,33 @@ Markit is a powerful tool that converts various document formats (PDF, DOCX, ima
|
|
| 94 |
- **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
|
| 95 |
- **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
|
| 96 |
- **Advanced Parsing Engines**:
|
| 97 |
-
- **
|
| 98 |
-
- **Docling**: Advanced document structure analysis
|
| 99 |
- **Gemini Flash**: AI-powered conversion using Google's Gemini API
|
| 100 |
- **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
|
|
|
|
| 101 |
- **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
|
| 102 |
- **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
|
| 103 |
- **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
|
| 104 |
- **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
|
| 105 |
|
| 106 |
## System Architecture
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
## Installation
|
| 114 |
|
|
@@ -187,10 +232,10 @@ build:
|
|
| 187 |
### Document Conversion
|
| 188 |
1. Upload your document using the file uploader
|
| 189 |
2. Select a parser provider:
|
| 190 |
-
- **
|
| 191 |
-
- **Docling**: Best for complex document layouts
|
| 192 |
- **Gemini Flash**: Best for AI-powered conversions (requires API key)
|
| 193 |
- **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
|
|
|
|
| 194 |
3. Choose an OCR option based on your selected parser:
|
| 195 |
- **None**: No OCR processing (for documents with selectable text)
|
| 196 |
- **Tesseract**: Basic OCR using Tesseract
|
|
@@ -206,6 +251,21 @@ build:
|
|
| 206 |
6. Navigate through pages using the navigation buttons for multi-page documents
|
| 207 |
7. Download the converted content in your selected format
|
| 208 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
## Troubleshooting
|
| 210 |
|
| 211 |
### OCR Issues
|
|
@@ -239,38 +299,57 @@ build:
|
|
| 239 |
### Project Structure
|
| 240 |
|
| 241 |
```
|
| 242 |
-
|
| 243 |
-
├── app.py # Main application entry point
|
|
|
|
| 244 |
├── setup.sh # Setup script
|
| 245 |
├── build.sh # Build script
|
| 246 |
├── requirements.txt # Python dependencies
|
| 247 |
├── README.md # Project documentation
|
| 248 |
-
├── .env # Environment variables
|
| 249 |
├── .gitignore # Git ignore file
|
| 250 |
├── .gitattributes # Git attributes file
|
| 251 |
├── src/ # Source code
|
| 252 |
│ ├── __init__.py # Package initialization
|
| 253 |
-
│ ├── main.py #
|
| 254 |
-
│ ├── core/ # Core functionality
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 255 |
│ │ ├── __init__.py # Package initialization
|
| 256 |
-
│ │
|
| 257 |
-
│ │ └── parser_factory.py # Parser factory
|
| 258 |
│ ├── parsers/ # Parser implementations
|
| 259 |
│ │ ├── __init__.py # Package initialization
|
| 260 |
-
│ │ ├── parser_interface.py #
|
| 261 |
-
│ │ ├── parser_registry.py # Parser registry
|
| 262 |
-
│ │ ├──
|
| 263 |
│ │ ├── got_ocr_parser.py # GOT-OCR parser for images
|
| 264 |
-
│ │
|
| 265 |
-
│
|
| 266 |
-
│
|
| 267 |
-
│
|
| 268 |
-
│
|
| 269 |
-
|
| 270 |
-
|
|
|
|
| 271 |
└── __init__.py # Package initialization
|
| 272 |
```
|
| 273 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
### ZeroGPU Integration Notes
|
| 275 |
|
| 276 |
When developing for Hugging Face Spaces with Stateless GPU:
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Markit_v2
|
| 3 |
emoji: 📄
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: indigo
|
|
|
|
| 45 |
|
| 46 |
## Environment Variables
|
| 47 |
|
| 48 |
+
The application uses centralized configuration management. You can enhance functionality by setting these environment variables:
|
| 49 |
|
| 50 |
+
### 🔑 **API Keys:**
|
| 51 |
- `GOOGLE_API_KEY`: Used for Gemini Flash parser and LaTeX to Markdown conversion
|
| 52 |
+
- `OPENAI_API_KEY`: Enables AI-based image descriptions in MarkItDown
|
| 53 |
+
- `MISTRAL_API_KEY`: For Mistral OCR parser (if available)
|
| 54 |
+
|
| 55 |
+
### ⚙️ **Configuration Options:**
|
| 56 |
+
- `DEBUG`: Set to `true` for debug mode with verbose logging
|
| 57 |
+
- `MAX_FILE_SIZE`: Maximum file size in bytes (default: 10MB)
|
| 58 |
+
- `TEMP_DIR`: Directory for temporary files (default: ./temp)
|
| 59 |
+
- `TESSERACT_PATH`: Custom path to Tesseract executable
|
| 60 |
+
- `TESSDATA_PATH`: Path to Tesseract language data
|
| 61 |
+
|
| 62 |
+
### 🤖 **Model Configuration:**
|
| 63 |
+
- `GEMINI_MODEL`: Gemini model to use (default: gemini-1.5-flash)
|
| 64 |
+
- `MISTRAL_MODEL`: Mistral model to use (default: pixtral-12b-2409)
|
| 65 |
+
- `GOT_OCR_MODEL`: GOT-OCR model to use (default: stepfun-ai/GOT-OCR2_0)
|
| 66 |
+
- `MODEL_TEMPERATURE`: Model temperature for AI responses (default: 0.1)
|
| 67 |
+
- `MODEL_MAX_TOKENS`: Maximum tokens for AI responses (default: 4096)
|
| 68 |
|
| 69 |
## Usage
|
| 70 |
|
|
|
|
| 76 |
|
| 77 |
## Local Development
|
| 78 |
|
| 79 |
+
### 🚀 **Quick Start:**
|
| 80 |
1. Clone the repository
|
| 81 |
+
2. Create a `.env` file with your API keys:
|
|
|
|
| 82 |
```
|
| 83 |
+
GOOGLE_API_KEY=your_gemini_api_key_here
|
| 84 |
+
OPENAI_API_KEY=your_openai_api_key_here
|
| 85 |
+
MISTRAL_API_KEY=your_mistral_api_key_here
|
| 86 |
+
DEBUG=true
|
| 87 |
+
```
|
| 88 |
+
3. Install dependencies:
|
| 89 |
+
```bash
|
| 90 |
pip install -r requirements.txt
|
| 91 |
```
|
| 92 |
4. Run the application:
|
| 93 |
+
```bash
|
| 94 |
+
# For full environment setup (HF Spaces compatible)
|
| 95 |
python app.py
|
| 96 |
+
|
| 97 |
+
# For local development (faster startup)
|
| 98 |
+
python run_app.py
|
| 99 |
```
|
| 100 |
|
| 101 |
+
### 🧪 **Development Features:**
|
| 102 |
+
- **Automatic Environment Setup**: Dependencies are checked and installed automatically
|
| 103 |
+
- **Configuration Validation**: Startup validation reports missing API keys and configuration issues
|
| 104 |
+
- **Enhanced Error Messages**: Detailed error reporting for debugging
|
| 105 |
+
- **Centralized Logging**: Configurable logging levels and output formats
|
| 106 |
+
|
| 107 |
## Credits
|
| 108 |
|
| 109 |
- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
|
|
|
|
| 127 |
- **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
|
| 128 |
- **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
|
| 129 |
- **Advanced Parsing Engines**:
|
| 130 |
+
- **MarkItDown**: Comprehensive document conversion (PDFs, Office docs, images, audio, etc.)
|
|
|
|
| 131 |
- **Gemini Flash**: AI-powered conversion using Google's Gemini API
|
| 132 |
- **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
|
| 133 |
+
- **Mistral OCR**: Advanced OCR using Mistral's Pixtral model for image-to-text conversion
|
| 134 |
- **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
|
| 135 |
- **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
|
| 136 |
- **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
|
| 137 |
- **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
|
| 138 |
|
| 139 |
## System Architecture
|
| 140 |
+
|
| 141 |
+
The application is built with a clean, layered architecture following modern software engineering principles:
|
| 142 |
+
|
| 143 |
+
### 🏗️ **Core Architecture Components:**
|
| 144 |
+
- **Entry Point** (`app.py`): HF Spaces-compatible application launcher with environment setup
|
| 145 |
+
- **Configuration Layer** (`src/core/config.py`): Centralized configuration management with validation
|
| 146 |
+
- **Service Layer** (`src/services/`): Business logic for document processing and external services
|
| 147 |
+
- **Core Engine** (`src/core/`): Document conversion workflows and utilities
|
| 148 |
+
- **Parser Registry** (`src/parsers/`): Extensible parser system with standardized interfaces
|
| 149 |
+
- **UI Layer** (`src/ui/`): Gradio-based web interface with enhanced error handling
|
| 150 |
+
|
| 151 |
+
### 🎯 **Key Architectural Features:**
|
| 152 |
+
- **Separation of Concerns**: Clean boundaries between UI, business logic, and core utilities
|
| 153 |
+
- **Centralized Configuration**: All settings, API keys, and validation in one place
|
| 154 |
+
- **Custom Exception Hierarchy**: Proper error handling with user-friendly messages
|
| 155 |
+
- **Plugin Architecture**: Easy addition of new document parsers
|
| 156 |
+
- **HF Spaces Optimized**: Maintains compatibility with Hugging Face deployment requirements
|
| 157 |
|
| 158 |
## Installation
|
| 159 |
|
|
|
|
| 232 |
### Document Conversion
|
| 233 |
1. Upload your document using the file uploader
|
| 234 |
2. Select a parser provider:
|
| 235 |
+
- **MarkItDown**: Best for comprehensive document conversion (supports PDFs, Office docs, images, audio, etc.)
|
|
|
|
| 236 |
- **Gemini Flash**: Best for AI-powered conversions (requires API key)
|
| 237 |
- **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
|
| 238 |
+
- **Mistral OCR**: Advanced OCR using Mistral's Pixtral model (requires API key)
|
| 239 |
3. Choose an OCR option based on your selected parser:
|
| 240 |
- **None**: No OCR processing (for documents with selectable text)
|
| 241 |
- **Tesseract**: Basic OCR using Tesseract
|
|
|
|
| 251 |
6. Navigate through pages using the navigation buttons for multi-page documents
|
| 252 |
7. Download the converted content in your selected format
|
| 253 |
|
| 254 |
+
## Configuration & Error Handling
|
| 255 |
+
|
| 256 |
+
### 🔧 **Automatic Configuration:**
|
| 257 |
+
The application includes intelligent configuration management that:
|
| 258 |
+
- Validates API keys and reports availability at startup
|
| 259 |
+
- Checks for required dependencies and installs them automatically
|
| 260 |
+
- Provides helpful warnings for missing optional components
|
| 261 |
+
- Reports which parsers are available based on current configuration
|
| 262 |
+
|
| 263 |
+
### 🛡️ **Enhanced Error Handling:**
|
| 264 |
+
- **User-Friendly Messages**: Clear error descriptions instead of technical stack traces
|
| 265 |
+
- **File Validation**: Automatic checking of file size and format compatibility
|
| 266 |
+
- **Parser Availability**: Real-time detection of which parsers can be used
|
| 267 |
+
- **Graceful Degradation**: Application continues working even if some parsers are unavailable
|
| 268 |
+
|
| 269 |
## Troubleshooting
|
| 270 |
|
| 271 |
### OCR Issues
|
|
|
|
| 299 |
### Project Structure
|
| 300 |
|
| 301 |
```
|
| 302 |
+
markit_v2/
|
| 303 |
+
├── app.py # Main application entry point (HF Spaces compatible)
|
| 304 |
+
├── run_app.py # 🆕 Lightweight app launcher for local development
|
| 305 |
├── setup.sh # Setup script
|
| 306 |
├── build.sh # Build script
|
| 307 |
├── requirements.txt # Python dependencies
|
| 308 |
├── README.md # Project documentation
|
| 309 |
+
├── .env # Environment variables (local development)
|
| 310 |
├── .gitignore # Git ignore file
|
| 311 |
├── .gitattributes # Git attributes file
|
| 312 |
├── src/ # Source code
|
| 313 |
│ ├── __init__.py # Package initialization
|
| 314 |
+
│ ├── main.py # Application launcher
|
| 315 |
+
│ ├── core/ # Core functionality and utilities
|
| 316 |
+
│ │ ├── __init__.py # Package initialization
|
| 317 |
+
│ │ ├── config.py # 🆕 Centralized configuration management
|
| 318 |
+
│ │ ├── exceptions.py # 🆕 Custom exception hierarchy
|
| 319 |
+
│ │ ├── logging_config.py # 🆕 Centralized logging setup
|
| 320 |
+
│ │ ├── environment.py # 🆕 Environment setup and dependency management
|
| 321 |
+
│ │ ├── converter.py # Document conversion orchestrator (refactored)
|
| 322 |
+
│ │ ├── parser_factory.py # Parser factory pattern
|
| 323 |
+
│ │ └── latex_to_markdown_converter.py # LaTeX conversion utility
|
| 324 |
+
│ ├── services/ # Business logic layer
|
| 325 |
│ │ ├── __init__.py # Package initialization
|
| 326 |
+
│ │ └── document_service.py # 🆕 Document processing service
|
|
|
|
| 327 |
│ ├── parsers/ # Parser implementations
|
| 328 |
│ │ ├── __init__.py # Package initialization
|
| 329 |
+
│ │ ├── parser_interface.py # Enhanced parser interface
|
| 330 |
+
│ │ ├── parser_registry.py # Parser registry pattern
|
| 331 |
+
│ │ ├── markitdown_parser.py # MarkItDown parser (updated)
|
| 332 |
│ │ ├── got_ocr_parser.py # GOT-OCR parser for images
|
| 333 |
+
│ │ ├── mistral_ocr_parser.py # 🆕 Mistral OCR parser
|
| 334 |
+
│ │ └── gemini_flash_parser.py # Gemini Flash parser
|
| 335 |
+
│ └── ui/ # User interface layer
|
| 336 |
+
│ ├── __init__.py # Package initialization
|
| 337 |
+
│ └── ui.py # Gradio UI with enhanced error handling
|
| 338 |
+
├── documents/ # Documentation and examples (gitignored)
|
| 339 |
+
├── tessdata/ # Tesseract OCR data (gitignored)
|
| 340 |
+
└── tests/ # Tests (future)
|
| 341 |
└── __init__.py # Package initialization
|
| 342 |
```
|
| 343 |
|
| 344 |
+
### 🆕 **New Architecture Components:**
|
| 345 |
+
- **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
|
| 346 |
+
- **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
|
| 347 |
+
- **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
|
| 348 |
+
- **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
|
| 349 |
+
- **Enhanced Parser Interface**: Validation, metadata, and cancellation support
|
| 350 |
+
- **Lightweight Launcher**: Quick development startup with `run_app.py`
|
| 351 |
+
- **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
|
| 352 |
+
|
| 353 |
### ZeroGPU Integration Notes
|
| 354 |
|
| 355 |
When developing for Hugging Face Spaces with Stateless GPU:
|
app.py
CHANGED
|
@@ -1,126 +1,59 @@
|
|
| 1 |
import spaces # Must be imported before any CUDA initialization
|
| 2 |
import sys
|
| 3 |
import os
|
| 4 |
-
import subprocess
|
| 5 |
-
import shutil
|
| 6 |
from pathlib import Path
|
| 7 |
-
import logging
|
| 8 |
|
| 9 |
-
#
|
| 10 |
-
logging.getLogger("httpx").setLevel(logging.WARNING) # Raise level to WARNING to suppress INFO logs
|
| 11 |
-
logging.getLogger("urllib3").setLevel(logging.WARNING) # Also suppress urllib3 logs which might be used
|
| 12 |
-
logging.getLogger("httpcore").setLevel(logging.WARNING) # httpcore is used by httpx
|
| 13 |
-
|
| 14 |
-
# Get the current directory
|
| 15 |
current_dir = os.path.dirname(os.path.abspath(__file__))
|
|
|
|
| 16 |
|
| 17 |
-
#
|
| 18 |
-
try:
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
print("
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
print(
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
print("WARNING: CUDA not available. GOT-OCR performs best with GPU acceleration.")
|
| 44 |
-
except ImportError:
|
| 45 |
-
print("WARNING: PyTorch not installed. Installing PyTorch...")
|
| 46 |
-
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
|
| 47 |
-
|
| 48 |
-
# Check if transformers is installed (needed for GOT-OCR)
|
| 49 |
-
try:
|
| 50 |
-
import transformers
|
| 51 |
-
print(f"Transformers version: {transformers.__version__}")
|
| 52 |
-
except ImportError:
|
| 53 |
-
print("WARNING: Transformers not installed. Installing transformers from GitHub...")
|
| 54 |
-
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/huggingface/transformers.git@main", "accelerate", "verovio"], check=False)
|
| 55 |
-
|
| 56 |
-
# Check if numpy is installed with the correct version
|
| 57 |
-
try:
|
| 58 |
-
import numpy as np
|
| 59 |
-
print(f"NumPy version: {np.__version__}")
|
| 60 |
-
if np.__version__ != "1.26.3":
|
| 61 |
-
print("WARNING: NumPy version mismatch. Installing exact version 1.26.3...")
|
| 62 |
-
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
|
| 63 |
-
except ImportError:
|
| 64 |
-
print("WARNING: NumPy not installed. Installing NumPy 1.26.3...")
|
| 65 |
-
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
|
| 66 |
-
|
| 67 |
-
# Check if markitdown is installed
|
| 68 |
-
try:
|
| 69 |
-
from markitdown import MarkItDown
|
| 70 |
-
print("MarkItDown is installed")
|
| 71 |
-
except ImportError:
|
| 72 |
-
print("WARNING: MarkItDown not installed. Installing...")
|
| 73 |
-
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
|
| 74 |
try:
|
| 75 |
from markitdown import MarkItDown
|
| 76 |
-
print("MarkItDown
|
| 77 |
except ImportError:
|
| 78 |
-
print("
|
| 79 |
-
|
| 80 |
-
# Try to load environment variables from .env file
|
| 81 |
-
try:
|
| 82 |
-
from dotenv import load_dotenv
|
| 83 |
-
load_dotenv()
|
| 84 |
-
print("Loaded environment variables from .env file")
|
| 85 |
-
except ImportError:
|
| 86 |
-
print("python-dotenv not installed, skipping .env file loading")
|
| 87 |
-
|
| 88 |
-
# Load API keys from environment variables
|
| 89 |
-
gemini_api_key = os.getenv("GOOGLE_API_KEY")
|
| 90 |
-
openai_api_key = os.getenv("OPENAI_API_KEY")
|
| 91 |
-
|
| 92 |
-
# Check if API keys are available and print messages
|
| 93 |
-
if not gemini_api_key:
|
| 94 |
-
print("Warning: GOOGLE_API_KEY environment variable not found. Gemini Flash parser and LaTeX to Markdown conversion may not work.")
|
| 95 |
-
else:
|
| 96 |
-
print(f"Found Gemini API key: {gemini_api_key[:5]}...{gemini_api_key[-5:] if len(gemini_api_key) > 10 else ''}")
|
| 97 |
-
print("Gemini API will be used for LaTeX to Markdown conversion when using GOT-OCR with Formatted Text mode")
|
| 98 |
-
|
| 99 |
-
if not openai_api_key:
|
| 100 |
-
print("Warning: OPENAI_API_KEY environment variable not found. LLM-based image description in MarkItDown may not work.")
|
| 101 |
-
else:
|
| 102 |
-
print(f"Found OpenAI API key: {openai_api_key[:5]}...{openai_api_key[-5:] if len(openai_api_key) > 10 else ''}")
|
| 103 |
-
print("OpenAI API will be available for LLM-based image descriptions in MarkItDown")
|
| 104 |
-
|
| 105 |
-
# Add the current directory to the Python path
|
| 106 |
-
sys.path.append(current_dir)
|
| 107 |
|
| 108 |
-
#
|
| 109 |
try:
|
| 110 |
-
# First attempt - standard import
|
| 111 |
from src.main import main
|
| 112 |
except ModuleNotFoundError:
|
| 113 |
try:
|
| 114 |
-
#
|
| 115 |
sys.path.append(os.path.join(current_dir, "src"))
|
| 116 |
from src.main import main
|
| 117 |
except ModuleNotFoundError:
|
| 118 |
-
#
|
| 119 |
init_path = os.path.join(current_dir, "src", "__init__.py")
|
| 120 |
if not os.path.exists(init_path):
|
| 121 |
with open(init_path, "w") as f:
|
| 122 |
-
pass
|
| 123 |
-
# Try import again
|
| 124 |
from src.main import main
|
| 125 |
|
| 126 |
if __name__ == "__main__":
|
|
|
|
| 1 |
import spaces # Must be imported before any CUDA initialization
|
| 2 |
import sys
|
| 3 |
import os
|
|
|
|
|
|
|
| 4 |
from pathlib import Path
|
|
|
|
| 5 |
|
| 6 |
+
# Get the current directory and setup Python path
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
current_dir = os.path.dirname(os.path.abspath(__file__))
|
| 8 |
+
sys.path.append(current_dir)
|
| 9 |
|
| 10 |
+
# Import environment manager after setting up path
|
| 11 |
+
try:
|
| 12 |
+
from src.core.environment import environment_manager
|
| 13 |
+
|
| 14 |
+
# Perform complete environment setup
|
| 15 |
+
print("Setting up environment...")
|
| 16 |
+
setup_results = environment_manager.full_environment_setup()
|
| 17 |
+
|
| 18 |
+
# Report setup status
|
| 19 |
+
print(f"Environment setup completed with results: {len([k for k, v in setup_results.items() if v])} successful, {len([k for k, v in setup_results.items() if not v])} failed")
|
| 20 |
+
|
| 21 |
+
except ImportError as e:
|
| 22 |
+
print(f"Warning: Could not import environment manager: {e}")
|
| 23 |
+
print("Falling back to basic setup...")
|
| 24 |
+
|
| 25 |
+
# Fallback to basic setup if environment manager fails
|
| 26 |
+
import subprocess
|
| 27 |
+
|
| 28 |
+
# Basic dependency checks
|
| 29 |
+
try:
|
| 30 |
+
import torch
|
| 31 |
+
print(f"PyTorch version: {torch.__version__}")
|
| 32 |
+
except ImportError:
|
| 33 |
+
print("Installing PyTorch...")
|
| 34 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
|
| 35 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
try:
|
| 37 |
from markitdown import MarkItDown
|
| 38 |
+
print("MarkItDown is available")
|
| 39 |
except ImportError:
|
| 40 |
+
print("Installing MarkItDown...")
|
| 41 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
# Import main function with fallback strategies (HF Spaces compatibility)
|
| 44 |
try:
|
|
|
|
| 45 |
from src.main import main
|
| 46 |
except ModuleNotFoundError:
|
| 47 |
try:
|
| 48 |
+
# Fallback: adjust path and try again
|
| 49 |
sys.path.append(os.path.join(current_dir, "src"))
|
| 50 |
from src.main import main
|
| 51 |
except ModuleNotFoundError:
|
| 52 |
+
# Last resort: create __init__.py if missing
|
| 53 |
init_path = os.path.join(current_dir, "src", "__init__.py")
|
| 54 |
if not os.path.exists(init_path):
|
| 55 |
with open(init_path, "w") as f:
|
| 56 |
+
pass
|
|
|
|
| 57 |
from src.main import main
|
| 58 |
|
| 59 |
if __name__ == "__main__":
|
run_app.py
ADDED
|
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Simple app launcher that skips the heavy environment setup.
|
| 4 |
+
Use this for local development when dependencies are already installed.
|
| 5 |
+
"""
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
|
| 9 |
+
# Get the current directory and setup Python path
|
| 10 |
+
current_dir = os.path.dirname(os.path.abspath(__file__))
|
| 11 |
+
sys.path.append(current_dir)
|
| 12 |
+
|
| 13 |
+
# Load environment variables from .env file
|
| 14 |
+
try:
|
| 15 |
+
from dotenv import load_dotenv
|
| 16 |
+
load_dotenv()
|
| 17 |
+
print("Loaded environment variables from .env file")
|
| 18 |
+
except ImportError:
|
| 19 |
+
print("python-dotenv not installed, skipping .env file loading")
|
| 20 |
+
|
| 21 |
+
# Import and run main directly
|
| 22 |
+
from src.main import main
|
| 23 |
+
|
| 24 |
+
if __name__ == "__main__":
|
| 25 |
+
main()
|
src/core/config.py
ADDED
|
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Centralized configuration management for Markit application.
|
| 3 |
+
"""
|
| 4 |
+
import os
|
| 5 |
+
from typing import Optional, Dict, Any
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
@dataclass
|
| 10 |
+
class APIConfig:
|
| 11 |
+
"""Configuration for external API services."""
|
| 12 |
+
google_api_key: Optional[str] = None
|
| 13 |
+
openai_api_key: Optional[str] = None
|
| 14 |
+
mistral_api_key: Optional[str] = None
|
| 15 |
+
|
| 16 |
+
def __post_init__(self):
|
| 17 |
+
"""Load API keys from environment variables."""
|
| 18 |
+
self.google_api_key = os.getenv("GOOGLE_API_KEY")
|
| 19 |
+
self.openai_api_key = os.getenv("OPENAI_API_KEY")
|
| 20 |
+
self.mistral_api_key = os.getenv("MISTRAL_API_KEY")
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
@dataclass
|
| 24 |
+
class OCRConfig:
|
| 25 |
+
"""Configuration for OCR-related settings."""
|
| 26 |
+
tesseract_path: Optional[str] = None
|
| 27 |
+
tessdata_path: Optional[str] = None
|
| 28 |
+
default_language: str = "eng"
|
| 29 |
+
|
| 30 |
+
def __post_init__(self):
|
| 31 |
+
"""Load OCR configuration from environment variables."""
|
| 32 |
+
self.tesseract_path = os.getenv("TESSERACT_PATH")
|
| 33 |
+
self.tessdata_path = os.getenv("TESSDATA_PATH", "./tessdata")
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
@dataclass
|
| 37 |
+
class ModelConfig:
|
| 38 |
+
"""Configuration for AI model settings."""
|
| 39 |
+
gemini_model: str = "gemini-2.5-flash"
|
| 40 |
+
mistral_model: str = "pixtral-12b-2409"
|
| 41 |
+
got_ocr_model: str = "stepfun-ai/GOT-OCR2_0"
|
| 42 |
+
temperature: float = 0.1
|
| 43 |
+
max_tokens: int = 4096
|
| 44 |
+
|
| 45 |
+
def __post_init__(self):
|
| 46 |
+
"""Load model configuration from environment variables."""
|
| 47 |
+
self.gemini_model = os.getenv("GEMINI_MODEL", self.gemini_model)
|
| 48 |
+
self.mistral_model = os.getenv("MISTRAL_MODEL", self.mistral_model)
|
| 49 |
+
self.got_ocr_model = os.getenv("GOT_OCR_MODEL", self.got_ocr_model)
|
| 50 |
+
self.temperature = float(os.getenv("MODEL_TEMPERATURE", self.temperature))
|
| 51 |
+
self.max_tokens = int(os.getenv("MODEL_MAX_TOKENS", self.max_tokens))
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
@dataclass
|
| 55 |
+
class AppConfig:
|
| 56 |
+
"""Main application configuration."""
|
| 57 |
+
debug: bool = False
|
| 58 |
+
max_file_size: int = 10 * 1024 * 1024 # 10MB
|
| 59 |
+
allowed_extensions: tuple = (".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".tex", ".xlsx")
|
| 60 |
+
temp_dir: str = "./temp"
|
| 61 |
+
|
| 62 |
+
def __post_init__(self):
|
| 63 |
+
"""Load application configuration from environment variables."""
|
| 64 |
+
self.debug = os.getenv("DEBUG", "false").lower() == "true"
|
| 65 |
+
self.max_file_size = int(os.getenv("MAX_FILE_SIZE", self.max_file_size))
|
| 66 |
+
self.temp_dir = os.getenv("TEMP_DIR", self.temp_dir)
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
class Config:
|
| 70 |
+
"""Main configuration container."""
|
| 71 |
+
|
| 72 |
+
def __init__(self):
|
| 73 |
+
self.api = APIConfig()
|
| 74 |
+
self.ocr = OCRConfig()
|
| 75 |
+
self.model = ModelConfig()
|
| 76 |
+
self.app = AppConfig()
|
| 77 |
+
|
| 78 |
+
def validate(self) -> Dict[str, Any]:
|
| 79 |
+
"""Validate configuration and return validation results."""
|
| 80 |
+
validation_results = {
|
| 81 |
+
"valid": True,
|
| 82 |
+
"warnings": [],
|
| 83 |
+
"errors": []
|
| 84 |
+
}
|
| 85 |
+
|
| 86 |
+
# Check API keys
|
| 87 |
+
if not self.api.google_api_key:
|
| 88 |
+
validation_results["warnings"].append("Google API key not found - Gemini parser will be unavailable")
|
| 89 |
+
|
| 90 |
+
if not self.api.mistral_api_key:
|
| 91 |
+
validation_results["warnings"].append("Mistral API key not found - Mistral parser will be unavailable")
|
| 92 |
+
|
| 93 |
+
# Check tesseract setup
|
| 94 |
+
if not self.ocr.tesseract_path and not os.path.exists("/usr/bin/tesseract"):
|
| 95 |
+
validation_results["warnings"].append("Tesseract not found in system PATH - OCR functionality may be limited")
|
| 96 |
+
|
| 97 |
+
# Check temp directory
|
| 98 |
+
try:
|
| 99 |
+
os.makedirs(self.app.temp_dir, exist_ok=True)
|
| 100 |
+
except Exception as e:
|
| 101 |
+
validation_results["errors"].append(f"Cannot create temp directory {self.app.temp_dir}: {e}")
|
| 102 |
+
validation_results["valid"] = False
|
| 103 |
+
|
| 104 |
+
return validation_results
|
| 105 |
+
|
| 106 |
+
def get_available_parsers(self) -> list:
|
| 107 |
+
"""Get list of available parsers based on current configuration."""
|
| 108 |
+
available = ["markitdown"] # Always available
|
| 109 |
+
|
| 110 |
+
if self.api.google_api_key:
|
| 111 |
+
available.append("gemini_flash")
|
| 112 |
+
|
| 113 |
+
if self.api.mistral_api_key:
|
| 114 |
+
available.append("mistral_ocr")
|
| 115 |
+
|
| 116 |
+
# GOT-OCR is available if we have GPU or can use ZeroGPU
|
| 117 |
+
available.append("got_ocr")
|
| 118 |
+
|
| 119 |
+
return available
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
# Global configuration instance
|
| 123 |
+
config = Config()
|
src/core/converter.py
CHANGED
|
@@ -1,55 +1,30 @@
|
|
| 1 |
-
import tempfile
|
| 2 |
import logging
|
| 3 |
-
import
|
| 4 |
-
import
|
| 5 |
-
from pathlib import Path
|
| 6 |
|
| 7 |
-
|
| 8 |
-
from src.core.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
# Import all parsers to ensure they're registered
|
| 11 |
from src import parsers
|
| 12 |
|
| 13 |
-
#
|
| 14 |
-
|
| 15 |
-
from src.core.latex_to_markdown_converter import convert_latex_to_markdown
|
| 16 |
-
HAS_GEMINI_CONVERTER = True
|
| 17 |
-
except ImportError:
|
| 18 |
-
HAS_GEMINI_CONVERTER = False
|
| 19 |
-
logging.warning("LaTeX to Markdown converter not available. Raw LaTeX will be returned for formatted text.")
|
| 20 |
|
| 21 |
-
|
| 22 |
-
# This will be set by the UI when the cancel button is clicked
|
| 23 |
-
conversion_cancelled = None # Will be a threading.Event object
|
| 24 |
-
# Flag to track if conversion is currently in progress
|
| 25 |
-
_conversion_in_progress = False
|
| 26 |
-
|
| 27 |
-
def set_cancellation_flag(flag):
|
| 28 |
"""Set the reference to the cancellation flag from ui.py"""
|
| 29 |
-
|
| 30 |
-
conversion_cancelled = flag
|
| 31 |
|
| 32 |
-
def is_conversion_in_progress():
|
| 33 |
"""Check if conversion is currently in progress"""
|
| 34 |
-
|
| 35 |
-
return _conversion_in_progress
|
| 36 |
-
|
| 37 |
-
def check_cancellation():
|
| 38 |
-
"""Check if cancellation has been requested"""
|
| 39 |
-
if conversion_cancelled and conversion_cancelled.is_set():
|
| 40 |
-
logging.info("Cancellation detected in check_cancellation")
|
| 41 |
-
return True
|
| 42 |
-
return False
|
| 43 |
-
|
| 44 |
-
def safe_delete_file(file_path):
|
| 45 |
-
"""Safely delete a file with error handling"""
|
| 46 |
-
if file_path and os.path.exists(file_path):
|
| 47 |
-
try:
|
| 48 |
-
os.unlink(file_path)
|
| 49 |
-
except Exception as e:
|
| 50 |
-
logging.error(f"Error cleaning up temp file {file_path}: {e}")
|
| 51 |
|
| 52 |
-
def convert_file(file_path, parser_name, ocr_method_name, output_format):
|
| 53 |
"""
|
| 54 |
Convert a file using the specified parser and OCR method.
|
| 55 |
|
|
@@ -62,165 +37,35 @@ def convert_file(file_path, parser_name, ocr_method_name, output_format):
|
|
| 62 |
Returns:
|
| 63 |
tuple: (content, download_file_path)
|
| 64 |
"""
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
# Set the conversion in progress flag
|
| 68 |
-
_conversion_in_progress = True
|
| 69 |
-
|
| 70 |
-
# Temporary file paths to clean up
|
| 71 |
-
temp_input = None
|
| 72 |
-
tmp_path = None
|
| 73 |
|
| 74 |
-
# Ensure we clean up the flag when we're done
|
| 75 |
try:
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
# Copy the content of original file to temp file
|
| 90 |
-
with open(file_path, 'rb') as original:
|
| 91 |
-
# Read in smaller chunks and check for cancellation between chunks
|
| 92 |
-
chunk_size = 1024 * 1024 # 1MB chunks
|
| 93 |
-
while True:
|
| 94 |
-
# Check for cancellation frequently
|
| 95 |
-
if check_cancellation():
|
| 96 |
-
logging.info("Cancellation detected during file copy")
|
| 97 |
-
safe_delete_file(temp_input)
|
| 98 |
-
return "Conversion cancelled.", None
|
| 99 |
-
|
| 100 |
-
chunk = original.read(chunk_size)
|
| 101 |
-
if not chunk:
|
| 102 |
-
break
|
| 103 |
-
temp_file.write(chunk)
|
| 104 |
-
file_path = temp_input
|
| 105 |
-
except Exception as e:
|
| 106 |
-
safe_delete_file(temp_input)
|
| 107 |
-
return f"Error creating temporary file: {e}", None
|
| 108 |
-
|
| 109 |
-
# Check for cancellation again
|
| 110 |
-
if check_cancellation():
|
| 111 |
-
logging.info("Cancellation detected after file preparation")
|
| 112 |
-
safe_delete_file(temp_input)
|
| 113 |
-
return "Conversion cancelled.", None
|
| 114 |
-
|
| 115 |
-
content = None
|
| 116 |
-
try:
|
| 117 |
-
# Use the parser factory to parse the document
|
| 118 |
-
start = time.time()
|
| 119 |
-
|
| 120 |
-
# Pass the cancellation flag to the parser factory
|
| 121 |
-
content = ParserFactory.parse_document(
|
| 122 |
-
file_path=file_path,
|
| 123 |
-
parser_name=parser_name,
|
| 124 |
-
ocr_method_name=ocr_method_name,
|
| 125 |
-
output_format=output_format.lower(),
|
| 126 |
-
cancellation_flag=conversion_cancelled # Pass the flag to parsers
|
| 127 |
-
)
|
| 128 |
-
|
| 129 |
-
# If content indicates cancellation, return early
|
| 130 |
-
if content == "Conversion cancelled.":
|
| 131 |
-
logging.info("Parser reported cancellation")
|
| 132 |
-
safe_delete_file(temp_input)
|
| 133 |
-
return content, None
|
| 134 |
-
|
| 135 |
-
duration = time.time() - start
|
| 136 |
-
logging.info(f"Processed in {duration:.2f} seconds.")
|
| 137 |
-
|
| 138 |
-
# Check for cancellation after processing
|
| 139 |
-
if check_cancellation():
|
| 140 |
-
logging.info("Cancellation detected after processing")
|
| 141 |
-
safe_delete_file(temp_input)
|
| 142 |
-
return "Conversion cancelled.", None
|
| 143 |
-
|
| 144 |
-
# Process LaTeX content for GOT-OCR formatted text
|
| 145 |
-
if parser_name == "GOT-OCR (jpg,png only)" and ocr_method_name == "Formatted Text" and HAS_GEMINI_CONVERTER:
|
| 146 |
-
logging.info("Converting LaTeX output to Markdown using Gemini API")
|
| 147 |
-
start_convert = time.time()
|
| 148 |
-
|
| 149 |
-
# Check for cancellation before conversion
|
| 150 |
-
if check_cancellation():
|
| 151 |
-
logging.info("Cancellation detected before LaTeX conversion")
|
| 152 |
-
safe_delete_file(temp_input)
|
| 153 |
-
return "Conversion cancelled.", None
|
| 154 |
-
|
| 155 |
-
try:
|
| 156 |
-
markdown_content = convert_latex_to_markdown(content)
|
| 157 |
-
if markdown_content:
|
| 158 |
-
content = markdown_content
|
| 159 |
-
logging.info(f"LaTeX conversion completed in {time.time() - start_convert:.2f} seconds")
|
| 160 |
-
else:
|
| 161 |
-
logging.warning("LaTeX to Markdown conversion failed, using raw LaTeX output")
|
| 162 |
-
except Exception as e:
|
| 163 |
-
logging.error(f"Error converting LaTeX to Markdown: {str(e)}")
|
| 164 |
-
# Continue with the original content on error
|
| 165 |
-
|
| 166 |
-
# Check for cancellation after conversion
|
| 167 |
-
if check_cancellation():
|
| 168 |
-
logging.info("Cancellation detected after LaTeX conversion")
|
| 169 |
-
safe_delete_file(temp_input)
|
| 170 |
-
return "Conversion cancelled.", None
|
| 171 |
-
|
| 172 |
-
except Exception as e:
|
| 173 |
-
safe_delete_file(temp_input)
|
| 174 |
-
return f"Error: {e}", None
|
| 175 |
-
|
| 176 |
-
# Determine the file extension based on the output format
|
| 177 |
-
if output_format == "Markdown":
|
| 178 |
-
ext = ".md"
|
| 179 |
-
elif output_format == "JSON":
|
| 180 |
-
ext = ".json"
|
| 181 |
-
elif output_format == "Text":
|
| 182 |
-
ext = ".txt"
|
| 183 |
-
elif output_format == "Document Tags":
|
| 184 |
-
ext = ".doctags"
|
| 185 |
-
else:
|
| 186 |
-
ext = ".txt"
|
| 187 |
-
|
| 188 |
-
# Check for cancellation again
|
| 189 |
-
if check_cancellation():
|
| 190 |
-
logging.info("Cancellation detected before output file creation")
|
| 191 |
-
safe_delete_file(temp_input)
|
| 192 |
return "Conversion cancelled.", None
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
tmp.write(content[i:i+chunk_size])
|
| 209 |
-
|
| 210 |
-
# Clean up the temporary input file
|
| 211 |
-
safe_delete_file(temp_input)
|
| 212 |
-
temp_input = None # Mark as cleaned up
|
| 213 |
-
|
| 214 |
-
return content, tmp_path
|
| 215 |
-
except Exception as e:
|
| 216 |
-
safe_delete_file(tmp_path)
|
| 217 |
-
safe_delete_file(temp_input)
|
| 218 |
-
return f"Error: {e}", None
|
| 219 |
-
finally:
|
| 220 |
-
# Always clean up any remaining temp files
|
| 221 |
-
safe_delete_file(temp_input)
|
| 222 |
-
if check_cancellation() and tmp_path:
|
| 223 |
-
safe_delete_file(tmp_path)
|
| 224 |
-
|
| 225 |
-
# Always clear the conversion in progress flag when done
|
| 226 |
-
_conversion_in_progress = False
|
|
|
|
|
|
|
| 1 |
import logging
|
| 2 |
+
import threading
|
| 3 |
+
from typing import Optional, Tuple
|
|
|
|
| 4 |
|
| 5 |
+
from src.core.config import config
|
| 6 |
+
from src.core.exceptions import (
|
| 7 |
+
DocumentProcessingError,
|
| 8 |
+
ConversionError,
|
| 9 |
+
ConfigurationError
|
| 10 |
+
)
|
| 11 |
+
from src.services.document_service import DocumentService
|
| 12 |
|
| 13 |
# Import all parsers to ensure they're registered
|
| 14 |
from src import parsers
|
| 15 |
|
| 16 |
+
# Global document service instance
|
| 17 |
+
_document_service = DocumentService()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
def set_cancellation_flag(flag: threading.Event) -> None:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
"""Set the reference to the cancellation flag from ui.py"""
|
| 21 |
+
_document_service.set_cancellation_flag(flag)
|
|
|
|
| 22 |
|
| 23 |
+
def is_conversion_in_progress() -> bool:
|
| 24 |
"""Check if conversion is currently in progress"""
|
| 25 |
+
return _document_service.is_conversion_in_progress()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
def convert_file(file_path: str, parser_name: str, ocr_method_name: str, output_format: str) -> Tuple[str, Optional[str]]:
|
| 28 |
"""
|
| 29 |
Convert a file using the specified parser and OCR method.
|
| 30 |
|
|
|
|
| 37 |
Returns:
|
| 38 |
tuple: (content, download_file_path)
|
| 39 |
"""
|
| 40 |
+
if not file_path:
|
| 41 |
+
return "Please upload a file.", None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
|
|
|
| 43 |
try:
|
| 44 |
+
# Use the document service to handle conversion
|
| 45 |
+
content, output_path = _document_service.convert_document(
|
| 46 |
+
file_path=file_path,
|
| 47 |
+
parser_name=parser_name,
|
| 48 |
+
ocr_method_name=ocr_method_name,
|
| 49 |
+
output_format=output_format
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
return content, output_path
|
| 53 |
+
|
| 54 |
+
except ConversionError as e:
|
| 55 |
+
# Handle user-friendly conversion errors
|
| 56 |
+
if "cancelled" in str(e).lower():
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
return "Conversion cancelled.", None
|
| 58 |
+
return f"Conversion failed: {e}", None
|
| 59 |
+
|
| 60 |
+
except DocumentProcessingError as e:
|
| 61 |
+
# Handle document processing errors
|
| 62 |
+
return f"Document processing error: {e}", None
|
| 63 |
+
|
| 64 |
+
except ConfigurationError as e:
|
| 65 |
+
# Handle configuration errors
|
| 66 |
+
return f"Configuration error: {e}", None
|
| 67 |
+
|
| 68 |
+
except Exception as e:
|
| 69 |
+
# Handle unexpected errors
|
| 70 |
+
logging.error(f"Unexpected error in convert_file: {e}")
|
| 71 |
+
return f"Unexpected error: {e}", None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
src/core/environment.py
ADDED
|
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Environment setup and dependency management for the Markit application.
|
| 3 |
+
Extracted from app.py to improve code organization while maintaining HF Spaces compatibility.
|
| 4 |
+
"""
|
| 5 |
+
import os
|
| 6 |
+
import sys
|
| 7 |
+
import subprocess
|
| 8 |
+
import logging
|
| 9 |
+
from typing import Dict, Optional, Tuple
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
|
| 12 |
+
from src.core.config import config
|
| 13 |
+
from src.core.logging_config import setup_logging
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
class EnvironmentManager:
|
| 17 |
+
"""Manages environment setup and dependency installation."""
|
| 18 |
+
|
| 19 |
+
def __init__(self):
|
| 20 |
+
self.current_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
| 21 |
+
self.logger = logging.getLogger(__name__)
|
| 22 |
+
|
| 23 |
+
def run_setup_script(self) -> bool:
|
| 24 |
+
"""Run setup.sh script if it exists."""
|
| 25 |
+
try:
|
| 26 |
+
setup_script = os.path.join(self.current_dir, "setup.sh")
|
| 27 |
+
if os.path.exists(setup_script):
|
| 28 |
+
print("Running setup.sh...")
|
| 29 |
+
subprocess.run(["bash", setup_script], check=False)
|
| 30 |
+
print("setup.sh completed")
|
| 31 |
+
return True
|
| 32 |
+
except Exception as e:
|
| 33 |
+
print(f"Error running setup.sh: {e}")
|
| 34 |
+
return False
|
| 35 |
+
|
| 36 |
+
def check_spaces_module(self) -> bool:
|
| 37 |
+
"""Check and install spaces module for ZeroGPU support."""
|
| 38 |
+
try:
|
| 39 |
+
import spaces
|
| 40 |
+
print("Spaces module found for ZeroGPU support")
|
| 41 |
+
return True
|
| 42 |
+
except ImportError:
|
| 43 |
+
print("WARNING: Spaces module not found. Installing...")
|
| 44 |
+
try:
|
| 45 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "spaces"], check=False)
|
| 46 |
+
return True
|
| 47 |
+
except Exception as e:
|
| 48 |
+
print(f"Error installing spaces module: {e}")
|
| 49 |
+
return False
|
| 50 |
+
|
| 51 |
+
def check_pytorch(self) -> Tuple[bool, Dict[str, str]]:
|
| 52 |
+
"""Check PyTorch and CUDA availability."""
|
| 53 |
+
info = {}
|
| 54 |
+
try:
|
| 55 |
+
import torch
|
| 56 |
+
info["pytorch_version"] = torch.__version__
|
| 57 |
+
info["cuda_available"] = str(torch.cuda.is_available())
|
| 58 |
+
|
| 59 |
+
print(f"PyTorch version: {info['pytorch_version']}")
|
| 60 |
+
print(f"CUDA available: {info['cuda_available']}")
|
| 61 |
+
|
| 62 |
+
if torch.cuda.is_available():
|
| 63 |
+
info["cuda_device"] = torch.cuda.get_device_name(0)
|
| 64 |
+
info["cuda_version"] = torch.version.cuda
|
| 65 |
+
print(f"CUDA device: {info['cuda_device']}")
|
| 66 |
+
print(f"CUDA version: {info['cuda_version']}")
|
| 67 |
+
else:
|
| 68 |
+
print("WARNING: CUDA not available. GOT-OCR performs best with GPU acceleration.")
|
| 69 |
+
|
| 70 |
+
return True, info
|
| 71 |
+
except ImportError:
|
| 72 |
+
print("WARNING: PyTorch not installed. Installing PyTorch...")
|
| 73 |
+
try:
|
| 74 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
|
| 75 |
+
return True, info
|
| 76 |
+
except Exception as e:
|
| 77 |
+
print(f"Error installing PyTorch: {e}")
|
| 78 |
+
return False, info
|
| 79 |
+
|
| 80 |
+
def check_transformers(self) -> bool:
|
| 81 |
+
"""Check and install transformers library."""
|
| 82 |
+
try:
|
| 83 |
+
import transformers
|
| 84 |
+
print(f"Transformers version: {transformers.__version__}")
|
| 85 |
+
return True
|
| 86 |
+
except ImportError:
|
| 87 |
+
print("WARNING: Transformers not installed. Installing transformers from GitHub...")
|
| 88 |
+
try:
|
| 89 |
+
subprocess.run([
|
| 90 |
+
sys.executable, "-m", "pip", "install", "-q",
|
| 91 |
+
"git+https://github.com/huggingface/transformers.git@main",
|
| 92 |
+
"accelerate", "verovio"
|
| 93 |
+
], check=False)
|
| 94 |
+
return True
|
| 95 |
+
except Exception as e:
|
| 96 |
+
print(f"Error installing transformers: {e}")
|
| 97 |
+
return False
|
| 98 |
+
|
| 99 |
+
def check_numpy(self) -> bool:
|
| 100 |
+
"""Check and install correct NumPy version."""
|
| 101 |
+
try:
|
| 102 |
+
import numpy as np
|
| 103 |
+
print(f"NumPy version: {np.__version__}")
|
| 104 |
+
if np.__version__ != "1.26.3":
|
| 105 |
+
print("WARNING: NumPy version mismatch. Installing exact version 1.26.3...")
|
| 106 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
|
| 107 |
+
return True
|
| 108 |
+
except ImportError:
|
| 109 |
+
print("WARNING: NumPy not installed. Installing NumPy 1.26.3...")
|
| 110 |
+
try:
|
| 111 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
|
| 112 |
+
return True
|
| 113 |
+
except Exception as e:
|
| 114 |
+
print(f"Error installing NumPy: {e}")
|
| 115 |
+
return False
|
| 116 |
+
|
| 117 |
+
def check_markitdown(self) -> bool:
|
| 118 |
+
"""Check and install MarkItDown library."""
|
| 119 |
+
try:
|
| 120 |
+
from markitdown import MarkItDown
|
| 121 |
+
print("MarkItDown is installed")
|
| 122 |
+
return True
|
| 123 |
+
except ImportError:
|
| 124 |
+
print("WARNING: MarkItDown not installed. Installing...")
|
| 125 |
+
try:
|
| 126 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
|
| 127 |
+
from markitdown import MarkItDown
|
| 128 |
+
print("MarkItDown installed successfully")
|
| 129 |
+
return True
|
| 130 |
+
except ImportError:
|
| 131 |
+
print("ERROR: Failed to install MarkItDown")
|
| 132 |
+
return False
|
| 133 |
+
except Exception as e:
|
| 134 |
+
print(f"Error installing MarkItDown: {e}")
|
| 135 |
+
return False
|
| 136 |
+
|
| 137 |
+
def load_environment_variables(self) -> bool:
|
| 138 |
+
"""Load environment variables from .env file."""
|
| 139 |
+
try:
|
| 140 |
+
from dotenv import load_dotenv
|
| 141 |
+
load_dotenv()
|
| 142 |
+
print("Loaded environment variables from .env file")
|
| 143 |
+
return True
|
| 144 |
+
except ImportError:
|
| 145 |
+
print("python-dotenv not installed, skipping .env file loading")
|
| 146 |
+
return False
|
| 147 |
+
|
| 148 |
+
def validate_api_keys(self) -> Dict[str, bool]:
|
| 149 |
+
"""Validate and report API key availability."""
|
| 150 |
+
results = {}
|
| 151 |
+
|
| 152 |
+
# Check Gemini API key
|
| 153 |
+
gemini_key = config.api.google_api_key
|
| 154 |
+
if not gemini_key:
|
| 155 |
+
print("Warning: GOOGLE_API_KEY environment variable not found. Gemini Flash parser and LaTeX to Markdown conversion may not work.")
|
| 156 |
+
results["gemini"] = False
|
| 157 |
+
else:
|
| 158 |
+
print(f"Found Gemini API key: {gemini_key[:5]}...{gemini_key[-5:] if len(gemini_key) > 10 else ''}")
|
| 159 |
+
print("Gemini API will be used for LaTeX to Markdown conversion when using GOT-OCR with Formatted Text mode")
|
| 160 |
+
results["gemini"] = True
|
| 161 |
+
|
| 162 |
+
# Check OpenAI API key
|
| 163 |
+
openai_key = config.api.openai_api_key
|
| 164 |
+
if not openai_key:
|
| 165 |
+
print("Warning: OPENAI_API_KEY environment variable not found. LLM-based image description in MarkItDown may not work.")
|
| 166 |
+
results["openai"] = False
|
| 167 |
+
else:
|
| 168 |
+
print(f"Found OpenAI API key: {openai_key[:5]}...{openai_key[-5:] if len(openai_key) > 10 else ''}")
|
| 169 |
+
print("OpenAI API will be available for LLM-based image descriptions in MarkItDown")
|
| 170 |
+
results["openai"] = True
|
| 171 |
+
|
| 172 |
+
# Check Mistral API key
|
| 173 |
+
mistral_key = config.api.mistral_api_key
|
| 174 |
+
if mistral_key:
|
| 175 |
+
print(f"Found Mistral API key: {mistral_key[:5]}...{mistral_key[-5:] if len(mistral_key) > 10 else ''}")
|
| 176 |
+
results["mistral"] = True
|
| 177 |
+
else:
|
| 178 |
+
results["mistral"] = False
|
| 179 |
+
|
| 180 |
+
return results
|
| 181 |
+
|
| 182 |
+
def setup_python_path(self) -> None:
|
| 183 |
+
"""Setup Python path for imports."""
|
| 184 |
+
if self.current_dir not in sys.path:
|
| 185 |
+
sys.path.append(self.current_dir)
|
| 186 |
+
|
| 187 |
+
def setup_logging(self) -> None:
|
| 188 |
+
"""Setup centralized logging configuration."""
|
| 189 |
+
# Configure logging to suppress httpx and other noisy logs
|
| 190 |
+
logging.getLogger("httpx").setLevel(logging.WARNING)
|
| 191 |
+
logging.getLogger("urllib3").setLevel(logging.WARNING)
|
| 192 |
+
logging.getLogger("httpcore").setLevel(logging.WARNING)
|
| 193 |
+
|
| 194 |
+
# Setup our centralized logging
|
| 195 |
+
setup_logging()
|
| 196 |
+
|
| 197 |
+
def full_environment_setup(self) -> Dict[str, bool]:
|
| 198 |
+
"""
|
| 199 |
+
Perform complete environment setup.
|
| 200 |
+
|
| 201 |
+
Returns:
|
| 202 |
+
Dictionary with setup results for each component
|
| 203 |
+
"""
|
| 204 |
+
results = {}
|
| 205 |
+
|
| 206 |
+
# Setup logging first
|
| 207 |
+
self.setup_logging()
|
| 208 |
+
|
| 209 |
+
# Run setup script
|
| 210 |
+
results["setup_script"] = self.run_setup_script()
|
| 211 |
+
|
| 212 |
+
# Check and install dependencies
|
| 213 |
+
results["spaces_module"] = self.check_spaces_module()
|
| 214 |
+
results["pytorch"], pytorch_info = self.check_pytorch()
|
| 215 |
+
results["transformers"] = self.check_transformers()
|
| 216 |
+
results["numpy"] = self.check_numpy()
|
| 217 |
+
results["markitdown"] = self.check_markitdown()
|
| 218 |
+
|
| 219 |
+
# Load environment variables
|
| 220 |
+
results["env_vars"] = self.load_environment_variables()
|
| 221 |
+
|
| 222 |
+
# Validate API keys
|
| 223 |
+
api_keys = self.validate_api_keys()
|
| 224 |
+
results["api_keys"] = api_keys
|
| 225 |
+
|
| 226 |
+
# Setup Python path
|
| 227 |
+
self.setup_python_path()
|
| 228 |
+
results["python_path"] = True
|
| 229 |
+
|
| 230 |
+
# Validate configuration
|
| 231 |
+
validation = config.validate()
|
| 232 |
+
results["config_valid"] = validation["valid"]
|
| 233 |
+
|
| 234 |
+
if validation["warnings"]:
|
| 235 |
+
for warning in validation["warnings"]:
|
| 236 |
+
print(f"Configuration warning: {warning}")
|
| 237 |
+
|
| 238 |
+
if validation["errors"]:
|
| 239 |
+
for error in validation["errors"]:
|
| 240 |
+
print(f"Configuration error: {error}")
|
| 241 |
+
|
| 242 |
+
return results
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
# Global instance
|
| 246 |
+
environment_manager = EnvironmentManager()
|
src/core/exceptions.py
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Custom exception classes for the Markit application.
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
class MarkitError(Exception):
|
| 7 |
+
"""Base exception class for all Markit-related errors."""
|
| 8 |
+
pass
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
class ConfigurationError(MarkitError):
|
| 12 |
+
"""Raised when there's a configuration-related error."""
|
| 13 |
+
pass
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
class ParserError(MarkitError):
|
| 17 |
+
"""Base exception for parser-related errors."""
|
| 18 |
+
pass
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
class ParserNotFoundError(ParserError):
|
| 22 |
+
"""Raised when a requested parser is not available."""
|
| 23 |
+
pass
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
class ParserInitializationError(ParserError):
|
| 27 |
+
"""Raised when a parser fails to initialize properly."""
|
| 28 |
+
pass
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
class DocumentProcessingError(ParserError):
|
| 32 |
+
"""Raised when document processing fails."""
|
| 33 |
+
pass
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
class UnsupportedFileTypeError(ParserError):
|
| 37 |
+
"""Raised when trying to process an unsupported file type."""
|
| 38 |
+
pass
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
class APIError(MarkitError):
|
| 42 |
+
"""Base exception for API-related errors."""
|
| 43 |
+
pass
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
class APIKeyMissingError(APIError):
|
| 47 |
+
"""Raised when required API key is missing."""
|
| 48 |
+
pass
|
| 49 |
+
|
| 50 |
+
|
| 51 |
+
class APIRateLimitError(APIError):
|
| 52 |
+
"""Raised when API rate limit is exceeded."""
|
| 53 |
+
pass
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
class APIQuotaExceededError(APIError):
|
| 57 |
+
"""Raised when API quota is exceeded."""
|
| 58 |
+
pass
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class FileError(MarkitError):
|
| 62 |
+
"""Base exception for file-related errors."""
|
| 63 |
+
pass
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
class FileSizeLimitError(FileError):
|
| 67 |
+
"""Raised when file size exceeds the allowed limit."""
|
| 68 |
+
pass
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
class FileNotFoundError(FileError):
|
| 72 |
+
"""Raised when a required file is not found."""
|
| 73 |
+
pass
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
class ConversionError(MarkitError):
|
| 77 |
+
"""Raised when document conversion fails."""
|
| 78 |
+
pass
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
class ValidationError(MarkitError):
|
| 82 |
+
"""Raised when input validation fails."""
|
| 83 |
+
pass
|
src/core/logging_config.py
ADDED
|
@@ -0,0 +1,83 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Centralized logging configuration for the Markit application.
|
| 3 |
+
"""
|
| 4 |
+
import logging
|
| 5 |
+
import sys
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
from typing import Optional
|
| 8 |
+
|
| 9 |
+
from src.core.config import config
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def setup_logging(
|
| 13 |
+
level: Optional[str] = None,
|
| 14 |
+
log_file: Optional[str] = None,
|
| 15 |
+
format_string: Optional[str] = None
|
| 16 |
+
) -> None:
|
| 17 |
+
"""
|
| 18 |
+
Setup centralized logging configuration.
|
| 19 |
+
|
| 20 |
+
Args:
|
| 21 |
+
level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
|
| 22 |
+
log_file: Optional file path for logging output
|
| 23 |
+
format_string: Custom format string for log messages
|
| 24 |
+
"""
|
| 25 |
+
# Determine logging level
|
| 26 |
+
if level is None:
|
| 27 |
+
level = "DEBUG" if config.app.debug else "INFO"
|
| 28 |
+
|
| 29 |
+
# Default format string
|
| 30 |
+
if format_string is None:
|
| 31 |
+
format_string = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
| 32 |
+
|
| 33 |
+
# Configure root logger
|
| 34 |
+
root_logger = logging.getLogger()
|
| 35 |
+
root_logger.setLevel(getattr(logging, level.upper()))
|
| 36 |
+
|
| 37 |
+
# Clear existing handlers
|
| 38 |
+
root_logger.handlers.clear()
|
| 39 |
+
|
| 40 |
+
# Create formatter
|
| 41 |
+
formatter = logging.Formatter(format_string)
|
| 42 |
+
|
| 43 |
+
# Console handler
|
| 44 |
+
console_handler = logging.StreamHandler(sys.stdout)
|
| 45 |
+
console_handler.setLevel(getattr(logging, level.upper()))
|
| 46 |
+
console_handler.setFormatter(formatter)
|
| 47 |
+
root_logger.addHandler(console_handler)
|
| 48 |
+
|
| 49 |
+
# File handler (optional)
|
| 50 |
+
if log_file:
|
| 51 |
+
try:
|
| 52 |
+
log_path = Path(log_file)
|
| 53 |
+
log_path.parent.mkdir(parents=True, exist_ok=True)
|
| 54 |
+
|
| 55 |
+
file_handler = logging.FileHandler(log_file)
|
| 56 |
+
file_handler.setLevel(getattr(logging, level.upper()))
|
| 57 |
+
file_handler.setFormatter(formatter)
|
| 58 |
+
root_logger.addHandler(file_handler)
|
| 59 |
+
except Exception as e:
|
| 60 |
+
logging.warning(f"Could not setup file logging: {e}")
|
| 61 |
+
|
| 62 |
+
# Set specific logger levels to reduce noise
|
| 63 |
+
logging.getLogger("urllib3").setLevel(logging.WARNING)
|
| 64 |
+
logging.getLogger("requests").setLevel(logging.WARNING)
|
| 65 |
+
logging.getLogger("gradio").setLevel(logging.WARNING)
|
| 66 |
+
|
| 67 |
+
if not config.app.debug:
|
| 68 |
+
# Reduce noise from external libraries in non-debug mode
|
| 69 |
+
logging.getLogger("transformers").setLevel(logging.WARNING)
|
| 70 |
+
logging.getLogger("torch").setLevel(logging.WARNING)
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def get_logger(name: str) -> logging.Logger:
|
| 74 |
+
"""
|
| 75 |
+
Get a logger with the specified name.
|
| 76 |
+
|
| 77 |
+
Args:
|
| 78 |
+
name: Logger name (typically __name__)
|
| 79 |
+
|
| 80 |
+
Returns:
|
| 81 |
+
Logger instance
|
| 82 |
+
"""
|
| 83 |
+
return logging.getLogger(name)
|
src/main.py
CHANGED
|
@@ -1,10 +1,10 @@
|
|
| 1 |
-
import parsers # Import all parsers to ensure they're registered
|
| 2 |
from src.ui.ui import launch_ui
|
| 3 |
|
| 4 |
def main():
|
| 5 |
# Launch the UI
|
| 6 |
launch_ui(
|
| 7 |
-
server_name="
|
| 8 |
server_port=7860,
|
| 9 |
share=False # Explicitly disable sharing on Hugging Face
|
| 10 |
)
|
|
|
|
| 1 |
+
from src import parsers # Import all parsers to ensure they're registered
|
| 2 |
from src.ui.ui import launch_ui
|
| 3 |
|
| 4 |
def main():
|
| 5 |
# Launch the UI
|
| 6 |
launch_ui(
|
| 7 |
+
server_name="localhost",
|
| 8 |
server_port=7860,
|
| 9 |
share=False # Explicitly disable sharing on Hugging Face
|
| 10 |
)
|
src/parsers/got_ocr_parser.py
CHANGED
|
@@ -13,6 +13,7 @@ import tempfile
|
|
| 13 |
import shutil
|
| 14 |
from typing import Dict, List, Optional, Any, Union
|
| 15 |
import copy
|
|
|
|
| 16 |
|
| 17 |
from src.parsers.parser_interface import DocumentParser
|
| 18 |
from src.parsers.parser_registry import ParserRegistry
|
|
|
|
| 13 |
import shutil
|
| 14 |
from typing import Dict, List, Optional, Any, Union
|
| 15 |
import copy
|
| 16 |
+
import pickle
|
| 17 |
|
| 18 |
from src.parsers.parser_interface import DocumentParser
|
| 19 |
from src.parsers.parser_registry import ParserRegistry
|
src/parsers/markitdown_parser.py
CHANGED
|
@@ -1,12 +1,13 @@
|
|
| 1 |
import logging
|
| 2 |
import os
|
| 3 |
from pathlib import Path
|
| 4 |
-
from typing import Dict, List, Optional, Any, Union
|
| 5 |
import io
|
| 6 |
|
| 7 |
# Import the parser interface and registry
|
| 8 |
from src.parsers.parser_interface import DocumentParser
|
| 9 |
from src.parsers.parser_registry import ParserRegistry
|
|
|
|
| 10 |
|
| 11 |
# Check for MarkItDown availability
|
| 12 |
try:
|
|
@@ -27,6 +28,7 @@ class MarkItDownParser(DocumentParser):
|
|
| 27 |
"""
|
| 28 |
|
| 29 |
def __init__(self):
|
|
|
|
| 30 |
self.markdown_instance = None
|
| 31 |
# Initialize MarkItDown instance
|
| 32 |
if HAS_MARKITDOWN:
|
|
@@ -60,34 +62,44 @@ class MarkItDownParser(DocumentParser):
|
|
| 60 |
Returns:
|
| 61 |
str: Markdown representation of the document
|
| 62 |
"""
|
|
|
|
|
|
|
|
|
|
| 63 |
# Check if MarkItDown is available
|
| 64 |
if not HAS_MARKITDOWN or self.markdown_instance is None:
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
# Get cancellation check function from kwargs
|
| 68 |
-
check_cancellation = kwargs.get('check_cancellation', lambda: False)
|
| 69 |
|
| 70 |
# Check for cancellation before starting
|
| 71 |
-
if
|
| 72 |
-
|
| 73 |
|
| 74 |
try:
|
| 75 |
# Convert the file using the standard instance
|
| 76 |
-
result = self.markdown_instance.convert(file_path)
|
| 77 |
|
| 78 |
# Check for cancellation after processing
|
| 79 |
-
if
|
| 80 |
-
|
| 81 |
|
| 82 |
return result.text_content
|
| 83 |
except Exception as e:
|
| 84 |
logger.error(f"Error converting file with MarkItDown: {str(e)}")
|
| 85 |
-
|
| 86 |
|
| 87 |
@classmethod
|
| 88 |
def get_name(cls) -> str:
|
| 89 |
return "MarkItDown (pdf, jpg, png, xlsx --best for xlsx)"
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
@classmethod
|
| 92 |
def get_supported_ocr_methods(cls) -> List[Dict[str, Any]]:
|
| 93 |
return [
|
|
|
|
| 1 |
import logging
|
| 2 |
import os
|
| 3 |
from pathlib import Path
|
| 4 |
+
from typing import Dict, List, Optional, Any, Union, Set
|
| 5 |
import io
|
| 6 |
|
| 7 |
# Import the parser interface and registry
|
| 8 |
from src.parsers.parser_interface import DocumentParser
|
| 9 |
from src.parsers.parser_registry import ParserRegistry
|
| 10 |
+
from src.core.exceptions import DocumentProcessingError, ParserError
|
| 11 |
|
| 12 |
# Check for MarkItDown availability
|
| 13 |
try:
|
|
|
|
| 28 |
"""
|
| 29 |
|
| 30 |
def __init__(self):
|
| 31 |
+
super().__init__() # Initialize the base class (including _cancellation_flag)
|
| 32 |
self.markdown_instance = None
|
| 33 |
# Initialize MarkItDown instance
|
| 34 |
if HAS_MARKITDOWN:
|
|
|
|
| 62 |
Returns:
|
| 63 |
str: Markdown representation of the document
|
| 64 |
"""
|
| 65 |
+
# Validate file first
|
| 66 |
+
self.validate_file(file_path)
|
| 67 |
+
|
| 68 |
# Check if MarkItDown is available
|
| 69 |
if not HAS_MARKITDOWN or self.markdown_instance is None:
|
| 70 |
+
raise ParserError("MarkItDown is not available. Please install with 'pip install markitdown[all]'")
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
# Check for cancellation before starting
|
| 73 |
+
if self._check_cancellation():
|
| 74 |
+
raise DocumentProcessingError("Conversion cancelled")
|
| 75 |
|
| 76 |
try:
|
| 77 |
# Convert the file using the standard instance
|
| 78 |
+
result = self.markdown_instance.convert(str(file_path))
|
| 79 |
|
| 80 |
# Check for cancellation after processing
|
| 81 |
+
if self._check_cancellation():
|
| 82 |
+
raise DocumentProcessingError("Conversion cancelled")
|
| 83 |
|
| 84 |
return result.text_content
|
| 85 |
except Exception as e:
|
| 86 |
logger.error(f"Error converting file with MarkItDown: {str(e)}")
|
| 87 |
+
raise DocumentProcessingError(f"MarkItDown conversion failed: {str(e)}")
|
| 88 |
|
| 89 |
@classmethod
|
| 90 |
def get_name(cls) -> str:
|
| 91 |
return "MarkItDown (pdf, jpg, png, xlsx --best for xlsx)"
|
| 92 |
|
| 93 |
+
@classmethod
|
| 94 |
+
def get_supported_file_types(cls) -> Set[str]:
|
| 95 |
+
"""Return a set of supported file extensions."""
|
| 96 |
+
return {".pdf", ".docx", ".xlsx", ".pptx", ".html", ".txt", ".md", ".json", ".xml", ".csv", ".jpg", ".jpeg", ".png"}
|
| 97 |
+
|
| 98 |
+
@classmethod
|
| 99 |
+
def is_available(cls) -> bool:
|
| 100 |
+
"""Check if this parser is available."""
|
| 101 |
+
return HAS_MARKITDOWN
|
| 102 |
+
|
| 103 |
@classmethod
|
| 104 |
def get_supported_ocr_methods(cls) -> List[Dict[str, Any]]:
|
| 105 |
return [
|
src/parsers/parser_interface.py
CHANGED
|
@@ -1,11 +1,26 @@
|
|
| 1 |
from abc import ABC, abstractmethod
|
| 2 |
from pathlib import Path
|
| 3 |
-
from typing import Dict, List, Optional, Any, Union
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
|
| 6 |
class DocumentParser(ABC):
|
| 7 |
"""Base interface for all document parsers in the system."""
|
| 8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
@abstractmethod
|
| 10 |
def parse(self, file_path: Union[str, Path], ocr_method: Optional[str] = None, **kwargs) -> str:
|
| 11 |
"""
|
|
@@ -18,6 +33,10 @@ class DocumentParser(ABC):
|
|
| 18 |
|
| 19 |
Returns:
|
| 20 |
str: The parsed content
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
"""
|
| 22 |
pass
|
| 23 |
|
|
@@ -44,4 +63,44 @@ class DocumentParser(ABC):
|
|
| 44 |
@classmethod
|
| 45 |
def get_description(cls) -> str:
|
| 46 |
"""Return a description of this parser"""
|
| 47 |
-
return f"{cls.get_name()} document parser"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
from abc import ABC, abstractmethod
|
| 2 |
from pathlib import Path
|
| 3 |
+
from typing import Dict, List, Optional, Any, Union, Set
|
| 4 |
+
import threading
|
| 5 |
+
|
| 6 |
+
from src.core.exceptions import ParserError, UnsupportedFileTypeError
|
| 7 |
|
| 8 |
|
| 9 |
class DocumentParser(ABC):
|
| 10 |
"""Base interface for all document parsers in the system."""
|
| 11 |
|
| 12 |
+
def __init__(self):
|
| 13 |
+
"""Initialize the parser."""
|
| 14 |
+
self._cancellation_flag: Optional[threading.Event] = None
|
| 15 |
+
|
| 16 |
+
def set_cancellation_flag(self, flag: Optional[threading.Event]) -> None:
|
| 17 |
+
"""Set the cancellation flag for this parser."""
|
| 18 |
+
self._cancellation_flag = flag
|
| 19 |
+
|
| 20 |
+
def _check_cancellation(self) -> bool:
|
| 21 |
+
"""Check if cancellation has been requested."""
|
| 22 |
+
return self._cancellation_flag is not None and self._cancellation_flag.is_set()
|
| 23 |
+
|
| 24 |
@abstractmethod
|
| 25 |
def parse(self, file_path: Union[str, Path], ocr_method: Optional[str] = None, **kwargs) -> str:
|
| 26 |
"""
|
|
|
|
| 33 |
|
| 34 |
Returns:
|
| 35 |
str: The parsed content
|
| 36 |
+
|
| 37 |
+
Raises:
|
| 38 |
+
ParserError: For general parsing errors
|
| 39 |
+
UnsupportedFileTypeError: For unsupported file types
|
| 40 |
"""
|
| 41 |
pass
|
| 42 |
|
|
|
|
| 63 |
@classmethod
|
| 64 |
def get_description(cls) -> str:
|
| 65 |
"""Return a description of this parser"""
|
| 66 |
+
return f"{cls.get_name()} document parser"
|
| 67 |
+
|
| 68 |
+
@classmethod
|
| 69 |
+
def get_supported_file_types(cls) -> Set[str]:
|
| 70 |
+
"""Return a set of supported file extensions (including the dot)."""
|
| 71 |
+
return {".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp"}
|
| 72 |
+
|
| 73 |
+
@classmethod
|
| 74 |
+
def is_available(cls) -> bool:
|
| 75 |
+
"""Check if this parser is available with current configuration."""
|
| 76 |
+
return True
|
| 77 |
+
|
| 78 |
+
def validate_file(self, file_path: Union[str, Path]) -> None:
|
| 79 |
+
"""
|
| 80 |
+
Validate that the file can be processed by this parser.
|
| 81 |
+
|
| 82 |
+
Args:
|
| 83 |
+
file_path: Path to the file to validate
|
| 84 |
+
|
| 85 |
+
Raises:
|
| 86 |
+
UnsupportedFileTypeError: If file type is not supported
|
| 87 |
+
ParserError: For other validation errors
|
| 88 |
+
"""
|
| 89 |
+
path = Path(file_path)
|
| 90 |
+
if not path.exists():
|
| 91 |
+
raise ParserError(f"File not found: {file_path}")
|
| 92 |
+
|
| 93 |
+
if path.suffix.lower() not in self.get_supported_file_types():
|
| 94 |
+
raise UnsupportedFileTypeError(
|
| 95 |
+
f"File type '{path.suffix}' not supported by {self.get_name()}"
|
| 96 |
+
)
|
| 97 |
+
|
| 98 |
+
def get_metadata(self) -> Dict[str, Any]:
|
| 99 |
+
"""Return metadata about this parser instance."""
|
| 100 |
+
return {
|
| 101 |
+
"name": self.get_name(),
|
| 102 |
+
"description": self.get_description(),
|
| 103 |
+
"supported_file_types": list(self.get_supported_file_types()),
|
| 104 |
+
"supported_ocr_methods": self.get_supported_ocr_methods(),
|
| 105 |
+
"available": self.is_available()
|
| 106 |
+
}
|
src/services/document_service.py
ADDED
|
@@ -0,0 +1,243 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Document processing service layer.
|
| 3 |
+
"""
|
| 4 |
+
import tempfile
|
| 5 |
+
import logging
|
| 6 |
+
import time
|
| 7 |
+
import os
|
| 8 |
+
import threading
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from typing import Optional, Tuple, Any
|
| 11 |
+
|
| 12 |
+
from src.core.config import config
|
| 13 |
+
from src.core.exceptions import (
|
| 14 |
+
DocumentProcessingError,
|
| 15 |
+
FileSizeLimitError,
|
| 16 |
+
UnsupportedFileTypeError,
|
| 17 |
+
ConversionError
|
| 18 |
+
)
|
| 19 |
+
from src.core.parser_factory import ParserFactory
|
| 20 |
+
from src.core.latex_to_markdown_converter import convert_latex_to_markdown
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
class DocumentService:
|
| 24 |
+
"""Service for handling document processing operations."""
|
| 25 |
+
|
| 26 |
+
def __init__(self):
|
| 27 |
+
self._conversion_in_progress = False
|
| 28 |
+
self._cancellation_flag: Optional[threading.Event] = None
|
| 29 |
+
|
| 30 |
+
def set_cancellation_flag(self, flag: threading.Event) -> None:
|
| 31 |
+
"""Set the cancellation flag for this service."""
|
| 32 |
+
self._cancellation_flag = flag
|
| 33 |
+
|
| 34 |
+
def is_conversion_in_progress(self) -> bool:
|
| 35 |
+
"""Check if conversion is currently in progress."""
|
| 36 |
+
return self._conversion_in_progress
|
| 37 |
+
|
| 38 |
+
def _check_cancellation(self) -> bool:
|
| 39 |
+
"""Check if cancellation has been requested."""
|
| 40 |
+
if self._cancellation_flag and self._cancellation_flag.is_set():
|
| 41 |
+
logging.info("Cancellation detected in document service")
|
| 42 |
+
return True
|
| 43 |
+
return False
|
| 44 |
+
|
| 45 |
+
def _safe_delete_file(self, file_path: Optional[str]) -> None:
|
| 46 |
+
"""Safely delete a file with error handling."""
|
| 47 |
+
if file_path and os.path.exists(file_path):
|
| 48 |
+
try:
|
| 49 |
+
os.unlink(file_path)
|
| 50 |
+
except Exception as e:
|
| 51 |
+
logging.error(f"Error cleaning up temp file {file_path}: {e}")
|
| 52 |
+
|
| 53 |
+
def _validate_file(self, file_path: str) -> None:
|
| 54 |
+
"""Validate file size and type."""
|
| 55 |
+
if not os.path.exists(file_path):
|
| 56 |
+
raise DocumentProcessingError(f"File not found: {file_path}")
|
| 57 |
+
|
| 58 |
+
# Check file size
|
| 59 |
+
file_size = os.path.getsize(file_path)
|
| 60 |
+
if file_size > config.app.max_file_size:
|
| 61 |
+
raise FileSizeLimitError(
|
| 62 |
+
f"File size ({file_size} bytes) exceeds maximum allowed size "
|
| 63 |
+
f"({config.app.max_file_size} bytes)"
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
# Check file extension
|
| 67 |
+
file_ext = Path(file_path).suffix.lower()
|
| 68 |
+
if file_ext not in config.app.allowed_extensions:
|
| 69 |
+
raise UnsupportedFileTypeError(
|
| 70 |
+
f"File type '{file_ext}' is not supported. "
|
| 71 |
+
f"Allowed types: {', '.join(config.app.allowed_extensions)}"
|
| 72 |
+
)
|
| 73 |
+
|
| 74 |
+
def _create_temp_file(self, original_path: str) -> str:
|
| 75 |
+
"""Create a temporary file with English filename."""
|
| 76 |
+
original_ext = Path(original_path).suffix
|
| 77 |
+
|
| 78 |
+
with tempfile.NamedTemporaryFile(suffix=original_ext, delete=False) as temp_file:
|
| 79 |
+
temp_path = temp_file.name
|
| 80 |
+
|
| 81 |
+
# Copy content in chunks with cancellation checks
|
| 82 |
+
with open(original_path, 'rb') as original:
|
| 83 |
+
chunk_size = 1024 * 1024 # 1MB chunks
|
| 84 |
+
while True:
|
| 85 |
+
if self._check_cancellation():
|
| 86 |
+
self._safe_delete_file(temp_path)
|
| 87 |
+
raise ConversionError("Conversion cancelled during file copy")
|
| 88 |
+
|
| 89 |
+
chunk = original.read(chunk_size)
|
| 90 |
+
if not chunk:
|
| 91 |
+
break
|
| 92 |
+
temp_file.write(chunk)
|
| 93 |
+
|
| 94 |
+
return temp_path
|
| 95 |
+
|
| 96 |
+
def _process_latex_content(self, content: str, parser_name: str, ocr_method_name: str) -> str:
|
| 97 |
+
"""Process LaTeX content for GOT-OCR formatted text."""
|
| 98 |
+
if (parser_name == "GOT-OCR (jpg,png only)" and
|
| 99 |
+
ocr_method_name == "Formatted Text" and
|
| 100 |
+
config.api.google_api_key):
|
| 101 |
+
|
| 102 |
+
logging.info("Converting LaTeX output to Markdown using Gemini API")
|
| 103 |
+
start_convert = time.time()
|
| 104 |
+
|
| 105 |
+
if self._check_cancellation():
|
| 106 |
+
raise ConversionError("Conversion cancelled before LaTeX conversion")
|
| 107 |
+
|
| 108 |
+
try:
|
| 109 |
+
markdown_content = convert_latex_to_markdown(content)
|
| 110 |
+
if markdown_content:
|
| 111 |
+
logging.info(f"LaTeX conversion completed in {time.time() - start_convert:.2f} seconds")
|
| 112 |
+
return markdown_content
|
| 113 |
+
else:
|
| 114 |
+
logging.warning("LaTeX to Markdown conversion failed, using raw LaTeX output")
|
| 115 |
+
except Exception as e:
|
| 116 |
+
logging.error(f"Error converting LaTeX to Markdown: {str(e)}")
|
| 117 |
+
# Continue with original content on error
|
| 118 |
+
|
| 119 |
+
return content
|
| 120 |
+
|
| 121 |
+
def _create_output_file(self, content: str, output_format: str) -> str:
|
| 122 |
+
"""Create output file with proper extension."""
|
| 123 |
+
# Determine file extension
|
| 124 |
+
format_extensions = {
|
| 125 |
+
"markdown": ".md",
|
| 126 |
+
"json": ".json",
|
| 127 |
+
"text": ".txt",
|
| 128 |
+
"document tags": ".doctags"
|
| 129 |
+
}
|
| 130 |
+
ext = format_extensions.get(output_format.lower(), ".txt")
|
| 131 |
+
|
| 132 |
+
if self._check_cancellation():
|
| 133 |
+
raise ConversionError("Conversion cancelled before output file creation")
|
| 134 |
+
|
| 135 |
+
# Create temporary output file
|
| 136 |
+
with tempfile.NamedTemporaryFile(mode="w", suffix=ext, delete=False, encoding="utf-8") as tmp:
|
| 137 |
+
tmp_path = tmp.name
|
| 138 |
+
|
| 139 |
+
# Write in chunks with cancellation checks
|
| 140 |
+
chunk_size = 10000 # characters
|
| 141 |
+
for i in range(0, len(content), chunk_size):
|
| 142 |
+
if self._check_cancellation():
|
| 143 |
+
self._safe_delete_file(tmp_path)
|
| 144 |
+
raise ConversionError("Conversion cancelled during output file writing")
|
| 145 |
+
|
| 146 |
+
tmp.write(content[i:i+chunk_size])
|
| 147 |
+
|
| 148 |
+
return tmp_path
|
| 149 |
+
|
| 150 |
+
def convert_document(
|
| 151 |
+
self,
|
| 152 |
+
file_path: str,
|
| 153 |
+
parser_name: str,
|
| 154 |
+
ocr_method_name: str,
|
| 155 |
+
output_format: str
|
| 156 |
+
) -> Tuple[str, Optional[str]]:
|
| 157 |
+
"""
|
| 158 |
+
Convert a document using the specified parser and OCR method.
|
| 159 |
+
|
| 160 |
+
Args:
|
| 161 |
+
file_path: Path to the input file
|
| 162 |
+
parser_name: Name of the parser to use
|
| 163 |
+
ocr_method_name: Name of the OCR method to use
|
| 164 |
+
output_format: Output format (Markdown, JSON, Text, Document Tags)
|
| 165 |
+
|
| 166 |
+
Returns:
|
| 167 |
+
Tuple of (content, output_file_path)
|
| 168 |
+
|
| 169 |
+
Raises:
|
| 170 |
+
DocumentProcessingError: For general processing errors
|
| 171 |
+
FileSizeLimitError: When file is too large
|
| 172 |
+
UnsupportedFileTypeError: For unsupported file types
|
| 173 |
+
ConversionError: When conversion fails or is cancelled
|
| 174 |
+
"""
|
| 175 |
+
if not file_path:
|
| 176 |
+
raise DocumentProcessingError("No file provided")
|
| 177 |
+
|
| 178 |
+
self._conversion_in_progress = True
|
| 179 |
+
temp_input = None
|
| 180 |
+
output_path = None
|
| 181 |
+
|
| 182 |
+
try:
|
| 183 |
+
# Validate input file
|
| 184 |
+
self._validate_file(file_path)
|
| 185 |
+
|
| 186 |
+
if self._check_cancellation():
|
| 187 |
+
raise ConversionError("Conversion cancelled")
|
| 188 |
+
|
| 189 |
+
# Create temporary file with English name
|
| 190 |
+
temp_input = self._create_temp_file(file_path)
|
| 191 |
+
|
| 192 |
+
if self._check_cancellation():
|
| 193 |
+
raise ConversionError("Conversion cancelled")
|
| 194 |
+
|
| 195 |
+
# Process document using parser factory
|
| 196 |
+
start_time = time.time()
|
| 197 |
+
content = ParserFactory.parse_document(
|
| 198 |
+
file_path=temp_input,
|
| 199 |
+
parser_name=parser_name,
|
| 200 |
+
ocr_method_name=ocr_method_name,
|
| 201 |
+
output_format=output_format.lower(),
|
| 202 |
+
cancellation_flag=self._cancellation_flag
|
| 203 |
+
)
|
| 204 |
+
|
| 205 |
+
if content == "Conversion cancelled.":
|
| 206 |
+
raise ConversionError("Conversion cancelled by parser")
|
| 207 |
+
|
| 208 |
+
duration = time.time() - start_time
|
| 209 |
+
logging.info(f"Document processed in {duration:.2f} seconds")
|
| 210 |
+
|
| 211 |
+
if self._check_cancellation():
|
| 212 |
+
raise ConversionError("Conversion cancelled")
|
| 213 |
+
|
| 214 |
+
# Process LaTeX content if needed
|
| 215 |
+
content = self._process_latex_content(content, parser_name, ocr_method_name)
|
| 216 |
+
|
| 217 |
+
if self._check_cancellation():
|
| 218 |
+
raise ConversionError("Conversion cancelled")
|
| 219 |
+
|
| 220 |
+
# Create output file
|
| 221 |
+
output_path = self._create_output_file(content, output_format)
|
| 222 |
+
|
| 223 |
+
return content, output_path
|
| 224 |
+
|
| 225 |
+
except (DocumentProcessingError, FileSizeLimitError, UnsupportedFileTypeError, ConversionError):
|
| 226 |
+
# Re-raise our custom exceptions
|
| 227 |
+
self._safe_delete_file(temp_input)
|
| 228 |
+
self._safe_delete_file(output_path)
|
| 229 |
+
raise
|
| 230 |
+
except Exception as e:
|
| 231 |
+
# Wrap unexpected exceptions
|
| 232 |
+
self._safe_delete_file(temp_input)
|
| 233 |
+
self._safe_delete_file(output_path)
|
| 234 |
+
raise DocumentProcessingError(f"Unexpected error during conversion: {str(e)}")
|
| 235 |
+
finally:
|
| 236 |
+
# Clean up temp input file
|
| 237 |
+
self._safe_delete_file(temp_input)
|
| 238 |
+
|
| 239 |
+
# Clean up output file if cancelled
|
| 240 |
+
if self._check_cancellation() and output_path:
|
| 241 |
+
self._safe_delete_file(output_path)
|
| 242 |
+
|
| 243 |
+
self._conversion_in_progress = False
|
src/ui/ui.py
CHANGED
|
@@ -6,19 +6,26 @@ import logging
|
|
| 6 |
from pathlib import Path
|
| 7 |
from src.core.converter import convert_file, set_cancellation_flag, is_conversion_in_progress
|
| 8 |
from src.parsers.parser_registry import ParserRegistry
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
# Import MarkItDown to check if it's available
|
| 11 |
try:
|
| 12 |
from markitdown import MarkItDown
|
| 13 |
HAS_MARKITDOWN = True
|
| 14 |
-
|
| 15 |
except ImportError:
|
| 16 |
HAS_MARKITDOWN = False
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
# Configure logging
|
| 20 |
-
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
| 21 |
-
logger = logging.getLogger(__name__)
|
| 22 |
|
| 23 |
# Add a global variable to track cancellation state
|
| 24 |
conversion_cancelled = threading.Event()
|
|
@@ -40,12 +47,33 @@ def validate_file_for_parser(file_path, parser_name):
|
|
| 40 |
"""Validate if the file type is supported by the selected parser."""
|
| 41 |
if not file_path:
|
| 42 |
return True, "" # No file selected yet
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
return False, "GOT-OCR only supports JPG and PNG formats."
|
| 48 |
-
return True, ""
|
| 49 |
|
| 50 |
def format_markdown_content(content):
|
| 51 |
if not content:
|
|
|
|
| 6 |
from pathlib import Path
|
| 7 |
from src.core.converter import convert_file, set_cancellation_flag, is_conversion_in_progress
|
| 8 |
from src.parsers.parser_registry import ParserRegistry
|
| 9 |
+
from src.core.config import config
|
| 10 |
+
from src.core.exceptions import (
|
| 11 |
+
DocumentProcessingError,
|
| 12 |
+
UnsupportedFileTypeError,
|
| 13 |
+
FileSizeLimitError,
|
| 14 |
+
ConfigurationError
|
| 15 |
+
)
|
| 16 |
+
from src.core.logging_config import get_logger
|
| 17 |
+
|
| 18 |
+
# Use centralized logging
|
| 19 |
+
logger = get_logger(__name__)
|
| 20 |
|
| 21 |
# Import MarkItDown to check if it's available
|
| 22 |
try:
|
| 23 |
from markitdown import MarkItDown
|
| 24 |
HAS_MARKITDOWN = True
|
| 25 |
+
logger.info("MarkItDown is available for use")
|
| 26 |
except ImportError:
|
| 27 |
HAS_MARKITDOWN = False
|
| 28 |
+
logger.warning("MarkItDown is not available")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
# Add a global variable to track cancellation state
|
| 31 |
conversion_cancelled = threading.Event()
|
|
|
|
| 47 |
"""Validate if the file type is supported by the selected parser."""
|
| 48 |
if not file_path:
|
| 49 |
return True, "" # No file selected yet
|
| 50 |
+
|
| 51 |
+
try:
|
| 52 |
+
file_path_obj = Path(file_path)
|
| 53 |
+
file_ext = file_path_obj.suffix.lower()
|
| 54 |
+
|
| 55 |
+
# Check file size
|
| 56 |
+
if file_path_obj.exists():
|
| 57 |
+
file_size = file_path_obj.stat().st_size
|
| 58 |
+
if file_size > config.app.max_file_size:
|
| 59 |
+
size_mb = file_size / (1024 * 1024)
|
| 60 |
+
max_mb = config.app.max_file_size / (1024 * 1024)
|
| 61 |
+
return False, f"File size ({size_mb:.1f}MB) exceeds maximum allowed size ({max_mb:.1f}MB)"
|
| 62 |
+
|
| 63 |
+
# Check file extension
|
| 64 |
+
if file_ext not in config.app.allowed_extensions:
|
| 65 |
+
return False, f"File type '{file_ext}' is not supported. Allowed types: {', '.join(config.app.allowed_extensions)}"
|
| 66 |
+
|
| 67 |
+
# Parser-specific validation
|
| 68 |
+
if "GOT-OCR" in parser_name:
|
| 69 |
+
if file_ext not in ['.jpg', '.jpeg', '.png']:
|
| 70 |
+
return False, "GOT-OCR only supports JPG and PNG formats."
|
| 71 |
+
|
| 72 |
+
return True, ""
|
| 73 |
|
| 74 |
+
except Exception as e:
|
| 75 |
+
logger.error(f"Error validating file: {e}")
|
| 76 |
+
return False, f"Error validating file: {e}"
|
|
|
|
|
|
|
| 77 |
|
| 78 |
def format_markdown_content(content):
|
| 79 |
if not content:
|