AnseMin commited on
Commit
a773878
·
1 Parent(s): 55627c9

Refactor and enhance application structure for Markit_v2

Browse files

- Introduced centralized configuration management for API keys and application settings.
- Implemented a new environment manager for dependency checks and setup.
- Refactored document processing into a dedicated service layer for improved organization.
- Enhanced error handling with custom exceptions for better clarity.
- Updated README with new configuration options and usage instructions.
- Added lightweight launcher for local development.
- Improved logging setup for better debugging and information tracking.
- Updated .gitignore to include new files and directories.

This commit lays the groundwork for a more modular and maintainable codebase, facilitating future feature additions and improvements.

.gitignore CHANGED
@@ -84,4 +84,16 @@ test_gemini_parser.py
84
 
85
  # Ignore tessdata folder
86
  /tessdata/
87
- /tessdata/*
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
  # Ignore tessdata folder
86
  /tessdata/
87
+ /tessdata/*
88
+
89
+ # Ignore .venv folder
90
+ .venv/
91
+
92
+ # Ignore Claude.md
93
+ Claude.md
94
+
95
+ # Ignore backup
96
+ app_backup.py
97
+
98
+ #Ignore .claude
99
+ .claude/
=1.1.0 ADDED
File without changes
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Markit GOT OCR
3
  emoji: 📄
4
  colorFrom: blue
5
  colorTo: indigo
@@ -45,10 +45,26 @@ This app integrates [Microsoft's MarkItDown](https://github.com/microsoft/markit
45
 
46
  ## Environment Variables
47
 
48
- You can enhance the functionality by setting these environment variables:
49
 
50
- - `OPENAI_API_KEY`: Enables AI-based image descriptions in MarkItDown
51
  - `GOOGLE_API_KEY`: Used for Gemini Flash parser and LaTeX to Markdown conversion
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## Usage
54
 
@@ -60,17 +76,34 @@ You can enhance the functionality by setting these environment variables:
60
 
61
  ## Local Development
62
 
 
63
  1. Clone the repository
64
- 2. Create a `.env` file based on `.env.example`
65
- 3. Install dependencies:
66
  ```
 
 
 
 
 
 
 
67
  pip install -r requirements.txt
68
  ```
69
  4. Run the application:
70
- ```
 
71
  python app.py
 
 
 
72
  ```
73
 
 
 
 
 
 
 
74
  ## Credits
75
 
76
  - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
@@ -94,21 +127,33 @@ Markit is a powerful tool that converts various document formats (PDF, DOCX, ima
94
  - **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
95
  - **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
96
  - **Advanced Parsing Engines**:
97
- - **PyPdfium**: Fast PDF parsing using the PDFium engine
98
- - **Docling**: Advanced document structure analysis
99
  - **Gemini Flash**: AI-powered conversion using Google's Gemini API
100
  - **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
 
101
  - **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
102
  - **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
103
  - **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
104
  - **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
105
 
106
  ## System Architecture
107
- The application is built with a modular architecture:
108
- - **Core Engine**: Handles document conversion and processing workflows
109
- - **Parser Registry**: Central registry for all document parsers
110
- - **UI Layer**: Gradio-based web interface
111
- - **Service Layer**: Handles AI chat functionality and external services integration
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  ## Installation
114
 
@@ -187,10 +232,10 @@ build:
187
  ### Document Conversion
188
  1. Upload your document using the file uploader
189
  2. Select a parser provider:
190
- - **PyPdfium**: Best for standard PDFs with selectable text
191
- - **Docling**: Best for complex document layouts
192
  - **Gemini Flash**: Best for AI-powered conversions (requires API key)
193
  - **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
 
194
  3. Choose an OCR option based on your selected parser:
195
  - **None**: No OCR processing (for documents with selectable text)
196
  - **Tesseract**: Basic OCR using Tesseract
@@ -206,6 +251,21 @@ build:
206
  6. Navigate through pages using the navigation buttons for multi-page documents
207
  7. Download the converted content in your selected format
208
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
  ## Troubleshooting
210
 
211
  ### OCR Issues
@@ -239,38 +299,57 @@ build:
239
  ### Project Structure
240
 
241
  ```
242
- markit/
243
- ├── app.py # Main application entry point
 
244
  ├── setup.sh # Setup script
245
  ├── build.sh # Build script
246
  ├── requirements.txt # Python dependencies
247
  ├── README.md # Project documentation
248
- ├── .env # Environment variables
249
  ├── .gitignore # Git ignore file
250
  ├── .gitattributes # Git attributes file
251
  ├── src/ # Source code
252
  │ ├── __init__.py # Package initialization
253
- │ ├── main.py # Main module
254
- │ ├── core/ # Core functionality
 
 
 
 
 
 
 
 
 
255
  │ │ ├── __init__.py # Package initialization
256
- │ │ ├── converter.py # Document conversion logic
257
- │ │ └── parser_factory.py # Parser factory
258
  │ ├── parsers/ # Parser implementations
259
  │ │ ├── __init__.py # Package initialization
260
- │ │ ├── parser_interface.py # Parser interface
261
- │ │ ├── parser_registry.py # Parser registry
262
- │ │ ├── docling_parser.py # Docling parser
263
  │ │ ├── got_ocr_parser.py # GOT-OCR parser for images
264
- │ │ └── pypdfium_parser.py # PyPDFium parser
265
- ├── ui/ # User interface
266
- │ ├── __init__.py # Package initialization
267
- │ └── ui.py # Gradio UI implementation
268
- └── services/ # External services
269
- │ └── __init__.py # Package initialization
270
- └── tests/ # Tests
 
271
  └── __init__.py # Package initialization
272
  ```
273
 
 
 
 
 
 
 
 
 
 
274
  ### ZeroGPU Integration Notes
275
 
276
  When developing for Hugging Face Spaces with Stateless GPU:
 
1
  ---
2
+ title: Markit_v2
3
  emoji: 📄
4
  colorFrom: blue
5
  colorTo: indigo
 
45
 
46
  ## Environment Variables
47
 
48
+ The application uses centralized configuration management. You can enhance functionality by setting these environment variables:
49
 
50
+ ### 🔑 **API Keys:**
51
  - `GOOGLE_API_KEY`: Used for Gemini Flash parser and LaTeX to Markdown conversion
52
+ - `OPENAI_API_KEY`: Enables AI-based image descriptions in MarkItDown
53
+ - `MISTRAL_API_KEY`: For Mistral OCR parser (if available)
54
+
55
+ ### ⚙️ **Configuration Options:**
56
+ - `DEBUG`: Set to `true` for debug mode with verbose logging
57
+ - `MAX_FILE_SIZE`: Maximum file size in bytes (default: 10MB)
58
+ - `TEMP_DIR`: Directory for temporary files (default: ./temp)
59
+ - `TESSERACT_PATH`: Custom path to Tesseract executable
60
+ - `TESSDATA_PATH`: Path to Tesseract language data
61
+
62
+ ### 🤖 **Model Configuration:**
63
+ - `GEMINI_MODEL`: Gemini model to use (default: gemini-1.5-flash)
64
+ - `MISTRAL_MODEL`: Mistral model to use (default: pixtral-12b-2409)
65
+ - `GOT_OCR_MODEL`: GOT-OCR model to use (default: stepfun-ai/GOT-OCR2_0)
66
+ - `MODEL_TEMPERATURE`: Model temperature for AI responses (default: 0.1)
67
+ - `MODEL_MAX_TOKENS`: Maximum tokens for AI responses (default: 4096)
68
 
69
  ## Usage
70
 
 
76
 
77
  ## Local Development
78
 
79
+ ### 🚀 **Quick Start:**
80
  1. Clone the repository
81
+ 2. Create a `.env` file with your API keys:
 
82
  ```
83
+ GOOGLE_API_KEY=your_gemini_api_key_here
84
+ OPENAI_API_KEY=your_openai_api_key_here
85
+ MISTRAL_API_KEY=your_mistral_api_key_here
86
+ DEBUG=true
87
+ ```
88
+ 3. Install dependencies:
89
+ ```bash
90
  pip install -r requirements.txt
91
  ```
92
  4. Run the application:
93
+ ```bash
94
+ # For full environment setup (HF Spaces compatible)
95
  python app.py
96
+
97
+ # For local development (faster startup)
98
+ python run_app.py
99
  ```
100
 
101
+ ### 🧪 **Development Features:**
102
+ - **Automatic Environment Setup**: Dependencies are checked and installed automatically
103
+ - **Configuration Validation**: Startup validation reports missing API keys and configuration issues
104
+ - **Enhanced Error Messages**: Detailed error reporting for debugging
105
+ - **Centralized Logging**: Configurable logging levels and output formats
106
+
107
  ## Credits
108
 
109
  - [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
 
127
  - **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
128
  - **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
129
  - **Advanced Parsing Engines**:
130
+ - **MarkItDown**: Comprehensive document conversion (PDFs, Office docs, images, audio, etc.)
 
131
  - **Gemini Flash**: AI-powered conversion using Google's Gemini API
132
  - **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
133
+ - **Mistral OCR**: Advanced OCR using Mistral's Pixtral model for image-to-text conversion
134
  - **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
135
  - **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
136
  - **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
137
  - **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
138
 
139
  ## System Architecture
140
+
141
+ The application is built with a clean, layered architecture following modern software engineering principles:
142
+
143
+ ### 🏗️ **Core Architecture Components:**
144
+ - **Entry Point** (`app.py`): HF Spaces-compatible application launcher with environment setup
145
+ - **Configuration Layer** (`src/core/config.py`): Centralized configuration management with validation
146
+ - **Service Layer** (`src/services/`): Business logic for document processing and external services
147
+ - **Core Engine** (`src/core/`): Document conversion workflows and utilities
148
+ - **Parser Registry** (`src/parsers/`): Extensible parser system with standardized interfaces
149
+ - **UI Layer** (`src/ui/`): Gradio-based web interface with enhanced error handling
150
+
151
+ ### 🎯 **Key Architectural Features:**
152
+ - **Separation of Concerns**: Clean boundaries between UI, business logic, and core utilities
153
+ - **Centralized Configuration**: All settings, API keys, and validation in one place
154
+ - **Custom Exception Hierarchy**: Proper error handling with user-friendly messages
155
+ - **Plugin Architecture**: Easy addition of new document parsers
156
+ - **HF Spaces Optimized**: Maintains compatibility with Hugging Face deployment requirements
157
 
158
  ## Installation
159
 
 
232
  ### Document Conversion
233
  1. Upload your document using the file uploader
234
  2. Select a parser provider:
235
+ - **MarkItDown**: Best for comprehensive document conversion (supports PDFs, Office docs, images, audio, etc.)
 
236
  - **Gemini Flash**: Best for AI-powered conversions (requires API key)
237
  - **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
238
+ - **Mistral OCR**: Advanced OCR using Mistral's Pixtral model (requires API key)
239
  3. Choose an OCR option based on your selected parser:
240
  - **None**: No OCR processing (for documents with selectable text)
241
  - **Tesseract**: Basic OCR using Tesseract
 
251
  6. Navigate through pages using the navigation buttons for multi-page documents
252
  7. Download the converted content in your selected format
253
 
254
+ ## Configuration & Error Handling
255
+
256
+ ### 🔧 **Automatic Configuration:**
257
+ The application includes intelligent configuration management that:
258
+ - Validates API keys and reports availability at startup
259
+ - Checks for required dependencies and installs them automatically
260
+ - Provides helpful warnings for missing optional components
261
+ - Reports which parsers are available based on current configuration
262
+
263
+ ### 🛡️ **Enhanced Error Handling:**
264
+ - **User-Friendly Messages**: Clear error descriptions instead of technical stack traces
265
+ - **File Validation**: Automatic checking of file size and format compatibility
266
+ - **Parser Availability**: Real-time detection of which parsers can be used
267
+ - **Graceful Degradation**: Application continues working even if some parsers are unavailable
268
+
269
  ## Troubleshooting
270
 
271
  ### OCR Issues
 
299
  ### Project Structure
300
 
301
  ```
302
+ markit_v2/
303
+ ├── app.py # Main application entry point (HF Spaces compatible)
304
+ ├── run_app.py # 🆕 Lightweight app launcher for local development
305
  ├── setup.sh # Setup script
306
  ├── build.sh # Build script
307
  ├── requirements.txt # Python dependencies
308
  ├── README.md # Project documentation
309
+ ├── .env # Environment variables (local development)
310
  ├── .gitignore # Git ignore file
311
  ├── .gitattributes # Git attributes file
312
  ├── src/ # Source code
313
  │ ├── __init__.py # Package initialization
314
+ │ ├── main.py # Application launcher
315
+ │ ├── core/ # Core functionality and utilities
316
+ │ │ ├── __init__.py # Package initialization
317
+ │ │ ├── config.py # 🆕 Centralized configuration management
318
+ │ │ ├── exceptions.py # 🆕 Custom exception hierarchy
319
+ │ │ ├── logging_config.py # 🆕 Centralized logging setup
320
+ │ │ ├── environment.py # 🆕 Environment setup and dependency management
321
+ │ │ ├── converter.py # Document conversion orchestrator (refactored)
322
+ │ │ ├── parser_factory.py # Parser factory pattern
323
+ │ │ └── latex_to_markdown_converter.py # LaTeX conversion utility
324
+ │ ├── services/ # Business logic layer
325
  │ │ ├── __init__.py # Package initialization
326
+ │ │ └── document_service.py # 🆕 Document processing service
 
327
  │ ├── parsers/ # Parser implementations
328
  │ │ ├── __init__.py # Package initialization
329
+ │ │ ├── parser_interface.py # Enhanced parser interface
330
+ │ │ ├── parser_registry.py # Parser registry pattern
331
+ │ │ ├── markitdown_parser.py # MarkItDown parser (updated)
332
  │ │ ├── got_ocr_parser.py # GOT-OCR parser for images
333
+ │ │ ├── mistral_ocr_parser.py # 🆕 Mistral OCR parser
334
+ │ └── gemini_flash_parser.py # Gemini Flash parser
335
+ └── ui/ # User interface layer
336
+ ├── __init__.py # Package initialization
337
+ └── ui.py # Gradio UI with enhanced error handling
338
+ ├── documents/ # Documentation and examples (gitignored)
339
+ ├── tessdata/ # Tesseract OCR data (gitignored)
340
+ └── tests/ # Tests (future)
341
  └── __init__.py # Package initialization
342
  ```
343
 
344
+ ### 🆕 **New Architecture Components:**
345
+ - **Configuration Management**: Centralized API keys, model settings, and app configuration (`src/core/config.py`)
346
+ - **Exception Hierarchy**: Proper error handling with specific exception types (`src/core/exceptions.py`)
347
+ - **Service Layer**: Business logic separated from UI and core utilities (`src/services/document_service.py`)
348
+ - **Environment Management**: Automated dependency checking and setup (`src/core/environment.py`)
349
+ - **Enhanced Parser Interface**: Validation, metadata, and cancellation support
350
+ - **Lightweight Launcher**: Quick development startup with `run_app.py`
351
+ - **Centralized Logging**: Configurable logging system (`src/core/logging_config.py`)
352
+
353
  ### ZeroGPU Integration Notes
354
 
355
  When developing for Hugging Face Spaces with Stateless GPU:
app.py CHANGED
@@ -1,126 +1,59 @@
1
  import spaces # Must be imported before any CUDA initialization
2
  import sys
3
  import os
4
- import subprocess
5
- import shutil
6
  from pathlib import Path
7
- import logging
8
 
9
- # Configure logging - Add this section to suppress httpx logs
10
- logging.getLogger("httpx").setLevel(logging.WARNING) # Raise level to WARNING to suppress INFO logs
11
- logging.getLogger("urllib3").setLevel(logging.WARNING) # Also suppress urllib3 logs which might be used
12
- logging.getLogger("httpcore").setLevel(logging.WARNING) # httpcore is used by httpx
13
-
14
- # Get the current directory
15
  current_dir = os.path.dirname(os.path.abspath(__file__))
 
16
 
17
- # Run setup.sh at startup
18
- try:
19
- setup_script = os.path.join(current_dir, "setup.sh")
20
- if os.path.exists(setup_script):
21
- print("Running setup.sh...")
22
- subprocess.run(["bash", setup_script], check=False)
23
- print("setup.sh completed")
24
- except Exception as e:
25
- print(f"Error running setup.sh: {e}")
26
-
27
- # Check if spaces module is installed (needed for ZeroGPU)
28
- try:
29
- print("Spaces module found for ZeroGPU support")
30
- except ImportError:
31
- print("WARNING: Spaces module not found. Installing...")
32
- subprocess.run([sys.executable, "-m", "pip", "install", "-q", "spaces"], check=False)
33
-
34
- # Check for PyTorch and CUDA availability (needed for GOT-OCR)
35
- try:
36
- import torch
37
- print(f"PyTorch version: {torch.__version__}")
38
- print(f"CUDA available: {torch.cuda.is_available()}")
39
- if torch.cuda.is_available():
40
- print(f"CUDA device: {torch.cuda.get_device_name(0)}")
41
- print(f"CUDA version: {torch.version.cuda}")
42
- else:
43
- print("WARNING: CUDA not available. GOT-OCR performs best with GPU acceleration.")
44
- except ImportError:
45
- print("WARNING: PyTorch not installed. Installing PyTorch...")
46
- subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
47
-
48
- # Check if transformers is installed (needed for GOT-OCR)
49
- try:
50
- import transformers
51
- print(f"Transformers version: {transformers.__version__}")
52
- except ImportError:
53
- print("WARNING: Transformers not installed. Installing transformers from GitHub...")
54
- subprocess.run([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/huggingface/transformers.git@main", "accelerate", "verovio"], check=False)
55
-
56
- # Check if numpy is installed with the correct version
57
- try:
58
- import numpy as np
59
- print(f"NumPy version: {np.__version__}")
60
- if np.__version__ != "1.26.3":
61
- print("WARNING: NumPy version mismatch. Installing exact version 1.26.3...")
62
- subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
63
- except ImportError:
64
- print("WARNING: NumPy not installed. Installing NumPy 1.26.3...")
65
- subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
66
-
67
- # Check if markitdown is installed
68
- try:
69
- from markitdown import MarkItDown
70
- print("MarkItDown is installed")
71
- except ImportError:
72
- print("WARNING: MarkItDown not installed. Installing...")
73
- subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
74
  try:
75
  from markitdown import MarkItDown
76
- print("MarkItDown installed successfully")
77
  except ImportError:
78
- print("ERROR: Failed to install MarkItDown")
79
-
80
- # Try to load environment variables from .env file
81
- try:
82
- from dotenv import load_dotenv
83
- load_dotenv()
84
- print("Loaded environment variables from .env file")
85
- except ImportError:
86
- print("python-dotenv not installed, skipping .env file loading")
87
-
88
- # Load API keys from environment variables
89
- gemini_api_key = os.getenv("GOOGLE_API_KEY")
90
- openai_api_key = os.getenv("OPENAI_API_KEY")
91
-
92
- # Check if API keys are available and print messages
93
- if not gemini_api_key:
94
- print("Warning: GOOGLE_API_KEY environment variable not found. Gemini Flash parser and LaTeX to Markdown conversion may not work.")
95
- else:
96
- print(f"Found Gemini API key: {gemini_api_key[:5]}...{gemini_api_key[-5:] if len(gemini_api_key) > 10 else ''}")
97
- print("Gemini API will be used for LaTeX to Markdown conversion when using GOT-OCR with Formatted Text mode")
98
-
99
- if not openai_api_key:
100
- print("Warning: OPENAI_API_KEY environment variable not found. LLM-based image description in MarkItDown may not work.")
101
- else:
102
- print(f"Found OpenAI API key: {openai_api_key[:5]}...{openai_api_key[-5:] if len(openai_api_key) > 10 else ''}")
103
- print("OpenAI API will be available for LLM-based image descriptions in MarkItDown")
104
-
105
- # Add the current directory to the Python path
106
- sys.path.append(current_dir)
107
 
108
- # Try different import approaches
109
  try:
110
- # First attempt - standard import
111
  from src.main import main
112
  except ModuleNotFoundError:
113
  try:
114
- # Second attempt - adjust path and try again
115
  sys.path.append(os.path.join(current_dir, "src"))
116
  from src.main import main
117
  except ModuleNotFoundError:
118
- # Third attempt - create __init__.py if it doesn't exist
119
  init_path = os.path.join(current_dir, "src", "__init__.py")
120
  if not os.path.exists(init_path):
121
  with open(init_path, "w") as f:
122
- pass # Create empty __init__.py file
123
- # Try import again
124
  from src.main import main
125
 
126
  if __name__ == "__main__":
 
1
  import spaces # Must be imported before any CUDA initialization
2
  import sys
3
  import os
 
 
4
  from pathlib import Path
 
5
 
6
+ # Get the current directory and setup Python path
 
 
 
 
 
7
  current_dir = os.path.dirname(os.path.abspath(__file__))
8
+ sys.path.append(current_dir)
9
 
10
+ # Import environment manager after setting up path
11
+ try:
12
+ from src.core.environment import environment_manager
13
+
14
+ # Perform complete environment setup
15
+ print("Setting up environment...")
16
+ setup_results = environment_manager.full_environment_setup()
17
+
18
+ # Report setup status
19
+ print(f"Environment setup completed with results: {len([k for k, v in setup_results.items() if v])} successful, {len([k for k, v in setup_results.items() if not v])} failed")
20
+
21
+ except ImportError as e:
22
+ print(f"Warning: Could not import environment manager: {e}")
23
+ print("Falling back to basic setup...")
24
+
25
+ # Fallback to basic setup if environment manager fails
26
+ import subprocess
27
+
28
+ # Basic dependency checks
29
+ try:
30
+ import torch
31
+ print(f"PyTorch version: {torch.__version__}")
32
+ except ImportError:
33
+ print("Installing PyTorch...")
34
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
35
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  try:
37
  from markitdown import MarkItDown
38
+ print("MarkItDown is available")
39
  except ImportError:
40
+ print("Installing MarkItDown...")
41
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
+ # Import main function with fallback strategies (HF Spaces compatibility)
44
  try:
 
45
  from src.main import main
46
  except ModuleNotFoundError:
47
  try:
48
+ # Fallback: adjust path and try again
49
  sys.path.append(os.path.join(current_dir, "src"))
50
  from src.main import main
51
  except ModuleNotFoundError:
52
+ # Last resort: create __init__.py if missing
53
  init_path = os.path.join(current_dir, "src", "__init__.py")
54
  if not os.path.exists(init_path):
55
  with open(init_path, "w") as f:
56
+ pass
 
57
  from src.main import main
58
 
59
  if __name__ == "__main__":
run_app.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simple app launcher that skips the heavy environment setup.
4
+ Use this for local development when dependencies are already installed.
5
+ """
6
+ import sys
7
+ import os
8
+
9
+ # Get the current directory and setup Python path
10
+ current_dir = os.path.dirname(os.path.abspath(__file__))
11
+ sys.path.append(current_dir)
12
+
13
+ # Load environment variables from .env file
14
+ try:
15
+ from dotenv import load_dotenv
16
+ load_dotenv()
17
+ print("Loaded environment variables from .env file")
18
+ except ImportError:
19
+ print("python-dotenv not installed, skipping .env file loading")
20
+
21
+ # Import and run main directly
22
+ from src.main import main
23
+
24
+ if __name__ == "__main__":
25
+ main()
src/core/config.py ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Centralized configuration management for Markit application.
3
+ """
4
+ import os
5
+ from typing import Optional, Dict, Any
6
+ from dataclasses import dataclass
7
+
8
+
9
+ @dataclass
10
+ class APIConfig:
11
+ """Configuration for external API services."""
12
+ google_api_key: Optional[str] = None
13
+ openai_api_key: Optional[str] = None
14
+ mistral_api_key: Optional[str] = None
15
+
16
+ def __post_init__(self):
17
+ """Load API keys from environment variables."""
18
+ self.google_api_key = os.getenv("GOOGLE_API_KEY")
19
+ self.openai_api_key = os.getenv("OPENAI_API_KEY")
20
+ self.mistral_api_key = os.getenv("MISTRAL_API_KEY")
21
+
22
+
23
+ @dataclass
24
+ class OCRConfig:
25
+ """Configuration for OCR-related settings."""
26
+ tesseract_path: Optional[str] = None
27
+ tessdata_path: Optional[str] = None
28
+ default_language: str = "eng"
29
+
30
+ def __post_init__(self):
31
+ """Load OCR configuration from environment variables."""
32
+ self.tesseract_path = os.getenv("TESSERACT_PATH")
33
+ self.tessdata_path = os.getenv("TESSDATA_PATH", "./tessdata")
34
+
35
+
36
+ @dataclass
37
+ class ModelConfig:
38
+ """Configuration for AI model settings."""
39
+ gemini_model: str = "gemini-2.5-flash"
40
+ mistral_model: str = "pixtral-12b-2409"
41
+ got_ocr_model: str = "stepfun-ai/GOT-OCR2_0"
42
+ temperature: float = 0.1
43
+ max_tokens: int = 4096
44
+
45
+ def __post_init__(self):
46
+ """Load model configuration from environment variables."""
47
+ self.gemini_model = os.getenv("GEMINI_MODEL", self.gemini_model)
48
+ self.mistral_model = os.getenv("MISTRAL_MODEL", self.mistral_model)
49
+ self.got_ocr_model = os.getenv("GOT_OCR_MODEL", self.got_ocr_model)
50
+ self.temperature = float(os.getenv("MODEL_TEMPERATURE", self.temperature))
51
+ self.max_tokens = int(os.getenv("MODEL_MAX_TOKENS", self.max_tokens))
52
+
53
+
54
+ @dataclass
55
+ class AppConfig:
56
+ """Main application configuration."""
57
+ debug: bool = False
58
+ max_file_size: int = 10 * 1024 * 1024 # 10MB
59
+ allowed_extensions: tuple = (".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp", ".tex", ".xlsx")
60
+ temp_dir: str = "./temp"
61
+
62
+ def __post_init__(self):
63
+ """Load application configuration from environment variables."""
64
+ self.debug = os.getenv("DEBUG", "false").lower() == "true"
65
+ self.max_file_size = int(os.getenv("MAX_FILE_SIZE", self.max_file_size))
66
+ self.temp_dir = os.getenv("TEMP_DIR", self.temp_dir)
67
+
68
+
69
+ class Config:
70
+ """Main configuration container."""
71
+
72
+ def __init__(self):
73
+ self.api = APIConfig()
74
+ self.ocr = OCRConfig()
75
+ self.model = ModelConfig()
76
+ self.app = AppConfig()
77
+
78
+ def validate(self) -> Dict[str, Any]:
79
+ """Validate configuration and return validation results."""
80
+ validation_results = {
81
+ "valid": True,
82
+ "warnings": [],
83
+ "errors": []
84
+ }
85
+
86
+ # Check API keys
87
+ if not self.api.google_api_key:
88
+ validation_results["warnings"].append("Google API key not found - Gemini parser will be unavailable")
89
+
90
+ if not self.api.mistral_api_key:
91
+ validation_results["warnings"].append("Mistral API key not found - Mistral parser will be unavailable")
92
+
93
+ # Check tesseract setup
94
+ if not self.ocr.tesseract_path and not os.path.exists("/usr/bin/tesseract"):
95
+ validation_results["warnings"].append("Tesseract not found in system PATH - OCR functionality may be limited")
96
+
97
+ # Check temp directory
98
+ try:
99
+ os.makedirs(self.app.temp_dir, exist_ok=True)
100
+ except Exception as e:
101
+ validation_results["errors"].append(f"Cannot create temp directory {self.app.temp_dir}: {e}")
102
+ validation_results["valid"] = False
103
+
104
+ return validation_results
105
+
106
+ def get_available_parsers(self) -> list:
107
+ """Get list of available parsers based on current configuration."""
108
+ available = ["markitdown"] # Always available
109
+
110
+ if self.api.google_api_key:
111
+ available.append("gemini_flash")
112
+
113
+ if self.api.mistral_api_key:
114
+ available.append("mistral_ocr")
115
+
116
+ # GOT-OCR is available if we have GPU or can use ZeroGPU
117
+ available.append("got_ocr")
118
+
119
+ return available
120
+
121
+
122
+ # Global configuration instance
123
+ config = Config()
src/core/converter.py CHANGED
@@ -1,55 +1,30 @@
1
- import tempfile
2
  import logging
3
- import time
4
- import os
5
- from pathlib import Path
6
 
7
- # Use relative imports instead of absolute imports
8
- from src.core.parser_factory import ParserFactory
 
 
 
 
 
9
 
10
  # Import all parsers to ensure they're registered
11
  from src import parsers
12
 
13
- # Import the LaTeX to Markdown converter
14
- try:
15
- from src.core.latex_to_markdown_converter import convert_latex_to_markdown
16
- HAS_GEMINI_CONVERTER = True
17
- except ImportError:
18
- HAS_GEMINI_CONVERTER = False
19
- logging.warning("LaTeX to Markdown converter not available. Raw LaTeX will be returned for formatted text.")
20
 
21
- # Reference to the cancellation flag from ui.py
22
- # This will be set by the UI when the cancel button is clicked
23
- conversion_cancelled = None # Will be a threading.Event object
24
- # Flag to track if conversion is currently in progress
25
- _conversion_in_progress = False
26
-
27
- def set_cancellation_flag(flag):
28
  """Set the reference to the cancellation flag from ui.py"""
29
- global conversion_cancelled
30
- conversion_cancelled = flag
31
 
32
- def is_conversion_in_progress():
33
  """Check if conversion is currently in progress"""
34
- global _conversion_in_progress
35
- return _conversion_in_progress
36
-
37
- def check_cancellation():
38
- """Check if cancellation has been requested"""
39
- if conversion_cancelled and conversion_cancelled.is_set():
40
- logging.info("Cancellation detected in check_cancellation")
41
- return True
42
- return False
43
-
44
- def safe_delete_file(file_path):
45
- """Safely delete a file with error handling"""
46
- if file_path and os.path.exists(file_path):
47
- try:
48
- os.unlink(file_path)
49
- except Exception as e:
50
- logging.error(f"Error cleaning up temp file {file_path}: {e}")
51
 
52
- def convert_file(file_path, parser_name, ocr_method_name, output_format):
53
  """
54
  Convert a file using the specified parser and OCR method.
55
 
@@ -62,165 +37,35 @@ def convert_file(file_path, parser_name, ocr_method_name, output_format):
62
  Returns:
63
  tuple: (content, download_file_path)
64
  """
65
- global conversion_cancelled, _conversion_in_progress
66
-
67
- # Set the conversion in progress flag
68
- _conversion_in_progress = True
69
-
70
- # Temporary file paths to clean up
71
- temp_input = None
72
- tmp_path = None
73
 
74
- # Ensure we clean up the flag when we're done
75
  try:
76
- if not file_path:
77
- return "Please upload a file.", None
78
-
79
- # Check for cancellation
80
- if check_cancellation():
81
- logging.info("Cancellation detected at start of convert_file")
82
- return "Conversion cancelled.", None
83
-
84
- # Create a temporary file with English filename
85
- try:
86
- original_ext = Path(file_path).suffix
87
- with tempfile.NamedTemporaryFile(suffix=original_ext, delete=False) as temp_file:
88
- temp_input = temp_file.name
89
- # Copy the content of original file to temp file
90
- with open(file_path, 'rb') as original:
91
- # Read in smaller chunks and check for cancellation between chunks
92
- chunk_size = 1024 * 1024 # 1MB chunks
93
- while True:
94
- # Check for cancellation frequently
95
- if check_cancellation():
96
- logging.info("Cancellation detected during file copy")
97
- safe_delete_file(temp_input)
98
- return "Conversion cancelled.", None
99
-
100
- chunk = original.read(chunk_size)
101
- if not chunk:
102
- break
103
- temp_file.write(chunk)
104
- file_path = temp_input
105
- except Exception as e:
106
- safe_delete_file(temp_input)
107
- return f"Error creating temporary file: {e}", None
108
-
109
- # Check for cancellation again
110
- if check_cancellation():
111
- logging.info("Cancellation detected after file preparation")
112
- safe_delete_file(temp_input)
113
- return "Conversion cancelled.", None
114
-
115
- content = None
116
- try:
117
- # Use the parser factory to parse the document
118
- start = time.time()
119
-
120
- # Pass the cancellation flag to the parser factory
121
- content = ParserFactory.parse_document(
122
- file_path=file_path,
123
- parser_name=parser_name,
124
- ocr_method_name=ocr_method_name,
125
- output_format=output_format.lower(),
126
- cancellation_flag=conversion_cancelled # Pass the flag to parsers
127
- )
128
-
129
- # If content indicates cancellation, return early
130
- if content == "Conversion cancelled.":
131
- logging.info("Parser reported cancellation")
132
- safe_delete_file(temp_input)
133
- return content, None
134
-
135
- duration = time.time() - start
136
- logging.info(f"Processed in {duration:.2f} seconds.")
137
-
138
- # Check for cancellation after processing
139
- if check_cancellation():
140
- logging.info("Cancellation detected after processing")
141
- safe_delete_file(temp_input)
142
- return "Conversion cancelled.", None
143
-
144
- # Process LaTeX content for GOT-OCR formatted text
145
- if parser_name == "GOT-OCR (jpg,png only)" and ocr_method_name == "Formatted Text" and HAS_GEMINI_CONVERTER:
146
- logging.info("Converting LaTeX output to Markdown using Gemini API")
147
- start_convert = time.time()
148
-
149
- # Check for cancellation before conversion
150
- if check_cancellation():
151
- logging.info("Cancellation detected before LaTeX conversion")
152
- safe_delete_file(temp_input)
153
- return "Conversion cancelled.", None
154
-
155
- try:
156
- markdown_content = convert_latex_to_markdown(content)
157
- if markdown_content:
158
- content = markdown_content
159
- logging.info(f"LaTeX conversion completed in {time.time() - start_convert:.2f} seconds")
160
- else:
161
- logging.warning("LaTeX to Markdown conversion failed, using raw LaTeX output")
162
- except Exception as e:
163
- logging.error(f"Error converting LaTeX to Markdown: {str(e)}")
164
- # Continue with the original content on error
165
-
166
- # Check for cancellation after conversion
167
- if check_cancellation():
168
- logging.info("Cancellation detected after LaTeX conversion")
169
- safe_delete_file(temp_input)
170
- return "Conversion cancelled.", None
171
-
172
- except Exception as e:
173
- safe_delete_file(temp_input)
174
- return f"Error: {e}", None
175
-
176
- # Determine the file extension based on the output format
177
- if output_format == "Markdown":
178
- ext = ".md"
179
- elif output_format == "JSON":
180
- ext = ".json"
181
- elif output_format == "Text":
182
- ext = ".txt"
183
- elif output_format == "Document Tags":
184
- ext = ".doctags"
185
- else:
186
- ext = ".txt"
187
-
188
- # Check for cancellation again
189
- if check_cancellation():
190
- logging.info("Cancellation detected before output file creation")
191
- safe_delete_file(temp_input)
192
  return "Conversion cancelled.", None
193
-
194
- try:
195
- # Create a temporary file for download
196
- with tempfile.NamedTemporaryFile(mode="w", suffix=ext, delete=False, encoding="utf-8") as tmp:
197
- tmp_path = tmp.name
198
- # Write in chunks and check for cancellation
199
- chunk_size = 10000 # characters
200
- for i in range(0, len(content), chunk_size):
201
- # Check for cancellation
202
- if check_cancellation():
203
- logging.info("Cancellation detected during output file writing")
204
- safe_delete_file(tmp_path)
205
- safe_delete_file(temp_input)
206
- return "Conversion cancelled.", None
207
-
208
- tmp.write(content[i:i+chunk_size])
209
-
210
- # Clean up the temporary input file
211
- safe_delete_file(temp_input)
212
- temp_input = None # Mark as cleaned up
213
-
214
- return content, tmp_path
215
- except Exception as e:
216
- safe_delete_file(tmp_path)
217
- safe_delete_file(temp_input)
218
- return f"Error: {e}", None
219
- finally:
220
- # Always clean up any remaining temp files
221
- safe_delete_file(temp_input)
222
- if check_cancellation() and tmp_path:
223
- safe_delete_file(tmp_path)
224
-
225
- # Always clear the conversion in progress flag when done
226
- _conversion_in_progress = False
 
 
1
  import logging
2
+ import threading
3
+ from typing import Optional, Tuple
 
4
 
5
+ from src.core.config import config
6
+ from src.core.exceptions import (
7
+ DocumentProcessingError,
8
+ ConversionError,
9
+ ConfigurationError
10
+ )
11
+ from src.services.document_service import DocumentService
12
 
13
  # Import all parsers to ensure they're registered
14
  from src import parsers
15
 
16
+ # Global document service instance
17
+ _document_service = DocumentService()
 
 
 
 
 
18
 
19
+ def set_cancellation_flag(flag: threading.Event) -> None:
 
 
 
 
 
 
20
  """Set the reference to the cancellation flag from ui.py"""
21
+ _document_service.set_cancellation_flag(flag)
 
22
 
23
+ def is_conversion_in_progress() -> bool:
24
  """Check if conversion is currently in progress"""
25
+ return _document_service.is_conversion_in_progress()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
+ def convert_file(file_path: str, parser_name: str, ocr_method_name: str, output_format: str) -> Tuple[str, Optional[str]]:
28
  """
29
  Convert a file using the specified parser and OCR method.
30
 
 
37
  Returns:
38
  tuple: (content, download_file_path)
39
  """
40
+ if not file_path:
41
+ return "Please upload a file.", None
 
 
 
 
 
 
42
 
 
43
  try:
44
+ # Use the document service to handle conversion
45
+ content, output_path = _document_service.convert_document(
46
+ file_path=file_path,
47
+ parser_name=parser_name,
48
+ ocr_method_name=ocr_method_name,
49
+ output_format=output_format
50
+ )
51
+
52
+ return content, output_path
53
+
54
+ except ConversionError as e:
55
+ # Handle user-friendly conversion errors
56
+ if "cancelled" in str(e).lower():
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  return "Conversion cancelled.", None
58
+ return f"Conversion failed: {e}", None
59
+
60
+ except DocumentProcessingError as e:
61
+ # Handle document processing errors
62
+ return f"Document processing error: {e}", None
63
+
64
+ except ConfigurationError as e:
65
+ # Handle configuration errors
66
+ return f"Configuration error: {e}", None
67
+
68
+ except Exception as e:
69
+ # Handle unexpected errors
70
+ logging.error(f"Unexpected error in convert_file: {e}")
71
+ return f"Unexpected error: {e}", None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/core/environment.py ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Environment setup and dependency management for the Markit application.
3
+ Extracted from app.py to improve code organization while maintaining HF Spaces compatibility.
4
+ """
5
+ import os
6
+ import sys
7
+ import subprocess
8
+ import logging
9
+ from typing import Dict, Optional, Tuple
10
+ from pathlib import Path
11
+
12
+ from src.core.config import config
13
+ from src.core.logging_config import setup_logging
14
+
15
+
16
+ class EnvironmentManager:
17
+ """Manages environment setup and dependency installation."""
18
+
19
+ def __init__(self):
20
+ self.current_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
21
+ self.logger = logging.getLogger(__name__)
22
+
23
+ def run_setup_script(self) -> bool:
24
+ """Run setup.sh script if it exists."""
25
+ try:
26
+ setup_script = os.path.join(self.current_dir, "setup.sh")
27
+ if os.path.exists(setup_script):
28
+ print("Running setup.sh...")
29
+ subprocess.run(["bash", setup_script], check=False)
30
+ print("setup.sh completed")
31
+ return True
32
+ except Exception as e:
33
+ print(f"Error running setup.sh: {e}")
34
+ return False
35
+
36
+ def check_spaces_module(self) -> bool:
37
+ """Check and install spaces module for ZeroGPU support."""
38
+ try:
39
+ import spaces
40
+ print("Spaces module found for ZeroGPU support")
41
+ return True
42
+ except ImportError:
43
+ print("WARNING: Spaces module not found. Installing...")
44
+ try:
45
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q", "spaces"], check=False)
46
+ return True
47
+ except Exception as e:
48
+ print(f"Error installing spaces module: {e}")
49
+ return False
50
+
51
+ def check_pytorch(self) -> Tuple[bool, Dict[str, str]]:
52
+ """Check PyTorch and CUDA availability."""
53
+ info = {}
54
+ try:
55
+ import torch
56
+ info["pytorch_version"] = torch.__version__
57
+ info["cuda_available"] = str(torch.cuda.is_available())
58
+
59
+ print(f"PyTorch version: {info['pytorch_version']}")
60
+ print(f"CUDA available: {info['cuda_available']}")
61
+
62
+ if torch.cuda.is_available():
63
+ info["cuda_device"] = torch.cuda.get_device_name(0)
64
+ info["cuda_version"] = torch.version.cuda
65
+ print(f"CUDA device: {info['cuda_device']}")
66
+ print(f"CUDA version: {info['cuda_version']}")
67
+ else:
68
+ print("WARNING: CUDA not available. GOT-OCR performs best with GPU acceleration.")
69
+
70
+ return True, info
71
+ except ImportError:
72
+ print("WARNING: PyTorch not installed. Installing PyTorch...")
73
+ try:
74
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q", "torch", "torchvision"], check=False)
75
+ return True, info
76
+ except Exception as e:
77
+ print(f"Error installing PyTorch: {e}")
78
+ return False, info
79
+
80
+ def check_transformers(self) -> bool:
81
+ """Check and install transformers library."""
82
+ try:
83
+ import transformers
84
+ print(f"Transformers version: {transformers.__version__}")
85
+ return True
86
+ except ImportError:
87
+ print("WARNING: Transformers not installed. Installing transformers from GitHub...")
88
+ try:
89
+ subprocess.run([
90
+ sys.executable, "-m", "pip", "install", "-q",
91
+ "git+https://github.com/huggingface/transformers.git@main",
92
+ "accelerate", "verovio"
93
+ ], check=False)
94
+ return True
95
+ except Exception as e:
96
+ print(f"Error installing transformers: {e}")
97
+ return False
98
+
99
+ def check_numpy(self) -> bool:
100
+ """Check and install correct NumPy version."""
101
+ try:
102
+ import numpy as np
103
+ print(f"NumPy version: {np.__version__}")
104
+ if np.__version__ != "1.26.3":
105
+ print("WARNING: NumPy version mismatch. Installing exact version 1.26.3...")
106
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
107
+ return True
108
+ except ImportError:
109
+ print("WARNING: NumPy not installed. Installing NumPy 1.26.3...")
110
+ try:
111
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q", "numpy==1.26.3"], check=False)
112
+ return True
113
+ except Exception as e:
114
+ print(f"Error installing NumPy: {e}")
115
+ return False
116
+
117
+ def check_markitdown(self) -> bool:
118
+ """Check and install MarkItDown library."""
119
+ try:
120
+ from markitdown import MarkItDown
121
+ print("MarkItDown is installed")
122
+ return True
123
+ except ImportError:
124
+ print("WARNING: MarkItDown not installed. Installing...")
125
+ try:
126
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q", "markitdown[all]"], check=False)
127
+ from markitdown import MarkItDown
128
+ print("MarkItDown installed successfully")
129
+ return True
130
+ except ImportError:
131
+ print("ERROR: Failed to install MarkItDown")
132
+ return False
133
+ except Exception as e:
134
+ print(f"Error installing MarkItDown: {e}")
135
+ return False
136
+
137
+ def load_environment_variables(self) -> bool:
138
+ """Load environment variables from .env file."""
139
+ try:
140
+ from dotenv import load_dotenv
141
+ load_dotenv()
142
+ print("Loaded environment variables from .env file")
143
+ return True
144
+ except ImportError:
145
+ print("python-dotenv not installed, skipping .env file loading")
146
+ return False
147
+
148
+ def validate_api_keys(self) -> Dict[str, bool]:
149
+ """Validate and report API key availability."""
150
+ results = {}
151
+
152
+ # Check Gemini API key
153
+ gemini_key = config.api.google_api_key
154
+ if not gemini_key:
155
+ print("Warning: GOOGLE_API_KEY environment variable not found. Gemini Flash parser and LaTeX to Markdown conversion may not work.")
156
+ results["gemini"] = False
157
+ else:
158
+ print(f"Found Gemini API key: {gemini_key[:5]}...{gemini_key[-5:] if len(gemini_key) > 10 else ''}")
159
+ print("Gemini API will be used for LaTeX to Markdown conversion when using GOT-OCR with Formatted Text mode")
160
+ results["gemini"] = True
161
+
162
+ # Check OpenAI API key
163
+ openai_key = config.api.openai_api_key
164
+ if not openai_key:
165
+ print("Warning: OPENAI_API_KEY environment variable not found. LLM-based image description in MarkItDown may not work.")
166
+ results["openai"] = False
167
+ else:
168
+ print(f"Found OpenAI API key: {openai_key[:5]}...{openai_key[-5:] if len(openai_key) > 10 else ''}")
169
+ print("OpenAI API will be available for LLM-based image descriptions in MarkItDown")
170
+ results["openai"] = True
171
+
172
+ # Check Mistral API key
173
+ mistral_key = config.api.mistral_api_key
174
+ if mistral_key:
175
+ print(f"Found Mistral API key: {mistral_key[:5]}...{mistral_key[-5:] if len(mistral_key) > 10 else ''}")
176
+ results["mistral"] = True
177
+ else:
178
+ results["mistral"] = False
179
+
180
+ return results
181
+
182
+ def setup_python_path(self) -> None:
183
+ """Setup Python path for imports."""
184
+ if self.current_dir not in sys.path:
185
+ sys.path.append(self.current_dir)
186
+
187
+ def setup_logging(self) -> None:
188
+ """Setup centralized logging configuration."""
189
+ # Configure logging to suppress httpx and other noisy logs
190
+ logging.getLogger("httpx").setLevel(logging.WARNING)
191
+ logging.getLogger("urllib3").setLevel(logging.WARNING)
192
+ logging.getLogger("httpcore").setLevel(logging.WARNING)
193
+
194
+ # Setup our centralized logging
195
+ setup_logging()
196
+
197
+ def full_environment_setup(self) -> Dict[str, bool]:
198
+ """
199
+ Perform complete environment setup.
200
+
201
+ Returns:
202
+ Dictionary with setup results for each component
203
+ """
204
+ results = {}
205
+
206
+ # Setup logging first
207
+ self.setup_logging()
208
+
209
+ # Run setup script
210
+ results["setup_script"] = self.run_setup_script()
211
+
212
+ # Check and install dependencies
213
+ results["spaces_module"] = self.check_spaces_module()
214
+ results["pytorch"], pytorch_info = self.check_pytorch()
215
+ results["transformers"] = self.check_transformers()
216
+ results["numpy"] = self.check_numpy()
217
+ results["markitdown"] = self.check_markitdown()
218
+
219
+ # Load environment variables
220
+ results["env_vars"] = self.load_environment_variables()
221
+
222
+ # Validate API keys
223
+ api_keys = self.validate_api_keys()
224
+ results["api_keys"] = api_keys
225
+
226
+ # Setup Python path
227
+ self.setup_python_path()
228
+ results["python_path"] = True
229
+
230
+ # Validate configuration
231
+ validation = config.validate()
232
+ results["config_valid"] = validation["valid"]
233
+
234
+ if validation["warnings"]:
235
+ for warning in validation["warnings"]:
236
+ print(f"Configuration warning: {warning}")
237
+
238
+ if validation["errors"]:
239
+ for error in validation["errors"]:
240
+ print(f"Configuration error: {error}")
241
+
242
+ return results
243
+
244
+
245
+ # Global instance
246
+ environment_manager = EnvironmentManager()
src/core/exceptions.py ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Custom exception classes for the Markit application.
3
+ """
4
+
5
+
6
+ class MarkitError(Exception):
7
+ """Base exception class for all Markit-related errors."""
8
+ pass
9
+
10
+
11
+ class ConfigurationError(MarkitError):
12
+ """Raised when there's a configuration-related error."""
13
+ pass
14
+
15
+
16
+ class ParserError(MarkitError):
17
+ """Base exception for parser-related errors."""
18
+ pass
19
+
20
+
21
+ class ParserNotFoundError(ParserError):
22
+ """Raised when a requested parser is not available."""
23
+ pass
24
+
25
+
26
+ class ParserInitializationError(ParserError):
27
+ """Raised when a parser fails to initialize properly."""
28
+ pass
29
+
30
+
31
+ class DocumentProcessingError(ParserError):
32
+ """Raised when document processing fails."""
33
+ pass
34
+
35
+
36
+ class UnsupportedFileTypeError(ParserError):
37
+ """Raised when trying to process an unsupported file type."""
38
+ pass
39
+
40
+
41
+ class APIError(MarkitError):
42
+ """Base exception for API-related errors."""
43
+ pass
44
+
45
+
46
+ class APIKeyMissingError(APIError):
47
+ """Raised when required API key is missing."""
48
+ pass
49
+
50
+
51
+ class APIRateLimitError(APIError):
52
+ """Raised when API rate limit is exceeded."""
53
+ pass
54
+
55
+
56
+ class APIQuotaExceededError(APIError):
57
+ """Raised when API quota is exceeded."""
58
+ pass
59
+
60
+
61
+ class FileError(MarkitError):
62
+ """Base exception for file-related errors."""
63
+ pass
64
+
65
+
66
+ class FileSizeLimitError(FileError):
67
+ """Raised when file size exceeds the allowed limit."""
68
+ pass
69
+
70
+
71
+ class FileNotFoundError(FileError):
72
+ """Raised when a required file is not found."""
73
+ pass
74
+
75
+
76
+ class ConversionError(MarkitError):
77
+ """Raised when document conversion fails."""
78
+ pass
79
+
80
+
81
+ class ValidationError(MarkitError):
82
+ """Raised when input validation fails."""
83
+ pass
src/core/logging_config.py ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Centralized logging configuration for the Markit application.
3
+ """
4
+ import logging
5
+ import sys
6
+ from pathlib import Path
7
+ from typing import Optional
8
+
9
+ from src.core.config import config
10
+
11
+
12
+ def setup_logging(
13
+ level: Optional[str] = None,
14
+ log_file: Optional[str] = None,
15
+ format_string: Optional[str] = None
16
+ ) -> None:
17
+ """
18
+ Setup centralized logging configuration.
19
+
20
+ Args:
21
+ level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
22
+ log_file: Optional file path for logging output
23
+ format_string: Custom format string for log messages
24
+ """
25
+ # Determine logging level
26
+ if level is None:
27
+ level = "DEBUG" if config.app.debug else "INFO"
28
+
29
+ # Default format string
30
+ if format_string is None:
31
+ format_string = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
32
+
33
+ # Configure root logger
34
+ root_logger = logging.getLogger()
35
+ root_logger.setLevel(getattr(logging, level.upper()))
36
+
37
+ # Clear existing handlers
38
+ root_logger.handlers.clear()
39
+
40
+ # Create formatter
41
+ formatter = logging.Formatter(format_string)
42
+
43
+ # Console handler
44
+ console_handler = logging.StreamHandler(sys.stdout)
45
+ console_handler.setLevel(getattr(logging, level.upper()))
46
+ console_handler.setFormatter(formatter)
47
+ root_logger.addHandler(console_handler)
48
+
49
+ # File handler (optional)
50
+ if log_file:
51
+ try:
52
+ log_path = Path(log_file)
53
+ log_path.parent.mkdir(parents=True, exist_ok=True)
54
+
55
+ file_handler = logging.FileHandler(log_file)
56
+ file_handler.setLevel(getattr(logging, level.upper()))
57
+ file_handler.setFormatter(formatter)
58
+ root_logger.addHandler(file_handler)
59
+ except Exception as e:
60
+ logging.warning(f"Could not setup file logging: {e}")
61
+
62
+ # Set specific logger levels to reduce noise
63
+ logging.getLogger("urllib3").setLevel(logging.WARNING)
64
+ logging.getLogger("requests").setLevel(logging.WARNING)
65
+ logging.getLogger("gradio").setLevel(logging.WARNING)
66
+
67
+ if not config.app.debug:
68
+ # Reduce noise from external libraries in non-debug mode
69
+ logging.getLogger("transformers").setLevel(logging.WARNING)
70
+ logging.getLogger("torch").setLevel(logging.WARNING)
71
+
72
+
73
+ def get_logger(name: str) -> logging.Logger:
74
+ """
75
+ Get a logger with the specified name.
76
+
77
+ Args:
78
+ name: Logger name (typically __name__)
79
+
80
+ Returns:
81
+ Logger instance
82
+ """
83
+ return logging.getLogger(name)
src/main.py CHANGED
@@ -1,10 +1,10 @@
1
- import parsers # Import all parsers to ensure they're registered
2
  from src.ui.ui import launch_ui
3
 
4
  def main():
5
  # Launch the UI
6
  launch_ui(
7
- server_name="0.0.0.0",
8
  server_port=7860,
9
  share=False # Explicitly disable sharing on Hugging Face
10
  )
 
1
+ from src import parsers # Import all parsers to ensure they're registered
2
  from src.ui.ui import launch_ui
3
 
4
  def main():
5
  # Launch the UI
6
  launch_ui(
7
+ server_name="localhost",
8
  server_port=7860,
9
  share=False # Explicitly disable sharing on Hugging Face
10
  )
src/parsers/got_ocr_parser.py CHANGED
@@ -13,6 +13,7 @@ import tempfile
13
  import shutil
14
  from typing import Dict, List, Optional, Any, Union
15
  import copy
 
16
 
17
  from src.parsers.parser_interface import DocumentParser
18
  from src.parsers.parser_registry import ParserRegistry
 
13
  import shutil
14
  from typing import Dict, List, Optional, Any, Union
15
  import copy
16
+ import pickle
17
 
18
  from src.parsers.parser_interface import DocumentParser
19
  from src.parsers.parser_registry import ParserRegistry
src/parsers/markitdown_parser.py CHANGED
@@ -1,12 +1,13 @@
1
  import logging
2
  import os
3
  from pathlib import Path
4
- from typing import Dict, List, Optional, Any, Union
5
  import io
6
 
7
  # Import the parser interface and registry
8
  from src.parsers.parser_interface import DocumentParser
9
  from src.parsers.parser_registry import ParserRegistry
 
10
 
11
  # Check for MarkItDown availability
12
  try:
@@ -27,6 +28,7 @@ class MarkItDownParser(DocumentParser):
27
  """
28
 
29
  def __init__(self):
 
30
  self.markdown_instance = None
31
  # Initialize MarkItDown instance
32
  if HAS_MARKITDOWN:
@@ -60,34 +62,44 @@ class MarkItDownParser(DocumentParser):
60
  Returns:
61
  str: Markdown representation of the document
62
  """
 
 
 
63
  # Check if MarkItDown is available
64
  if not HAS_MARKITDOWN or self.markdown_instance is None:
65
- return "Error: MarkItDown is not available. Please install with 'pip install markitdown[all]'"
66
-
67
- # Get cancellation check function from kwargs
68
- check_cancellation = kwargs.get('check_cancellation', lambda: False)
69
 
70
  # Check for cancellation before starting
71
- if check_cancellation():
72
- return "Conversion cancelled."
73
 
74
  try:
75
  # Convert the file using the standard instance
76
- result = self.markdown_instance.convert(file_path)
77
 
78
  # Check for cancellation after processing
79
- if check_cancellation():
80
- return "Conversion cancelled."
81
 
82
  return result.text_content
83
  except Exception as e:
84
  logger.error(f"Error converting file with MarkItDown: {str(e)}")
85
- return f"Error: {str(e)}"
86
 
87
  @classmethod
88
  def get_name(cls) -> str:
89
  return "MarkItDown (pdf, jpg, png, xlsx --best for xlsx)"
90
 
 
 
 
 
 
 
 
 
 
 
91
  @classmethod
92
  def get_supported_ocr_methods(cls) -> List[Dict[str, Any]]:
93
  return [
 
1
  import logging
2
  import os
3
  from pathlib import Path
4
+ from typing import Dict, List, Optional, Any, Union, Set
5
  import io
6
 
7
  # Import the parser interface and registry
8
  from src.parsers.parser_interface import DocumentParser
9
  from src.parsers.parser_registry import ParserRegistry
10
+ from src.core.exceptions import DocumentProcessingError, ParserError
11
 
12
  # Check for MarkItDown availability
13
  try:
 
28
  """
29
 
30
  def __init__(self):
31
+ super().__init__() # Initialize the base class (including _cancellation_flag)
32
  self.markdown_instance = None
33
  # Initialize MarkItDown instance
34
  if HAS_MARKITDOWN:
 
62
  Returns:
63
  str: Markdown representation of the document
64
  """
65
+ # Validate file first
66
+ self.validate_file(file_path)
67
+
68
  # Check if MarkItDown is available
69
  if not HAS_MARKITDOWN or self.markdown_instance is None:
70
+ raise ParserError("MarkItDown is not available. Please install with 'pip install markitdown[all]'")
 
 
 
71
 
72
  # Check for cancellation before starting
73
+ if self._check_cancellation():
74
+ raise DocumentProcessingError("Conversion cancelled")
75
 
76
  try:
77
  # Convert the file using the standard instance
78
+ result = self.markdown_instance.convert(str(file_path))
79
 
80
  # Check for cancellation after processing
81
+ if self._check_cancellation():
82
+ raise DocumentProcessingError("Conversion cancelled")
83
 
84
  return result.text_content
85
  except Exception as e:
86
  logger.error(f"Error converting file with MarkItDown: {str(e)}")
87
+ raise DocumentProcessingError(f"MarkItDown conversion failed: {str(e)}")
88
 
89
  @classmethod
90
  def get_name(cls) -> str:
91
  return "MarkItDown (pdf, jpg, png, xlsx --best for xlsx)"
92
 
93
+ @classmethod
94
+ def get_supported_file_types(cls) -> Set[str]:
95
+ """Return a set of supported file extensions."""
96
+ return {".pdf", ".docx", ".xlsx", ".pptx", ".html", ".txt", ".md", ".json", ".xml", ".csv", ".jpg", ".jpeg", ".png"}
97
+
98
+ @classmethod
99
+ def is_available(cls) -> bool:
100
+ """Check if this parser is available."""
101
+ return HAS_MARKITDOWN
102
+
103
  @classmethod
104
  def get_supported_ocr_methods(cls) -> List[Dict[str, Any]]:
105
  return [
src/parsers/parser_interface.py CHANGED
@@ -1,11 +1,26 @@
1
  from abc import ABC, abstractmethod
2
  from pathlib import Path
3
- from typing import Dict, List, Optional, Any, Union
 
 
 
4
 
5
 
6
  class DocumentParser(ABC):
7
  """Base interface for all document parsers in the system."""
8
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  @abstractmethod
10
  def parse(self, file_path: Union[str, Path], ocr_method: Optional[str] = None, **kwargs) -> str:
11
  """
@@ -18,6 +33,10 @@ class DocumentParser(ABC):
18
 
19
  Returns:
20
  str: The parsed content
 
 
 
 
21
  """
22
  pass
23
 
@@ -44,4 +63,44 @@ class DocumentParser(ABC):
44
  @classmethod
45
  def get_description(cls) -> str:
46
  """Return a description of this parser"""
47
- return f"{cls.get_name()} document parser"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  from abc import ABC, abstractmethod
2
  from pathlib import Path
3
+ from typing import Dict, List, Optional, Any, Union, Set
4
+ import threading
5
+
6
+ from src.core.exceptions import ParserError, UnsupportedFileTypeError
7
 
8
 
9
  class DocumentParser(ABC):
10
  """Base interface for all document parsers in the system."""
11
 
12
+ def __init__(self):
13
+ """Initialize the parser."""
14
+ self._cancellation_flag: Optional[threading.Event] = None
15
+
16
+ def set_cancellation_flag(self, flag: Optional[threading.Event]) -> None:
17
+ """Set the cancellation flag for this parser."""
18
+ self._cancellation_flag = flag
19
+
20
+ def _check_cancellation(self) -> bool:
21
+ """Check if cancellation has been requested."""
22
+ return self._cancellation_flag is not None and self._cancellation_flag.is_set()
23
+
24
  @abstractmethod
25
  def parse(self, file_path: Union[str, Path], ocr_method: Optional[str] = None, **kwargs) -> str:
26
  """
 
33
 
34
  Returns:
35
  str: The parsed content
36
+
37
+ Raises:
38
+ ParserError: For general parsing errors
39
+ UnsupportedFileTypeError: For unsupported file types
40
  """
41
  pass
42
 
 
63
  @classmethod
64
  def get_description(cls) -> str:
65
  """Return a description of this parser"""
66
+ return f"{cls.get_name()} document parser"
67
+
68
+ @classmethod
69
+ def get_supported_file_types(cls) -> Set[str]:
70
+ """Return a set of supported file extensions (including the dot)."""
71
+ return {".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp", ".webp"}
72
+
73
+ @classmethod
74
+ def is_available(cls) -> bool:
75
+ """Check if this parser is available with current configuration."""
76
+ return True
77
+
78
+ def validate_file(self, file_path: Union[str, Path]) -> None:
79
+ """
80
+ Validate that the file can be processed by this parser.
81
+
82
+ Args:
83
+ file_path: Path to the file to validate
84
+
85
+ Raises:
86
+ UnsupportedFileTypeError: If file type is not supported
87
+ ParserError: For other validation errors
88
+ """
89
+ path = Path(file_path)
90
+ if not path.exists():
91
+ raise ParserError(f"File not found: {file_path}")
92
+
93
+ if path.suffix.lower() not in self.get_supported_file_types():
94
+ raise UnsupportedFileTypeError(
95
+ f"File type '{path.suffix}' not supported by {self.get_name()}"
96
+ )
97
+
98
+ def get_metadata(self) -> Dict[str, Any]:
99
+ """Return metadata about this parser instance."""
100
+ return {
101
+ "name": self.get_name(),
102
+ "description": self.get_description(),
103
+ "supported_file_types": list(self.get_supported_file_types()),
104
+ "supported_ocr_methods": self.get_supported_ocr_methods(),
105
+ "available": self.is_available()
106
+ }
src/services/document_service.py ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Document processing service layer.
3
+ """
4
+ import tempfile
5
+ import logging
6
+ import time
7
+ import os
8
+ import threading
9
+ from pathlib import Path
10
+ from typing import Optional, Tuple, Any
11
+
12
+ from src.core.config import config
13
+ from src.core.exceptions import (
14
+ DocumentProcessingError,
15
+ FileSizeLimitError,
16
+ UnsupportedFileTypeError,
17
+ ConversionError
18
+ )
19
+ from src.core.parser_factory import ParserFactory
20
+ from src.core.latex_to_markdown_converter import convert_latex_to_markdown
21
+
22
+
23
+ class DocumentService:
24
+ """Service for handling document processing operations."""
25
+
26
+ def __init__(self):
27
+ self._conversion_in_progress = False
28
+ self._cancellation_flag: Optional[threading.Event] = None
29
+
30
+ def set_cancellation_flag(self, flag: threading.Event) -> None:
31
+ """Set the cancellation flag for this service."""
32
+ self._cancellation_flag = flag
33
+
34
+ def is_conversion_in_progress(self) -> bool:
35
+ """Check if conversion is currently in progress."""
36
+ return self._conversion_in_progress
37
+
38
+ def _check_cancellation(self) -> bool:
39
+ """Check if cancellation has been requested."""
40
+ if self._cancellation_flag and self._cancellation_flag.is_set():
41
+ logging.info("Cancellation detected in document service")
42
+ return True
43
+ return False
44
+
45
+ def _safe_delete_file(self, file_path: Optional[str]) -> None:
46
+ """Safely delete a file with error handling."""
47
+ if file_path and os.path.exists(file_path):
48
+ try:
49
+ os.unlink(file_path)
50
+ except Exception as e:
51
+ logging.error(f"Error cleaning up temp file {file_path}: {e}")
52
+
53
+ def _validate_file(self, file_path: str) -> None:
54
+ """Validate file size and type."""
55
+ if not os.path.exists(file_path):
56
+ raise DocumentProcessingError(f"File not found: {file_path}")
57
+
58
+ # Check file size
59
+ file_size = os.path.getsize(file_path)
60
+ if file_size > config.app.max_file_size:
61
+ raise FileSizeLimitError(
62
+ f"File size ({file_size} bytes) exceeds maximum allowed size "
63
+ f"({config.app.max_file_size} bytes)"
64
+ )
65
+
66
+ # Check file extension
67
+ file_ext = Path(file_path).suffix.lower()
68
+ if file_ext not in config.app.allowed_extensions:
69
+ raise UnsupportedFileTypeError(
70
+ f"File type '{file_ext}' is not supported. "
71
+ f"Allowed types: {', '.join(config.app.allowed_extensions)}"
72
+ )
73
+
74
+ def _create_temp_file(self, original_path: str) -> str:
75
+ """Create a temporary file with English filename."""
76
+ original_ext = Path(original_path).suffix
77
+
78
+ with tempfile.NamedTemporaryFile(suffix=original_ext, delete=False) as temp_file:
79
+ temp_path = temp_file.name
80
+
81
+ # Copy content in chunks with cancellation checks
82
+ with open(original_path, 'rb') as original:
83
+ chunk_size = 1024 * 1024 # 1MB chunks
84
+ while True:
85
+ if self._check_cancellation():
86
+ self._safe_delete_file(temp_path)
87
+ raise ConversionError("Conversion cancelled during file copy")
88
+
89
+ chunk = original.read(chunk_size)
90
+ if not chunk:
91
+ break
92
+ temp_file.write(chunk)
93
+
94
+ return temp_path
95
+
96
+ def _process_latex_content(self, content: str, parser_name: str, ocr_method_name: str) -> str:
97
+ """Process LaTeX content for GOT-OCR formatted text."""
98
+ if (parser_name == "GOT-OCR (jpg,png only)" and
99
+ ocr_method_name == "Formatted Text" and
100
+ config.api.google_api_key):
101
+
102
+ logging.info("Converting LaTeX output to Markdown using Gemini API")
103
+ start_convert = time.time()
104
+
105
+ if self._check_cancellation():
106
+ raise ConversionError("Conversion cancelled before LaTeX conversion")
107
+
108
+ try:
109
+ markdown_content = convert_latex_to_markdown(content)
110
+ if markdown_content:
111
+ logging.info(f"LaTeX conversion completed in {time.time() - start_convert:.2f} seconds")
112
+ return markdown_content
113
+ else:
114
+ logging.warning("LaTeX to Markdown conversion failed, using raw LaTeX output")
115
+ except Exception as e:
116
+ logging.error(f"Error converting LaTeX to Markdown: {str(e)}")
117
+ # Continue with original content on error
118
+
119
+ return content
120
+
121
+ def _create_output_file(self, content: str, output_format: str) -> str:
122
+ """Create output file with proper extension."""
123
+ # Determine file extension
124
+ format_extensions = {
125
+ "markdown": ".md",
126
+ "json": ".json",
127
+ "text": ".txt",
128
+ "document tags": ".doctags"
129
+ }
130
+ ext = format_extensions.get(output_format.lower(), ".txt")
131
+
132
+ if self._check_cancellation():
133
+ raise ConversionError("Conversion cancelled before output file creation")
134
+
135
+ # Create temporary output file
136
+ with tempfile.NamedTemporaryFile(mode="w", suffix=ext, delete=False, encoding="utf-8") as tmp:
137
+ tmp_path = tmp.name
138
+
139
+ # Write in chunks with cancellation checks
140
+ chunk_size = 10000 # characters
141
+ for i in range(0, len(content), chunk_size):
142
+ if self._check_cancellation():
143
+ self._safe_delete_file(tmp_path)
144
+ raise ConversionError("Conversion cancelled during output file writing")
145
+
146
+ tmp.write(content[i:i+chunk_size])
147
+
148
+ return tmp_path
149
+
150
+ def convert_document(
151
+ self,
152
+ file_path: str,
153
+ parser_name: str,
154
+ ocr_method_name: str,
155
+ output_format: str
156
+ ) -> Tuple[str, Optional[str]]:
157
+ """
158
+ Convert a document using the specified parser and OCR method.
159
+
160
+ Args:
161
+ file_path: Path to the input file
162
+ parser_name: Name of the parser to use
163
+ ocr_method_name: Name of the OCR method to use
164
+ output_format: Output format (Markdown, JSON, Text, Document Tags)
165
+
166
+ Returns:
167
+ Tuple of (content, output_file_path)
168
+
169
+ Raises:
170
+ DocumentProcessingError: For general processing errors
171
+ FileSizeLimitError: When file is too large
172
+ UnsupportedFileTypeError: For unsupported file types
173
+ ConversionError: When conversion fails or is cancelled
174
+ """
175
+ if not file_path:
176
+ raise DocumentProcessingError("No file provided")
177
+
178
+ self._conversion_in_progress = True
179
+ temp_input = None
180
+ output_path = None
181
+
182
+ try:
183
+ # Validate input file
184
+ self._validate_file(file_path)
185
+
186
+ if self._check_cancellation():
187
+ raise ConversionError("Conversion cancelled")
188
+
189
+ # Create temporary file with English name
190
+ temp_input = self._create_temp_file(file_path)
191
+
192
+ if self._check_cancellation():
193
+ raise ConversionError("Conversion cancelled")
194
+
195
+ # Process document using parser factory
196
+ start_time = time.time()
197
+ content = ParserFactory.parse_document(
198
+ file_path=temp_input,
199
+ parser_name=parser_name,
200
+ ocr_method_name=ocr_method_name,
201
+ output_format=output_format.lower(),
202
+ cancellation_flag=self._cancellation_flag
203
+ )
204
+
205
+ if content == "Conversion cancelled.":
206
+ raise ConversionError("Conversion cancelled by parser")
207
+
208
+ duration = time.time() - start_time
209
+ logging.info(f"Document processed in {duration:.2f} seconds")
210
+
211
+ if self._check_cancellation():
212
+ raise ConversionError("Conversion cancelled")
213
+
214
+ # Process LaTeX content if needed
215
+ content = self._process_latex_content(content, parser_name, ocr_method_name)
216
+
217
+ if self._check_cancellation():
218
+ raise ConversionError("Conversion cancelled")
219
+
220
+ # Create output file
221
+ output_path = self._create_output_file(content, output_format)
222
+
223
+ return content, output_path
224
+
225
+ except (DocumentProcessingError, FileSizeLimitError, UnsupportedFileTypeError, ConversionError):
226
+ # Re-raise our custom exceptions
227
+ self._safe_delete_file(temp_input)
228
+ self._safe_delete_file(output_path)
229
+ raise
230
+ except Exception as e:
231
+ # Wrap unexpected exceptions
232
+ self._safe_delete_file(temp_input)
233
+ self._safe_delete_file(output_path)
234
+ raise DocumentProcessingError(f"Unexpected error during conversion: {str(e)}")
235
+ finally:
236
+ # Clean up temp input file
237
+ self._safe_delete_file(temp_input)
238
+
239
+ # Clean up output file if cancelled
240
+ if self._check_cancellation() and output_path:
241
+ self._safe_delete_file(output_path)
242
+
243
+ self._conversion_in_progress = False
src/ui/ui.py CHANGED
@@ -6,19 +6,26 @@ import logging
6
  from pathlib import Path
7
  from src.core.converter import convert_file, set_cancellation_flag, is_conversion_in_progress
8
  from src.parsers.parser_registry import ParserRegistry
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  # Import MarkItDown to check if it's available
11
  try:
12
  from markitdown import MarkItDown
13
  HAS_MARKITDOWN = True
14
- logging.info("MarkItDown is available for use")
15
  except ImportError:
16
  HAS_MARKITDOWN = False
17
- logging.warning("MarkItDown is not available")
18
-
19
- # Configure logging
20
- logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
21
- logger = logging.getLogger(__name__)
22
 
23
  # Add a global variable to track cancellation state
24
  conversion_cancelled = threading.Event()
@@ -40,12 +47,33 @@ def validate_file_for_parser(file_path, parser_name):
40
  """Validate if the file type is supported by the selected parser."""
41
  if not file_path:
42
  return True, "" # No file selected yet
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- if "GOT-OCR" in parser_name:
45
- file_ext = Path(file_path).suffix.lower()
46
- if file_ext not in ['.jpg', '.jpeg', '.png']:
47
- return False, "GOT-OCR only supports JPG and PNG formats."
48
- return True, ""
49
 
50
  def format_markdown_content(content):
51
  if not content:
 
6
  from pathlib import Path
7
  from src.core.converter import convert_file, set_cancellation_flag, is_conversion_in_progress
8
  from src.parsers.parser_registry import ParserRegistry
9
+ from src.core.config import config
10
+ from src.core.exceptions import (
11
+ DocumentProcessingError,
12
+ UnsupportedFileTypeError,
13
+ FileSizeLimitError,
14
+ ConfigurationError
15
+ )
16
+ from src.core.logging_config import get_logger
17
+
18
+ # Use centralized logging
19
+ logger = get_logger(__name__)
20
 
21
  # Import MarkItDown to check if it's available
22
  try:
23
  from markitdown import MarkItDown
24
  HAS_MARKITDOWN = True
25
+ logger.info("MarkItDown is available for use")
26
  except ImportError:
27
  HAS_MARKITDOWN = False
28
+ logger.warning("MarkItDown is not available")
 
 
 
 
29
 
30
  # Add a global variable to track cancellation state
31
  conversion_cancelled = threading.Event()
 
47
  """Validate if the file type is supported by the selected parser."""
48
  if not file_path:
49
  return True, "" # No file selected yet
50
+
51
+ try:
52
+ file_path_obj = Path(file_path)
53
+ file_ext = file_path_obj.suffix.lower()
54
+
55
+ # Check file size
56
+ if file_path_obj.exists():
57
+ file_size = file_path_obj.stat().st_size
58
+ if file_size > config.app.max_file_size:
59
+ size_mb = file_size / (1024 * 1024)
60
+ max_mb = config.app.max_file_size / (1024 * 1024)
61
+ return False, f"File size ({size_mb:.1f}MB) exceeds maximum allowed size ({max_mb:.1f}MB)"
62
+
63
+ # Check file extension
64
+ if file_ext not in config.app.allowed_extensions:
65
+ return False, f"File type '{file_ext}' is not supported. Allowed types: {', '.join(config.app.allowed_extensions)}"
66
+
67
+ # Parser-specific validation
68
+ if "GOT-OCR" in parser_name:
69
+ if file_ext not in ['.jpg', '.jpeg', '.png']:
70
+ return False, "GOT-OCR only supports JPG and PNG formats."
71
+
72
+ return True, ""
73
 
74
+ except Exception as e:
75
+ logger.error(f"Error validating file: {e}")
76
+ return False, f"Error validating file: {e}"
 
 
77
 
78
  def format_markdown_content(content):
79
  if not content: