# Final Domain Collection Research ## Summary of Findings ### Available Methods in jao-py The `JaoPublicationToolPandasClient` class provides three domain query methods: 1. **`query_final_domain(mtu, presolved, cne, co, use_mirror)`** (Line 233) - Final Computation - Final FB parameters following LTN - Published: 10:30 D-1 - Most complete dataset (recommended for Phase 2) 2. **`query_prefinal_domain(mtu, presolved, cne, co, use_mirror)`** (Line 248) - Pre-Final (EarlyPub) - Pre-final FB parameters before LTN - Published: 08:00 D-1 - Earlier publication time, but before LTN application 3. **`query_initial_domain(mtu, presolved, cne, co)`** (Line 264) - Initial Computation (Virgin Domain) - Initial flow-based parameters - Published: Early in D-1 - Before any adjustments ### Method Parameters ```python def query_final_domain( mtu: pd.Timestamp, # Market Time Unit (1 hour, timezone-aware) presolved: bool = None, # Filter: True=binding, False=non-binding, None=ALL cne: str = None, # CNEC name keyword filter (NOT EIC-based!) co: str = None, # Contingency keyword filter use_mirror: bool = False # Use mirror.flowbased.eu for faster bulk download ) -> pd.DataFrame ``` ### Key Findings 1. **DENSE Data Acquisition**: - Set `presolved=None` to get ALL CNECs (binding + non-binding) - This provides the DENSE format needed for Phase 2 feature engineering 2. **Filtering Limitations**: - ❌ NO EIC-based filtering on server side - ✅ Only keyword-based filters (cne, co) available - **Solution**: Download all CNECs, filter locally by EIC codes 3. **Query Granularity**: - Method queries **1 hour at a time** (mtu = Market Time Unit) - For 24 months: Need 17,520 API calls (1 per hour) - Alternative: Use `use_mirror=True` for whole-day downloads 4. **Mirror Option** (Recommended for bulk collection): - URL: `https://mirror.flowbased.eu/dacc/final_domain/YYYY-MM-DD` - Returns full day (24 hours) as CSV in ZIP file - Much faster than hourly API calls - Set `use_mirror=True` OR set env var `JAO_USE_MIRROR=1` 5. **Data Structure** (from `parse_final_domain()`): - Returns pandas DataFrame with columns: - **Identifiers**: `mtu` (timestamp), `tso`, `cnec_name`, `cnec_eic`, `direction` - **Contingency**: `contingency_*` fields (nested structure flattened) - **Presolved field**: Indicates if CNEC is binding (True) or redundant (False) - **RAM breakdown**: `ram`, `fmax`, `imax`, `frm`, `fuaf`, `amr`, `lta_margin`, etc. - **PTDFs**: `ptdf_AT`, `ptdf_BE`, ..., `ptdf_SK` (12 Core zones) - Timestamps converted to Europe/Amsterdam timezone - snake_case column names (except PTDFs) ### Recommended Implementation for Phase 2 **Option A: Mirror-based (FASTEST)**: ```python def collect_final_domain_sample( start_date: str, end_date: str, target_cnec_eics: list[str], # 200 EIC codes from Phase 1 output_path: Path ) -> pl.DataFrame: """Collect DENSE CNEC data for specific CNECs using mirror.""" client = JAOClient() # With use_mirror=True all_data = [] for date in pd.date_range(start_date, end_date): # Query full day (all CNECs) via mirror df_day = client.query_final_domain( mtu=pd.Timestamp(date, tz='Europe/Amsterdam'), presolved=None, # ALL CNECs (DENSE!) use_mirror=True # Fast bulk download ) # Filter to target CNECs only df_filtered = df_day[df_day['cnec_eic'].isin(target_cnec_eics)] all_data.append(df_filtered) # Combine and save df_full = pd.concat(all_data) pl_df = pl.from_pandas(df_full) pl_df.write_parquet(output_path) return pl_df ``` **Option B: Hourly API calls (SLOWER, but more granular)**: ```python def collect_final_domain_hourly( start_date: str, end_date: str, target_cnec_eics: list[str], output_path: Path ) -> pl.DataFrame: """Collect DENSE CNEC data hour-by-hour.""" client = JAOClient() all_data = [] for date in pd.date_range(start_date, end_date, freq='H'): try: df_hour = client.query_final_domain( mtu=pd.Timestamp(date, tz='Europe/Amsterdam'), presolved=None # ALL CNECs ) df_filtered = df_hour[df_hour['cnec_eic'].isin(target_cnec_eics)] all_data.append(df_filtered) except NoMatchingDataError: continue # Hour may have no data df_full = pd.concat(all_data) pl_df = pl.from_pandas(df_full) pl_df.write_parquet(output_path) return pl_df ``` ### Data Volume Estimates **Full Download (all ~20K CNECs)**: - 20,000 CNECs × 17,520 hours = 350M records - ~27 columns × 8 bytes/value = ~75 GB uncompressed - Parquet compression: ~10-20 GB **Filtered (200 target CNECs)**: - 200 CNECs × 17,520 hours = 3.5M records - ~27 columns × 8 bytes/value = ~750 MB uncompressed - Parquet compression: ~100-150 MB ### Implementation Strategy 1. **Phase 1 complete**: Identify top 200 CNECs from SPARSE data 2. **Extract EIC codes**: Save to `data/processed/critical_cnecs_eic_codes.csv` 3. **Test on 1 week**: Validate DENSE collection with mirror ```python # Test: 2025-09-23 to 2025-09-30 (8 days) # Expected: 200 CNECs × 192 hours = 38,400 records ``` 4. **Collect 24 months**: Using mirror for speed 5. **Validate DENSE structure**: ```python unique_cnecs = df['cnec_eic'].n_unique() unique_hours = df['mtu'].n_unique() expected = unique_cnecs * unique_hours actual = len(df) assert actual == expected, f"Not DENSE! {actual} != {expected}" ``` ### Advantages of Mirror Method - ✅ Faster: 1 request/day vs 24 requests/day - ✅ Rate limit friendly: 730 requests vs 17,520 requests - ✅ More reliable: Less chance of timeout/connection errors - ✅ Complete days: Guarantees all 24 hours present ### Next Steps 1. Add `collect_final_domain_dense()` method to `collect_jao.py` 2. Test on 1-week sample with target EIC codes 3. Validate DENSE structure and data quality 4. Run 24-month collection after Phase 1 complete 5. Use DENSE data for Tier 1 & Tier 2 feature engineering --- **Research completed**: 2025-11-05 **jao-py version**: 0.6.2 **Source**: C:\Users\evgue\projects\fbmc_chronos2\.venv\Lib\site-packages\jao\jao.py