# Final Domain Collection Research

## Summary of Findings

### Available Methods in jao-py

The `JaoPublicationToolPandasClient` class provides three domain query methods:

1. **`query_final_domain(mtu, presolved, cne, co, use_mirror)`** (Line 233)
   - Final Computation - Final FB parameters following LTN
   - Published: 10:30 D-1
   - Most complete dataset (recommended for Phase 2)

2. **`query_prefinal_domain(mtu, presolved, cne, co, use_mirror)`** (Line 248)
   - Pre-Final (EarlyPub) - Pre-final FB parameters before LTN
   - Published: 08:00 D-1
   - Earlier publication time, but before LTN application

3. **`query_initial_domain(mtu, presolved, cne, co)`** (Line 264)
   - Initial Computation (Virgin Domain) - Initial flow-based parameters
   - Published: Early in D-1
   - Before any adjustments

### Method Parameters

```python
def query_final_domain(
    mtu: pd.Timestamp,      # Market Time Unit (1 hour, timezone-aware)
    presolved: bool = None, # Filter: True=binding, False=non-binding, None=ALL
    cne: str = None,        # CNEC name keyword filter (NOT EIC-based!)
    co: str = None,         # Contingency keyword filter
    use_mirror: bool = False # Use mirror.flowbased.eu for faster bulk download
) -> pd.DataFrame
```

### Key Findings

1. **DENSE Data Acquisition**:
   - Set `presolved=None` to get ALL CNECs (binding + non-binding)
   - This provides the DENSE format needed for Phase 2 feature engineering

2. **Filtering Limitations**:
   - ❌ NO EIC-based filtering on server side
   - ✅ Only keyword-based filters (cne, co) available
   - **Solution**: Download all CNECs, filter locally by EIC codes

3. **Query Granularity**:
   - Method queries **1 hour at a time** (mtu = Market Time Unit)
   - For 24 months: Need 17,520 API calls (1 per hour)
   - Alternative: Use `use_mirror=True` for whole-day downloads

4. **Mirror Option** (Recommended for bulk collection):
   - URL: `https://mirror.flowbased.eu/dacc/final_domain/YYYY-MM-DD`
   - Returns full day (24 hours) as CSV in ZIP file
   - Much faster than hourly API calls
   - Set `use_mirror=True` OR set env var `JAO_USE_MIRROR=1`

5. **Data Structure** (from `parse_final_domain()`):
   - Returns pandas DataFrame with columns:
     - **Identifiers**: `mtu` (timestamp), `tso`, `cnec_name`, `cnec_eic`, `direction`
     - **Contingency**: `contingency_*` fields (nested structure flattened)
     - **Presolved field**: Indicates if CNEC is binding (True) or redundant (False)
     - **RAM breakdown**: `ram`, `fmax`, `imax`, `frm`, `fuaf`, `amr`, `lta_margin`, etc.
     - **PTDFs**: `ptdf_AT`, `ptdf_BE`, ..., `ptdf_SK` (12 Core zones)
   - Timestamps converted to Europe/Amsterdam timezone
   - snake_case column names (except PTDFs)

### Recommended Implementation for Phase 2

**Option A: Mirror-based (FASTEST)**:
```python
def collect_final_domain_sample(
    start_date: str,
    end_date: str,
    target_cnec_eics: list[str],  # 200 EIC codes from Phase 1
    output_path: Path
) -> pl.DataFrame:
    """Collect DENSE CNEC data for specific CNECs using mirror."""

    client = JAOClient()  # With use_mirror=True

    all_data = []
    for date in pd.date_range(start_date, end_date):
        # Query full day (all CNECs) via mirror
        df_day = client.query_final_domain(
            mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
            presolved=None,  # ALL CNECs (DENSE!)
            use_mirror=True   # Fast bulk download
        )

        # Filter to target CNECs only
        df_filtered = df_day[df_day['cnec_eic'].isin(target_cnec_eics)]
        all_data.append(df_filtered)

    # Combine and save
    df_full = pd.concat(all_data)
    pl_df = pl.from_pandas(df_full)
    pl_df.write_parquet(output_path)

    return pl_df
```

**Option B: Hourly API calls (SLOWER, but more granular)**:
```python
def collect_final_domain_hourly(
    start_date: str,
    end_date: str,
    target_cnec_eics: list[str],
    output_path: Path
) -> pl.DataFrame:
    """Collect DENSE CNEC data hour-by-hour."""

    client = JAOClient()

    all_data = []
    for date in pd.date_range(start_date, end_date, freq='H'):
        try:
            df_hour = client.query_final_domain(
                mtu=pd.Timestamp(date, tz='Europe/Amsterdam'),
                presolved=None  # ALL CNECs
            )
            df_filtered = df_hour[df_hour['cnec_eic'].isin(target_cnec_eics)]
            all_data.append(df_filtered)
        except NoMatchingDataError:
            continue  # Hour may have no data

    df_full = pd.concat(all_data)
    pl_df = pl.from_pandas(df_full)
    pl_df.write_parquet(output_path)

    return pl_df
```

### Data Volume Estimates

**Full Download (all ~20K CNECs)**:
- 20,000 CNECs × 17,520 hours = 350M records
- ~27 columns × 8 bytes/value = ~75 GB uncompressed
- Parquet compression: ~10-20 GB

**Filtered (200 target CNECs)**:
- 200 CNECs × 17,520 hours = 3.5M records
- ~27 columns × 8 bytes/value = ~750 MB uncompressed
- Parquet compression: ~100-150 MB

### Implementation Strategy

1. **Phase 1 complete**: Identify top 200 CNECs from SPARSE data
2. **Extract EIC codes**: Save to `data/processed/critical_cnecs_eic_codes.csv`
3. **Test on 1 week**: Validate DENSE collection with mirror
   ```python
   # Test: 2025-09-23 to 2025-09-30 (8 days)
   # Expected: 200 CNECs × 192 hours = 38,400 records
   ```
4. **Collect 24 months**: Using mirror for speed
5. **Validate DENSE structure**:
   ```python
   unique_cnecs = df['cnec_eic'].n_unique()
   unique_hours = df['mtu'].n_unique()
   expected = unique_cnecs * unique_hours
   actual = len(df)
   assert actual == expected, f"Not DENSE! {actual} != {expected}"
   ```

### Advantages of Mirror Method

- ✅ Faster: 1 request/day vs 24 requests/day
- ✅ Rate limit friendly: 730 requests vs 17,520 requests
- ✅ More reliable: Less chance of timeout/connection errors
- ✅ Complete days: Guarantees all 24 hours present

### Next Steps

1. Add `collect_final_domain_dense()` method to `collect_jao.py`
2. Test on 1-week sample with target EIC codes
3. Validate DENSE structure and data quality
4. Run 24-month collection after Phase 1 complete
5. Use DENSE data for Tier 1 & Tier 2 feature engineering

---

**Research completed**: 2025-11-05
**jao-py version**: 0.6.2
**Source**: C:\Users\evgue\projects\fbmc_chronos2\.venv\Lib\site-packages\jao\jao.py