Spaces:
Sleeping
Sleeping
File size: 8,937 Bytes
e5f4fec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
# Validation Methodology: Compromised Validation Using Actuals
**Date**: November 13, 2025
**Status**: Accepted by User
**Purpose**: Document limitations and expected optimism bias in Sept 2025 validation
---
## Executive Summary
This validation uses **actual values instead of forecasts** for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a **compromised validation** approach that represents a **lower bound** on production MAE, not actual production performance.
**Expected Impact**: Results will be **20-40% more optimistic** than production reality.
---
## Features Using Actuals (Not Forecasts)
### 1. Weather Features (375 features)
**Compromise**: Using actual weather values instead of weather forecasts
**Production Reality**: Weather forecasts contain errors that propagate to flow predictions
**Impact**:
- Weather forecast errors typically 1-3°C for temperature, 20-30% for wind
- This represents the **largest source of optimism bias**
- Expected 15-25% MAE improvement vs. real forecasts
**Why Compromised**:
- OpenMeteo API does not provide historical forecast archives
- Only current forecasts + historical actuals available
- Cannot reconstruct "forecast as of Oct 1" for October validation
### 2. CNEC Outage Features (176 features)
**Compromise**: Using actual outages instead of planned outage forecasts
**Production Reality**: Outage schedules change (cancellations, extensions, unplanned events)
**Impact**:
- Outage forecast accuracy ~80-90% (planned outages fairly reliable)
- Expected 3-7% MAE improvement vs. real outage forecasts
**Why Compromised**:
- ENTSO-E Transparency API does not easily expose outage version history
- Could potentially collect with advanced queries (future work)
- Current dataset contains final outage data, not forecasts
### 3. LTA Features (40 features)
**Compromise**: Using actual LTA values instead of forward-filled from D+0
**Production Reality**: LTA published weeks ahead, minimal uncertainty
**Impact**:
- LTA values are very stable (long-term allocations)
- Expected <1% MAE impact (negligible)
**Why Compromised**:
- JAO API could provide this, but requires additional implementation
- LTA uncertainty minimal compared to weather/load forecasts
### 4. Load Forecast Features (12 features)
**Compromise**: Using actual demand instead of day-ahead load forecasts
**Production Reality**: Load forecasts have 1-3% MAPEerror
**Impact**:
- Load forecast error contributes to flow prediction error
- Expected 5-10% MAE improvement vs. real load forecasts
**Why Compromised**:
- ENTSO-E day-ahead load forecasts available but requires separate collection
- Currently using actual demand from historical data
---
## Features Using Correct Data (No Compromise)
### Temporal Features (12 features)
- Hour, day, month, weekday encodings
- **Always known perfectly** - no forecast error possible
### Historical Features (1,899 features)
- Prices, generation, demand, lags, CNEC bindings
- **Only used in context window** - not forecast ahead
- Correct usage: These are known values up to run_date
---
## Expected Optimism Bias Summary
| Feature Category | Count | Forecast Error | Bias Contribution |
|-----------------|-------|----------------|-------------------|
| Weather | 375 | High (20-30%) | +15-25% MAE bias |
| Load Forecasts | 12 | Medium (1-3%) | +5-10% MAE bias |
| CNEC Outages | 176 | Low (10-20%) | +3-7% MAE bias |
| LTA | 40 | Negligible | <1% MAE bias |
| **Total Expected** | **603** | **Combined** | **+20-40% total** |
**Interpretation**: If validation shows 100 MW MAE, expect **120-140 MW MAE in production**.
---
## Validation Framing
### What This Validation Proves
✅ **Pipeline Correctness**: DynamicForecast system works mechanically
✅ **Leakage Prevention**: Time-aware extraction prevents data leakage
✅ **Model Capability**: Chronos 2 can learn cross-border flow patterns
✅ **Lower Bound**: Establishes best-case performance envelope
✅ **Comparative Studies**: Fair baseline for model comparisons
### What This Validation Does NOT Prove
❌ **Production Accuracy**: Real MAE will be 20-40% higher
❌ **Operational Readiness**: Requires prospective validation
❌ **Feature Importance**: Cannot isolate weather vs. structural effects
❌ **Forecast Skill**: Using perfect information, not forecasts
---
## Precedents in ML Forecasting Literature
This compromised approach is **common and accepted** in ML research when properly documented:
### Academic Precedents
1. **IEEE Power & Energy Society Journals**:
- Many load/renewable forecasting papers use actual weather for validation
- Framed as "perfect weather information" scenarios
- Cited to establish theoretical performance bounds
2. **Energy Forecasting Competitions**:
- Some tracks explicitly provide actual values for covariates
- Focus on model architecture, not forecast accuracy
- Clearly labeled as "oracle" scenarios
3. **Weather-Dependent Forecasting**:
- Wind power forecasting research often uses actual wind observations
- Standard practice when evaluating model capacity independently
### Key Requirement
**Explicit documentation** of limitations (as provided in this document).
---
## Mitigation Strategies
### 1. Clear Communication
- **ALWAYS** state "using actuals for weather/outages/load"
- Frame results as "lower bound on production MAE"
- Never claim production-ready without prospective validation
### 2. Ablation Studies (Future Work)
- Remove weather features → measure MAE increase
- Remove outage features → measure contribution
- Quantify: "Weather contributes ~X MW to MAE"
### 3. Synthetic Forecast Degradation (Future Work)
- Add Gaussian noise to weather features (σ = 2°C for temperature)
- Simulate load forecast error (~2% MAPE)
- Re-evaluate with "noisy forecasts" → closer to production
### 4. Prospective Validation (November 2025+)
- Collect proper forecasts daily starting Nov 1
- Run forecasts using day-ahead weather/load/outages
- Compare Oct (optimistic) vs. Nov (realistic)
---
## Comparison to Baseline Models
Even with compromised validation, comparisons are **valid** if:
- ✅ **All models use same compromised data** (fair comparison)
- ✅ **Baseline models clearly defined** (persistence, seasonal naive, ARIMA)
- ✅ **Relative performance** matters more than absolute MAE
Example:
```
Model | Sept MAE (Compromised) | Relative to Persistence
-------------------|------------------------|------------------------
Persistence | 250 MW | 1.00x (baseline)
Seasonal Naive | 210 MW | 0.84x
Chronos 2 (ours) | 120 MW | 0.48x ← Valid comparison
```
---
## Validation Results Interpretation Guide
### If Sept MAE = 100 MW
- **Lower bound established**: Pipeline works mechanically
- **Production expectation**: 120-140 MW MAE
- **Target assessment**: Still below 134 MW target? ✅ Good sign
- **Action**: Proceed to prospective validation
### If Sept MAE = 150 MW
- **Lower bound established**: 150 MW with perfect info
- **Production expectation**: 180-210 MW MAE
- **Target assessment**: Above 134 MW target ❌ Problem
- **Action**: Investigate errors before production
### If Sept MAE = 200+ MW
- **Systematic issue**: Even perfect information insufficient
- **Action**: Debug feature engineering, check for bugs
---
## Recommended Reporting Language
### Good ✅
> "Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a **lower bound** on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)."
### Acceptable ✅
> "Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment."
### Misleading ❌
> "The system achieves 120 MW MAE on validation data and is ready for production."
*(Omits limitations, implies production readiness)*
---
## Conclusion
This compromised validation approach is:
- ✅ **Acceptable** in ML research with proper documentation
- ✅ **Useful** for proving pipeline correctness and model capability
- ✅ **Valid** for comparative studies (vs. baselines, ablations)
- ❌ **NOT sufficient** for claiming production accuracy
- ❌ **NOT a substitute** for prospective validation
**Next Steps**:
1. Run Sept validation with this methodology
2. Document results with limitations clearly stated
3. Begin November prospective validation collection
4. Compare Oct (optimistic) vs. Nov (realistic) in ~30 days
---
**Approved By**: User (Nov 13, 2025)
**Rationale**: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."
|