File size: 8,937 Bytes
e5f4fec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# Validation Methodology: Compromised Validation Using Actuals

**Date**: November 13, 2025
**Status**: Accepted by User
**Purpose**: Document limitations and expected optimism bias in Sept 2025 validation

---

## Executive Summary

This validation uses **actual values instead of forecasts** for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a **compromised validation** approach that represents a **lower bound** on production MAE, not actual production performance.

**Expected Impact**: Results will be **20-40% more optimistic** than production reality.

---

## Features Using Actuals (Not Forecasts)

### 1. Weather Features (375 features)
**Compromise**: Using actual weather values instead of weather forecasts
**Production Reality**: Weather forecasts contain errors that propagate to flow predictions
**Impact**:
- Weather forecast errors typically 1-3°C for temperature, 20-30% for wind
- This represents the **largest source of optimism bias**
- Expected 15-25% MAE improvement vs. real forecasts

**Why Compromised**:
- OpenMeteo API does not provide historical forecast archives
- Only current forecasts + historical actuals available
- Cannot reconstruct "forecast as of Oct 1" for October validation

### 2. CNEC Outage Features (176 features)
**Compromise**: Using actual outages instead of planned outage forecasts
**Production Reality**: Outage schedules change (cancellations, extensions, unplanned events)
**Impact**:
- Outage forecast accuracy ~80-90% (planned outages fairly reliable)
- Expected 3-7% MAE improvement vs. real outage forecasts

**Why Compromised**:
- ENTSO-E Transparency API does not easily expose outage version history
- Could potentially collect with advanced queries (future work)
- Current dataset contains final outage data, not forecasts

### 3. LTA Features (40 features)
**Compromise**: Using actual LTA values instead of forward-filled from D+0
**Production Reality**: LTA published weeks ahead, minimal uncertainty
**Impact**:
- LTA values are very stable (long-term allocations)
- Expected <1% MAE impact (negligible)

**Why Compromised**:
- JAO API could provide this, but requires additional implementation
- LTA uncertainty minimal compared to weather/load forecasts

### 4. Load Forecast Features (12 features)
**Compromise**: Using actual demand instead of day-ahead load forecasts
**Production Reality**: Load forecasts have 1-3% MAPEerror
**Impact**:
- Load forecast error contributes to flow prediction error
- Expected 5-10% MAE improvement vs. real load forecasts

**Why Compromised**:
- ENTSO-E day-ahead load forecasts available but requires separate collection
- Currently using actual demand from historical data

---

## Features Using Correct Data (No Compromise)

### Temporal Features (12 features)
- Hour, day, month, weekday encodings
- **Always known perfectly** - no forecast error possible

### Historical Features (1,899 features)
- Prices, generation, demand, lags, CNEC bindings
- **Only used in context window** - not forecast ahead
- Correct usage: These are known values up to run_date

---

## Expected Optimism Bias Summary

| Feature Category | Count | Forecast Error | Bias Contribution |
|-----------------|-------|----------------|-------------------|
| Weather | 375 | High (20-30%) | +15-25% MAE bias |
| Load Forecasts | 12 | Medium (1-3%) | +5-10% MAE bias |
| CNEC Outages | 176 | Low (10-20%) | +3-7% MAE bias |
| LTA | 40 | Negligible | <1% MAE bias |
| **Total Expected** | **603** | **Combined** | **+20-40% total** |

**Interpretation**: If validation shows 100 MW MAE, expect **120-140 MW MAE in production**.

---

## Validation Framing

### What This Validation Proves
**Pipeline Correctness**: DynamicForecast system works mechanically
**Leakage Prevention**: Time-aware extraction prevents data leakage
**Model Capability**: Chronos 2 can learn cross-border flow patterns
**Lower Bound**: Establishes best-case performance envelope
**Comparative Studies**: Fair baseline for model comparisons

### What This Validation Does NOT Prove
**Production Accuracy**: Real MAE will be 20-40% higher
**Operational Readiness**: Requires prospective validation
**Feature Importance**: Cannot isolate weather vs. structural effects
**Forecast Skill**: Using perfect information, not forecasts

---

## Precedents in ML Forecasting Literature

This compromised approach is **common and accepted** in ML research when properly documented:

### Academic Precedents
1. **IEEE Power & Energy Society Journals**:
   - Many load/renewable forecasting papers use actual weather for validation
   - Framed as "perfect weather information" scenarios
   - Cited to establish theoretical performance bounds

2. **Energy Forecasting Competitions**:
   - Some tracks explicitly provide actual values for covariates
   - Focus on model architecture, not forecast accuracy
   - Clearly labeled as "oracle" scenarios

3. **Weather-Dependent Forecasting**:
   - Wind power forecasting research often uses actual wind observations
   - Standard practice when evaluating model capacity independently

### Key Requirement
**Explicit documentation** of limitations (as provided in this document).

---

## Mitigation Strategies

### 1. Clear Communication
- **ALWAYS** state "using actuals for weather/outages/load"
- Frame results as "lower bound on production MAE"
- Never claim production-ready without prospective validation

### 2. Ablation Studies (Future Work)
- Remove weather features → measure MAE increase
- Remove outage features → measure contribution
- Quantify: "Weather contributes ~X MW to MAE"

### 3. Synthetic Forecast Degradation (Future Work)
- Add Gaussian noise to weather features (σ = 2°C for temperature)
- Simulate load forecast error (~2% MAPE)
- Re-evaluate with "noisy forecasts" → closer to production

### 4. Prospective Validation (November 2025+)
- Collect proper forecasts daily starting Nov 1
- Run forecasts using day-ahead weather/load/outages
- Compare Oct (optimistic) vs. Nov (realistic)

---

## Comparison to Baseline Models

Even with compromised validation, comparisons are **valid** if:
- ✅ **All models use same compromised data** (fair comparison)
- ✅ **Baseline models clearly defined** (persistence, seasonal naive, ARIMA)
- ✅ **Relative performance** matters more than absolute MAE

Example:
```
Model              | Sept MAE (Compromised) | Relative to Persistence
-------------------|------------------------|------------------------
Persistence        | 250 MW                 | 1.00x (baseline)
Seasonal Naive     | 210 MW                 | 0.84x
Chronos 2 (ours)   | 120 MW                 | 0.48x ← Valid comparison
```

---

## Validation Results Interpretation Guide

### If Sept MAE = 100 MW
- **Lower bound established**: Pipeline works mechanically
- **Production expectation**: 120-140 MW MAE
- **Target assessment**: Still below 134 MW target? ✅ Good sign
- **Action**: Proceed to prospective validation

### If Sept MAE = 150 MW
- **Lower bound established**: 150 MW with perfect info
- **Production expectation**: 180-210 MW MAE
- **Target assessment**: Above 134 MW target ❌ Problem
- **Action**: Investigate errors before production

### If Sept MAE = 200+ MW
- **Systematic issue**: Even perfect information insufficient
- **Action**: Debug feature engineering, check for bugs

---

## Recommended Reporting Language

### Good ✅
> "Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a **lower bound** on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)."

### Acceptable ✅
> "Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment."

### Misleading ❌
> "The system achieves 120 MW MAE on validation data and is ready for production."
*(Omits limitations, implies production readiness)*

---

## Conclusion

This compromised validation approach is:
- ✅ **Acceptable** in ML research with proper documentation
- ✅ **Useful** for proving pipeline correctness and model capability
- ✅ **Valid** for comparative studies (vs. baselines, ablations)
- ❌ **NOT sufficient** for claiming production accuracy
- ❌ **NOT a substitute** for prospective validation

**Next Steps**:
1. Run Sept validation with this methodology
2. Document results with limitations clearly stated
3. Begin November prospective validation collection
4. Compare Oct (optimistic) vs. Nov (realistic) in ~30 days

---

**Approved By**: User (Nov 13, 2025)
**Rationale**: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."