Spaces:

evgueni-p
/

fbmc-chronos2

Sleeping

App Files Files Community

fbmc-chronos2 / doc /validation_methodology.md

Evgueni Poloukarov

fix: add Windows multiprocessing protection and validation methodology

e5f4fec 28 days ago

preview code

raw

history blame contribute delete

8.94 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Validation Methodology: Compromised Validation Using Actuals

Date: November 13, 2025 Status: Accepted by User Purpose: Document limitations and expected optimism bias in Sept 2025 validation

Executive Summary

This validation uses actual values instead of forecasts for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a compromised validation approach that represents a lower bound on production MAE, not actual production performance.

Expected Impact: Results will be 20-40% more optimistic than production reality.

Features Using Actuals (Not Forecasts)

1. Weather Features (375 features)

Compromise: Using actual weather values instead of weather forecasts Production Reality: Weather forecasts contain errors that propagate to flow predictions Impact:

Weather forecast errors typically 1-3°C for temperature, 20-30% for wind
This represents the largest source of optimism bias
Expected 15-25% MAE improvement vs. real forecasts

Why Compromised:

OpenMeteo API does not provide historical forecast archives
Only current forecasts + historical actuals available
Cannot reconstruct "forecast as of Oct 1" for October validation

2. CNEC Outage Features (176 features)

Compromise: Using actual outages instead of planned outage forecasts Production Reality: Outage schedules change (cancellations, extensions, unplanned events) Impact:

Outage forecast accuracy ~80-90% (planned outages fairly reliable)
Expected 3-7% MAE improvement vs. real outage forecasts

Why Compromised:

ENTSO-E Transparency API does not easily expose outage version history
Could potentially collect with advanced queries (future work)
Current dataset contains final outage data, not forecasts

3. LTA Features (40 features)

Compromise: Using actual LTA values instead of forward-filled from D+0 Production Reality: LTA published weeks ahead, minimal uncertainty Impact:

LTA values are very stable (long-term allocations)
Expected <1% MAE impact (negligible)

Why Compromised:

JAO API could provide this, but requires additional implementation
LTA uncertainty minimal compared to weather/load forecasts

4. Load Forecast Features (12 features)

Compromise: Using actual demand instead of day-ahead load forecasts Production Reality: Load forecasts have 1-3% MAPEerror Impact:

Load forecast error contributes to flow prediction error
Expected 5-10% MAE improvement vs. real load forecasts

Why Compromised:

ENTSO-E day-ahead load forecasts available but requires separate collection
Currently using actual demand from historical data

Features Using Correct Data (No Compromise)

Temporal Features (12 features)

Hour, day, month, weekday encodings
Always known perfectly - no forecast error possible

Historical Features (1,899 features)

Prices, generation, demand, lags, CNEC bindings
Only used in context window - not forecast ahead
Correct usage: These are known values up to run_date

Expected Optimism Bias Summary

Feature Category	Count	Forecast Error	Bias Contribution
Weather	375	High (20-30%)	+15-25% MAE bias
Load Forecasts	12	Medium (1-3%)	+5-10% MAE bias
CNEC Outages	176	Low (10-20%)	+3-7% MAE bias
LTA	40	Negligible	<1% MAE bias
Total Expected	603	Combined	+20-40% total

Interpretation: If validation shows 100 MW MAE, expect 120-140 MW MAE in production.

Validation Framing

What This Validation Proves

✅ Pipeline Correctness: DynamicForecast system works mechanically ✅ Leakage Prevention: Time-aware extraction prevents data leakage ✅ Model Capability: Chronos 2 can learn cross-border flow patterns ✅ Lower Bound: Establishes best-case performance envelope ✅ Comparative Studies: Fair baseline for model comparisons

What This Validation Does NOT Prove

❌ Production Accuracy: Real MAE will be 20-40% higher ❌ Operational Readiness: Requires prospective validation ❌ Feature Importance: Cannot isolate weather vs. structural effects ❌ Forecast Skill: Using perfect information, not forecasts

Precedents in ML Forecasting Literature

This compromised approach is common and accepted in ML research when properly documented:

Academic Precedents

IEEE Power & Energy Society Journals:
- Many load/renewable forecasting papers use actual weather for validation
- Framed as "perfect weather information" scenarios
- Cited to establish theoretical performance bounds
Energy Forecasting Competitions:
- Some tracks explicitly provide actual values for covariates
- Focus on model architecture, not forecast accuracy
- Clearly labeled as "oracle" scenarios
Weather-Dependent Forecasting:
- Wind power forecasting research often uses actual wind observations
- Standard practice when evaluating model capacity independently

Key Requirement

Explicit documentation of limitations (as provided in this document).

Mitigation Strategies

1. Clear Communication

ALWAYS state "using actuals for weather/outages/load"
Frame results as "lower bound on production MAE"
Never claim production-ready without prospective validation

2. Ablation Studies (Future Work)

Remove weather features → measure MAE increase
Remove outage features → measure contribution
Quantify: "Weather contributes ~X MW to MAE"

3. Synthetic Forecast Degradation (Future Work)

Add Gaussian noise to weather features (σ = 2°C for temperature)
Simulate load forecast error (~2% MAPE)
Re-evaluate with "noisy forecasts" → closer to production

4. Prospective Validation (November 2025+)

Collect proper forecasts daily starting Nov 1
Run forecasts using day-ahead weather/load/outages
Compare Oct (optimistic) vs. Nov (realistic)

Comparison to Baseline Models

Even with compromised validation, comparisons are valid if:

✅ All models use same compromised data (fair comparison)
✅ Baseline models clearly defined (persistence, seasonal naive, ARIMA)
✅ Relative performance matters more than absolute MAE

Example:

Model              | Sept MAE (Compromised) | Relative to Persistence
-------------------|------------------------|------------------------
Persistence        | 250 MW                 | 1.00x (baseline)
Seasonal Naive     | 210 MW                 | 0.84x
Chronos 2 (ours)   | 120 MW                 | 0.48x ← Valid comparison

Validation Results Interpretation Guide

If Sept MAE = 100 MW

Lower bound established: Pipeline works mechanically
Production expectation: 120-140 MW MAE
Target assessment: Still below 134 MW target? ✅ Good sign
Action: Proceed to prospective validation

If Sept MAE = 150 MW

Lower bound established: 150 MW with perfect info
Production expectation: 180-210 MW MAE
Target assessment: Above 134 MW target ❌ Problem
Action: Investigate errors before production

If Sept MAE = 200+ MW

Systematic issue: Even perfect information insufficient
Action: Debug feature engineering, check for bugs

Recommended Reporting Language

Good ✅

"Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a lower bound on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)."

Acceptable ✅

"Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment."

Misleading ❌

"The system achieves 120 MW MAE on validation data and is ready for production." (Omits limitations, implies production readiness)

Conclusion

This compromised validation approach is:

✅ Acceptable in ML research with proper documentation
✅ Useful for proving pipeline correctness and model capability
✅ Valid for comparative studies (vs. baselines, ablations)
❌ NOT sufficient for claiming production accuracy
❌ NOT a substitute for prospective validation

Next Steps:

Run Sept validation with this methodology
Document results with limitations clearly stated
Begin November prospective validation collection
Compare Oct (optimistic) vs. Nov (realistic) in ~30 days

Approved By: User (Nov 13, 2025) Rationale: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."