fbmc-chronos2 / doc /validation_methodology.md
Evgueni Poloukarov
fix: add Windows multiprocessing protection and validation methodology
e5f4fec

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Validation Methodology: Compromised Validation Using Actuals

Date: November 13, 2025 Status: Accepted by User Purpose: Document limitations and expected optimism bias in Sept 2025 validation


Executive Summary

This validation uses actual values instead of forecasts for several key feature categories due to API limitations preventing retrospective access to historical forecast data. This is a compromised validation approach that represents a lower bound on production MAE, not actual production performance.

Expected Impact: Results will be 20-40% more optimistic than production reality.


Features Using Actuals (Not Forecasts)

1. Weather Features (375 features)

Compromise: Using actual weather values instead of weather forecasts Production Reality: Weather forecasts contain errors that propagate to flow predictions Impact:

  • Weather forecast errors typically 1-3°C for temperature, 20-30% for wind
  • This represents the largest source of optimism bias
  • Expected 15-25% MAE improvement vs. real forecasts

Why Compromised:

  • OpenMeteo API does not provide historical forecast archives
  • Only current forecasts + historical actuals available
  • Cannot reconstruct "forecast as of Oct 1" for October validation

2. CNEC Outage Features (176 features)

Compromise: Using actual outages instead of planned outage forecasts Production Reality: Outage schedules change (cancellations, extensions, unplanned events) Impact:

  • Outage forecast accuracy ~80-90% (planned outages fairly reliable)
  • Expected 3-7% MAE improvement vs. real outage forecasts

Why Compromised:

  • ENTSO-E Transparency API does not easily expose outage version history
  • Could potentially collect with advanced queries (future work)
  • Current dataset contains final outage data, not forecasts

3. LTA Features (40 features)

Compromise: Using actual LTA values instead of forward-filled from D+0 Production Reality: LTA published weeks ahead, minimal uncertainty Impact:

  • LTA values are very stable (long-term allocations)
  • Expected <1% MAE impact (negligible)

Why Compromised:

  • JAO API could provide this, but requires additional implementation
  • LTA uncertainty minimal compared to weather/load forecasts

4. Load Forecast Features (12 features)

Compromise: Using actual demand instead of day-ahead load forecasts Production Reality: Load forecasts have 1-3% MAPEerror Impact:

  • Load forecast error contributes to flow prediction error
  • Expected 5-10% MAE improvement vs. real load forecasts

Why Compromised:

  • ENTSO-E day-ahead load forecasts available but requires separate collection
  • Currently using actual demand from historical data

Features Using Correct Data (No Compromise)

Temporal Features (12 features)

  • Hour, day, month, weekday encodings
  • Always known perfectly - no forecast error possible

Historical Features (1,899 features)

  • Prices, generation, demand, lags, CNEC bindings
  • Only used in context window - not forecast ahead
  • Correct usage: These are known values up to run_date

Expected Optimism Bias Summary

Feature Category Count Forecast Error Bias Contribution
Weather 375 High (20-30%) +15-25% MAE bias
Load Forecasts 12 Medium (1-3%) +5-10% MAE bias
CNEC Outages 176 Low (10-20%) +3-7% MAE bias
LTA 40 Negligible <1% MAE bias
Total Expected 603 Combined +20-40% total

Interpretation: If validation shows 100 MW MAE, expect 120-140 MW MAE in production.


Validation Framing

What This Validation Proves

Pipeline Correctness: DynamicForecast system works mechanically ✅ Leakage Prevention: Time-aware extraction prevents data leakage ✅ Model Capability: Chronos 2 can learn cross-border flow patterns ✅ Lower Bound: Establishes best-case performance envelope ✅ Comparative Studies: Fair baseline for model comparisons

What This Validation Does NOT Prove

Production Accuracy: Real MAE will be 20-40% higher ❌ Operational Readiness: Requires prospective validation ❌ Feature Importance: Cannot isolate weather vs. structural effects ❌ Forecast Skill: Using perfect information, not forecasts


Precedents in ML Forecasting Literature

This compromised approach is common and accepted in ML research when properly documented:

Academic Precedents

  1. IEEE Power & Energy Society Journals:

    • Many load/renewable forecasting papers use actual weather for validation
    • Framed as "perfect weather information" scenarios
    • Cited to establish theoretical performance bounds
  2. Energy Forecasting Competitions:

    • Some tracks explicitly provide actual values for covariates
    • Focus on model architecture, not forecast accuracy
    • Clearly labeled as "oracle" scenarios
  3. Weather-Dependent Forecasting:

    • Wind power forecasting research often uses actual wind observations
    • Standard practice when evaluating model capacity independently

Key Requirement

Explicit documentation of limitations (as provided in this document).


Mitigation Strategies

1. Clear Communication

  • ALWAYS state "using actuals for weather/outages/load"
  • Frame results as "lower bound on production MAE"
  • Never claim production-ready without prospective validation

2. Ablation Studies (Future Work)

  • Remove weather features → measure MAE increase
  • Remove outage features → measure contribution
  • Quantify: "Weather contributes ~X MW to MAE"

3. Synthetic Forecast Degradation (Future Work)

  • Add Gaussian noise to weather features (σ = 2°C for temperature)
  • Simulate load forecast error (~2% MAPE)
  • Re-evaluate with "noisy forecasts" → closer to production

4. Prospective Validation (November 2025+)

  • Collect proper forecasts daily starting Nov 1
  • Run forecasts using day-ahead weather/load/outages
  • Compare Oct (optimistic) vs. Nov (realistic)

Comparison to Baseline Models

Even with compromised validation, comparisons are valid if:

  • All models use same compromised data (fair comparison)
  • Baseline models clearly defined (persistence, seasonal naive, ARIMA)
  • Relative performance matters more than absolute MAE

Example:

Model              | Sept MAE (Compromised) | Relative to Persistence
-------------------|------------------------|------------------------
Persistence        | 250 MW                 | 1.00x (baseline)
Seasonal Naive     | 210 MW                 | 0.84x
Chronos 2 (ours)   | 120 MW                 | 0.48x ← Valid comparison

Validation Results Interpretation Guide

If Sept MAE = 100 MW

  • Lower bound established: Pipeline works mechanically
  • Production expectation: 120-140 MW MAE
  • Target assessment: Still below 134 MW target? ✅ Good sign
  • Action: Proceed to prospective validation

If Sept MAE = 150 MW

  • Lower bound established: 150 MW with perfect info
  • Production expectation: 180-210 MW MAE
  • Target assessment: Above 134 MW target ❌ Problem
  • Action: Investigate errors before production

If Sept MAE = 200+ MW

  • Systematic issue: Even perfect information insufficient
  • Action: Debug feature engineering, check for bugs

Recommended Reporting Language

Good ✅

"Using actual weather/load/outage values (not forecasts), the zero-shot model achieves 120 MW MAE on Sept 2025 holdout data. This represents a lower bound on production performance; we expect 20-40% degradation with real forecasts (estimated 144-168 MW production MAE)."

Acceptable ✅

"Proof-of-concept validation with oracle information shows the pipeline is mechanically sound. Results establish performance ceiling; prospective validation with real forecasts is required for operational deployment."

Misleading ❌

"The system achieves 120 MW MAE on validation data and is ready for production." (Omits limitations, implies production readiness)


Conclusion

This compromised validation approach is:

  • Acceptable in ML research with proper documentation
  • Useful for proving pipeline correctness and model capability
  • Valid for comparative studies (vs. baselines, ablations)
  • NOT sufficient for claiming production accuracy
  • NOT a substitute for prospective validation

Next Steps:

  1. Run Sept validation with this methodology
  2. Document results with limitations clearly stated
  3. Begin November prospective validation collection
  4. Compare Oct (optimistic) vs. Nov (realistic) in ~30 days

Approved By: User (Nov 13, 2025) Rationale: "We need results immediately. Option A (compromised validation) is acceptable if properly documented."