Most ML forecasting tutorials start with the data and end with a model. This project started with a physics equation.
Germany's solar fleet generates up to 50 GW at peak. Forecasting it matters because every percentage point of error at midday costs real money on the balancing market. The standard approach is to throw weather features into a gradient boosting model and optimise RMSE. That works. But it leaves a lot of explainability on the table, and it forces the model to learn things it should already know.
The full pipeline: SMARD + Open-Meteo data → TimescaleDB → Physics layer (pvlib) + ML residual layer (XGBoost) → Calibrated P10/P50/P90 → API & Dashboard
The Residual Idea
pvlib is a Python library built by solar engineers to compute exactly how much energy a solar panel should produce given the sun's position, air mass, and atmospheric turbidity. It knows nothing about clouds. But it knows geometry, and geometry is reliable.
So instead of asking XGBoost to predict solar generation directly, I asked it to predict the residual:
residual = actual_solar - physics_prediction
The physics layer handles the easy part: geometry, seasonal patterns, diurnal curve. XGBoost handles only what physics cannot see — cloud cover, curtailments, aerosols, measurement noise. The result is a smaller, faster model that learns from a harder signal.
R² went from 0.78 (physics alone) to 0.92 (physics + residual learner). MAE dropped by 60%, from 3,856 MW to 1,552 MW.
The most interesting result: when I inspected XGBoost's feature importances, physics_pred was the top feature by a large margin. The model is primarily amplifying and correcting the physics signal, not ignoring it. That is exactly what you want from a physics-informed design.
Calibrated Uncertainty
A point forecast is not enough for grid operators. They need to know the range. I trained three separate quantile models (q10, q50, q90) and then applied split conformal prediction to add a distribution-free coverage guarantee.
Grid operators need calibrated intervals, not point estimates. Split conformal prediction provides a distribution-free coverage guarantee.
The P90 interval hit 0.869 empirical coverage on the test set, close to the 0.90 target. P50 was miscalibrated (0.28 observed vs 0.50 claimed) due to a seasonal distribution shift between the calibration and test sets. This is a known limitation of split conformal when the calibration window does not represent the test distribution — documented honestly in ADR-003.
Forecasts are evaluated with CRPS (Continuous Ranked Probability Score), which rewards both calibration and sharpness simultaneously. A wide but well-calibrated interval scores worse than a tight and accurate one. Final CRPS: 514.6 MW.
The Stack
The whole system runs with docker compose up --build. Three services start in order: TimescaleDB (raw data storage with 26,000+ hourly records in hypertables), FastAPI (serves forecasts), Streamlit (interactive dashboard with P10/P50/P90 chart, physics decomposition, and reliability diagram).
The project is managed as a 22-day research sprint, including 5 Architecture Decision Records, 3 weekly reports, 2 retrospectives, a risk register, and a public Kanban board — because good science needs a paper trail.
Key Takeaways
- Physics + ML beats pure ML. A principled baseline reduces the learning problem to what matters.
- The model validates itself. If
physics_predis XGBoost's top feature, you know the architecture is working as intended. - Calibration ≠ accuracy. CRPS and reliability diagrams reveal what RMSE hides — and honest reporting of miscalibration is more valuable than hiding it.
- One command deployment. Real systems should start reliably. Docker Compose enforces this discipline.
The repo is currently private while final documentation is completed. If you'd like early access or want to discuss the methodology, reach out on LinkedIn or via the contact page.