The relationship between temperature and electricity demand is not linear. Everyone in the industry knows this in theory. In practice, most production forecasts are still built on methods that assume it is close enough.
Here is what happens. Between roughly 50 and 75 degrees Fahrenheit, demand is relatively flat — mild weather, minimal heating or cooling. Below 50, heating load starts climbing. Above 75, air conditioning ramps up. The relationship looks roughly quadratic in the comfortable range. But at the extremes — below 20 or above 100 — the curve steepens dramatically. Every additional degree matters more, not less. Heat pumps hit their efficiency limits. Industrial curtailment patterns shift. Behavioral responses (people opening windows vs. cranking AC) change discontinuously.
Conventional forecasting methods — regression, ARIMA, basic feed-forward neural networks — are calibrated on the bulk of historical data, which is mostly normal weather. They fit the middle of the distribution well. At the tails, they underpredict demand during heat waves and often mischaracterize heating load during polar vortex events. This is exactly when forecast accuracy matters most, because these are the hours when imbalance costs spike and grid reliability is at risk.
The ISO baselines published by grid operators have the same problem. They are tuned to minimize average error across all hours, which means they sacrifice tail accuracy for overall MAPE. That is a reasonable choice for a system operator reporting an aggregate metric. It is a bad outcome for a trader or operations team that needs to know what demand will be during the 50 hardest hours of the year.
Deep learning architectures handle this differently. In our benchmarking study (arXiv:2602.21415), the models that performed best during extreme weather were the ones that ingested raw weather covariates — not just temperature, but humidity, wind speed, solar irradiance, and dew point. These models learn the nonlinear interactions automatically. They do not need a meteorologist to hand-engineer a “cooling degree day” feature. The architecture itself discovers that 90 degrees with 80% humidity produces a different load shape than 90 degrees with 30% humidity.
The ERCOT results illustrate this clearly. Over our evaluation period, Gramm reduced MAPE from 5.1% (the ISO baseline) to 1.62% — a 68.2% reduction. But the improvement was not evenly distributed across all hours. The biggest gains came during extreme weather periods: the summer heat wave hours where baseline errors spiked to 8-12% and Gramm held at 2-3%. During mild shoulder season days, the improvement was modest because the baseline was already decent.
This is the core point: if you evaluate a forecast on average MAPE alone, you miss the distribution of errors. A forecast with 2% average MAPE and 3% tail MAPE is fundamentally more useful than one with 2% average MAPE and 10% tail MAPE, even though the headline number looks similar. The hours when forecasts fail are the hours when getting them right has the highest economic value.
We designed Gramm to optimize for exactly this. The training pipeline overweights extreme weather hours. The weather data ingestion runs at sub-hourly frequency to catch rapid changes. And the accuracy page breaks out performance by condition, not just as a single aggregate number.
Normal-weather forecasting is a solved problem. Extreme-weather forecasting is where the value is, and where deep learning actually earns its complexity budget.