Why grid demand forecasts fail during extreme weather

Rex Lee & S.C. Hong·February 3, 2026
Analyzing a storm radar for extreme-weather forecasting

The relationship between temperature and electricity demand is not linear. Everyone in the industry knows this in theory. In practice, most production forecasts are still built on methods that assume it is close enough.

Demand bends, then breaks, with temperature

Here is what happens. Between roughly 50 and 75 degrees Fahrenheit, demand is relatively flat, mild weather, minimal heating or cooling. Below 50, heating load starts climbing. Above 75, air conditioning ramps up. The relationship looks roughly quadratic in the comfortable range. But at the extremes, below 20 or above 100, the curve steepens dramatically. Every additional degree matters more, not less. Heat pumps hit their efficiency limits. Industrial curtailment patterns shift. Behavioral responses (people opening windows vs. cranking AC) change discontinuously.

Why the textbooks fail at the tails

Conventional forecasting methods, regression, ARIMA, basic feed-forward neural networks, are calibrated on the bulk of historical data, which is mostly normal weather. They fit the middle of the distribution well. At the tails, they underpredict demand during heat waves and often mischaracterize heating load during polar vortex events. This is exactly when forecast accuracy matters most, because these are the hours when imbalance costs spike and grid reliability is at risk.

The ISO baselines published by grid operators have the same problem. They are tuned to minimize average error across all hours, which means they sacrifice tail accuracy for overall MAPE. That is a reasonable choice for a system operator reporting an aggregate metric. It is a bad outcome for a trader or operations team that needs to know what demand will be during the 50 hardest hours of the year.

What deep learning actually learns from raw weather

Deep learning architectures handle this differently. In our benchmarking study (arXiv:2601.01410), the models that performed best during extreme weather were the ones that worked with richer atmospheric inputs and learned the nonlinear interactions automatically, without a meteorologist hand-engineering features. The model itself learns that high-temperature, high-humidity hours produce a different load shape than the same temperature in dry air.

The ERCOT archive results illustrated this clearly, but they should not be read as current production evidence. The older held-out run showed a large MAPE reduction versus the ISO baseline, with the biggest gains during extreme weather periods. Later live tests did not reproduce that advantage, which is why the current accuracy page separates archive results from live scorecards.

Average MAPE hides the distribution. The hours when forecasts fail are the hours when getting them right has the highest economic value.

This is the core point: if you evaluate a forecast on average MAPE alone, you miss the distribution of errors. A forecast with 2% average MAPE and 3% tail MAPE is fundamentally more useful than one with 2% average MAPE and 10% tail MAPE, even though the headline number looks similar. The hours when forecasts fail are the hours when getting them right has the highest economic value.

How we built for the hours that matter

We designed Gramm to optimize for exactly this. Tail-risk hours carry more weight in our training and evaluation than they would in a vendor optimizing only for headline MAPE. The accuracy page breaks performance out by condition, not just as a single aggregate number.

Normal-weather forecasting is a solved problem. Extreme-weather forecasting is where the value is, and where deep learning actually earns its complexity budget.

Try the API

Free trial with 100 requests/hour. No credit card required.