What March 2026 taught us about forecast anomalies

In mid-March 2026 a four-day heat event swept across CAISO, PJM, and parts of ERCOT. Load spiked 25-30% above seasonal normals on the peak days. Day-ahead forecast errors across every major U.S. grid jumped to 4-5x their usual monthly average. Our own numbers moved with them. Here is what we found when we dug in, and the engineering change we shipped in response.
March alone owned a third of our 2026 error
We decompose forecast errors monthly as a matter of course. When we ran the March decomposition we found something striking: across all four major grids we track most closely, March 2026 accounted for 25-35% of the full calendar-year MAPE, even though March is just 8% of the hours. Every single top-10 worst forecast day across those grids fell in March 2026. Not distributed across the year. Clustered.
Per-grid: CAISO March MAPE hit 6.91% against an annual baseline of ~2.2% outside the anomaly. ERCOT hit 5.09% against ~1.55%. PJM hit 4.34% against ~1.47%. MISO, 5.08% against ~1.46%. Remove March from the test window and the numbers snap back to the ranges we published when we first benchmarked the architecture.
Is the model broken, the data broken, or something else?
The first question in any forecast-accuracy investigation is whether the signal is genuine model behavior or an artifact. We checked both.
The data was clean. CAISO daily-mean load March 17-21 hit 26,000-27,000 MW against a March baseline of 20,000-21,000 MW. PJM peaks crossed 105,000 MW against an 85,000 MW baseline. These loads are real, the ISOs themselves logged them, and our HRRR weather inputs showed the corresponding temperature anomaly. The ingest pipeline recorded the heat event correctly.
The model was correct for its training distribution. Our training data runs 2022-2025. In those four years, we can point to exactly zero precedents of mid-March with sustained +10°C anomalies on the Pacific and Atlantic coasts simultaneously. The model saw temperatures within the training distribution (HRRR gridded temperatures are not extreme), but the joint pattern, high temperature this early in the spring transition, load response consistent with summer AC cycling patterns, simply was not represented in the data it trained on. The model extrapolated using its closest learned pattern, which for mid-March was "late-winter shoulder demand." It predicted accordingly. Reality did not comply.
This is the textbook definition of a distribution shift. Not a bug to fix , a regime the model had no statistical basis to anticipate.
The daily mean was right. The hourly shape wasn't.
We broke March 2026 MAPE down further by hour and by load decile. Two patterns emerged. First, the intra-day shape was wrong more than the daily mean was wrong. On peak days the model got daily-mean load within 1-4%, but hourly MAPE ran 15-23% because the AC-cycling timing, thermal-lag tails, and duck-curve evening ramps were all pulled toward summer patterns that the model had no March precedent for calibrating against. Second, the errors were persistent. Predictions missed the elevated load days ahead of the event, during the event, and for a day or two after it, because the AC-duty-cycle tail does not end the moment the heat breaks.
This diagnosis matters because it tells us where to spend engineering effort. The issue was not that the model failed to respond to weather, it did, correctly, at aggregate level. The issue was that the model's learned hour-by-hour shape was miscalibrated for the regime. A better backbone would not fix that. Better normalization of the input signal within each forecast window would.
The fix: let the model adapt to each window
We ran a targeted experiment against this specific failure mode and shipped a small inference-time architectural change that lets the model adapt to the operating regime of each forecast window, rather than relying on statistics baked in at training time. When the regime shifts, the model shifts with it.
On CAISO, the results were directionally what we hoped for. Multi-seed training showed March MAPE dropping from 6.91% to 5.82%, a 1.07 percentage-point reduction isolated to the anomaly month. The overall CAISO MAPE held, but its composition shifted, less concentrated in the hard month, more evenly spread.
The anomaly-month cap (the ratio of worst-month MAPE to annual MAPE) for CAISO moved from 2.36 to 2.01. That matters because we use that ratio internally as a customer-facing commitment: customers can plan around a worst-month cap, even if the absolute MAPE in that month is higher than other months.
Why this regime won't be the last
Better architecture matters. Fresher training data matters. But for anomaly handling specifically, neither is the first-order lever. The first-order lever is keeping the model in-distribution at inference time, which means renormalizing each window against its own statistics rather than trusting statistics learned months ago to apply to a regime the training never saw.
This matters commercially because it tells customers what to expect. Gramm publishes MAPE numbers. Those numbers are honest on the test window they were computed against. When a regime shift arrives, and one always does, a good forecast provider is the one that detects it within 24 hours, root-causes it within a week, and ships an architectural response within the month. That is our 2026 cycle. The March numbers are a permanent part of the record, and so is the fix.
The March 2026 heat event was not the last time we will see an OOD load regime. Climate variability, structural demand shifts from data-center buildout, and EV-charging patterns that do not yet exist at scale are all going to push load distributions away from the training history we have. Per-window renormalization is one piece of a larger toolkit, continuous retraining on fresh windows, regime-switching ensembles that route to specialist models, and conformal calibration for probabilistic coverage are all in the pipeline. We are building for the next regime shift, not the last one.
If you are building against U.S. grid load forecasts and want to discuss anomaly behavior or methodology, we publish our work openly , research papers here and per-grid accuracy on the accuracy benchmarks page.