Archive note: this post preserves the historical benchmark that originally motivated Gramm. Later live evaluation did not reproduce the same advantage, so these numbers should be read as research archive, not current production performance.

We started with CAISO. California was the natural first target because the duck curve makes forecasting interesting, solar generation creates a midday trough and a steep evening ramp that trips up simple models. Once CAISO was working, we expanded to ERCOT, then PJM, and eventually all seven major U.S. ISOs. Here is what we learned running the same architecture across very different grids.

Each grid has a distinct personality. CAISO is solar-heavy, with net load patterns that look nothing like gross load. The duck curve means you are really forecasting two things: total demand and the solar generation that offsets it. ERCOT is electrically isolated from the rest of the continent, which makes it volatile, there is no interconnection cushion when things go wrong. PJM is the largest grid by load, spanning 13 states with a mix of industrial, commercial, and residential demand. MISO stretches across climate zones from the Gulf Coast to the Canadian border. NYISO has the densest urban load center in the country (New York City), where a hot day means millions of window AC units turning on within the same hour. ISO-NE is heating-driven in winter, with natural gas constraints that create price spikes when everyone needs both gas for heating and gas for power generation. SPP is wind-heavy, and wind variability makes the net load signal noisy.

One architecture, seven regional brains

We use one model architecture across all seven regions, but train per-region. Same architecture, same input shape, what changes is the training data and the regional weather station mappings. This is a deliberate choice. A single architecture means one codebase to maintain, one inference pipeline, one monitoring framework. Region-specific training means the model learns that CAISO and ISO-NE respond to weather differently, without forcing one shared set of weights to compromise between them.

Weather mattered more on some grids than others

The key finding in that archive run: weather integration matters more in some regions than others. MISO achieved 1.74% MAPE, the lowest of any region, where large, stable industrial load creates a predictable base and weather effects are moderated by the geographic spread. PJM followed closely at 1.75%. At the other end, SPP came in at 3.30% MAPE. SPP was the hardest grid to forecast because wind generation variability directly impacts the net load signal, and wind is inherently harder to predict than temperature. The remaining regions fell in between: NYISO at 2.21%, ERCOT at 2.45%, CAISO at 2.82%, and ISO-NE at 3.18%.

Three patterns we did not expect

A few patterns emerged that I did not expect. First, grid size does not predict forecastability. PJM is by far the largest grid, but it is not the easiest to forecast. MISO is, despite being smaller. The stability of the underlying load matters more than aggregation effects. Second, regions with high renewable penetration are harder across the board, not because renewables make load harder to predict, but because the net load signal (what the grid actually needs to serve from dispatchable generation) has more variance. Third, the archive gap between Gramm and the ISO baseline was largest in regions where the ISO appeared to be using older methods. MISO and PJM showed roughly 55-60% MAPE reductions in that run, while regions with smaller baseline gaps (ISO-NE, SPP) showed reductions in the 14-30% range. Current production does not reproduce those gaps.

Expanding from one grid to seven forced us to build infrastructure that is region-aware at every layer: weather data ingestion routes to the right stations, model versioning is per-region, accuracy monitoring breaks out by territory, and the API lets you query by ISO code. It would have been easier to just do CAISO. But the benchmarking paper covered all seven, so the product had to as well. The live scorecard now decides which results can be marketed.

Benchmarking across seven U.S. grids: what we learned

One architecture, seven regional brains

Weather mattered more on some grids than others

Three patterns we did not expect

Try the API