Sunki and I wrote up the results as two papers. The first (arXiv:2601.01410) was a straight benchmark, we took several model architectures and tested each one against the day-ahead forecast that each ISO actually publishes. The archive results suggested forecast-error reductions by grid, and the gains were largest when we stopped trying to pre-process weather data and just let the model figure it out. Later live tests did not reproduce the same advantage, which is why the current site separates research archive from production evidence.
The second paper (arXiv:2602.21415) came out of something we noticed while running the first study: two models can have almost identical average error but behave completely differently during a heat wave or a cold snap. The state space models didn't fall apart under extreme weather the way some of the transformer variants did. For us that mattered more than average-case accuracy, because the extreme hours are where forecast errors actually cost money.
So we had the papers, the trained models, eval scripts, the whole research stack, and none of it was something anyone could actually use. When we started showing results to people in the industry, nobody wanted to talk about architecture comparisons. They wanted to know if they could call an endpoint. The first trading desk we spoke with spent most of the call on latency, authentication, and whether we supported CSV. Nobody asked about the benchmark numbers once they'd seen them. We didn't have an API.
We looked around. The ISOs publish their own forecasts, and those became the reference point we had to beat before making any production claim. The commercial vendors don't publish accuracy numbers, so you can't compare them to anything before you buy. The research existed. The production evidence standard still had to be built.
We incorporated in early 2026. The scope was straightforward because we already had models for all seven ISOs, extended-horizon research runs, REST API, JSON and CSV and XML because energy companies still run a lot of XML pipelines. That part went fast.
Everything else nearly broke us. Weather data ingestion was a mess, multiple providers, different update schedules, and occasional silent failures where a feed just stops and you don't realize it until your 6 AM forecast run produces garbage. We wasted about two weeks debugging an issue where Open-Meteo occasionally returned stale data without changing the timestamp, so our cache thought it was fresh. Inference had to finish before the ISO's submission window closed or it was useless. Model updates had to deploy without downtime. And we needed monitoring that would catch accuracy drops before a customer noticed, which meant building a second system to watch the first system. None of this was in our research code, obviously.
We launched with all seven grids and put ISO baselines next to the numbers. The mistake was treating historical validation as enough. The current site now puts the live scorecard first, because if someone can't verify the accuracy against public data before they pay us, we've got bigger problems than marketing.
