Blog

Why we invented a new accuracy metric before building the model

Rex Lee·January 8, 2026

In manufacturing, the metric you optimize for determines the failures you get. Optimize for average yield and you ship defective parts in the tails. Optimize for zero escapes and you scrap good product.

Grid forecasting has the same asymmetry. Under-predicting demand is dangerous. Over-predicting is wasteful. A 2 GW under-forecast during a heat wave triggers emergency reserves and price spikes. A 2 GW over-forecast idles some generators.

Both show up as the same MAPE. One is a nuisance. The other is a crisis.

Nobody was measuring this distinction. So before we trained a single model, we wrote a paper. Five neural architectures. 24 months of California data. Same dataset, same split. The finding: models with nearly identical MAPE had vastly different tail-risk profiles.

One missed during heat waves. Another missed during cold snaps. A third over-predicted systematically, which looked safe but proved expensive at scale.

We proposed bias-controlled loss functions. Penalize under-prediction more heavily, but with a governor so the model does not swing too far the other way. The math borrows from statistical process control. The intuition borrows from watching fab engineers argue about spec limits.

By the time the paper was done we had a metric we trusted, five architectures benchmarked, and a suspicion that the right architecture was none of the five.

Next: The architecture search that found something better

Read the paper

arXiv:2601.01410 — State Space Models for Safety-Critical Energy Systems