In manufacturing, the metric you optimize for determines the failures you get. Optimize for average yield and you ship defective parts in the tails. Optimize for zero escapes and you scrap good product.

Grid forecasting has the same asymmetry. Under-predicting demand is dangerous. Over-predicting is wasteful. A 2 GW under-forecast during a heat wave triggers emergency reserves and price spikes. A 2 GW over-forecast idles some generators.

Both show up as the same MAPE. One is a nuisance. The other is a crisis.

So we wrote a paper before we wrote a model

Nobody was measuring this distinction. So before we picked a production architecture, we wrote a paper on how to evaluate one. Five neural architectures. 24 months of California data. Same dataset, same split. The finding: models with nearly identical MAPE had vastly different tail-risk profiles.

One missed during heat waves. Another missed during cold snaps. A third over-predicted systematically, which looked safe but proved expensive at scale.

Borrowing the intuition from a semiconductor fab

We proposed an evaluation framework that scores models on the asymmetry of their tail behavior, not just average error, borrowing the intuition from statistical process control and from watching fab engineers argue about spec limits.

By the time the paper was done we had a metric we trusted, five architectures benchmarked, and a suspicion that the right architecture was none of the five.

Next: The architecture search that found something better

Why we invented a new accuracy metric before building the model

So we wrote a paper before we wrote a model

Borrowing the intuition from a semiconductor fab