Why Your AI Demand Forecasts Still Miss 25–30% and How Relational Models Close the Gap

The uncomfortable failure pattern in ai for demand forecasting usually does not look like a bad pilot. The company has already moved beyond spreadsheet seasonality rules and may have replaced ARIMA, Prophet, or XGBoost experiments with a production forecasting system. The dashboards look modern. The model retrains. Planners receive SKU-store forecasts on schedule. Yet finance still sees 25%+ overstock on slow movers, while the commercial team is explaining why trending adjacent items went out of stock.

That is the point where “tune the model” becomes too small an answer. If the same families of misses keep appearing around substitutes, promotions, assortment changes, and constrained supply, the useful question is no longer whether the algorithm is sophisticated enough. It is what the forecasting system is structurally unable to see.

Traditional isolated SKU-store forecasting contrasted with relational graph-based demand forecasting

For readers still comparing the first-order jump from statistical methods to machine learning, that benchmark belongs in a separate evaluation. The harder plateau starts after the upgrade, when the deployed system still treats each SKU-store pair as a mostly self-contained time series. That is where the substitution blind spot begins.

The miss is often sitting one shelf over

A stockout is rarely a clean disappearance of demand. In a fashion retail study, Li et al. found that 51.7% of unmet demand from a stockout spilled over to adjacent SKUs, while only 28.1% was genuinely lost sales. The remaining demand did not vanish; it moved through the assortment in ways the original SKU-level forecast could not own cleanly.[1]

Substitution spillover effect from a stockout to substitute products and lost sales

The exact percentages should not be copied blindly into grocery, electronics, or CPG. The study setting matters. But the operational consequence is familiar across categories: the forecast error does not stay inside the item that triggered it. A missing size, flavor, color, pack size, or nearby brand can inflate demand for another item, suppress measured demand for the unavailable one, and leave the planner with two distorted signals instead of one clean failure.

This is why a slow mover can look overforecasted while its neighbor is underforecasted. The problem may not be that the model failed to learn the slow mover’s seasonality. It may be that the training data recorded demand after shoppers had already substituted into or away from the item. The shelf event changed the observed sales pattern, and the forecasting system treated the result as local history.

Once substitution is visible, the usual model leaderboard becomes less satisfying. A more accurate SKU-local model can still be trained on a damaged view of demand. It can improve the curve fit and still miss the movement of demand between products.

Why isolated SKU-store models hit a ceiling

Most production forecasting tables are built for convenience: one row per SKU, location, and time period, plus columns for price, promotion, inventory, calendar, and lagged sales. That shape is friendly to common machine learning workflows. It is also a flattening decision. It tells the model that the primary unit of learning is a single SKU-store history, with other facts attached as features.

Substitution does not naturally live in that shape. It is not only a property of SKU A or SKU B. It is a relationship between them, and the strength of that relationship can change with price gaps, assortment depth, shelf availability, promotion timing, brand loyalty, and store context. A shopper who substitutes from one black running shoe to another is creating a demand signal on an edge, not just a data point in either item’s time series.

Feature engineering can patch part of this. A team can add category-level sales, similarity clusters, competitor promotion flags, stockout indicators, or hand-built substitute groups. That is usually worth doing. The limitation is that the team has to decide in advance which relationships matter, summarize them into columns, and maintain those columns as products, stores, and promotions change. The more dynamic the assortment, the more this becomes a second forecasting system hidden inside preprocessing.

Promotional cannibalization has the same problem. A discount on one item does not only lift that item’s sales; it can pull demand from nearby items, change basket composition, and alter replenishment timing. Supplier constraints propagate too. If a constrained component, vendor, or distribution lane limits supply for a set of related items, the demand plan and inventory outcome can shift across multiple SKUs even if the consumer-facing demand signal started in one place.

Kumo.ai’s published materials call this a “cross-table signal gap”: important predictive information exists across related entities, but the forecasting architecture has flattened those entities into isolated rows. The term is vendor-coined, so it should not be treated as a standard diagnosis by itself. The mechanism behind it, however, is straightforward: when demand behavior travels through relationships, a model trained primarily on independent item histories is forced to infer a network from columns that were never designed to represent one.

That is also where the often-cited 25–30% gap needs careful handling. Kumo.ai attributes this range to enterprise benchmarks and reports examples including 25% overstock reduction and $2–5 million in freed working capital after switching from isolated time-series approaches to relational demand models.[2] Those figures are useful as business stakes, not as a universal constant. The stronger conclusion is narrower: when the largest misses are relational, the forecast architecture has to represent relationships before model tuning can do much more.

For a broader implementation-risk lens, see AI Demand Forecasting Implementation: The Risks Vendors Don't Emphasize and AI Demand Forecasting Challenges and Readiness: What Supply Chain Leaders Need to Know Before Implementing.

What changes when demand is modeled as a connected system

A relational or graph-based forecasting design starts from a different premise. Products, stores, customers, suppliers, promotions, and time periods are not merely fields in one table. They are entities with relationships: product-to-product similarity, store-to-store geography, product-to-promotion exposure, product-to-supplier dependency, and item-to-assortment membership.

That shift changes the forecasting task. Instead of asking only, “What will this SKU-store sell next week based on its own history and attached features?” the system can ask, “What should this node’s demand look like given what is happening to connected products, stores, promotions, and constraints?” The model is not expected to rediscover the commercial structure from a flat table every time.

Demand behavior	What a flat SKU-store model tends to see	What a relational model can represent
Stockout substitution	Lower sales for the out-of-stock item and higher sales elsewhere, often treated as separate local histories	Demand movement between substitute or adjacent products
Promotional cannibalization	Lift on the promoted item, with weaker visibility into suppressed demand nearby	Promotion effects across related products and categories
Supplier constraint propagation	Availability or service-level changes attached to individual SKUs	Shared upstream dependencies affecting multiple downstream forecasts
Assortment changes	Broken or sparse item history after listings, delistings, or replacements	Signals transferred across related items, stores, and product families

The important implementation distinction is where the relationships live. If substitute relationships, promotion relationships, and supplier relationships exist only as manually engineered columns, they are brittle and incomplete. If they are native parts of the data representation, the model can learn from the structure itself.

The planning impact is practical. A replenishment planner does not need a model that sounds more advanced in architecture review. They need a forecast that does not treat a stockout-driven demand transfer as two unrelated forecast errors. A finance reviewer does not need another explanation that “the model was directionally right.” They need fewer cases where capital is trapped in the item that lost demand while the substitute runs dry.

The evidence is converging, but the sources do not prove the same thing

The case for relational forecasting is strongest when the evidence is kept in its proper lanes. Independent academic evidence shows substitution is material. Graph and relational benchmarks show that connected representations can improve forecast metrics in specific settings. Vendor benchmarks translate the architecture question into enterprise outcomes. Those are related signals, not interchangeable proof.

One GraphSAGE analysis on real FMCG data reported WAPE falling from 0.86 for a naive baseline to 0.62, described as a 27% error reduction, by allowing demand signals to flow across SKU connections.[3] That result is useful because it ties improvement to a connected representation rather than another round of purely local time-series tuning. It is still one published practitioner analysis, not a cross-industry law.

GraphDeepAR, an arXiv 2024 paper involving Amazon and adidas, reported a 2.05% average financial uplift across six article subsets at adidas. The same paper reported RMSE improvements of 4.36% on retail data and 31.98% on e-commerce data.[4] The spread matters. Relational methods can produce modest or large gains depending on the dataset, target, and commercial structure. That is exactly why the right diagnostic is not “graphs are always better,” but “are our errors concentrated in relationships the current architecture cannot represent?”

The SAP SALT and KumoRFM benchmark points in the same direction from a different angle. Kumo.ai reports that its relational foundation model scored 89% versus 75% for PhD-level data scientists hand-engineering features for XGBoost, framing the 14-percentage-point gap as an architecture difference rather than a better hand-tuned feature set.[5] Because this is vendor-published material, it should be read with attention to methodology and comparison design. The relevant takeaway is not that one vendor benchmark settles the matter; it is that hand-engineering relational signals into a flat model can become the ceiling.

Kumo.ai’s separate tool guide rates cross-product substitution as high predictive power affecting 5–8% of SKUs weekly, promotional lift interactions as very high predictive power driving 20–40% of volume, and supplier constraint propagation as high predictive power.[6] Again, those are vendor disclosures. They are most useful as a checklist for what to test in your own demand data, not as independent proof of prevalence in every business.

Practitioners were already compensating for substitution before graph models became common

Albert Heijn’s stock-out substitution work is a useful reminder that this is not an academic fashion attached to graph neural networks. The company described a Naive Bayes substitution correction deployed across more than 30,000 SKUs to adjust demand forecasting for stock-out substitution.[7] That is not an end-to-end graph model. It is evidence that practitioners had already found the same wound and built a correction around it.

In many organizations, these compensations appear as planner overrides, category-specific business rules, or preprocessing tables maintained by analysts who know the assortment better than the model does. The existence of those workarounds should not be dismissed as “manual noise.” They are often the organization’s shadow map of demand relationships.

The architectural question is whether that shadow map should remain outside the forecasting model. If substitution, cannibalization, and constraint propagation are central to the misses, then keeping them as after-the-fact corrections makes every planning cycle depend on fragile reconciliation: the model predicts one thing, the business rules adjust another, and the planner has to explain the difference.

How to diagnose whether your plateau is architectural

The practical diagnosis does not start with a model bake-off. It starts with error clustering. Pull the largest forecast misses from the last several planning cycles and tag them by business context: stockout near substitutes, promotion overlap, new or removed assortment, supplier or DC constraint, price gap change, and store cluster behavior. The question is whether the misses are random residuals or relationship-shaped failures.

If the same item is wrong everywhere at the same time, review item-level features, launch history, price, or external demand drivers.
If one item is overstocked while close substitutes are stocked out, inspect substitution and availability effects before tuning hyperparameters.
If promoted items look accurate but adjacent items repeatedly miss, measure cannibalization rather than only promotional lift.
If multiple unrelated-looking SKUs miss together after supply disruption, map shared supplier, component, lane, or distribution dependencies.
If new items, delisted items, and replacements create recurring cold-start errors, test whether product and assortment relationships are being transferred or discarded.

This diagnosis also prevents a common procurement mistake. A vendor can show a stronger algorithm and still require the same flattened training table. Another vendor can support relational data structures but only for narrow feature lookup, not for learning across product, store, promotion, and supplier relationships. The evaluation criterion should be concrete: where do the relationships enter the model, how are they updated, and can the system explain which connected signals influenced a forecast?

For vendor selection, this belongs alongside accuracy, latency, integration burden, planner workflow, and governance. See How to Evaluate and Select AI-Powered Demand Forecasting Tools.

What relational models do not solve

Relational forecasting does not erase bad inventory records, late promotion calendars, broken master data, or incentives that reward forecast gaming. It also does not remove the need to decide what counts as a valid substitute, how to represent hierarchy, or how to manage changing assortments. A graph built from stale or politically convenient relationships will simply make the wrong structure look technical.

It also does not mean every business needs a full graph neural network deployment immediately. Some organizations should first correct stockout censoring, clean product hierarchies, or formalize substitute groups. A relational foundation model may be the right endpoint; a substitution-aware preprocessing layer may be the first responsible step. The architecture should follow the error evidence.

Recent operations research continues to examine inventory management under customer substitution, which is another reason to keep the scope precise: substitution is not just a forecasting metric problem. It affects replenishment, inventory positioning, and how service levels are interpreted when customers switch products instead of walking away.[8]

The decision rule

If forecast misses are broadly distributed, local, and explainable by weak item history or poor features, better feature work and model tuning may still be the right path. If the largest misses cluster around substitutes, promotions, assortment changes, or constrained supply, the current system is probably being asked to infer a connected demand system from a flattened table.

At that point, the next evaluation is not whether the forecasting tool has a more impressive algorithm name. It is whether the architecture can represent demand as a connected system, so the signal that starts at one SKU, promotion, store, or supplier is allowed to travel to the places where the business will actually feel it.

References

Managing Stockouts Under Customer Substitution, Management Science, 2022, https://pubsonline.informs.org/doi/10.1287/msom.2022.1135
Demand Forecasting Complete Guide, Kumo.ai, https://kumo.ai/resources/learn/guide/demand-forecasting-complete-guide/
Time Series Isn’t Enough: How Graph Neural Networks Change Demand Forecasting, Towards Data Science, Jan 2026, https://towardsdatascience.com/time-series-isnt-enough-how-graph-neural-networks-change-demand-forecasting/
GraphDeepAR: A Probabilistic Forecasting Model Using Graph Neural Networks, arXiv, 2024, https://arxiv.org/html/2401.13096v1
KumoRFM: A Relational Foundation Model, arXiv, https://arxiv.org/html/2604.12596v1
Best Demand Forecasting Tools, Kumo.ai, https://kumo.ai/resources/learn/best-demand-forecasting-tools/
Accounting for Stock-Out Substitution in Demand Forecasting at Scale, Albert Heijn Technology, https://blog.ah.technology/accounting-for-stock-out-substitution-in-demand-forecasting-at-scale-88d264102ee4
Data-driven inventory management under customer substitution, European Journal of Operational Research, 2025, https://www.sciencedirect.com/science/article/pii/S0377221725010008