How to Evaluate AI Forecasting Tools: A Buyer's Framework for Supply Chain Leaders

When evaluating AI for forecasting, the first number to distrust is usually the one printed in a vendor deck. A polished demo can make almost any forecast look respectable if the test is built around aggregate MAPE, a forgiving backtest, and data that was cleaned to fit the story. That is not the same thing as helping planners decide what to replenish next week, what to substitute, or which stockout will ripple into the next echelon.

The more useful question is not which tool posts the best benchmark, but what relationships the model can actually see. Kumo.ai's vendor-published analysis says isolated time-series approaches miss 25–30% of demand signal because they cannot see cross-product substitution or supplier constraints, and its SAP SALT benchmark reports 89% accuracy for a relational graph model versus 75% for PhD data scientists using XGBoost and 63% for LLM+AutoML. Those figures are useful for the architecture question, but they are still vendor-published evidence, not independent proof that one named vendor wins every evaluation [1].

Split-screen illustration comparing isolated time-series columns with an interconnected demand network.

The architecture question behind the demo

A vendor can usually make one of four stories sound persuasive, but they do not behave the same way once the forecast enters a real planning process:

Isolated time-series models treat each SKU-store pair as its own island. They are often fine when demand is stable and interactions are weak, but they are structurally blind to substitution, shared supply constraints, and cascade effects.
AutoML on flat tables can automate feature search and model selection, but it still sees records as rows. If the relationship matters more than the row, the model has to be engineered around that limitation.
Integrated planning platforms combine forecasting with broader planning workflows. That is useful when the operating model is already complex, but the buyer should expect the forecast to arrive through ETL, master-data cleanup, workflow design, and change management rather than a quick API call.
Relational graph-based systems try to represent the network itself: products, promotions, customers, suppliers, and echelons. That makes them better suited to questions where one event changes the meaning of another.

That last point matters most when a stockout does not end the demand story. If Product A stocks out and Product B absorbs some of that demand, the forecast problem is not just a lower number for A. It is a connected movement inside the assortment. Kumo.ai's analysis says substitution affects 5–8% of SKUs weekly in retail and CPG environments, which is large enough to create repeated planner pain, but narrow enough that aggregate accuracy can hide it completely [1].

Diagram showing stockout-driven demand shifting from Product A to Product B.

That is why the buying team should ask a model a structural question before it asks for a score: what does this system represent natively, and what does it force us to approximate by hand? If the answer is "each SKU-store pair in isolation," then promotion lift, substitution, and supplier constraint propagation all have to be reconstructed later with feature engineering, rules, or reconciliation logic. A relational model moves that burden into the core architecture.

Deployment speed is not the same as readiness

Buyers often compare tools as if a fast pilot and a finished deployment were the same event. They are not. A relational tool can often generate first forecasts from raw tables in days, which is a real advantage when the team wants to test whether the architecture can see the right relationships. Integrated planning platforms such as o9, Anaplan, Blue Yonder, Kinaxis, and RELEX can take 6–18 months for full deployment once ETL, workflow redesign, and change management are included [2][3].

That timing difference should not be treated as a simple speed contest. A pilot that runs quickly may still fail at enterprise scale if the organization has not solved master data, process ownership, exception handling, and planner adoption. The more integrated the platform, the more the forecast becomes part of a broader operating model rather than a standalone model selection exercise.

Run the proof of concept where the forecast is actually consumed

The easiest way to reward the wrong system is to score it on the wrong level of detail. A POC should test the model where planners, inventory managers, and replenishment teams will use it: SKU-store-week, not just total category demand. It should also prevent the vendor from choosing the most flattering slice of history.

Checklist diagram of four proof-of-concept evaluation criteria for forecasting tools.

Use temporal splits, not random splits. Train on earlier periods and test on later ones so the model proves it can forecast the future, not just memorize the past.
Measure weighted MAPE at SKU-store-week granularity. Weight by business impact so a trivial item does not count the same as a high-volume or high-margin item.
Score promotional periods separately. Promo weeks behave differently from base demand, and a blended score can hide the model's weakness where lift matters most.
Handle stockouts explicitly. If the shelf was empty, the model may be penalized for demand it could not observe or praised for demand that simply shifted elsewhere.

This is also where aggregate accuracy can mislead buying teams. A vendor can look strong on a rolled-up metric while still missing the exact SKUs that trigger expediting, substitutions, and replenishment disputes. If the POC does not expose those failures, it is not testing the planning problem; it is testing the reporting layer.

Where simpler models are enough

Graph-based forecasting is not a universal answer. A manufacturer with 200 mostly stable SKUs and little promotion activity may get plenty of value from a simpler time-series system, especially if the team needs a clean deployment more than a richer dependency graph. The architectural case gets stronger as the business complexity rises: cross-product substitution, promotion-heavy demand, multi-echelon inventory, and supplier dependencies are all signs that the model should see relationships, not just sequences.

That leaves the buyer with a cleaner decision threshold. Before asking which tool is most accurate, define which demand relationships the business needs the model to see natively, then design the POC to test exactly those relationships at the level where decisions are made. If a system only improves the score on aggregated demand, it may still be the wrong fit for the people who have to explain the forecast on Monday morning.

References

Best AI Demand Forecasting Tools for Enterprise (2026) — Kumo.ai
Machine Learning in Demand Planning — ToolsGroup
AI Demand Forecasting Explained for Supply Chain Teams — ToolsGroup

The architecture question behind the demo

Deployment speed is not the same as readiness

Run the proof of concept where the forecast is actually consumed

Where simpler models are enough

References

Comments