Gartner’s 2025 survey data puts an uncomfortable number on a familiar pattern: 55% of AI supply chain projects fail to scale beyond the pilot stage.[1] That figure is not demand-planning-specific; it covers AI supply chain projects more broadly. But demand planning is one of the most common places companies try to prove AI value first, so the warning lands squarely on anyone evaluating demand planning artificial intelligence software.
The question is not whether AI can produce a better forecast in a controlled pilot. It often can. The harder question is why the result breaks down once the pilot leaves the sandbox and has to absorb late promotion changes, account-level negotiations, price moves, planner overrides, master-data defects, and an S&OP meeting where nobody wants to explain why the forecast ignored what sales already knew.

Most failed pilots do not collapse because the algorithm is useless. They stall because the organization tries to scale a model before the surrounding planning system is ready to support it. Three gaps show up repeatedly: weak data infrastructure, poor integration with commercial planning inputs, and change management that treats planner trust as an afterthought.
The first failure point is the least glamorous one: data readiness
The fastest way to make an AI forecasting pilot look better than it really is is to give it a cleaned-up sample, limit the scope, and keep the messiest planning exceptions outside the demo. That may be acceptable for proving a narrow modeling concept. It is a dangerous way to judge whether the system can scale.
McKinsey research on AI in supply chain found that organizations underinvesting in data infrastructure before AI deployment are three times more likely to report negative ROI.[2] That is the kind of finding that should slow down a pilot review. A model can look impressive while the total implementation economics are already deteriorating because the team is spending too much time reconciling feeds, correcting history, explaining exceptions, and rebuilding pipelines that were never designed for operational use.
Demand planning data is rarely one clean time series. It is a stack of sales history, lost sales, substitutions, product launches, phase-outs, customer hierarchies, calendar effects, promotions, pricing, constraints, and manual adjustments. If those inputs are incomplete or misaligned, the model does not merely lose a few points of accuracy. It gives planners a forecast they cannot explain, and unexplained forecasts die quickly in planning meetings.
This is why ERP and planning-system readiness should be treated as the first scaling gate, not a technical cleanup task running in the background. Before expanding a pilot, the team needs to know whether item-location history is usable, whether hierarchy changes are traceable, whether demand signals reflect what actually happened in the market, and whether the forecast engine can distinguish true demand from supply-constrained shipments. For a deeper operational checklist, see the AI demand planning implementation readiness assessment checklist.
The budget pattern matters too. McKinsey’s research points to a 20–30% data-preparation budget allocation as a success pattern for AI supply chain deployments.[2] That does not mean every company should mechanically reserve the same percentage. It does mean that treating data work as a minor pre-pilot activity is usually a sign the business has not priced the real implementation.
A practical data-readiness review should answer questions like these before the pilot is judged ready to scale:
- Can the system separate unconstrained demand from shipped volume when supply was short?
- Are promotions, pricing events, and customer-level changes available in a form the model can actually use?
- Are product transitions, substitutions, and discontinued SKUs mapped consistently enough to preserve demand history?
- Can planners see which inputs changed the forecast, or is the forecast presented as a black-box number?
- Is data ownership clear after go-live, or does every defect become an urgent planning-team workaround?
That last question is not administrative. If nobody owns the health of the demand signal, the planner becomes the repair layer. Once that happens, the AI tool may still be live, but the operating model has already failed.
The second failure point appears when commercial reality enters the forecast
The cleanest statistical forecast in the world will be challenged the moment a national account changes a promotion date, a sales team negotiates a one-time buy, marketing shifts spend into a different region, or pricing approves a discount that was not in the historical pattern. This is where many demand planning artificial intelligence software pilots move from promising to fragile.
Deloitte’s 2025 AI in supply chain research identifies lack of integration with commercial planning processes — including sales inputs, marketing calendars, and pricing changes — as the most common reason for AI forecasting underperformance.[3] That point deserves more attention than it usually gets. Forecast rejection often does not begin with a planner refusing innovation. It begins when the model produces a number that ignores information the business knows is real.
From the planner’s chair, the override may be rational. If a customer has pulled forward volume, if a promotion has moved, or if a price increase is about to change order behavior, the planner cannot wait for the model to rediscover that pattern after the fact. They override because the forecast is incomplete for the decision in front of them.
The problem is what happens next. Overrides break the feedback loop if they are not structured, coded, and reviewed. A planner adjustment becomes a shadow forecast. Sales keeps a separate view. Finance distrusts both. The AI forecast still exists, but the organization has returned to parallel planning — now with a more expensive tool in the middle.
Scaling requires a process design that decides which commercial signals enter the model, which ones remain human judgment, and how exceptions are reviewed. It is not enough to connect the AI engine to historical demand. The pilot has to prove that sales, marketing, pricing, and planning can feed the system in time for the forecast to matter.
This is also where ERP and planning integration become more than an IT dependency. If customer hierarchies, pricing data, promotion calendars, and product attributes sit in disconnected systems, the AI tool will either miss important signals or depend on manual uploads. Both paths create delay and ambiguity. The ERP integration readiness guide for AI demand planning goes deeper on that first gate.
A better pilot review does not ask only, “Did forecast accuracy improve?” It asks, “Did the model survive the commercial planning cycle?” That means the pilot should include real promotion changes, real customer exceptions, real pricing events, and the same review cadence the scaled process will use. Otherwise, the team has tested a model but not a planning system.
The third failure point is trust, and trust is not created by a launch deck
Change management is often discussed too late, as if adoption begins after the model is built. By then, the planners have already formed their opinion. They have seen whether the forecast explains itself, whether commercial inputs are reflected, whether exceptions are easy to review, and whether leadership blames them when the number is wrong.
Deloitte’s 2025 research found that 61% cite change management as the primary barrier to AI supply chain adoption.[3] RELEX’s 2026 State of Supply Chain survey adds the behavioral side: only 10% of leaders trust AI to make critical decisions autonomously, while 54% prefer a hybrid human-and-AI approach.[4]
That is not just resistance to automation. It is a signal about operating risk. Forecasts drive buys, production, inventory, allocation, service commitments, and working capital. If the AI recommendation is wrong, someone in the business still has to explain the miss. Until accountability changes, people will keep a hand on the wheel.
The trust loop can become self-defeating. Low trust leads to manual overrides. Unstructured overrides prevent the model from learning from actual business judgment. The model then appears less useful, which reinforces low trust. Breaking that loop requires more than training sessions. It requires exception governance, clear override reasons, visible performance review, and agreement on where the AI forecast is allowed to be the default.

What the successful 45% tend to do differently
The successful minority should not be imagined as companies that simply bought better algorithms. The stronger pattern is sequencing. They make the surrounding planning environment usable before they ask AI to carry more of the decision load.
| Sequence | What gets proven | Why it matters before scale |
|---|---|---|
| 1. Data foundation | Demand history, master data, hierarchy logic, and external inputs are clean enough for repeated use | Prevents the pilot from depending on one-time data cleanup |
| 2. Commercial integration | Sales, promotion, pricing, and marketing inputs enter the forecast process in time | Reduces the need for quiet manual overrides |
| 3. Adoption path | Planners know when to accept, challenge, or override the model | Turns AI from a parallel forecast into an operating process |
| 4. Scope expansion | The model expands by category, region, product family, or decision type after the first gates hold | Avoids scaling exception chaos across the full planning network |
There is nothing flashy in that order, which is exactly why it works. Data comes first because every downstream failure gets more expensive when the inputs are brittle. Commercial integration comes next because the forecast has to reflect how demand is actually shaped. Adoption follows because planners need a system they can interrogate, not a number they are told to trust.
Vendor case studies can be useful here, as long as they are not treated as averages. An o9-published case study on AB InBev reports a 60% stockout reduction, an 11% forecast accuracy improvement, and 70–90% touchless planning adoption.[5] Those are single-enterprise, vendor-published results, not a promise that another company will see the same outcome. Their value is in showing what becomes possible when data, process integration, and operating adoption line up.
The same caution applies to Idaho Forest Group, where secondary reporting says forecasting time fell from more than 80 hours to under 15 hours.[6] Time reduction can be a powerful scaling signal, but only if the saved time comes from eliminating low-value reconciliation and manual assembly — not from suppressing necessary review. A faster bad forecast is not an operating improvement.
For readers evaluating vendor-specific claims, the point is not to dismiss case studies. It is to ask what had to be true underneath them. Were promotion inputs integrated? Were planners working from one process or parallel spreadsheets? Were overrides governed? Was touchless planning applied only where forecastability was high, or everywhere? The answers matter more than the headline metric. For a closer vendor-specific view, see the o9 Solutions demand planning module deep dive.
A hybrid statistical-plus-ML path is often the safer bridge
Not every company is ready to absorb a full AI demand planning deployment. That is not a maturity insult; it is an implementation reality. If the data foundation is incomplete, the integration map is still being built, or planner adoption is fragile, a hybrid statistical-plus-machine-learning approach can be a lower-risk entry point.
McKinsey research indicates that hybrid statistical-plus-ML approaches can deliver 85–90% of the accuracy benefit of pure-AI approaches with significantly lower cost and data-infrastructure requirements.[2] That finding should not be stretched into a claim that hybrid is always superior. It is better understood as a sequencing option: capture much of the benefit while reducing the burden on teams that still need to strengthen data, integration, and governance.
The ceiling is real. A hybrid approach may not exploit every signal, automate every exception, or support the most ambitious autonomous planning vision. But a lower ceiling can still be the right trade-off if the alternative is an expensive AI pilot that cannot survive scale. The decision should be based on readiness, not aspiration.
A useful dividing line is operational tolerance. If planners still spend much of the cycle correcting history, chasing commercial inputs, and reconciling spreadsheet versions, a hybrid architecture may provide a more stable upgrade path. If data flows are governed, commercial inputs are integrated, and exception review is disciplined, a broader AI deployment has a better chance of being absorbed. The comparison in AI demand forecasting vs. traditional methods is a helpful next reference for that architecture decision.
The pilot review should change before the pilot expands
Many pilot reviews are built around model performance: forecast accuracy, bias, error reduction, and sometimes projected ROI. Those metrics matter, but they are incomplete. A scale decision should test whether the process around the model is strong enough to keep the gains alive.
| Pilot review question | What a weak answer usually signals |
|---|---|
| Was the pilot data prepared once, or is the pipeline repeatable? | The result may depend on manual cleanup that will not scale |
| Did the forecast include real commercial inputs? | The model may be accurate only in a simplified environment |
| Were planner overrides captured with reason codes? | Human judgment may remain invisible to the learning loop |
| Did sales, marketing, finance, and supply agree on the operating cadence? | The AI forecast may become one more number in a fragmented process |
| Which decisions became touchless, and which still require review? | Automation may be applied too broadly or too vaguely |
This is where project-level failure and organizational failure meet. A company can run a technically sound pilot inside an organization that has no formal AI planning strategy, unclear data ownership, and no agreement on how human judgment should interact with model recommendations. For broader context on the organizational pattern, see why AI in supply chain fails and the five barriers to supply chain AI adoption.
A better scale gate is blunt: can the model survive contact with the real planning calendar? If the answer depends on heroic planner effort, one-off data repair, or commercial inputs arriving outside the system, the pilot is not ready. It may still be worth continuing, but it should be treated as a readiness project, not a scale deployment.
The practical verdict
The failure pattern behind many AI demand planning pilots is preventable. Companies try to install intelligence into a planning system that has not yet earned trust: data is unstable, commercial signals arrive late or outside the tool, and planners are asked to accept recommendations without a clear operating model for challenge, override, and accountability.
The teams that scale do the dull work in the right order. They strengthen the demand signal, wire in the commercial process, define how humans and AI share the decision, and only then broaden model scope. If a new pilot is starting now, the first question is not which model to buy. It is whether the data, commercial inputs, and adoption path are ready enough for that model to survive planning reality.
References
- Gartner 2025 survey data on AI supply chain project scaling; Gartner; 2025; source link not provided in research brief
- McKinsey research on AI in supply chain; McKinsey & Company; source link not provided in research brief
- Deloitte's 2025 AI in supply chain research; Deloitte; 2025; source link not provided in research brief
- RELEX 2026 State of Supply Chain; RELEX Solutions; 2026; source link not provided in research brief
- AB InBev case study; o9 Solutions; source link not provided in research brief
- Idaho Forest Group forecasting case aggregation; Stealth Agents; source link not provided in research brief

Comments
Join the discussion with an anonymous comment.