How to Take an AI Pilot to Production

Most enterprise AI pilots die between demo and deployment. A system impressive enough to get stakeholder buy-in fails to ship because nobody owns the path from "it works in a notebook" to "it runs in production at scale with monitoring, observability, and a runbook."

A 2025 McKinsey survey found that 73% of enterprise AI pilots never reached production. The gap isn't capability — frontier models can handle most use cases — it's delivery. The engineering and organizational work required to take a pilot to production is systematically underestimated, under-resourced, and under-owned.

Here's the complete roadmap for closing the gap.

The Six Stages from Pilot to Production

Stage 1: Data Foundation (Weeks 1–2)

Before writing a line of LLM code, audit your data. Production AI systems fail on data quality, not model quality. Understand: Where does the data live? How messy is it? Is it labeled or unlabeled? What's the freshness requirement? Is there PII that requires special handling? What are the access controls?

This audit takes 1–2 weeks and regularly surfaces problems that would have caused production failures: data that's older than expected, fields that are inconsistently populated, PII that needs to be handled before it enters prompts, and access restrictions that require IT involvement.

Deliverable: Data inventory document + PII handling spec + access requirements.

Stage 2: Architecture Lock (Weeks 2–3)

Choose your architecture before building. Architecture switches mid-engagement are expensive — they invalidate already-written code and often require starting integration work over.

The three dominant production architectures in 2026:

RAG pipeline: Retrieval-augmented generation over your document corpus. Best for knowledge base Q&A, support, search, and any use case where citing sources matters. Requires: vector database, embedding pipeline, chunking strategy, retrieval layer.

Agent with tool use: LLM orchestrating API calls, database queries, and external actions to complete multi-step tasks. Best for automation, workflows, and tasks that require accessing multiple systems. Requires: tool definitions, orchestration framework (LangGraph, custom), retry/error handling.

Fine-tuned model: Domain-adapted base model for narrow, high-volume tasks where latency and cost matter. Best for classification, extraction, formatting tasks at scale. Requires: labeled training data, fine-tuning pipeline, serving infrastructure.

Each has different infrastructure requirements. Make the call in week 2 and commit.

Stage 3: Eval Framework First (Weeks 3–4)

Before writing production code, build your evaluation framework. Define: What does "correct" output look like? How will you measure it automatically? What's the minimum acceptable performance threshold below which you will not launch?

This is counterintuitive — most teams build first and evaluate later. Building eval last means you're flying blind during development and you don't know when you've broken something. Building eval first gives you a test harness to develop against.

Minimum viable eval framework:

Labeled test set of 200–500 examples covering key use cases and edge cases
Automated harness that runs on every code change
Metrics for your use case: accuracy, relevance, groundedness, latency, cost per query
Human review queue sampling 1–5% of production outputs

Stage 4: Integration and Hardening (Weeks 4–10)

This is where pilots stall. Connecting the AI system to existing enterprise infrastructure — auth systems, databases, APIs, compliance controls, observability pipelines — is consistently the longest phase.

Every integration point is a potential failure mode. Legacy systems have rate limits you didn't know about. Authentication systems have token expiration edge cases. APIs return unexpected error shapes. Network timeouts happen at scale that never appeared in testing.

Allocate 40–60% of total build time to integration, not core AI logic. Most teams underestimate this by 3x. FDEs who have done this before know to budget here first.

Integration checklist:

Auth and identity provider integration (SSO, service accounts, API keys)
Data source connections with retry logic and circuit breakers
Observability: structured logging, distributed tracing, error reporting
Rate limiting and graceful degradation for every external dependency
Compliance controls (audit logging, PII redaction, access controls)

Stage 5: Load Testing and Cost Modeling (Weeks 10–12)

Model what happens at 10x and 100x your pilot traffic. LLM inference costs are non-trivial — a system that costs $200/month in a controlled pilot can cost $200,000/month in production if caching, batching, and prompt optimization aren't built in.

Run load tests before launch. Production traffic patterns differ from test patterns. Concurrent request handling, queue management, and database connection pooling all behave differently at scale.

Cost model inputs:

Expected daily/monthly query volume
Average prompt length (input tokens)
Average output length (output tokens)
Model price per million tokens
Cache hit rate assumption
Batching efficiency gains

Optimization stack: Semantic caching (cache similar queries, not just identical ones), prompt compression, tiered model routing (smaller model for simple queries, larger for complex), batch processing for non-real-time workloads.

Stage 6: Handoff and Documentation (Weeks 12–14)

A production AI system with no runbooks is a system that will fail silently when the person who built it leaves. Before launch, document: the system architecture, every data dependency, the eval framework and how to run it, the monitoring dashboards and alert thresholds, the incident response playbook, and the model update procedure.

Record a walkthrough video. Schedule a live knowledge transfer session with the team that will own the system. Make sure at least two internal engineers understand the system end-to-end before the FDE exits.

The Five Most Common Pilot-to-Production Failure Points

1. No eval framework — you ship, something breaks gradually, you don't notice for weeks because you have no automated measurement.

2. Underestimated integration work — connecting to enterprise systems takes 3x longer than expected. The AI logic took 2 weeks; the integrations took 8.

3. No owner post-launch — the FDE or contractor exits, nobody on the internal team understands the system, the first incident reveals a single point of failure in human knowledge.

4. Prompt drift — the LLM provider updates their model, behavior changes subtly, your eval suite doesn't catch it because it's not running in production.

5. Cost explosion — caching and batching weren't implemented, inference cost is 10x the pilot estimate, the business case collapses at scale.

Pilot-to-Production Timeline Benchmarks

| Scenario | Time to Production | |---|---| | FDE + clear scope + stakeholder alignment | 8–14 weeks | | Internal team + dedicated AI engineer | 4–9 months | | Internal team + split-attention engineers | 9–18 months | | POC-first without production roadmap | Often never |

Organizational Prerequisites

Technical readiness is necessary but not sufficient. Before a pilot can reach production, you need:

Single owner: One person who is accountable for the system being in production and working. Not a committee.

IT/security sign-off path: Identified contacts in IT and security who will approve the system, and a timeline for their review.

Infrastructure access: Production environment access, domain credentials, API keys, database access. Getting these takes weeks if not pre-arranged.

Support plan: Who handles incidents after launch? What's the escalation path? What's the SLA for the system?

Frequently Asked Questions

How long does it take to go from pilot to production? With a dedicated FDE, 8–16 weeks for most agent/RAG systems. Without a dedicated owner, 6–18 months is the median — if it ships at all.

What's the minimum viable eval framework? A labeled test set of 200–500 examples, an automated harness that runs on every deploy, key metric tracking (accuracy, latency, cost), and a human review queue for 1–5% of production outputs. This is the floor.

When should we bring in external help? If your pilot has been "almost ready for production" for more than 8 weeks, bring in external help. Internal pilots accumulate technical debt and rarely receive the dedicated attention needed to clear the final blockers.

Should we build a POC first? Only if you're genuinely uncertain about whether the AI approach will work for your use case. Most "build a POC first" decisions are actually ways to defer the harder production engineering work. Build a POC if you need to de-risk a technical assumption; don't build one if you're just postponing integration work.

How do we prevent knowledge transfer failure when the FDE exits? Start knowledge transfer in week 1, not week 14. Write documentation throughout the engagement. Require internal engineers to pair with the FDE on key components. Ensure at least two people understand the system end-to-end before the engagement ends.

We scope pilot-to-production engagements in 2 days →