AI Prototype
to Production
You launched a demo. The board loved it. Now it has to survive real users, real regulators, and a real bill. We take it the rest of the way.
95%
OF GENAI PILOTS NEVER MOVE THE P&L
30%
ABANDONED AFTER POC BY END OF 2025
26%
HAVE CAPABILITY TO GO BEYOND POC
Sources: MIT NANDA, The GenAI Divide, 2025. Gartner, July 2024. BCG, Where's the Value in AI? Oct 2024.
The demo is the easy part.
Modern models make the demo trivial. Wire up an API, paste in a prompt, drop in a few PDFs, and you've got something that wows a room.
Production is the bit that survives a hostile user, a 3am outage, a regulator's letter, and a finance director reading the OpenRouter bill. Air Canada's chatbot became a tribunal case because the public version gave the wrong answer.
If your AI handles money, advice, customer data or anything anyone notices when it breaks, you want the production half done properly.
THE PROTOTYPE
- "Looked good on the five examples we tried"
- No tests, no evals, no regression suite
- Prompts buried in the codebase
- No idea why it fails when it fails
- One model, one provider, no fallback
- Costs surprise you at the end of the month
THE PRODUCTION SYSTEM
- Graded on a real dataset every build
- Evals catch regressions before users do
- Prompts versioned, traced, reviewable
- Every call traced, logged and searchable
- Routed, cached, fallbacks wired in
- Budgets and alerts on tokens, not bills
Five things prototypes usually miss.
We've taken RAG pilots, agents, copilots and chatbots through this on-ramp. The gaps we usually find first.
Evals
A graded dataset of real inputs and expected behaviour. Without that, nobody can tell whether the system improved or just got better at the demo path.
Observability
Every call traced, every prompt versioned, every failure searchable. LangSmith, Langfuse, Arize Phoenix, or wired to your existing stack. You stop guessing what went wrong.
Guardrails
Prompt injection, jailbreaks, data leaks, tool misuse. Reviewed against the OWASP Top 10 for LLM Applications (2025) and MITRE ATLAS. The Chevy Tahoe chatbot incident is why sales bots need hard boundaries.
Cost and latency
Model routing, caching, batching, smaller-model fallbacks. The same answer for less money and less waiting, when the evals prove it. Budgets and alerts on tokens before the bill turns up.
Governance
Aligned to NIST AI RMF and ISO/IEC 42001. EU AI Act obligations and ICO guidance mapped to your system. The paperwork that turns "cool tool" into "approved to launch."
Where we come in.
A short review, a fixed-scope production sprint, then an optional retainer to keep it improving. No twelve-month consulting engagement, no "AI strategy" away days.
You keep the prototype your team built. We add the parts a production system needs. You end up with something you can run, audit, scale and defend.
BOOK A PRODUCTION REVIEWProduction review
Read access to your repo, prompts, model usage and any test data. You get a one-page report: where it'll break first, what's exposed under OWASP LLM Top 10, where the bill comes from, what governance you're missing for NIST AI RMF and ISO/IEC 42001. Fixed price.
Build the eval harness
A real dataset, real expected behaviours, automated grading on every change. LLM-as-judge where it makes sense, human review where it doesn't. After this you can answer "did our change make it better or worse" with a number, not a vibe.
Harden, instrument, productionise
Tracing and logging on every call. Prompt injection and tool-misuse guardrails. Model routing, caching, fallbacks, retries, timeouts, rate limits. PII handling reviewed for ICO guidance. Token budgets and alerts before the OpenRouter invoice arrives.
Govern and hand back
An AI use register, a risk assessment mapped to NIST AI RMF's Govern, Map, Measure, Manage. EU AI Act obligations checked against your use case. Optional retainer if you want us building the next set of features alongside your team.
When the demo went live.
Four public examples of AI systems that launched before the boring production work was finished.
Air Canada chatbot.
BC Civil Resolution Tribunal held the airline liable for negligent misrepresentation by its website chatbot on bereavement fares. Air Canada's argument that the bot was a separate legal entity was rejected. CA$650.88 in damages, plus interest and fees.
Chevy Tahoe for $1.
A Chevrolet dealership used a ChatGPT-powered support widget. A user prompt-injected it into agreeing to sell a 2024 Tahoe for one dollar, ending with the line "no takesies backsies." The bot was taken offline after the screenshots spread.
NYC MyCity bot.
A Microsoft Azure-powered city chatbot for small businesses told users they could take staff tips, fire whistleblowers, and refuse cash, despite local rules. The Markup and THE CITY ran the story. The mayor kept it live while the city worked on fixes.
Cursor's "Sam".
Cursor's AI support bot invented a one-device-per-subscription policy that didn't exist. Users cancelled subscriptions and posted complaints on Hacker News and Reddit. Cursor apologised and said AI-generated support replies would be labelled.
Sources: BC Civil Resolution Tribunal (Moffatt v Air Canada, 2024 BCCRT 149). GM Authority. The Markup/THE CITY. Ars Technica.
The ones we get asked first.
Our prototype works. Why does it need any of this?
It works on the inputs you tried. Production sees inputs you didn't. Without evals you can't tell whether a prompt change made it better or worse. Without traces you can't tell why it failed last Tuesday. Without guardrails, the first determined user gets to write your support policy for you.
Do we have to throw the prototype away?
No. The prototype is the spec. We keep the UI, the workflow, the prompts that work. We add the parts a production system needs around it. Where something has to be rewritten, we tell you up front and why.
What about the EU AI Act?
If you're in the UK, the Act can still reach you when an AI system is placed on the EU market or its output is used in the EU. Prohibitions and AI literacy duties applied from 2 February 2025; general-purpose AI model obligations from 2 August 2025; most remaining rules from 2 August 2026; some high-risk system rules from 2 August 2027. We map your system to the right risk tier and tell you what you'd have to do, or stop doing, to stay compliant.
We don't have evals. How do you build them?
Real inputs from your logs (or synthetic ones if you don't have logs yet), labelled with the behaviour you actually want. Then a mix of deterministic checks, rubric grading, and LLM-as-judge where appropriate. We calibrate AI judges against human labels instead of treating them as truth. Everything runs on CI and breaks the build when it should.
Which observability platform should we pick?
We've used LangSmith, Langfuse, Arize Phoenix, Braintrust and Helicone. We pick on your stack, your data residency requirements, and whether self-hosting is worth the operational cost. We're not on commission for any of them.
Our OpenAI bill is out of control. Can you help?
Almost certainly. The usual fixes are prompt caching, routing easy queries to a cheaper model, batching the slow paths, and killing the agent loops that quietly call themselves seven times. The eval harness is what lets us prove the cheaper version is still good enough.
How long does it take?
Review is days. Most productionisation runs to a fixed scope, with the time set by how much guardrail and governance work the use case needs. Heavily regulated cases (financial advice, healthcare, anything touching the EU AI Act high-risk list) take longer.
How much does it cost?
Review is fixed-fee. Productionisation is scoped against the review, priced per phase. You see the number before we touch a line of code, with the cost work separated from the security and governance work.
Get the demo into production.
Send us the repo or a screen recording of the prototype. Thirty minutes on a call and you'll have a clear answer on what would break first under real users, what we'd fix first, and which bits we'd leave alone.