[PRO SERVICES / ADVISORY]

AI Prototype
to Production

The prototype has support from leadership. Now technology, security, finance and the operational owner need to know whether it can run safely at company scale. We close the engineering and governance gaps before wider use.

BOOK A PRODUCTION REVIEW WHY DEMOS DON'T LAUNCH

A production server next to a prototype breadboard and engineering sketch

Demo to live system

95%

OF GENAI PILOTS NEVER MOVE THE P&L

30%

ABANDONED AFTER POC BY END OF 2025

26%

HAVE CAPABILITY TO GO BEYOND POC

Sources: MIT NANDA, The GenAI Divide, 2025. Gartner, July 2024. BCG, Where's the Value in AI? Oct 2024.

[THE GAP]

Production adds the hard engineering

Modern models make the demo trivial. Wire up an API, paste in a prompt, drop in a few PDFs, and you've got something that wows a room.

Production is the bit that survives a hostile user, a 3am outage, a regulator's letter, and a finance director reading the OpenRouter bill. Air Canada's chatbot became a tribunal case because the public version gave the wrong answer.

If your AI handles money, advice, customer data or anything anyone notices when it breaks, you want the production half done properly.

THE PROTOTYPE

"Looked good on the five examples we tried"
No tests, no evals, no regression suite
Prompts buried in the codebase
No idea why it fails when it fails
One model, one provider, no fallback
Costs surprise you at the end of the month

THE PRODUCTION SYSTEM

Graded on a real dataset every build
Evals catch regressions before users do
Prompts versioned, traced, reviewable
Every call traced, logged and searchable
Routed, cached, fallbacks wired in
Budgets and alerts on tokens, not bills

[THE FIVE THINGS]

What prototypes usually miss

These are the operating gaps we check first across retrieval systems, agents, copilots and chatbots.

Evals

A graded dataset of real inputs and expected behaviour. Without that, nobody can tell whether the system improved or just got better at the demo path.

Observability

Every call traced, every prompt versioned, every failure searchable. LangSmith, Langfuse, Arize Phoenix, or wired to your existing stack. You stop guessing what went wrong.

Guardrails

Prompt injection, jailbreaks, data leaks, tool misuse. Reviewed against the OWASP Top 10 for LLM Applications (2025) and MITRE ATLAS. The Chevy Tahoe chatbot incident is why sales bots need hard boundaries.

Cost and latency

Model routing, caching, batching, smaller-model fallbacks. The same answer for less money and less waiting, when the evals prove it. Budgets and alerts on tokens before the bill turns up.

Governance

Aligned to NIST AI RMF and ISO/IEC 42001. EU AI Act obligations and ICO guidance mapped to your system. The paperwork that turns "cool tool" into "approved to launch."

[HOW WE WORK]

What the leadership team gets

The review separates immediate production risks, required controls and optional improvements. The business can approve each engineering phase against that evidence.

You keep the prototype your team built. We add the parts a production system needs. You end up with something you can run, audit, scale and defend.

BOOK A PRODUCTION REVIEW

Production review

With read access to the repository, prompts, model usage and test evidence, we document production risks, cost drivers and missing controls against the relevant guidance.

Build the eval harness

A real dataset, real expected behaviours, automated grading on every change. LLM-as-judge where it makes sense, human review where it doesn't. After this you can answer "did our change make it better or worse" with a number, not a vibe.

Harden, instrument, productionise

Tracing and logging on every call. Prompt injection and tool-misuse guardrails. Model routing, caching, fallbacks, retries, timeouts, rate limits. PII handling reviewed for ICO guidance. Token budgets and alerts before the OpenRouter invoice arrives.

Govern and hand back

An AI use register, a risk assessment mapped to NIST AI RMF's Govern, Map, Measure, Manage. EU AI Act obligations checked against your use case. Optional retainer if you want us building the next set of features alongside your team.

[PUBLISHED EXAMPLES]

What goes wrong after launch

Four public examples of AI systems that launched before the boring production work was finished.

FEB 2024

Air Canada chatbot.

BC Civil Resolution Tribunal held the airline liable for negligent misrepresentation by its website chatbot on bereavement fares. Air Canada's argument that the bot was a separate legal entity was rejected. CA$650.88 in damages, plus interest and fees.

DEC 2023

Chevy Tahoe for $1.

A Chevrolet dealership used a ChatGPT-powered support widget. A user prompt-injected it into agreeing to sell a 2024 Tahoe for one dollar, ending with the line "no takesies backsies." The bot was taken offline after the screenshots spread.

MAR 2024

NYC MyCity bot.

A Microsoft Azure-powered city chatbot for small businesses told users they could take staff tips, fire whistleblowers, and refuse cash, despite local rules. The Markup and THE CITY ran the story. The mayor kept it live while the city worked on fixes.

APR 2025

Cursor's "Sam".

Cursor's AI support bot invented a one-device-per-subscription policy that didn't exist. Users cancelled subscriptions and posted complaints on Hacker News and Reddit. Cursor apologised and said AI-generated support replies would be labelled.

Sources: BC Civil Resolution Tribunal (Moffatt v Air Canada, 2024 BCCRT 149). GM Authority. The Markup/THE CITY. Ars Technica.

[RELEVANT VU WORK]

Our own products have to pass the same test

Raq.com, 102.ai and Project Quote AI all moved beyond a working demonstration into monitored software with accounts, permissions, billing, data handling and support. Those operational details are the production work.

SEE THE TECHNICAL EVIDENCE

[A USEFUL FIRST CONVERSATION]

When this is worth discussing

We work best when there is a real operating problem, enough volume to measure and people from the affected teams who can make decisions.

Usually a good fit

An established UK business, usually with annual revenue above £10m
A repeated process with a known cost, delay, error rate or capacity problem
A senior sponsor and a day-to-day owner who understand the work
Access to the relevant staff, systems, sample records and security requirements

We may point you elsewhere

A standard product already covers the process well
The requirement is a one-off small build with no wider operating case
There is no owner or access to the people and data needed to test the result
The plan relies on AI making high-impact decisions with nobody responsible for review

[QUESTIONS]

Questions before committing

Q.01

Our prototype works. Why does it need any of this?

It works on the inputs you tried. Production sees inputs you didn't. Without evals you can't tell whether a prompt change made it better or worse. Without traces you can't tell why it failed last Tuesday. Without guardrails, the first determined user gets to write your support policy for you.

Q.02

Do we have to throw the prototype away?

No. The prototype is the spec. We keep the UI, the workflow, the prompts that work. We add the parts a production system needs around it. Where something has to be rewritten, we tell you up front and why.

Q.03

What about the EU AI Act?

A UK company can still be in scope when an AI system is placed on the EU market or its output is used in the EU. Prohibitions and AI literacy duties have applied since 2 February 2025, general-purpose model duties since 2 August 2025, and Article 50 transparency duties apply from 2 August 2026. The Commission's current timeline moves Annex III high-risk areas to 2 December 2027 and product-embedded high-risk systems to 2 August 2028. We map the system, territory and your role before stating the requirements.

Q.04

We don't have evals. How do you build them?

Real inputs from your logs (or synthetic ones if you don't have logs yet), labelled with the behaviour you want. Then a mix of deterministic checks, rubric grading, and LLM-as-judge where appropriate. We calibrate AI judges against human labels instead of treating them as truth. Everything runs on CI and breaks the build when it should.

Q.05

Which observability platform should we pick?

The options include LangSmith, Langfuse, Arize Phoenix, Braintrust, Helicone and custom telemetry. We compare them against your stack, data requirements, procurement position and the operating cost of self-hosting.

Q.06

Our OpenAI bill is out of control. Can you help?

Almost certainly. The usual fixes are prompt caching, cheaper models for simple queries, batching slow work and removing repeated tool calls. The evaluation suite checks that the cheaper version still meets the agreed standard.

Q.07

How long does it take?

Timing depends on the architecture, data, user volume and controls the use case needs. Regulated or consequential uses require more evidence, review and testing than an internal low-risk tool.

Q.08

How much does it cost?

Review is fixed-fee. Productionisation is scoped against the review, priced per phase. You see the number before we touch a line of code, with the cost work separated from the security and governance work.

Talk to us about the prototype

Send us the repository, architecture notes or a screen recording. We will identify the first production risks, the evidence missing from the pilot and the scope needed for a technical review.

BOOK A PRODUCTION REVIEW SEE ALL SERVICES

AI Prototypeto Production