[PRO SERVICES / BUILD]

Build AI Agents
That Do Real Work

A production agent needs a defined job, an operational owner and clear authority over company systems. We build around those constraints, measure the result and keep people involved where a wrong action would matter.

BOOK AN AGENT SCOPING CALL SEE WHAT'S INSIDE ONE

Agents that finish the job

95%

OF ENTERPRISE GENAI PILOTS SHOW NO P&L IMPACT (MIT, 2025)

40%

AGENTIC AI PROJECTS CANCELLED BY END OF 2027 (GARTNER)

Bounded authority

TOOLS, LIMITS AND APPROVALS

[THE OPERATING PROBLEM]

Production agents need more than a prompt

A prototype often proves the main path without covering permissions, exceptions, monitoring or recovery. A live agent needs those operating controls, a named owner and a safe way to stop it.

MIT NANDA's July 2025 State of AI in Business report found 95% of enterprise GenAI pilots had no measurable P&L impact. Gartner reckons 40%+ of agentic AI projects will get cancelled by end of 2027. The agents that survive look different. Scoped to one job, wired to real tools, tested against cases that go wrong.

We build the second kind.

THE DEMO AGENT

"Answers anything" with no real job
One prompt, no tools, no memory
No evals, no idea if it got worse
Hallucinates a refund, you pay it
Nobody owns it after launch week

A WORKING AGENT

One named task, one definition of done
Tools wired in via MCP and HTTP
Evals that run on every change
Hard limits on what it can spend or send
An audit log a human can read

[WHAT'S INSIDE]

What a production agent needs

Anthropic's "Building effective agents" guide makes the case for starting with simple workflows and only reaching for autonomy when the job needs it. Whether yours ends up a workflow or an agent, these five parts have to be there.

The job

One task, written down, with a clear "done". Triage this inbox. Reconcile yesterday's payments. Qualify these leads. If you can't write it on a Post-it, it's too vague.

Tools

The handful of APIs the agent can call. Read this table. Send that email. Refund up to £50. Wired via Anthropic's Model Context Protocol where it makes sense, plain HTTP where it doesn't.

Guardrails

What the agent can do alone, what needs a human, what it must never do. OWASP's 2025 Top 10 for LLM Applications calls the failure mode LLM06: Excessive Agency. We don't leave it to the prompt.

Evals

A bank of real cases with a known good outcome. Every prompt change, every model swap, every new tool: the agent has to keep passing them. No evals, no idea if you've made it worse.

The audit log

Every run, every tool call, every decision, written down. So when something goes sideways at 2am you can read what happened, replay it, and tell the customer exactly what their agent did.

[HOW WE WORK]

How we take it into live use

We start with a short scoping call and price the first useful agent as a defined piece of work. If it proves itself, we'll map the next jobs. We build in your stack, on your infrastructure, with your data, and you own all of it.

LangGraph, the OpenAI Agents SDK, plain Python or Laravel jobs, depending on the job. Whichever is simplest. We don't sell you a framework.

BOOK AN AGENT SCOPING CALL

Pick the right job

We sit with you for half a day and find the task that's worth automating: repeatable, well-defined, currently eating someone's morning. We say no to the ones that aren't ready and tell you why.

Build the agent

The simplest pattern that works. Often that's a scripted workflow with one or two LLM calls, not a fully autonomous agent. We wire the tools, write the prompts, set the limits. Working version in your sandbox.

Evals, guardrails, audit log

Cases with known answers. Hard limits on cost, scope and blast radius. A log a human can read. We score the agent before launch and again every time anyone changes it.

Launch, watch, iterate

Live to a small share of traffic first. We watch the log with you for the first week, fix what breaks, then ramp it up. Optional retainer for the next agent, or we hand it over and walk away.

[CAUTIONARY TALES]

Published AI agent failures

Four well-known stories from the last couple of years. None of them happened because the model was bad. They happened because the job, the guardrails, the evals or the audit log weren't there yet.

FEB 2024

Moffatt v. Air Canada.

The chatbot invented a retroactive bereavement-fare refund policy that didn't exist. The BC tribunal ordered Air Canada to honour what the bot promised and rejected the "the chatbot is a separate legal entity" defence. No guardrail on what the agent could promise.

DEC 2023

A Chevy Tahoe for $1.

Chevrolet of Watsonville's dealership chatbot was prompt-injected into saying it agreed to sell a 2024 Chevy Tahoe for $1. No instruction hardening, no scope on what it could say. The bot was disabled after the screenshots spread.

JAN 2024

DPD's chatbot swore at a customer.

After a system update DPD's "Ruby" bot called the company "the worst delivery company in the world" and wrote a self-deprecating poem. The screenshots spread on X. A regression eval should have caught it before launch.

MAY 2025

Klarna walked it back.

After saying its AI assistant did the work of 700 full-time customer-service agents, Klarna later said it was hiring people again because support quality mattered. The right tool, pointed at too big a slice of the job.

Sources: BC Civil Resolution Tribunal (Moffatt v. Air Canada, 2024 BCCRT 149), Anthropic, Business Insider, TIME, Fortune, MIT NANDA State of AI in Business 2025, Gartner, OWASP Gen AI Security Project.

[RELEVANT VU WORK]

The job comes before the agent label

AM2PM automates assessment reminders, grading and CRM updates. Crystal runs agreed compliance checks and keeps the evidence with each deal. The useful unit is the completed business job, with people reviewing the exceptions.

SEE OUR WORK

[A USEFUL FIRST CONVERSATION]

When this is worth discussing

We work best when there is a real operating problem, enough volume to measure and people from the affected teams who can make decisions.

Usually a good fit

An established UK business, usually with annual revenue above £10m
A repeated process with a known cost, delay, error rate or capacity problem
A senior sponsor and a day-to-day owner who understand the work
Access to the relevant staff, systems, sample records and security requirements

We may point you elsewhere

A standard product already covers the process well
The requirement is a one-off small build with no wider operating case
There is no owner or access to the people and data needed to test the result
The plan relies on AI making high-impact decisions with nobody responsible for review

[QUESTIONS]

Questions before connecting the systems

Q.01

Isn't this just a chatbot?

No. A chatbot replies. An agent finishes a task. The thing we build reads a real inbox, takes a real action, writes the result to a real system, and stops when it hits its limits. Some have a chat surface, most don't.

Q.02

What kind of jobs work?

Repeatable tasks with a clear definition of done and a paper trail. Triage and routing inbound emails. First-pass support replies. Qualifying leads against a brief. Reconciling invoices to bookings. Drafting reports from raw data. Anything where someone today follows a checklist and copies between two systems.

Q.03

What kind of jobs don't?

Anything where a bad answer can't be caught before it hurts somebody. Medical advice. Legal advice. Final-sign-off financial decisions. Anything regulated where the audit trail has to be a named human. We'll tell you that on the first call, not three months in.

Q.04

How do you stop it doing something stupid?

Hard limits in the code, not in the prompt. The agent can refund up to a number you set. It can email these addresses, not those. It can read these tables, not write to them. Anything outside the box queues for a human. We treat the prompt as advice and the code as law.

Q.05

Which framework do you use? LangGraph? CrewAI? OpenAI Agents SDK?

We choose after defining state, checkpoints, handoffs, tracing, integrations and the skills of the team that will operate it. The answer may be an agent framework, a standard application queue or a simpler deterministic workflow.

Q.06

Where does the agent run, and who sees the data?

Your infrastructure or ours, your call. Your data stays in your stack. Model calls go via the provider you choose, with the controls you need (UK/EU residency, no-training settings, BYOK where we can). We map it to ICO guidance on AI and the NCSC's Guidelines for Secure AI System Development so a future audit isn't a surprise.

Q.07

How much does it cost?

Scoping is fixed-fee. A first production agent is priced per phase against the integrations and review requirements. The proposal includes model assumptions, usage metering and spend controls so finance can see which costs vary with volume.

Q.08

We already tried an agent project and it died. Why is this different?

Usually it died because nobody picked a small enough job, nobody wrote evals, and the demo wowed the board but couldn't survive a Tuesday. We start narrow on purpose. One job. Real cases. If your last one ran for six months without a user, you already know the cost of skipping that.

Talk to us about the job your agent needs to do

Tell us the job, its volume, the systems involved and the cost of a wrong action. We will assess whether an agent is appropriate and where approval or conventional software should stay in control.

BOOK AN AGENT SCOPING CALL SEE ALL SERVICES

Build AI AgentsThat Do Real Work