Skip to main content
Success
[PRO SERVICES / BUILD]

Build AI Agents
That Do Real Work

Most agent projects never leave the demo. We pick a real job, build an agent that does it end to end, wire the evals and the audit log, and hand back something you can measure on Monday morning.

Agents that finish the job

95%

OF ENTERPRISE GENAI PILOTS SHOW NO P&L IMPACT (MIT, 2025)

40%

AGENTIC AI PROJECTS CANCELLED BY END OF 2027 (GARTNER)

Days

TO FIRST AGENT IN PRODUCTION

[THE SHIFT]

A demo isn't an agent.

The thing that worked in a Friday afternoon prototype isn't the thing you put in front of customers. The demo gets the happy path right and ignores the other ninety. Production has to handle the other ninety, log what it did, and let you turn it off when it's wrong.

MIT NANDA's July 2025 State of AI in Business report found 95% of enterprise GenAI pilots had no measurable P&L impact. Gartner reckons 40%+ of agentic AI projects will get cancelled by end of 2027. The agents that survive look different. Scoped to one job, wired to real tools, tested against cases that actually go wrong.

We build the second kind.

THE DEMO AGENT

  • "Answers anything" with no real job
  • One prompt, no tools, no memory
  • No evals, no idea if it got worse
  • Hallucinates a refund, you pay it
  • Nobody owns it after launch week

A WORKING AGENT

  • One named task, one definition of done
  • Tools wired in via MCP and HTTP
  • Evals that run on every change
  • Hard limits on what it can spend or send
  • An audit log a human can read
[WHAT'S INSIDE]

A production agent has five moving parts.

Anthropic's "Building effective agents" guide makes the case for starting with simple workflows and only reaching for autonomy when the job needs it. Whether yours ends up a workflow or an agent, these five parts have to be there.

01

The job

One task, written down, with a clear "done". Triage this inbox. Reconcile yesterday's payments. Qualify these leads. If you can't write it on a Post-it, it's too vague.

02

Tools

The handful of APIs the agent can actually call. Read this table. Send that email. Refund up to £50. Wired via Anthropic's Model Context Protocol where it makes sense, plain HTTP where it doesn't.

03

Guardrails

What the agent can do alone, what needs a human, what it must never do. OWASP's 2025 Top 10 for LLM Applications calls the failure mode LLM06: Excessive Agency. We don't leave it to the prompt.

04

Evals

A bank of real cases with a known good outcome. Every prompt change, every model swap, every new tool: the agent has to keep passing them. No evals, no idea if you've made it worse.

05

The audit log

Every run, every tool call, every decision, written down. So when something goes sideways at 2am you can read what happened, replay it, and tell the customer exactly what their agent did.

[HOW WE WORK]

Where we come in.

A short scoping call. A fixed-price first agent live. A roadmap if there's more to do after. We build in your stack, on your infrastructure, with your data, and you own all of it.

LangGraph, the OpenAI Agents SDK, plain Python or Laravel jobs, depending on the job. Whichever is simplest. We don't sell you a framework.

BOOK AN AGENT SCOPING CALL
01

Pick the right job

We sit with you for half a day and find the task that's worth automating: repeatable, well-defined, currently eating someone's morning. We say no to the ones that aren't ready and tell you why.

02

Build the agent

The simplest pattern that works. Often that's a scripted workflow with one or two LLM calls, not a fully autonomous agent. We wire the tools, write the prompts, set the limits. Working version in your sandbox.

03

Evals, guardrails, audit log

Cases with known answers. Hard limits on cost, scope and blast radius. A log a human can read. We score the agent before launch and again every time anyone changes it.

04

Launch, watch, iterate

Live to a small share of traffic first. We watch the log with you for the first week, fix what breaks, then ramp it up. Optional retainer for the next agent, or we hand it over and walk away.

[CAUTIONARY TALES]

When the agent skipped these parts.

Four well-known stories from the last couple of years. None of them happened because the model was bad. They happened because the job, the guardrails, the evals or the audit log weren't there yet.

FEB 2024

Moffatt v. Air Canada.

The chatbot invented a retroactive bereavement-fare refund policy that didn't exist. The BC tribunal ordered Air Canada to honour what the bot promised and rejected the "the chatbot is a separate legal entity" defence. No guardrail on what the agent could promise.

DEC 2023

A Chevy Tahoe for $1.

Chevrolet of Watsonville's dealership chatbot was prompt-injected into saying it agreed to sell a 2024 Chevy Tahoe for $1. No instruction hardening, no scope on what it could say. The bot was disabled after the screenshots spread.

JAN 2024

DPD's chatbot swore at a customer.

After a system update DPD's "Ruby" bot called the company "the worst delivery company in the world" and wrote a self-deprecating poem. The screenshots spread on X. A regression eval should have caught it before launch.

MAY 2025

Klarna walked it back.

After saying its AI assistant did the work of 700 full-time customer-service agents, Klarna later said it was hiring people again because support quality mattered. The right tool, pointed at too big a slice of the job.

Sources: BC Civil Resolution Tribunal (Moffatt v. Air Canada, 2024 BCCRT 149), Anthropic, Business Insider, TIME, Fortune, MIT NANDA State of AI in Business 2025, Gartner, OWASP Gen AI Security Project.

[QUESTIONS]

The ones we get asked first.

Q.01

Isn't this just a chatbot?

No. A chatbot replies. An agent finishes a task. The thing we build reads a real inbox, takes a real action, writes the result to a real system, and stops when it hits its limits. Some have a chat surface, most don't.

Q.02

What kind of jobs actually work?

Repeatable tasks with a clear definition of done and a paper trail. Triage and routing inbound emails. First-pass support replies. Qualifying leads against a brief. Reconciling invoices to bookings. Drafting reports from raw data. Anything where someone today follows a checklist and copies between two systems.

Q.03

What kind of jobs don't?

Anything where a bad answer can't be caught before it hurts somebody. Medical advice. Legal advice. Final-sign-off financial decisions. Anything regulated where the audit trail has to be a named human. We'll tell you that on the first call, not three months in.

Q.04

How do you stop it doing something stupid?

Hard limits in the code, not in the prompt. The agent can refund up to a number you set. It can email these addresses, not those. It can read these tables, not write to them. Anything outside the box queues for a human. We treat the prompt as advice and the code as law.

Q.05

Which framework do you use? LangGraph? CrewAI? OpenAI Agents SDK?

Whichever is simplest for the job. LangGraph when we need explicit state and checkpointing. OpenAI's Agents SDK when handoffs and tracing matter most. Plain Python or a Laravel queue when neither of those is needed. We've built agents in all three. The framework choice is the boring bit.

Q.06

Where does the agent run, and who sees the data?

Your infrastructure or ours, your call. Your data stays in your stack. Model calls go via the provider you choose, with the controls you need (UK/EU residency, no-training settings, BYOK where we can). We map it to ICO guidance on AI and the NCSC's Guidelines for Secure AI System Development so a future audit isn't a surprise.

Q.07

How much does it cost?

Scoping is fixed-fee. A first production agent is priced per phase, scoped against the integrations. We tell you the number before we start, and we run the agent's token spend on a budget you set. No surprise bills.

Q.08

We already tried an agent project and it died. Why is this different?

Usually it died because nobody picked a small enough job, nobody wrote evals, and the demo wowed the board but couldn't survive a Tuesday. We start narrow on purpose. One job. Real cases. If your last one ran for six months without a user, you already know the cost of skipping that.

Vu Agency agent design session

Pick the job. We'll build the agent.

Tell us the task you'd hand to an agent if you trusted one. Thirty minutes, no slides. You'll get a clear answer on whether it's a good first agent, what it would take to build, and what to leave for a human.

Instant AI Chat Message us on WhatsApp