Skip to main content
Success
[PRO SERVICES / BUILD]

Human-in-the-Loop
Automation

Let the AI handle the routine work. Keep a human on anything that can fire a customer, delete the production database or end up in tribunal. We design and build the gate.

Approve. Edit. Reject. Audit.

Gate

ROUTINE WORK PASSES. RISKY WORK PAUSES.

Art. 22A

UK GDPR, IN FORCE 5 FEB 2026

Days

FROM AGENT TO AUDITABLE WORKFLOW

[THE TRUTH]

The risk starts when the system can act.

The public failures keep landing in the same place: an AI system acts, the action reaches the outside world, the company explains it afterwards.

Replit's coding agent deleted a production database during a July 2025 experiment, then fabricated test data and claimed rollback was impossible. Cursor's support bot invented a login policy in April 2025 and triggered cancellations. Air Canada's chatbot gave wrong bereavement-fare advice and the British Columbia Civil Resolution Tribunal made the airline pay.

In each case the expensive step needed a pause, a named reviewer, and a record of the decision.

FULLY AUTONOMOUS AGENT

  • Acts first, you find out after
  • One bad prompt deletes real data
  • No record of what it did or why
  • Hallucinated policy lands in a customer inbox
  • Regulator asks who decided. Nobody did.

HUMAN-IN-THE-LOOP

  • AI proposes. The system pauses.
  • Destructive moves need a real approval
  • Every decision logged with the inputs
  • Reviewer sees the draft before the customer does
  • A named person is on record as the decider
[THE ANATOMY]

Five parts. Same five parts every time.

Whether it's a sales reply, a refund, a hiring decision or a database migration, a properly built HITL workflow runs the same five steps. Once you can see them, you can wire them in.

01

Trigger

A customer email, a webhook, a row added to a table, a Slack message. The event that starts the workflow.

02

AI proposes

A draft action, its reasoning, a confidence score, and the inputs it used. Not "ran the action", "wrote the proposal".

03

Routing rule

High confidence and low blast radius? Auto-execute. Anything else gets queued. Truly weird gets escalated to a named person.

04

Human reviews

Sees the proposal, the inputs, the diff, the reasoning. Can approve, edit then approve, or reject. Not a single "yes" button on its own.

05

Execute & audit

Only after approval. Reversible where possible. Full log of inputs, model, prompt version, reviewer ID and timestamp. The bit you need when the ICO, your insurer or your board asks what happened.

[WHERE THE GATE GOES]

Not on everything. On the bits that bite.

HITL on every step is slow software with an extra payroll line. The skill is knowing where to put the gate. We use four tests.

Is it reversible?

Drafting a reply is reversible. Sending it isn't. Tagging a record is reversible. Deleting it isn't. The gate goes between the reversible and the irreversible step, every time.

What's the blast radius?

One customer or every customer? One row or the whole table? A draft Slack message or an outbound payment? Bigger blast, harder gate.

Is the regulator interested?

Credit, hiring, insurance, medical, education, anything that materially affects a person's life. UK GDPR Articles 22A to 22D set rules for solely automated decisions with legal or similarly serious effects. EU AI Act Article 14 requires human oversight for high-risk AI systems. A rubber-stamp queue won't do the job.

Would a screenshot end up on X?

Anything a customer can quote back to you in public. Outbound emails, refunds, account changes, anything the brand's voice attaches to. Gate first, send second.

[HOW WE WORK]

Where we come in.

We build the gate inside your agent or workflow. Approval queues, confidence routing, audit logs, the bits the regulator asks for and the bits that stop you ringing your lawyer at 11pm.

If you've already got an agent in production, we wrap it. If you don't yet, we design it gate-first so you never have to retrofit it.

BOOK A DESIGN CALL
01

Map the workflow

We walk the actual job end to end. Every action the AI is about to take, every input it uses, every record it touches. We mark each step on the four tests: reversible, blast radius, regulated, customer-visible. Scope and price up front.

02

Design the gates

Per step: auto, queue, or escalate. Confidence thresholds with real numbers. Service levels on the human step so the queue doesn't become the bottleneck. Named owners. A written policy your team can actually read.

03

Build it in

Approval queues your team works through in Slack, email or a dedicated dashboard. Pause and resume on durable workflows so an approval can take ten seconds or ten days. LangGraph interrupts, Temporal signals, Inngest steps, whatever fits your stack. Reviewer sees the inputs, the proposal, the diff, the confidence.

04

Wire the audit log

Inputs, model, prompt version, AI proposal, reviewer ID, decision, timestamp. Searchable, exportable, retained on your terms. The thing you hand the ICO or your insurer when they ask.

05

Watch and tune

Override rates, edit rates, queue depths, time-to-decision. Where humans always approve, we raise the auto threshold. Where they always edit, we tighten the prompt, knowledge base or policy. The split moves over time, on purpose.

[IN THE WILD]

The cost of no gate.

Four publicly reported incidents. Same failure mode each time: the system spoke or acted before a human checked the risky part.

JUL 2025

Replit deletes prod.

During a vibe-coding experiment, Replit's agent ran destructive commands inside a declared code freeze, deleted 1,206 executive records, then fabricated test data and claimed rollback was impossible. Replit's CEO apologised and said planning-only mode and dev/prod database separation were being added.

APR 2025

Cursor invents a policy.

An AI support reply from Sam told customers there was a one-device-per-subscription rule. There wasn't. Public backlash, cancellations and a cofounder apology followed.

FEB 2024

Air Canada loses in tribunal.

The chatbot gave wrong bereavement-fare refund advice. The tribunal said Air Canada had, in effect, treated the chatbot as a separate legal entity, rejected that argument, and awarded damages for negligent misrepresentation. Small money, useful warning.

JAN 2024

DPD's bot swears at a customer.

After a system update, DPD's support bot criticised DPD and swore when prompted by a frustrated user. TIME reported 1.3 million views on the X post before DPD disabled the AI element.

Sources: The Register, Fortune, Ars Technica, British Columbia Civil Resolution Tribunal (Moffatt v Air Canada, 2024 BCCRT 149), The Guardian, TIME.

[QUESTIONS]

The ones we get asked first.

Q.01

Doesn't this just put the human back in the bottleneck?

Only if the gate is in the wrong place. Most of the work the AI does is fine to release without review, so the queue stays short. The point is to gate the risky part, not gate everything. We measure queue depth and time-to-decision from week one and tune until the human is doing roughly the work you wanted them to.

Q.02

What about rubber-stamping? People just click approve.

It's the most common failure of HITL, and it's a design problem. We surface the inputs the AI used, the alternatives it considered, and the confidence score, so the reviewer can actually see what they're approving. We track edit rate and override rate. If a reviewer's edit rate sits at zero, that's a signal, not a win. Article 22A treats a decision as solely automated when there's no meaningful human involvement. Clicking approve without reading probably isn't meaningful.

Q.03

Do we have to use a particular stack?

No. We've wired gates into LangGraph agents, Temporal workflows, Inngest jobs, n8n flows, plain Laravel queues and bespoke Node services. The pattern is the same: durable pause, approval event, resume. We pick the lightest tool that gets you a real audit log and a real "stop" button.

Q.04

Where does the human actually approve things?

Wherever the team already lives. Slack for fast lanes, email for asynchronous approvals, a proper dashboard for higher-volume queues. We usually start in Slack to get rolling, then move to a dedicated queue once the volume justifies it.

Q.05

We're a UK SME, do we really need to care about the EU AI Act?

If you sell into the EU, or your AI output is used in the EU, you need to check whether the Act applies. Article 14 requires high-risk systems to be designed for human oversight and names automation bias as a risk to be designed against. The UK doesn't have an equivalent AI Act, but the Data (Use and Access) Act 2025 rewrote the UK GDPR rules for solely automated decisions with legal or similarly serious effects. Designing for oversight up front is cheaper than retrofitting it under pressure.

Q.06

Can the AI just learn from the human approvals?

Yes, that's the whole point. Every approval, edit and rejection is training data and a signal. Where edits cluster, the prompt or the knowledge base is wrong. Where rejections cluster, the policy is wrong. The gate isn't permanent for every step; it's the safest place to learn from in production.

Q.07

How long does it take?

Mapping and gate design is quick. First workflow live with a real approval queue, audit log and confidence routing usually lands in days. After that we iterate on thresholds and queues weekly.

Q.08

How much does it cost?

Fixed price per phase, agreed before we start. Mapping is its own short phase. The first gated workflow with audit and queue is scoped against the map. We tell you the number up front and bill against it.

Vu Agency working session

Build the gate before you need it.

Send us the workflow. We'll walk it end to end and mark where the gates go. Thirty minutes, no slides.

Instant AI Chat Message us on WhatsApp