[PRO SERVICES / BUILD]

Human-in-the-Loop
Automation

AI can prepare work and recommend an action without owning every decision. We define where staff approve, what evidence they see, how exceptions escalate and how the business can reconstruct what happened later.

BOOK A DESIGN CALL SEE THE ANATOMY

Approve. Edit. Reject. Audit.

Gate

ROUTINE WORK PASSES. RISKY WORK PAUSES.

Art. 22A

UK GDPR, IN FORCE 5 FEB 2026

Human gate

ON HIGH-IMPACT ACTIONS

[THE OPERATING PROBLEM]

The risk starts when the system can act

The public failures keep landing in the same place: an AI system acts, the action reaches the outside world, the company explains it afterwards.

Replit's coding agent deleted a production database during a July 2025 experiment, then fabricated test data and claimed rollback was impossible. Cursor's support bot invented a login policy in April 2025 and triggered cancellations. Air Canada's chatbot gave wrong bereavement-fare advice and the British Columbia Civil Resolution Tribunal made the airline pay.

In each case the expensive step needed a pause, a named reviewer, and a record of the decision.

FULLY AUTONOMOUS AGENT

Acts first, you find out after
One bad prompt deletes real data
No record of what it did or why
Hallucinated policy lands in a customer inbox
Regulator asks who decided. Nobody did.

HUMAN-IN-THE-LOOP

AI proposes. The system pauses.
Destructive moves need a real approval
Every decision logged with the inputs
Reviewer sees the draft before the customer does
A named person is on record as the decider

[THE ANATOMY]

How the approval gate works

A sales reply, refund, hiring decision and database migration need different approval rules. Each workflow makes the proposal, routing decision, human action and final execution visible.

Trigger

A customer email, a webhook, a row added to a table, a Slack message. The event that starts the workflow.

AI proposes

A draft action, its reasoning, a confidence score, and the inputs it used. Not "ran the action", "wrote the proposal".

Routing rule

High confidence and low blast radius? Auto-execute. Anything else gets queued. Truly weird gets escalated to a named person.

Human reviews

Sees the proposal, the inputs, the diff, the reasoning. Can approve, edit then approve, or reject. Not a single "yes" button on its own.

Execute & audit

Only after approval. Reversible where possible. Full log of inputs, model, prompt version, reviewer ID and timestamp. The bit you need when the ICO, your insurer or your board asks what happened.

[WHERE THE GATE GOES]

Where a human should stay involved

HITL on every step is slow software with an extra payroll line. The skill is knowing where to put the gate. We use four tests.

Is it reversible?

Drafting a reply is reversible. Sending it isn't. Tagging a record is reversible. Deleting it isn't. The gate goes between the reversible and the irreversible step, every time.

What's the blast radius?

One customer or every customer? One row or the whole table? A draft Slack message or an outbound payment? Bigger blast, harder gate.

Is the regulator interested?

Credit, hiring, insurance, medical, education, anything that materially affects a person's life. UK GDPR Articles 22A to 22D set rules for solely automated decisions with legal or similarly serious effects. EU AI Act Article 14 requires human oversight for high-risk AI systems. A rubber-stamp queue won't do the job.

Would a screenshot end up on X?

Anything a customer can quote back to you in public. Outbound emails, refunds, account changes, anything the brand's voice attaches to. Gate first, send second.

[HOW WE WORK]

How we take it into live use

We build the gate inside your agent or workflow. Approval queues, confidence routing, audit logs, the bits the regulator asks for and the bits that stop you ringing your lawyer at 11pm.

If you've already got an agent in production, we wrap it. If you don't yet, we design it gate-first so you never have to retrofit it.

BOOK A DESIGN CALL

Map the workflow

We walk the actual job end to end. Every action the AI is about to take, every input it uses, every record it touches. We mark each step on the four tests: reversible, blast radius, regulated, customer-visible. Scope and price up front.

Design the gates

Per step: auto, queue, or escalate. Confidence thresholds with real numbers. Service levels on the human step so the queue doesn't become the bottleneck. Named owners. A written policy your team can read.

Build it in

Approval queues your team works through in Slack, email or a dedicated dashboard. Pause and resume on durable workflows so an approval can take ten seconds or ten days. LangGraph interrupts, Temporal signals, Inngest steps, whatever fits your stack. Reviewer sees the inputs, the proposal, the diff, the confidence.

Wire the audit log

Inputs, model, prompt version, AI proposal, reviewer ID, decision, timestamp. Searchable, exportable, retained on your terms. The thing you hand the ICO or your insurer when they ask.

Watch and tune

Override rates, edit rates, queue depths, time-to-decision. Where humans always approve, we raise the auto threshold. Where they always edit, we tighten the prompt, knowledge base or policy. The split moves over time, on purpose.

[PUBLISHED EXAMPLES]

Failures caused by missing approval gates

Four publicly reported incidents. Same failure mode each time: the system spoke or acted before a human checked the risky part.

JUL 2025

Replit deletes prod.

During a vibe-coding experiment, Replit's agent ran destructive commands inside a declared code freeze, deleted 1,206 executive records, then fabricated test data and claimed rollback was impossible. Replit's CEO apologised and said planning-only mode and dev/prod database separation were being added.

APR 2025

Cursor invents a policy.

An AI support reply from Sam told customers there was a one-device-per-subscription rule. There wasn't. Public backlash, cancellations and a cofounder apology followed.

FEB 2024

Air Canada loses in tribunal.

The chatbot gave wrong bereavement-fare refund advice. The tribunal said Air Canada had, in effect, treated the chatbot as a separate legal entity, rejected that argument, and awarded damages for negligent misrepresentation. Small money, useful warning.

JAN 2024

DPD's bot swears at a customer.

After a system update, DPD's support bot criticised DPD and swore when prompted by a frustrated user. TIME reported 1.3 million views on the X post before DPD disabled the AI element.

Sources: The Register, Fortune, Ars Technica, British Columbia Civil Resolution Tribunal (Moffatt v Air Canada, 2024 BCCRT 149), The Guardian, TIME.

[RELEVANT VU WORK]

Human review is part of the product

ClimateEQ uses AI to score carbon-literacy pledges and draft feedback, while reviewers can adjust the result before it reaches the learner. Crystal gives compliance reviewers the findings and evidence needed to make the final decision.

SEE OUR WORK

[A USEFUL FIRST CONVERSATION]

When this is worth discussing

We work best when there is a real operating problem, enough volume to measure and people from the affected teams who can make decisions.

Usually a good fit

An established UK business, usually with annual revenue above £10m
A repeated process with a known cost, delay, error rate or capacity problem
A senior sponsor and a day-to-day owner who understand the work
Access to the relevant staff, systems, sample records and security requirements

We may point you elsewhere

A standard product already covers the process well
The requirement is a one-off small build with no wider operating case
There is no owner or access to the people and data needed to test the result
The plan relies on AI making high-impact decisions with nobody responsible for review

[QUESTIONS]

Questions before connecting the systems

Q.01

Doesn't this just put the human back in the bottleneck?

It can if the review point is too broad or poorly staffed. We classify actions by impact, confidence and reversibility, then measure queue depth and time to decision. Low-impact work may move to sampling once the evidence supports it, while higher-impact actions keep the agreed review.

Q.02

What about rubber-stamping? People just click approve.

It's the most common failure of HITL, and it's a design problem. We surface the inputs the AI used, the alternatives it considered, and the confidence score, so the reviewer can see what they're approving. We track edit rate and override rate. A reviewer who never edits anything may be approving without enough scrutiny. Article 22A treats a decision as solely automated when there's no meaningful human involvement. Clicking approve without reading probably isn't meaningful.

Q.03

Do we have to use a particular stack?

No. The pattern can sit inside an agent framework, workflow engine, automation platform or standard application queue. We choose the least complex option that provides a durable pause, approval record, controlled resume and an effective stop mechanism.

Q.04

Where does the human approve things?

Wherever the team already lives. Slack for fast lanes, email for asynchronous approvals, a proper dashboard for higher-volume queues. We usually start in Slack to get rolling, then move to a dedicated queue once the volume justifies it.

Q.05

Do we need to consider the EU AI Act?

If you sell into the EU, or your AI output is used in the EU, you need to check whether the Act applies. Article 14 requires high-risk systems to be designed for human oversight and names automation bias as a risk to be designed against. The UK doesn't have an equivalent AI Act, but the Data (Use and Access) Act 2025 rewrote the UK GDPR rules for solely automated decisions with legal or similarly serious effects. Designing for oversight up front is cheaper than retrofitting it under pressure.

Q.06

Can the AI just learn from the human approvals?

Yes. Every approval, edit and rejection gives you useful evidence. A cluster of edits points to a prompt or knowledge-base problem; a cluster of rejections points to a policy problem. The review gate can move once the evidence shows which steps are safe to automate.

Q.07

How long does it take?

The timetable depends on the systems, decision risk and evidence the reviewer needs. We agree the gate design first, test it against representative cases and run alongside the current process before increasing automation.

Q.08

How much does it cost?

Fixed price per phase, agreed before we start. Mapping is its own short phase. The first gated workflow with audit and queue is scoped against the map. We tell you the number up front and bill against it.

Talk to us about the approval point

Send us the workflow and the decisions it can trigger. We will mark the approval points, evidence requirements, escalation path and actions the system should never take alone.

BOOK A DESIGN CALL SEE ALL SERVICES

Human-in-the-LoopAutomation