Protocol 17, AI testing vs human testing

Why this matters

The pain it
solves

AI will test the code is one of the most over-claimed promises in the current AI hype cycle. The truth is more useful and less exciting: AI can test some things very well, cannot test other things at all, and lying to yourself about the difference will eventually ship a bug your customers find before you do.

This protocol gives you a clean line between what to automate and what to keep human. The line is not philosophical. It is operational and you will use it every day.

The teaching

What this
actually is

The honest line on what AI can test

"AI will test the code" is one of the most over-claimed promises in the current AI hype cycle. The truth is more useful and less exciting: AI can test some things very well, cannot test other things at all, and lying to yourself about the difference will eventually ship a bug your customers find before you do.

This protocol gives you a clean line between what to automate and what to keep human. The line is operational. You will use it every day.

Three categories of test

Most testing thinking confuses these three. Once you separate them, every test you ever write fits cleanly into one bucket.

Correctness, AI is excellent

Deterministic checks. Does the function return the right value? Does the API respond with the right shape? Did the migration apply cleanly? AI writes these, runs them, and catches regressions. Unit tests, integration tests, type checks. Automate everything in this category.
UX, taste, and judgment, humans only

Does the page feel right? Is the copy in the right voice? Is the loading state confusing? These depend on the goal of the experience and the taste of the team. AI can spot violations of a design system you defined; it cannot tell you the system itself is wrong.
End-to-end flow, collaborative

Can a real user complete the task? AI assists (simulates clicks, checks pages load, captures screenshots). Human delivers the verdict. Tools like Playwright drive the agent; the human asks does this feel like the journey we wanted.

A rule of thumb that holds 95% of the time

If the test has a deterministic right answer, an agent runs it. If it has a well-it-depends answer, a human runs it. Memorise this. It saves the next two hundred meta-arguments about whether to automate.

The cadence

When to run which kind. Doing this consistently is more useful than any single sophisticated test.

AI on every commit

The QA agent runs the correctness suite on every push. If anything goes red, the build does not deploy.
Human on every epic close

Before an epic moves to Done in project-status.html, a human runs the e2e flow and signs off. Five minutes per epic. Catches the taste and judgment things AI missed.
Full regression before each launch

A planned 30-minute window before each big launch where the QA agent runs everything and a human walks the entire flow. Both kinds. Same hour. Same room.

Try it yourself 30 minutes

Compare AI and human testing on one feature in 30 minutes

Pick a feature you already shipped (lead capture from Protocol 07 works well). You will write an AI test suite, run a human pass, and list three things AI missed.

Step 01

Have the QA agent write a test suite

In Claude Code: "You are the QA agent. Write Playwright tests for the lead capture form: it renders, accepts valid input, rejects invalid email, inserts a row to Supabase on submit, shows the thank-you message." Let it write 5 to 10 tests.
Step 02

Run the AI suite

npx playwright test. Watch the tests run. Fix anything red. Once it all passes, save the run as the baseline.
Step 03

Do a human pass

Open your live site. Submit a form as if you were a real lead. Pay attention. Is the focus right? Is the spacing weird on mobile? Is the thank-you copy in your voice? Is the loading state confusing?
Step 04

List three things AI missed

Write them down. "AI did not flag that the submit button lacks a loading state." "AI did not flag that the thank-you copy is generic." "AI did not flag that on mobile the form overflows."
Step 05

Decide which ones become tests

Some belong in the AI suite (the mobile overflow can become an automated visual regression). Some stay human-only (voice). Note which is which in your QA plan.

Outcome

A working test suite that runs on every commit, a fresh sense of what AI catches and misses, and a written cadence: AI on commit, human on epic close, both on launch.

Official resources

Straight from
the source

Playwright doc

Playwright getting started

Google article

Google testing blog, the test pyramid

Vitest doc

Vitest, fast unit tests

What you walk out with

By the end of this
protocol

01 A working test suite that the QA agent runs autonomously on every commit
02 A human pass run on a real feature, with three things AI missed listed in plain sight
03 Your testing cadence written down: AI on commit, human on epic close, full regression before each launch

At the retreat

You learn it by
doing it

You watch the QA agent generate and run a test suite for your contact form in five minutes. Then you do the human pass and find three things the AI suite did not catch.

Saigon, Jun 19 to 21 See all 18 protocols

Connects to