Where AI can self-test (correctness, regressions, completeness) and where humans still must (UX, taste, judgment calls). Knowing which is which saves hours.
AI will test the code is one of the most over-claimed promises in the current AI hype cycle. The truth is more useful and less exciting: AI can test some things very well, cannot test other things at all, and lying to yourself about the difference will eventually ship a bug your customers find before you do.
This protocol gives you a clean line between what to automate and what to keep human. The line is not philosophical. It is operational and you will use it every day.
"AI will test the code" is one of the most over-claimed promises in the current AI hype cycle. The truth is more useful and less exciting: AI can test some things very well, cannot test other things at all, and lying to yourself about the difference will eventually ship a bug your customers find before you do.
This protocol gives you a clean line between what to automate and what to keep human. The line is operational. You will use it every day.
Most testing thinking confuses these three. Once you separate them, every test you ever write fits cleanly into one bucket.
If the test has a deterministic right answer, an agent runs it. If it has a well-it-depends answer, a human runs it. Memorise this. It saves the next two hundred meta-arguments about whether to automate.
When to run which kind. Doing this consistently is more useful than any single sophisticated test.
Pick a feature you already shipped (lead capture from Protocol 07 works well). You will write an AI test suite, run a human pass, and list three things AI missed.
In Claude Code: "You are the QA agent. Write Playwright tests for the lead capture form: it renders, accepts valid input, rejects invalid email, inserts a row to Supabase on submit, shows the thank-you message." Let it write 5 to 10 tests.
npx playwright test. Watch the tests run. Fix anything red. Once it all passes, save the run as the baseline.
Open your live site. Submit a form as if you were a real lead. Pay attention. Is the focus right? Is the spacing weird on mobile? Is the thank-you copy in your voice? Is the loading state confusing?
Write them down. "AI did not flag that the submit button lacks a loading state." "AI did not flag that the thank-you copy is generic." "AI did not flag that on mobile the form overflows."
Some belong in the AI suite (the mobile overflow can become an automated visual regression). Some stay human-only (voice). Note which is which in your QA plan.
A working test suite that runs on every commit, a fresh sense of what AI catches and misses, and a written cadence: AI on commit, human on epic close, both on launch.
You watch the QA agent generate and run a test suite for your contact form in five minutes. Then you do the human pass and find three things the AI suite did not catch.