Why Claude Code AutoFix Can’t Fix Flaky Tests
AutoFix is great at real bugs. Flaky tests break it — and the fix loop costs more than the flake did.
Anthropic shipped Claude Code AutoFix — an agent that subscribes to your PR’s GitHub events, watches CI, and pushes commits to fix failing tests and address review comments. For real bugs, it’s genuinely good. For flaky tests, it’s a footgun.
We build a tool in this exact space (Kleore quantifies and surfaces flaky-test waste), so we watched the AutoFix launch closely. Here’s the honest read on what it does, where it breaks, and why throwing an AI agent at flakiness makes the problem more expensive, not less.
What AutoFix actually does
AutoFix is a cloud-hosted Claude Code session attached to a pull request. When CI fails or a reviewer leaves a comment, it:
- Reads the failure log or comment
- Investigates the relevant code
- Pushes a commit with an explanation
- Re-runs CI and iterates
For a deterministic failure — a real null deref, a type error, a missing import — this loop converges. Test fails → agent reads stack trace → agent fixes code → test passes. Clean.
Flaky tests break the loop
A flaky test, by definition, fails for reasons unrelated to the code change. Race conditions. Unstable network mocks. Order-dependent fixtures. Timezone drift. The test that failed on this PR will pass on the next run with no change at all.
AutoFix doesn’t know that. It sees a red CI, assumes the diff broke something, and starts hunting. The result is one of three failure modes:
1. The speculative fix loop
AutoFix reads the stack trace, invents a plausible cause, and pushes a “fix.” The flaky test passes on the next run — not because of the fix, but because flakes pass ~70% of the time. AutoFix declares victory. You’ve now merged a code change that was triggered by randomness, not by an actual bug.
Multiply this across a quarter and your codebase fills up with cargo-cult fixes: extra awaits, defensive null checks, retries, sleep statements, narrowed test assertions. Each one looks reasonable in isolation. Together they’re the AI version of “don’t touch this, it works.”
2. The infinite re-run
Sometimes AutoFix gets it right and tries to re-run. The flake fails again. It fixes harder. Re-runs. Fails. Each iteration burns tokens, CI minutes, and your patience. The blog post that introduced AutoFix flags this directly: agents can “enter speculative fix loops that consume resources without resolving the underlying problem.”
A single flaky test can cost an AutoFix session 10–30 LLM calls, each touching multiple files, each pushing a commit. Your token bill and your git history both look terrible.
3. The wrong file blamed
Race conditions and shared-state bugs rarely live in the file the test exercises. AutoFix looks where the stack trace points. The actual cause — a fixture in a sibling file, a global mock that leaked, a database row left by an earlier test — is two directories away. AutoFix “fixes” the wrong thing, the symptom moves, the next PR sees a new flake.
The cost math is bad
| Failure type | Real bug | Flaky test |
|---|---|---|
| AutoFix iterations | 1–3 | 5–30 |
| CI minutes consumed | 10–30 | 60–300 |
| Token spend per failure | $0.50–$2 | $5–$25 |
| Outcome | Bug fixed | Symptom hidden |
The unit economics flip on flaky tests. You pay 10x more, end up with worse code, and the underlying flake is still there waiting for the next PR.
The right move: classify before you fix
Every CI failure should be sorted into one of two buckets before an agent (or a human) starts fixing it:
- Real failure — this PR’s code change broke something. Send to AutoFix. It’ll do a great job.
- Flake — this test fails on unrelated PRs too. Don’t fix it on this PR. Quarantine, log, and address the root cause separately.
AutoFix has no way to make this distinction on its own. It only sees one PR. Flake detection requires looking across many PRs, over time, and asking: does this test fail on diffs that have nothing to do with it?
This is what Kleore does
Kleore connects to your GitHub repos, scans your CI history, and ranks every flaky test by frequency and dollar cost. It’s the missing classifier in front of AutoFix:
- Test fails on a PR → check Kleore → if the test is on the flake list, skip AutoFix entirely
- Top flakes get triaged as their own work item, not patched into random PRs
- Engineering managers see a weekly $ number and can decide whether to invest fix-time
AutoFix and Kleore are complements, not competitors. AutoFix needs a flake-aware front door to be safe in a real codebase. Without one, every flaky test in your suite becomes a recurring tax on your token bill.
Stop letting AutoFix burn cycles on flakes.
Install the Kleore GitHub App. Get a ranked list of every flaky test in your repos — with dollar costs attached — in two minutes. Free to start.
Scan my repos — freeFAQ: Claude Code AutoFix and flaky tests
Can Claude Code AutoFix fix flaky tests?
No. AutoFix is designed to fix deterministic failures caused by the current PR’s code change. Flaky tests fail for reasons unrelated to the diff — race conditions, shared state, network jitter — so AutoFix either invents a cargo-cult fix that “works” because the flake passed by chance, or it loops indefinitely burning tokens and CI minutes.
Why does AutoFix loop on flaky tests?
AutoFix treats every red CI as a code bug. When a flake fails, AutoFix pushes a speculative fix and re-runs. The flake fails again for unrelated reasons, AutoFix fixes harder, and the loop repeats. Each iteration consumes LLM calls, CI minutes, and adds noisy commits to your git history.
How do I stop AutoFix from wasting cycles on flaky tests?
Classify failures before AutoFix runs. If a test has historically failed on PRs unrelated to its code, treat it as a flake — quarantine and address separately. Tools like Kleore scan your CI history across many PRs to identify flaky tests and rank them by frequency and dollar cost, so AutoFix only engages on real bugs.
Are AutoFix and Kleore competitors?
No, they’re complements. AutoFix is a per-PR fixer that needs a flake-aware front door. Kleore provides cross-PR flaky test detection and dollar-cost reporting that tells AutoFix (and your engineers) when to fix and when to quarantine.
How much does running AutoFix on a flaky test cost?
A single flaky test can drive AutoFix through 5–30 iterations, consuming 60–300 CI minutes and $5–$25 in tokens per failure — roughly 10x the cost of fixing a real bug. Multiplied across a quarter, this becomes a recurring tax on your engineering budget.
Further reading
- What Are Flaky Tests? — The primer on why tests fail without code changes.
- How Much Do Flaky Tests Actually Cost? — Compute is the smallest line item.
- How to Fix Flaky Tests in GitHub Actions — Six root causes and how to fix each one.