What permissions does the Kleore GitHub App need?

Kleore starts with the bare minimum: read-only access to Actions workflow runs. It never sees your source code. When you enable optional features like PR comments, it asks for just the additional permission needed — you approve each one.

Do I need to change my CI workflows?

Not for the initial report. The zero-config scan works from your existing GitHub Actions data. To see individual flaky tests, you add one step to upload JUnit XML results — a 2-line YAML change.

How is the CI cost number calculated?

Each flaky rerun costs approximately 30 minutes (20 min rerun wait + 10 min context switch) at $75/hr fully-loaded engineer rate. These are conservative defaults you can customize for your team.

Can I use Kleore on a private GitHub repo?

Yes. The free tier works on any repo you install the app on, public or private. Your data stays private and is never shared.

What makes Kleore different from other CI tools?

Most CI tools show you pass/fail. Kleore shows you cost. It translates flaky tests into dollar amounts so you can prioritize fixes and get budget approval. The shareable report makes the problem impossible to ignore.

← All articles

Why Claude Code AutoFix Can’t Fix Flaky Tests

AutoFix is great at real bugs. Flaky tests break it — and the fix loop costs more than the flake did.

May 8, 2026·8 min read

Anthropic shipped Claude Code AutoFix — an agent that subscribes to your PR’s GitHub events, watches CI, and pushes commits to fix failing tests and address review comments. For real bugs, it’s genuinely good. For flaky tests, it’s a footgun.

We build a tool in this exact space (Kleore quantifies and surfaces flaky-test waste), so we watched the AutoFix launch closely. Here’s the honest read on what it does, where it breaks, and why throwing an AI agent at flakiness makes the problem more expensive, not less.

What AutoFix actually does

AutoFix is a cloud-hosted Claude Code session attached to a pull request. When CI fails or a reviewer leaves a comment, it:

Reads the failure log or comment
Investigates the relevant code
Pushes a commit with an explanation
Re-runs CI and iterates

For a deterministic failure — a real null deref, a type error, a missing import — this loop converges. Test fails → agent reads stack trace → agent fixes code → test passes. Clean.

Flaky tests break the loop

A flaky test, by definition, fails for reasons unrelated to the code change. Race conditions. Unstable network mocks. Order-dependent fixtures. Timezone drift. The test that failed on this PR will pass on the next run with no change at all.

AutoFix doesn’t know that. It sees a red CI, assumes the diff broke something, and starts hunting. The result is one of three failure modes:

1. The speculative fix loop

AutoFix reads the stack trace, invents a plausible cause, and pushes a “fix.” The flaky test passes on the next run — not because of the fix, but because flakes pass ~70% of the time. AutoFix declares victory. You’ve now merged a code change that was triggered by randomness, not by an actual bug.

Multiply this across a quarter and your codebase fills up with cargo-cult fixes: extra awaits, defensive null checks, retries, sleep statements, narrowed test assertions. Each one looks reasonable in isolation. Together they’re the AI version of “don’t touch this, it works.”

2. The infinite re-run

Sometimes AutoFix gets it right and tries to re-run. The flake fails again. It fixes harder. Re-runs. Fails. Each iteration burns tokens, CI minutes, and your patience. The blog post that introduced AutoFix flags this directly: agents can “enter speculative fix loops that consume resources without resolving the underlying problem.”

A single flaky test can cost an AutoFix session 10–30 LLM calls, each touching multiple files, each pushing a commit. Your token bill and your git history both look terrible.

3. The wrong file blamed

Race conditions and shared-state bugs rarely live in the file the test exercises. AutoFix looks where the stack trace points. The actual cause — a fixture in a sibling file, a global mock that leaked, a database row left by an earlier test — is two directories away. AutoFix “fixes” the wrong thing, the symptom moves, the next PR sees a new flake.

The cost math is bad

Failure type	Real bug	Flaky test
AutoFix iterations	1–3	5–30
CI minutes consumed	10–30	60–300
Token spend per failure	$0.50–$2	$5–$25
Outcome	Bug fixed	Symptom hidden

The unit economics flip on flaky tests. You pay 10x more, end up with worse code, and the underlying flake is still there waiting for the next PR.

The right move: classify before you fix

Every CI failure should be sorted into one of two buckets before an agent (or a human) starts fixing it:

Real failure — this PR’s code change broke something. Send to AutoFix. It’ll do a great job.
Flake — this test fails on unrelated PRs too. Don’t fix it on this PR. Quarantine, log, and address the root cause separately.

AutoFix has no way to make this distinction on its own. It only sees one PR. Flake detection requires looking across many PRs, over time, and asking: does this test fail on diffs that have nothing to do with it?

This is what Kleore does

Kleore connects to your GitHub repos, scans your CI history, and ranks every flaky test by frequency and dollar cost. It’s the missing classifier in front of AutoFix:

Test fails on a PR → check Kleore → if the test is on the flake list, skip AutoFix entirely
Top flakes get triaged as their own work item, not patched into random PRs
Engineering managers see a weekly $ number and can decide whether to invest fix-time

AutoFix and Kleore are complements, not competitors. AutoFix needs a flake-aware front door to be safe in a real codebase. Without one, every flaky test in your suite becomes a recurring tax on your token bill.

Stop letting AutoFix burn cycles on flakes.

Install the Kleore GitHub App. Get a ranked list of every flaky test in your repos — with dollar costs attached — in two minutes. Free to start.

Scan my repos — free

FAQ: Claude Code AutoFix and flaky tests

Can Claude Code AutoFix fix flaky tests?

No. AutoFix is designed to fix deterministic failures caused by the current PR’s code change. Flaky tests fail for reasons unrelated to the diff — race conditions, shared state, network jitter — so AutoFix either invents a cargo-cult fix that “works” because the flake passed by chance, or it loops indefinitely burning tokens and CI minutes.

Why does AutoFix loop on flaky tests?

AutoFix treats every red CI as a code bug. When a flake fails, AutoFix pushes a speculative fix and re-runs. The flake fails again for unrelated reasons, AutoFix fixes harder, and the loop repeats. Each iteration consumes LLM calls, CI minutes, and adds noisy commits to your git history.

How do I stop AutoFix from wasting cycles on flaky tests?

Classify failures before AutoFix runs. If a test has historically failed on PRs unrelated to its code, treat it as a flake — quarantine and address separately. Tools like Kleore scan your CI history across many PRs to identify flaky tests and rank them by frequency and dollar cost, so AutoFix only engages on real bugs.

Are AutoFix and Kleore competitors?

No, they’re complements. AutoFix is a per-PR fixer that needs a flake-aware front door. Kleore provides cross-PR flaky test detection and dollar-cost reporting that tells AutoFix (and your engineers) when to fix and when to quarantine.

How much does running AutoFix on a flaky test cost?

A single flaky test can drive AutoFix through 5–30 iterations, consuming 60–300 CI minutes and $5–$25 in tokens per failure — roughly 10x the cost of fixing a real bug. Multiplied across a quarter, this becomes a recurring tax on your engineering budget.

Why Claude Code AutoFix Can’t Fix Flaky Tests

What AutoFix actually does

Flaky tests break the loop

1. The speculative fix loop

2. The infinite re-run

3. The wrong file blamed

The cost math is bad

The right move: classify before you fix

This is what Kleore does

Stop letting AutoFix burn cycles on flakes.

FAQ: Claude Code AutoFix and flaky tests

Can Claude Code AutoFix fix flaky tests?

Why does AutoFix loop on flaky tests?

How do I stop AutoFix from wasting cycles on flaky tests?

Are AutoFix and Kleore competitors?

How much does running AutoFix on a flaky test cost?

Further reading

Stop guessing.
Start measuring.

What AutoFix actually does

Flaky tests break the loop

1. The speculative fix loop

2. The infinite re-run

3. The wrong file blamed

The cost math is bad

The right move: classify before you fix

This is what Kleore does

Stop letting AutoFix burn cycles on flakes.

FAQ: Claude Code AutoFix and flaky tests

Can Claude Code AutoFix fix flaky tests?

Why does AutoFix loop on flaky tests?

How do I stop AutoFix from wasting cycles on flaky tests?

Are AutoFix and Kleore competitors?

How much does running AutoFix on a flaky test cost?

Further reading

Stop guessing.Start measuring.

Stop guessing.
Start measuring.