How We Reduced Flaky Test Rate by 40% with Smarter Test Selection

We ran SuperPlane on our own repository for the first six months while building it — eating our own cooking before anyone else did. That experience taught us more about the relationship between test selection and flaky tests than any design exercise would have. The short version: flakiness and suite over-execution are the same problem from different angles, and fixing one affects the other in ways that are not obvious until you have the data.

The Flakiness Problem Is Mostly a Volume Problem

When people talk about flaky tests, they usually mean tests that fail non-deterministically — timing issues, test isolation failures, external service dependencies, port conflicts. These are real problems that require real fixes. But there is a second category that gets less attention: tests that are technically deterministic but are being run in contexts where they are exposed to environmental conditions they were not designed to handle.

Consider an integration test that spins up a local HTTP server on port 8080. In isolation it is fine. Parallelized against 40 other tests that might also claim port 8080, it fails sometimes. The test is not flaky in the "bad code" sense — it is being run in conditions its author did not account for, at a frequency its author did not expect.

This is a volume problem. Run 400 tests where you should run 80, and you will generate more apparent flakiness than if you ran 80 — not because the tests got worse, but because you are exercising more interaction paths between tests, more shared resource contention, and more race conditions in test infrastructure that your test framework authors assumed would handle smaller concurrent loads.

When we reduced our selected test count by approximately 45% through test selection (running tests relevant to the diff rather than the full suite), our apparent flaky failure rate dropped from around 8.2% of CI runs to about 4.9%. We had not fixed a single flaky test. We had just stopped running so many of them at the same time.

Measuring Flakiness Correctly

Most teams measure flakiness as: "this test failed, we re-ran it without changing code, it passed — therefore flaky." This is correct but incomplete. It misses two important questions: how often does this test fail flakily per unit of time, and what is the blast radius (how many CI runs does it affect per day)?

A test that has a 2% flakiness rate but runs on every PR in a busy repository might block 10-15 engineers per day. A test that has a 25% flakiness rate but runs only on a weekly regression suite affects almost nobody. The 2% test is the one to fix first.

We track flakiness this way:

def flakiness_impact_score(test_id, window_days=7):
    runs = get_test_runs(test_id, days=window_days)
    flaky_runs = [r for r in runs
                  if r.status == "failed"
                  and was_retry_successful(r)]

    # Raw flakiness rate
    flakiness_rate = len(flaky_runs) / len(runs) if runs else 0

    # Frequency weight: how often does this test run?
    runs_per_day = len(runs) / window_days

    # Impact = frequency × flakiness rate
    return flakiness_rate * runs_per_day

Sorting by impact score rather than raw flakiness rate gives you a prioritized list where the tests at the top are actually the ones costing your team the most time. The ones with high raw flakiness but low frequency often end up near the bottom — worth fixing eventually, but not the next thing to work on.

What Test Selection Does to Flakiness (and What It Does Not)

Test selection reduces flakiness indirectly, by reducing the total execution volume and therefore the concurrency load on shared test infrastructure. It does not fix the underlying non-determinism in affected tests.

This distinction matters for how you communicate the improvement. When we saw our flaky rate drop by 40%, we were initially tempted to describe it as a flakiness reduction. That is technically accurate but misleading — the tests themselves had not improved, and if we ran the full suite, we would see the old flakiness rate. The better description is: "we are running fewer unnecessary tests, which reduces the contention-driven flakiness that was masking real test health."

This also means that tests with genuine non-determinism — timing issues, external API calls, race conditions in the application logic itself — will not improve from test selection alone. They need to be fixed. But test selection exposes them more clearly because the flakiness signal is less diluted by contention-driven false positives.

The Cascade Effect: How Over-Running Creates Flakiness Debt

There is a more insidious dynamic we observed in our own CI history. Over time, teams stop treating flaky tests as something to fix. They become ambient noise. Engineers add retry: 2 to their CI YAML. The flaky tests keep accumulating failures, keep getting retried, and nobody investigates because the build eventually passes.

This creates flakiness debt. The team's effective tolerance for test unreliability rises, more tests get written with the same assumptions that created the original flakiness, and the problem compounds. After 12 months, a team that started with 3% flakiness might be at 12% — not because the engineers are worse at writing tests, but because the standard was never enforced.

Reducing selection volume resets this dynamic in a useful way. When you are running fewer tests per build, each flaky test has a higher proportional impact on build reliability. Teams start caring about them again because the signal-to-noise ratio improves. We saw this empirically — in the three months after we tightened test selection on our own pipeline, we filed and fixed more flaky test issues than in the preceding six months combined. The tests had not gotten worse; they had become more visible.

Patterns We Found in Our Own Flaky Tests

When we finally dug into the flaky tests that persisted after selection, they fell into three categories:

Shared port/address contention. Multiple tests assuming ownership of localhost ports, Redis keys, or file paths. Fix: parameterize these with random ports at test startup, use temp directories, prefix Redis keys with test run IDs.

Database state leakage. Tests modifying shared database state and not cleaning up, causing later tests in the same run to observe unexpected data. Fix: transaction rollback in test teardown, or test-specific database schemas created fresh per test.

Time-sensitive assertions. Tests that assert "this should complete within 100ms" on a heavily loaded CI runner where the runner might be scheduled out by the OS during the 100ms window. Fix: replace time-based assertions with event-based assertions, or use much more generous timeouts that only fail on genuine hangs.

None of these categories is novel. But seeing your actual distribution across categories is useful for prioritization — if 60% of your flaky tests are in the time-sensitivity category, you need to look at your CI runner provisioning and load as much as at the test code.

What a Healthy Flakiness Rate Looks Like

There is no universally correct flakiness target. Teams at different testing philosophy points have different acceptable thresholds. But a rough reference point from working in this space: a well-maintained test suite with selective execution should see a flaky-run rate (builds that required at least one flaky test retry) below 3%. Above 8% is a signal that something systemic is wrong — over-running, undertested infrastructure, or accumulated debt.

We are not saying zero flakiness is achievable or even the right goal. Some tests exercise asynchronous systems and will always have some timing sensitivity. Some integration tests genuinely depend on external factors that cannot be fully controlled. Chasing zero flakiness at the cost of abandoning integration test coverage is the wrong tradeoff. Chasing 3-5% with the fixes above — reducing volume, fixing the high-impact tests, maintaining a discipline of not adding retries as a substitute for investigation — is achievable for most teams and makes the rest of your pipeline measurably more reliable.