Test Selection Without ML Infrastructure

When teams ask us about intelligent test selection, the conversation sometimes gets stuck at the wrong place: "We don't have an ML team. We don't have a model serving infrastructure. Does this still apply to us?"

Yes. Substantially yes. The sophisticated version of test selection uses ML models trained on dependency graphs, code embeddings, and historical correlation signals. But you don't need the sophisticated version to get most of the value. The simple version — built on historical pass/fail correlation from your existing CI run data — can produce 50-65% test reduction on typical PR-level changes, and you can implement it without any ML infrastructure at all.

This post is about the simple version: what data you need, how the selection logic works, what it can and can't do, and where the ceiling is before you'd need to reach for more complex machinery.

What "Test Selection" Actually Means

Test selection, in the context of CI, means: given a set of changed files in a commit, select a subset of your test suite that's likely to contain all tests that would fail if this change introduced a bug. Run that subset. If it passes, you have reasonable confidence the change is safe — not perfect confidence, but confidence proportional to the quality of your selection model.

The key word is "likely to contain." You're not guaranteeing full coverage. You're making a probabilistic bet that the selected subset has high recall (doesn't miss real failures) even if lower precision (may include some tests that couldn't possibly fail from this change).

The tradeoff you're making: faster feedback (smaller test suite) in exchange for slightly lower coverage on each individual run. The bet pays off when: (a) your full suite is slow enough to create real friction, (b) most changes are localized enough that a significant fraction of tests are genuinely irrelevant, and (c) your selection model has high enough recall that missed failures are rare.

The Data You Already Have

Here's the core insight: your CI run history is a supervised dataset for test selection, and you already own it. You don't need to collect new data or instrument anything new.

Every CI run contains: (a) which files were modified in the triggering commit, (b) which tests ran, (c) which tests passed or failed. If you have 500 runs, you have 500 examples of (change set, test outcomes) pairs.

The simplest useful feature you can extract from this data: for each (test, file) pair, how often did this test fail when this file was part of the change set? This is a raw co-failure frequency. It's not sophisticated — it doesn't know about code structure, doesn't understand imports, doesn't distinguish between "this file imports that module" and "this file happened to be modified in the same commit as that module accidentally" — but it's a starting point.

The selection algorithm: for a given change set of files F, select all tests where the co-failure frequency with any file in F exceeds a threshold T. Add all tests that have never run against any change to any file in F (these are "uncertain" tests that you haven't observed in relevant contexts). Everything else is candidate for exclusion.

# Pseudocode for basic frequency-based selection
def select_tests(changed_files, run_history, threshold=0.05):
    selected = set()

    for test in all_tests:
        # Always include tests that failed with these files
        co_fail_rate = compute_co_failure_rate(
            test, changed_files, run_history
        )
        if co_fail_rate > threshold:
            selected.add(test)
            continue

        # Include tests with no history against these files
        history_count = get_history_count(
            test, changed_files, run_history
        )
        if history_count < MIN_OBSERVATIONS:
            selected.add(test)

    return selected

This is rough but useful. On a codebase where tests have reasonable locality (tests for module X tend to fail when module X changes), this approach alone can produce 40-60% test reduction while maintaining recall above 95% — meaning fewer than 1 in 20 real failures gets skipped.

Where Simple Frequency Breaks Down

Before you implement this and call it done, the failure modes you should know about:

Sparse history on new code. New files have no co-failure history. The "uncertain test" fallback handles this correctly — tests with no history against the changed file get included. But if you're doing a large refactor that touches many new or rarely-modified files, most of your test suite ends up in the "uncertain" bucket and you get little reduction. This is correct behavior — you genuinely don't have signal — but it means selection is least effective precisely when changes are largest.

Shared utility file thrash. If a file like utils/string_helpers.py is modified in many commits for various reasons, tests across your entire codebase will show high co-failure rate with it — not because they actually depend on the string helpers, but because those tests tend to be run whenever anything changes. The selection model will select almost everything when string_helpers.py is in the change set, providing no benefit. Mitigation: weight the co-failure rate by the inverse document frequency of the file (files modified in many commits have their correlation signal discounted).

Integration test coupling. End-to-end integration tests tend to fail across many different change types because they exercise large codepaths. Their co-failure rate with almost every file is high. Selection models tend to always include integration tests, which limits the reduction possible for change sets that don't genuinely need them. You can address this by treating integration tests as a separate category with a separate selection threshold, or by running them only on merge to main rather than on PR branches.

The Numbers: What to Realistically Expect

For a typical backend service codebase with a few hundred unit tests plus integration tests, here's what the frequency-based approach produces based on the pattern we observe in practice:

Single-file changes: 55-70% test reduction (most tests have no historical co-failure with the changed file)
5-10 file changes: 30-45% reduction (more files means more coverage needed)
20+ file changes: 10-20% reduction (large change sets trigger most of the suite)
Recall (fraction of real failures not missed): 93-97% on well-established codebases

We're not claiming these numbers are guaranteed — they depend heavily on how modular your codebase is and how much your tests have locality. A highly coupled codebase where every change tends to touch shared infrastructure will see lower reduction. A well-modularized codebase with clear test ownership per module will see higher reduction.

The numbers also improve as your history accumulates. At 100 runs, you have low confidence on any given frequency estimate. At 1000 runs, the estimates are more stable and the selection is more aggressive. The first few weeks of using test selection are the low-confidence phase — it's conservative by design while history builds up.

The Minimum History Threshold

Before activating test selection in your CI pipeline, you want enough run history to have seen each test fail at least a few times in different contexts. The rule of thumb we use: 300 CI runs per service, with reasonably diverse change sets (not 300 runs of the same file being modified 300 times).

If you're starting from scratch, you have a few options:

Run selection in "shadow mode" for the first few weeks: compute the selection but still run the full suite, log which tests selection would have excluded, and verify that excluded tests don't fail. This gives you calibration data without the risk of missed failures.
Set the exclusion threshold conservatively (high threshold for exclusion, low threshold for inclusion) until you've accumulated enough history to trust the model.
Use file-path-based heuristics as a prior: tests in tests/unit/auth/ are probably not relevant to changes in src/payments/, regardless of historical correlation. This structural prior fills the gap while history accumulates.

When to Add Code Coverage Signals

The frequency-based approach is purely correlational. Adding code coverage data — which lines of code does each test exercise — makes the model structural. A test is now selected based on whether the changed lines are in its coverage path, not just whether it historically failed together with the changed file.

Coverage-based selection is more precise: it selects tests that could theoretically be affected by the change rather than tests that historically happened to fail alongside it. But it requires instrumentation. Collecting coverage adds 15-40% to test execution time. Managing coverage storage (you need per-test coverage, not aggregate coverage) adds infrastructure. For many teams the frequency approach is good enough to not need this overhead.

The signal to add coverage: your frequency model has recall problems — it's missing failures that coverage-based selection would catch. In practice, this typically happens when your codebase has highly indirect dependencies (A imports B which imports C, and tests for A fail when C changes, but A and C have never been in the same change set historically). Coverage-based selection catches this; frequency-based selection misses it until you've observed it failing.

We built both approaches into SuperPlane because different codebases need different starting points. But the frequency-based path, which requires no new instrumentation, is where most teams start — and for the majority, it's where they stay.