All articles

Engineering

SuperPlane v0.9: Smarter Test Selection with Rolling History

SuperPlane v0.9: Smarter Test Selection with Rolling History

SuperPlane v0.9 shipped last week. The headline change is that our test selection model now uses a 30-day rolling history window instead of 7 days. This post explains why we made that change, what the data looked like before and after, and a few edge cases that took more work than expected.

What the 7-Day Window Was Missing

Our test selection model works by building a correlation map between code paths (files, modules, packages) and test cases, then scoring which tests are most likely to catch a failure given the current diff. The rolling history window is the lookback period we use to build that correlation map.

With a 7-day window, we were seeing a pattern we started calling "Monday morning false confidence." Teams with regular sprint cycles tend to touch certain parts of the codebase in waves — heavy refactors early in a sprint, stabilization and test work at the end. If a significant refactor happened 8 days ago, our correlation map had no knowledge of it. Tests that were highly relevant to that code path were getting scored as low-priority, and our selection rate on them was dropping.

We quantified this by running a retrospective analysis: for each failed build in our dataset, we asked "would the failing test have been selected under the current model?" Under the 7-day model, we were selecting the eventual failing test in 91.4% of cases. That sounds good until you look at the 8.6% miss rate broken down by when the relevant code was last touched. Missed selections were concentrated in code paths last modified 8-21 days ago — exactly the window the 7-day model was blind to.

Why 30 Days, Not 14 or 60

We tested a range of window sizes: 7, 14, 21, 30, 45, and 60 days. The metric we optimized for was "failing test selection coverage" — what fraction of build failures include the failing test in the selected set.

Window Failing test selection rate Avg tests selected per run P75 pipeline duration (relative)
7 days 91.4% baseline 1.0×
14 days 94.2% +4% 1.04×
21 days 95.8% +7% 1.07×
30 days 96.9% +11% 1.11×
45 days 97.1% +19% 1.19×
60 days 97.2% +28% 1.28×

The coverage curve flattens sharply after 30 days — we get 96.9% at 30 days versus 97.2% at 60 days, but the test set grows 17 percentage points larger to gain that 0.3%. Past 30 days, you are mostly selecting tests for code that genuinely has not changed in a month and is unlikely to regress from today's diff. The correlation signal is real but very weak, and the pipeline duration cost is not worth it.

Implementation: How the Window Works

The correlation map is not stored as a simple day-keyed log. We maintain a weighted directed graph where edges represent "test T failed when code path P was recently modified." The rolling window determines which edges are eligible to contribute weight. An edge from 31 days ago falls off entirely when we move to a 30-day window; edges from the past 30 days contribute weight inversely proportional to their age (more recent modifications count more).

# Simplified edge weight calculation
def edge_weight(correlation_event, window_days=30):
    age_days = (now() - correlation_event.timestamp).days
    if age_days > window_days:
        return 0.0
    # Exponential decay within window
    decay = math.exp(-age_days / (window_days * 0.4))
    return correlation_event.base_weight * decay

The exponential decay within the window is important. Without it, a correlation event from 29 days ago would count as much as one from yesterday, which made the model sluggish to react to refactors. With decay, recent history dominates, but a meaningful regression correlation from three weeks ago still has a non-zero vote.

Storage and Latency Changes

Going from 7 to 30 days of history increases the correlation graph size by roughly 4×. For most repos this is fine — we are talking about a few hundred megabytes at most for a mid-sized monorepo. For very active repositories with broad test coverage, we saw graph sizes reach 2-3 GB in our internal testing.

We addressed this with two changes:

Sparse representation for low-weight edges. Edges with a current weight below 0.02 are pruned from the in-memory graph at computation time, even if they are within the 30-day window. These edges have essentially no effect on test selection scores but were consuming memory and slowing traversal.

Incremental update instead of full rebuild. Previously, when a build completed, we rebuilt the entire correlation graph from the full window. Now we compute the delta — new edges added, aged-out edges removed, weight adjustments for existing edges — and apply that diff. Full rebuild time on a large repo was 12-18 seconds; incremental update is typically under 1 second.

Edge Cases and Known Limitations

A few scenarios where the 30-day window still does not behave well:

Major refactors that move files. If a module is renamed or restructured significantly, the historical correlation data maps to paths that no longer exist. We detect file renames through Git history and remap edges where possible, but non-trivial refactors (merge of two modules, extraction of a shared library) still cause a cold-start period of a few days while the model relearns.

Monorepos with very infrequent changes to certain services. If a service is touched once every 45 days on average, our 30-day window will frequently have little or no history for it. We fall back to full test execution for those services rather than underselecting. This is the conservative failure mode — you run more tests than necessary — which we consider preferable to missing a regression.

Teams that push directly to main. Our model is calibrated on PR-based workflows where each push is a logical unit of change. Push-to-main workflows with large, infrequent commits produce noisier correlation signals. The 30-day window actually helps here compared to 7 days, but the fundamental signal quality issue remains.

What Is Coming in v0.10

The next improvement we are working on is making the window configurable per repository. A fast-moving monorepo with daily broad changes might want 21 days; a slower-moving repo with deep, infrequent changes might benefit from 45 days. Right now the window is a global setting; v0.10 will let you tune it per-project through the SuperPlane config.

We are also working on smarter cold-start handling for the file-rename case. The current fallback to full execution is safe but wasteful. We have a prototype that uses AST-level module similarity to map old correlation edges to new paths when Git rename detection fails — early results look promising, though it adds complexity to the graph update path.

The v0.9 release is available now. If you hit unexpected behavior with the extended window — particularly if you are seeing a larger-than-expected test set for a small diff — check your repository's base path configuration in the SuperPlane dashboard. A misconfigured base path can cause the model to treat many unrelated files as correlated, which amplifies the window extension effect.

Written by

Yuki Tanaka

Back to all articles