All articles

AI Pipelines

Zero-Config Test Optimization: What It Takes

Zero-Config Test Optimization: What It Takes

"Connect your repo and test selection starts working" is the kind of promise that gets feature teams excited and platform engineers immediately skeptical. The platform engineers are right to be skeptical. The question is not whether test selection works — it does — but what "zero-config" actually means in practice. What does the system know at install time? What does it have to infer? What breaks silently when the inference is wrong?

This post is the explanation we wish existed when we were building SuperPlane. It covers the three main mechanisms behind zero-config test optimization: dependency graph inference, test history bootstrap, and default policy selection.

What Zero-Config Does Not Mean

Zero-config does not mean zero information. It means zero manual configuration — no YAML files enumerating test dependencies, no mapping files saying "if you change auth/token.go, run these 23 tests." The system has to derive that information automatically.

It also does not mean instant effectiveness. There is a bootstrap period — typically 5-10 build runs — during which the system runs the full test suite to collect the history it needs. If you connect SuperPlane to a repository and immediately see test selection filtering aggressively on the first run, that is a bug, not a feature. Conservative behavior during cold start is intentional.

What zero-config actually delivers is: no ongoing maintenance burden. You do not need to update a test mapping file when you refactor a module. You do not need to reconfigure test groupings when you add a new service. The inference mechanisms track changes automatically. That is the real value proposition.

Dependency Graph Inference: Two Approaches in Practice

There are two broad strategies for inferring which tests are relevant to a code change: static analysis and historical correlation. We use both, for different reasons.

Static analysis examines import graphs, module boundaries, and dependency declarations (package.json, go.mod, requirements.txt, pom.xml) to build a map of what depends on what. If auth/token.go is imported by api/middleware.go, which is imported by 14 handlers, and those handlers have 90 associated tests, static analysis will flag all 90 as potentially relevant.

Static analysis is precise but has a coverage problem. Many test frameworks use runtime test discovery. Test files are not always explicit about which production code they exercise. Integration tests and end-to-end tests often do not import the code they test at all — they call it over HTTP or a message bus. For these categories, static import analysis produces either zero matches or overly broad matches (everything imports everything through a shared utils layer).

Historical correlation asks a different question: in the past, when code path P was modified, which tests subsequently failed? This approach catches integration test dependencies that static analysis misses. It also self-corrects — if a correlation is spurious (test T failed when P changed, but only because of a race condition in the test harness), it weakens over time as P changes without T failing.

We combine both signals. Static analysis handles cold start and covers areas where historical data is sparse. Historical correlation takes over for the bulk of selection decisions once enough history has accumulated. When the two signals conflict — static analysis says "relevant," history says "never actually failed when this changed" — we apply a conservative heuristic: select the test anyway but downweight it in the priority queue, so it runs later in the batch if time allows.

How the Dependency Graph Gets Built

The graph construction happens in two passes on repository connect.

Pass one is static. We clone the repository, identify the build system (Go modules, npm/yarn, Maven/Gradle, Python setuptools, Cargo), parse dependency declarations, and walk import statements. For Go we use AST parsing. For Python we resolve imports against the installed package set. For JavaScript/TypeScript we walk the module resolution algorithm. This produces a module-level dependency graph, not a file-level one, because file-level is too granular — one-line refactors that move a function across files within the same module should not invalidate the entire test correlation map.

Pass two is test discovery. We run the test framework's dry-run or list mode to enumerate test IDs without executing them. For pytest: pytest --collect-only -q. For Go: go test -list '.*' ./.... For Jest: jest --listTests. We then attempt to map each test ID to its owning module using the file path of the test file. This gives us the initial test-to-module assignment, which we refine with historical data over subsequent runs.

# Simplified pass-two test discovery
def discover_tests(repo_root, build_system):
    if build_system == "pytest":
        output = run("pytest --collect-only -q", cwd=repo_root)
        return parse_pytest_collection(output)
    elif build_system == "go":
        output = run("go test -list '.*' ./...", cwd=repo_root)
        return parse_go_test_list(output)
    elif build_system == "jest":
        output = run("jest --listTests --json", cwd=repo_root)
        return json.loads(output)
    else:
        # Fall back to file glob heuristics
        return discover_by_convention(repo_root)

Test History Bootstrap: The First 10 Runs

During the bootstrap period, we run the full test suite on every build but collect detailed failure and timing data. Specifically, we record:

  • Which tests failed, and on which commit
  • What files changed in the diff that preceded the failure
  • Test execution time per test (for future parallelization optimization)
  • Whether the failure was non-deterministic (test passed on retry without code change)

After 5 builds, we have enough data to activate limited test selection — typically filtering out tests with zero historical co-failure with any recently-changed module. After 10 builds, the correlation model has sufficient signal for the full selection algorithm to operate.

The bootstrap period feels like a tax. It is also the most important thing we do. Jumping to aggressive selection without a correlation baseline produces false negatives — missed failures — which are the worst outcome and the hardest to debug. We would rather run full suites for 10 builds than miss a regression on build 3.

Default Policies: What We Choose When We Do Not Know

Zero-config does not mean the system has no opinions. It has opinions; they are just defaults rather than mandatory configurations. These defaults encode our view of what conservative looks like in a test selection system.

Coverage floor. By default, we select a minimum of 20% of the test suite on every run regardless of diff size. A one-line change to a comment does not justify running only the 3 tests that share a module with the changed file. The 20% floor provides a safety net against model errors, unknown dependencies, and infrastructure tests that do not have clear code-path associations.

Full run on infrastructure changes. Changes to CI configuration files, Terraform files, Kubernetes manifests, Dockerfile, or dependency lockfiles trigger full test suite execution. These changes affect the environment the tests run in, not the code paths they test. The correlation model has no useful signal for them.

Full run on merge to main. Pre-merge CI can use aggressive selection. Merge-to-main (or equivalent trunk) always runs the full suite. This is the gate that matters for production safety, and the cost of a longer pipeline on the main branch is justified by the confidence it provides.

Flaky test quarantine. Tests that fail non-deterministically more than 15% of the time over a 7-day window are flagged. Flagged tests are still executed, but their failures are reported separately and do not block the build by default. This is controversial — we are not saying flaky tests do not matter — we are saying that a flaky test blocking every engineer's pipeline is a worse outcome than a separate "flaky test" report that the responsible team can investigate asynchronously.

What Can Go Wrong

The main failure modes we have seen in practice:

Dynamic code loading. If your application uses importlib.import_module(), require() with runtime-computed paths, or plugin systems that load modules by name, static analysis will not see those dependencies. Historical correlation will catch them eventually, but there is a blind window at the start.

Shared test infrastructure. If 60% of your tests import a shared test helper module, a change to that module will cause the model to select 60% of your test suite. This is technically correct — any of those tests could fail — but it undermines the efficiency gains. The fix is to structure test helpers so they do not import production code directly, or to configure the module as an infrastructure file that triggers full runs.

Large monorepos with sparse ownership. In a monorepo where different services are developed by different teams with very different commit cadences, the correlation model can develop strong spurious correlations between rarely-touched services and intermittently failing tests. We added a minimum confidence threshold for correlation edges in v0.8 to address this, but it is worth monitoring in the first two weeks after connecting a large monorepo.

We are not saying zero-config test optimization is a fully solved problem. We are saying the defaults are calibrated to be safe failures — when in doubt, run more tests, not fewer. The goal of zero-config is to remove the maintenance burden, not to guarantee perfect selection on day one.

Written by

Yuki Tanaka

Back to all articles