All articles

CI/CD Strategy

Why YAML CI Configs Break at Scale

Why YAML CI Configs Break at Scale

The first CI configuration file is a reasonable piece of work. A GitHub Actions workflow for a new service, maybe 80 lines: checkout, set up language runtime, install dependencies, run tests, build container, push to registry. Readable. Maintainable by any engineer on the team. Does exactly what it says.

By the time you have 15 services, a staging environment, a production environment, three different test suites (unit, integration, e2e), a security scan step, an image signing step, and policy for deploying on merge to main but not on PRs — the workflow files are somewhere between 400 and 900 lines each. There's duplication between services. There are subtle differences between services that nobody can explain. A change to the shared Docker registry URL requires touching 15 files. Welcome to YAML at scale.

This isn't a complaint about YAML specifically — although YAML has its own issues. It's about the structural properties of the CI configuration problem that any static declarative format struggles with. Understanding why helps you make better choices about what to do next.

The Three Properties YAML Lacks

Abstraction. YAML is a data serialization format. It has no concept of a function, a variable with a meaningful name, or a reusable procedure. GitHub Actions has "composite actions" and "reusable workflows" as workarounds. These help but introduce their own complexity: composite actions can't use all action types, reusable workflows have input/output limitations and require a separate file per reusable unit.

The abstraction problem becomes acute when you want to express: "run the same deployment logic in every service's workflow, but let each service override the Helm chart name and the environment-specific values." In a real programming language, this is a function call with parameters. In GitHub Actions YAML, it's either a reusable workflow with workflow_call trigger (which has significant limitations) or copy-paste with sed-style substitution (which is not a solution).

Type checking. YAML is stringly typed. A step that expects a value of "true" (string) and receives true (boolean) may behave differently, and YAML's implicit type coercion rules are famously unintuitive. More importantly, there's no static analysis at definition time — CI configuration errors are runtime errors discovered when the workflow runs, not at the time the configuration is written.

For a team shipping five times a day, a misconfigured workflow discovered at CI run time rather than at configuration write time is a 5-15 minute feedback delay minimum. In aggregate across a team, this is a meaningful source of lost time. It's also the kind of error that often gets attributed to "flaky CI" rather than "configuration is wrong and we didn't know," because the error message comes from the CI runner rather than a configuration validator.

Testability. You cannot unit test a GitHub Actions workflow. You can lint it with actionlint, which catches syntax errors and some schema violations. But you cannot write a test that says "given this set of inputs, verify that the deploy step receives the correct environment variables." The only way to verify the behavior of a YAML workflow is to run it, which requires a real or simulated CI environment.

This means that workflow changes have a high confidence cost: you have to run the full CI pipeline to know if your change is correct. For a workflow that takes 20 minutes, that's a 20-minute feedback loop on configuration changes. Teams working around this often create test branches with artificial change sets just to trigger the CI and verify the config — this is not a developer experience you would design intentionally.

Where Teams Actually Hit the Wall

The breaking point is usually not a single event — it's a slow accumulation that crosses a threshold when something goes wrong that would have been easy to prevent with better tooling.

Common patterns:

The copy-paste divergence incident. Service A's workflow has a security scan step that was added six months ago. Service B's workflow was copied from Service A before the scan step was added, and the person who copied it didn't notice. Service B has been shipping without security scanning for six months. No one noticed because there was no centralized view of which workflows have which steps. The discovery happens during a compliance review, not during development.

The secrets rotation incident. A deployment key needs to be rotated. It's referenced by name in 15 workflow files as DEPLOY_KEY_PRODUCTION. But three of those files were written by an engineer who called it PROD_DEPLOY_KEY because they thought that matched an older naming convention. When the secret is rotated in GitHub Secrets, three services suddenly can't deploy — and the failure manifests as an obscure authentication error that takes 30 minutes to connect back to the secret name mismatch.

The configuration drift incident. The platform team wants to enforce that all production deployments go through a specific approval gate. They add the approval step to the "canonical" workflow template. But eight of the 15 services were set up before the canonical template existed, and nobody updated them. The approval gate is active for 7 services and absent for 8. The violation is invisible until an auditor asks "how do you ensure all production deployments are approved?"

None of these are catastrophic. All of them are preventable with different tooling. All of them happen regularly in organizations that have scaled beyond the YAML-per-service model without changing how they manage CI configuration.

The Organizational Moment of Recognition

There's a specific moment in the lifecycle of a growing engineering organization where someone does the math. Platform engineering team, maybe three or four people, responsible for CI/CD infrastructure for 20 product services. They calculate: how much engineer time is spent maintaining CI configuration across all services?

The number they usually arrive at, when they add up the small tasks — updating a step version, adding a new required policy step, debugging a misconfigured timeout, investigating a "works on my workflow but not on yours" situation — is 15-25% of the platform team's capacity. For a four-person platform team, that's somewhere between one half-time person and one full-time person whose job is managing YAML.

That's the moment teams start looking for alternatives. Not because YAML is unworkable in isolation — it works fine for any single workflow. It's because the maintenance overhead at scale is a fundamentally unbounded function of the number of services, and there's no mechanism within the YAML model to stop it from growing.

The Alternatives and Their Tradeoffs

Three broad approaches have emerged:

Generated YAML. Use a higher-level configuration tool (Jsonnet, Dhall, CUE, or a custom Python/TypeScript generator) that produces YAML as output. You write configuration once in the higher-level language and generate the YAML for each service. This gives you abstraction, testability of the generator, and a single source of truth.

The tradeoff: generated YAML is less readable when you need to debug CI failures — you have to understand both the generator and the generated output. Also, if the generated YAML is committed to the repo, it can become stale if someone edits it directly rather than through the generator. Organizational discipline required.

API-driven CI. Move CI configuration out of YAML files and into a service that provides an API. Your workflow is defined programmatically, not declaratively. This is what some internal platform teams build on top of GitHub Actions' REST API — they have a service that creates/updates workflows via API rather than via committed YAML files.

The tradeoff: high implementation cost, strong dependency on the CI platform's API stability, and a non-obvious audit trail (changes to CI behavior don't show up in git history).

Abstraction layer with policy enforcement. Build or adopt a system that lets you define CI configuration at a higher abstraction level — describing what you want to happen, not how to implement it in YAML — and handles the YAML generation, policy enforcement, and cross-service consistency underneath. Product teams interact with the abstraction layer; the platform team owns the underlying implementation.

This is roughly what SuperPlane is: instead of writing YAML workflows, you describe your pipeline's requirements (test commands, deployment targets, policy constraints, canary configuration), and the system figures out the execution. The YAML is an implementation detail you don't manage directly.

We're not saying YAML is inherently bad or should be eliminated — GitHub Actions YAML is a reasonable representation for simple, stable pipelines. We're saying it's the wrong primitive for operating CI configuration at organizational scale, because it provides no mechanism for abstraction, consistency enforcement, or evolution without per-service manual work. Teams that haven't hit the wall yet will hit it. The question is whether to wait for the incident that surfaces the gap, or address the structural problem before it accumulates.

What Good Looks Like

The exit from YAML-at-scale isn't a specific technology — it's a model where CI configuration is managed as software rather than as configuration files. That means: abstraction (DRY, parameterized), version control with meaningful history, type checking or schema validation before run time, policy enforcement that's central rather than per-file, and the ability to evolve all services' CI behavior from a single change point.

Teams that have made this transition report the same outcome: the platform team stops spending 20% of their time on CI configuration maintenance, and starts spending it on capabilities that improve developer productivity. That's the trade you're trying to make. YAML-at-scale just makes it harder to execute.

Written by

Darko Fabijan

Back to all articles