Policy-as-Code for CI/CD: A Practitioner's Guide

Most CI/CD policy lives in three places: the heads of two or three platform engineers, a Confluence page that was last updated 14 months ago, and scattered shell conditionals across a dozen YAML files that nobody wants to touch. When someone asks "why did this deploy get blocked?", the answer is usually "good question, let me look at the pipeline YAML."

Policy-as-code is the practice of expressing CI/CD rules — what can deploy when, what tests are required, what approvals gate production — in machine-readable, version-controlled files that are separate from the pipeline mechanics that execute them. The rules are explicit, reviewable, testable, and auditable. The guide that explains how to actually do this did not really exist in one place, so here it is.

What Policy-as-Code Is Not

Policy-as-code is not the same as infrastructure-as-code, though the concepts are related. Infrastructure-as-code (Terraform, Pulumi, CloudFormation) manages the resources your services run on. Policy-as-code manages the rules governing how code gets from a developer's branch to those resources.

It is also not the same as pipeline-as-code. Pipeline-as-code (Jenkinsfile, GitHub Actions YAML, Buildkite pipeline definitions) describes the steps that execute in a build: run tests, build image, deploy. Policy-as-code describes the conditions under which those steps are allowed to proceed: only deploy to production during business hours, require two approvals for changes to the payment service, block any deploy if the canary error rate exceeds threshold.

The distinction matters because policies change for different reasons than pipelines. You update a pipeline when your build process changes. You update a policy when your governance requirements change — a new regulation, a post-incident change to deploy gate criteria, a new service classification. Mixing the two in the same file makes both harder to change and audit.

The Policy Model: What You Are Expressing

CI/CD policies operate on three main concerns:

Pre-deploy gates. Conditions that must be satisfied before a deploy proceeds. Test coverage thresholds, required approval by specific CODEOWNERS groups, open-severity security scan findings, dependency vulnerability flags.

Deploy windows and restrictions. When deploys are allowed and to which environments. No production deploys on Fridays after 15:00. No deploys to production within 30 minutes of a previous deploy to the same service. Require change management ticket number for production deploys during freeze windows.

Rollout behavior rules. How a deploy proceeds once it starts. Canary percentage schedule. Health check criteria for advancing stages. Automatic rollback conditions. Blast radius limits (only one service in a given dependency cluster at a time).

A policy-as-code system should be able to express all three categories in a consistent, human-readable format, evaluate them against runtime context (current time, test results, deployment state), and produce a structured decision: allow, block, require-approval, or auto-rollback.

Tools: OPA, Kyverno, and Custom DSLs

Open Policy Agent (OPA) is the most general policy engine available. It uses Rego, a purpose-built query language for expressing policy. OPA is powerful and works in any context where you can pass it a JSON input document, which makes it suitable for CI/CD policy evaluation. The learning curve on Rego is steep — it is a different mental model from procedural code — but for complex, cross-cutting policies with multiple inputs, it pays off.

# OPA policy: block production deploy on Friday after 15:00 PST
package superplane.deploy

default allow_production_deploy = false

allow_production_deploy {
    not is_friday_afternoon
    not input.service.tags[_] == "freeze-window"
    input.tests.all_passing == true
}

is_friday_afternoon {
    # day_of_week: 5 = Friday in the input context
    input.context.day_of_week == 5
    input.context.hour_pst >= 15
}

# Denial reason for engineers to see
deny_reason[msg] {
    is_friday_afternoon
    msg := "Production deploys are blocked on Friday afternoons. Schedule for Monday or use an override with incident ticket."
}

Kyverno is a Kubernetes-native policy engine. If your deployment pipeline ends in Kubernetes and you want to enforce policies at the cluster admission level, Kyverno is purpose-built for this. Its limitation is that it only works inside a Kubernetes context — it cannot express policies about the CI phase that precedes deployment.

Custom DSLs are what most teams end up with when they do not choose a general engine. A YAML schema specific to your deploy system. Sometimes expressed in JSON Schema with custom extensions. These are pragmatic when the policy surface is limited and stable, but they accumulate technical debt when policy complexity grows because you eventually need to implement a query engine yourself.

SuperPlane's policy format is a structured YAML DSL over a small expression language — simpler than Rego, sufficient for the deploy and test-gate policies that most teams need. For organizations that need cross-service policy evaluation or highly conditional rules, we expose a hook to delegate policy decisions to an OPA server.

A Practical Policy File Structure

A minimal policy-as-code implementation for a single service looks like this:

# .superplane/policy.yml

version: "1"
service: payment-api

pre_deploy_gates:
  - id: test_coverage
    check: test.coverage_pct >= 80
    message: "Coverage must be >= 80% before production deploy"
    environments: [production, staging]

  - id: security_scan
    check: security.critical_findings == 0
    message: "No critical security findings permitted"
    environments: [production]

  - id: codeowner_approval
    check: approvals.from_codeowners >= 1
    message: "At least one CODEOWNERS approval required"
    environments: [production]
    path_filter: "src/billing/**"

deploy_windows:
  production:
    - schedule: "Mon-Thu 09:00-17:00"
      timezone: "America/Los_Angeles"
    - schedule: "Fri 09:00-13:00"
      timezone: "America/Los_Angeles"
  staging:
    - schedule: "* 00:00-23:59"   # always open

rollout:
  strategy: canary
  stages:
    - traffic_pct: 5
      hold_minutes: 15
      success_criteria:
        error_rate_pct: "< 0.5"
        p99_latency_ms: "< 1200"
    - traffic_pct: 25
      hold_minutes: 20
      success_criteria:
        error_rate_pct: "< 0.5"
        p99_latency_ms: "< 1200"
    - traffic_pct: 100
  auto_rollback:
    on_criteria_failure: true
    on_anomaly_score: "> 0.8"

This file is checked into the repository alongside the application code. Every change to it goes through code review. Engineers can see exactly what the deploy rules are by looking at a single file, not by reverse-engineering pipeline YAML. Post-incident, you can look at the git history of this file and understand how the rules have changed over time.

Testing Policies Before They Bite You

Policy-as-code without tests is just configuration you hope is correct. The policy engine should be testable with synthetic inputs so you can verify the policy behaves correctly before it runs against real deploys.

OPA has a built-in test framework. For custom DSLs, you need to write test cases separately. Either way, the pattern is: construct JSON inputs representing pipeline states you want to test, run the policy engine against them, assert the expected allow/deny/require-approval output.

# OPA test cases for the Friday policy above
test_allow_monday_morning {
    allow_production_deploy with input as {
        "context": {"day_of_week": 1, "hour_pst": 10},
        "service": {"tags": []},
        "tests": {"all_passing": true}
    }
}

test_deny_friday_afternoon {
    not allow_production_deploy with input as {
        "context": {"day_of_week": 5, "hour_pst": 16},
        "service": {"tags": []},
        "tests": {"all_passing": true}
    }
}

test_deny_failing_tests {
    not allow_production_deploy with input as {
        "context": {"day_of_week": 2, "hour_pst": 11},
        "service": {"tags": []},
        "tests": {"all_passing": false}
    }
}

These tests run in CI. A policy change that breaks an existing test case is caught before it reaches production. A new policy check that was added without tests is visible in code review.

Organizational Realities: Who Owns Policies

Policy-as-code creates a question that is more organizational than technical: who can change the policies? There are two reasonable models.

Platform-owned global policies + team-owned service policies. The platform team maintains a set of organization-wide policy templates covering the non-negotiables (security scan required, no deploys on Fridays for PCI-scoped services, rollback thresholds). Individual service teams customize within that template for their service's specific requirements. Changes to global policies require platform team review; changes to service-level policies require the service team's own CODEOWNERS approval.

Policy-as-part-of-service-repo. Each service team owns their policy file entirely. The platform team enforces that certain policy elements must exist (a security scan gate, at minimum) via a policy linting step in CI. Teams can tune everything else. This gives teams more flexibility but requires trusting teams to maintain sensible policies — which works well when teams have strong ownership culture and less well when they do not.

We are not saying one model is universally better. We are saying that deciding this explicitly, before you build the tooling, saves a lot of uncomfortable conversations later about whether the authentication service team is allowed to disable the Friday deploy window for their service.

Migration: From Shell Scripts to Policy Files

Almost every team starting this migration has conditional logic scattered across their existing pipeline YAML. The migration path that has worked best in practice:

First, audit your existing pipeline files for conditional logic that implements policy. Look for patterns like if [[ "$BRANCH" == "main" ]] && [[ "$ENV" == "production" ]], manual approval steps, environment checks, and time-based gates. Document each one.

Second, translate the audit into policy file format, even before you have a policy engine to evaluate it. Just having the policies written down in one place is valuable — it is a forcing function for making the implicit explicit.

Third, implement the policy engine in evaluation-only mode first. Run the engine against your pipeline decisions and log the results, but do not block on them yet. Spend two weeks validating that the policy engine would have made the same decisions the shell scripts did. Fix discrepancies.

Fourth, enable enforcement. At this point you have high confidence the policy engine is correct, your team has seen how it works, and the migration is largely low-risk.

Doing the migration all at once — remove shell scripts, add policy engine, enforce immediately — is the way to guarantee a late-night incident when the policy has an edge case you did not think of. The gradual path is slower and worth it.