Canary Deployments: The Decision Nobody Talks About

The canary deployment pattern gets a lot of attention in the traffic-splitting part and almost none in the decision part. Blog posts walk through Istio VirtualService config, weighted Kubernetes Services, or Nginx upstream weights. The math for "send 5% of traffic to the new version" is genuinely interesting infrastructure work.

Then comes the hard question: once 5% of traffic is hitting the canary, how do you decide whether to promote it to 100% or roll it back? And when?

This is the decision nobody talks about because it's uncomfortable. It's not a systems problem with a clean answer. It's a judgment problem that most teams resolve by picking a time window, setting a p99 latency threshold, setting an error rate threshold, and — when the canary is still running at the end of the window without crossing either threshold — promoting. That's not wrong, exactly. It's just not a good decision process dressed up as one.

The Traffic-Splitting Part (Brief)

For completeness: the mechanics of a canary deployment in a Kubernetes environment typically involve either a service mesh (Istio, Linkerd) for request-level routing, or a replica-count approach where you run 1 canary pod alongside 9 stable pods, giving you approximately 10% canary traffic based on load balancer distribution. The service mesh approach is more precise but operationally heavier. The replica-count approach is simpler but rounds to nearest-pod percentages and doesn't give you header-based routing for testing.

Neither approach is wrong. Pick the one that matches your current operational maturity. This post isn't about which traffic-splitting approach to use.

The Actual Hard Part: Defining "Healthy"

Before you can make a promote/rollback decision, you need a definition of canary health that's specific enough to act on. Most teams start with the obvious two:

Error rate < X% (compared to stable baseline)
p99 latency < Y ms (absolute threshold, or relative to stable)

These are necessary but not sufficient. The problems:

Error rate comparison needs a baseline, and your baseline is noisy. If your stable service has a 0.3% error rate because of some persistent upstream flakiness, and your canary has a 0.4% error rate, is that a regression? Statistically: maybe not, if your sample is small. If you're routing 5% of traffic to the canary and you have 1000 requests per minute total, you're evaluating the canary on 50 req/min. That's 3000 requests in an hour. At 0.3% vs 0.4% error rate, the confidence interval on the difference includes zero. You can't make a statistically defensible promote decision after one hour at 5% traffic with those numbers.

p99 latency is sensitive to outliers in small samples. p99 across 3000 requests is the 30th worst request. One slow database query in a background job can move this number significantly. Comparing canary p99 to stable p99 across a one-hour window at 5% traffic is comparing two unreliable estimates.

The minimum statistically defensible approach: run the canary long enough or at high enough traffic percentage that you have at least 10,000 canary requests per comparison interval. For high-traffic services this might be 10 minutes at 5%. For lower-traffic services it might be 24 hours at 5% — at which point you're asking whether it's acceptable to run a two-version production environment for a full day before each deploy.

The Hidden Signals Most Teams Miss

Error rate and latency are the obvious metrics, but several more discriminating signals are often ignored because they're harder to wire up:

Business-level error rates. An HTTP 200 response from your checkout service that contains a JSON body with "status": "cart_validation_failed" is an application error, not an HTTP error. Your p99 latency looks fine. Your error rate looks fine. Your canary is silently returning cart validation failures at 3x the baseline rate because a change in the coupon code parsing logic introduced a regression for a specific code format. HTTP-level metrics don't catch this. Application-level event tracking does.

Downstream service impact. Your canary is healthy. But it's generating 40% more calls to your inventory service than the stable version, because a loop optimization was accidentally reverted. Your inventory service is fine now, but it'll be a different story when 100% of traffic is hitting the new version under peak load. Canary health metrics should include the load profile your canary is imposing on dependencies, not just its own health.

Memory and goroutine/thread growth rate. A canary that's slowly leaking memory will look healthy for the observation window and fail on Sunday morning when memory pressure triggers the OOM killer on 100% of your pods simultaneously. Runtime metrics (heap size growth rate, goroutine count trends in Go, JVM old-gen growth in Java) during the canary window can catch this pattern early.

The Cron Job and Hope Pattern

What teams typically build: a script that queries Prometheus or Datadog for canary vs stable comparison, runs on a schedule, and if all thresholds pass after N minutes, triggers the promotion step. This works in the common case. The failure modes:

The time window is usually chosen based on intuition rather than statistical power calculations. Five minutes "feels safe" for a well-covered service but gives you almost no signal for a low-traffic endpoint that processes 10 requests per minute. The same time window applied to all services is wrong for most of them.

The promotion trigger is binary (thresholds pass → promote), which means a canary that's borderline — 0.8x the stable error rate, p99 20% higher — promotes automatically because it didn't cross the threshold. A human looking at those numbers would probably want to investigate. The cron job doesn't care about "concerning but within bounds."

We're not saying automated promotion based on thresholds is wrong — it's operationally necessary if you're deploying frequently. We're saying the thresholds need to be derived from your actual traffic volume and baseline variability, not set once and forgotten. And the automated path should have a "pause for review" condition when metrics are within bounds but trending poorly.

A Better Decision Framework

The promote/rollback decision is better modeled as a sequential testing problem than a fixed-window threshold check. The key insight: you don't need to wait a fixed time — you need to collect enough statistical evidence to make a confident decision, and stop early (roll back) if the evidence for regression accumulates faster than the evidence for health.

The Sequential Probability Ratio Test (SPRT) is the textbook approach here. In practice, a simpler version works: track a rolling window of the ratio of canary error rate to stable error rate. If this ratio exceeds a threshold (say, 1.5x) with statistical confidence at any point, roll back immediately. If the ratio stays below a lower bound (say, 1.1x) with statistical confidence, promote. If neither condition is met, continue running the canary.

This approach adapts to your traffic volume automatically. High-traffic services reach a decision in minutes. Low-traffic services take longer — and that's correct, because the data supports fewer conclusions faster.

The implementation isn't trivial — it requires tracking the statistical bounds correctly as samples accumulate, not just comparing point estimates. But it produces decisions that are defensible rather than arbitrary, which matters when you're explaining a bad canary decision to your engineering team at 2 AM.

What "Rollback" Actually Means

One more thing that gets glossed over: rollback in a canary deployment is only simple if your deployment didn't include a database schema change. If the canary ran a migration — even a backward-compatible one — rolling back the application code doesn't undo the schema change. You're now running the old application code against a slightly forward-migrated schema.

Most of the time this is fine because you're careful to only make additive migrations (new columns, new tables) not destructive ones (dropping columns, renaming fields) before a canary is fully promoted. But "most of the time" is not "always," and the cases where it matters tend to involve the most pressure to roll back quickly.

The implication: your canary deployment pipeline should require an explicit declaration of whether the change includes a schema migration, and enforce stricter evidence thresholds before promoting schema-migration canaries. A code-only change can be rolled back cleanly. A migration-included change cannot — which changes the risk calculus on the rollback side of the decision.

This is the kind of nuance that gets lost when canary deployment is treated as a pure infrastructure pattern rather than a deployment risk management decision. The traffic split is the easy part. The decision is the work.