How to Measure Platform Team Toil (and What to Do About It)

Google's SRE book defined toil as "the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales with the size of the service." Platform teams recognized the concept immediately, because most of their week is it.

The problem is that toil, for platform teams, is harder to measure than for SRE teams running production services. When an SRE gets paged, there is a ticket. When a platform engineer spends 40 minutes debugging a flapping CI job for a feature team that filed a Slack message, that time disappears into ambient noise. And if you cannot measure it, you cannot make the case to reduce it — or fund the automation that would eliminate it.

This post is a practical framework for getting that measurement in place. It is not sophisticated. It does not require specialized tooling. It requires about two weeks of discipline and a shared spreadsheet.

Why Toil Is Worse on Platform Teams Than in SRE

SRE toil is largely reactive — you get paged, you respond, you log the work. Platform toil is often proactive, invisible, and demand-driven by other teams. Three patterns dominate:

Pull requests that require manual review to unblock. Feature team opens a PR that modifies CI configuration or Kubernetes manifests. They do not know the conventions. Platform engineer reviews, comments, back-and-forths. This is sometimes genuinely necessary — sometimes it is a symptom of a missing golden path or an opaque convention that nobody wrote down.

Environment provisioning requests. "Can you spin up a staging environment for this branch?" If this request is coming in more than twice a week per platform engineer, something is missing from your self-service story. Tracking it makes the pattern visible.

Pipeline fire-fighting. CI job fails mysteriously. Feature team files a Slack thread. Platform engineer investigates, finds it is a runner capacity issue or a flaky test in a shared step, fixes it, moves on. No ticket. No log. Just 90 minutes of someone's afternoon, every week.

The common thread is that none of these generate tickets automatically. They are invisible unless you explicitly instrument for them.

The Toil Log: A Minimal Starting Point

For two weeks, every member of the platform team logs interruptions and reactive work in a shared document. Not project work — the stuff on the roadmap. Interruptions and reactive work: the things that pull you away from planned work.

Each entry needs only four fields:

Date | Duration (minutes) | Category | Automatable? (Y/N/Partial)

2026-01-04 | 45 | Pipeline fire-fighting | Y
2026-01-04 | 20 | Environment provisioning | Y
2026-01-05 | 30 | CI config review (feature team PR) | Partial
2026-01-05 | 15 | On-call rotation change coordination | N
2026-01-06 | 60 | Flaky test debugging (shared auth step) | Y

Two weeks of this gives you a sample. It will not be statistically perfect — one unusually bad week can skew it. But it will almost certainly reveal the top two or three categories that are consuming disproportionate time, which is all you need to prioritize automation work.

Toil Categories and Their Automation Potential

Based on running this exercise with several platform teams, toil tends to cluster into these categories, with rough automation potential:

Pipeline configuration changes — Medium automation potential. You can reduce this with better golden paths and a policy-as-code layer that validates changes automatically. You cannot fully automate it because some changes require human judgment about system-wide impact.

Environment provisioning — High automation potential. If you have a Kubernetes cluster and a Terraform module for environments, a well-written self-service endpoint can handle 80-90% of these requests. The remaining 10% involves edge cases that you will document as you hit them.

Dependency and lockfile updates — Near-total automation potential. Dependabot, Renovate, and similar tools exist for this. If platform engineers are manually updating lockfiles, that is tooling debt, not a category of work that needs to exist.

Secret rotation — Medium automation potential. Some secrets rotate via automated tooling (Vault, AWS Secrets Manager rotation policies). Others require human coordination. Track them separately.

Pipeline fire-fighting — Partial automation potential. You can instrument pipelines to self-report anomalies and auto-retry flaky steps, which reduces fire-fighting frequency. But you cannot fully automate root cause investigation when a new failure pattern appears. Track time-to-diagnose as a metric — improvements here often come from better observability on the CI infrastructure itself, not automation of the investigation.

Coordination work — Low automation potential. On-call handoffs, cross-team scheduling, incident post-mortems. This is legitimately human work. The goal is not to automate it but to make it visible so it does not get confused with automatable work when you are arguing for headcount.

Turning the Log into an Argument

After two weeks, aggregate the log by category and by automatable/non-automatable. A typical result might look like:

Category	Hours/week (2-person team)	Automatable?
Pipeline fire-fighting	4.5h	Partial
Environment provisioning	3.0h	Yes
CI config reviews (feature teams)	2.5h	Partial
Dependency/lockfile updates	2.0h	Yes
Secret rotation	1.5h	Partial
Coordination/scheduling	2.0h	No

That table shows 5h/week of fully automatable work. That is one headcount's worth of capacity freed up over a year, from a two-person platform team. That argument lands differently than "we spend too much time on toil."

The Automation Priority Order

Not all automatable toil is equal. Before diving into whichever category irritates you most, score each category on two axes: elimination potential (how much of the category can automation eliminate) and leverage (does eliminating it unblock other platform work or directly benefit feature teams).

Environment provisioning tends to score high on both — you can eliminate most of it with a self-service layer, and doing so removes a daily friction point for feature teams trying to test against production-like environments. It is usually the right starting point.

Dependency updates score high on elimination potential but lower on leverage — they are annoying but rarely blocking. Automate them, but do not make them the center of a platform investment pitch.

Pipeline fire-fighting is high leverage — every hour you spend on it is an hour feature teams are blocked — but partial on elimination. The right investment here is observability improvement (better CI metrics, alerting on runner capacity) rather than automation, which changes the nature of the work rather than eliminating it.

Keeping the Measurement Going

The two-week exercise gives you a baseline. The mistake is stopping there. Toil changes over time — you automate environment provisioning, and suddenly dependency management becomes the new top category. If you only measure at the start, you lose the ability to track whether your automation work is actually moving the needle.

We run a lightweight version of this continuously — one line per interrupt, same four fields, logged in the same shared doc. It takes about 30 seconds per entry and 15 minutes per month to aggregate. The monthly aggregate goes into our planning document alongside roadmap items, which makes the tradeoff explicit: "this month we are spending X hours on toil — here is what we could automate and what that would free up."

We are not saying every platform team needs a formal toil tracking system. We are saying that if you cannot answer "what percentage of platform engineer time is toil, and which categories dominate," you are making resourcing and automation investment decisions blind. The spreadsheet costs nothing and takes two weeks. The information it generates pays for itself the first time you use it in a planning conversation.