Building an Internal Developer Platform in 2025

There is a version of "internal developer platform" that means a portal where engineers can click a button to provision a Kubernetes namespace. There is another version that means the sum total of tooling, abstractions, golden paths, and operational practices that let engineers ship code without thinking about infrastructure. The first version is easy to build and often useless. The second version is hard and matters a lot.

This post is about the second version — what it takes to build something that actually increases shipping velocity, what the key decisions are, and where most IDP efforts go wrong.

The Ownership Question: What Platform Should Own

The most important decision you will make building an IDP is: what does the platform own, and what do product teams own? Get this wrong and you build either a platform that is too thin to be useful (product teams have to re-invent infrastructure for every new service) or too thick to be maintainable (the platform team becomes a bottleneck for every deployment decision).

A useful heuristic: the platform should own anything that is the same across most services and where divergence creates operational risk. Product teams should own anything that is specific to their service's business logic or that changes at their service's pace.

In practice, this means the platform typically owns:

Kubernetes cluster configuration, namespaces, RBAC, and network policies
Observability infrastructure: metric collection, log aggregation, distributed tracing
Secret management and rotation policies
Base container images and approved dependency lockfiles
CI/CD pipeline templates and deploy gate policies
Service catalog and dependency registry

Product teams typically own:

Application-level configuration (feature flags, service-specific env vars)
Database schema migrations
Service-level alerting thresholds and SLO definitions
Test coverage and test quality for their service

The hard cases are the things that live on the boundary: Terraform modules for service-specific infrastructure, Helm values files, service mesh configuration. For these, a useful pattern is platform-provided-but-team-maintained: the platform provides a template and a schema, product teams maintain their own instantiation of it. This keeps platform involvement in the creation step rather than every update step.

Golden Paths: What Makes One Work

A golden path is a predefined, opinionated route for common engineering tasks — spin up a new service, add a new endpoint, deploy to production — that is easy enough to follow that it is faster than the DIY alternative. The key word is "golden," not "mandatory." A golden path that engineers are forced to use is a bottleneck. A golden path that engineers choose because it saves them time is a productivity multiplier.

Golden paths work when they reduce the number of decisions an engineer has to make without reducing their agency over outcomes. "Create a new service" as a golden path should handle: Kubernetes manifests, CI pipeline setup, observability wiring, service discovery registration, and base security policies. It should not decide: what the service does, what its API contracts are, or how it stores data.

Golden paths fail when:

They are not maintained. A golden path that was built 18 months ago and has not tracked Kubernetes API changes or the move to a new secrets manager is worse than no golden path — it creates false confidence that the output is production-ready when it is not.

They assume too much homogeneity. If your backend services use Python, Go, and Java, a golden path designed for Python will be used by Python teams and ignored by everyone else. You need golden paths per language ecosystem, which multiplies maintenance cost.

They are too opinionated about internal details. Golden paths should be opinionated about the external interfaces (how the service reports metrics, how it registers with service discovery) and permissive about internal structure. If your golden path dictates folder structure for business logic, engineers will route around it.

The Abstraction Layer: How Thick Is Thick Enough?

Most IDPs end up with some kind of abstraction layer between product teams and the underlying infrastructure — a simplified interface for deploying a service that hides the Kubernetes/Helm/Terraform complexity underneath. Getting the thickness of this layer right is genuinely difficult.

Too thin: product teams need to know about pods, deployments, services, ingresses, and horizontal pod autoscalers. They spend time learning Kubernetes internals that are not relevant to their service's business logic.

Too thick: the abstraction leaks when something unusual needs to happen. A team wants to configure a sidecar for their specific use case. The abstraction does not support it. They either hack around the golden path or file a request with the platform team and wait.

The practical answer for most teams is: abstract the common case completely, expose the underlying primitives as an escape hatch with clear documentation about what bypassing the abstraction means for your maintenance relationship with the platform team.

# Example: thin abstraction with escape hatch
# Normal path — platform manages everything
service:
  name: payment-processor
  language: go
  port: 8080
  replicas:
    min: 2
    max: 10
  resources:
    preset: standard-api  # platform-defined CPU/memory profile

# Escape hatch — team provides raw Kubernetes spec
service:
  name: ml-inference
  kubernetes_override:
    # Raw Kubernetes deployment spec follows
    # Note: platform SLA does not cover services using kubernetes_override

Crestline Software, a growing SaaS team, implemented this pattern and found it dramatically reduced "golden path exception" requests to their platform team — down to roughly one or two per month from 8-10. The escape hatch was used for about 7% of services, almost all of which had legitimate needs (GPU requirements, custom sidecar configurations) rather than engineers avoiding the golden path out of preference.

Build vs Buy: The Honest Calculus

Every component of your IDP comes with a build-vs-buy decision. Here is how we think about it:

Build when the problem is genuinely specific to your engineering culture, your stack, or your deployment patterns. If your golden path for service creation is tightly coupled to how your company structures teams, handles on-call, and names services, a generic scaffolding tool will not fit without heavy customization. Build the scaffolding. Use the commodity tools under the hood.

Buy when the problem is an industry problem, not your problem. Secret rotation, log aggregation, distributed tracing, container image scanning — these are solved problems with good commercial and open-source options. Building them yourself costs platform team time that could go toward the problems that are actually specific to your context.

The false economy of "just write a script." Every platform team has a pile of scripts that started as "just a quick automation" and are now critical path with no documentation and no tests. Scripts are not builds. They are deferred technical debt. If the thing you are automating is important enough to run in production, it is important enough to build with the same quality bar you would apply to a service.

What 2025 Changes About IDP Design

A few things have shifted in the past couple of years that affect IDP design decisions:

Platform engineering tooling is more mature. Backstage, Port, Cortex, and similar developer portal products have absorbed a lot of the service catalog and golden path workflow problem that teams were hand-building three years ago. If you are starting a new IDP today, you almost certainly should not build your own portal from scratch.

GitOps has won for infrastructure state. If you are still applying Terraform manually or running infrastructure configuration scripts as part of CI jobs, you are accumulating drift risk. Flux and ArgoCD for Kubernetes state, Terraform Cloud or Atlantis for infra state — the GitOps pattern is the right answer for environment consistency.

Ephemeral environments have become tractable. Spinning up a full environment per PR was exotic and expensive two years ago. Namespace-per-PR on Kubernetes, combined with lightweight database seeding approaches, has made it achievable for most teams. If your IDP does not have a path to ephemeral PR environments, that is probably the highest-value capability gap to close.

The Mistake That Kills IDP Projects

The most common failure mode we see is platform teams building infrastructure that nobody asked for, discovering it goes unused, and concluding that "developers just don't want to be helped." This misses the actual problem: the platform was not built around where engineers were actually losing time.

Before writing a line of IDP code, spend four weeks observing. Watch engineers during on-call rotations. Sit with them during deploy failures. Look at your Slack channels and see what questions come up repeatedly. The IDP should automate the things that are visibly costing time, not the things that seem like they should cost time according to a whiteboard architecture diagram.

And once you ship something, measure adoption directly — not "the golden path exists and could be used" but "N% of new service creations used the golden path in the last month." If adoption is below 50% and the golden path is available, there is a friction problem or a trust problem that you need to investigate before building more.