Intellectual
← All Insights
Platform Engineering21 November 20238 min read

Enterprise Platform Governance

Platform governance sits between platform team and consumer teams, between central standards and team autonomy, between speed and discipline. A practitioner view of the governance forums, decision rights, and policy boundaries that make platforms operable in regulated enterprises.

Platform governance is a different problem from API governance or integration governance, though the practices overlap. The platform team is operating shared infrastructure on behalf of consumer teams; the governance question is how the platform team and consumer teams negotiate the boundary — who decides what, when consumer teams can deviate, how shared resources are allocated, how policy is enforced without producing friction that pushes consumer teams toward shadow infrastructure.

This piece is the practitioner view on platform governance specifically, complementing the broader integration and API governance pieces. The audience is platform leads, architecture leadership, and the executives who sponsor platform engineering.

What platform governance is for

Platform governance answers a small set of recurring questions:

  • What capabilities does the platform provide, and what is excluded?
  • Which decisions are the platform team's to make? Which are consumer teams' to make?
  • How are shared resources (compute, storage, networking, secrets, namespaces) allocated?
  • How are platform changes communicated and consented to?
  • How are policy violations identified and resolved?
  • How does the platform evolve over time?

These look simple on paper. In practice, they consume a substantial portion of platform leadership attention because the answers shift over time and because consumer teams interpret them differently than the platform team intends.

The decision rights matrix

The single most consequential platform governance artefact is a written decision-rights matrix. Without it, every recurring decision becomes a discussion; with it, the discussion is constrained to the marginal cases.

A workable matrix looks something like this:

| Decision | Owns | Consults | Informs | |---|---|---|---| | Base image selection | Platform team | Security, application leads | All teams | | Deployment pipeline architecture | Platform team | Application leads | All teams | | Kubernetes cluster topology | Platform team | Architecture, security | All teams | | Application-level Dockerfile content | Application team | Platform (review only) | — | | Application's runtime configuration | Application team | — | — | | Cluster upgrades and timing | Platform team | Application leads | All teams | | New namespace requests | Platform team | — | Requesting team | | Policy exceptions | Platform team + Security | Application lead | All teams | | Platform capacity decisions | Platform team | Finance, architecture | All teams | | Application capacity within team budget | Application team | — | — |

The matrix is not exhaustive; it covers the recurring decisions. Items not in the matrix follow a default — usually "platform team owns it" — and can be added when they recur often enough to deserve formalisation.

The artefact takes two or three hours of senior-architect time to write for an established platform. Maintaining it (quarterly updates as the platform evolves) takes less. The payoff is that the team stops re-arguing the same decisions month after month.

The governance forum structure

Platforms need forums where decisions are made. Three patterns work in practice:

Platform steering forum. Senior architecture and platform leadership, monthly. Reviews capacity headroom, security findings, platform roadmap, escalations from consumer teams. Output: directional decisions, capacity commitments, priority shifts.

Platform user council. Representatives from consumer teams, monthly. Surfaces friction, requests features, validates roadmap. Output: feedback to the platform team, prioritisation input, awareness of upcoming changes.

Architecture review board (for platform-affecting changes). Senior architects, ad-hoc, for changes that affect multiple teams or change platform-level patterns. Output: explicit decisions captured as ADRs.

Smaller estates can collapse these into fewer meetings; larger estates may need more. The pattern that fails is one all-purpose meeting that tries to be steering and user-feedback and architecture review simultaneously; nothing gets the time it deserves.

The policy enforcement boundary

The boundary between "policy by document" and "policy by tooling" is where platform governance succeeds or fails.

Policy by document is the traditional approach — published standards, reviewed by humans, enforced through code review and architecture review. It works for a small number of high-impact decisions where human judgment matters; it fails as the only enforcement mechanism for high-volume routine decisions.

Policy by tooling is enforcement through automation — OPA, Kyverno, Pod Security Admission, image scanning gates, IaC linters. Routine decisions are caught at pull-request time or deployment time without requiring human review. Engineers can't ship a configuration that violates policy.

The estates that operate platforms well use both:

  • Policy-by-tooling for high-volume, well-defined rules (no privileged pods, no images with critical CVEs, no public S3 buckets, no resource requests above team budget)
  • Policy-by-document for judgment calls (new service architecture, integration pattern selection, capacity escalations)

The mistake we see most often is reliance on policy-by-document for everything. Consumer teams find the documents, often don't read them, and the platform team catches violations during code review or — worse — in production. Tooling that catches violations earlier produces fewer escalations and less friction.

The exception process

Every well-governed platform has an exception process. Some workloads genuinely need to deviate from the standard. The discipline is making exceptions:

  • Explicit — exceptions are documented, with the rationale, the affected workload, the compensating control if any, and the expiry date if temporary
  • Approved by the right authority — usually platform lead + security lead, not approved by the team requesting the exception
  • Visible — the exception register is shared; other teams can see what has been granted and why
  • Time-bound — most exceptions are temporary, with a defined sunset; permanent exceptions are rare and require senior signoff

The estates that maintain an exception process produce a culture where the standard path is the easy path and deviation requires effort. The estates that allow informal exceptions accumulate undocumented variations that eventually become the norm.

Resource allocation

Shared platforms have finite resources. Compute capacity, storage, message broker partitions, secret manager allocations, observability ingestion volume — all of these have limits. Governance decides how they're allocated.

Patterns that work:

Per-team budgets. Each consumer team has a defined budget — number of namespaces, compute resources, observability ingestion volume, etc. Teams operate within their budget; requests above their budget go through the steering forum.

Showback or chargeback. The platform team reports what each team actually consumed. Showback (visibility only) typically reduces consumption by 10-20% over time. Chargeback (actual cost transfer) is more disruptive but produces stronger discipline.

Headroom buffer. The platform maintains capacity headroom (typically 30-40%) so that growth doesn't require emergency provisioning. The headroom is a deliberate cost, not an oversight.

Quota policy at the platform layer. Kubernetes ResourceQuotas, cloud account quotas, namespace limits — enforced through tooling rather than through trust.

The estates that handle resource allocation well make consumption visible, set explicit budgets, and enforce them. The estates that don't have unbounded consumption that produces cost surprises and capacity crises.

Communicating platform changes

The platform changes over time. New capabilities, deprecations, retirements, breaking changes. How these are communicated to consumer teams determines how much trust the platform team has.

Patterns that work:

  • Roadmap visibility — what's coming in the next quarter, what's coming in the next year. Teams can plan against it.
  • Advance notice for breaking changes — deprecation timelines that are credible. Three months for routine changes, six months for substantial ones, twelve months for major ones.
  • Release notes for every release — what changed, what's deprecated, what to watch for. Distributed through whatever channel teams actually read.
  • Office hours — regular sessions where consumer teams can ask questions, surface concerns, get help. The platform team's accessibility shapes the relationship.
  • Incident communication — when something breaks, the platform team communicates promptly with consumer teams, not after the fact.

The estates that communicate well develop platform teams trusted by consumer teams. The estates that communicate poorly develop adversarial relationships where consumer teams route around the platform when they can.

The shadow infrastructure signal

The most reliable signal that platform governance has gone wrong is the appearance of shadow infrastructure. Consumer teams stand up their own deployments, their own observability stacks, their own secrets handling — outside the platform. Sometimes legitimately (the platform doesn't fit their workload); often as workarounds for friction the platform team isn't aware of.

When shadow infrastructure appears:

  • Ask honestly why. Sometimes the answer is "the platform's standard pattern doesn't fit this workload" — which is a legitimate reason and may warrant a platform extension.
  • Sometimes the answer is "the platform is too hard to use for this case" — which is a platform-team friction problem.
  • Sometimes the answer is "the team didn't know the platform supported this" — which is a communication failure.

The governance response is rarely to crack down on shadow infrastructure. The governance response is to address the underlying cause; once the platform genuinely serves the team's need, the shadow infrastructure becomes unnecessary.

What we recommend

For an enterprise establishing platform governance:

  1. Write the decision-rights matrix. Get senior signoff. Make it visible.
  2. Establish the three forums (steering, user council, architecture review) with cadence and output expectations.
  3. Implement policy-by-tooling for high-volume routine decisions. Reserve policy-by-document for judgment calls.
  4. Establish the exception process with explicit approval, visibility, and time-bounds.
  5. Set up resource allocation with per-team budgets and visibility.
  6. Establish communication cadence — roadmap, deprecations, release notes, office hours.

For an existing platform with governance friction:

  1. Audit the decision rights. Are they written down? Are they current?
  2. Audit the policy enforcement mix. Are too many things relying on document-policy?
  3. Audit the exception register. Are exceptions tracked? Are old exceptions retired?
  4. Audit the consumer-team relationship. Is shadow infrastructure appearing? What does it signal?

Platform governance is unglamorous, continuous work. The estates that take it seriously produce platforms that consumer teams trust, that scale operationally, and that compound value over years. The estates that treat governance as either bureaucracy or oversight produce friction that ultimately costs more than the governance was meant to prevent.

RELATED READING

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.