Intellectual
← All Insights
Platform Engineering6 June 20238 min read

Platform Reliability Engineering

SRE has moved from Google-specific practice to enterprise discipline. A practitioner view of what site reliability engineering actually requires in regulated enterprise platforms — error budgets, postmortem culture, the operational disciplines that hold.

Site Reliability Engineering started as Google's specific approach to operating its services. Over the last several years the practice has been documented, generalised, and adopted across enterprise IT. By 2023 it is the dominant operational model for cloud-native platforms in regulated industries. The terminology — error budgets, SLOs, toil reduction, blameless postmortems — has become standard vocabulary.

The vocabulary has become standard. The actual practice often has not. Most enterprise organisations claiming SRE adoption are doing parts of the practice without the disciplines that make it work. This piece is a practitioner view of what SRE actually requires in enterprise platforms, where the value comes from, and where the practice fails when adopted superficially.

The core idea, stated honestly

SRE is operational engineering work applied to running production systems. The core ideas:

  • Reliability is engineered, not wished for. Reliability targets are measured, committed, and acted on.
  • Operations work is engineering work. The same discipline applied to software engineering applies to operating it.
  • Toil is the enemy. Repetitive manual work indicates an engineering problem to solve, not a constant to live with.
  • Failures are learning opportunities. Postmortems are about understanding what happened, not assigning blame.
  • Reliability is a feature, not a separate concern. It trades off against velocity; both sides of the tradeoff are visible and negotiated.

These ideas are old; what SRE contributed was the operational practice that makes them tractable. Service Level Objectives. Error budgets. Toil measurement. Postmortem structure. On-call rotation discipline.

Service Level Objectives — the most consequential primitive

The single most important SRE primitive is the Service Level Objective (SLO): a measurable commitment about the reliability of a service. "99.9% of requests in the last 30 days returned successfully" is an SLO. "99% of requests in the last 7 days returned within 500ms" is an SLO.

Good SLOs share properties:

  • Measurable from the user perspective. The metric reflects what users experience, not what the infrastructure reports. A backend that's "up" but unreachable from users is not meeting its SLO.
  • Sustainable. The team can realistically commit to this level of reliability with the resources available. Aspirational SLOs that no one believes are worse than honest ones.
  • Above-zero but below-perfect. Perfect reliability isn't achievable and isn't economically rational. The SLO captures the level of reliability that matters; anything beyond is gold-plating.
  • Tied to consequence. When the SLO is breached, something happens. Without consequence, the SLO is decoration.

The error budget is the inverse of the SLO: 99.9% success means a 0.1% failure budget. Burning the budget is normal; exhausting it triggers action — typically a feature freeze until the budget recovers, with engineering effort shifted to reliability work.

The discipline produces a measurable conversation about reliability. Product and engineering can debate "should we ship this risky feature or wait for the budget to recover" with data, not feelings.

Toil measurement and reduction

Toil, in SRE vocabulary, is operational work that is manual, repetitive, automatable, and tactical. It scales linearly with service usage. Engineers performing toil are not producing engineering output; they're paying an operational tax.

The SRE practice measures toil — typically as a percentage of the team's time — and sets a budget. Google's practice caps toil at 50% of time; the remaining 50% is reserved for engineering work that reduces future toil. Most enterprise teams aim for a similar split but find themselves at 70-80% toil.

The intervention is to identify the highest-toil items and engineer them away:

  • The deployment that requires manual checks before approval — automate the checks
  • The recurring incident type that has a known fix — write the runbook, then automate the runbook
  • The capacity request that requires manual provisioning — self-service the capacity
  • The configuration change that requires a manual deploy — declarative configuration with automated rollout

The estates that take toil reduction seriously gradually shift their teams toward engineering work. The estates that don't accumulate operational debt that compounds.

Postmortem culture, blameless and rigorous

When something goes wrong, the SRE practice produces a postmortem — a structured document describing what happened, when, the contributing factors, and the remediation actions. Two characteristics distinguish good postmortems from bad ones:

Blameless. The document focuses on contributing factors and systemic causes, not on which individual made which decision. The premise: engineers act rationally within the system they're given; if the system produces an incident, the system needs improvement, not the engineer.

Rigorous. The document captures the actual sequence of events with timestamps, the decisions taken and why, the assumptions that turned out wrong, and the remediation actions with owners. The artifact is genuinely useful for future engineers reading it.

Postmortems are not the only learning mechanism, but they are the most institutionalised one. The estates that produce real postmortems develop a learning culture; the estates that produce template-filled documents develop an artifact-production culture.

On-call discipline

The on-call rotation is the front line. The SRE practice has documented disciplines:

  • Bounded shifts. A 12-hour or 24-hour shift is sustainable; a permanent on-call is not.
  • Compensation. On-call time is paid time. The expectation is professional response, not heroic dedication.
  • Documentation. Every alertable issue has a runbook entry. On-call engineers can resolve known patterns without paging the senior team.
  • Escalation paths. When the on-call engineer can't resolve the issue, the escalation path is clear, documented, and rehearsed.
  • Page volume limits. A team page volume above a threshold triggers engineering work to reduce the noise. Alert fatigue is a measurable engineering problem.

The estates that operate well treat on-call as a sustained engineering discipline. The estates that treat on-call as a tax to be tolerated produce burnout and high turnover among the engineers who carry the on-call burden.

The reliability-velocity tradeoff, made explicit

The deepest contribution of SRE is making the reliability-velocity tradeoff explicit. Product wants velocity; reliability work slows velocity. Reliability wants stability; velocity introduces risk. Without a framework, the conflict is permanent.

SRE's framework: the error budget makes the tradeoff quantitative. When the budget has headroom, velocity dominates — risky features can ship. When the budget is exhausted, reliability dominates — the team focuses on stabilisation until the budget recovers.

The conversation moves from "are you taking reliability seriously" (unfalsifiable) to "what's the error budget, and how should we spend it" (concrete, actionable). The estates that adopt this framework reduce the chronic friction between product and reliability concerns.

Where enterprise SRE adoption fails

The failure patterns we see most often:

SRE in title only. A team is renamed "platform SRE"; the practices don't change. The team continues operational work without error budgets, without toil measurement, without postmortem discipline. The label is adopted; the operating model is not.

SLOs without consequence. SLOs are published; the team measures against them; nothing happens when they're breached. Without the error budget consequence, SLOs are decoration.

Postmortems that produce blame. Postmortems happen but they identify individuals as root causes. The learning culture doesn't develop; engineers become defensive about decisions; the postmortem volume drops over time.

Toil that isn't measured. "We've reduced toil" without any measurement to back it up. The toil stays the same; the team's sense of being overwhelmed continues.

On-call without runbooks. The senior engineer who knows everything is the on-call escalation point indefinitely. Their knowledge stays personal rather than becoming institutional.

These patterns are common because adopting the vocabulary is easier than adopting the practice. The vocabulary is free; the practice requires sustained operating-model commitment.

The capabilities the practice requires

Enterprise SRE adoption requires:

  • Engineering capacity in operations. SRE work is engineering, not just operating. The team needs engineers who can write code, design tooling, build automation.
  • Tooling investment. SLO measurement, error budget tracking, postmortem documentation, runbook libraries — all require tooling. Spreadsheets don't scale.
  • Cultural commitment from leadership. When error budgets are exhausted, feature work pauses. Leadership has to back this; without leadership backing, the budget consequence isn't honoured.
  • Product-side partnership. Product managers have to accept the framework. SRE without product alignment is a unilateral commitment that breaks at the first pressure point.

The estates that have these capabilities tend to adopt SRE successfully. The estates that try to bolt SRE practices onto a team that doesn't have these capabilities tend to produce the failure patterns above.

What we recommend

For an enterprise standing up SRE practice:

  1. Pick one platform — typically the most critical — and adopt the full practice. SLOs, error budgets, toil measurement, postmortem discipline, on-call rotation.
  2. Invest in tooling. Most of the practice is enabled by tooling; without it, the discipline is informal and decays.
  3. Get leadership commitment to the error budget consequence. If feature work doesn't actually pause when the budget is exhausted, the budget isn't real.
  4. Run the practice for six months, then assess. What's improving? What's not? Adjust.
  5. Expand to a second platform once the first is stable.

For a team claiming SRE practice without the underlying disciplines:

  1. Honestly audit: how many of the disciplines are actually in place? Most teams in this state have one or two.
  2. Pick the highest-leverage missing discipline and invest. Usually it's SLOs with real consequence, or toil measurement, or postmortem rigour.
  3. Build out from there.

Site reliability engineering is not a label or a team renaming. It is a sustained operating-model investment. The estates that make the investment produce platforms that compound reliability and reduce operational pain over years. The estates that adopt the vocabulary without the practice continue to operate the way they always did, with new terminology overlaid on the existing reality.

RELATED READING

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.