Intellectual
← All Insights
Platform Engineering26 January 20238 min read

Kubernetes for Enterprise Platforms

Kubernetes is the default substrate for new enterprise platforms. The operating model — not the platform choice — is where most Kubernetes rollouts in regulated enterprises succeed or fail. A practical view from delivery.

Kubernetes has become the default substrate for new enterprise platforms. Three years ago it was a credible choice; today it is the credible choice. The question is no longer whether to adopt it but how to operate it without recreating the operational debt the cloud-native pattern was meant to solve.

Most of the Kubernetes estates we audit in regulated enterprises are technically functional and operationally fragile. The cluster runs; the applications deploy; the team can ship; but the operational discipline is thinner than the workloads warrant. This piece is the operating model that separates Kubernetes estates that compound value from estates that accumulate technical debt at high velocity.

The managed vs self-run question

The first decision is whether to consume a managed Kubernetes service (AKS, EKS, GKE, OpenShift on-cloud) or run the control plane yourself. For nearly every enterprise we work with, the answer is managed.

The case for managed is operational. The control plane is hard to operate well. The cloud providers have invested heavily in operating control planes; your platform team has not. Outsourcing the control plane to a provider that runs millions of clusters is the right trade.

The case against managed — sovereignty, regulatory constraints, vendor lock-in concerns — is real for a small subset of enterprises. For most, it is theoretical. The estates we have seen run self-managed Kubernetes have nearly all encountered control-plane operational pain that the team underestimated.

The clarifying question: is your platform team going to be world-class at operating Kubernetes control planes? If not, use a managed service. Spend the engineering effort on the things that differentiate your platform — the operating model, the developer experience, the observability — not on running etcd.

Cluster topology decisions

Once managed is chosen, the topology decisions become the next consequential cluster of choices.

Single cluster vs multi-cluster. A single cluster is simpler operationally but creates a large blast radius for failures. A multi-cluster topology isolates failure but multiplies operational work. Most enterprise estates land at two to four clusters — production / staging / development at minimum, often split further by sensitivity (regulated workloads on a dedicated cluster) or geography.

Namespace strategy. Each cluster has many namespaces. The strategy that holds up: one namespace per team, with multiple environments-as-namespaces only if the team genuinely needs them. The pattern that fails: namespaces per project per environment, which produces hundreds of namespaces nobody can navigate.

Node pool segmentation. Different workload classes (memory-heavy, GPU, regulated, system) get different node pools with appropriate taints and tolerations. The pattern that fails: every workload runs on the default pool, producing noisy-neighbour effects when one workload spikes.

Network model. CNI choice (Calico, Cilium, Azure CNI, AWS VPC CNI) determines what network policies are possible. Ingress strategy (NGINX, Traefik, cloud-native ingress controllers, service mesh) determines what traffic management is possible. These are decisions to make consciously, not by default.

Multi-region. For workloads with disaster-recovery or low-latency requirements, multi-region topology adds significant operational complexity. Most workloads do not need it; the ones that do need it should adopt it deliberately, with the operating model that goes with it.

The deployment story

Kubernetes deployment patterns have converged on GitOps. The model: declarative configuration lives in a Git repository; an in-cluster controller (ArgoCD or Flux are the dominant choices) reconciles the cluster state with the repository state.

The operational advantages:

  • Every change to the cluster is visible in the Git history
  • Rollbacks are Git reverts, not manual interventions
  • Configuration drift is detectable (the controller surfaces "the cluster differs from the repository")
  • Multi-cluster management consolidates around a single source of truth
  • Auditability is built in — every change has an actor, a timestamp, and a rationale (commit message)

The operational discipline GitOps demands:

  • Branch protection on the deployment repository — no direct pushes to main
  • Pull-request review for every change to the deployment configuration
  • Secrets management that is GitOps-compatible (sealed secrets, external secrets operator, cloud-native secrets stores)
  • Image tag discipline — pinned tags or content-addressable digests, no :latest in production
  • A working rollback path that is tested, not theoretical

Estates that adopt GitOps without these disciplines reproduce the manual deployment problems with extra steps. The pattern works when the operating model commits to it.

Observability that operations engineers can use

A Kubernetes cluster without strong observability is a closed system. Things happen inside it; engineers can't see why. The default Kubernetes tooling (kubectl, raw events) is enough for development but inadequate for production operations.

The observability stack we see working in most enterprise Kubernetes estates:

  • Metrics: Prometheus scraping cluster and application metrics, Grafana dashboards layered on top
  • Logs: Centralised log aggregation (Loki, Elasticsearch, cloud-native equivalent) with structured logging emitted by applications
  • Traces: OpenTelemetry instrumentation in applications, traces collected to a backend (Jaeger, Tempo, cloud-native equivalent)
  • Events: Kubernetes events streamed into the observability stack so cluster behaviour is visible alongside application behaviour
  • Alerting: Alertmanager rules on the metrics, with alerts that have meaningful thresholds and actionable destinations

The estates that operate well have all five layers working together. The estates that struggle usually have metrics and logs but not traces, or have an observability stack that engineers find too hard to use during incidents.

Security at the cluster level

Kubernetes security has matured significantly. The disciplines we recommend:

  • Pod security standards enforced through Pod Security Admission. The restricted policy as the default; deviations are explicit and reviewed.
  • Network policies restricting pod-to-pod traffic. Default-deny ingress is the starting posture; specific traffic patterns are allowed explicitly.
  • RBAC for cluster access, mapped to the organisation's identity system. No shared kubeconfig files; every actor authenticates as themselves.
  • Image scanning in the CI/CD pipeline (Trivy, Snyk, cloud-native scanners) with policy on critical vulnerabilities.
  • Runtime security (Falco, Sysdig Secure, cloud-native equivalents) detecting anomalous behaviour at the kernel level.
  • Secrets management through External Secrets Operator or equivalent, with secrets sourced from a managed secrets store, not Kubernetes secrets at rest.

The estates we have audited that fail security audits usually fail on RBAC drift (people who left the company can still access the cluster), image scanning gaps (vulnerable images deployed to production), or secrets management (sensitive data in Kubernetes secrets without encryption-at-rest enabled).

Capacity and cost

Kubernetes makes it easier to spin up workloads, which makes it easier to spend money. Cost discipline is part of the operating model, not a separate concern.

What works:

  • Resource requests and limits on every workload. Workloads without limits can starve other workloads.
  • Horizontal Pod Autoscaler for variable-load workloads, configured against meaningful metrics (CPU, custom metrics for queue depth or request rate).
  • Cluster Autoscaler scaling node pools up and down based on pending pods.
  • Cost monitoring at namespace level, with showback or chargeback to consumer teams.
  • Right-sizing reviews quarterly. Workloads tend to be over-provisioned; review and tighten.

The estates that don't manage Kubernetes cost discipline find that cloud bills are higher than the equivalent VM-based estate. The pattern works when capacity management is a continuous practice, not an annual budget exercise.

The platform team model

The role that emerges from running Kubernetes well is the platform team. The platform team:

  • Operates the cluster (or clusters) on behalf of consumer teams
  • Maintains the deployment toolchain (GitOps controllers, CI/CD integration, secrets management)
  • Provides the observability stack as a service to consumer teams
  • Sets the standards (pod security, network policies, image scanning) and enforces them through tooling
  • Handles cluster upgrades, capacity management, and security patches
  • Is on call for cluster-level incidents

The consumer teams (the teams building applications on the platform) consume the platform through a defined contract: namespaces, deployment templates, observability access, security context. They are not responsible for operating the cluster.

This separation is the single most important operating-model decision in Kubernetes adoption. Estates that conflate platform-team work with application-team work produce confusion: application teams cannot operate the cluster, platform teams cannot keep up with application change. The clean separation produces an operating model that scales.

What we recommend

For an enterprise estate adopting Kubernetes:

  1. Use a managed Kubernetes service. The control plane is not your differentiation.
  2. Decide topology consciously — single cluster vs multi-cluster, namespace strategy, node pool segmentation, network model — before workloads start landing.
  3. Adopt GitOps as the deployment model from Day 1. Manual deployments outside emergency are forbidden.
  4. Stand up the full observability stack (metrics, logs, traces, events, alerting) before production workloads.
  5. Enforce the security disciplines through tooling, not through policy documents.
  6. Stand up the platform team with a defined service contract to consumer teams.

For an existing Kubernetes estate showing operational pain:

  1. Audit against the six commitments above. Which are weak?
  2. Pick the weakest and invest in it. Most often it is observability or the platform-team operating model.
  3. Improve iteratively. Kubernetes operational maturity is a multi-year journey.

Kubernetes done well produces a substrate that compounds value over years. Kubernetes done badly produces a substrate that consumes operational attention without producing equivalent benefit. The technology is the same; the operating model is the difference.

RELATED READING

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.