← All Posts

Kubernetes in Production — What Nobody Tells You

Matthias Bruns · · 7 min read
kubernetes cloud-native engineering

The Tutorial Lies By Omission

Every Kubernetes tutorial follows the same script: spin up a cluster, kubectl apply a deployment, watch your pods come up, celebrate. Maybe throw in an Ingress and a HPA for extra credit. Done — you’re running Kubernetes!

No. You’re running a demo.

I’ve been operating Kubernetes clusters in production for years now. For companies ranging from startups to enterprises. And the gap between “I deployed my first pod” and “I’m running a reliable production workload” is enormous. Here’s what nobody tells you.

Upgrade Pain Is Real

Kubernetes releases a new minor version every four months. Each version is supported for roughly 14 months. That means you’re on a treadmill — if you stop upgrading, you fall off support.

Sounds manageable? Here’s the reality:

  • API deprecations break things. That PodSecurityPolicy you rely on? Gone in 1.25. Your Ingress annotations? Half of them changed between beta and GA. Every upgrade is an audit of your entire manifest base.
  • Managed services lag. EKS, GKE, AKS — they all run their own upgrade schedules. Your cluster might be on 1.28 while the latest is 1.31. Try finding docs for your specific version when everyone’s blogging about the newest features.
  • Add-ons have their own compatibility matrices. Your CNI plugin, your Ingress controller, your cert-manager, your monitoring stack — they all need to be compatible with your K8s version. Upgrade one, and suddenly three others need updating too.

A real-world upgrade looks like this:

# What you think it looks like
kubectl upgrade cluster --version 1.30

# What it actually looks like
# 1. Read the changelog (all of it)
# 2. Check deprecated APIs with pluto/kubent
# 3. Test in staging (you have staging, right?)
# 4. Update all Helm charts for compatibility
# 5. Coordinate with every team that deploys to the cluster
# 6. Pick a maintenance window
# 7. Pray
# 8. Roll back when something breaks
# 9. Debug for 3 hours
# 10. Try again

Budget at least one full day per upgrade for a medium-complexity cluster. Per quarter.

Operator Fatigue

The Kubernetes ecosystem loves operators. Need a database? There’s an operator. Message queue? Operator. Certificate management? Operator. Monitoring? You guessed it.

The promise is beautiful: declare your desired state, and the operator reconciles reality to match. The problem is that every operator is its own little control plane with its own bugs, its own upgrade cycle, and its own opinions about how things should work.

I’ve seen clusters running 15+ operators. Each one:

  • Watches the API server (adding load)
  • Has its own CRDs (cluttering your API surface)
  • Needs its own RBAC (security surface area)
  • Has its own logging and error patterns
  • May conflict with other operators

When something goes wrong — and it will — you’re debugging interactions between multiple reconciliation loops. Is the database operator fighting with the backup operator? Is the Istio sidecar injector interfering with your StatefulSet rollout? Good luck grepping those logs.

My rule of thumb: Every operator you add should save more operational effort than it creates. Most don’t pass that test. A well-written Helm chart with a CronJob often beats a poorly maintained operator.

The Monitoring Gap

Here’s a dirty secret: most Kubernetes monitoring setups only tell you what’s happening inside Kubernetes. They don’t tell you what your users are experiencing.

A typical Prometheus + Grafana setup gives you:

  • Pod CPU and memory usage
  • Container restart counts
  • Node resource utilization
  • Kubernetes API server latency

What it doesn’t give you out of the box:

  • End-to-end request latency from the user’s perspective
  • Whether your DNS resolution is flaking out
  • That one node’s disk is silently corrupting data
  • That your PersistentVolume is 30ms slower than yesterday
  • That your cloud provider’s load balancer is dropping connections every 47 minutes

The metrics you need to actually operate a production system are application-level metrics and infrastructure-level metrics that go beyond what the Kubernetes API exposes. You need distributed tracing. You need synthetic monitoring. You need to care about the things between your pods and your users.

# What most people monitor
- pod_cpu_usage
- pod_memory_usage
- container_restarts

# What actually matters
- p99_request_latency
- error_rate_by_endpoint
- dns_resolution_time
- certificate_expiry_days
- persistent_volume_iops
- cross_az_network_latency

The Cost Surprise

“Kubernetes will save us money through better resource utilization!” — every pitch deck ever.

Here’s what actually happens:

Control plane costs. On EKS, you’re paying $0.10/hour just for the API server. That’s $73/month before a single pod runs. Multi-cluster? Multiply that.

Node overhead. Each node runs kubelet, kube-proxy, a CNI agent, potentially a logging agent, a monitoring agent, and whatever DaemonSets your platform team decided you need. On a t3.medium, that’s easily 30-40% of your node’s resources consumed by the platform itself.

The “we need three environments” multiplier. Dev, staging, prod — each gets its own cluster. Each cluster needs its own nodes, its own monitoring, its own ingress. Your cloud bill just tripled.

Load balancer proliferation. Every Service of type LoadBalancer creates a cloud LB. Each costs $15-25/month on AWS. I’ve seen clusters with 40+ load balancers because nobody bothered setting up a shared Ingress controller.

Persistent volume waste. Volumes that were provisioned for deleted pods, sitting there accruing charges. PVCs that are 100GB because someone copy-pasted a manifest and forgot to adjust.

The real cost of Kubernetes isn’t the sticker price — it’s the operational overhead. You need at least one person who deeply understands the platform. That person’s salary dwarfs your cloud bill.

Networking Is Where Dreams Go To Die

Kubernetes networking is a layer cake of abstractions, and every layer can break independently:

  1. Pod-to-pod: CNI plugin handles this. Works great until it doesn’t. Debugging CNI issues requires understanding iptables, eBPF, or whatever your plugin uses under the hood.
  2. Service discovery: kube-dns/CoreDNS. Usually fine. Until you hit the ndots:5 default and every external DNS lookup takes 5x longer because it tries cluster-local suffixes first.
  3. Ingress: A whole category of “works in docs, breaks in practice.” TLS termination, path routing, header manipulation — each Ingress controller implements the spec slightly differently.
  4. Network policies: The firewall you should be using but probably aren’t. And when you do, you’ll discover that your CNI plugin’s NetworkPolicy implementation has “known limitations.”
# This innocent-looking config has bitten me multiple times
apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"  # Override the default 5
  # Without this, resolving "api.stripe.com" generates
  # 5 failed lookups before the real one succeeds

So Should You Use Kubernetes?

Yes — if you have the team, the workload, and the operational maturity for it.

Kubernetes makes sense when:

  • You’re running 10+ services that need independent scaling and deployment
  • You have a dedicated platform team (or are willing to build one)
  • You need multi-cloud or hybrid capabilities
  • Your deployment frequency justifies the complexity

Kubernetes is overkill when:

  • You have fewer than 5 services
  • Your team is smaller than 10 engineers
  • A managed PaaS (Cloud Run, App Runner, Fly.io) covers your needs
  • You’re choosing it because it’s on your CV wishlist

The honest truth: Kubernetes is infrastructure for building platforms. If you don’t need a platform, you don’t need Kubernetes. A well-configured VPS with Docker Compose and a CI/CD pipeline will outperform a poorly operated Kubernetes cluster every single time.

The Bottom Line

Kubernetes is a powerful tool. It’s also a complex one that demands respect. If you’re going to run it in production, go in with your eyes open:

  • Budget for upgrades as ongoing work, not one-time setup
  • Be selective with operators — fewer is better
  • Invest in monitoring that goes beyond pod metrics
  • Track your actual costs, not your projected savings
  • Accept that networking will surprise you

And if you’re a mid-sized company wondering whether Kubernetes is right for you — talk to someone who’s operated it, not just someone who sells it. The difference matters.

Reader settings

Font size