Kubernetes is the de facto standard for container orchestration, but running it in production is harder than the tutorials suggest. Here are the real-world lessons from teams managing clusters at scale.
Kubernetes has won the container orchestration war. But deploying a "Hello World" app on a tutorial cluster is a world apart from running mission-critical workloads in production. Here are the hard-earned lessons from teams that have been operating Kubernetes at scale.
Lesson 1: Don’t Run Your Own Control Plane (Unless You Must)
- Managed Kubernetes (EKS, GKE, AKS) handles the hardest parts — etcd management, API server scaling, and upgrades
- Self-managed control planes require deep expertise and 24/7 operational commitment
- The cost savings of self-hosting rarely justify the operational burden
- Exception: Air-gapped or highly regulated environments where managed services aren’t an option
Lesson 2: Resource Requests and Limits Are Not Optional
- Without resource requests, the scheduler can’t make intelligent placement decisions
- Without limits, a single runaway pod can starve the entire node
- Over-provisioning wastes money; under-provisioning causes instability
- Use Vertical Pod Autoscaler (VPA) to right-size based on actual usage
- Review resource utilization monthly and adjust
Lesson 3: Observability Is Everything
- You cannot operate what you cannot see
- Minimum stack: Prometheus + Grafana for metrics, Loki or Elasticsearch for logs, Jaeger or Tempo for traces
- Alert on symptoms (error rate, latency) not causes (CPU usage)
- Invest in dashboards that operators actually use during incidents
- OpenTelemetry is becoming the standard — adopt it early
Lesson 4: Networking Is the Hard Part
- Kubernetes networking is deceptively complex
- Service mesh (Istio, Linkerd) adds observability and security but also complexity
- Network policies are your firewall — use them
- DNS resolution issues are the #1 most common production problem
- Invest in understanding CNI plugins and how pod networking actually works
Lesson 5: Security Cannot Be Bolted On
- Use Pod Security Standards (or OPA/Kyverno) to enforce policies
- Never run containers as root
- Scan images for vulnerabilities in CI/CD
- Rotate secrets regularly and use external secret managers (Vault, AWS Secrets Manager)
- RBAC should follow least-privilege principles — no cluster-admin for application teams
Lesson 6: Upgrades Are a Discipline
- Kubernetes releases a new minor version every 4 months
- Falling behind means compounding upgrade pain and missing security patches
- Test upgrades in staging first, always
- Have a rollback plan for every upgrade
- Automate node rolling updates to minimize disruption
Lesson 7: GitOps Makes Life Easier
- Tools like ArgoCD and Flux let you manage cluster state declaratively through Git
- Every change is auditable, reversible, and reviewable
- Reduces configuration drift between environments
- Makes disaster recovery significantly easier
Lesson 8: Cost Management Requires Active Effort
- Kubernetes makes it easy to over-provision
- Use tools like Kubecost, OpenCost, or cloud provider cost dashboards
- Implement namespace-level resource quotas
- Use spot/preemptible instances for stateless, fault-tolerant workloads
- Right-size nodes — fewer large nodes are often more efficient than many small ones
Final Thought
Kubernetes is powerful, but it’s not magic. The teams that succeed with it treat it as a platform that requires investment in tooling, training, and operational discipline — not just a deployment target.