Kubernetes in Production: Lessons Learned from Managing Clusters at Scale
Back to Blog
Cloud

Kubernetes in Production: Lessons Learned from Managing Clusters at Scale

VL
VEXILO LABS Team
Jun 20, 20257 min read

Kubernetes is the de facto standard for container orchestration, but running it in production is harder than the tutorials suggest. Here are the real-world lessons from teams managing clusters at scale.

Kubernetes has won the container orchestration war. But deploying a "Hello World" app on a tutorial cluster is a world apart from running mission-critical workloads in production. Here are the hard-earned lessons from teams that have been operating Kubernetes at scale.

Lesson 1: Don’t Run Your Own Control Plane (Unless You Must)

  • Managed Kubernetes (EKS, GKE, AKS) handles the hardest parts — etcd management, API server scaling, and upgrades
  • Self-managed control planes require deep expertise and 24/7 operational commitment
  • The cost savings of self-hosting rarely justify the operational burden
  • Exception: Air-gapped or highly regulated environments where managed services aren’t an option

Lesson 2: Resource Requests and Limits Are Not Optional

  • Without resource requests, the scheduler can’t make intelligent placement decisions
  • Without limits, a single runaway pod can starve the entire node
  • Over-provisioning wastes money; under-provisioning causes instability
  • Use Vertical Pod Autoscaler (VPA) to right-size based on actual usage
  • Review resource utilization monthly and adjust

Lesson 3: Observability Is Everything

  • You cannot operate what you cannot see
  • Minimum stack: Prometheus + Grafana for metrics, Loki or Elasticsearch for logs, Jaeger or Tempo for traces
  • Alert on symptoms (error rate, latency) not causes (CPU usage)
  • Invest in dashboards that operators actually use during incidents
  • OpenTelemetry is becoming the standard — adopt it early

Lesson 4: Networking Is the Hard Part

  • Kubernetes networking is deceptively complex
  • Service mesh (Istio, Linkerd) adds observability and security but also complexity
  • Network policies are your firewall — use them
  • DNS resolution issues are the #1 most common production problem
  • Invest in understanding CNI plugins and how pod networking actually works

Lesson 5: Security Cannot Be Bolted On

  • Use Pod Security Standards (or OPA/Kyverno) to enforce policies
  • Never run containers as root
  • Scan images for vulnerabilities in CI/CD
  • Rotate secrets regularly and use external secret managers (Vault, AWS Secrets Manager)
  • RBAC should follow least-privilege principles — no cluster-admin for application teams

Lesson 6: Upgrades Are a Discipline

  • Kubernetes releases a new minor version every 4 months
  • Falling behind means compounding upgrade pain and missing security patches
  • Test upgrades in staging first, always
  • Have a rollback plan for every upgrade
  • Automate node rolling updates to minimize disruption

Lesson 7: GitOps Makes Life Easier

  • Tools like ArgoCD and Flux let you manage cluster state declaratively through Git
  • Every change is auditable, reversible, and reviewable
  • Reduces configuration drift between environments
  • Makes disaster recovery significantly easier

Lesson 8: Cost Management Requires Active Effort

  • Kubernetes makes it easy to over-provision
  • Use tools like Kubecost, OpenCost, or cloud provider cost dashboards
  • Implement namespace-level resource quotas
  • Use spot/preemptible instances for stateless, fault-tolerant workloads
  • Right-size nodes — fewer large nodes are often more efficient than many small ones

Final Thought

Kubernetes is powerful, but it’s not magic. The teams that succeed with it treat it as a platform that requires investment in tooling, training, and operational discipline — not just a deployment target.