All Projects

EKS Cluster Upgrade: v1.23 → v1.28

Blue-green EKS cluster migration from a manually-managed v1.23 cluster to a Terraform-provisioned v1.28 cluster with VPC-only access — achieving under 5 minutes of user-facing impact and 100% IaC coverage.

Tech Stack
KubernetesAWS EKSTerraformHelmArgoCDAWS

EKS Cluster Upgrade: v1.23 → v1.28

The production EKS cluster had been running for 2+ years. It was on Kubernetes v1.23 (deprecated), provisioned manually with no Terraform coverage, and configured with a publicly accessible API endpoint. Upgrading in-place would have meant five sequential version hops with no rollback path at each step.

Why Blue-Green

In-place cluster upgrades carry compounding risk: each version hop is irreversible once committed, and any issue at v1.25 means you’ve already burned v1.24. A blue-green approach (new cluster alongside old) gives a clean state and a tested rollback: if something is wrong, flip DNS back.

The Migration

New cluster provisioned with Terraform:

  • v1.28 from day one — no incremental hops
  • VPC-only API endpoint (no public endpoint; access through VPN/bastion only)
  • Karpenter for node provisioning (replacing Cluster Autoscaler)
  • Full IaC coverage — every node group, IRSA binding, and add-on managed in code

Service-by-service migration:

  • Updated deployment manifests for API deprecations between v1.23 and v1.28
  • Moved services one at a time to the new cluster, validated, then proceeded
  • Kept old cluster live throughout as the active production environment until cutover

Cutover:

  • Set DNS TTLs to 60 seconds 48 hours before cutover
  • Flipped DNS at a low-traffic window
  • Monitored for 2 weeks with old cluster on standby (rollback path)

Results

MetricBeforeAfter
Kubernetes versionv1.23 (deprecated)v1.28
IaC coverage0% (manual)100% Terraform
API server exposurePublic endpointVPC-only
User-facing downtimeUnder 5 minutes
Node cost efficiencyCluster AutoscalerKarpenter (~20% cost gain)

The 2-week rollback window gave the team confidence to migrate aggressively. The old cluster was never needed — but having it there changed the risk calculus entirely.