EKS Cluster Upgrade: v1.23 → v1.28

The production EKS cluster had been running for 2+ years. It was on Kubernetes v1.23 (deprecated), provisioned manually with no Terraform coverage, and configured with a publicly accessible API endpoint. Upgrading in-place would have meant five sequential version hops with no rollback path at each step.

Why Blue-Green

In-place cluster upgrades carry compounding risk: each version hop is irreversible once committed, and any issue at v1.25 means you’ve already burned v1.24. A blue-green approach (new cluster alongside old) gives a clean state and a tested rollback: if something is wrong, flip DNS back.

The Migration

New cluster provisioned with Terraform:

v1.28 from day one — no incremental hops
VPC-only API endpoint (no public endpoint; access through VPN/bastion only)
Karpenter for node provisioning (replacing Cluster Autoscaler)
Full IaC coverage — every node group, IRSA binding, and add-on managed in code

Service-by-service migration:

Updated deployment manifests for API deprecations between v1.23 and v1.28
Moved services one at a time to the new cluster, validated, then proceeded
Kept old cluster live throughout as the active production environment until cutover

Cutover:

Set DNS TTLs to 60 seconds 48 hours before cutover
Flipped DNS at a low-traffic window
Monitored for 2 weeks with old cluster on standby (rollback path)

Results

Metric	Before	After
Kubernetes version	v1.23 (deprecated)	v1.28
IaC coverage	0% (manual)	100% Terraform
API server exposure	Public endpoint	VPC-only
User-facing downtime	—	Zero downtime
Node cost efficiency	Cluster Autoscaler	Karpenter (~20% cost gain)

The 2-week rollback window gave the team confidence to migrate aggressively. The old cluster was never needed — but having it there changed the risk calculus entirely.