EKS Cluster Upgrade: v1.23 → v1.28
Blue-green EKS cluster migration from a manually-managed v1.23 cluster to a Terraform-provisioned v1.28 cluster with VPC-only access — achieving under 5 minutes of user-facing impact and 100% IaC coverage.
EKS Cluster Upgrade: v1.23 → v1.28
The production EKS cluster had been running for 2+ years. It was on Kubernetes v1.23 (deprecated), provisioned manually with no Terraform coverage, and configured with a publicly accessible API endpoint. Upgrading in-place would have meant five sequential version hops with no rollback path at each step.
Why Blue-Green
In-place cluster upgrades carry compounding risk: each version hop is irreversible once committed, and any issue at v1.25 means you’ve already burned v1.24. A blue-green approach (new cluster alongside old) gives a clean state and a tested rollback: if something is wrong, flip DNS back.
The Migration
New cluster provisioned with Terraform:
- v1.28 from day one — no incremental hops
- VPC-only API endpoint (no public endpoint; access through VPN/bastion only)
- Karpenter for node provisioning (replacing Cluster Autoscaler)
- Full IaC coverage — every node group, IRSA binding, and add-on managed in code
Service-by-service migration:
- Updated deployment manifests for API deprecations between v1.23 and v1.28
- Moved services one at a time to the new cluster, validated, then proceeded
- Kept old cluster live throughout as the active production environment until cutover
Cutover:
- Set DNS TTLs to 60 seconds 48 hours before cutover
- Flipped DNS at a low-traffic window
- Monitored for 2 weeks with old cluster on standby (rollback path)
Results
| Metric | Before | After |
|---|---|---|
| Kubernetes version | v1.23 (deprecated) | v1.28 |
| IaC coverage | 0% (manual) | 100% Terraform |
| API server exposure | Public endpoint | VPC-only |
| User-facing downtime | — | Under 5 minutes |
| Node cost efficiency | Cluster Autoscaler | Karpenter (~20% cost gain) |
The 2-week rollback window gave the team confidence to migrate aggressively. The old cluster was never needed — but having it there changed the risk calculus entirely.