Handling Kubernetes Upgrades on EKS/GKE: Best Practices for Zero-Downtime

Feb 65 min read

Overview
Why Kubernetes Upgrade Matter?
Challenges During Kubernetes Upgrade
Best Practices for Zero-Downtime K8S Upgrades on EKS/GKE
Conclusion
References

Overview

Kubernetes is one of the most widely adopted container orchestration platforms, offering immense flexibility and scalability. However, with its continuous evolution, keeping Kubernetes clusters up to date is essential for security, performance improvements, and new features. Upgrading Kubernetes, though, can sometimes cause application downtime if not handled properly. Fortunately, with Amazon EKS (Elastic Kubernetes Service) and Google GKE (Google Kubernetes Engine), handling upgrades with minimal or zero downtime is entirely possible. In this blog, we will explore the best practices for performing Kubernetes upgrades on EKS and GKE clusters while ensuring your applications remain available.

Why Kubernetes Upgrades Matter

Before we dive into the upgrade strategies, let’s briefly understand why upgrades are critical:

Security Patches: New versions of Kubernetes fix vulnerabilities that could otherwise be exploited.
New Features: Every release brings enhanced capabilities like better scaling, networking, and integration with newer services.
Bug Fixes: Regular upgrades ensure your Kubernetes environment remains stable and efficient.

However, upgrading your cluster without affecting your services requires careful planning and execution.

Challenges During Kubernetes Upgrades

Upgrading Kubernetes on managed services like EKS and GKE introduces several challenges:

Control Plane Updates: Both EKS and GKE automatically manage the Kubernetes control plane, but you still need to ensure that the worker nodes, which run your applications, are also upgraded without downtime.
In-Place Upgrades: Upgrading Kubernetes while maintaining application availability means upgrading components in a careful order to avoid disruptions.
Application Downtime: Even though control plane upgrades are usually seamless, nodes and workloads require attention to avoid service interruptions.

Now, let’s explore the best practices for handling Kubernetes upgrades on EKS and GKE while minimizing or avoiding downtime.

Best Practices for Zero-Downtime K8S Upgrades on EKS/GKE

Use Rolling Updates for Deployments

The key to zero-downtime upgrades lies in ensuring that workloads are updated incrementally. Kubernetes supports rolling updates for Deployments, which allow you to update application pods without causing service interruptions.

Example: Rolling Update for Deployment

If you have a deployment, you can configure a rolling update strategy to update your application pods without downtime. Below is an example of a deployment definition with a rolling update strategy.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: my-app:v2
        ports:
        - containerPort: 80

This configuration ensures that only one pod is unavailable at any given time, while one pod is added as a replacement, allowing for a smooth upgrade with no downtime.

Leverage Node Pool Upgrades

Both EKS and GKE provide the ability to upgrade node pools independently from the control plane. By upgrading node pools sequentially, you can minimize disruptions to the application.

GKE: Use the “upgrade” command to update the node pools one by one, ensuring that each node is drained before upgrading.
EKS: Use the EKS console or CLI to update the node groups while respecting the desired capacity.

Example: Upgrading Node Pools in GKE

In GKE, you can upgrade the node pool with the following command:

gcloud container clusters upgrade my-cluster --node-pool my-node-pool --zone us-central1-a

Once the nodes are upgraded, Kubernetes will automatically schedule workloads on the newly updated nodes. You should configure your pod disruption budgets to ensure the upgrade is performed without affecting your services.

Example: Upgrading Node Groups in EKS

For EKS, use the following AWS CLI command to upgrade node groups:

aws eks update-nodegroup-version --cluster-name my-cluster --nodegroup-name my-nodegroup

Once upgraded, EKS will drain the nodes, ensuring your applications are rescheduled to other nodes during the upgrade process.

Use Pod Disruption Budgets (PDB)

Pod Disruption Budgets (PDBs) are a critical mechanism for ensuring that your applications maintain a certain level of availability during node upgrades or voluntary disruptions.

A PDB specifies the minimum number of pods that should remain available. When a node is being upgraded, Kubernetes respects the PDB to prevent too many pods from being evicted at once.

Example: Pod Disruption Budget for a Stateful Application

Here’s an example of a Pod Disruption Budget for a stateful application that needs at least two pods running during an upgrade:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

In this case, Kubernetes will ensure that at least two pods are running at all times, preventing any disruption during the upgrade process.

Monitor the Upgrade Process

Before and during an upgrade, it’s essential to monitor the health of the cluster and the applications running on it. Tools like Prometheus and Grafana for monitoring and Fluentd or Elasticsearch for logging can provide critical insights.

Pod Health Checks: Ensure that liveness and readiness probes are defined in your pod specifications. These probes help Kubernetes identify when a pod is unhealthy and when it’s ready to handle traffic.
Cluster Health: Tools like kubectl top or GKE’s and EKS’s built-in monitoring tools can help ensure the control plane and nodes are healthy during the upgrade.

Test Upgrades in Staging Environments

Before applying an upgrade to your production clusters, always test the upgrade in a staging environment. This allows you to simulate the upgrade process, check for potential issues, and ensure compatibility with your workloads.

EKS: You can create a copy of your production cluster and test the upgrade process in a controlled environment.
GKE: GKE allows you to clone your clusters or use different environments for testing purposes.

Conclusion

Upgrading Kubernetes doesn’t have to be a daunting task. With proper planning, rolling updates, careful node pool management, and the right monitoring tools, you can ensure that your applications continue running smoothly, even during the upgrade process.

In the world of modern cloud infrastructure, zero-downtime upgrades are not just a luxury; they are a necessity. As a Kubernetes practitioner, embracing these best practices will ensure you maintain both application reliability and operational efficiency. Kubernetes is powerful, and so are you. By mastering these upgrade strategies on EKS and GKE, you not only improve your skills but also contribute to building resilient, high-performing systems in the cloud.

So, keep learning, keep experimenting, and never hesitate to share your findings. Happy upgrading!

References

If you enjoyed this post, don't forget to leave a comment 💬 with your thoughts — we’d love to hear from you! Feel free to share it with your friends and family 📲 so they can join the conversation too. Your support helps us grow and create even more content for you! 🌟