Context
Kubernetes cluster administrators sometimes perform maintenance operations in nodes, such as hardware swaps, Kernel and Kubernetes version upgrades. On the other hand, application developers, might be interested in ensuring that their applications remain available during those maintenance operations. This is usually achieved through Disruption Budgets - here’s what the official docs have to say about them:
[They] limit the number of concurrent disruptions that your application experiences, allowing for higher availability while permitting the cluster administrator to manage the clusters nodes.
Practical example
To demonstrate how they’re utilized, let’s start with a simple Deployment
of two nginx
replicas:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
spec:
selector:
matchLabels:
app: nginx
replicas: 2
template:
metadata:
labels:
app: nginx
By default, old pods will be replaced by new ones following the RollingUpdate
strategy, which gradually replaces old pods with new ones. Through the maxSurge
and maxUnavailable
options, an application developer can control how those pods are replaced. In the example below, the number of available replicas should never be lower than the deployment size itself and at most one extra pod can be created:
...
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
...
Finally, for the sake of the example, let’s ensure that two nginx
pods never run on the same node through a podAntiAffinity
rule:
...
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- nginx
topologyKey: kubernetes.io/hostname
...
At this point, the application developer can rollout new versions of their deployment safely. However, the availability of the application is still at risk of being ‘disrupted’ by maintenances that affect the pods. The cluster administrator can drain all nodes in the cluster, leaving all pods of the deployment in the pending state:
$ kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
nginx-deployment 0/2 2 0 110m
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-deployment-855f7d96b9-8d48g 0/1 Pending 0 3m22s
nginx-deployment-855f7d96b9-bgvzb 0/1 Pending 0 3m18s
This is where Disruption Budgets are useful. The application developer might decide to create the following PodDisruptionBudget
, which has similar semantics to the RollingUpdate
strategy of the Deployment
:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
maxUnavailable: 0
selector:
matchLabels:
app: nginx
However, setting a Disruption Budget with maxUnavailable: 0
has important implications:
If you set
maxUnavailable
to 0% or 0, or you setminAvailable
to 100% or the number of replicas, you are requiring zero voluntary evictions. When you set zero voluntary evictions for a workload object such as ReplicaSet, then you cannot successfully drain a Node running one of those Pods. If you try to drain a Node where an unevictable Pod is running, the drain never completes. This is permitted as per the semantics ofPodDisruptionBudget
.
Impossible node drains
As suggested by the PDB docs, if we try to drain a node where a pod protected by a PDB with such characteristics exists, it will never complete. Take a cluster with four nodes as an example:
$ kuebctl get nodes
NAME STATUS ROLES AGE VERSION
kind-control-plane Ready control-plane 58s v1.25.3
kind-worker Ready <none> 34s v1.25.3
kind-worker2 Ready <none> 34s v1.25.3
kind-worker3 Ready <none> 34s v1.25.3
kind-worker4 Ready <none> 33s v1.25.3
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-deployment-bdccd6b79-84zhj 1/1 Running 0 41s 10.244.3.2 kind-worker <none> <none>
nginx-deployment-bdccd6b79-hqbgx 1/1 Running 0 41s 10.244.1.2 kind-worker3 <none> <none>
If we try to drain one of the nodes, the process will never complete:
$ kubectl drain --ignore-daemonsets kind-worker
node/kind-worker cordoned
Warning: ignoring DaemonSet-managed Pods: kube-system/kindnet-dkp4b, kube-system/kube-proxy-c66b6
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
error when evicting pods/"nginx-deployment-bdccd6b79-84zhj" -n "default" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod default/nginx-deployment-bdccd6b79-84zhj
...
A workaround
However, in some situations, such as the one above, a cluster administrator can successfully drain the node by issuing a restart of the application blocking the drain:
$ kubectl rollout restart deployment nginx-deployment
deployment.apps/nginx-deployment restarted
Here’s a full example:
This creates a dilemma: Disruption Budgets with those characteristics are explicitly communicating their application’s “desire” to never be drained. On the other hand, the application’s strategy allows the application to be restarted while still respecting its RollingUpdate
spec.
In practical terms, the cluster administrator can perform the necessary maintenances while respecting the availability characteristics from the application. But should cluster administrators restart applications in the cluster without the application developer consent?
In some circumstances, for example companies with SRE teams maintaining production clusters and Product teams developing applications, such operations could be performed, as there might not be strict terms of service in place.
Possible solutions
I discussed this situation on the Kubernetes sig-cli Slack channel, and a few folks were receptive to the idea of giving Kubernetes users a way to automatically workaround impossible drains.
The drain logic only lives in kubectl
, which calls the Eviction API for each pod running on the node. My first idea was to introduce a new flag to kubectl drain
that would trigger rollout restart
of the blocking controllers (Deployment
, StatefulSet
, ReplicaSet
) when the Eviction API returned a 429 response.
When proposing this in the sig-cli fortnightly meeting, we concluded that the drain behavior sufficiently meets the existing semantics and that impossible drain situations are opportunities to educate application developers of the implications of those restrictive PDBs. For example, OpenShift’s Kubernetes Controller Manager Operator has alerts configured for those restrictive PDBs:
Standard workloads should have at least one pod more than is desired to support API-initiated eviction. Workloads that are at the minimum disruption allowed level violate this and could block node drain. This is important for node maintenance and cluster upgrades.
Nonetheless, the group suggested writing this up (this blog post) and presenting it to the sig-apps and sig-api-machinery groups. A potential proposal could be to introduce new functionality to the Eviction API, alongside a new field to the PodDisruptionBudget
spec that would trigger the update of the blocking controller. For example:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
spec:
rolloutRestartAllowed: true
maxUnavailable: 0
selector:
matchLabels:
app: nginx
Here, the application developer would be explicitly granting permission to the cluster administrator to perform updates to their deployment or applications.