Kubernetes Cluster is in Failed State

4 min readJan 7, 2023

Once the Kubernetes Cluster is in the failed state, currently there is no option on Azure Portal to either revert back to the last known stable state or to rollback the upgrade

The straightforward solution that you will get online is to perform the following task from the cloud shell or your configured terminal with Azure CLI

To return a cluster or node pool in a failed state back to a succeeded state, we can try running the az aks update or az aks nodepool update Azure CLI command as follows to perform an empty update and trigger a reconciliation:

az aks update --resource-group {ResourceGroupName} --name {ClusterName}

az aks nodepool update --resource-group {ResourceGroupName} --cluster-name {ClusterName} --name { {NODEPOOL} }

however it did not work as expected, so I am creating this article to help you if you come across a similar scenario, first I am going to show you the error I encountered when performing above

So I followed the error message and raised a service request to increase the quota limited, but then I noticed it is not doing anything, same error came back again to place a new request. So I tried the following command mentioned there

az aks nodepool update --resource-group {ResourceGroupName} --cluster-name {ClusterName} --name {agent-pool-name}

got a similar error again to place a request for CPU code and Regional limit.

Because this was going in a loop, and the Kubernetes cluster was not coming out of failed state, I looked into the Node pool and Node status and started to troubleshoot there, and finally, I got “ReconcileVMSSAgentPoolFailed” error

This error was helpful, so I queries available pdb

To successfully drain nodes and evict running pods, ensure that any PodDisruptionBudgets (PDBs) allow for at least 1 pod replica to be moved at a time, otherwise the drain/evict operation will fail. To check this, you can run kubectl get pdb -A and make sure ALLOWED DISRUPTIONS is at least 1 or higher.

Draining nodes will cause pods running on them to be evicted and recreated on the other, schedulable nodes.

To drain nodes, use

Now that I have located the node where the problem was (Nodes inside the node pool were in two different versions of Kubernetes, the one that was on the older version, was not draining out to let the upgrade complete)

Delete the pods that can’t be drained. Note: If the pods were created by a deployment or StatefulSet, they’ll be controlled by a ReplicaSet we should take a backup of the current node, this Kubernetes cluster on the other hand is not created with a deployment file, this one was created using Terraform