Troubleshooting a Broken K8S Cluster

In Software world, broken modules are daily challenges for any Software Engineer and K8S is no exception as a Cluster. Whether it’s a single node or multi-node cluster, sometimes we might end up seeing the entire cluster down too.

Welcome to the world of troubleshooting and resolving the issue/s :)

I was playing around with K8S clusters as part of Kubernetes exploration., and thought of sharing my hands-on learning about Troubleshooting a K8S Cluster in this blog.

Through this blog, I’ll try to investigate to drill-down to the problem and then fix it as a solution to make the cluster up and running. Below is a simple K8S Cluster architecture which has One Master Node and 2 Worker Nodes. I’ve referenced the same architecture in my other blogs as well for easy and simple use-case

Fig : 1 — K8S Cluster Architecture

Let’s log in to the Master Node using it’s public IP and credentials

ssh cloud_user@<MasterNode-Public-IP-Address> and use ‘kubectl get nodes’ to identify the issue

Fig : 2 — K8S Cluster Nodes Status

You can see from the Fig : 2 that k8s-worker2 node is ‘NotReady’.

Let’s dig deeper to understand the issue more in details with ‘Kubectl describe node k8s-worker2’. Look at the Conditions section in the details of the response shown

In order to understand specific issue with the Worker Node 2, let’s ssh in to that node using it’s credentials and then run ‘sudo journalctl -u kubelet’ which gives more details about the kubelet under that node. The response would be very long, use ‘Shift + G’ to reach to the end of the response and Press ‘Ctrl + C’ to come out of that response shell

Fig : 3 — Worker Node response on journalctl

It is very clear from Fig : 3 that kubelet is inactive and stopped., hence the status of that node is ‘NotReady’ when we looked from the Master Node

On the worker node 2, use ‘sudo systemctl status kubelet’ to know more detailed status about the kubelet. Fig :4 indicates that kubelet is stopped

Fig : 4 — Worker Node issue with Kebelet

How do we fix it ?

We know the problem that kubelet has stopped on the worker node 2. To resolve, we need to start the server and also need to enable the kubelet which ensures that next time whenever the server starts, it starts the kubelet as well to get rid of these failures again in the future

Now, check the status of the kubelet on the same node using ‘sudo systemctl status kubelet’. As show in the Fig : 5, kubelet is active and running now

Fig : 5 — Kubelet ‘running’ status now

Come back to the Master Node and check by using the ‘kubectl get nodes’ and now you can notice that worker node 2 status is ‘Ready’ as shown in Fig : 6

Fig — 6 : Worker Node 2 is ‘Ready’ now

In case if you still notice the worker node 2 status as ‘NotReady’, please wait for couple of seconds and re-run the same command until you see ‘Ready’ on that node

Congratulations! You are now able to notice the problem in the cluster, troubleshoot the issue and were able to identify and resolve

Keep Learning.

Human being First, followed by A Husband and A Father of Two Smiles — Rest is the Magic!