In Software world, broken modules are daily challenges for any Software Engineer and K8S is no exception as a Cluster. Whether it’s a single node or multi-node cluster, sometimes we might end up seeing the entire cluster down too.
Welcome to the world of troubleshooting and resolving the issue/s :)
I was playing around with K8S clusters as part of Kubernetes exploration., and thought of sharing my hands-on learning about Troubleshooting a K8S Cluster in this blog.
Through this blog, I’ll try to investigate to drill-down to the problem and then fix it as a solution to make the cluster up and running. Below is a simple K8S Cluster architecture which has One Master Node and 2 Worker Nodes. I’ve referenced the same architecture in my other blogs as well for easy and simple use-case
Let’s log in to the Master Node using it’s public IP and credentials
ssh cloud_user@<MasterNode-Public-IP-Address> and use ‘kubectl get nodes’ to identify the issue
You can see from the Fig : 2 that k8s-worker2 node is ‘NotReady’.
Let’s dig deeper to understand the issue more in details with ‘Kubectl describe node k8s-worker2’. Look at the Conditions section in the details of the response shown
In order to understand specific issue with the Worker Node 2, let’s ssh in to that node using it’s credentials and then run ‘sudo journalctl -u kubelet’ which gives more details about the kubelet under that node. The response would be very long, use ‘Shift + G’ to reach to the end of the response and Press ‘Ctrl + C’ to come out of that response shell
It is very clear from Fig : 3 that kubelet is inactive and stopped., hence the status of that node is ‘NotReady’ when we looked from the Master Node
On the worker node 2, use ‘sudo systemctl status kubelet’ to know more detailed status about the kubelet. Fig :4 indicates that kubelet is stopped
How do we fix it ?
We know the problem that kubelet has stopped on the worker node 2. To resolve, we need to start the server and also need to enable the kubelet which ensures that next time whenever the server starts, it starts the kubelet as well to get rid of these failures again in the future
Now, check the status of the kubelet on the same node using ‘sudo systemctl status kubelet’. As show in the Fig : 5, kubelet is active and running now
Come back to the Master Node and check by using the ‘kubectl get nodes’ and now you can notice that worker node 2 status is ‘Ready’ as shown in Fig : 6
In case if you still notice the worker node 2 status as ‘NotReady’, please wait for couple of seconds and re-run the same command until you see ‘Ready’ on that node
Congratulations! You are now able to notice the problem in the cluster, troubleshoot the issue and were able to identify and resolve