The story starts with a message on our workplace chat:
“Our k8s master servers are rebooting randomly every day!”
This announcement had a good and a bad side. On the good side, our clusters handled this unfortunate event without problems, even when two out of five servers were rebooting at the same time. On the bad side, we didn’t receive an alert. …We had zero information about what could possible cause this. So, we thought to start the investigation by looking at system statistics. The interval of pulling those statistics was too high and as a result the statistics hid valuable information.
We didn’t know the cause of the restart but we found out the following two kernel settings.
- kernel.hung_task_panic = 1
- kernel.softlockup_panic = 1
Those settings instruct the kernel to panic when a task stops or when a softlockup occurs. Furthermore, we found out that kernel automatically reboots when a panic happens because we have kernel.panic = 70 in our sysctl settings. The combination of triggering a panic and auto reboot prevented us to capture a kernel crash dump.
Read more at Medium