Linux Kernel Bug Hunting

June 26, 2018

497

The story starts with a message on our workplace chat:

“Our k8s master servers are rebooting randomly every day!”

This announcement had a good and a bad side. On the good side, our clusters handled this unfortunate event without problems, even when two out of five servers were rebooting at the same time. On the bad side, we didn’t receive an alert. …We had zero information about what could possible cause this. So, we thought to start the investigation by looking at system statistics. The interval of pulling those statistics was too high and as a result the statistics hid valuable information.

We didn’t know the cause of the restart but we found out the following two kernel settings.

kernel.hung_task_panic = 1
kernel.softlockup_panic = 1

Those settings instruct the kernel to panic when a task stops or when a softlockup occurs. Furthermore, we found out that kernel automatically reboots when a panic happens because we have kernel.panic = 70 in our sysctl settings. The combination of triggering a panic and auto reboot prevented us to capture a kernel crash dump.

RELATED ARTICLESMORE FROM AUTHOR

Building Autonomous ML Experimentation with Tangle and Tangent

Score Big on Your Tech Career

Celebrating the Second Year of Linux Man-Pages Maintenance Sponsorship

How to Deploy Lightweight Language Models on Embedded Linux with LiteLLM

Automating Compliance Management with UTMStack’s Open Source SIEM & XDR

RELATED ARTICLES MORE FROM AUTHOR