Overview
The error 'failed to keep watchdog alive' occurs in the logs, accompanying a NIC (network interface controller) flap or another outage in the system.
An Exinda keeps a detailed log to determine when something happens, trying to output as much useful information about the behaviors as possible, to help administrators diagnose problems, as well as gather troubleshooting information for investigations. One such issue that can occur is the NICs on the device stopping their standard behavior - either for a short period (A NIC flap) or going into a bypass for an extended time on its own. After one of these events, it is essential to check the logs for a reason why is the first step for troubleshooting.
It is possible to see the following error message in the log at the time of the outage:
bypassed[4486]: TID 140158059296544: [bypassd.ERR] (watchdog) failed to keep watchdog alive.
This article provides details about this issue and the means to prevent it (or resolve it if already triggered).
Root Cause
The bypass process in the Exinda ('bypassd') is the process which looks after the NICs and determines whether or not they should go into bypass mode while the system is active. It knows the state that the NICs should be in from the Configuration > System > Networks page, and it will continually monitor the system to determine its health and wellbeing. If the device gets into an unstable state, it will switch the NICs into bypass mode to prevent an outage.
The mechanism through which it does this is the System Watchdog. The watchdog expects a response from the system every second. If it does not receive it, it will wait for a total of 8 seconds before triggering. If no reactions come to it in that time, it will preemptively either change the NIC states to go into bypass or reboot the device (depending on its system settings) because it indicates that the system is failing to acknowledge it for one of many reasons:
- The device is too busy to handle the load on it (in the middle of an attack, the traffic load is too high, the RAM use on the box is too much, etc.).
- The device has locked up or frozen.
- The device is in a state where it is unresponsive.
Every time bypass is unable to contact the watchdog, it provides the above error message.
Resolution
Preemptively rebooting the device will ensure that the system gets into a fresh and clean state and that bypassd will be able to contact the watchdog to keep it running as expected.
To keep the watchdog from timing out, the following should be ensured on the system:
- That the number of connections on the device is under proper system specifications.
- The RAM use on the device is not extremely high (90-100%), and the swap is not under any heavy use.
If the system gets into an unresponsive state due to a bug or other cause, restarting the device can get it back to a stable, known state to process operations.