EPYC server processors are really nice, when they work. However, lately with kernel 6.2 I started getting dreadful "CPU stuck" errors that lead to hanging system. Normal person might revert to an older kernel. Me? I decided to turn on the watchdog.
In case you don't know, watchdog is a functionality that, once turned on, will require your system to notify it every once in a while that it's still active. If notification is not received within given time interval, system is assumed stuck and thus it gets rebooted. Best of all, this is done on a hardware level and thus no hanging application or CPU will prevent it.
To confuse things a bit, my Supermicro M11SDV-4CT-LN4F server, seems to have two watchdog systems. One is part of Epyc platform itself and controlled via BIOS setting. That one has 5 minute interval and no matter what I couldn't get it working properly. I mean, I could get it running but, since there was no easy way to reset it, system would reboot every 5 minutes, no matter what.
The second watchdog is the part of AST2500 chipset that handles other IPMI functions. And this one was well supported from Linux command line using
ipmitool utility. To see its status, just ask
ipmitool for that information:
ipmitool mc watchdog get
But there is no option to turn it on. However, one can always send raw commands and I was fortunate to see that somebody already did. Not to get into too much details, the last two numbers in the string of hexadecimal values are the only thing you generally want to change - time interval. In example below, I decided to go for 610 seconds (0x17D4 in 0.1 s units).
ipmitool raw 0x06 0x24 0x04 0x01 0x00 0x00 0xD4 0x17
This will start a ticking bomb that will, if not defused within the given interval, reboot your computer. So, why did I select 10 minutes and 10 seconds? As many things, this was completely subjective.
Well, no matter what, I wanted this watchdog not to interfere with my normal server operation. Since a normal reboot takes about 5 minutes, I wanted to have 5 minutes on counter even if I reboot system myself just before watchdog would reset. So, if I select 10 minute interval and reset it every 5 minutes, this gives me 5 minutes of extra time I might need for reboot. But why extra 10 seconds? Well, in case I mess with my watchdog settings and I miss reset at 5 minute mark, I wanted to give an extra chance of reset at 10 minute mark without having to deal with a reboot race condition.
And how might one actually setup watchdog and its reset within Linux? Well,
crontab, of course. These two entries were all it took:
@reboot /usr/bin/ipmitool raw 0x06 0x24 0x04 0x01 0x00 0x00 0xD4 0x17 0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/bin/ipmitool mc watchdog reset
This will turn on watchdog upon every system reset (and yes, once watchdog goes off, you do need to manually turn it back on) and every 5 minutes system will reset its counter if nothing goes awry.
Simple and effective.