Systemd Watchdog for Any Service

Making basic systemd service is easy. Let's assume the simplest application (not necessarily even designed to be a service) and look into making it work with systemd.

Our example application will be a script in /opt/test/application with the following content:

/opt/test/application
#!/bin/bash

while(true); do
date | tee /var/tmp/test.log
sleep 1
done

Essentially it's just never ending output of a current date.

To make it a service, we simply create /etc/systemd/system/test.service with description of our application:

/etc/systemd/system/test.service
[Unit]
Description=Test service
After=network.target
StartLimitIntervalSec=0

[Service]
Type=simple
ExecStart=/opt/test/application
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target

That's all needed before we can start the service:

Terminal
sudo systemctl start test

sudo systemctl status test
● test.service - Test service
Loaded: loaded (/etc/systemd/system/test.service; disabled; vendor preset: enabled)
Active: active (running)
Main PID: 5212 (service)
Tasks: 2 (limit: 4657)
CGroup: /system.slice/test.service
├─5212 /bin/bash /opt/test/application
└─5321 sleep 1

Systemd will start application and even perform restart if application fails. But what if we want it a bit smarter? What if we want a watchdog that'll restart application not only when it's process fails but also when some other health check goes bad?

While sytemd does support such setup, application generally should be aware of it and call watchdog function every now and then. Fortunately, even if our application doesn't do that, we can use watchdog facilities via systemd-notify tool.

First we need to change three things in our service definition. One is changing type to notify, then changing executable to the wrapper script, and lastly defining the watchdog time.

In this example, if application doesn't respond in 5 seconds, it will be considered failed. The new service definition in /etc/systemd/system/test.service can look something like this:

/etc/systemd/system/test.service
[Unit]
Description=Test service
After=network.target
StartLimitIntervalSec=0

[Service]
Type=notify
ExecStart=/opt/test/test.sh
Restart=always
RestartSec=1
TimeoutSec=5
WatchdogSec=5

[Install]
WantedBy=multi-user.target

Those watching carefully will note we don't actually solve anything with this and that we just move all responsibility to /opt/test/test.sh wrapper.

It's in that script we first communicate to sytemd when application is ready and later, in a loop, check for not only application PID but also for any other condition (e.g. certain curl response), calling systemd-notify if application proves to be healthy:

/opt/test/test.sh
#!/bin/bash

trap 'kill $(jobs -p)' EXIT

/opt/test/service &
PID=$!

/bin/systemd-notify --ready

while(true); do
FAIL=0

kill -0 $PID
if [[ $? -ne 0 ]]; then FAIL=1; fi

# curl http://localhost/test/
# if [[ $? -ne 0 ]]; then FAIL=1; fi

if [[ $FAIL -eq 0 ]]; then /bin/systemd-notify WATCHDOG=1; fi

sleep 1
done

Starting service now gives slightly different output:

Terminal
sudo systemctl stop test

sudo systemctl start test

sudo systemctl status test
● test.service - Test service
Loaded: loaded (/etc/systemd/system/test.service; disabled; vendor preset: enabled)
Active: active (running)
Main PID: 6406 (test.sh)
Tasks: 4 (limit: 4657)
CGroup: /system.slice/test.service
├─6406 /bin/bash /opt/test/test.sh
├─6407 /bin/bash /opt/test/application
├─6557 sleep 1
└─6560 sleep 1

If we kill application manually (e.g. sudo kill 6407), systemd will pronounce service dead and start it again. It will do the same if any other check fails.

While this approach is not ideal, it does allow for easy application watchdog retrofitting.

10 thoughts to “Systemd Watchdog for Any Service”

  1. There are a few improvements you can make to your script:

    1. You don’t need type=notify to use the watchdog functionality, so that can be removed
    2. It’s better for your service to run as the same pid as systemd launched. This way systemd will know that it’s the main pid and you won’t need your trap, nor your checking with `kill -0`. You can achieve this with `exec` and running your watchdog in the background.
    3. systemd sets $WATCHDOG_USEC to say how long the watchdog interval is. It recommends notifying at half that interval. This way we don’t need to check every 1s.

    Something like this may be better:

    #!/bin/bash
    
    watchdog() {
        PID=$1
    
        while(true); do
            FAIL=0
    
            curl http://localhost/test/ || FAIL=1
    
            if [[ $FAIL -eq 0 ]]; then
                /bin/systemd-notify WATCHDOG=1;
                sleep $(($WATCHDOG_USEC / 2000000))
            else
                sleep 1
            fi
        done
    }
    
    watchdog $$ &
    
    exec /opt/test/service
    
    1. Hi William,

      And tanx for sharing, In my case whene I try it on a Raspberry PI4 & ubuntu 20.10 server (64bits). It complains about “permitted for main PID” then I fix the problem by adding PID to notify line as follow:

      /bin/systemd-notify –pid=$PID WATCHDOG=1;

  2. On second thought: It would probably be better to leave `ExecStart` as it is and launch your watchdog in an `ExecStartPost`. You’ll also have to set `NotifyAccess=all`.

    1. Running watchdog via ExecStartPost doesn’t work.
      Systemd just restarting service every WatchdogSec.

    2. I have tried with this Unit

      [Unit]
      Description=communication-core-app
      After=network.target
      StartLimitIntervalSec=0
      
      [Service]
      Type=simple
      WatchdogSec=20
      NotifyAccess=all
      WorkingDirectory=/opt/app
      ExecStart=/opt/app/bin/run.sh
      ExecStartPost=/opt/app/bin/watchdog.sh
      Restart=always
      RestartSec=10
      
      [Install]
      WantedBy=multi-user.target
      
      and this code inside watchdog.sh
      #!/usr/bin/env bash
      
      watchdog() {
          while(true); do
              FAIL=0
      
              $(nc -z localhost 3822) || FAIL=1
              echo $FAIL
              if [[ $FAIL -eq 0 ]]; then
                  /bin/systemd-notify WATCHDOG=1;
                  sleep $(($WATCHDOG_USEC / 2000000))
              else
                  sleep 1
              fi
          done
      }
      
      watchdog &
      
  3. Another suggestion: have the watchdog script run systemd-notify WATCHDOG=trigger when the failure occurs.

  4. Hi,
    I have a similar problem with monitoring a certain application.
    What does the final version of these scripts look like?

  5. Unfortunatelly “WATCHDOG_USEC” is only available in the process started by “ExecStart” so it’s not possible to use a watchdog script in “ExecStartPost”

  6. Not necessarily true, you can retrieve the WATCHDOG_USEC property like this.
    systemctl show –property WatchdogUSec –value “servicename.service”

    As for the person having issues with ExecStartPost to launch a watchdog, seems to be working fine for me. Just be sure to background your script and if you are having trouble communicating with systemd, you could try adding BARRIER=1 to systemd-notify to force synchronous communicaiton, i.e. /bin/systemd-notify BARRIER=1 WATCHDOG=1.

    Also you should be able to get away with NotifyAccess=exec if you’re using ExecStartPost, which is more secure than NotifyAccess=all…

Leave a Reply

Your email address will not be published. Required fields are marked *