Affichage des articles dont le libellé est service. Afficher tous les articles
Affichage des articles dont le libellé est service. Afficher tous les articles

jeudi 28 août 2025

Systemd healtcheck with side service and monotonic timer, auto-healing

What : bypass the lack of healthcheck of systemd

  • systemd service "what-service.service"
  • systemd timer  "what-service-healthcheck.timer"
    • triggers a systemd service "what-service-healthcheck.service"
       which lanches a script "
      service_health_check.sh"
    • script that :
      • curl's heal-tcheck URL "HEALTH_CHECK_URL"
      • if KO, restart the targetted service



what-service-healthcheck.timer

[Unit]
Description=Run health check every 15 seconds
[Timer]
# Wait 1 minute after boot before the first check
OnBootSec=1min
# Run the check 15 seconds after the last time it finished
OnUnitActiveSec=15s
[Install]
WantedBy=timers.target


By default the timer service will trigger the unit service with the same name, no need to specify it.

what-service-healthcheck.service.j2
[Unit]
Description=Health Check for {{ what_service }}
Requires={{ what_service }}.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/service_health_check.sh
Restart=on-failure
OnFailure={{ what_service }}


service_health_check.sh 
#!/bin/bash
# The health check endpoint
HEALTH_CHECK_URL="http://localhost:{{ running_port }}/health_check"
# Use curl to check the endpoint.
# --fail: Makes curl exit with a non-zero status code on server errors (4xx or 5xx).
# --silent: Hides the progress meter.
# --output /dev/null: Discards the response body.
if ! curl --silent --fail --max-time 2 --output /dev/null "$HEALTH_CHECK_URL"; then
echo "Health check failed for {{ service_name }}. Restarting..."
# Restart is performed on failure from healthcheck service
exit 1
fi



Adding through ansible (to do : fix indentation, blog isn't great for this)

<role>/tasks/main.yml
--- 

- name: Generate what-service systemd file
ansible.builtin.template:
src: what-service.service.j2
dest: /etc/systemd/system/what-service.service
mode: "0755"
notify: Restart what-service

 - name: Copy the health check script
ansible.builtin.copy:
src: service_health_check.sh
dest: /usr/local/bin/service_health_check.sh
owner: root
group: root
mode: '0755'
vars: 
  service_name: what-service

- name: Copy the health check systemd service file
ansible.builtin.copy:
src: what-service-healthcheck.service
dest: /etc/systemd/system/what-service-healthcheck.service
owner: root
group: root
mode: '0644'
notify: Reload systemd

- name: Copy the health check systemd timer file
ansible.builtin.copy:
src: what-service-healthcheck.timer
dest: /etc/systemd/system/what-service-healthcheck.timer
owner: root
group: root
mode: '0644'
notify: Reload systemd

- name: Enable and start the health check timer
ansible.builtin.systemd:
name: healthcheck.timer
state: started
enabled: yes
daemon_reload: yes # Ensures systemd is reloaded before starting


<role>/handlers/main.yml
---
- name: Restart what-service 
ansible.builtin.service:
name: what-service
state: restarted
daemon_reload: true

- name: Reload systemd
ansible.builtin.service:
daemon_reload: yes

- name: Restart what-service-healthcheck.timer
ansible.builtin.service:
name: what-service-healthcheck.timer
state: restarted
daemon_reload: true

mardi 19 novembre 2024

removing unwanted *typo* systemd instance'd template units

 Systemd has this great feature : 


file: foo@.service 

contains the definition, as usual, with "%i"

=> will replace anything in the service file 

for example:


systemctl start foo@bar.service

systemctl start foo@resto.service


=> will start  service foo@bar  and foo@resto


but if resto does not make a valid service, it fails and can't be easily removed from the list of apparently formerly existing services.

(cf. for example https://bbs.archlinux.org/viewtopic.php?id=286912&p=1 )


As you can see with the command : 

 systemctl list-units --type=service | grep 'foo'

  UNIT                                                                                      LOAD   ACTIVE SUB     DESCRIPTION

  foo@alpha.service                                                               loaded active running Foo for protocol alpha

● foo@lish.service                                                              masked failed failed  foo@lish.service



=> Solution (on ubuntu, but no reason it does not work elsewhere) 

 systemctl mask foo@resto.service #this creates a symlink / replaces the symlink with one pointing to /dev/null, and marks the service as failed => no restart will happen again

 systemctl reset-failed # cleans failed services