RFC Systemd Service Restart Policy

Andrew Geissler geissonator at gmail.com
Thu Sep 7 05:03:31 AEST 2017


I’ve got an old but good one this sprint,
https://github.com/openbmc/openbmc/issues/272

The point of this issue is to define our restart and recovery policy
for openbmc services.

Currently we’re using the systemd defaults, which are the following:
RestartSec=100ms
StartLimitIntervalSec=10s
StartLimitBurst=5
StartLimitAction=none

So basically if a service fails, we will restart it up to 5 times,
every 10s, with a 100ms delay between each restart.
There is no action taken when we reach the 5 restarts, other then to
do nothing until the 10s window has expired.

I’d like to propose a few changes for openbmc:

1.  Change the StartLimitBurst to 3
Five just seems excessive for our services in openbmc.  In all fail
scenarios I’ve seen so far (other then with phosphor-hwmon), either
restarting once does the job or restarting all 5 times does not help
and we just end up hitting the 5 limit anyway.

2. Change the RestartSec from 100ms to 1s.
When a service hits a failure, our new debug collection service kicks
in.  When a core file is involved we’ve found that generating 5 core
files within ~500ms puts a huge strain on the BMC.  Also, if we are
going to get a fix on a restart of a service, the more time the better
(think retries on device driver scenarios).

3. Define a StartLimitAction for critical services to “reboot” the BMC
With 1 and 2 above, we could have services starting indefinitely with
no real recovery on the BMC.  Certain services are critical though,
and I believe should result in a BMC reset to try and recover.  Those
service are the following:
   o dbus.service
   o xyz.openbmc_project.ObjectMapper.service

Some services that are on the bubble for me (external interfaces):
   o phosphor-ipmi-host.service
   o phosphor-ipmi-net.service
   o dropbear at .service
   o phosphor-gevent.service

I have some maintainability concerns with trying to pick specific
services to cause a BMC reboot.  Maybe it would be better to define a
default  that all services cause a BMC reboot, then pick specific
one’s that would not result in a reboot?  Or maybe it’s best to never
reboot, and just let the system owners manage it?  Thoughts
appreciated.

References:
https://www.freedesktop.org/software/systemd/man/systemd.unit.html#


More information about the openbmc mailing list