RFC Systemd Service Restart Policy
Andrew Geissler
geissonator at gmail.com
Thu Sep 7 05:03:31 AEST 2017
I’ve got an old but good one this sprint,
https://github.com/openbmc/openbmc/issues/272
The point of this issue is to define our restart and recovery policy
for openbmc services.
Currently we’re using the systemd defaults, which are the following:
RestartSec=100ms
StartLimitIntervalSec=10s
StartLimitBurst=5
StartLimitAction=none
So basically if a service fails, we will restart it up to 5 times,
every 10s, with a 100ms delay between each restart.
There is no action taken when we reach the 5 restarts, other then to
do nothing until the 10s window has expired.
I’d like to propose a few changes for openbmc:
1. Change the StartLimitBurst to 3
Five just seems excessive for our services in openbmc. In all fail
scenarios I’ve seen so far (other then with phosphor-hwmon), either
restarting once does the job or restarting all 5 times does not help
and we just end up hitting the 5 limit anyway.
2. Change the RestartSec from 100ms to 1s.
When a service hits a failure, our new debug collection service kicks
in. When a core file is involved we’ve found that generating 5 core
files within ~500ms puts a huge strain on the BMC. Also, if we are
going to get a fix on a restart of a service, the more time the better
(think retries on device driver scenarios).
3. Define a StartLimitAction for critical services to “reboot” the BMC
With 1 and 2 above, we could have services starting indefinitely with
no real recovery on the BMC. Certain services are critical though,
and I believe should result in a BMC reset to try and recover. Those
service are the following:
o dbus.service
o xyz.openbmc_project.ObjectMapper.service
Some services that are on the bubble for me (external interfaces):
o phosphor-ipmi-host.service
o phosphor-ipmi-net.service
o dropbear at .service
o phosphor-gevent.service
I have some maintainability concerns with trying to pick specific
services to cause a BMC reboot. Maybe it would be better to define a
default that all services cause a BMC reboot, then pick specific
one’s that would not result in a reboot? Or maybe it’s best to never
reboot, and just let the system owners manage it? Thoughts
appreciated.
References:
https://www.freedesktop.org/software/systemd/man/systemd.unit.html#
More information about the openbmc
mailing list