RFC Systemd Service Restart Policy

Thu Sep 7 05:50:26 AEST 2017

On 06-Sep-2017 02:03 PM, Andrew Geissler wrote:
> I’ve got an old but good one this sprint,
> https://github.com/openbmc/openbmc/issues/272
> 
> The point of this issue is to define our restart and recovery policy
> for openbmc services.
> 
> Currently we’re using the systemd defaults, which are the following:
> RestartSec=100ms
> StartLimitIntervalSec=10s
> StartLimitBurst=5
> StartLimitAction=none
> 
> So basically if a service fails, we will restart it up to 5 times,
> every 10s, with a 100ms delay between each restart.
> There is no action taken when we reach the 5 restarts, other then to
> do nothing until the 10s window has expired.
> 
> I’d like to propose a few changes for openbmc:
> 
> 1.  Change the StartLimitBurst to 3
> Five just seems excessive for our services in openbmc.  In all fail
> scenarios I’ve seen so far (other then with phosphor-hwmon), either
> restarting once does the job or restarting all 5 times does not help
> and we just end up hitting the 5 limit anyway.
> 
> 2. Change the RestartSec from 100ms to 1s.
> When a service hits a failure, our new debug collection service kicks
> in.  When a core file is involved we’ve found that generating 5 core
> files within ~500ms puts a huge strain on the BMC.  Also, if we are
> going to get a fix on a restart of a service, the more time the better
> (think retries on device driver scenarios).

I think these two are pretty reasonable. We have had similar behavior 
implemented on prior generations of BMC. I like your reasoning for both 
changes.

> 3. Define a StartLimitAction for critical services to “reboot” the BMC
> With 1 and 2 above, we could have services starting indefinitely with
> no real recovery on the BMC.  Certain services are critical though,
> and I believe should result in a BMC reset to try and recover.  Those
> service are the following:
>    o dbus.service
>    o xyz.openbmc_project.ObjectMapper.service
> 
> Some services that are on the bubble for me (external interfaces):
>    o phosphor-ipmi-host.service
>    o phosphor-ipmi-net.service
>    o dropbear at .service
>    o phosphor-gevent.service
> 
> I have some maintainability concerns with trying to pick specific
> services to cause a BMC reboot.  Maybe it would be better to define a
> default  that all services cause a BMC reboot, then pick specific
> one’s that would not result in a reboot?  Or maybe it’s best to never
> reboot, and just let the system owners manage it?  Thoughts
> appreciated.

I would prefer that we have a set core (such as dbus and the mapper) 
that are terminal faults (maybe even without retries) and then assume 
that everything else can be restarted nicely. If something cannot be 
restarted nicely, there should be a really good reason for that and that 
service's unit file can specify something other than the defaults to 
change its behavior.

This is a Linux system; in the ideal world, it should only need to be 
restarted for firmware updates. All other faults should be recoverable. 
Ideal world aside, individual services that can only be recovered with a 
reboot can handle that case without adjusting the global default.

--Vernon

> References:
> https://www.freedesktop.org/software/systemd/man/systemd.unit.html#