How to deal with failing services in the boot targets

Fri Feb 3 17:40:37 AEDT 2017

On Wed, 2017-02-01 at 14:42 -0600, Andrew Geissler wrote:
> Finally got around to doing some testing on this, here's what I got.
> 
> > My story this sprint, https://github.com/openbmc/openbmc/issues/1033,
> is focused on handling errors when things go wrong.  Specifically,
> when required services fail to execute properly during a systemd
> target execution (power on, power off).  When a fail happens, the obmc
> software needs to notify the users of the system and provide
> mechanisms for either the system to automatically retry the failed
> operation (i.e. reboot the system) or to stay in a quiesced state so
> that error data can be collected and the fail can be investigated.
> 
> Michael is working on a story that ties in with this function this
> > sprint, https://github.com/openbmc/openbmc/issues/942, in which we’ll
> allow system users to enable or disable the auto reboot function on
> errors (service failure, host checkstop failure, host watchdog
> failure).  He will utilize the new target I’ll be creating in my story
> for this.
> 
> So we have two main fail scenarios:
> 
> 1. A service within a target fails
> - If the service is a oneshot type, and you put that it is required
> (not wanted) by the target then the target will fail if the service
> fails
>   - You can simply define a behavior for when the target fails using
> the “OnFailure” option (i.e. go to a new failure target if any
> required service fails)
> 
> - If the service is not a oneshot, then you can not have it fail the
> target (the target only knows that it started successfully)
>   - You have to define a behavior for when the service fails (OnFailure) option.
>   - The service can not have "RemainAfterExit=yes” otherwise the
> OnFailure action does not occur until the service is stopped (instead
> of when it fails)
> 
> 2. A failure outside of a normal systemd target/service (host watchdog
> expires, host checkstop detected)
> - The service which detects this failure is responsible for logging
> the appropriate error, and instructing systemd to go to the
> appropriate target
> 
> The current proposal is that we create a new quiesce target.  This is
> the target that the target/services put for their “OnFailure=“
> instruction and where the services in fail #2 above detect a problem
> will instruct systemd to go to.  We’ll then have code that monitors
> for the entry into this new quiesce target and handles the halt vs
> automatic reboot functionality.
> 
> The above info sets up some general guidelines for our targets and
> services (and some refactoring for my story this sprint)
> 
> - All targets should have an “OnFailure=obmc-quiesce-system at .target”
> - All services which are required for a target to achieve it’s
> function should be RequiredBy that target (not WantedBy)
> - All services should first try to be Type=oneshot so that we can just
> rely on the target fail path
> - If a service can not be “Type=oneshot”, then it needs to have a
> “OnFailure=obmc-quiesce-system at .target” and a "RemainAfterExit=no”
> - If a service can not be any of these then it’s up to the service
> application to call systemd with the obmc-quiesce-system at .target on
> failures
> 
> Thoughts/Questions?

I think this is a sensible set of suggestions. We need to document them
somewhere obvious so a) we can point people at them and b) reviewers
can refer to them when reviewing patches adding/updating systemd unit
files and targets.

Thanks for considering the problem.

Andrew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20170203/565d5e75/attachment.sig>