How to deal with failing services in the boot targets
Andrew Jeffery
andrew at aj.id.au
Fri Feb 3 17:40:37 AEDT 2017
On Wed, 2017-02-01 at 14:42 -0600, Andrew Geissler wrote:
> Finally got around to doing some testing on this, here's what I got.
>
> > My story this sprint, https://github.com/openbmc/openbmc/issues/1033,
> is focused on handling errors when things go wrong. Specifically,
> when required services fail to execute properly during a systemd
> target execution (power on, power off). When a fail happens, the obmc
> software needs to notify the users of the system and provide
> mechanisms for either the system to automatically retry the failed
> operation (i.e. reboot the system) or to stay in a quiesced state so
> that error data can be collected and the fail can be investigated.
>
> Michael is working on a story that ties in with this function this
> > sprint, https://github.com/openbmc/openbmc/issues/942, in which we’ll
> allow system users to enable or disable the auto reboot function on
> errors (service failure, host checkstop failure, host watchdog
> failure). He will utilize the new target I’ll be creating in my story
> for this.
>
> So we have two main fail scenarios:
>
> 1. A service within a target fails
> - If the service is a oneshot type, and you put that it is required
> (not wanted) by the target then the target will fail if the service
> fails
> - You can simply define a behavior for when the target fails using
> the “OnFailure” option (i.e. go to a new failure target if any
> required service fails)
>
> - If the service is not a oneshot, then you can not have it fail the
> target (the target only knows that it started successfully)
> - You have to define a behavior for when the service fails (OnFailure) option.
> - The service can not have "RemainAfterExit=yes” otherwise the
> OnFailure action does not occur until the service is stopped (instead
> of when it fails)
>
> 2. A failure outside of a normal systemd target/service (host watchdog
> expires, host checkstop detected)
> - The service which detects this failure is responsible for logging
> the appropriate error, and instructing systemd to go to the
> appropriate target
>
> The current proposal is that we create a new quiesce target. This is
> the target that the target/services put for their “OnFailure=“
> instruction and where the services in fail #2 above detect a problem
> will instruct systemd to go to. We’ll then have code that monitors
> for the entry into this new quiesce target and handles the halt vs
> automatic reboot functionality.
>
> The above info sets up some general guidelines for our targets and
> services (and some refactoring for my story this sprint)
>
> - All targets should have an “OnFailure=obmc-quiesce-system at .target”
> - All services which are required for a target to achieve it’s
> function should be RequiredBy that target (not WantedBy)
> - All services should first try to be Type=oneshot so that we can just
> rely on the target fail path
> - If a service can not be “Type=oneshot”, then it needs to have a
> “OnFailure=obmc-quiesce-system at .target” and a "RemainAfterExit=no”
> - If a service can not be any of these then it’s up to the service
> application to call systemd with the obmc-quiesce-system at .target on
> failures
>
> Thoughts/Questions?
I think this is a sensible set of suggestions. We need to document them
somewhere obvious so a) we can point people at them and b) reviewers
can refer to them when reviewing patches adding/updating systemd unit
files and targets.
Thanks for considering the problem.
Andrew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: This is a digitally signed message part
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20170203/565d5e75/attachment.sig>
More information about the openbmc
mailing list