How to deal with failing services in the boot targets

Andrew Geissler geissonator at gmail.com
Thu Feb 2 07:42:02 AEDT 2017


Finally got around to doing some testing on this, here's what I got.

My story this sprint, https://github.com/openbmc/openbmc/issues/1033,
is focused on handling errors when things go wrong.  Specifically,
when required services fail to execute properly during a systemd
target execution (power on, power off).  When a fail happens, the obmc
software needs to notify the users of the system and provide
mechanisms for either the system to automatically retry the failed
operation (i.e. reboot the system) or to stay in a quiesced state so
that error data can be collected and the fail can be investigated.

Michael is working on a story that ties in with this function this
sprint, https://github.com/openbmc/openbmc/issues/942, in which we’ll
allow system users to enable or disable the auto reboot function on
errors (service failure, host checkstop failure, host watchdog
failure).  He will utilize the new target I’ll be creating in my story
for this.

So we have two main fail scenarios:

1. A service within a target fails
- If the service is a oneshot type, and you put that it is required
(not wanted) by the target then the target will fail if the service
fails
  - You can simply define a behavior for when the target fails using
the “OnFailure” option (i.e. go to a new failure target if any
required service fails)

- If the service is not a oneshot, then you can not have it fail the
target (the target only knows that it started successfully)
  - You have to define a behavior for when the service fails (OnFailure) option.
  - The service can not have "RemainAfterExit=yes” otherwise the
OnFailure action does not occur until the service is stopped (instead
of when it fails)

2. A failure outside of a normal systemd target/service (host watchdog
expires, host checkstop detected)
- The service which detects this failure is responsible for logging
the appropriate error, and instructing systemd to go to the
appropriate target

The current proposal is that we create a new quiesce target.  This is
the target that the target/services put for their “OnFailure=“
instruction and where the services in fail #2 above detect a problem
will instruct systemd to go to.  We’ll then have code that monitors
for the entry into this new quiesce target and handles the halt vs
automatic reboot functionality.

The above info sets up some general guidelines for our targets and
services (and some refactoring for my story this sprint)

- All targets should have an “OnFailure=obmc-quiesce-system at .target”
- All services which are required for a target to achieve it’s
function should be RequiredBy that target (not WantedBy)
- All services should first try to be Type=oneshot so that we can just
rely on the target fail path
- If a service can not be “Type=oneshot”, then it needs to have a
“OnFailure=obmc-quiesce-system at .target” and a "RemainAfterExit=no”
- If a service can not be any of these then it’s up to the service
application to call systemd with the obmc-quiesce-system at .target on
failures

Thoughts/Questions?
Andrew

On Thu, Jan 26, 2017 at 7:16 PM, Andrew Jeffery <andrew at aj.id.au> wrote:
> On Wed, 2017-01-25 at 15:29 -0800, Xo Wang wrote:
>> 3) Do other people also want this? To me it seems obvious that failure
>> to power on should always block starting IPL, but maybe somebody else
>> has a good reason to use weaker relationships.
>
> Sounds highly desirable to me. In an effort to better understand our
> dependencies I dumped them out with `systemd-analyze dot`. Safe to say
> I'm not much wiser having seen the graph:
>
> http://ozlabs.org/~arj/openbmc/systemd.svg
>
> (Source: http://ozlabs.org/~arj/openbmc/systemd.dot.xz )
>
> Andrew
> _______________________________________________
> openbmc mailing list
> openbmc at lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/openbmc
>


More information about the openbmc mailing list