[Skiboot] [PATCH v2] opal/xstop: Use nvram option to enable/disable sw checkstop.

Stewart Smith stewart at linux.vnet.ibm.com
Tue Jan 16 11:22:32 AEDT 2018

Oliver <oohall at gmail.com> writes:
> On Mon, Jan 15, 2018 at 5:42 PM, Stewart Smith
> <stewart at linux.vnet.ibm.com> wrote:
>> Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
>>> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>>> Add a mechanism to enable/disable sw checkstop by looking at nvram option
>>> opal-sw-xstop=<enable/disable>.
>>> For now this patch disables the sw checkstop trigger unless explicitly
>>> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
>>> an opportunity to get host kernel in panic path or xmon for unrecoverable
>>> HMIs or MCE, to be able to debug the issue effectively.
>>> To enable sw checkstop in opal issue following command:
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
>>> NOTE: This is a workaround patch to disable sw checkstop by default to gain
>>> control in host kernel for better checkstop debugging. Once we have most of
>>> the checkstop issues stabilized/resolved, revisit this patch to enable sw
>>> checkstop by default.
>>> For p8 platform it will remain enabled by default unless explicitly disabled.
>>> To disable sw checkstop on p8 issue following command:
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
>>> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>>> Reviewed-by: Balbir Singh <bsingharora at gmail.com>
>>> ---
>>> Change in v2:
>>>    - Add pr_log to indicate that sw checkstop was disabled.
>>> ---
>>>  hw/xscom.c |   32 ++++++++++++++++++++++++++++++++
>>>  1 file changed, 32 insertions(+)
>> All a bit umming-and-ahhing about the behaviour change... but this seems
>> to be the "easiest" for now.... and I reserve the right to change my
>> mind at any point :)
>> I think the correct solution here is to have the kernel make the
>> appropriate decision rather than having this workaround in OPAL.
>> BUt.. well... reality and today was checkstop heavy, so my mind kind of
>> changed :)
>> Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though.
>> I think having the option to *disable* it is always going to be good,
>> but... well... I don't like that we end up in a situation where the
>> kernel says "everything is terrible because you told me it was terrible,
>> please reboot now" and then we ignore it.
>> The real solution is a kernel one....
> It really isn't. If we are reporting unrecoverable HMIs to the kernel
> then the kernel has every right to assume the world is on fire and
> request a shutdown. If we want the kernel to do something else then we
> need to change what OPAL reports back to the kernel. Just disabling
> the software xstop is a gross hack at best. It's not even clear that
> just disabling the xstop is sufficent to keep the host up and running
> since the kernel thread that initiated the shutdown isn't expecting to
> return...

Yeah, that would be the better place to fix things - telling the kernel
that it did a naughty rather than the machine is borked.

I'm sure we'll figure it out sometime after I stop seeing "PRD: Hardware
problem, very low chance of software cause" that's actually 100%
software problem.

Stewart Smith
OPAL Architect, IBM.

More information about the Skiboot mailing list