[Skiboot] [PATCH v2] opal/xstop: Use nvram option to enable/disable sw checkstop.
bsingharora at gmail.com
Wed Jan 17 15:47:56 AEDT 2018
On Mon, Jan 15, 2018 at 12:26 PM, Oliver <oohall at gmail.com> wrote:
> On Mon, Jan 15, 2018 at 5:42 PM, Stewart Smith
> <stewart at linux.vnet.ibm.com> wrote:
>> Mahesh J Salgaonkar <mahesh at linux.vnet.ibm.com> writes:
>>> From: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>>> Add a mechanism to enable/disable sw checkstop by looking at nvram option
>>> For now this patch disables the sw checkstop trigger unless explicitly
>>> enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
>>> an opportunity to get host kernel in panic path or xmon for unrecoverable
>>> HMIs or MCE, to be able to debug the issue effectively.
>>> To enable sw checkstop in opal issue following command:
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
>>> NOTE: This is a workaround patch to disable sw checkstop by default to gain
>>> control in host kernel for better checkstop debugging. Once we have most of
>>> the checkstop issues stabilized/resolved, revisit this patch to enable sw
>>> checkstop by default.
>>> For p8 platform it will remain enabled by default unless explicitly disabled.
>>> To disable sw checkstop on p8 issue following command:
>>> # nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
>>> Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
>>> Reviewed-by: Balbir Singh <bsingharora at gmail.com>
>>> Change in v2:
>>> - Add pr_log to indicate that sw checkstop was disabled.
>>> hw/xscom.c | 32 ++++++++++++++++++++++++++++++++
>>> 1 file changed, 32 insertions(+)
>> All a bit umming-and-ahhing about the behaviour change... but this seems
>> to be the "easiest" for now.... and I reserve the right to change my
>> mind at any point :)
>> I think the correct solution here is to have the kernel make the
>> appropriate decision rather than having this workaround in OPAL.
>> BUt.. well... reality and today was checkstop heavy, so my mind kind of
>> changed :)
>> Merged to master as of 3c38214ab4f097a307058361428f9be8a239f1db though.
>> I think having the option to *disable* it is always going to be good,
>> but... well... I don't like that we end up in a situation where the
>> kernel says "everything is terrible because you told me it was terrible,
>> please reboot now" and then we ignore it.
>> The real solution is a kernel one....
> It really isn't. If we are reporting unrecoverable HMIs to the kernel
> then the kernel has every right to assume the world is on fire and
> request a shutdown. If we want the kernel to do something else then we
> need to change what OPAL reports back to the kernel. Just disabling
The real issue is keeping backwards compat for p8 and allowing a
checkstop for machines that care to do so.
> the software xstop is a gross hack at best. It's not even clear that
> just disabling the xstop is sufficent to keep the host up and running
> since the kernel thread that initiated the shutdown isn't expecting to
The real goal of the patch is log (get the context of what was
happening when we triggered the platform error). We could still dump
that on the console, but going back to the kernel lets us crash/xmon
and get more info before rebooting the box. Hostboot printing that we
got a software initiated checkstop (TI) is not useful to be honest and
we're seeing NPU2 devices cause HMI's and machine checks, so its
useful to see the context at the time of the error
> That said, it's a stupid debug hack so who cares.
More information about the Skiboot