[Skiboot] [PATCH v4] IPMI: Trigger OPAL TI in abort path.

Wed Nov 6 23:42:49 AEDT 2019

On 2019-11-05 23:23:37 Tue, Oliver O'Halloran wrote:
> On Tue, Nov 5, 2019 at 10:02 PM Mahesh Salgaonkar
> <mahesh at linux.vnet.ibm.com> wrote:
> >
> > The current assert/abort implementation for BMC based system invokes cec
> > reboot after printing backtrace. This means that BMC never gets notified
> > about OPAL crash/termination. This sometimes leads into never ending
> > IPL-ing loop if OPAL keeps aborting very early in boot path.
> >
> > Trigger a software xstop (OPAL TI) to inform BMC about the OPAL
> > termination. BMC is capable of catching checkstop signal and facilitate in
> > rebooting (IPL-ing) host.
> >
> > With AutoReboot policy, OpenBMC handles checkstop signals and counts them
> > against the reboot counter. In cases where OPAL is crashing before host
> > reaches to runtime, OpenBMC will move the system in Quiesced state after 3
> > or so attempts of IPL/reboot so that system can be debugged. When OPAL
> > triggers software checkstop it causes all the CPU threads to be stooped and
> > moved to quiesced state. Hence OPAL don't need to explicitly stop all CPUs
> > before calling software xstop.
> >
> > Signed-off-by: Mahesh Salgaonkar <mahesh at linux.vnet.ibm.com>
> > ---
> > v4:
> >  - Remove the unwanted header file include leftover from v2.
> > v3:
> >  - Trigger software xstop (OPAL TI) instead of attn.
> > v2:
> >  - Always Quiesce the secondaries in abort path.
> >  - Change the attn_supported type to bool.
> 
> Patch looks fine, but how have you tested it?

I have tested with flashing a BOOTKERNEL filled with zero's in
/us/local/share/pnor/BOOTKERNEL.

Since the assert is before host reaches runtime, OpenBMC reboots the host 3
times and then goes into quiesce state.

> 
> I'm wondering mainly because when debugging checkstops with cronus you
> usually need to run "ecmdchip cleanup" before you can read host memory
> using getmemproc and friends. I assume the BMC's dump collector is
> doing something similar since it needs to deal with "real" checkstops
> caused by hardware, but I think it would be good to verify it works
> the way we think it does. Even if it doesn't the xstop approach is
> better since an attn raised on the 2nd chip won't be propagated to the
> first (I checked).

Yes, behaviour is similar to that of system checkstop.

Thanks,
-Mahesh.

-- 
Mahesh J Salgaonkar