<div dir="ltr"><div>Design template Review is available here<br></div><div><br></div><div><a href="https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/21772" target="_blank">https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/21772</a></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, May 16, 2019 at 6:31 PM Andrew Geissler <<a href="mailto:geissonator@gmail.com" target="_blank">geissonator@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, May 16, 2019 at 1:36 AM Deepak Kodihalli<br>

<<a href="mailto:dkodihal@linux.vnet.ibm.com" target="_blank">dkodihal@linux.vnet.ibm.com</a>> wrote:<br>

><br>

> On 15/05/19 6:09 PM, Jayanth Othayoth wrote:<br>

> > ## Problem Description<br>

> > Issue #457:  Add support to debug unresponsive host.<br>

> ><br>

> > Scope: High level design direction to solve this problem,<br>

> ><br>

> > ## Background and References<br>

> > There are situation at customer places where OPAL/Linux goes<br>

> > unresponsive causing a system hang. And there is no way to figure out<br>

> > what went wrong with Linux kernel or OPAL. Looking for a way to trigger<br>

> > a dump capture on Linux host so that we can capture the OS dump for post<br>

> > analysis.<br>

> ><br>

> > ## Proposed Design for POWER processor based systems:<br>

> > Get all Host CPUs in reset vector and Linux then has a mechanism to<br>

> > patch it into panic-kdump path to trigger dump capture. This will enable<br>

> > us to analyze and fix customer issue where we see Linux hang and<br>

> > unresponsive system.<br>

> ><br>

> > ### Redfish Schema used:<br>

> > * Reference: DSP2046 2018.3,<br>

> > * ComputerSystem 1.6.0 schema provides an action called<br>

> > #ComputerSystem.Reset”, This action is used to reset the system.<br>

> > ResetType parameter is used  for indicating type of reset need to be<br>

> > performed. In this use case we can use “Nmi” type<br>

> >      * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86<br>

> > systems) to cease normal operations, perform diagnostic actions and<br>

> > typically halt the system.<br>

> > * ### d-bus :<br>

> ><br>

> > Option 1:   Extending  the existing  d-bus interface  state.Host  name<br>

> > space (<br>

> > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml<br>

> > ) to support new RequestedHostTransition property called  “Nmi”.   d-bus<br>

> > backend can internally invoke processor specific target to do Sreset(<br>

> > equivalent to x86 NMI) and associated  actions.<br>

><br>

> I don't prefer this option, because this would mean adding host specific<br>

> code in phoshor-state-manager, which I think until now is host agnostic.<br>

<br>

Yeah, this was my main concern with tying it into phosphor-state-manager.<br>

The fact Redfish put it in with their other state related commands (which<br>

are implemented by phosphor-state-manager) is the only reason I'm a little<br>

wishy-washy here. We could just create a generic systemd target "host-nmi"<br>

or something and phosphor-state-manager could just call that to abstract<br>

any of the specifics, but it sill doesn't really feel like it fits to me.<br>

<br>

I think I prefer option 2, and then we can just map bmcweb to that API when<br>

the Redfish command comes in. Sounds like for ppc64 systems we can just<br>

use pdbg to issue the NMI.<br>

<br>

> So for that reason, Option 2 sounds better. There are some good<br>

> questions from Neeraj as well, so I would suggest adding this as a<br>

> design template on Gerrit to gather better feedback.<br>

><br>

> Thanks,<br>

> Deepak<br>

><br>

> > Option 2: Introducing new d-bus interface in the control.state namespace<br>

> > (<br>

> > /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml)<br>

> > namespace and implement the new d-bus back-end for respective  processor<br>

> > specific targets.<br>

><br>

</blockquote></div>