Add support to debug unresponsive host

vishwa vishwa at linux.vnet.ibm.com
Mon May 27 22:42:24 AEST 2019


I kind of remember this topic being talked about in the past. Looks like 
we need to do 2 things prior to calling SRESET. I will comment the review.

!! Vishwa !!

On 5/27/19 12:45 PM, Jayanth Othayoth wrote:
> Design template Review is available here
>
> https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/21772
>
> On Thu, May 16, 2019 at 6:31 PM Andrew Geissler <geissonator at gmail.com 
> <mailto:geissonator at gmail.com>> wrote:
>
>     On Thu, May 16, 2019 at 1:36 AM Deepak Kodihalli
>     <dkodihal at linux.vnet.ibm.com <mailto:dkodihal at linux.vnet.ibm.com>>
>     wrote:
>     >
>     > On 15/05/19 6:09 PM, Jayanth Othayoth wrote:
>     > > ## Problem Description
>     > > Issue #457:  Add support to debug unresponsive host.
>     > >
>     > > Scope: High level design direction to solve this problem,
>     > >
>     > > ## Background and References
>     > > There are situation at customer places where OPAL/Linux goes
>     > > unresponsive causing a system hang. And there is no way to
>     figure out
>     > > what went wrong with Linux kernel or OPAL. Looking for a way
>     to trigger
>     > > a dump capture on Linux host so that we can capture the OS
>     dump for post
>     > > analysis.
>     > >
>     > > ## Proposed Design for POWER processor based systems:
>     > > Get all Host CPUs in reset vector and Linux then has a
>     mechanism to
>     > > patch it into panic-kdump path to trigger dump capture. This
>     will enable
>     > > us to analyze and fix customer issue where we see Linux hang and
>     > > unresponsive system.
>     > >
>     > > ### Redfish Schema used:
>     > > * Reference: DSP2046 2018.3,
>     > > * ComputerSystem 1.6.0 schema provides an action called
>     > > #ComputerSystem.Reset”, This action is used to reset the system.
>     > > ResetType parameter is used  for indicating type of reset need
>     to be
>     > > performed. In this use case we can use “Nmi” type
>     > >      * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86
>     > > systems) to cease normal operations, perform diagnostic
>     actions and
>     > > typically halt the system.
>     > > * ### d-bus :
>     > >
>     > > Option 1:   Extending  the existing  d-bus interface 
>     state.Host  name
>     > > space (
>     > >
>     /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml
>     > > ) to support new RequestedHostTransition property called 
>     “Nmi”.   d-bus
>     > > backend can internally invoke processor specific target to do
>     Sreset(
>     > > equivalent to x86 NMI) and associated  actions.
>     >
>     > I don't prefer this option, because this would mean adding host
>     specific
>     > code in phoshor-state-manager, which I think until now is host
>     agnostic.
>
>     Yeah, this was my main concern with tying it into
>     phosphor-state-manager.
>     The fact Redfish put it in with their other state related commands
>     (which
>     are implemented by phosphor-state-manager) is the only reason I'm
>     a little
>     wishy-washy here. We could just create a generic systemd target
>     "host-nmi"
>     or something and phosphor-state-manager could just call that to
>     abstract
>     any of the specifics, but it sill doesn't really feel like it fits
>     to me.
>
>     I think I prefer option 2, and then we can just map bmcweb to that
>     API when
>     the Redfish command comes in. Sounds like for ppc64 systems we can
>     just
>     use pdbg to issue the NMI.
>
>     > So for that reason, Option 2 sounds better. There are some good
>     > questions from Neeraj as well, so I would suggest adding this as a
>     > design template on Gerrit to gather better feedback.
>     >
>     > Thanks,
>     > Deepak
>     >
>     > > Option 2: Introducing new d-bus interface in the control.state
>     namespace
>     > > (
>     > >
>     /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml)
>     > > namespace and implement the new d-bus back-end for respective 
>     processor
>     > > specific targets.
>     >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20190527/22930c23/attachment.htm>


More information about the openbmc mailing list