Add support to debug unresponsive host
vishwa
vishwa at linux.vnet.ibm.com
Mon May 27 22:42:24 AEST 2019
I kind of remember this topic being talked about in the past. Looks like
we need to do 2 things prior to calling SRESET. I will comment the review.
!! Vishwa !!
On 5/27/19 12:45 PM, Jayanth Othayoth wrote:
> Design template Review is available here
>
> https://gerrit.openbmc-project.xyz/c/openbmc/docs/+/21772
>
> On Thu, May 16, 2019 at 6:31 PM Andrew Geissler <geissonator at gmail.com
> <mailto:geissonator at gmail.com>> wrote:
>
> On Thu, May 16, 2019 at 1:36 AM Deepak Kodihalli
> <dkodihal at linux.vnet.ibm.com <mailto:dkodihal at linux.vnet.ibm.com>>
> wrote:
> >
> > On 15/05/19 6:09 PM, Jayanth Othayoth wrote:
> > > ## Problem Description
> > > Issue #457: Add support to debug unresponsive host.
> > >
> > > Scope: High level design direction to solve this problem,
> > >
> > > ## Background and References
> > > There are situation at customer places where OPAL/Linux goes
> > > unresponsive causing a system hang. And there is no way to
> figure out
> > > what went wrong with Linux kernel or OPAL. Looking for a way
> to trigger
> > > a dump capture on Linux host so that we can capture the OS
> dump for post
> > > analysis.
> > >
> > > ## Proposed Design for POWER processor based systems:
> > > Get all Host CPUs in reset vector and Linux then has a
> mechanism to
> > > patch it into panic-kdump path to trigger dump capture. This
> will enable
> > > us to analyze and fix customer issue where we see Linux hang and
> > > unresponsive system.
> > >
> > > ### Redfish Schema used:
> > > * Reference: DSP2046 2018.3,
> > > * ComputerSystem 1.6.0 schema provides an action called
> > > #ComputerSystem.Reset”, This action is used to reset the system.
> > > ResetType parameter is used for indicating type of reset need
> to be
> > > performed. In this use case we can use “Nmi” type
> > > * Nmi: Generate a Diagnostic Interrupt (usually an NMI on x86
> > > systems) to cease normal operations, perform diagnostic
> actions and
> > > typically halt the system.
> > > * ### d-bus :
> > >
> > > Option 1: Extending the existing d-bus interface
> state.Host name
> > > space (
> > >
> /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/State/Host.interface.yaml
> > > ) to support new RequestedHostTransition property called
> “Nmi”. d-bus
> > > backend can internally invoke processor specific target to do
> Sreset(
> > > equivalent to x86 NMI) and associated actions.
> >
> > I don't prefer this option, because this would mean adding host
> specific
> > code in phoshor-state-manager, which I think until now is host
> agnostic.
>
> Yeah, this was my main concern with tying it into
> phosphor-state-manager.
> The fact Redfish put it in with their other state related commands
> (which
> are implemented by phosphor-state-manager) is the only reason I'm
> a little
> wishy-washy here. We could just create a generic systemd target
> "host-nmi"
> or something and phosphor-state-manager could just call that to
> abstract
> any of the specifics, but it sill doesn't really feel like it fits
> to me.
>
> I think I prefer option 2, and then we can just map bmcweb to that
> API when
> the Redfish command comes in. Sounds like for ppc64 systems we can
> just
> use pdbg to issue the NMI.
>
> > So for that reason, Option 2 sounds better. There are some good
> > questions from Neeraj as well, so I would suggest adding this as a
> > design template on Gerrit to gather better feedback.
> >
> > Thanks,
> > Deepak
> >
> > > Option 2: Introducing new d-bus interface in the control.state
> namespace
> > > (
> > >
> /openbmc/phosphor-dbus-interfaces/xyz/openbmc_project/Control/Host/NMI.interface.yaml)
> > > namespace and implement the new d-bus back-end for respective
> processor
> > > specific targets.
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20190527/22930c23/attachment.htm>
More information about the openbmc
mailing list