[RFC] BMC RAS Feature

dhruvaraj S dhruvaraj at gmail.com
Sat Jul 15 19:01:07 AEST 2023


Please find a few comments on using phosphor-debug-collector for this

Phosphor-debug-collector employs a set of scripts for BMC dump
collections, which can be customised per processor architecture.
Architecture-specific dump collections are appended to dump-extensions
and activated exclusively on systems that support them, identified by
their corresponding feature code.

Data Format: The data is packaged as a basic tarball or a custom
package according to host specifications.

Event Triggering: The phosphor-debug-collector responds to specific
events to initialize dump creation. A core monitor observes a
designated directory, generating a BMC dump containing the core file
upon event detection. On IBM systems, an attention handler awaits
notifications from processors or the host to trigger dump creation via
phosphor-debug-collector.

Layered Design: The phosphor-debug-collector operates as a
processor-specific collector within the proposed layered architecture,
initiated by a platform-specific monitoring service like the
host-error-monitor. The created dumps are exposed via D-Bus, which can
then be served by bmcweb via Redfish.

Phosphor-debug-collector allows for extensions to accommodate
processor-specific parameters. This is achieved by adjusting the dump
collection scripts in line with the particular processor requirements.

The phosphor-debug-collector interacts with specific applications
during the dump collection process. For example, on IBM systems, it
invokes an IBM-specific application via the dump collection script to
retrieve the dump from the host processor.

On Sat, 15 Jul 2023 at 03:37, Bills, Jason M
<jason.m.bills at linux.intel.com> wrote:
>
> Sorry for missing this earlier.  Here are some of my thoughts.
>
> On 3/20/2023 11:14 PM, Supreeth Venkatesh wrote:
> >
> > #### Requirements
> >
> > 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
> >     of virtual APIs to allow override for processor specific way of
> >     collecting the data.
> > 2. Crash data format shall be stored in common platform error record
> >     (CPER) format as per UEFI specification
> >     [https://uefi.org/specs/UEFI/2.10/].
>
> Do we have to define a single output format? Could it be made to be
> flexible with the format of the collected crash data?
>
> > 3. Configuration parameters of the service shall be standard with scope
> >     for processor specific extensions.
> >
> > #### Proposed Design
> >
> > When one or more processors register a fatal error condition , then an
> > interrupt is generated to the host processor.
> >
> > The host processor in the failed state asserts the signal to indicate to
> > the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD
> > processor family]
> >
> > BMC RAS application listens on the event [APML_ALERT# in case of AMD
> > processor family ].
>
> The host-error-monitor application provides support for listening for
> events and taking action such as logging or triggering a crashdump that
> may meet this requirement.
>
>
> One thought may be to break this up into various layers to allow for
> flexibility and standardization. For example:
> 1. Redfish -> provided by bmcweb which pulls from
> 2. D-Bus -> provided by a new service which looks for data stored by
> 3. processor-specific collector -> provided by separate services as
> needed and triggered by
> 4. platform-specific monitoring service -> provided by
> host-error-monitor or other service as needed.
>
> Ideally, we could make 2 a generic service.
>
> >
> > Upon detection of FATAL error event , BMC will check the status register
> > of the host processor [implementation defined method] to see
> >
> > if the assertion is due to the fatal error.
> >
> > Upon fatal error , BMC will attempt to harvest crash data from all
> > processors. [via the APML interface (mailbox) in case of AMD processor
> > family].
> >
> > BMC will generate a single raw crashdump record and saves it in
> > non-volatile location /var/lib/bmc-ras.
> >
>


-- 
--------------
Dhruvaraj S


More information about the openbmc mailing list