[RFC] BMC RAS Feature

Tue Jul 25 00:29:40 AEST 2023

[AMD Official Use Only - General]

Thanks for your feedback Jason. Sorry for the delay in my response.

1. The format can be anything. [We could use phosphor-debug-collector that collects different debug dumps]
2. Agree with this path
        i. Redfish -> provided by bmcweb which pulls from
        ii. D-Bus -> provided by a new service which looks for data stored by
        iii. processor-specific collector -> provided by separate services as needed and triggered by
        iv. platform-specific monitoring service -> provided by host-error-monitor or other service as needed.
We need a repository for processor-specific collector.

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software

-----Original Message-----
From: openbmc <openbmc-bounces+supreeth.venkatesh=amd.com at lists.ozlabs.org> On Behalf Of Bills, Jason M
Sent: Friday, July 14, 2023 5:05 PM
To: openbmc at lists.ozlabs.org
Subject: Re: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Sorry for missing this earlier.  Here are some of my thoughts.

On 3/20/2023 11:14 PM, Supreeth Venkatesh wrote:
>
> #### Requirements
>
> 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
>     of virtual APIs to allow override for processor specific way of
>     collecting the data.
> 2. Crash data format shall be stored in common platform error record
>     (CPER) format as per UEFI specification
>     [https://uefi.org/specs/UEFI/2.10/].

Do we have to define a single output format? Could it be made to be flexible with the format of the collected crash data?

> 3. Configuration parameters of the service shall be standard with scope
>     for processor specific extensions.
>
> #### Proposed Design
>
> When one or more processors register a fatal error condition , then an
> interrupt is generated to the host processor.
>
> The host processor in the failed state asserts the signal to indicate
> to the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD
> processor family]
>
> BMC RAS application listens on the event [APML_ALERT# in case of AMD
> processor family ].

The host-error-monitor application provides support for listening for events and taking action such as logging or triggering a crashdump that may meet this requirement.

One thought may be to break this up into various layers to allow for flexibility and standardization. For example:
1. Redfish -> provided by bmcweb which pulls from 2. D-Bus -> provided by a new service which looks for data stored by 3. processor-specific collector -> provided by separate services as needed and triggered by 4. platform-specific monitoring service -> provided by host-error-monitor or other service as needed.

Ideally, we could make 2 a generic service.

>
> Upon detection of FATAL error event , BMC will check the status
> register of the host processor [implementation defined method] to see
>
> if the assertion is due to the fatal error.
>
> Upon fatal error , BMC will attempt to harvest crash data from all
> processors. [via the APML interface (mailbox) in case of AMD processor
> family].
>
> BMC will generate a single raw crashdump record and saves it in
> non-volatile location /var/lib/bmc-ras.
>