[RFC] BMC RAS Feature

Bills, Jason M jason.m.bills at linux.intel.com
Sat Jul 15 08:05:17 AEST 2023


Sorry for missing this earlier.  Here are some of my thoughts.

On 3/20/2023 11:14 PM, Supreeth Venkatesh wrote:
> 
> #### Requirements
> 
> 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
>     of virtual APIs to allow override for processor specific way of
>     collecting the data.
> 2. Crash data format shall be stored in common platform error record
>     (CPER) format as per UEFI specification
>     [https://uefi.org/specs/UEFI/2.10/].

Do we have to define a single output format? Could it be made to be 
flexible with the format of the collected crash data?

> 3. Configuration parameters of the service shall be standard with scope
>     for processor specific extensions.
> 
> #### Proposed Design
> 
> When one or more processors register a fatal error condition , then an 
> interrupt is generated to the host processor.
> 
> The host processor in the failed state asserts the signal to indicate to 
> the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD 
> processor family]
> 
> BMC RAS application listens on the event [APML_ALERT# in case of AMD 
> processor family ].

The host-error-monitor application provides support for listening for 
events and taking action such as logging or triggering a crashdump that 
may meet this requirement.


One thought may be to break this up into various layers to allow for 
flexibility and standardization. For example:
1. Redfish -> provided by bmcweb which pulls from
2. D-Bus -> provided by a new service which looks for data stored by
3. processor-specific collector -> provided by separate services as 
needed and triggered by
4. platform-specific monitoring service -> provided by 
host-error-monitor or other service as needed.

Ideally, we could make 2 a generic service.

> 
> Upon detection of FATAL error event , BMC will check the status register 
> of the host processor [implementation defined method] to see
> 
> if the assertion is due to the fatal error.
> 
> Upon fatal error , BMC will attempt to harvest crash data from all 
> processors. [via the APML interface (mailbox) in case of AMD processor 
> family].
> 
> BMC will generate a single raw crashdump record and saves it in 
> non-volatile location /var/lib/bmc-ras.
> 



More information about the openbmc mailing list