[RFC] BMC RAS Feature
Bills, Jason M
jason.m.bills at linux.intel.com
Sat Jul 15 08:05:17 AEST 2023
Sorry for missing this earlier. Here are some of my thoughts.
On 3/20/2023 11:14 PM, Supreeth Venkatesh wrote:
>
> #### Requirements
>
> 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
> of virtual APIs to allow override for processor specific way of
> collecting the data.
> 2. Crash data format shall be stored in common platform error record
> (CPER) format as per UEFI specification
> [https://uefi.org/specs/UEFI/2.10/].
Do we have to define a single output format? Could it be made to be
flexible with the format of the collected crash data?
> 3. Configuration parameters of the service shall be standard with scope
> for processor specific extensions.
>
> #### Proposed Design
>
> When one or more processors register a fatal error condition , then an
> interrupt is generated to the host processor.
>
> The host processor in the failed state asserts the signal to indicate to
> the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD
> processor family]
>
> BMC RAS application listens on the event [APML_ALERT# in case of AMD
> processor family ].
The host-error-monitor application provides support for listening for
events and taking action such as logging or triggering a crashdump that
may meet this requirement.
One thought may be to break this up into various layers to allow for
flexibility and standardization. For example:
1. Redfish -> provided by bmcweb which pulls from
2. D-Bus -> provided by a new service which looks for data stored by
3. processor-specific collector -> provided by separate services as
needed and triggered by
4. platform-specific monitoring service -> provided by
host-error-monitor or other service as needed.
Ideally, we could make 2 a generic service.
>
> Upon detection of FATAL error event , BMC will check the status register
> of the host processor [implementation defined method] to see
>
> if the assertion is due to the fatal error.
>
> Upon fatal error , BMC will attempt to harvest crash data from all
> processors. [via the APML interface (mailbox) in case of AMD processor
> family].
>
> BMC will generate a single raw crashdump record and saves it in
> non-volatile location /var/lib/bmc-ras.
>
More information about the openbmc
mailing list