[RFC] BMC RAS Feature

Tue Mar 21 16:14:45 AEDT 2023

Thanks in advance for your inputs/feedback.

##Purpose

Gather feedback on the BMC RAS design so that it can be used in 
processor agnostic manner, find collaborators for refining the 
design/implementation

and request for a OpenBMC repository [preferably bmc-ras or oob-ras or 
bmc-crashdump or oob-crashdump] with the initial maintainers

Supreeth Venkatesh [Supreeth.Venkatesh at amd.com] and Abinaya Dhandapani 
[Abinaya.Dhandapani at amd.com]

### BMC RAS, Crash dump

Author:

< Supreeth Venkatesh>

< Abinaya Dhandapani>

Primary assignee:

< Supreeth Venkatesh>

< Abinaya Dhandapani>

Other contributors:

Created:

<03/20/2023>

#### Problem Description

Collection of crash data at runtime in a processor agnostic manner 
presents a challenge and an opportunity to standardize.

#### Background and References

Host processors allows an external management controller (i.e BMC)

to harvest CPU crash data over the Vendor specific interface during 
fatal errors [APML interface in case of AMD processor family]

This feature allows more efficient real-time diagnosis of hardware 
failures without waiting for Boot-Error Record Table (BERT) logs in the 
next boot cycle.

The crash data collected may be used to triage, debug, or attempt to 
reproduce the system conditions that led to the failure.

#### Requirements

 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
    of virtual APIs to allow override for processor specific way of
    collecting the data.
 2. Crash data format shall be stored in common platform error record
    (CPER) format as per UEFI specification
    [https://uefi.org/specs/UEFI/2.10/].
 3. Configuration parameters of the service shall be standard with scope
    for processor specific extensions.

#### Proposed Design

When one or more processors register a fatal error condition , then an 
interrupt is generated to the host processor.

The host processor in the failed state asserts the signal to indicate to 
the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD 
processor family]

BMC RAS application listens on the event [APML_ALERT# in case of AMD 
processor family ].

Upon detection of FATAL error event , BMC will check the status register 
of the host processor [implementation defined method] to see

if the assertion is due to the fatal error.

Upon fatal error , BMC will attempt to harvest crash data from all 
processors. [via the APML interface (mailbox) in case of AMD processor 
family].

BMC will generate a single raw crashdump record and saves it in 
non-volatile location /var/lib/bmc-ras.

As per the BMC policy configuration , BMC initiates a System reset to 
recover the system from the fatal error condition.

The generated crashdump record will be in Common Platform Error Record 
(CPER) format as defined in the UEFI specification 
[https://uefi.org/specs/UEFI/2.10/].

Application has configurable number of records with the default set to 
10 records. If the number of records exceed 10, the records are rotated.

Crashdump records saved in the /var/lib/bmc-ras which can be retrieved 
via redfish interface.

Format of RAS/Crash dump record below

Sample CPER file on fatal error:

_Configuring RAS config file_

A configuration file is created in the /var/lib/bmc-ras application 
which allows the user to configure the below values

AMD specific configuration fields below. However, this can be 
_standardized_ based on the feedback.

“APML retries” – Retry count of APML mailbox command

“harvest PPIN” – If enabled , harvest PPIN and dump into the CPER file

“Harvest Microcode” – If enabled , harvest microcode and dump into the 
CPER file

“System Recovery” – Warm reset or cold reset or no reset as per User’s 
requirement.

The configuration file values can be viewed and changed via redfish 
GET/SET command.

The redfish URI to configure the BMC config file: 
https://<BMC-IP>/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration 
<https://%3cBMC-IP%3e/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration>

Sample redfish output:

curl -s -k -u root:0penBmc -H"Content-type: application/json" -X GET 
https://onyx-63dd/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration

{

   "@odata.id": 
"/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration",

   "@odata.type": "#LogService.v1_2_0.LogService",

   "apmlRetries": 10,

   "harvestPpin": true,

   "systemRecovery": 2,

   "uCodeVersion": true

}

_Dbus__interface for crashdump service_

BMC ras application is started by the systemd service com.amd.crashdump 
[This will be changed to generic service name based on the 
feedback/interest from the community].

A dbus interface is maintained which has the config file info and the 
CPER files currently in the system

Which can be downloaded via the redfish interface.

The service name , object path needs to be renamed instead of OEM 
specific names as all contributors can use the same service name and 
object path to pull the crashdata.

#### Alternatives Considered

In-band mechanisms using System Management Mode (SMM) exists.

However, out of band method to gather RAS data is processor specific.

#### Impacts

Since crash dump data is as per common platform error record (CPER) 
format as per UEFI specification [https://uefi.org/specs/UEFI/2.10/],

No security impact.

This implementation takes off the host processor workload by offloading 
the data collection process to BMC and thereby improving the system 
performance as a whole.

#### Testing

It has been tested on AMD Genoa platforms namely Onyx, Quartz, Ruby and 
Titanite.

Further testing support is appreciated.

Thanks,

Abinaya & Supreeth