[RFC] BMC RAS Feature
Supreeth Venkatesh
supreeth.venkatesh at amd.com
Tue Mar 21 16:14:45 AEDT 2023
Thanks in advance for your inputs/feedback.
##Purpose
Gather feedback on the BMC RAS design so that it can be used in
processor agnostic manner, find collaborators for refining the
design/implementation
and request for a OpenBMC repository [preferably bmc-ras or oob-ras or
bmc-crashdump or oob-crashdump] with the initial maintainers
Supreeth Venkatesh [Supreeth.Venkatesh at amd.com] and Abinaya Dhandapani
[Abinaya.Dhandapani at amd.com]
### BMC RAS, Crash dump
Author:
< Supreeth Venkatesh>
< Abinaya Dhandapani>
Primary assignee:
< Supreeth Venkatesh>
< Abinaya Dhandapani>
Other contributors:
Created:
<03/20/2023>
#### Problem Description
Collection of crash data at runtime in a processor agnostic manner
presents a challenge and an opportunity to standardize.
#### Background and References
Host processors allows an external management controller (i.e BMC)
to harvest CPU crash data over the Vendor specific interface during
fatal errors [APML interface in case of AMD processor family]
This feature allows more efficient real-time diagnosis of hardware
failures without waiting for Boot-Error Record Table (BERT) logs in the
next boot cycle.
The crash data collected may be used to triage, debug, or attempt to
reproduce the system conditions that led to the failure.
#### Requirements
1. Collecting RAS/Crashdump shall be processor specific. Hence the use
of virtual APIs to allow override for processor specific way of
collecting the data.
2. Crash data format shall be stored in common platform error record
(CPER) format as per UEFI specification
[https://uefi.org/specs/UEFI/2.10/].
3. Configuration parameters of the service shall be standard with scope
for processor specific extensions.
#### Proposed Design
When one or more processors register a fatal error condition , then an
interrupt is generated to the host processor.
The host processor in the failed state asserts the signal to indicate to
the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD
processor family]
BMC RAS application listens on the event [APML_ALERT# in case of AMD
processor family ].
Upon detection of FATAL error event , BMC will check the status register
of the host processor [implementation defined method] to see
if the assertion is due to the fatal error.
Upon fatal error , BMC will attempt to harvest crash data from all
processors. [via the APML interface (mailbox) in case of AMD processor
family].
BMC will generate a single raw crashdump record and saves it in
non-volatile location /var/lib/bmc-ras.
As per the BMC policy configuration , BMC initiates a System reset to
recover the system from the fatal error condition.
The generated crashdump record will be in Common Platform Error Record
(CPER) format as defined in the UEFI specification
[https://uefi.org/specs/UEFI/2.10/].
Application has configurable number of records with the default set to
10 records. If the number of records exceed 10, the records are rotated.
Crashdump records saved in the /var/lib/bmc-ras which can be retrieved
via redfish interface.
Format of RAS/Crash dump record below
Sample CPER file on fatal error:
_Configuring RAS config file_
A configuration file is created in the /var/lib/bmc-ras application
which allows the user to configure the below values
AMD specific configuration fields below. However, this can be
_standardized_ based on the feedback.
“APML retries” – Retry count of APML mailbox command
“harvest PPIN” – If enabled , harvest PPIN and dump into the CPER file
“Harvest Microcode” – If enabled , harvest microcode and dump into the
CPER file
“System Recovery” – Warm reset or cold reset or no reset as per User’s
requirement.
The configuration file values can be viewed and changed via redfish
GET/SET command.
The redfish URI to configure the BMC config file:
https://<BMC-IP>/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration
<https://%3cBMC-IP%3e/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration>
Sample redfish output:
curl -s -k -u root:0penBmc -H"Content-type: application/json" -X GET
https://onyx-63dd/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration
{
"@odata.id":
"/redfish/v1/Systems/system/LogServices/Crashdump/Actions/Oem/Crashdump.Configuration",
"@odata.type": "#LogService.v1_2_0.LogService",
"apmlRetries": 10,
"harvestPpin": true,
"systemRecovery": 2,
"uCodeVersion": true
}
_Dbus__interface for crashdump service_
BMC ras application is started by the systemd service com.amd.crashdump
[This will be changed to generic service name based on the
feedback/interest from the community].
A dbus interface is maintained which has the config file info and the
CPER files currently in the system
Which can be downloaded via the redfish interface.
The service name , object path needs to be renamed instead of OEM
specific names as all contributors can use the same service name and
object path to pull the crashdata.
#### Alternatives Considered
In-band mechanisms using System Management Mode (SMM) exists.
However, out of band method to gather RAS data is processor specific.
#### Impacts
Since crash dump data is as per common platform error record (CPER)
format as per UEFI specification [https://uefi.org/specs/UEFI/2.10/],
No security impact.
This implementation takes off the host processor workload by offloading
the data collection process to BMC and thereby improving the system
performance as a whole.
#### Testing
It has been tested on AMD Genoa platforms namely Onyx, Quartz, Ruby and
Titanite.
Further testing support is appreciated.
Thanks,
Abinaya & Supreeth
More information about the openbmc
mailing list