[RFC] BMC RAS Feature
Venkatesh, Supreeth
Supreeth.Venkatesh at amd.com
Tue Jul 25 00:29:40 AEST 2023
[AMD Official Use Only - General]
Thanks for your feedback Jason. Sorry for the delay in my response.
1. The format can be anything. [We could use phosphor-debug-collector that collects different debug dumps]
2. Agree with this path
i. Redfish -> provided by bmcweb which pulls from
ii. D-Bus -> provided by a new service which looks for data stored by
iii. processor-specific collector -> provided by separate services as needed and triggered by
iv. platform-specific monitoring service -> provided by host-error-monitor or other service as needed.
We need a repository for processor-specific collector.
Thanks,
Supreeth Venkatesh
System Manageability Architect | AMD
Server Software
-----Original Message-----
From: openbmc <openbmc-bounces+supreeth.venkatesh=amd.com at lists.ozlabs.org> On Behalf Of Bills, Jason M
Sent: Friday, July 14, 2023 5:05 PM
To: openbmc at lists.ozlabs.org
Subject: Re: [RFC] BMC RAS Feature
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
Sorry for missing this earlier. Here are some of my thoughts.
On 3/20/2023 11:14 PM, Supreeth Venkatesh wrote:
>
> #### Requirements
>
> 1. Collecting RAS/Crashdump shall be processor specific. Hence the use
> of virtual APIs to allow override for processor specific way of
> collecting the data.
> 2. Crash data format shall be stored in common platform error record
> (CPER) format as per UEFI specification
> [https://uefi.org/specs/UEFI/2.10/].
Do we have to define a single output format? Could it be made to be flexible with the format of the collected crash data?
> 3. Configuration parameters of the service shall be standard with scope
> for processor specific extensions.
>
> #### Proposed Design
>
> When one or more processors register a fatal error condition , then an
> interrupt is generated to the host processor.
>
> The host processor in the failed state asserts the signal to indicate
> to the BMC that a fatal hang has occurred. [APML_ALERT# in case of AMD
> processor family]
>
> BMC RAS application listens on the event [APML_ALERT# in case of AMD
> processor family ].
The host-error-monitor application provides support for listening for events and taking action such as logging or triggering a crashdump that may meet this requirement.
One thought may be to break this up into various layers to allow for flexibility and standardization. For example:
1. Redfish -> provided by bmcweb which pulls from 2. D-Bus -> provided by a new service which looks for data stored by 3. processor-specific collector -> provided by separate services as needed and triggered by 4. platform-specific monitoring service -> provided by host-error-monitor or other service as needed.
Ideally, we could make 2 a generic service.
>
> Upon detection of FATAL error event , BMC will check the status
> register of the host processor [implementation defined method] to see
>
> if the assertion is due to the fatal error.
>
> Upon fatal error , BMC will attempt to harvest crash data from all
> processors. [via the APML interface (mailbox) in case of AMD processor
> family].
>
> BMC will generate a single raw crashdump record and saves it in
> non-volatile location /var/lib/bmc-ras.
>
More information about the openbmc
mailing list