[RFC] BMC RAS Feature
Venkatesh, Supreeth
Supreeth.Venkatesh at amd.com
Sat Jul 22 00:03:41 AEST 2023
[AMD Official Use Only - General]
Hi Dhanasekar,
It is supported for EPYC Genoa family and beyond at this time.
Daytona uses EPYC Milan family and support is not there in that.
Thanks,
Supreeth Venkatesh
System Manageability Architect | AMD
Server Software
[cid:image001.png at 01D9BBB2.3DA7CC00]
From: J Dhanasekar <jdhanasekar at velankanigroup.com>
Sent: Friday, July 21, 2023 5:30 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh at amd.com>
Cc: Zane Shelley <zshelle at imap.linux.ibm.com>; Lei Yu <yulei.sh at bytedance.com>; Michael Shen <gpgpgp at google.com>; openbmc <openbmc at lists.ozlabs.org>; dhruvaraj S <dhruvaraj at gmail.com>; Brad Bishop <bradleyb at fuzziesquirrel.com>; Ed Tanous <ed at tanous.net>; Dhandapani, Abinaya <Abinaya.Dhandapani at amd.com>
Subject: Re: [RFC] BMC RAS Feature
Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
Hi Supreeth Venkatesh,
Does this RAS feature work for the Daytona Platform. i have been working in openBMC development for the Daytonax platform.
If this RAS works for Daytona Platform. I will include it in my project.
Please provide your suggestions.
Thanks,
Dhanasekar
---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <supreeth.venkatesh at amd.com<mailto:supreeth.venkatesh at amd.com>> wrote ---
On 3/23/23 13:57, Zane Shelley wrote:
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
> On 2023-03-22 19:07, Supreeth Venkatesh wrote:
>> On 3/22/23 02:10, Lei Yu wrote:
>>> Caution: This message originated from an External Source. Use proper
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh
>>>>> <supreeth.venkatesh at amd.com<mailto:supreeth.venkatesh at amd.com>> wrote:
>>>>>
>>>>>
>>>>> On 3/21/23 05:40, Patrick Williams wrote:
>>>>> > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh
>>>>> wrote:
>>>>> >
>>>>> >> #### Alternatives Considered
>>>>> >>
>>>>> >> In-band mechanisms using System Management Mode (SMM)
>>>>> exists.
>>>>> >>
>>>>> >> However, out of band method to gather RAS data is processor
>>>>> specific.
>>>>> >>
>>>>> > How does this compare with existing implementations in
>>>>> > phosphor-debug-collector.
>>>>> Thanks for your feedback. See below.
>>>>> > I believe there was some attempt to extend
>>>>> > P-D-C previously to handle Intel's crashdump behavior.
>>>>> Intel's crashdump interface uses com.intel.crashdump.
>>>>> We have implemented com.amd.crashdump based on that reference.
>>>>> However,
>>>>> can this be made generic?
>>>>>
>>>>> PoC below:
>>>>>
>>>>> busctl tree com.amd.crashdump
>>>>>
>>>>> └─/com
>>>>> └─/com/amd
>>>>> └─/com/amd/crashdump
>>>>> ├─/com/amd/crashdump/0
>>>>> ├─/com/amd/crashdump/1
>>>>> ├─/com/amd/crashdump/2
>>>>> ├─/com/amd/crashdump/3
>>>>> ├─/com/amd/crashdump/4
>>>>> ├─/com/amd/crashdump/5
>>>>> ├─/com/amd/crashdump/6
>>>>> ├─/com/amd/crashdump/7
>>>>> ├─/com/amd/crashdump/8
>>>>> └─/com/amd/crashdump/9
>>>>>
>>>>> > The repository
>>>>> > currently handles IBM's processors, I think, or maybe that is
>>>>> covered by
>>>>> > openpower-debug-collector.
>>>>> >
>>>>> > In any case, I think you should look at the existing D-Bus
>>>>> interfaces
>>>>> > (and associated Redfish implementation) of these repositories
>>>>> and
>>>>> > determine if you can use those approaches (or document why
>>>>> now).
>>>>> I could not find an existing D-Bus interface for RAS in
>>>>> xyz/openbmc_project/.
>>>>> It would be helpful if you could point me to it.
>>>>>
>>>>>
>>>>> There is an interface for the dumps generated from the host, which
>>>>> can
>>>>> be used for these kinds of dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>>
>>>>>
>>>>> The fault log also provides similar dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>>
>>>>>
>>>> ThanksDdhruvraj. The interface looks useful for the purpose. However,
>>>> the current BMCWEB implementation references
>>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp
>>>>
>>>> [com.intel.crashdump]
>>>> constexpr char const* crashdumpPath = "/com/intel/crashdump";
>>>>
>>>> constexpr char const* crashdumpInterface = "com.intel.crashdump";
>>>> constexpr char const* crashdumpObject = "com.intel.crashdump";
>>>>
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>
>>>> or
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>
>>>> is it exercised in Redfish logservices?
>>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added
>>> to copy the crashdump json file to the dump tarball.
>>> The crashdump tool (Intel or AMD) could trigger a dump after the
>>> crashdump is completed, and then we could get a dump entry containing
>>> the crashdump.
>> Thanks Lei Yu for your input. We are using Redfish to retrieve the
>> CPER binary file which can then be passed through a plugin/script for
>> detailed analysis.
>> In any case irrespective of whichever Dbus interface we use, we need a
>> repository which will gather data from AMD processor via APML as per
>> AMD design.
>> APML
>> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip
>> Can someone please help create bmc-ras or amd-debug-collector
>> repository as there are instances of openpower-debug-collector
>> repository used for Open Power systems?
>>>
>>>
>>> --
>>> BRs,
>>> Lei YU
> I am interested in possibly standardizing some of this. IBM POWER has
> several related components. openpower-hw-diags is a service that will
> listen for the hardware interrupts via a GPIO pin. When an error is
> detected, it will use openpower-libhei to query hardware registers to
> determine what happened. Based on that information openpower-hw-diags
> will generate a PEL, which is an extended log in phosphor-logging, that
> is used to tell service what to replace if necessary. Afterward,
> openpower-hw-diags will initiate openpower-debug-collector, which
> gathers a significant amount of data from the hardware for additional
> debug when necessary. I wrote openpower-libhei to be fairly agnostic. It
> uses data files (currently XML, but moving to JSON) to define register
> addresses and rules for isolation. openpower-hw-diags is fairly POWER
> specific, but I can see some parts can be made generic. Dhruv would have
> to help with openpower-debug-collector.
Thank you. Lets collaborate in standardizing some aspects of it.
>
> Regarding creation of a new repository, I think we'll need to have some
> more collaboration to determine the scope before creating it. It
> certainly sounds like we are doing similar things, but we need to
> determine if enough can be abstracted to make it worth our time.
I have put in a request here:
https://github.com/openbmc/technical-oversight-forum/issues/24
Please chime in.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20230721/57e54bf3/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 3608 bytes
Desc: image001.png
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20230721/57e54bf3/attachment-0001.png>
More information about the openbmc
mailing list