[RFC] BMC RAS Feature

Venkatesh, Supreeth Supreeth.Venkatesh at amd.com
Wed Jul 26 00:02:59 AEST 2023


[AMD Official Use Only - General]

Hi Dhanasekar,

Algorithms or Steps for implementing functionalities (SOL, PostCode, ) will be same.

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png at 01D9BED6.CE5EAC10]

From: J Dhanasekar <jdhanasekar at velankanigroup.com>
Sent: Tuesday, July 25, 2023 8:09 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh at amd.com>
Cc: Lei Yu <yulei.sh at bytedance.com>; Michael Shen <gpgpgp at google.com>; openbmc <openbmc at lists.ozlabs.org>; dhruvaraj S <dhruvaraj at gmail.com>; Brad Bishop <bradleyb at fuzziesquirrel.com>; Ed Tanous <ed at tanous.net>; Dhandapani, Abinaya <Abinaya.Dhandapani at amd.com>
Subject: RE: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.


Hi Supreeth,

I am working on SP5 Servers too. SP5 Servers has aspeed 2600 chip and  BMC is off the board whereas EthanolX/Daytonax has 2500 and BMC is on the board.
Algorithms or Steps for implementing functionalities (SOL, PostCode, PSU..) will  remain the same?.

Thanks,
Dhanasekar




---- On Mon, 24 Jul 2023 19:44:52 +0530 Venkatesh, Supreeth <Supreeth.Venkatesh at amd.com<mailto:Supreeth.Venkatesh at amd.com>> wrote ---


[AMD Official Use Only - General]

Hi Dhanasekar,

DaytonaX and EthanolX platforms were only OpenBMC PoC with limited functionality.
We are in the process of upstreaming new AMD CRBs with OpenBMC which has all the functionality you mention below.
Public instance of the staging/intermediary repository before upstream is here:
AMDESE/OpenBMC: OpenBMC for Genoa SP5 socket platforms (github.com)<https://github.com/AMDESE/OpenBMC>

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png at 01D9BED6.CE5EAC10]

From: J Dhanasekar <jdhanasekar at velankanigroup.com<mailto:jdhanasekar at velankanigroup.com>>
Sent: Monday, July 24, 2023 8:04 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh at amd.com<mailto:Supreeth.Venkatesh at amd.com>>
Cc: Lei Yu <yulei.sh at bytedance.com<mailto:yulei.sh at bytedance.com>>; Zane Shelley <zshelle at imap.linux.ibm.com<mailto:zshelle at imap.linux.ibm.com>>; Michael Shen <gpgpgp at google.com<mailto:gpgpgp at google.com>>; openbmc <openbmc at lists.ozlabs.org<mailto:openbmc at lists.ozlabs.org>>; dhruvaraj S <dhruvaraj at gmail.com<mailto:dhruvaraj at gmail.com>>; Brad Bishop <bradleyb at fuzziesquirrel.com<mailto:bradleyb at fuzziesquirrel.com>>; Ed Tanous <ed at tanous.net<mailto:ed at tanous.net>>; Dhandapani, Abinaya <Abinaya.Dhandapani at amd.com<mailto:Abinaya.Dhandapani at amd.com>>
Subject: RE: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Supreeth,

Thanks for the info. We hoped that Daytonax would be upstreamed. Unfortunately, It is not available.
Actually, we need to enable SOL, Post code and PSU features in Daytona.  Will we get support for this feature enablement? or Are there any reference implementation available for AMD boards?.

Thanks,
Dhanasekar



---- On Fri, 21 Jul 2023 19:33:41 +0530 Venkatesh, Supreeth <Supreeth.Venkatesh at amd.com<mailto:Supreeth.Venkatesh at amd.com>> wrote ---


[AMD Official Use Only - General]

Hi Dhanasekar,

It is supported for EPYC Genoa family and beyond at this time.
Daytona uses EPYC Milan family and support is not there in that.

Thanks,
Supreeth Venkatesh
System Manageability Architect  |  AMD
Server Software
[cid:image001.png at 01D9BED6.CE5EAC10]

From: J Dhanasekar <jdhanasekar at velankanigroup.com<mailto:jdhanasekar at velankanigroup.com>>
Sent: Friday, July 21, 2023 5:30 AM
To: Venkatesh, Supreeth <Supreeth.Venkatesh at amd.com<mailto:Supreeth.Venkatesh at amd.com>>
Cc: Zane Shelley <zshelle at imap.linux.ibm.com<mailto:zshelle at imap.linux.ibm.com>>; Lei Yu <yulei.sh at bytedance.com<mailto:yulei.sh at bytedance.com>>; Michael Shen <gpgpgp at google.com<mailto:gpgpgp at google.com>>; openbmc <openbmc at lists.ozlabs.org<mailto:openbmc at lists.ozlabs.org>>; dhruvaraj S <dhruvaraj at gmail.com<mailto:dhruvaraj at gmail.com>>; Brad Bishop <bradleyb at fuzziesquirrel.com<mailto:bradleyb at fuzziesquirrel.com>>; Ed Tanous <ed at tanous.net<mailto:ed at tanous.net>>; Dhandapani, Abinaya <Abinaya.Dhandapani at amd.com<mailto:Abinaya.Dhandapani at amd.com>>
Subject: Re: [RFC] BMC RAS Feature

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

Hi Supreeth Venkatesh,

Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform.
If this RAS works for Daytona Platform. I will include it in my project.

Please provide your suggestions.

Thanks,
Dhanasekar





---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <supreeth.venkatesh at amd.com<mailto:supreeth.venkatesh at amd.com>> wrote ---


On 3/23/23 13:57, Zane Shelley wrote:
> Caution: This message originated from an External Source. Use proper
> caution when opening attachments, clicking links, or responding.
>
>
> On 2023-03-22 19:07, Supreeth Venkatesh wrote:
>> On 3/22/23 02:10, Lei Yu wrote:
>>> Caution: This message originated from an External Source. Use proper
>>> caution when opening attachments, clicking links, or responding.
>>>
>>>
>>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh
>>>>> <supreeth.venkatesh at amd.com<mailto:supreeth.venkatesh at amd.com>> wrote:
>>>>>
>>>>>
>>>>>      On 3/21/23 05:40, Patrick Williams wrote:
>>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh
>>>>> wrote:
>>>>>      >
>>>>>      >> #### Alternatives Considered
>>>>>      >>
>>>>>      >> In-band mechanisms using System Management Mode (SMM)
>>>>> exists.
>>>>>      >>
>>>>>      >> However, out of band method to gather RAS data is processor
>>>>>      specific.
>>>>>      >>
>>>>>      > How does this compare with existing implementations in
>>>>>      > phosphor-debug-collector.
>>>>>      Thanks for your feedback. See below.
>>>>>      > I believe there was some attempt to extend
>>>>>      > P-D-C previously to handle Intel's crashdump behavior.
>>>>>      Intel's crashdump interface uses com.intel.crashdump.
>>>>>      We have implemented com.amd.crashdump based on that reference.
>>>>>      However,
>>>>>      can this be made generic?
>>>>>
>>>>>      PoC below:
>>>>>
>>>>>      busctl tree com.amd.crashdump
>>>>>
>>>>>      └─/com
>>>>>         └─/com/amd
>>>>>           └─/com/amd/crashdump
>>>>>             ├─/com/amd/crashdump/0
>>>>>             ├─/com/amd/crashdump/1
>>>>>             ├─/com/amd/crashdump/2
>>>>>             ├─/com/amd/crashdump/3
>>>>>             ├─/com/amd/crashdump/4
>>>>>             ├─/com/amd/crashdump/5
>>>>>             ├─/com/amd/crashdump/6
>>>>>             ├─/com/amd/crashdump/7
>>>>>             ├─/com/amd/crashdump/8
>>>>>             └─/com/amd/crashdump/9
>>>>>
>>>>>      > The repository
>>>>>      > currently handles IBM's processors, I think, or maybe that is
>>>>>      covered by
>>>>>      > openpower-debug-collector.
>>>>>      >
>>>>>      > In any case, I think you should look at the existing D-Bus
>>>>>      interfaces
>>>>>      > (and associated Redfish implementation) of these repositories
>>>>> and
>>>>>      > determine if you can use those approaches (or document why
>>>>> now).
>>>>>      I could not find an existing D-Bus interface for RAS in
>>>>>      xyz/openbmc_project/.
>>>>>      It would be helpful if you could point me to it.
>>>>>
>>>>>
>>>>> There is an interface for the dumps generated from the host, which
>>>>> can
>>>>> be used for these kinds of dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>>
>>>>>
>>>>> The fault log also provides similar dumps
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>>
>>>>>
>>>> ThanksDdhruvraj. The interface looks useful for the purpose. However,
>>>> the current BMCWEB implementation references
>>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp
>>>>
>>>> [com.intel.crashdump]
>>>> constexpr char const* crashdumpPath = "/com/intel/crashdump";
>>>>
>>>> constexpr char const* crashdumpInterface = "com.intel.crashdump";
>>>> constexpr char const* crashdumpObject = "com.intel.crashdump";
>>>>
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml
>>>>
>>>> or
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml
>>>>
>>>> is it exercised in Redfish logservices?
>>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added
>>> to copy the crashdump json file to the dump tarball.
>>> The crashdump tool (Intel or AMD) could trigger a dump after the
>>> crashdump is completed, and then we could get a dump entry containing
>>> the crashdump.
>> Thanks Lei Yu for your input. We are using Redfish to retrieve the
>> CPER binary file which can then be passed through a plugin/script for
>> detailed analysis.
>> In any case irrespective of whichever Dbus interface we use, we need a
>> repository which will gather data from AMD processor via APML as per
>> AMD design.
>> APML
>> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip
>> Can someone please help create bmc-ras or amd-debug-collector
>> repository as there are instances of openpower-debug-collector
>> repository used for Open Power systems?
>>>
>>>
>>> --
>>> BRs,
>>> Lei YU
> I am interested in possibly standardizing some of this. IBM POWER has
> several related components. openpower-hw-diags is a service that will
> listen for the hardware interrupts via a GPIO pin. When an error is
> detected, it will use openpower-libhei to query hardware registers to
> determine what happened. Based on that information openpower-hw-diags
> will generate a PEL, which is an extended log in phosphor-logging, that
> is used to tell service what to replace if necessary. Afterward,
> openpower-hw-diags will initiate openpower-debug-collector, which
> gathers a significant amount of data from the hardware for additional
> debug when necessary. I wrote openpower-libhei to be fairly agnostic. It
> uses data files (currently XML, but moving to JSON) to define register
> addresses and rules for isolation. openpower-hw-diags is fairly POWER
> specific, but I can see some parts can be made generic. Dhruv would have
> to help with openpower-debug-collector.
Thank you. Lets collaborate in standardizing some aspects of it.
>
> Regarding creation of a new repository, I think we'll need to have some
> more collaboration to determine the scope before creating it. It
> certainly sounds like we are doing similar things, but we need to
> determine if enough can be abstracted to make it worth our time.
I have put in a request here:
https://github.com/openbmc/technical-oversight-forum/issues/24
Please chime in.






-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20230725/89e77ebd/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 3608 bytes
Desc: image001.png
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20230725/89e77ebd/attachment-0001.png>


More information about the openbmc mailing list