[RFC] BMC RAS Feature

Fri Jul 21 20:29:39 AEST 2023

Hi Supreeth Venkatesh,

Does this RAS feature work for the Daytona Platform.  i have been working in openBMC development for the Daytonax platform. 

If this RAS works for Daytona Platform. I will include it in my project. 

Please provide your suggestions. 

Thanks,

Dhanasekar

---- On Mon, 03 Apr 2023 22:06:24 +0530 Supreeth Venkatesh <supreeth.venkatesh at amd.com> wrote ---

On 3/23/23 13:57, Zane Shelley wrote: 
> Caution: This message originated from an External Source. Use proper 
> caution when opening attachments, clicking links, or responding. 
> 
> 
> On 2023-03-22 19:07, Supreeth Venkatesh wrote: 
>> On 3/22/23 02:10, Lei Yu wrote: 
>>> Caution: This message originated from an External Source. Use proper 
>>> caution when opening attachments, clicking links, or responding. 
>>> 
>>> 
>>>>> On Tue, 21 Mar 2023 at 20:38, Supreeth Venkatesh 
>>>>> <mailto:supreeth.venkatesh at amd.com> wrote: 
>>>>> 
>>>>> 
>>>>>      On 3/21/23 05:40, Patrick Williams wrote: 
>>>>>      > On Tue, Mar 21, 2023 at 12:14:45AM -0500, Supreeth Venkatesh 
>>>>> wrote: 
>>>>>      > 
>>>>>      >> #### Alternatives Considered 
>>>>>      >> 
>>>>>      >> In-band mechanisms using System Management Mode (SMM) 
>>>>> exists. 
>>>>>      >> 
>>>>>      >> However, out of band method to gather RAS data is processor 
>>>>>      specific. 
>>>>>      >> 
>>>>>      > How does this compare with existing implementations in 
>>>>>      > phosphor-debug-collector. 
>>>>>      Thanks for your feedback. See below. 
>>>>>      > I believe there was some attempt to extend 
>>>>>      > P-D-C previously to handle Intel's crashdump behavior. 
>>>>>      Intel's crashdump interface uses com.intel.crashdump. 
>>>>>      We have implemented com.amd.crashdump based on that reference. 
>>>>>      However, 
>>>>>      can this be made generic? 
>>>>> 
>>>>>      PoC below: 
>>>>> 
>>>>>      busctl tree com.amd.crashdump 
>>>>> 
>>>>>      └─/com 
>>>>>         └─/com/amd 
>>>>>           └─/com/amd/crashdump 
>>>>>             ├─/com/amd/crashdump/0 
>>>>>             ├─/com/amd/crashdump/1 
>>>>>             ├─/com/amd/crashdump/2 
>>>>>             ├─/com/amd/crashdump/3 
>>>>>             ├─/com/amd/crashdump/4 
>>>>>             ├─/com/amd/crashdump/5 
>>>>>             ├─/com/amd/crashdump/6 
>>>>>             ├─/com/amd/crashdump/7 
>>>>>             ├─/com/amd/crashdump/8 
>>>>>             └─/com/amd/crashdump/9 
>>>>> 
>>>>>      > The repository 
>>>>>      > currently handles IBM's processors, I think, or maybe that is 
>>>>>      covered by 
>>>>>      > openpower-debug-collector. 
>>>>>      > 
>>>>>      > In any case, I think you should look at the existing D-Bus 
>>>>>      interfaces 
>>>>>      > (and associated Redfish implementation) of these repositories 
>>>>> and 
>>>>>      > determine if you can use those approaches (or document why 
>>>>> now). 
>>>>>      I could not find an existing D-Bus interface for RAS in 
>>>>>      xyz/openbmc_project/. 
>>>>>      It would be helpful if you could point me to it. 
>>>>> 
>>>>> 
>>>>> There is an interface for the dumps generated from the host, which 
>>>>> can 
>>>>> be used for these kinds of dumps 
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
>>>>> 
>>>>> 
>>>>> The fault log also provides similar dumps 
>>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
>>>>> 
>>>>> 
>>>> ThanksDdhruvraj. The interface looks useful for the purpose. However, 
>>>> the current BMCWEB implementation references 
>>>> https://github.com/openbmc/bmcweb/blob/master/redfish-core/lib/log_services.hpp 
>>>> 
>>>> [com.intel.crashdump] 
>>>> constexpr char const* crashdumpPath = "/com/intel/crashdump"; 
>>>> 
>>>> constexpr char const* crashdumpInterface = "com.intel.crashdump"; 
>>>> constexpr char const* crashdumpObject = "com.intel.crashdump"; 
>>>> 
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/System.interface.yaml 
>>>> 
>>>> or 
>>>> https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Dump/Entry/FaultLog.interface.yaml 
>>>> 
>>>> is it exercised in Redfish logservices? 
>>> In our practice, a plugin `tools/dreport.d/plugins.d/acddump` is added 
>>> to copy the crashdump json file to the dump tarball. 
>>> The crashdump tool (Intel or AMD) could trigger a dump after the 
>>> crashdump is completed, and then we could get a dump entry containing 
>>> the crashdump. 
>> Thanks Lei Yu for your input. We are using Redfish to retrieve the 
>> CPER binary file which can then be passed through a plugin/script for 
>> detailed analysis. 
>> In any case irrespective of whichever Dbus interface we use, we need a 
>> repository which will gather data from AMD processor via APML as per 
>> AMD design. 
>> APML 
>> Spec: https://www.amd.com/system/files/TechDocs/57019-A0-PUB_3.00.zip 
>> Can someone please help create bmc-ras or amd-debug-collector 
>> repository as there are instances of openpower-debug-collector 
>> repository used for Open Power systems? 
>>> 
>>> 
>>> -- 
>>> BRs, 
>>> Lei YU 
> I am interested in possibly standardizing some of this. IBM POWER has 
> several related components. openpower-hw-diags is a service that will 
> listen for the hardware interrupts via a GPIO pin. When an error is 
> detected, it will use openpower-libhei to query hardware registers to 
> determine what happened. Based on that information openpower-hw-diags 
> will generate a PEL, which is an extended log in phosphor-logging, that 
> is used to tell service what to replace if necessary. Afterward, 
> openpower-hw-diags will initiate openpower-debug-collector, which 
> gathers a significant amount of data from the hardware for additional 
> debug when necessary. I wrote openpower-libhei to be fairly agnostic. It 
> uses data files (currently XML, but moving to JSON) to define register 
> addresses and rules for isolation. openpower-hw-diags is fairly POWER 
> specific, but I can see some parts can be made generic. Dhruv would have 
> to help with openpower-debug-collector. 
Thank you. Lets collaborate in standardizing some aspects of it. 
> 
> Regarding creation of a new repository, I think we'll need to have some 
> more collaboration to determine the scope before creating it. It 
> certainly sounds like we are doing similar things, but we need to 
> determine if enough can be abstracted to make it worth our time. 
I have put in a request here: 
https://github.com/openbmc/technical-oversight-forum/issues/24 
Please chime in.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ozlabs.org/pipermail/openbmc/attachments/20230721/6449b945/attachment.htm>