Proposal for operations on isolated hardware units using Redfish logging

Ed Tanous ed at tanous.net
Fri Dec 11 04:29:17 AEDT 2020


On Thu, Dec 10, 2020 at 7:49 AM dhruvaraj S <dhruvaraj at gmail.com> wrote:
>
> Hi,
> Please find the option for operations on isolated hardware units using
> Redfisg logging
>
>
> Hardware Isolation
> On systems with multiple processor units and other redundant vital resources,
> the system downtime can be prevented by isolating the faulty hardware units.
> Most of the actions required to isolate the parts will be dependent on
> the architecture and
> executed in the host. But the BMC needs to support a few steps like
> provide a method to users to query the units in isolation, clearing
> isolation, isolating a
> suspected part, or isolating when the host is down due to a fault in a
> critical unit.
> Since a user interface is needed for the above actions proposing a method to use
> Redfish log service to carry out these actions.

Right off the bat, LogServices seems like a strange choice for this.
In your requirements, you're taking actions on the unit itself, not
logging the actions that occurred, so I'm struggling to see the design
choice here.  Can you elaborate why LogService, something intended to
be for historical logging, would be appropriate for a design that
needs to accept user action?

>
> Requirements
> When user requests, isolate a hardware unit.
> Getting the list of all isolated resources.
> Remove the isolation of a hardware unit.
> Remove all existing isolation
>
> Isolating a hardware unit:
> redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware
> {
>   "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware",
>   "@odata.type": "#LogService.v1_2_0.LogService",
>   "Actions": {
>     "#LogService.CollectDiagnosticData": {
>       "target":
> "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Actions/LogService.CollectDiagnosticData"

What is this action intended to do?

>     }
>   },
>   "Description": "Isolated Hardware",
>   "Entries": {
>     "@odata.id":
> "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries"
>   },
>   "Id": "IsolatedHardware",
>   "Name": "Isolated Hardware LogService",
>   "OverWritePolicy": "WrapsWhenFull"
>
> Listing isolated hardware units.
> redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware >> Entries
> {
>   "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
>   "@odata.type": "#LogEntryCollection.LogEntryCollection",
>   "Description": "Collection of Isolated Hardware Components",
>   "Members": [
>     {
>       "@odata.id":
> "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
>       "@odata.type": "#LogEntry.v1_7_0.LogEntry",
>       "Created": "2020-10-15T10:30:08+00:00",
>       "EntryType": "Event",
>       "Id": "1",
>       "Resolved": "false",

LogEntry doesn't have a "Resolved" field that I can see.

>       "Name": "Processor 1",
>       "links":  {
>                  "OriginOfCondition": {
>                         "@odata.id":
> "/redfish/v1/Systems/system/Processors/cpu1"
>                     },
>       "Severity": "Critical",
>        "SensorType" : "Processor",

SensorType doesn't really make sense in this case, as you're not
reporting errors from a sensor, but from a resource.

>
>  "AdditionalDataURI":
> “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111"
>  “AddionalDataSizeBytes": "1024"
>
>   }
>   ],
>   "Members at odata.count": 1,
>   "Name": "Isolated Hardware Entries"
>
> Users will be able to delete any entry or all the entries, but if an
> isolated unit is serviced then that unit will be back in service, in
> such cases the "Resolved" property in the entries will be marked as
> "true"
> "AdditionalDataURI" : This is a link to the error log associated with
> this isolation action.
> --------------
> Dhruvaraj S


I suspect overall you need to separate this into two different
resources.  One for logging things that have happened in the past,
under log service, and one for interacting directly with the system in
its current state.  The second one would likely take the form of being
able to set the Status property to something like "Disabled",
"UnavailableOffline", or something similar on your Processor
resources.


More information about the openbmc mailing list