Proposal for operations on isolated hardware units using Redfish logging

dhruvaraj S dhruvaraj at gmail.com
Fri Dec 11 22:27:22 AEDT 2020


On Thu, Dec 10, 2020 at 10:59 PM Ed Tanous <ed at tanous.net> wrote:
>
> On Thu, Dec 10, 2020 at 7:49 AM dhruvaraj S <dhruvaraj at gmail.com> wrote:
> >
> > Hi,
> > Please find the option for operations on isolated hardware units using
> > Redfisg logging
> >
> >
> > Hardware Isolation
> > On systems with multiple processor units and other redundant vital resources,
> > the system downtime can be prevented by isolating the faulty hardware units.
> > Most of the actions required to isolate the parts will be dependent on
> > the architecture and
> > executed in the host. But the BMC needs to support a few steps like
> > provide a method to users to query the units in isolation, clearing
> > isolation, isolating a
> > suspected part, or isolating when the host is down due to a fault in a
> > critical unit.
> > Since a user interface is needed for the above actions proposing a method to use
> > Redfish log service to carry out these actions.
>
> Right off the bat, LogServices seems like a strange choice for this.
> In your requirements, you're taking actions on the unit itself, not
> logging the actions that occurred, so I'm struggling to see the design
> choice here.  Can you elaborate why LogService, something intended to
> be for historical logging, would be appropriate for a design that
> needs to accept user action?

Apart from user-requested isolation of a hardware unit, usually, hardware units
get isolated due to a past action in the system. for example, if a
processor core encountered
an error while performing the activities and cannot continue in
service, that will be listed
as isolated. A method is needed to show the list of such units to the users.
Since log service is for showing such logs, I think log service is
suitable for that.
And after the repair, once the unit is back in service, this log
service entry will be marked
as resolved.

>
> >
> > Requirements
> > When user requests, isolate a hardware unit.
> > Getting the list of all isolated resources.
> > Remove the isolation of a hardware unit.
> > Remove all existing isolation
> >
> > Isolating a hardware unit:
> > redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware
> > {
> >   "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware",
> >   "@odata.type": "#LogService.v1_2_0.LogService",
> >   "Actions": {
> >     "#LogService.CollectDiagnosticData": {
> >       "target":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Actions/LogService.CollectDiagnosticData"
>
> What is this action intended to do?
>
> >     }
> >   },
> >   "Description": "Isolated Hardware",
> >   "Entries": {
> >     "@odata.id":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries"
> >   },
> >   "Id": "IsolatedHardware",
> >   "Name": "Isolated Hardware LogService",
> >   "OverWritePolicy": "WrapsWhenFull"
> >
> > Listing isolated hardware units.
> > redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware >> Entries
> > {
> >   "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
> >   "@odata.type": "#LogEntryCollection.LogEntryCollection",
> >   "Description": "Collection of Isolated Hardware Components",
> >   "Members": [
> >     {
> >       "@odata.id":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
> >       "@odata.type": "#LogEntry.v1_7_0.LogEntry",
> >       "Created": "2020-10-15T10:30:08+00:00",
> >       "EntryType": "Event",
> >       "Id": "1",
> >       "Resolved": "false",
>
> LogEntry doesn't have a "Resolved" field that I can see.
>
> >       "Name": "Processor 1",
> >       "links":  {
> >                  "OriginOfCondition": {
> >                         "@odata.id":
> > "/redfish/v1/Systems/system/Processors/cpu1"
> >                     },
> >       "Severity": "Critical",
> >        "SensorType" : "Processor",
>
> SensorType doesn't really make sense in this case, as you're not
> reporting errors from a sensor, but from a resource.
>
> >
> >  "AdditionalDataURI":
> > “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111"
> >  “AddionalDataSizeBytes": "1024"
> >
> >   }
> >   ],
> >   "Members at odata.count": 1,
> >   "Name": "Isolated Hardware Entries"
> >
> > Users will be able to delete any entry or all the entries, but if an
> > isolated unit is serviced then that unit will be back in service, in
> > such cases the "Resolved" property in the entries will be marked as
> > "true"
> > "AdditionalDataURI" : This is a link to the error log associated with
> > this isolation action.
> > --------------
> > Dhruvaraj S
>
>
> I suspect overall you need to separate this into two different
> resources.  One for logging things that have happened in the past,
> under log service, and one for interacting directly with the system in
> its current state.  The second one would likely take the form of being
> able to set the Status property to something like "Disabled",
> "UnavailableOffline", or something similar on your Processor
> resources.

The log service is already being used to generate the dump, which is a
user-initiated
 action in log service, I am thinking the user-initiated isolation
also can be in the same place.
But as you suggested setting the disabled/UnavailableOffline on the
list of units also a good option,
need to look more into that.

-- 
--------------
Dhruvaraj S


More information about the openbmc mailing list