Proposal for operations on isolated hardware units using Redfish logging
dhruvaraj S
dhruvaraj at gmail.com
Fri Dec 11 22:27:22 AEDT 2020
On Thu, Dec 10, 2020 at 10:59 PM Ed Tanous <ed at tanous.net> wrote:
>
> On Thu, Dec 10, 2020 at 7:49 AM dhruvaraj S <dhruvaraj at gmail.com> wrote:
> >
> > Hi,
> > Please find the option for operations on isolated hardware units using
> > Redfisg logging
> >
> >
> > Hardware Isolation
> > On systems with multiple processor units and other redundant vital resources,
> > the system downtime can be prevented by isolating the faulty hardware units.
> > Most of the actions required to isolate the parts will be dependent on
> > the architecture and
> > executed in the host. But the BMC needs to support a few steps like
> > provide a method to users to query the units in isolation, clearing
> > isolation, isolating a
> > suspected part, or isolating when the host is down due to a fault in a
> > critical unit.
> > Since a user interface is needed for the above actions proposing a method to use
> > Redfish log service to carry out these actions.
>
> Right off the bat, LogServices seems like a strange choice for this.
> In your requirements, you're taking actions on the unit itself, not
> logging the actions that occurred, so I'm struggling to see the design
> choice here. Can you elaborate why LogService, something intended to
> be for historical logging, would be appropriate for a design that
> needs to accept user action?
Apart from user-requested isolation of a hardware unit, usually, hardware units
get isolated due to a past action in the system. for example, if a
processor core encountered
an error while performing the activities and cannot continue in
service, that will be listed
as isolated. A method is needed to show the list of such units to the users.
Since log service is for showing such logs, I think log service is
suitable for that.
And after the repair, once the unit is back in service, this log
service entry will be marked
as resolved.
>
> >
> > Requirements
> > When user requests, isolate a hardware unit.
> > Getting the list of all isolated resources.
> > Remove the isolation of a hardware unit.
> > Remove all existing isolation
> >
> > Isolating a hardware unit:
> > redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware
> > {
> > "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware",
> > "@odata.type": "#LogService.v1_2_0.LogService",
> > "Actions": {
> > "#LogService.CollectDiagnosticData": {
> > "target":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Actions/LogService.CollectDiagnosticData"
>
> What is this action intended to do?
>
> > }
> > },
> > "Description": "Isolated Hardware",
> > "Entries": {
> > "@odata.id":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries"
> > },
> > "Id": "IsolatedHardware",
> > "Name": "Isolated Hardware LogService",
> > "OverWritePolicy": "WrapsWhenFull"
> >
> > Listing isolated hardware units.
> > redfish >> v1 >> Systems >> system >> LogServices >> IsolatedHardware >> Entries
> > {
> > "@odata.id": "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries",
> > "@odata.type": "#LogEntryCollection.LogEntryCollection",
> > "Description": "Collection of Isolated Hardware Components",
> > "Members": [
> > {
> > "@odata.id":
> > "/redfish/v1/Systems/system/LogServices/IsolatedHardware/Entries/1",
> > "@odata.type": "#LogEntry.v1_7_0.LogEntry",
> > "Created": "2020-10-15T10:30:08+00:00",
> > "EntryType": "Event",
> > "Id": "1",
> > "Resolved": "false",
>
> LogEntry doesn't have a "Resolved" field that I can see.
>
> > "Name": "Processor 1",
> > "links": {
> > "OriginOfCondition": {
> > "@odata.id":
> > "/redfish/v1/Systems/system/Processors/cpu1"
> > },
> > "Severity": "Critical",
> > "SensorType" : "Processor",
>
> SensorType doesn't really make sense in this case, as you're not
> reporting errors from a sensor, but from a resource.
>
> >
> > "AdditionalDataURI":
> > “/redfish/v1/Systems/system/LogServices/EventLog/attachement/111"
> > “AddionalDataSizeBytes": "1024"
> >
> > }
> > ],
> > "Members at odata.count": 1,
> > "Name": "Isolated Hardware Entries"
> >
> > Users will be able to delete any entry or all the entries, but if an
> > isolated unit is serviced then that unit will be back in service, in
> > such cases the "Resolved" property in the entries will be marked as
> > "true"
> > "AdditionalDataURI" : This is a link to the error log associated with
> > this isolation action.
> > --------------
> > Dhruvaraj S
>
>
> I suspect overall you need to separate this into two different
> resources. One for logging things that have happened in the past,
> under log service, and one for interacting directly with the system in
> its current state. The second one would likely take the form of being
> able to set the Status property to something like "Disabled",
> "UnavailableOffline", or something similar on your Processor
> resources.
The log service is already being used to generate the dump, which is a
user-initiated
action in log service, I am thinking the user-initiated isolation
also can be in the same place.
But as you suggested setting the disabled/UnavailableOffline on the
list of units also a good option,
need to look more into that.
--
--------------
Dhruvaraj S
More information about the openbmc
mailing list