[Proposal] Implementing SensorEvent message registry

Amithash Prasad amithash at meta.com
Wed Aug 21 03:08:14 AEST 2024


Folks,

I am requesting feedback from the community on a direction to get our sensor services to log threshold events compatible to
ones defined in redfish message registry -- SensorEvent [7, ch2.14].
This will be based on and probably the first implementation of the recently proposed enhancements to event logging[1].

Problem Statement:

We have two primary methods of creating threshold based events.
(1) Legacy, used only by dbus-sensors, set the ThresholdAsserted signal[2] and subscribed by phosphor-sel-logger[3] Which creates redfish events/messages.
(2) New, used by others as specified in phosphor-dbus-sensors for Critical[4] and Warning[5]. Current sensor services which raise this signal:
Phosphor-hwmon[6], phosphor-nvme, and phosphor-virtual-sensor raises a different set of signal[6], using the phosphor-dbus-interface

Phosphor-sel-logger subscribes to both these signals (Legacy[2] and New[3]) in different implementations both translating them to Redfish Message IDs:
```
OpenBMC.0.1.SensorThresholdCriticalLowGoingLow
OpenBMC.0.1.SensorThresholdCriticalLowGoingHigh
OpenBMC.0.1.SensorThresholdWarningLowGoingLow
OpenBMC.0.1.SensorThresholdWarningLowGoingHigh
OpenBMC.0.1.SensorThresholdWarningHighGoingHigh
OpenBMC.0.1.SensorThresholdWarningHighGoingLow
OpenBMC.0.1.SensorThresholdCriticalHighGoingHigh
OpenBMC.0.1.SensorThresholdCriticalHighGoingLow
```
These messages export 3 args: Sensor name, Sensor Reading Value, Threshold Value.

While these provide value and have served us well, they have a main drawback:

They are not part of the redfish standard which specifies explicit messages around sensor threshold events.
Having events emitted which are specified in the redfish standard message registry have very specific advantages
for the client side who can use a consistent library.

Proposed Messages:
I plan on creating new events which are based and intended to provide Redfish SensorEvent 1.0.0[7, ch2.14]. 
```
SensorEvent.1.0.ReadingAboveLowerCriticalThreshold (Warning)
SensorEvent.1.0.ReadingAboveLowerFatalThreshold (Critical)
SensorEvent.1.0.ReadingAboveUpperCautionThreshold (Warning)
SensorEvent.1.0.ReadingAboveUpperCriticalThreshold (Critical)
SensorEvent.1.0.ReadingAboveUpperFatalThreshold (Critical)
SensorEvent.1.0.ReadingBelowLowerCautionThreshold (Warning)
SensorEvent.1.0.ReadingBelowLowerCriticalThreshold (Critical)
SensorEvent.1.0.ReadingBelowLowerFatalThreshold (Critical)
SensorEvent.1.0.ReadingBelowUpperCriticalThreshold (Warning)
SensorEvent.1.0.ReadingBelowUpperFatalThreshold (Critical)
SensorEvent.1.0.SensorReadingNormalRange (OK)
```
All except the last one of these provide 4 arguments:
string: The name or identifier of the sensor. This argument shall contain a string that identifies or describes the Sensor resource.
number: The reading of the sensor. This argument shall contain a number that equals the value of the Reading property of the
Sensor resource.
string: The reading units of measure. This argument shall contain a string that equals the value of the ReadingUnits property
of the Sensor resource. Unitless readings should use count .
number: The threshold value. This argument shall contain a number that equals the value of the Reading property within the
<threshold> property of the Sensor resource.
SensorEvent.1.0.0.SensorReadingNormalRange would have only the first 3 arguments.

Example behavior (as proposed in the spec): as the sensor value rises above nominal range, You will see 
ReadingAboveUpperCautionThreshold, then
ReadingAboveUpperCriticalThreshold then
ReadingAboveUpperFatalThreshold.

As the sensor readings start to drop, we would see the reverse order:
ReadingBelowUpperFatalThreshold (Marks ReadingAboveUpperFatalThreshold as resolved), then
ReadingBelowUpperCriticalThreshold (Marks ReadingAboveUpperCriticalThreshold as resolved) then
SensorReadingNormalRange (Marks ReadingAboveUpperCautionThreshold as resolved)

This is an imperfect match to our existing sensor events from phosphor-sel-logger:
(1) While Caution and Critical matches are 1:1 match to Warning and Critical respectively from phosphor-sel-logging,
we have multiple equivalents for Fatal (SoftShutdown/HardShutdown).
(2) Redfish adds an explicit message SensorEvent.1.0.0.SensorReadingNormalRange to indicate that the sensor is truly
back in normal operating range, with the existing messages we need to infer from pre-knowledge of thresholds.
(3) The args are similar, but OpenBMC.0.1.* events do not contain the "Units"[3] of the sensor.

Having OpenBMC implement the standard redfish messages would really help in allowing a more standard operation
from the end user which allows for better code sharing on the client side.

Implementation Proposal:

(1) Add a new phosphor-dbus-interfaces/yaml/xyz/openbmc_project/Sensor/Threshold.events.yaml
(which will define the Redfish SensorEvents.1.0 events).
This will be a downstream effort from the events enhancement design proposal[1].

This will add definitions for all of SensorEvent.1.0.* (Not just the ones provided as examples earlier),
but the first wave of implementation will focus more on threshold events.

Example snipped of the Threshold.events.yaml
```
version: 1.0.0

errors:
  - name: ReadingAboveUpperCriticalThreshold
    severity: critical
    metadata:
      - name: sensor_name
        type: string
        primary: true
      - name: reading_value
        type: double
        primary: true
      - name: units
        type: enum[Value.interface.Unit]
        primary: true
      - name: threshold_value
        type: double
        primary: true
    redfish-mapping: SensorEvent.1.0.0.ReadingAboveUpperCriticalThreshold

events:

  - name: SensorReadingNormalRange
    severity: notice
    metadata:
      - name: sensor_name
        type: string
        primary: true
      - name: reading_value
        type: float/double
        primary: true
      - name: units
        type: string/might be better to be a enum
        primary: true
    redfish-mapping: SensorEvent.1.0.0.SensorReadingNormalRange
```

(2) Have each of the sensor monitoring services (dbus-sensors[9], pldm [10] when it lands, virtual-sensors[11],
phosphor-hwmon[12]) updated to use the appropriate new message/exceptions.
We can track thresholds as they exceed and log.
Each service will log an appropriate event as the value exceeds a threshold. (Example, reading exceeding Caution
Threshold and then exceeding Critical Threshold) will log two events. Each logged event will have an appropriate
uniquely identifiable Hint. (More at [8]). The service should also keep local state of what alerts were already raised
(so it does not raise a duplicated alert).
As the reading starts moving towards nominal value, the service can query the logging service and resolve the
previously raised alerts. This is done using the Hint parameter used previously. At each point the service will
mark the local alert (Critical/Caution etc) as resolved for that sensor.
Once there are no longer any alarms. (Sensor value below Caution threshold) and a previous alert was raised.
The service can clear its local assert and log the SensorReadingNormalRange event. This would not need an
associated hint.

Keep Backward Compatibility

Let the existing services continue to use the signals as is. If there are no listeners, it's probably a no-op. We can keep the
Threshold signal and Threshold Alarms. These are also good mechanisms for other services to take actions based on
sensor thresholds since this only changes the logging behavior around these events; we should not expect additional behavior changes.

Users could use the capability planned for the changes to turn off these new events for systems already using phosphor-sel-logger.

Alternative Considered

While SensorEvent.1.0.X is the most generic and comprehensive, Redfish Message Registry[7] also specifies more sensor specific type events.
Example:
Power 1.0.0 specifies:
```
[Power|Current|Voltage][Above|Below][Upper|Lower][Caution|Critical|Fatal]Threshold and [Power|Current|Voltage]NoLongerCritical
```
Arguments are the same as SensorEvent args with the obvious omission of the “Units” argument.
Environmental 1.0.0 specifies
```
[Humidity|Temperature][Above|Below][Upper|Lower][Caution|Critical|Fatal]Threshold and TemperatureNoLongerCritical, HumidityNormal.
```
Arguments are the same as SensorEvent args with the omission of the “Units” argument.

The reason I chose not to consider these are:
(1) There are no entries for some sensor types like Airflow (CFM), Fan speed (RPM), Fan PWM. Hence for a majority of these we would
need to default to the SensorEvent.
(2) I found the Humidity events inconsistent with Temperature.
(3) The code would be less elegant with a lot of units based case switch statements splattered all over.

Thus my recommendation of sticking to the more generic SensorEvent.1.0.0.

References:
[1] https://github.com/openbmc/docs/blob/master/designs/event-logging.md
[2] https://github.com/openbmc/dbus-sensors/blob/master/src/Thresholds.cpp#L468
[3] https://github.com/openbmc/phosphor-sel-logger/blob/master/include/threshold_event_monitor.hpp#L329C27-L329C45
[4] https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Sensor/Threshold/Critical.interface.yaml
[5] https://github.com/openbmc/phosphor-dbus-interfaces/blob/master/yaml/xyz/openbmc_project/Sensor/Threshold/Warning.interface.yaml
[6] https://github.com/openbmc/phosphor-hwmon/blob/master/thresholds.hpp#L75
[7] https://www.dmtf.org/sites/default/files/standards/documents/DSP2065_2023.1.pdf
[8] https://github.com/openbmc/docs/blob/master/designs/event-logging.md#phosphor-logging
[9] https://github.com/openbmc/dbus-sensors/blob/master/src/Thresholds.cpp
[10] https://gerrit.openbmc.org/c/openbmc/pldm/+/51489/135/platform-mc/numeric_sensor.cpp#657
[11] https://github.com/openbmc/phosphor-virtual-sensor/blob/master/thresholds.hpp#L173
[12] https://github.com/openbmc/phosphor-hwmon/blob/master/thresholds.hpp

Thanks,
Amithash


More information about the openbmc mailing list