[PATCH v8 4/9] docs: misc-devices: (smpro-errmon) Add documentation
Quan Nguyen
quan at os.amperecomputing.com
Fri Apr 22 12:46:48 AEST 2022
Adds documentation for Ampere(R)'s Altra(R) SMpro errmon driver.
Signed-off-by: Thu Nguyen <thu at os.amperecomputing.com>
Signed-off-by: Quan Nguyen <quan at os.amperecomputing.com>
---
Changes in v8:
+ Update to reflect single value per sysfs [Quan]
Changes in v7:
+ None
Changes in v6:
+ First introduced in v6 [Quan]
Documentation/misc-devices/index.rst | 1 +
Documentation/misc-devices/smpro-errmon.rst | 198 ++++++++++++++++++++
2 files changed, 199 insertions(+)
create mode 100644 Documentation/misc-devices/smpro-errmon.rst
diff --git a/Documentation/misc-devices/index.rst b/Documentation/misc-devices/index.rst
index 30ac58f81901..7a6a6263cbab 100644
--- a/Documentation/misc-devices/index.rst
+++ b/Documentation/misc-devices/index.rst
@@ -26,6 +26,7 @@ fit into other categories.
lis3lv02d
max6875
pci-endpoint-test
+ smpro-errmon
spear-pcie-gadget
uacce
xilinx_sdfec
diff --git a/Documentation/misc-devices/smpro-errmon.rst b/Documentation/misc-devices/smpro-errmon.rst
new file mode 100644
index 000000000000..53599904da70
--- /dev/null
+++ b/Documentation/misc-devices/smpro-errmon.rst
@@ -0,0 +1,198 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+Kernel driver Ampere(R)'s Altra(R) SMpro errmon
+===============================================
+
+Supported chips:
+
+ * Ampere(R) Altra(R)
+
+ Prefix: 'smpro'
+
+ Preference: Altra SoC BMC Interface Specification
+
+Author: Thu Nguyen <thu at os.amperecomputing.com>
+
+Description
+-----------
+
+This driver supports hardware monitoring for Ampere(R) Altra(R) SoC's based on the
+SMpro co-processor (SMpro).
+The following SoC alert/event types are supported by the errmon driver:
+
+* Core CE/UE errors
+* Memory CE/UE errors
+* PCIe CE/UE errors
+* Other CE/UE errors
+* Internal SMpro/PMpro errors
+* VRD hot
+* VRD warn/fault
+* DIMM Hot
+
+The SMpro interface provides the registers to query the status of the SoC alerts/events
+and their data and export to userspace by this driver.
+
+Usage Notes
+-----------
+
+SMpro errmon driver creates the sysfs files for each host alert/event type.
+Example: ``error_core_ce`` to get Core CE error type.
+
+To get a host alert/event type, the user will read the corresponding sysfs file.
+
+* If the alert/event is absented, the sysfs file returns empty.
+* If the alerts/events are presented, one each read to the sysfs, the oldest alert/event will be reported until all the errors are read out..
+
+The format of the error lines is defended on the alert/event type.
+
+1) Type 1 for Core/Memory/PCIe/Other CE/UE alert types::
+
+ <Error Type><Error SubType><Instance><Error Status><Error Address><Error Misc 0><Error Misc 1><Error Misc2><Error Misc 3>
+
+ Where:
+ * Error Type: The hardwares cause the errors in format of two hex characters.
+ * SubType: Sub type of error in the specified hardware error in format of two hex characters.
+ * Instance: Combination of the socket, channel, slot cause the error in format of four hex characters.
+ * Error Status: Encode of error status in format of eight hex characters.
+ * Error Address: The address in device causes the errors in format of sixteen hex characters.
+ * Error Misc 0/1/2/3: Addition info about the errors. Each field is in format of sixteen hex characters.
+
+ Example:
+ # cat error_other_ce
+ 0a020000000030e400000000000000800000020000000000000000000000000000000000000000000000000000000000
+
+ The size of the alert buffer for this error type is 8 alerts.
+ When the buffer is overflowed, the read to overflow_other_ce will return 1, otherwise it returns 0.
+
+ Example:
+ # cat overflow_other_ce
+ 1
+
+Below table defines the value of Error types, Sub Types, Sub component and instance:
+
+ ============ ========== ========= =============== ================
+ Error Group Error Type Sub type Sub component Instance
+ CPM 0 0 Snoop-Logic CPM #
+ CPM 0 2 Armv8 Core 1 CPM #
+ MCU 1 1 ERR1 MCU # | SLOT << 11
+ MCU 1 2 ERR2 MCU # | SLOT << 11
+ MCU 1 3 ERR3 MCU #
+ MCU 1 4 ERR4 MCU #
+ MCU 1 5 ERR5 MCU #
+ MCU 1 6 ERR6 MCU #
+ MCU 1 7 Link Error MCU #
+ Mesh 2 0 Cross Point X | (Y << 5) | NS <<11
+ Mesh 2 1 Home Node(IO) X | (Y << 5) | NS <<11
+ Mesh 2 2 Home Node(Mem) X | (Y << 5) | NS <<11 | device<<12
+ Mesh 2 4 CCIX Node X | (Y << 5) | NS <<11
+ 2P Link 3 0 N/A Altra 2P Link #
+ GIC 5 0 ERR0 0
+ GIC 5 1 ERR1 0
+ GIC 5 2 ERR2 0
+ GIC 5 3 ERR3 0
+ GIC 5 4 ERR4 0
+ GIC 5 5 ERR5 0
+ GIC 5 6 ERR6 0
+ GIC 5 7 ERR7 0
+ GIC 5 8 ERR8 0
+ GIC 5 9 ERR9 0
+ GIC 5 10 ERR10 0
+ GIC 5 11 ERR11 0
+ GIC 5 12 ERR12 0
+ GIC 5 13-21 ERR13 RC# + 1
+ SMMU 6 TCU 100 RC #
+ SMMU 6 TBU0 0 RC #
+ SMMU 6 TBU1 1 RC #
+ SMMU 6 TBU2 2 RC #
+ SMMU 6 TBU3 3 RC #
+ SMMU 6 TBU4 4 RC #
+ SMMU 6 TBU5 5 RC #
+ SMMU 6 TBU6 6 RC #
+ SMMU 6 TBU7 7 RC #
+ SMMU 6 TBU8 8 RC #
+ SMMU 6 TBU9 9 RC #
+ PCIe AER 7 Root 0 RC #
+ PCIe AER 7 Device 1 RC #
+ PCIe RC 8 RCA HB 0 RC #
+ PCIe RC 8 RCB HB 1 RC #
+ PCIe RC 8 RASDP 8 RC #
+ OCM 9 ERR0 0 0
+ OCM 9 ERR1 1 0
+ OCM 9 ERR2 2 0
+ SMpro 10 ERR0 0 0
+ SMpro 10 ERR1 1 0
+ SMpro 10 MPA_ERR 2 0
+ PMpro 11 ERR0 0 0
+ PMpro 11 ERR1 1 0
+ PMpro 11 MPA_ERR 2 0
+ ============= ========== ========= =============== ================
+
+
+2) Type 2 for the Internal SMpro/PMpro alert types::
+
+ <Error Type><Error SubType><Direction><Error Location><Error Code><Error Data>
+
+ Where:
+ * Error Type: SMpro/PMpro Error types in format of two hex characters.
+ + 1: Warning
+ + 2: Error
+ + 4: Error with data
+ * Error SubType: SMpro/PMpro Image Code in format of two hex characters.
+ * Direction: Direction in format of two hex characters.
+ + 0: Enter
+ + 1: Exit
+ * Error Location: SMpro/PMpro Module Location code in format of two hex characters.
+ * Error Code: SMpro/PMpro Error code in format of four hex characters.
+ * Error Data: Extensive datae in format of eight hex characters.
+ All bits are 0 when Error Type is warning or error.
+
+ Example:
+ # cat errors_smpro
+ 01040108003500000000
+
+3) Type 3 for the VRD hot, VRD /warn/fault, DIMM Hot event::
+
+ <Event Channel><Event Data>
+
+ Where:
+ * Event channel:
+ 00: VRD Warning Fault
+ 01: VRD Hot
+ 02: DIMM hot
+ * Event Data: Extensive data if have in format of four hex characters.
+
+ Example:
+ #cat event_vrd_hot
+ 010000
+
+Sysfs entries
+-------------
+
+The following sysfs files are supported:
+
+* Ampere(R) Altra(R):
+
+Alert Types:
+
+ ================= =============== =========================================================== =======
+ Alert Type Sysfs name Description Format
+ Core CE Errors errors_core_ce Triggered by CPU when Core has an CE error 1
+ Core UE Errors errors_core_ue Triggered by CPU when Core has an UE error 1
+ Memory CE Errors errors_mem_ce Triggered by CPU when Memory has an CE error 1
+ Memory UE Errors errors_mem_ue Triggered by CPU when Memory has an UE error 1
+ PCIe CE Errors errors_pcie_ce Triggered by CPU when any PCIe controller has any CE error 1
+ PCIe UE Errors errors_pcie_ue Triggered by CPU when any PCIe controller has any UE error 1
+ Other CE Errors errors_other_ce Triggered by CPU when any Others CE error 1
+ Other UE Errors errors_other_ue Triggered by CPU when any Others UE error 1
+ SMpro Errors errors_smpro Triggered by CPU when system have SMpro error 2
+ PMpro Errors errors_pmpro Triggered by CPU when system have PMpro error 2
+ ================= =============== =========================================================== =======
+
+Event Type:
+
+ ============================ ========================== =========== ========================
+ Event Type Sysfs name Event Type Sub Type
+ VRD HOT event_vrd_hot 0 0: SoC, 1: Core, 2: DIMM
+ VR Warn/Fault event_vrd_warn_fault 1 0: SoC, 1: Core, 2: DIMM
+ DIMM Hot event_dimm_hot 2 NA (Default 0)
+ ============================ ========================== =========== ========================
--
2.35.1
More information about the openbmc
mailing list