[Skiboot] [PATCH 010/110] doc: prettify and expand OPAL_HANDLE_HMI2 docs

Stewart Smith stewart at linux.ibm.com
Fri May 31 16:12:11 AEST 2019


Signed-off-by: Stewart Smith <stewart at linux.ibm.com>
---
 doc/opal-api/opal-handle-hmi-98-166.rst | 187 +++++++++++++-----------
 doc/opal-api/opal-messages.rst          |   2 +
 2 files changed, 106 insertions(+), 83 deletions(-)

diff --git a/doc/opal-api/opal-handle-hmi-98-166.rst b/doc/opal-api/opal-handle-hmi-98-166.rst
index 950e0c4ef7fa..5959e5e90194 100644
--- a/doc/opal-api/opal-handle-hmi-98-166.rst
+++ b/doc/opal-api/opal-handle-hmi-98-166.rst
@@ -1,34 +1,34 @@
 Hypervisor Maintenance Interrupt (HMI)
 ======================================
 
-  Hypervisor Maintenance Interrupt usually reports error related to processor
-  recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then
-  takes this opportunity to analyze and recover from some of these errors.
-  Hypervisor takes assistance from OPAL layer to handle and recover from HMI.
-  After handling HMI, OPAL layer sends the summary of error report and status
-  of recovery action using HMI event. See ref: `opal-messages.rst` for HMI
-  event structure under ```OPAL_MSG_HMI_EVT``` section.
-
-  HMI is thread specific. The reason for HMI is available in a per thread
-  Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance
-  Exception Enable Register (HMEER) is per core. Bits from the HMER need to
-  be enabled by the corresponding bits in the HMEER in order to cause an HMI.
-
-  Several interrupt reasons are routed in parallel to each of the thread
-  specific copies. Each thread can only clear bits in its own HMER. OPAL
-  handler from each thread clears the respective bit from HMER register
-  after handling the error.
+Hypervisor Maintenance Interrupt usually reports error related to processor
+recovery/checkstop, NX/NPU checkstop and Timer facility. Hypervisor then
+takes this opportunity to analyze and recover from some of these errors.
+Hypervisor takes assistance from OPAL layer to handle and recover from HMI.
+After handling HMI, OPAL layer sends the summary of error report and status
+of recovery action using HMI event. See ref:`opal-messages` for HMI
+event structure under :ref:`OPAL_MSG_HMI_EVT` section.
+
+HMI is thread specific. The reason for HMI is available in a per thread
+Hypervisor Maintenance Exception Register (HMER). A Hypervisor Maintenance
+Exception Enable Register (HMEER) is per core. Bits from the HMER need to
+be enabled by the corresponding bits in the HMEER in order to cause an HMI.
+
+Several interrupt reasons are routed in parallel to each of the thread
+specific copies. Each thread can only clear bits in its own HMER. OPAL
+handler from each thread clears the respective bit from HMER register
+after handling the error.
 
 List of errors that causes HMI
 ==============================
 
-  - CPU Errors
+ - CPU Errors
 
    - Processor Core checkstop
    - Processor retry recovery
    - NX/NPU/CAPP checkstop.
 
-  - Timer facility Errors
+ - Timer facility Errors
 
    - ChipTOD Errors
 
@@ -36,7 +36,7 @@ List of errors that causes HMI
     - ChipTOD configuration register parity errors
     - ChiTOD topology failover
 
-   - Timebase (TB) errors
+ - Timebase (TB) errors
 
     - TB parity/residue error
     - TFMR parity and firmware control error
@@ -45,84 +45,95 @@ List of errors that causes HMI
 HMI handling
 ============
 
-   A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0).
-   OPAL handler scans through Fault Isolation Register (FIR) for each
-   core/nx/npu to detect the exact reason for checkstop and reports it back
-   to the host alongwith the disposition.
-
-   A processor recovery is reported through HMER bits 2, 3 and 11. These are
-   just an informational messages and no extra recovery is required.
-
-   Timer facility errors are reported through HMER bit 4. These are all
-   recoverable errors. The exact reason for the errors are stored in
-   Timer Facility Management Register (TFMR). Some of the Timer facility
-   errors affects TB and some of them affects TOD. TOD is a per chip
-   Time-Of-Day logic that holds the actual time value of the chip and
-   communicates with every TOD in the system to achieve synchronized
-   timer value within a system. TB is per core register (64-bit) derives its
-   value from ChipTOD at startup and then it gets periodically incremented
-   by STEP signal provided by the TOD. In a multi-socket system TODs are
-   always configured as master/backup TOD under primary/secondary
-   topology configuration respectively.
-
-   TB error generates HMI on all threads of the affected core. TB errors
-   except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running
-   making it invalid. As part of TB recovery, OPAL hmi handler synchronizes
-   with all threads, clears the TB errors and then re-sync the TB with TOD
-   value putting it back in running state.
-
-   TOD errors generates HMI on every core/thread of affected chip. The reason
-   for TOD errors are stored in TOD ERROR register (0x40030). As part of the
-   recovery OPAL hmi handler clears the TOD error and then requests new TOD
-   value from another running chipTOD in the system. Sometimes, if a primary
-   chipTOD is in error, it may need a TOD topology switch to recover from
-   error. A TOD topology switch basically makes a backup as new active master.
-
-OPAL_HANDLE_HMI and OPAL_HANDLE_HMI2
-====================================
-::
+A core/NX/NPU checkstops are reported as malfunction alert (HMER bit 0).
+OPAL handler scans through Fault Isolation Register (FIR) for each
+core/nx/npu to detect the exact reason for checkstop and reports it back
+to the host alongwith the disposition.
+
+A processor recovery is reported through HMER bits 2, 3 and 11. These are
+just an informational messages and no extra recovery is required.
+
+Timer facility errors are reported through HMER bit 4. These are all
+recoverable errors. The exact reason for the errors are stored in
+Timer Facility Management Register (TFMR). Some of the Timer facility
+errors affects TB and some of them affects TOD. TOD is a per chip
+Time-Of-Day logic that holds the actual time value of the chip and
+communicates with every TOD in the system to achieve synchronized
+timer value within a system. TB is per core register (64-bit) derives its
+value from ChipTOD at startup and then it gets periodically incremented
+by STEP signal provided by the TOD. In a multi-socket system TODs are
+always configured as master/backup TOD under primary/secondary
+topology configuration respectively.
+
+TB error generates HMI on all threads of the affected core. TB errors
+except DEC/HDEC/PURR/SPURR parity errors, causes TB to stop running
+making it invalid. As part of TB recovery, OPAL hmi handler synchronizes
+with all threads, clears the TB errors and then re-sync the TB with TOD
+value putting it back in running state.
+
+TOD errors generates HMI on every core/thread of affected chip. The reason
+for TOD errors are stored in TOD ERROR register (0x40030). As part of the
+recovery OPAL hmi handler clears the TOD error and then requests new TOD
+value from another running chipTOD in the system. Sometimes, if a primary
+chipTOD is in error, it may need a TOD topology switch to recover from
+error. A TOD topology switch basically makes a backup as new active master.
+
+.. _OPAL_HANDLE_HMI:
+
+OPAL_HANDLE_HMI
+===============
+
+.. code-block:: c
 
    #define OPAL_HANDLE_HMI	98
-   #define OPAL_HANDLE_HMI2	166
 
-``OPAL_HANDLE_HMI``
+   int64_t opal_handle_hmi(void);
 
-``OPAL_HANDLE_HMI2``
-  When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call
-  ```OPAL_HANDLE_HMI``` or ```OPAL_HANDLE_HMI2```. The ```OPAL_HANDLE_HMI```
-  is an old interface. ```OPAL_HANDLE_HMI2``` is newly introduced opal call
-  that returns direct info to Linux. It returns a 64-bit flag mask currently
-  set to provide info about which timer facilities were lost, and whether an
-  event was generated. This information will help OS to take respective
-  actions.
 
-  In case where opal hmi handler is unable to recover from TOD or TB errors,
-  it would flag ```OPAL_HMI_FLAGS_TOD_TB_FAIL``` to indicate OS that TB is
-  dead. This information then can be used by OS to make sure that the
-  functions relying on TB value (e.g. udelay()) are aware of TB not ticking.
-  This will avoid OS getting stuck or hang during its way to panic path.
+Superseded by :ref:`OPAL_HANDLE_HMI2`, meaning that :ref:`OPAL_HANDLE_HMI`
+should only be called if :ref:`OPAL_HANDLE_HMI2` is not available.
 
-OPAL_HANDLE_HMI
----------------
-Syntax: ::
+Since :ref:`OPAL_HANDLE_HMI2` has been available since the start of POWER9
+systems being supported, if you only target POWER9 and above, you can
+assume the presence of :ref:`OPAL_HANDLE_HMI2`.
 
-  int64_t opal_handle_hmi(void)
+.. _OPAL_HANDLE_HMI2:
 
 OPAL_HANDLE_HMI2
-----------------
-Syntax: ::
+================
+
+.. code-block:: c
+
+   #define OPAL_HANDLE_HMI2	166
+
+   int64_t opal_handle_hmi2(__be64 *out_flags);
+
+When OS host gets an Hypervisor Maintenance Interrupt (HMI), it must call
+:ref:`OPAL_HANDLE_HMI` or :ref:`OPAL_HANDLE_HMI2`. The :ref:`OPAL_HANDLE_HMI`
+is an old interface. :ref:`OPAL_HANDLE_HMI2` is newly introduced opal call
+that returns direct info to the OS. It returns a 64-bit flag mask currently
+set to provide info about which timer facilities were lost, and whether an
+event was generated. This information will help OS to take respective
+actions.
+
+In case where opal hmi handler is unable to recover from TOD or TB errors,
+it would flag :ref:`OPAL_HMI_FLAGS_TOD_TB_FAIL` to indicate OS that TB is
+dead. This information then can be used by OS to make sure that the
+functions relying on TB value (e.g. udelay()) are aware of TB not ticking.
+This will avoid OS getting stuck or hang during its way to panic path.
 
-  int64_t opal_handle_hmi2(__be64 *out_flags)
 
-parameters
+Parameters
 ^^^^^^^^^^
 
-  ``__be64 *out_flags``
+.. code-block:: c
 
-  Returns the 64-bit flag mask that provides info about which timer facilities
-  were lost, and whether an event was generated.
+   __be64 *out_flags;
 
-::
+Returns the 64-bit flag mask that provides info about which timer facilities
+were lost, and whether an event was generated.
+
+.. code-block:: c
 
    /* OPAL_HANDLE_HMI2 out_flags */
    enum {
@@ -132,3 +143,13 @@ parameters
         OPAL_HMI_FLAGS_TOD_TB_FAIL      = (1ull << 3), /* TOD/TB recovery failed. */
         OPAL_HMI_FLAGS_NEW_EVENT        = (1ull << 63), /* An event has been created */
    };
+
+.. _OPAL_HMI_FLAGS_TOD_TB_FAIL:
+
+OPAL_HMI_FLAGS_TOD_TB_FAIL
+  The Time of Day (TOD) / Timebase facility has failed. This is probably fatal
+  for the OS, and requires the OS to be very careful to not call any function
+  that may rely on it, usually as it heads down a `panic()` code path.
+  This code path should be :ref:`OPAL_CEC_REBOOT2` with the OPAL_REBOOT_PLATFORM_ERROR
+  option. Details of the failure are likely delivered as part of HMI events if
+  `OPAL_HMI_FLAGS_NEW_EVENT` is set.
diff --git a/doc/opal-api/opal-messages.rst b/doc/opal-api/opal-messages.rst
index e4e813aadb3b..b023622d2d26 100644
--- a/doc/opal-api/opal-messages.rst
+++ b/doc/opal-api/opal-messages.rst
@@ -65,6 +65,8 @@ or a reboot. ::
 
   params[0] = 0x01 reboot, 0x00 shutdown
 
+.. _OPAL_MSG_HMI_EVT:
+
 OPAL_MSG_HMI_EVT
 ----------------
 
-- 
2.21.0



More information about the Skiboot mailing list