[Skiboot] [PATCH v13 0/9] skiboot: OPAL support for IMC instrumentation

Madhavan Srinivasan maddy at linux.vnet.ibm.com
Mon Jun 19 17:05:38 AEST 2017

Patchset adds support for In Memory Collection instrumentation (IMC)
services in OPAL for Power9. The entire IMC infrastructure consists of
two kinds of Performance Monitoring Units (PMUs) : nest imc pmus (chip
level) and core imc pmus (core level).

Nest IMC PMUs are off core but on chip. And these can be accessed via
in-band scoms. Programming these counters and accumulating the counter
data to memory is done via microcode running in one of the OCC Engines.

Core IMC PMUs handle the per-core counters. These are initialized with
per-core PDBARs, HTM_MODE and EVENT_MASK scoms.

This patchset is to add nest and core IMC instrumentation support in the
OPAL side.

"IMA_CATALOG" partition in PNOR contains multiple device tree binaries
(DTB) in a compressed form with PVR tag. So, when loading the pnor
partition, OPAL passes the system PVR as a "subid" to the load_resource
API. If a catalog dtb is found for a given pvr, it is decompressed and
linked to the main device tree.

Commit which adds the partition to PNOR is :

The root node of a IMC catalog device tree contains nodes for the IMC
PMUs and the common events across the PMUs. Here is an excerpt from
the device tree :

/ {
        name = "";
        compatible = "ibm,opal-in-memory-counters";
        #address-cells = <0x1>;
        #size-cells = <0x1>;
        version-id = "";

        NEST_MCS: nest-mcs-events {
                #address-cells = <0x1>;
                #size-cells = <0x1>;

                event at 0 {
                        event-name = "RRTO_QFULL_NO_DISP" ;
                        reg = <0x0 0x8>;
                        desc = "RRTO not dispatched in MCS0 due to capacity - pulses once for each time a valid RRTO op is not dispatched due to a command list full condition" ;
                event at 8 {
                        event-name = "WRTO_QFULL_NO_DISP" ;
                        reg = <0x8 0x8>;
                        desc = "WRTO not dispatched in MCS0 due to capacity - pulses once for each time a valid WRTO op is not dispatched due to a command list full condition" ;
        mcs0 {
                compatible = "ibm,imc-counters";
                events-prefix = "PM_MCS0_";
                unit = "";
                scale = "";
                reg = <0x118 0x8>;
                events = < &NEST_MCS >;
		type = <0x10>;
       mcs1 {
                compatible = "ibm,imc-counters";
                events-prefix = "PM_MCS1_";
                unit = "";
                scale = "";
                reg = <0x198 0x8>;
                events = < &NEST_MCS >;
		type = <0x10>;

	CORE_EVENTS: core-events {
                #address-cells = <0x1>;
                #size-cells = <0x1>;

                event at e0 {
                        event-name = "0THRD_NON_IDLE_PCYC" ;
                        reg = <0xe0 0x8>;
                        desc = "The number of processor cycles when all threads are idle" ;
                event at 120 {
                        event-name = "1THRD_NON_IDLE_PCYC" ;
                        reg = <0x120 0x8>;
                        desc = "The number of processor cycles when exactly one SMT thread is executing non-idle code" ;
        core {
                compatible = "ibm,imc-counters";
                events-prefix = "CPM_";
                unit = "";
                scale = "";
                reg = <0x0 0x8>;
                events = < &CORE_EVENTS >;
		type = <0x4>;

        thread {
                compatible = "ibm,imc-counters";
                events-prefix = "CPM_";
                unit = "";
                scale = "";
                reg = <0x0 0x8>;
                events = < &CORE_EVENTS >;
		type = <0x1>;

IMC Catalog DTS:
(recent suggested device node changes are not yet updated to the link)

For any IMC PMU node (mcs0, mcs1, mcs2, core, thread etc), its events
property points to the events node which gives us the event
information for that PMU.
For e.g., let's take the mcs0 PMU node from the above excerpt, "events"
property points us to the events list for this PMU and "events-prefix"
property helps us to create the correct event name for this PMU. So,
"RRTO_QFULL_NO_DISP" event name from "nest-mcs-events" becomes
This new design of the DTS file saves up a lot of space for the device
tree, since a lot of event names are common across PMUs. For core and
thread IMC PMUs, all the event names are common.

Each event in the device tree contains "event-name" and "offset".
Some of the PMUs may contain properties such as "scale" and "unit" which
reflects the fact that all the events inside this PMU will have the
same "scale" and "unit" values.

Why this design for the IMC DTS files?
The DTS files for Power 9 contain the IMC PMUs for nest, core and thread
IMC PMUs. There could be an argument to design the device tree
in such a way, so that one can use of_translate_address() directly on
the event nodes and can get the cpu address for that event. However,
there are some issues with that.
For nest imc, we need to attach the device tree to per-chip HOMER region
node. For multiple chips, this will increase replication.
For core imc, we allocate the memory in the kernel for each core and the
base location for core imc is not fixed. Hence, we can't use
of_translate_address on the core events.
For thread imc, we allocate memory for each linux process which needs to
be monitored. This will be particularly difficult to take care of in
the device tree since, the allocation will be dynamic.

So, from the OPAL side, we need to :
 - Find out the current processor's PVR.
 - Fetch the IMC catalog pnor partition.
 - Fetch the correct subpartition based on the current processor's PVR.
 - Decompress the blob taken from the subpartition.
 - Expand the (now uncompressed) device tree binary, fixup the phandle and
   attach it to the system's device tree, so that, it can now be discovered
   by the kernel.
 - Look at the IMC availability vector which denotes which of the nest
   PMUs are available and remove the unavailable PMU nodes from the
   device tree.

Note that :
 - Since OPAL lacks a xz decompression library, an xz decompression
   library has been add from http://tukaani.org/xz/embedded.html

This Patchset does 2 things :

1) At the time of boot, it detects the IMA_CATALOG resource. Based on
   the current processor's PVR value, it fetches the appropriate
   subpartition. The blob in this subpartition is then uncompressed and the
   flattened device tree is obtained. This dtb is then expanded and then
   linked to the system's device tree under
   "/proc/device-tree/ima-counters". The node "ima-counters" is a new node
   created in this patchset. The kernel can then discover this node based
   on its compatibility field.

2) It implements opal calls to initialize, enable and disable the IMC counters
   as specified the host kernel.

This patchset is based on the initial work for Nest Instrumentation done
by Madhavan Srinivasan, which can be found here :

Changelog :
v12 -> v13:
 - Added more documentation and code comments
 - Merged patches 8 and 9.
 - Modified _start and _stop calls to take another parameter.
 - Modified decompress function to look more of memcpy style parameter input.
 - Added new helper functions for node parser
 - Added a new device tree parser node to detect and remove unknown imc type
 - Removed Acked-by from patch 1 since made a change to fix the warning at
   doc compilation.

v11 -> v12:
 - Dropped the patch to add chip-id to reserve-memory
 - Modified the _INIT call to carry additional parameter (cpu_pir)
 - Modified the _INIT call to update scom based on input cpu_pir
 - Updated the opal api docs
 - Added test for dt_fixup function in core/test/run-device.c
 - Added new function to update base-addr and chip-id array to nest nodes
 - Updated commit messages and added more code comments

v10 -> v11:
 - Fix the return value for _INIT call incase of nest type

v9 -> v10:
 - Implemented the phandle fixup function using a single pass dt loop
 - Removed the hash functions and added more comments to the code
 - separated imc catalog preloading from imc_init() as suggested
 - Made changes to document files
 - v8 has dropped a patch which was added in this series

v8 -> v9:
 - Changed the opal call APIs for nest and core counters.
 - Implemented the fixup phandler using hash primitive, instead
   of linked list.
 - Made changes in commit messages.
 - Made changes in opal-call documentations.

v7 -> v8:
 - Rebased to latest upstream
 - Made changes to commit messages

v6 -> v7:
 - libxz -- removing the hostboot header from code
 - Made changes in commit messages.

v5 -> v6:
 - Added a set of new dt_fixup_* functions to handle phandle
   in the incoming tree.
 - Removed the nest_imc.h and move the nest_pmc[] to imc.c
 - Updated macro names and values as suggested
 - Fixed disable_unavailable_units() to work with incoming tree
   and not system dt.
 - rearranged the pacthes to have homer region update patch to be first
 - Made changes to commit messages.

 v4 -> v5:
 - Changed the cover letter to show the new IMC DTS format (which removes
 - No visible changes in the code.

 v3 -> v4:
 Major Changes include :
 - Patchset now has support for core level IMC PMUs support.

 v2 -> v3 :
 Major changes include
 - Addressed review comments from Oliver O'Halloran.
 - Renamed this infrastructure from IMA (In-Memory Accumulation) to IMC
   (In-Memory Collection), since, the name IMA conflicts with existing
   IMA (Integrity Measurement Architecture) in the linux kernel.
 - Patches 2 and 4 have been merged together (3/6).
 - Patch 3 (xz library) has been moved to Patch 2/6.

Anju T Sudhakar (2):
  skiboot: Add opal calls to init/start/stop IMC devices
  skiboot: Add documentation for IMC opal call

Hemant Kumar (2):
  skiboot: Nest IMC macro definitions
  skiboot: Add a library for xz

Madhavan Srinivasan (5):
  skiboot/doc: Add doc/imc.rst documentation
  skiboot/doc: Add devicetree binding document for IMC
  dt: Add helper function for last_phandle updates
  dt: Add phandle fixup helpers
  skiboot: Find the IMC DTB

 Makefile.main                      |    5 +-
 core/device.c                      |   42 +-
 core/fdt.c                         |    8 +-
 core/flash.c                       |    1 +
 core/init.c                        |    7 +
 core/test/run-device.c             |   35 +-
 doc/device-tree/imc.rst            |   72 +++
 doc/imc.rst                        |   54 ++
 doc/index.rst                      |    1 +
 doc/opal-api/opal-imc-counters.rst |   87 +++
 hw/Makefile.inc                    |    2 +-
 hw/imc.c                           |  622 +++++++++++++++++++
 include/chip.h                     |   11 +
 include/device.h                   |   20 +
 include/imc.h                      |  135 +++++
 include/opal-api.h                 |   12 +-
 include/platform.h                 |    1 +
 libxz/Makefile.inc                 |    7 +
 libxz/xz.h                         |  304 ++++++++++
 libxz/xz_config.h                  |  124 ++++
 libxz/xz_crc32.c                   |   59 ++
 libxz/xz_dec_lzma2.c               | 1171 ++++++++++++++++++++++++++++++++++++
 libxz/xz_dec_stream.c              |  847 ++++++++++++++++++++++++++
 libxz/xz_lzma2.h                   |  204 +++++++
 libxz/xz_private.h                 |  156 +++++
 libxz/xz_stream.h                  |   62 ++
 26 files changed, 4037 insertions(+), 12 deletions(-)
 create mode 100644 doc/device-tree/imc.rst
 create mode 100644 doc/imc.rst
 create mode 100644 doc/opal-api/opal-imc-counters.rst
 create mode 100644 hw/imc.c
 create mode 100644 include/imc.h
 create mode 100644 libxz/Makefile.inc
 create mode 100644 libxz/xz.h
 create mode 100644 libxz/xz_config.h
 create mode 100644 libxz/xz_crc32.c
 create mode 100644 libxz/xz_dec_lzma2.c
 create mode 100644 libxz/xz_dec_stream.c
 create mode 100644 libxz/xz_lzma2.h
 create mode 100644 libxz/xz_private.h
 create mode 100644 libxz/xz_stream.h


More information about the Skiboot mailing list