[PATCH v2] powerpc/imc: Add documentation for IMC and trace-mode
Michael Ellerman
mpe at ellerman.id.au
Mon Oct 28 21:08:16 AEDT 2019
From: Anju T Sudhakar <anju at linux.vnet.ibm.com>
Documentation for IMC (In-Memory Collection Counters) infrastructure
and trace-mode of IMC.
Signed-off-by: Anju T Sudhakar <anju at linux.vnet.ibm.com>
[mpe: Convert to rst, minor rewording, make PMI example more concise]
Signed-off-by: Michael Ellerman <mpe at ellerman.id.au>
---
Documentation/powerpc/imc.rst | 199 ++++++++++++++++++++++++++++++++
Documentation/powerpc/index.rst | 1 +
2 files changed, 200 insertions(+)
create mode 100644 Documentation/powerpc/imc.rst
diff --git a/Documentation/powerpc/imc.rst b/Documentation/powerpc/imc.rst
new file mode 100644
index 000000000000..633bcee7dc85
--- /dev/null
+++ b/Documentation/powerpc/imc.rst
@@ -0,0 +1,199 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. _imc:
+
+===================================
+IMC (In-Memory Collection Counters)
+===================================
+
+Anju T Sudhakar, 10 May 2019
+
+.. contents::
+ :depth: 3
+
+
+Basic overview
+==============
+
+IMC (In-Memory collection counters) is a hardware monitoring facility that
+collects large numbers of hardware performance events at Nest level (these are
+on-chip but off-core), Core level and Thread level.
+
+The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC
+(On-Chip Controller) complex. The microcode collects the counter data and moves
+the nest IMC counter data to memory.
+
+The Core and Thread IMC PMU counters are handled in the core. Core level PMU
+counters give us the IMC counters' data per core and thread level PMU counters
+give us the IMC counters' data per CPU thread.
+
+OPAL obtains the IMC PMU and supported events information from the IMC Catalog
+and passes on to the kernel via the device tree. The event's information
+contains:
+
+- Event name
+- Event Offset
+- Event description
+
+and possibly also:
+
+- Event scale
+- Event unit
+
+Some PMUs may have a common scale and unit values for all their supported
+events. For those cases, the scale and unit properties for those events must be
+inherited from the PMU.
+
+The event offset in the memory is where the counter data gets accumulated.
+
+IMC catalog is available at:
+ https://github.com/open-power/ima-catalog
+
+The kernel discovers the IMC counters information in the device tree at the
+`imc-counters` device node which has a compatible field
+`ibm,opal-in-memory-counters`. From the device tree, the kernel parses the PMUs
+and their event's information and register the PMU and its attributes in the
+kernel.
+
+IMC example usage
+=================
+
+.. code-block:: sh
+
+ # perf list
+ [...]
+ nest_mcs01/PM_MCS01_64B_RD_DISP_PORT01/ [Kernel PMU event]
+ nest_mcs01/PM_MCS01_64B_RD_DISP_PORT23/ [Kernel PMU event]
+ [...]
+ core_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event]
+ core_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event]
+ [...]
+ thread_imc/CPM_0THRD_NON_IDLE_PCYC/ [Kernel PMU event]
+ thread_imc/CPM_1THRD_NON_IDLE_INST/ [Kernel PMU event]
+
+To see per chip data for nest_mcs0/PM_MCS_DOWN_128B_DATA_XFER_MC0/:
+
+.. code-block:: sh
+
+ # ./perf stat -e "nest_mcs01/PM_MCS01_64B_WR_DISP_PORT01/" -a --per-socket
+
+To see non-idle instructions for core 0:
+
+.. code-block:: sh
+
+ # ./perf stat -e "core_imc/CPM_NON_IDLE_INST/" -C 0 -I 1000
+
+To see non-idle instructions for a "make":
+
+.. code-block:: sh
+
+ # ./perf stat -e "thread_imc/CPM_NON_IDLE_PCYC/" make
+
+
+IMC Trace-mode
+===============
+
+POWER9 supports two modes for IMC which are the Accumulation mode and Trace
+mode. In Accumulation mode, event counts are accumulated in system Memory.
+Hypervisor then reads the posted counts periodically or when requested. In IMC
+Trace mode, the 64 bit trace SCOM value is initialized with the event
+information. The CPMCxSEL and CPMC_LOAD in the trace SCOM, specifies the event
+to be monitored and the sampling duration. On each overflow in the CPMCxSEL,
+hardware snapshots the program counter along with event counts and writes into
+memory pointed by LDBAR.
+
+LDBAR is a 64 bit special purpose per thread register, it has bits to indicate
+whether hardware is configured for accumulation or trace mode.
+
+LDBAR Register Layout
+---------------------
+
+ +-------+----------------------+
+ | 0 | Enable/Disable |
+ +-------+----------------------+
+ | 1 | 0: Accumulation Mode |
+ | +----------------------+
+ | | 1: Trace Mode |
+ +-------+----------------------+
+ | 2:3 | Reserved |
+ +-------+----------------------+
+ | 4-6 | PB scope |
+ +-------+----------------------+
+ | 7 | Reserved |
+ +-------+----------------------+
+ | 8:50 | Counter Address |
+ +-------+----------------------+
+ | 51:63 | Reserved |
+ +-------+----------------------+
+
+TRACE_IMC_SCOM bit representation
+---------------------------------
+
+ +-------+------------+
+ | 0:1 | SAMPSEL |
+ +-------+------------+
+ | 2:33 | CPMC_LOAD |
+ +-------+------------+
+ | 34:40 | CPMC1SEL |
+ +-------+------------+
+ | 41:47 | CPMC2SEL |
+ +-------+------------+
+ | 48:50 | BUFFERSIZE |
+ +-------+------------+
+ | 51:63 | RESERVED |
+ +-------+------------+
+
+CPMC_LOAD contains the sampling duration. SAMPSEL and CPMCxSEL determines the
+event to count. BUFFERSIZE indicates the memory range. On each overflow,
+hardware snapshots the program counter along with event counts and updates the
+memory and reloads the CMPC_LOAD value for the next sampling duration. IMC
+hardware does not support exceptions, so it quietly wraps around if memory
+buffer reaches the end.
+
+*Currently the event monitored for trace-mode is fixed as cycle.*
+
+Trace IMC example usage
+=======================
+
+.. code-block:: sh
+
+ # perf list
+ [....]
+ trace_imc/trace_cycles/ [Kernel PMU event]
+
+To record an application/process with trace-imc event:
+
+.. code-block:: sh
+
+ # perf record -e trace_imc/trace_cycles/ yes > /dev/null
+ [ perf record: Woken up 1 times to write data ]
+ [ perf record: Captured and wrote 0.012 MB perf.data (21 samples) ]
+
+The `perf.data` generated, can be read using perf report.
+
+Benefits of using IMC trace-mode
+================================
+
+PMI (Performance Monitoring Interrupts) interrupt handling is avoided, since IMC
+trace mode snapshots the program counter and updates to the memory. And this
+also provide a way for the operating system to do instruction sampling in real
+time without PMI processing overhead.
+
+Performance data using `perf top` with and without trace-imc event.
+
+PMI interrupts count when `perf top` command is executed without trace-imc event.
+
+.. code-block:: sh
+
+ # grep PMI /proc/interrupts
+ PMI: 0 0 0 0 Performance monitoring interrupts
+ # ./perf top
+ ...
+ # grep PMI /proc/interrupts
+ PMI: 39735 8710 17338 17801 Performance monitoring interrupts
+ # ./perf top -e trace_imc/trace_cycles/
+ ...
+ # grep PMI /proc/interrupts
+ PMI: 39735 8710 17338 17801 Performance monitoring interrupts
+
+
+That is, the PMI interrupt counts do not increment when using the `trace_imc` event.
diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index db7b6a880f52..80622a931a90 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -18,6 +18,7 @@ powerpc
elfnote
firmware-assisted-dump
hvcs
+ imc
isa-versions
mpc52xx
pci_iov_resource_on_powernv
--
2.21.0
More information about the Linuxppc-dev
mailing list