[PATCH 00/14] Add interface to expose vpa dtl counters via perf

tejas05 tejas05 at linux.ibm.com
Tue Aug 19 00:41:31 AEST 2025


On 8/15/25 14:03, Athira Rajeev wrote:

> The pseries Shared Processor Logical Partition(SPLPAR) machines can
> retrieve a log of dispatch and preempt events from the hypervisor
> using data from Disptach Trace Log(DTL) buffer. With this information,
> user can retrieve when and why each dispatch & preempt has occurred.
> The vpa-dtl PMU exposes the Virtual Processor Area(VPA) DTL counters
> via perf.
>
> - Patches 1 to 6 has powerpc PMU driver code changes to capture DTL
>    trace in perf.data. And patch 14 has documentation update.
> - Patch 7 to 13 is perf tools side code changes to enable perf
>    report/script on perf.data file
>
> Infrastructure used
> ===================
>
> The VPA DTL PMU counters do not interrupt on overflow or generate any
> PMI interrupts. Therefore, hrtimer is used to poll the DTL data. The timer
> nterval can be provided by user via sample_period field in nano seconds.
> vpa dtl pmu has one hrtimer added per vpa-dtl pmu thread. DTL (Dispatch
> Trace Log) contains information about dispatch/preempt, enqueue time etc.
> We directly copy the DTL buffer data as part of auxiliary buffer and it
> will be processed later. This will avoid time taken to create samples
> in the kernel space. The PMU driver collecting Dispatch Trace Log (DTL)
> entries makes use of AUX support in perf infrastructure. On the tools side,
> this data is made available as PERF_RECORD_AUXTRACE records.
>
> To corelate each DTL entry with other events across CPU's, an auxtrace_queue
> is created for each CPU. Each auxtrace queue has a array/list of auxtrace buffers.
> All auxtrace queues is maintained in auxtrace heap. The queues are sorted
> based on timestamp. When the different PERF_RECORD_XX records are processed,
> compare the timestamp of perf record with timestamp of top element in the
> auxtrace heap so that DTL events can be co-related with other events
> Process the auxtrace queue if the timestamp of element from heap is
> lower than timestamp from entry in perf record. Sometimes it could happen that
> one buffer is only partially processed. if the timestamp of occurrence of
> another event is more than currently processed element in the queue, it will
> move on to next perf record. So keep track of position of buffer to continue
> processing next time. Update the timestamp of the auxtrace heap with the timestamp
> of last processed entry from the auxtrace buffer.
>
> This infrastructure ensures dispatch trace log entries can be corelated
> and presented along with other events like sched.
>
> vpa-dtl PMU example usage
>
>    # ls /sys/devices/vpa_dtl/
>    events  format  perf_event_mux_interval_ms  power  subsystem  type  uevent
>
>
> To capture the DTL data using perf record:
>
>    # ./perf record -a -e sched:*,vpa_dtl/dtl_all/ -c 1000000000 sleep 1
>
> The result can be interpreted using perf report. Snippet of perf report -D:
>
>    # ./perf report -D
>
> There are different PERF_RECORD_XX records. In that records corresponding to
> auxtrace buffers includes:
>
> 1. PERF_RECORD_AUX
>     Conveys that new data is available in AUX area
>
> 2. PERF_RECORD_AUXTRACE_INFO
>     Describes offset and size of auxtrace data in the buffers
>
> 3. PERF_RECORD_AUXTRACE
>     This is the record that defines the auxtrace data which here in case of
>     vpa-dtl pmu is dispatch trace log data.
>
> Snippet from perf report -D showing the PERF_RECORD_AUXTRACE dump
>
> 0 0 0x39b10 [0x30]: PERF_RECORD_AUXTRACE size: 0x690  offset: 0  ref: 0  idx: 0  tid: -1  cpu: 0
> .
> . ... VPA DTL PMU data: size 1680 bytes, entries is 35
> .  00000000: boot_tb: 21349649546353231, tb_freq: 512000000
> .  00000030: dispatch_reason:decrementer interrupt, preempt_reason:H_CEDE, enqueue_to_dispatch_time:7064, ready_to_enqueue_time:187, waiting_to_ready_time:6611773
> .  00000060: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:146, ready_to_enqueue_time:0, waiting_to_ready_time:15359437
> .  00000090: dispatch_reason:decrementer interrupt, preempt_reason:H_CEDE, enqueue_to_dispatch_time:4868, ready_to_enqueue_time:232, waiting_to_ready_time:5100709
> .  000000c0: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:179, ready_to_enqueue_time:0, waiting_to_ready_time:30714243
> .  000000f0: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:197, ready_to_enqueue_time:0, waiting_to_ready_time:15350648
> .  00000120: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:213, ready_to_enqueue_time:0, waiting_to_ready_time:15353446
> .  00000150: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:212, ready_to_enqueue_time:0, waiting_to_ready_time:15355126
> .  00000180: dispatch_reason:decrementer interrupt, preempt_reason:H_CEDE, enqueue_to_dispatch_time:6368, ready_to_enqueue_time:164, waiting_to_ready_time:5104665
>
> Above is representation of dtl entry of below format:
>
> struct dtl_entry {
>          u8      dispatch_reason;
>          u8      preempt_reason;
>          u16     processor_id;
>          u32     enqueue_to_dispatch_time;
>          u32     ready_to_enqueue_time;
>          u32     waiting_to_ready_time;
>          u64     timebase;
>          u64     fault_addr;
>          u64     srr0;
>          u64     srr1;
>   };
>
> First two fields represent the dispatch reason and preempt reason. The post
> procecssing of PERF_RECORD_AUXTRACE records will translate to meaninful data
> for user to consume.
>
> Visualize the dispatch trace log entries with perf report:
>
>    # ./perf record -a -e sched:*,vpa_dtl/dtl_all/ -c 1000000000 sleep 1
>    [ perf record: Woken up 1 times to write data ]
>    [ perf record: Captured and wrote 0.300 MB perf.data ]
>
>    # ./perf report
>    # Samples: 321  of event 'vpa-dtl'
>    # Event count (approx.): 321
>    #
>    # Children      Self  Command  Shared Object      Symbol
>    # ........  ........  .......  .................  ..............................
>    #
>       100.00%   100.00%  swapper  [kernel.kallsyms]  [k] plpar_hcall_norets_notrace
>
> Visualize the dispatch trace log entries with perf script:
>
>     # ./perf script
>              perf   13322 [002]   233.835807:                     sched:sched_switch: perf:13322 [120] R ==> migration/2:27 [0]
>       migration/2      27 [002]   233.835811:               sched:sched_migrate_task: comm=perf pid=13322 prio=120 orig_cpu=2 dest_cpu=3
>       migration/2      27 [002]   233.835818:               sched:sched_stat_runtime: comm=migration/2 pid=27 runtime=9214 [ns]
>       migration/2      27 [002]   233.835819:                     sched:sched_switch: migration/2:27 [0] S ==> swapper/2:0 [120]
>           swapper       0 [002]   233.835822:                                vpa-dtl: timebase: 338954486062657 dispatch_reason:decrementer_interrupt, preempt_reason:H_CEDE, enqueue_to_dispatch_time:435,                      ready_to_enqueue_time:0, waiting_to_ready_time:34775058, processor_id: 202 c0000000000f8094 plpar_hcall_norets_notrace+0x18 ([kernel.kallsyms])
>           swapper       0 [001]   233.835886:                                vpa-dtl: timebase: 338954486095398 dispatch_reason:priv_doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:542,                      ready_to_enqueue_time:0, waiting_to_ready_time:1245360, processor_id: 201 c0000000000f8094 plpar_hcall_norets_notrace+0x18 ([kernel.kallsyms])
>
> Aboorva Devarajan (1):
>    powerpc/time: Expose boot_tb via accessor
>
> Athira Rajeev (11):
>    powerpc/perf/vpa-dtl: Add support to setup and free aux buffer for
>      capturing DTL data
>    powerpc/perf/vpa-dtl: Add support to capture DTL data in aux buffer
>    powerpc/perf/vpa-dtl: Handle the writing of perf record when aux wake
>      up is needed
>    tools/perf: Add basic CONFIG_AUXTRACE support for VPA pmu on powerpc
>    tools/perf: process auxtrace events and display in perf report -D
>    tools/perf: Add event name as vpa-dtl of PERF_TYPE_SYNTH type to
>      present DTL samples
>    tools/perf: Allocate and setup aux buffer queue to help co-relate with
>      other events across CPU's
>    tools/perf: Process the DTL entries in queue and deliver samples
>    tools/perf: Add support for printing synth event details via default
>      callback
>    tools/perf: Enable perf script to present the DTL entries
>    powerpc/perf/vpa-dtl: Add documentation for VPA dispatch trace log PMU
>
> Kajol Jain (2):
>    powerpc/vpa_dtl: Add interface to expose vpa dtl counters via perf
>    docs: ABI: sysfs-bus-event_source-devices-vpa-dtl: Document sysfs
>      event format entries for vpa_dtl pmu
>
>   .../sysfs-bus-event_source-devices-vpa-dtl    |  25 +
>   Documentation/arch/powerpc/index.rst          |   1 +
>   Documentation/arch/powerpc/vpa-dtl.rst        | 155 ++++
>   arch/powerpc/include/asm/time.h               |   2 +
>   arch/powerpc/kernel/time.c                    |   7 +-
>   arch/powerpc/perf/Makefile                    |   2 +-
>   arch/powerpc/perf/vpa-dtl.c                   | 605 ++++++++++++++
>   tools/perf/arch/powerpc/util/Build            |   1 +
>   tools/perf/arch/powerpc/util/auxtrace.c       | 122 +++
>   tools/perf/builtin-script.c                   |  26 +
>   tools/perf/util/Build                         |   1 +
>   tools/perf/util/auxtrace.c                    |   4 +
>   tools/perf/util/auxtrace.h                    |   1 +
>   tools/perf/util/event.h                       |   1 +
>   tools/perf/util/powerpc-vpadtl.c              | 756 ++++++++++++++++++
>   tools/perf/util/powerpc-vpadtl.h              |  45 ++
>   16 files changed, 1752 insertions(+), 2 deletions(-)
>   create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-vpa-dtl
>   create mode 100644 Documentation/arch/powerpc/vpa-dtl.rst
>   create mode 100644 arch/powerpc/perf/vpa-dtl.c
>   create mode 100644 tools/perf/arch/powerpc/util/auxtrace.c
>   create mode 100644 tools/perf/util/powerpc-vpadtl.c
>   create mode 100644 tools/perf/util/powerpc-vpadtl.h
>
Hi,

I have tested the above patchset on the mainline kernel [ 6.17.0-rc2], 
it is working fine. The vpa-dtl pmu is recognizable and the perf record 
report works as expected. Please add the tag below, for the entire series.

Tested-by: Tejas Manhas <tejas05 at linux.ibm.com>

Thanks & Regards,

Tejas


More information about the Linuxppc-dev mailing list