[PATCH 00/14] Add interface to expose vpa dtl counters via perf
Athira Rajeev
atrajeev at linux.ibm.com
Fri Aug 15 22:51:50 AEST 2025
> On 15 Aug 2025, at 5:47 PM, Venkat Rao Bagalkote <venkat88 at linux.ibm.com> wrote:
>
>
> On 15/08/25 2:03 pm, Athira Rajeev wrote:
>> The pseries Shared Processor Logical Partition(SPLPAR) machines can
>> retrieve a log of dispatch and preempt events from the hypervisor
>> using data from Disptach Trace Log(DTL) buffer. With this information,
>> user can retrieve when and why each dispatch & preempt has occurred.
>> The vpa-dtl PMU exposes the Virtual Processor Area(VPA) DTL counters
>> via perf.
>>
>> - Patches 1 to 6 has powerpc PMU driver code changes to capture DTL
>> trace in perf.data. And patch 14 has documentation update.
>> - Patch 7 to 13 is perf tools side code changes to enable perf
>> report/script on perf.data file
>>
>> Infrastructure used
>> ===================
>>
>> The VPA DTL PMU counters do not interrupt on overflow or generate any
>> PMI interrupts. Therefore, hrtimer is used to poll the DTL data. The timer
>> nterval can be provided by user via sample_period field in nano seconds.
>> vpa dtl pmu has one hrtimer added per vpa-dtl pmu thread. DTL (Dispatch
>> Trace Log) contains information about dispatch/preempt, enqueue time etc.
>> We directly copy the DTL buffer data as part of auxiliary buffer and it
>> will be processed later. This will avoid time taken to create samples
>> in the kernel space. The PMU driver collecting Dispatch Trace Log (DTL)
>> entries makes use of AUX support in perf infrastructure. On the tools side,
>> this data is made available as PERF_RECORD_AUXTRACE records.
>>
>> To corelate each DTL entry with other events across CPU's, an auxtrace_queue
>> is created for each CPU. Each auxtrace queue has a array/list of auxtrace buffers.
>> All auxtrace queues is maintained in auxtrace heap. The queues are sorted
>> based on timestamp. When the different PERF_RECORD_XX records are processed,
>> compare the timestamp of perf record with timestamp of top element in the
>> auxtrace heap so that DTL events can be co-related with other events
>> Process the auxtrace queue if the timestamp of element from heap is
>> lower than timestamp from entry in perf record. Sometimes it could happen that
>> one buffer is only partially processed. if the timestamp of occurrence of
>> another event is more than currently processed element in the queue, it will
>> move on to next perf record. So keep track of position of buffer to continue
>> processing next time. Update the timestamp of the auxtrace heap with the timestamp
>> of last processed entry from the auxtrace buffer.
>>
>> This infrastructure ensures dispatch trace log entries can be corelated
>> and presented along with other events like sched.
>>
>> vpa-dtl PMU example usage
>>
>> # ls /sys/devices/vpa_dtl/
>> events format perf_event_mux_interval_ms power subsystem type uevent
>>
>>
>> To capture the DTL data using perf record:
>>
>> # ./perf record -a -e sched:*,vpa_dtl/dtl_all/ -c 1000000000 sleep 1
>>
>> The result can be interpreted using perf report. Snippet of perf report -D:
>>
>> # ./perf report -D
>>
>> There are different PERF_RECORD_XX records. In that records corresponding to
>> auxtrace buffers includes:
>>
>> 1. PERF_RECORD_AUX
>> Conveys that new data is available in AUX area
>>
>> 2. PERF_RECORD_AUXTRACE_INFO
>> Describes offset and size of auxtrace data in the buffers
>>
>> 3. PERF_RECORD_AUXTRACE
>> This is the record that defines the auxtrace data which here in case of
>> vpa-dtl pmu is dispatch trace log data.
>>
>> Snippet from perf report -D showing the PERF_RECORD_AUXTRACE dump
>>
>> 0 0 0x39b10 [0x30]: PERF_RECORD_AUXTRACE size: 0x690 offset: 0 ref: 0 idx: 0 tid: -1 cpu: 0
>> .
>> . ... VPA DTL PMU data: size 1680 bytes, entries is 35
>> . 00000000: boot_tb: 21349649546353231, tb_freq: 512000000
>> . 00000030: dispatch_reason:decrementer interrupt, preempt_reason:H_CEDE, enqueue_to_dispatch_time:7064, ready_to_enqueue_time:187, waiting_to_ready_time:6611773
>> . 00000060: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:146, ready_to_enqueue_time:0, waiting_to_ready_time:15359437
>> . 00000090: dispatch_reason:decrementer interrupt, preempt_reason:H_CEDE, enqueue_to_dispatch_time:4868, ready_to_enqueue_time:232, waiting_to_ready_time:5100709
>> . 000000c0: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:179, ready_to_enqueue_time:0, waiting_to_ready_time:30714243
>> . 000000f0: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:197, ready_to_enqueue_time:0, waiting_to_ready_time:15350648
>> . 00000120: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:213, ready_to_enqueue_time:0, waiting_to_ready_time:15353446
>> . 00000150: dispatch_reason:priv doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:212, ready_to_enqueue_time:0, waiting_to_ready_time:15355126
>> . 00000180: dispatch_reason:decrementer interrupt, preempt_reason:H_CEDE, enqueue_to_dispatch_time:6368, ready_to_enqueue_time:164, waiting_to_ready_time:5104665
>>
>> Above is representation of dtl entry of below format:
>>
>> struct dtl_entry {
>> u8 dispatch_reason;
>> u8 preempt_reason;
>> u16 processor_id;
>> u32 enqueue_to_dispatch_time;
>> u32 ready_to_enqueue_time;
>> u32 waiting_to_ready_time;
>> u64 timebase;
>> u64 fault_addr;
>> u64 srr0;
>> u64 srr1;
>> };
>>
>> First two fields represent the dispatch reason and preempt reason. The post
>> procecssing of PERF_RECORD_AUXTRACE records will translate to meaninful data
>> for user to consume.
>>
>> Visualize the dispatch trace log entries with perf report:
>>
>> # ./perf record -a -e sched:*,vpa_dtl/dtl_all/ -c 1000000000 sleep 1
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0.300 MB perf.data ]
>>
>> # ./perf report
>> # Samples: 321 of event 'vpa-dtl'
>> # Event count (approx.): 321
>> #
>> # Children Self Command Shared Object Symbol
>> # ........ ........ ....... ................. ..............................
>> #
>> 100.00% 100.00% swapper [kernel.kallsyms] [k] plpar_hcall_norets_notrace
>>
>> Visualize the dispatch trace log entries with perf script:
>>
>> # ./perf script
>> perf 13322 [002] 233.835807: sched:sched_switch: perf:13322 [120] R ==> migration/2:27 [0]
>> migration/2 27 [002] 233.835811: sched:sched_migrate_task: comm=perf pid=13322 prio=120 orig_cpu=2 dest_cpu=3
>> migration/2 27 [002] 233.835818: sched:sched_stat_runtime: comm=migration/2 pid=27 runtime=9214 [ns]
>> migration/2 27 [002] 233.835819: sched:sched_switch: migration/2:27 [0] S ==> swapper/2:0 [120]
>> swapper 0 [002] 233.835822: vpa-dtl: timebase: 338954486062657 dispatch_reason:decrementer_interrupt, preempt_reason:H_CEDE, enqueue_to_dispatch_time:435, ready_to_enqueue_time:0, waiting_to_ready_time:34775058, processor_id: 202 c0000000000f8094 plpar_hcall_norets_notrace+0x18 ([kernel.kallsyms])
>> swapper 0 [001] 233.835886: vpa-dtl: timebase: 338954486095398 dispatch_reason:priv_doorbell, preempt_reason:H_CEDE, enqueue_to_dispatch_time:542, ready_to_enqueue_time:0, waiting_to_ready_time:1245360, processor_id: 201 c0000000000f8094 plpar_hcall_norets_notrace+0x18 ([kernel.kallsyms])
>>
>> Aboorva Devarajan (1):
>> powerpc/time: Expose boot_tb via accessor
>>
>> Athira Rajeev (11):
>> powerpc/perf/vpa-dtl: Add support to setup and free aux buffer for
>> capturing DTL data
>> powerpc/perf/vpa-dtl: Add support to capture DTL data in aux buffer
>> powerpc/perf/vpa-dtl: Handle the writing of perf record when aux wake
>> up is needed
>> tools/perf: Add basic CONFIG_AUXTRACE support for VPA pmu on powerpc
>> tools/perf: process auxtrace events and display in perf report -D
>> tools/perf: Add event name as vpa-dtl of PERF_TYPE_SYNTH type to
>> present DTL samples
>> tools/perf: Allocate and setup aux buffer queue to help co-relate with
>> other events across CPU's
>> tools/perf: Process the DTL entries in queue and deliver samples
>> tools/perf: Add support for printing synth event details via default
>> callback
>> tools/perf: Enable perf script to present the DTL entries
>> powerpc/perf/vpa-dtl: Add documentation for VPA dispatch trace log PMU
>>
>> Kajol Jain (2):
>> powerpc/vpa_dtl: Add interface to expose vpa dtl counters via perf
>> docs: ABI: sysfs-bus-event_source-devices-vpa-dtl: Document sysfs
>> event format entries for vpa_dtl pmu
>>
>> .../sysfs-bus-event_source-devices-vpa-dtl | 25 +
>> Documentation/arch/powerpc/index.rst | 1 +
>> Documentation/arch/powerpc/vpa-dtl.rst | 155 ++++
>> arch/powerpc/include/asm/time.h | 2 +
>> arch/powerpc/kernel/time.c | 7 +-
>> arch/powerpc/perf/Makefile | 2 +-
>> arch/powerpc/perf/vpa-dtl.c | 605 ++++++++++++++
>> tools/perf/arch/powerpc/util/Build | 1 +
>> tools/perf/arch/powerpc/util/auxtrace.c | 122 +++
>> tools/perf/builtin-script.c | 26 +
>> tools/perf/util/Build | 1 +
>> tools/perf/util/auxtrace.c | 4 +
>> tools/perf/util/auxtrace.h | 1 +
>> tools/perf/util/event.h | 1 +
>> tools/perf/util/powerpc-vpadtl.c | 756 ++++++++++++++++++
>> tools/perf/util/powerpc-vpadtl.h | 45 ++
>> 16 files changed, 1752 insertions(+), 2 deletions(-)
>> create mode 100644 Documentation/ABI/testing/sysfs-bus-event_source-devices-vpa-dtl
>> create mode 100644 Documentation/arch/powerpc/vpa-dtl.rst
>> create mode 100644 arch/powerpc/perf/vpa-dtl.c
>> create mode 100644 tools/perf/arch/powerpc/util/auxtrace.c
>> create mode 100644 tools/perf/util/powerpc-vpadtl.c
>> create mode 100644 tools/perf/util/powerpc-vpadtl.h
>>
>
> Tested this patchset by applying on top of mainline kernel, and its working as expected. Hence for the entire series, please add below tag.
>
>
> Tested-by: Venkat Rao Bagalkote <venkat88 at linux.ibm.com>
>
>
> Regards,
>
> Venkat.
Thanks Venkat for checking.
Hi All,
I have CC-ed linuxppc-dev for powerpc specific patches and linux-perf-users for perf tools side related patches.
As mentioned in cover letter:
- Patches 1 to 6 has powerpc PMU driver code changes to capture trace data in perf.data, And patch 14 has documentation update.
- Patch 7 to 13 is perf tools side code changes to enable perf report/script on perf.data
Thanks
Athira
More information about the Linuxppc-dev
mailing list