summaryrefslogtreecommitdiff
path: root/tools/perf
AgeCommit message (Collapse)Author
2025-02-28perf build: Fix in-tree build due to symbolic linkLuca Ceresoli
Building perf in-tree is broken after commit 890a1961c812 ("perf tools: Create source symlink in perf object dir") which added a 'source' symlink in the output dir pointing to the source dir. With in-tree builds, the added 'SOURCE = ...' line is executed multiple times (I observed 2 during the build plus 2 during installation). This is a minor inefficiency, in theory not harmful because symlink creation is assumed to be idempotent. But it is not. Considering with in-tree builds: srctree=/absolute/path/to/linux OUTPUT=/absolute/path/to/linux/tools/perf here's what happens: 1. ln -sf $(srctree)/tools/perf $(OUTPUT)/source -> creates /absolute/path/to/linux/tools/perf/source link to /absolute/path/to/linux/tools/perf => OK, that's what was intended 2. ln -sf $(srctree)/tools/perf $(OUTPUT)/source # same command as 1 -> creates /absolute/path/to/linux/tools/perf/perf link to /absolute/path/to/linux/tools/perf => Not what was intended, not idempotent 3. Now the build _should_ create the 'perf' executable, but it fails The reason is the tricky 'ln' command line. At the first invocation 'ln' uses the 1st form: ln [OPTION]... [-T] TARGET LINK_NAME and creates a link to TARGET *called LINK_NAME*. At the second invocation $(OUTPUT)/source exists, so 'ln' uses the 3rd form: ln [OPTION]... TARGET... DIRECTORY and creates a link to TARGET *called TARGET* inside DIRECTORY. Fix by adding -n/--no-dereference to "treat LINK_NAME as a normal file if it is a symbolic link to a directory", as the manpage says. Closes: https://lore.kernel.org/all/20241125182506.38af9907@booty/ Fixes: 890a1961c812 ("perf tools: Create source symlink in perf object dir") Signed-off-by: Luca Ceresoli <luca.ceresoli@bootlin.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Tested-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20250124-perf-fix-intree-build-v1-1-485dd7a855e4@bootlin.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-28perf arm-spe: Report error if set frequencyLeo Yan
When users set the parameter '-F' to specify frequency for Arm SPE, the tool reports error: perf record -F 1000 -e arm_spe_0// -- sleep 1 Error: Invalid event (arm_spe_0//) in per-thread mode, enable system wide with '-a'. The output logs are confused and it does not give the correct reminding. Arm SPE does not support frequency setting given it adopts a statistical based approach. Alternatively, Arm SPE supports setting period. This commit adds a for frequency setting. It reports error and reminds users to set period instead. After: perf record -F 1000 -e arm_spe_0// -- sleep 1 Arm SPE: Frequency is not supported. Set period with -c option or PMU parameter (-e arm_spe_0/period=NUM/). Signed-off-by: Leo Yan <leo.yan@arm.com> Reviewed-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250227085544.2154136-1-leo.yan@arm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-28perf lock: Report owner stack in usermodeChun-Tse Shao
This patch parses `owner_lock_stat` into a RB tree, enabling ordered reporting of owner lock statistics with stack traces. It also updates the documentation for the `-o` option in contention mode, decouples `-o` from `-t`, and issues a warning to inform users about the new behavior of `-ov`. Example output: $ sudo ~/linux/tools/perf/perf lock con -abvo -Y mutex-spin -E3 perf bench sched pipe ... contended total wait max wait avg wait type caller 171 1.55 ms 20.26 us 9.06 us mutex pipe_read+0x57 0xffffffffac6318e7 pipe_read+0x57 0xffffffffac623862 vfs_read+0x332 0xffffffffac62434b ksys_read+0xbb 0xfffffffface604b2 do_syscall_64+0x82 0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76 36 193.71 us 15.27 us 5.38 us mutex pipe_write+0x50 0xffffffffac631ee0 pipe_write+0x50 0xffffffffac6241db vfs_write+0x3bb 0xffffffffac6244ab ksys_write+0xbb 0xfffffffface604b2 do_syscall_64+0x82 0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76 4 51.22 us 16.47 us 12.80 us mutex do_epoll_wait+0x24d 0xffffffffac691f0d do_epoll_wait+0x24d 0xffffffffac69249b do_epoll_pwait.part.0+0xb 0xffffffffac693ba5 __x64_sys_epoll_pwait+0x95 0xfffffffface604b2 do_syscall_64+0x82 0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76 === owner stack trace === 3 31.24 us 15.27 us 10.41 us mutex pipe_read+0x348 0xffffffffac631bd8 pipe_read+0x348 0xffffffffac623862 vfs_read+0x332 0xffffffffac62434b ksys_read+0xbb 0xfffffffface604b2 do_syscall_64+0x82 0xffffffffad00012f entry_SYSCALL_64_after_hwframe+0x76 ... Signed-off-by: Chun-Tse Shao <ctshao@google.com> Tested-by: Athira Rajeev <atrajeev@linux.ibm.com> Link: https://lore.kernel.org/r/20250227003359.732948-5-ctshao@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-28perf lock: Make rb_tree helper functions genericChun-Tse Shao
The rb_tree helper functions can be reused for parsing `owner_lock_stat` into rb tree for sorting. Signed-off-by: Chun-Tse Shao <ctshao@google.com> Tested-by: Athira Rajeev <atrajeev@linux.ibm.com> Link: https://lore.kernel.org/r/20250227003359.732948-4-ctshao@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-28perf lock: Retrieve owner callstack in bpf programChun-Tse Shao
This implements per-callstack aggregation of lock owners in addition to per-thread. The owner callstack is captured using `bpf_get_task_stack()` at `contention_begin()` and it also adds a custom stackid function for the owner stacks to be compared easily. The owner info is kept in a hash map using lock addr as a key to handle multiple waiters for the same lock. At `contention_end()`, it updates the owner lock stat based on the info that was saved at `contention_begin()`. If there are more waiters, it'd update the owner pid to itself as `contention_end()` means it gets the lock now. But it also needs to check the return value of the lock function in case task was killed by a signal or something. Signed-off-by: Chun-Tse Shao <ctshao@google.com> Tested-by: Athira Rajeev <atrajeev@linux.ibm.com> Link: https://lore.kernel.org/r/20250227003359.732948-3-ctshao@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-28perf lock: Add bpf maps for owner stack tracingChun-Tse Shao
Add a struct and few bpf maps in order to tracing owner stack. `struct owner_tracing_data`: Contains owner's pid, stack id, timestamp for when the owner acquires lock, and the count of lock waiters. `stack_buf`: Percpu buffer for retrieving owner stacktrace. `owner_stacks`: For tracing owner stacktrace to customized owner stack id. `owner_data`: For tracing lock_address to `struct owner_tracing_data` in bpf program. `owner_stat`: For reporting owner stacktrace in usermode. Signed-off-by: Chun-Tse Shao <ctshao@google.com> Tested-by: Athira Rajeev <atrajeev@linux.ibm.com> Link: https://lore.kernel.org/r/20250227003359.732948-2-ctshao@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27perf cpumap: Reduce cpu size from int to int16_tIan Rogers
Fewer than 32k logical CPUs are currently supported by perf. A cpumap is indexed by an integer (see perf_cpu_map__cpu) yielding a perf_cpu that wraps a 4-byte int for the logical CPU - the wrapping is done deliberately to avoid confusing a logical CPU with an index into a cpumap. Using a 4-byte int within the perf_cpu is larger than required so this patch reduces it to the 2-byte int16_t. For a cpumap containing 16 entries this will reduce the array size from 64 to 32 bytes. For very large servers with lots of logical CPUs the size savings will be greater. Signed-off-by: Ian Rogers <irogers@google.com> Reviewed-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250210191231.156294-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-27perf trace: Add missing perf_tool__init()Athira Rajeev
Perf trace on perf.data fails as below: ./perf trace record -- sleep 1 ./perf trace -i perf.data perf: Segmentation fault Segmentation fault (core dumped) Backtrace pointed to : ?? () perf_session.process_user_event () reader.read_event () perf_session.process_events () cmd_trace () run_builtin () handle_internal_command () main () Further debug pointed that, segmentation fault happens when trying to access id_index. Code snippet: case PERF_RECORD_ID_INDEX: err = tool->id_index(session, event); Since 'commit 15d4a6f41d72 ("perf tool: Remove perf_tool__fill_defaults()")', perf_tool__fill_defaults is removed. All tools are initialized using perf_tool__init() prior to use. But in builtin-trace, perf_tool__init is not used and hence the defaults are not initialized. Use perf_tool__init() in perf trace to handle the initialization. Reported-by: Tejas Manhas <Tejas.Manhas1@ibm.com> Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com> Link: https://lore.kernel.org/r/20250225113157.28836-1-atrajeev@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-26perf list: Document -v option deduplication featureJames Clark
-v disables deduplication of similarly suffixed PMUs so add it to the help and doc strings. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250226104111.564443-4-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-26perf pmu: Don't double count common sysfs and json eventsJames Clark
After pmu_add_cpu_aliases() is called, perf_pmu__num_events() returns an incorrect value that double counts common events and doesn't match the actual count of events in the alias list. This is because after 'cpu_aliases_added == true', the number of events returned is 'sysfs_aliases + cpu_json_aliases'. But when adding 'case EVENT_SRC_SYSFS' events, 'sysfs_aliases' and 'cpu_json_aliases' are both incremented together, failing to account that these ones overlap and only add a single item to the list. Fix it by adding another counter for overlapping events which doesn't influence 'cpu_json_aliases'. There doesn't seem to be a current issue because it's used in perf list before pmu_add_cpu_aliases() so the correct value is returned. Other uses in tests may also miss it for other reasons like only looking at uncore events. However it's marked as a fixes commit in case any new fix with new uses of perf_pmu__num_events() is backported. Fixes: d9c5f5f94c2d ("perf pmu: Count sys and cpuid JSON events separately") Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250226104111.564443-3-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-26perf pmu: Dynamically allocate tool PMUJames Clark
perf_pmus__destroy() treats all PMUs as allocated and free's them so we can't have any static PMUs that are added to the PMU lists. Fix it by allocating the tool PMU in the same way as the others. Current users of the tool PMU already use find_pmu() and not perf_pmus__tool_pmu(), so rename the function to add 'new' to avoid it being misused in the future. perf_pmus__fake_pmu() can remain as static as it's not added to the PMU lists. Fixes the following error: $ perf bench internals pmu-scan # Running 'internals/pmu-scan' benchmark: Computing performance of sysfs PMU event scan for 100 times munmap_chunk(): invalid pointer Aborted (core dumped) Fixes: 240505b2d0ad ("perf tool_pmu: Factor tool events into their own PMU") Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250226104111.564443-2-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-26perf probe: Pick the correct dwarf die while adding probe pointsAthira Rajeev
Perf probe on vfs_fstatat fails as below on a powerpc system $ ./perf probe -nf --max-probes=512 -a 'vfs_fstatat $params' Segmentation fault (core dumped) This is observed while running perftool-testsuite_probe testcase. While running with verbose, its observed that segfault happens at: synthesize_probe_trace_arg () synthesize_probe_trace_command () probe_file.add_event () apply_perf_probe_events () __cmd_probe () cmd_probe () run_builtin () handle_internal_command () main () Code in synthesize_probe_trace_arg() access a null value and results in segfault. Data structure which is null: struct probe_trace_arg arg->value We are hitting a case where arg->value is null in probe point: "vfs_fstatat $params". This is happening since 'commit e896474fe485 ("getname_maybe_null() - the third variant of pathname copy-in")' Before the commit, probe point for vfs_fstatat was getting added only for one location: Writing event: p:probe/vfs_fstatat _text+6345404 dfd=%gpr3:s32 filename=%gpr4:x64 stat=%gpr5:x64 flags=%gpr6:s32 With this change, vfs_fstatat code is inlined for other locations in the code: Probe point found: __do_sys_lstat64+48 Probe point found: __do_sys_stat64+48 Probe point found: __do_sys_newlstat+48 Probe point found: __do_sys_newstat+48 Probe point found: vfs_fstatat+0 When trying to find matching dwarf information entry (DIE) from the debuginfo, the code incorrectly picks DIE which is not referring to vfs_fstatat. Snippet from dwarf entry in vmlinux debuginfo file. The main abstract die is: <1><4214883>: Abbrev Number: 147 (DW_TAG_subprogram) <4214885> DW_AT_external : 1 <4214885> DW_AT_name : (indirect string, offset: 0x17b9f3): vfs_fstatat With formal parameters: <2><4214896>: Abbrev Number: 51 (DW_TAG_formal_parameter) <4214897> DW_AT_name : dfd <2><42148a3>: Abbrev Number: 23 (DW_TAG_formal_parameter) <42148a4> DW_AT_name : (indirect string, offset: 0x8fda9): filename <2><42148b0>: Abbrev Number: 23 (DW_TAG_formal_parameter) <42148b1> DW_AT_name : (indirect string, offset: 0x16bd9c): stat <2><42148bd>: Abbrev Number: 23 (DW_TAG_formal_parameter) <42148be> DW_AT_name : (indirect string, offset: 0x39832b): flags While collecting variables/parameters for a probe point, the function copy_variables_cb() also looks at dwarf debug entries based on the instruction address. Snippet if (dwarf_haspc(die_mem, vf->pf->addr)) return DIE_FIND_CB_CONTINUE; else return DIE_FIND_CB_SIBLING; But incase of inlined function instance for vfs_fstatat, there are two entries which has the instruction address entry point as same. Instance 1: which is for vfs_fstatat and DW_AT_abstract_origin points to 0x4214883 (reference above for main abstract die) <3><42131fa>: Abbrev Number: 59 (DW_TAG_inlined_subroutine) <42131fb> DW_AT_abstract_origin: <0x4214883> <42131ff> DW_AT_entry_pc : 0xc00000000062b1e0 Instance 2: which is not for vfs_fstatat but for getname <5><4213270>: Abbrev Number: 39 (DW_TAG_inlined_subroutine) <4213271> DW_AT_abstract_origin: <0x4215b6b> <4213275> DW_AT_entry_pc : 0xc00000000062b1e0 But the copy_variables_cb() continues to add parameters from second instance also based on the dwarf_haspc() check. This results in formal parameters for getname also appended to params. But while filling in the args->value for these parameters, since these args are not part of dwarf with offset "42131fa". Hence value will be null. This incorrect args results in segfault when value field is accessed. Save the dwarf dieoffset of the actual DW_TAG_subprogram as part of "struct probe_finder". In copy_variables_cb(), include check to make sure the DW_AT_abstract_origin points to the correct entry if the dwarf_haspc() matches the instruction address. Signed-off-by: Athira Rajeev <atrajeev@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250225123042.37263-1-atrajeev@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-26perf ftrace latency: allow to hide empty bucketsGabriele Monaco
Especially while using several buckets, it isn't uncommon to have some of them empty and reading the histogram may be a bit more complex: # perf ftrace latency -a -T mutex_lock --bucket-range 5 --max-latency 200 # DURATION | COUNT | GRAPH | 0 - 5 us | 14816 | ###################################### | 5 - 10 us | 1228 | ### | 10 - 15 us | 438 | # | 15 - 20 us | 106 | | 20 - 25 us | 21 | | 25 - 30 us | 11 | | 30 - 35 us | 1 | | 35 - 40 us | 2 | | 40 - 45 us | 4 | | 45 - 50 us | 0 | | 50 - 55 us | 1 | | 55 - 60 us | 0 | | 60 - 65 us | 1 | | 65 - 70 us | 1 | | 70 - 75 us | 1 | | 75 - 80 us | 2 | | 80 - 85 us | 0 | | 85 - 90 us | 1 | | 90 - 95 us | 0 | | 95 - 100 us | 1 | | 100 - 105 us | 0 | | 105 - 110 us | 0 | | 110 - 115 us | 0 | | 115 - 120 us | 0 | | 120 - 125 us | 1 | | 125 - 130 us | 0 | | 130 - 135 us | 0 | | 135 - 140 us | 1 | | 140 - 145 us | 0 | | 145 - 150 us | 0 | | 150 - 155 us | 0 | | 155 - 160 us | 0 | | 160 - 165 us | 0 | | 165 - 170 us | 0 | | 170 - 175 us | 0 | | 175 - 180 us | 0 | | 180 - 185 us | 0 | | 185 - 190 us | 0 | | 190 - 195 us | 0 | | 195 - 200 us | 0 | | 200 - ... us | 2 | | Allow the optional flag --hide-empty to remove buckets with no element and produce a more compact graph. This feature could be misleading since there is no clear indication for missing buckets, for this reason it's disabled by default. # perf ftrace latency -a -T mutex_lock --bucket-range 5 --max-latency --hide-empty 200 # DURATION | COUNT | GRAPH | 0 - 5 us | 14816 | ###################################### | 5 - 10 us | 1228 | ### | 10 - 15 us | 438 | # | 15 - 20 us | 106 | | 20 - 25 us | 21 | | 25 - 30 us | 11 | | 30 - 35 us | 1 | | 35 - 40 us | 2 | | 40 - 45 us | 4 | | 50 - 55 us | 1 | | 60 - 65 us | 1 | | 65 - 70 us | 1 | | 70 - 75 us | 1 | | 75 - 80 us | 2 | | 85 - 90 us | 1 | | 95 - 100 us | 1 | | 120 - 125 us | 1 | | 135 - 140 us | 1 | | 200 - ... us | 2 | | Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20250207080446.77630-2-gmonaco@redhat.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-26perf ftrace latency: variable histogram bucketsGabriele Monaco
The max-latency value can make the histogram smaller, but not larger, we have a maximum of 22 buckets and specifying a max-latency that would require more buckets has no effect. Dynamically allocate the buckets and compute the bucket number from the max latency as (max-min) / range + 2 If the maximum is not specified, we still set the bucket number to 22 and compute the maximum accordingly. Fail if the maximum is smaller than min+range, this way we make sure we always have 3 buckets: those below min, those above max and one in the middle. Since max-latency is not available in log2 mode, always use 22 buckets. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20250207080446.77630-1-gmonaco@redhat.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-26perf annotate-data: Handle direct use of stack pointer without fbregNamhyung Kim
Sometimes compiler generates code to use the stack pointer register without frame pointer. As we know RSP is the stack register on x86, let's treat it as same as fbreg. But the offset would be opposite direction so update the debug message accordingly. Reported-by: Blake Jones <blakejones@google.com> Link: https://lore.kernel.org/r/20250126210242.1181225-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-25Merge tag 'perf-tools-fixes-for-v6.14-2-2025-02-25' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools Pull perf tools fixes from Arnaldo Carvalho de Melo: - Fix tools/ quiet build Makefile infrastructure that was broken when working on tools/perf/ without testing on other tools/ living utilities. * tag 'perf-tools-fixes-for-v6.14-2-2025-02-25' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools: tools: Remove redundant quiet setup tools: Unify top-level quiet infrastructure
2025-02-24perf report: Fix sample number stats for branch entry modeThomas Falcon
Currently, stats->nr_samples is incremented per entry in the branch stack instead of per sample taken. As a result, statistics of samples taken during perf record in --branch-filter or --branch-any mode does not seem correct. Instead call hists__inc_nr_samples() for each sample taken instead of for each entry in the branch stack. Before: $ ./perf record -e cycles:u -b -c 10000000000 ./tchain_edit [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.005 MB perf.data (2 samples) ] $ perf report -D | tail -n 16 Aggregated stats: TOTAL events: 16 COMM events: 2 (12.5%) EXIT events: 1 ( 6.2%) SAMPLE events: 2 (12.5%) MMAP2 events: 2 (12.5%) KSYMBOL events: 1 ( 6.2%) FINISHED_ROUND events: 1 ( 6.2%) ID_INDEX events: 1 ( 6.2%) THREAD_MAP events: 1 ( 6.2%) CPU_MAP events: 1 ( 6.2%) EVENT_UPDATE events: 2 (12.5%) TIME_CONV events: 1 ( 6.2%) FINISHED_INIT events: 1 ( 6.2%) cpu_core/cycles/u stats: SAMPLE events: 64 After: $ ./perf report -D | tail -n 16 Aggregated stats: TOTAL events: 16 COMM events: 2 (12.5%) EXIT events: 1 ( 6.2%) SAMPLE events: 2 (12.5%) MMAP2 events: 2 (12.5%) KSYMBOL events: 1 ( 6.2%) FINISHED_ROUND events: 1 ( 6.2%) ID_INDEX events: 1 ( 6.2%) THREAD_MAP events: 1 ( 6.2%) CPU_MAP events: 1 ( 6.2%) EVENT_UPDATE events: 2 (12.5%) TIME_CONV events: 1 ( 6.2%) FINISHED_INIT events: 1 ( 6.2%) cpu_core/cycles/u stats: SAMPLE events: 2 Signed-off-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lore.kernel.org/r/20250220045942.114965-1-thomas.falcon@intel.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-24perf machine: Reuse module path bufferIan Rogers
Rather than copying the path and appending the directory entry in a fresh path buffer, append to the path at the end of where it is for the recursion level. This saves a PATH_MAX buffer per recursion level and some unnecessary copying. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-9-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-24perf hwmon_pmu: Switch event discovery to io_dir__readdirIan Rogers
Avoid DIR allocations when scanning sysfs by using io_dir for the readdir implementation, that allocates about 1kb on the stack. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-8-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-24perf parse-events: Switch tracepoints to io_dir__readdirIan Rogers
Avoid DIR allocations when scanning sysfs by using io_dir for the readdir implementation, that allocates about 1kb on the stack. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-7-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-24perf events: Remove scandir in thread synthesisIan Rogers
This avoids scanddir reading the directory into memory that's allocated and instead allocates on the stack. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20250222061015.303622-6-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-24perf header: Switch mem topology to io_dir__readdirIan Rogers
Switch memory_node__read and build_mem_topology from opendir/readdir to io_dir__readdir, with smaller stack allocations. Reduces peak memory consumption of perf record by 10kb. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-5-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-24perf pmu: Switch to io_dir__readdirIan Rogers
Avoid DIR allocations when scanning sysfs by using io_dir for the readdir implementation, that allocates about 1kb on the stack. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-4-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-24perf maps: Switch modules tree walk to io_dir__readdirIan Rogers
Compared to glibc's opendir/readdir this lowers the max RSS of perf record by 1.8MB on a Debian machine. Acked-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250222061015.303622-3-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-20perf parse-events: Tidy name token matchingIan Rogers
Prior to commit 70c90e4a6b2f ("perf parse-events: Avoid scanning PMUs before parsing") names (generally event names) excluded hyphen (minus) symbols as the formation of legacy names with hyphens was handled in the yacc code. That commit allowed hyphens supposedly making name_minus unnecessary. However, changing name_minus to name has issues in the term config tokens as then name ends up having priority over numbers and name allows matching numbers since commit 5ceb57990bf4 ("perf parse: Allow tracepoint names to start with digits "). It is also permissable for a name to match with a colon (':') in it when its in a config term list. To address this rename name_minus to term_name, make the pattern match name's except for the colon, add number matching into the config term region with a higher priority than name matching. This addresses an inconsistency and allows greater matching for names inside of term lists, for example, they may start with a number. Rename name_tag to quoted_name and update comments and helper functions to avoid str detecting quoted strings which was already done by the lexer. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250109175401.161340-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19perf tools: Improve startup time by reducing unnecessary stat() callsKrzysztof Łopatowski
When testing perf trace on NixOS, I noticed significant startup delays: - `ls`: ~2ms - `strace ls`: ~10ms - `perf trace ls`: ~550ms Profiling showed that 51% of the time is spent reading files, 26% in loading BPF programs, and 11% in `newfstatat`. This patch optimizes module path exploration by avoiding `stat()` calls unless necessary. For filesystems that do not implement `d_type` (DT_UNKNOWN), it falls back to the old behavior. See `readdir(3)` for details. This reduces `perf trace ls` time to ~500ms. A more thorough startup optimization based on command parameters would be ideal, but that is a larger effort. Signed-off-by: Krzysztof Łopatowski <krzysztof.m.lopatowski@gmail.com> Acked-by: Howard Chu <howardchu95@gmail.com> Link: https://lore.kernel.org/r/20250206113314.335376-2-krzysztof.m.lopatowski@gmail.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19perf report: Fix input reload/switch with symbol sort keyDmitry Vyukov
Currently the code checks that there is no "ipc" in the sort order and add an ipc string. This will always error out on the second pass after input reload/switch, since the sort order already contains "ipc". Do the ipc check/fixup only on the first pass. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Link: https://lore.kernel.org/r/20250108063628.215577-1-dvyukov@google.com Fixes: ec6ae74fe8f0 ("perf report: Display average IPC and IPC coverage per symbol") Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19perf report: Support switching data w/ and w/o callchainsNamhyung Kim
The symbol_conf.use_callchain should be reset when switching to new data file, otherwise report__setup_sample_type() will show an error message that it enabled callchains but no callchain data. The function also will turn on the callchains if the data has PERF_SAMPLE_CALLCHAIN so I think it's ok to reset symbol_conf.use_callchain here. Link: https://lore.kernel.org/r/20250211060745.294289-2-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19perf report: Switch data file correctly in TUINamhyung Kim
The 's' key is to switch to a new data file and load the data in the same window. The switch_data_file() will show a popup menu to select which data file user wants and update the 'input_name' global variable. But in the cmd_report(), it didn't update the data.path using the new 'input_name' and keep usng the old file. This is fairly an old bug and I assume people don't use this feature much. :) Link: https://lore.kernel.org/r/20250211060745.294289-1-namhyung@kernel.org Closes: https://lore.kernel.org/linux-perf-users/89e678bc-f0af-4929-a8a6-a2666f1294a4@linaro.org Fixes: f5fc14124c5cefdd ("perf tools: Add data object to handle perf data file") Reported-by: James Clark <james.clark@linaro.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19perf tools: Fix up some comments and code to properly use the event_source busGreg Kroah-Hartman
In sysfs, the perf events are all located in /sys/bus/event_source/devices/ but some places ended up hard-coding the location to be at the root of /sys/devices/ which could be very risky as you do not exactly know what type of device you are accessing in sysfs at that location. So fix this all up by properly pointing everything at the bus device list instead of the root of the sysfs devices/ tree. Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/2025021955-implant-excavator-179d@gregkh Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19perf list: Also append PMU name in verbose modeJames Clark
When listing in verbose mode, the long description is used but the PMU name isn't appended. There doesn't seem to be a reason to exclude it when asking for more information, so use the same print block for both long and short descriptions. Before: $ perf list -v ... inst_retired [Instruction architecturally executed] After: $ perf list -v ... inst_retired [Instruction architecturally executed. Unit: armv8_cortex_a57] Signed-off-by: James Clark <james.clark@linaro.org> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250219151622.1097289-1-james.clark@linaro.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-19perf vendor events arm64: Fix incorrect CPU_CYCLE in metrics exprYangyu Chen
Some existing metrics for Neoverse N3 and V3 expressions use CPU_CYCLE to represent the number of cycles, but this is incorrect. The correct event to use is CPU_CYCLES. I encountered this issue while working on a patch to add pmu events for Cortex A720 and A520 by reusing the existing patch for Neoverse N3 and V3 by James Clark [1] and my check script [2] reported this issue. [1] https://lore.kernel.org/lkml/20250122163504.2061472-1-james.clark@linaro.org/ [2] https://github.com/cyyself/arm-pmu-check Signed-off-by: Yangyu Chen <cyy@cyyself.name> Reviewed-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/tencent_D4ED18476ADCE818E31084C60E3E72C14907@qq.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18perf script: Fix hangup in offline flamegraph reportNamhyung Kim
A recent change in the flamegraph script fixed an issue with live mode but it created another for offline mode. It needs to pass "-" to -i option to read from stdin in the live mode. Actually there's a logic to pass the option in the perf script code, but the script was written with "-- $@" which prevented the option to go to the perf script. So the previous commit added the hard-coded "-i -" to the report command. But it's a problem for the offline mode which expects input from a file and now it's stuck on reading from stdin. Let's remove the "-i - --" part and let it pass the options properly to perf script. Closes: https://lore.kernel.org/linux-perf-users/c41e4b04-e1fd-45ab-80b0-ec2ac6e94310@linux.ibm.com Fixes: 23e0a63c6dd3f69c ("perf script: force stdin for flamegraph in live mode") Reported-by: Thomas Richter <tmricht@linux.ibm.com> Tested-by: Thomas Richter <tmricht@linux.ibm.com> Cc: Anubhav Shelat <ashelat@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18perf hist: Shrink struct hist_entry sizeDmitry Vyukov
Reorder the struct fields by size to reduce paddings and reduce struct simd_flags size from 8 to 1 byte. This reduces struct hist_entry size by 8 bytes (592->584), and leaves a single more usable 6 byte padding hole. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/7c1cb1c8f9901e945162701ba7269d0f9c70be89.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18perf test: Add tests for latency and parallelism profilingDmitry Vyukov
Ensure basic operation of latency/parallelism profiling and that main latency/parallelism record/report invocations don't fail/crash. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/c129c8f02f328f68e1e9ef2cdc582f8a9786a97d.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18perf report: Add latency and parallelism profiling documentationDmitry Vyukov
Describe latency and parallelism profiling, related flags, and differences with the currently only supported CPU-consumption-centric profiling. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/a13f270ed33cedb03ce9ebf9ddbd064854ca0f19.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18perf report: Add --latency flagDmitry Vyukov
Add record/report --latency flag that allows to capture and show latency-centric profiles rather than the default CPU-consumption-centric profiles. For latency profiles record captures context switch events, and report shows Latency as the first column. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/e9640464bcbc47dde2cb557003f421052ebc9eec.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18perf report: Add latency output fieldDmitry Vyukov
Latency output field is similar to overhead, but represents overhead for latency rather than CPU consumption. It's re-scaled from overhead by dividing weight by the current parallelism level at the time of the sample. It effectively models profiling with 1 sample taken per unit of wall-clock time rather than unit of CPU time. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/b6269518758c2166e6ffdc2f0e24cfdecc8ef9c1.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18perf report: Add parallelism filterDmitry Vyukov
Add parallelism filter that can be used to look at specific parallelism levels only. The format is the same as cpu lists. For example: Only single-threaded samples: --parallelism=1 Low parallelism only: --parallelism=1-4 High parallelism only: --parallelism=64-128 Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/e61348985ff0a6a14b07c39e880edbd60a8f8635.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18perf report: Switch filtered from u8 to u16Dmitry Vyukov
We already have all u8 bits taken, adding one more filter leads to unpleasant failure mode, where code compiles w/o warnings, but the last filters silently don't work. Add a typedef and switch to u16. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/32b4ce1731126c88a2d9e191dc87e39ae4651cb7.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-18tools: Unify top-level quiet infrastructureCharlie Jenkins
Commit f2868b1a66d4f40f ("perf tools: Expose quiet/verbose variables in Makefile.perf") moved the quiet infrastructure out of tools/build/Makefile.build and into the top-level Makefile.perf file so that the quiet infrastructure could be used throughout perf and not just in Makefile.build. Extract out the quiet infrastructure into Makefile.include so that it can be leveraged outside of perf. Fixes: f2868b1a66d4f40f ("perf tools: Expose quiet/verbose variables in Makefile.perf") Reviewed-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Benjamin Tissoires <bentiss@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: KP Singh <kpsingh@kernel.org> Cc: Lukasz Luba <lukasz.luba@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Mykola Lysenko <mykolal@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Monnet <qmo@kernel.org> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20250213-quiet_tools-v3-1-07de4482a581@rivosinc.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-02-17perf report: Add parallelism sort keyDmitry Vyukov
Show parallelism level in profiles if requested by user. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/7f7bb87cbaa51bf1fb008a0d68b687423ce4bad4.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-17perf report: Add machine parallelismDmitry Vyukov
Add calculation of the current parallelism level (number of threads actively running on CPUs). The parallelism level can be shown in reports on its own, and to calculate latency overheads. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/0f8c1b8eb12619029e31b3d5c0346f4616a5aeda.1739437531.git.dvyukov@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-14perf tools: Fix compile error on sample->user_regsNamhyung Kim
It's recently changed to allocate dynamically but misses to update some arch-dependent codes to use perf_sample__user_regs(). Fixes: dc6d2bc2d893a878 ("perf sample: Make user_regs and intr_regs optional") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250214191641.756664-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-14perf tools: Fix compilation error on arm64Leo Yan
Since the commit dc6d2bc2d893 ("perf sample: Make user_regs and intr_regs optional"), the building for Arm64 reports error: arch/arm64/util/unwind-libdw.c: In function ‘libdw__arch_set_initial_registers’: arch/arm64/util/unwind-libdw.c:11:32: error: initialization of ‘struct regs_dump *’ from incompatible pointer type ‘struct regs_dump **’ [-Werror=incompatible-pointer-types] 11 | struct regs_dump *user_regs = &ui->sample->user_regs; | ^ cc1: all warnings being treated as errors make[6]: *** [/home/niayan01/linux/tools/build/Makefile.build:85: arch/arm64/util/unwind-libdw.o] Error 1 make[5]: *** [/home/niayan01/linux/tools/build/Makefile.build:138: util] Error 2 arch/arm64/tests/dwarf-unwind.c: In function ‘test__arch_unwind_sample’: arch/arm64/tests/dwarf-unwind.c:48:27: error: initialization of ‘struct regs_dump *’ from incompatible pointer type ‘struct regs_dump **’ [-Werror=incompatible-pointer-types] 48 | struct regs_dump *regs = &sample->user_regs; | ^ To fix the issue, use the helper perf_sample__user_regs() to retrieve the user_regs. Fixes: dc6d2bc2d893 ("perf sample: Make user_regs and intr_regs optional") Signed-off-by: Leo Yan <leo.yan@arm.com> Reviewed-by: James Clark <james.clark@linaro.org> Link: https://lore.kernel.org/r/20250214111025.14478-1-leo.yan@arm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf sample: Make user_regs and intr_regs optionalIan Rogers
The struct dump_regs contains 512 bytes of cache_regs, meaning the two values in perf_sample contribute 1088 bytes of its total 1384 bytes size. Initializing this much memory has a cost reported by Tavian Barnes <tavianator@tavianator.com> as about 2.5% when running `perf script --itrace=i0`: https://lore.kernel.org/lkml/d841b97b3ad2ca8bcab07e4293375fb7c32dfce7.1736618095.git.tavianator@tavianator.com/ Adrian Hunter <adrian.hunter@intel.com> replied that the zero initialization was necessary and couldn't simply be removed. This patch aims to strike a middle ground of still zeroing the perf_sample, but removing 79% of its size by make user_regs and intr_regs optional pointers to zalloc-ed memory. To support the allocation accessors are created for user_regs and intr_regs. To support correct cleanup perf_sample__init and perf_sample__exit functions are created and added throughout the code base. Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250113194345.1537821-1-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf test stat_all_metrics: Ensure missing events fail testIan Rogers
Issue reported by Thomas Falcon and diagnosed by Kan Liang here: https://lore.kernel.org/lkml/d44036481022c27d83ce0faf8c7f77042baedb34.camel@intel.com/ Metrics with missing events can be erroneously skipped if they contain FP, AMX or PMM events. Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lore.kernel.org/r/20250211213031.114209-25-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf vendor events: Update Tigerlake events/metricsIan Rogers
Update events from v1.16 to v1.17. Update TMA metrics from 4.8 to 5.02. Bring in the event updates v1.17: https://github.com/intel/perfmon/commit/e1d5ac3412450bf049301cb26206d03c41066b83 The TMA 5.02 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c22d782 Co-developed-by: Caleb Biggers <caleb.biggers@intel.com> Signed-off-by: Caleb Biggers <caleb.biggers@intel.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lore.kernel.org/r/20250211213031.114209-24-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf vendor events: Update SkylakeX events/metricsIan Rogers
Update events from v1.35 to v1.36. Update TMA metrics from 4.8 to 5.02. Bring in the event updates v1.36: https://github.com/intel/perfmon/commit/f6801e5c145406f355f40e1746f836eaa1426cf9 The TMA 5.02 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c22d782 Co-developed-by: Caleb Biggers <caleb.biggers@intel.com> Signed-off-by: Caleb Biggers <caleb.biggers@intel.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lore.kernel.org/r/20250211213031.114209-23-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
2025-02-12perf vendor events: Update Skylake metricsIan Rogers
Update TMA metrics from 4.8 to 5.02. The TMA 5.02 addition is from (with subsequent fixes): https://github.com/intel/perfmon/commit/1d72913b2d938781fb28f3cc3507aaec5c22d782 Co-developed-by: Caleb Biggers <caleb.biggers@intel.com> Signed-off-by: Caleb Biggers <caleb.biggers@intel.com> Acked-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Thomas Falcon <thomas.falcon@intel.com> Link: https://lore.kernel.org/r/20250211213031.114209-22-irogers@google.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>