linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-05-08	kselftest/arm64: fp-ptrace: Fix expected FPMR value when PSTATE.SM is changed	Mark Rutland
	The fp-ptrace test suite expects that FPMR is set to zero when PSTATE.SM is changed via ptrace, but ptrace has never altered FPMR in this way, and the test logic erroneously relies upon (and has concealed) a bug where task_fpsimd_load() would unexpectedly and non-deterministically clobber FPMR. Using ptrace, FPMR can only be altered by writing to the NT_ARM_FPMR regset. The value of PSTATE.SM can be altered by writing to the NT_ARM_SVE or NT_ARM_SSVE regsets, and/or by changing the SME vector length (when writing to the NT_ARM_SVE, NT_ARM_SSVE, or NT_ARM_ZA regsets), but none of these writes will change the value of FPMR. The task_fpsimd_load() bug was introduced with the initial FPMR support in commit: 203f2b95a882 ("arm64/fpsimd: Support FEAT_FPMR") The incorrect FPMR test code was introduced in commit: 7dbd26d0b22d ("kselftest/arm64: Add FPMR coverage to fp-ptrace") Subsequently, the task_fpsimd_load() bug was fixed in commit: e5fa85fce08b ("arm64/fpsimd: Don't corrupt FPMR when streaming mode changes") ... whereupon the fp-ptrace FPMR tests started failing reliably, e.g. \| # # Mismatch in saved FPMR: 915058000 != 0 \| # not ok 25 SVE write, SVE 64->64, SME 64/0->64/1 Fix this by changing the test to expect that FPMR is NOT changed when PSTATE.SM is changed via ptrace, matching the extant behaviour. I've chosen to update the test code rather than modifying ptrace to zero FPMR when PSTATE.SM changes. Not zeroing FPMR is simpler overall, and allows the NT_ARM_FPMR regset to be handled independently from other regsets, leaving less scope for error. Fixes: 7dbd26d0b22d ("kselftest/arm64: Add FPMR coverage to fp-ptrace") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Spickett <david.spickett@arm.com> Cc: Luis Machado <luis.machado@arm.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250508132644.1395904-22-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2025-05-08	KVM: selftests: Add a test for x86's fastops emulation	Sean Christopherson
	Add a test to verify KVM's fastops emulation via forced emulation. KVM's so called "fastop" infrastructure executes the to-be-emulated instruction directly on hardware instead of manually emulating the instruction in software, using various shenanigans to glue together the emulator context and CPU state, e.g. to get RFLAGS fed into the instruction and back out for the emulator. Add testcases for all instructions that are low hanging fruit. While the primary goal of the selftest is to validate the glue code, a secondary goal is to ensure "emulation" matches hardware exactly, including for arithmetic flags that are architecturally undefined. While arithmetic flags may be architecturally undefined, their behavior is deterministic for a given CPU (likely a given uarch, and possibly even an entire family or class of CPUs). I.e. KVM has effectively been emulating underlying hardware behavior for years. Link: https://lore.kernel.org/r/20250506011250.1089254-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-07	selftests/mm: fix a build failure on powerpc	Nysal Jan K.A.
	The compiler is unaware of the size of code generated by the ".rept" assembler directive. This results in the compiler emitting branch instructions where the offset to branch to exceeds the maximum allowed value, resulting in build failures like the following: CC protection_keys /tmp/ccypKWAE.s: Assembler messages: /tmp/ccypKWAE.s:2073: Error: operand out of range (0x0000000000020158 is not between 0xffffffffffff8000 and 0x0000000000007ffc) /tmp/ccypKWAE.s:2509: Error: operand out of range (0x0000000000020130 is not between 0xffffffffffff8000 and 0x0000000000007ffc) Fix the issue by manually adding nop instructions using the preprocessor. Link: https://lkml.kernel.org/r/20250428131937.641989-2-nysal@linux.ibm.com Fixes: 46036188ea1f ("selftests/mm: build with -O2") Reported-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Tested-by: Donet Tom <donettom@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07	selftests/mm: fix build break when compiling pkey_util.c	Madhavan Srinivasan
	Commit 50910acd6f615 ("selftests/mm: use sys_pkey helpers consistently") added a pkey_util.c to refactor some of the protection_keys functions accessible by other tests. But this broken the build in powerpc in two ways, pkey-powerpc.h: In function `arch_is_powervm': pkey-powerpc.h:73:21: error: storage size of `buf' isn't known 73 \| struct stat buf; \| ^~~ pkey-powerpc.h:75:14: error: implicit declaration of function `stat'; did you mean `strcat'? [-Wimplicit-function-declaration] 75 \| if ((stat("/sys/firmware/devicetree/base/ibm,partition-name", &buf) == 0) && \| ^~~~ \| strcat Since pkey_util.c includes pkeys-helper.h, which in turn includes pkeys-powerpc.h, stat.h including is missing for "struct stat". This is fixed by adding "sys/stat.h" in pkeys-powerpc.h Secondly, pkey-powerpc.h:55:18: warning: format `%llx' expects argument of type `long long unsigned int', but argument 3 has type `u64' {aka `long unsigned int'} [-Wformat=] 55 \| dprintf4("%s() changing %016llx to %016llx\n", \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 56 \| __func__, __read_pkey_reg(), pkey_reg); \| ~~~~~~~~~~~~~~~~~ \| \| \| u64 {aka long unsigned int} pkey-helpers.h:63:32: note: in definition of macro `dprintf_level' 63 \| sigsafe_printf(args); \ \| ^~~~ These format specifier related warning are removed by adding "__SANE_USERSPACE_TYPES__" to pkeys_utils.c. Link: https://lkml.kernel.org/r/20250428131937.641989-1-nysal@linux.ibm.com Fixes: 50910acd6f61 ("selftests/mm: use sys_pkey helpers consistently") Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07	tools/testing/selftests: fix guard region test tmpfs assumption	Lorenzo Stoakes
	The current implementation of the guard region tests assume that /tmp is mounted as tmpfs, that is shmem. This isn't always the case, and at least one instance of a spurious test failure has been reported as a result. This assumption is unsafe, rushed and silly - and easily remedied by simply using memfd, so do so. We also have to fixup the readonly_file test to explicitly only be applicable to file-backed cases. Link: https://lkml.kernel.org/r/20250425162436.564002-1-lorenzo.stoakes@oracle.com Fixes: 272f37d3e99a ("tools/selftests: expand all guard region tests to file-backed") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Closes: https://lore.kernel.org/linux-mm/a2d2766b-0ab4-437b-951a-8595a7506fe9@arm.com/ Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07	selftests/mm: compaction_test: support platform with huge mount of memory	Feng Tang
	When running mm selftest to verify mm patches, 'compaction_test' case failed on an x86 server with 1TB memory. And the root cause is that it has too much free memory than what the test supports. The test case tries to allocate 100000 huge pages, which is about 200 GB for that x86 server, and when it succeeds, it expects it's large than 1/3 of 80% of the free memory in system. This logic only works for platform with 750 GB ( 200 / (1/3) / 80% ) or less free memory, and may raise false alarm for others. Fix it by changing the fixed page number to self-adjustable number according to the real number of free memory. Link: https://lkml.kernel.org/r/20250423103645.2758-1-feng.tang@linux.alibaba.com Fixes: bd67d5c15cc1 ("Test compaction of mlocked memory") Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Acked-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@inux.alibaba.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Sri Jayaramappa <sjayaram@akamai.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07	tools: ynl-gen: move the count into a presence struct too	Jakub Kicinski
	While we reshuffle the presence members, move the counts as well. Previously array count members would have been place directly in the struct, so: struct family_op_req { struct { u32 a:1; u32 b:1; } _present; struct { u32 bin; } _len; u32 a; u64 b; const unsigned char bin; u32 n_multi; << count u32 multi; << objects }; Since len has been moved to its own presence struct move the count as well: struct family_op_req { struct { u32 a:1; u32 b:1; } _present; struct { u32 bin; } _len; struct { u32 multi; << count } _count; u32 a; u64 b; const unsigned char bin; u32 multi; << objects }; This improves the consistency and allows us to remove some hacks in the codegen. Unlike for len there is no known name collision with the existing scheme. Link: https://patch.msgid.link/20250505165208.248049-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-07	tools: ynl-gen: split presence metadata	Jakub Kicinski
	Each YNL struct contains the data and a sub-struct indicating which fields are valid. Something like: struct family_op_req { struct { u32 a:1; u32 b:1; u32 bin_len; } _present; u32 a; u64 b; const unsigned char bin; }; Note that the bin object 'bin' has a length stored, and that length has a _len suffix added to the field name. This breaks if there is a explicit field called bin_len, which is the case for some TC actions. Move the length fields out of the _present struct, create a new struct called _len: struct family_op_req { struct { u32 a:1; u32 b:1; } _present; struct { u32 bin; } _len; u32 a; u64 b; const unsigned char bin; }; This should prevent name collisions and help with the packing of the struct. Unfortunately this is a breaking change, but hopefully the migration isn't too painful. Link: https://patch.msgid.link/20250505165208.248049-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-07	tools: ynl-gen: rename basic presence from 'bit' to 'present'	Jakub Kicinski
	Internal change to the code gen. Rename how we indicate a type has a single bit presence from using a 'bit' string to 'present'. This is a noop in terms of generated code but will make next breaking change easier. Reviewed-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250505165208.248049-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-07	bpf: Clarify handling of mark and tstamp by redirect_peer	Paul Chaignon
	When switching network namespaces with the bpf_redirect_peer helper, the skb->mark and skb->tstamp fields are not zeroed out like they can be on a typical netns switch. This patch clarifies that in the helper description. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/ccc86af26d43c5c0b776bcba2601b7479c0d46d0.1746460653.git.paul.chaignon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-07	rtla: Define _GNU_SOURCE in timerlat_bpf.c	Tomas Glozar
	Newer versions of glibc include a definition of struct sched_attr in bits/sched.h (included through sched.h which is included by rtla). Commit 0eecee340672 ("tools/rtla: fix collision with glibc sched_attr/sched_set_attr") has modified the definition of struct sched_attr in utils.h, so that it is only applied with older versions of glibc that do not define it, in order to prevent build failure. The definition in bits/sched.h depends on _GNU_SOURCE. timerlat_bpf.c does not define _GNU_SOURCE, making it fall back to the definition in utils.h. The latter has two fields less, leading to shifted offsets of struct timerlat_params in timerlat_bpf_init. Because of the shift, timerlat_bpf_init incorrectly reads params->entries as 0 for timerlat-hist and disables the creation of histogram maps, causing breakage in BPF sample collection mode: $ rtla timerlat hist -d 1s Error pulling BPF data Fix the issue by also defining _GNU_SOURCE in timerlat_bpf.c. Cc: John Kacur <jkacur@redhat.com> Cc: Luis Goncalves <lgoncalv@redhat.com> Link: https://lore.kernel.org/20250430144651.621766-1-tglozar@redhat.com Fixes: e34293ddcebd ("rtla/timerlat: Add BPF skeleton to collect samples") Signed-off-by: Tomas Glozar <tglozar@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-07	rtla: Define __NR_sched_setattr for LoongArch	Tiezhu Yang
	When executing "make -C tools/tracing/rtla" on LoongArch, there exists the following error: src/utils.c:237:24: error: '__NR_sched_setattr' undeclared Just define __NR_sched_setattr for LoongArch if not exist. Link: https://lore.kernel.org/20250422074917.25771-1-yangtiezhu@loongson.cn Reported-by: Haiyong Sun <sunhaiyong@loongson.cn> Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-07	rtla: Set distinctive exit value for failed tests	Costa Shulyupin
	A test is considered failed when a sample trace exceeds the threshold. Failed tests return the same exit code as passed tests, requiring test frameworks to determine the result by searching for "hit stop tracing" in the output. Assign a distinct exit code for failed tests to enable the use of shell expressions and seamless integration with testing frameworks without the need to parse output. Add enum type for return value. Update `make check`. Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: John Kacur <jkacur@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Eder Zulian <ezulian@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Jan Stancek <jstancek@redhat.com> Link: https://lore.kernel.org/20250417185757.2194541-1-costa.shul@redhat.com Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Reviewed-by: Tomas Glozar <tglozar@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-07	Merge remote-tracking branch 'torvalds/master' into perf-tools-next	Arnaldo Carvalho de Melo
	To pick up fixes from the latest perf-tools pull request from Namhyung and get perf-tools-next in line with thinngs in other areas it uses, like tools/lib/bpf, etc. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-06	tools: ynl-gen: allow noncontiguous enums	Jiri Pirko
	in case the enum has holes, instead of hard stop, generate a validation callback to check valid enum values. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250505114513.53370-2-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06	io_uring/zcrx: selftests: fix setting ntuple rule into rss	David Wei
	Fix ethtool syntax for setting ntuple rule into rss. It should be `context' instead of `action'. Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250503043007.857215-1-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06	cxl/test: Address missing MODULE_DESCRIPTION warnings for cxl_test	Dave Jiang
	Add MODULE_DESCRIPTION() to address the following warnings: WARNING: modpost: missing MODULE_DESCRIPTION() in test/cxl_test.o WARNING: modpost: missing MODULE_DESCRIPTION() in test/cxl_mock.o WARNING: modpost: missing MODULE_DESCRIPTION() in test/cxl_mock_mem.o [dj: s/CXL test/cxl_test:/ per djbw's comment] Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20250429235953.4175408-1-dave.jiang@intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com>
2025-05-07	objtool/rust: add one more `noreturn` Rust function for Rust 1.87.0	Miguel Ojeda
	Starting with Rust 1.87.0 (expected 2025-05-15), `objtool` may report: rust/core.o: warning: objtool: _R..._4core9panicking9panic_fmt() falls through to next function _R..._4core9panicking18panic_nounwind_fmt() rust/core.o: warning: objtool: _R..._4core9panicking18panic_nounwind_fmt() falls through to next function _R..._4core9panicking5panic() The reason is that `rust_begin_unwind` is now mangled: _R..._7___rustc17rust_begin_unwind Thus add the mangled one to the list so that `objtool` knows it is actually `noreturn`. See commit 56d680dd23c3 ("objtool/rust: list `noreturn` Rust functions") for more details. Alternatively, we could remove the fixed one in `noreturn.h` and relax this test to cover both, but it seems best to be strict as long as we can. Cc: stable@vger.kernel.org # Needed in 6.12.y and later (Rust is pinned in older LTSs). Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/r/20250502140237.1659624-2-ojeda@kernel.org Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2025-05-06	bpftool: Fix regression of "bpftool cgroup tree" EINVAL on older kernels	YiFei Zhu
	If cgroup_has_attached_progs queries an attach type not supported by the running kernel, due to the kernel being older than the bpftool build, it would encounter an -EINVAL from BPF_PROG_QUERY syscall. Prior to commit 98b303c9bf05 ("bpftool: Query only cgroup-related attach types"), this EINVAL would be ignored by the function, allowing the function to only consider supported attach types. The commit changed so that, instead of querying all attach types, only attach types from the array `cgroup_attach_types` is queried. The assumption is that because these are only cgroup attach types, they should all be supported. Unfortunately this assumption may be false when the kernel is older than the bpftool build, where the attach types queried by bpftool is not yet implemented in the kernel. This would result in errors such as: $ bpftool cgroup tree CgroupPath ID AttachType AttachFlags Name Error: can't query bpf programs attached to /sys/fs/cgroup: Invalid argument This patch restores the logic of ignoring EINVAL from prior to that patch. Fixes: 98b303c9bf05 ("bpftool: Query only cgroup-related attach types") Reported-by: Sagarika Sharma <sharmasagarika@google.com> Reported-by: Minh-Anh Nguyen <minhanhdn@google.com> Signed-off-by: YiFei Zhu <zhuyifei@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/bpf/20250428211536.1651456-1-zhuyifei@google.com
2025-05-06	selftests/bpf: Add test for bpf_list_{front,back}	Martin KaFai Lau
	This patch adds the "list_peek" test to use the new bpf_list_{front,back} kfunc. The test_{front,back}* tests ensure that the return value is a non_own_ref node pointer and requires the spinlock to be held. Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> # check non_own_ref marking Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-9-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06	selftests/bpf: Add tests for bpf_rbtree_{root,left,right}	Martin KaFai Lau
	This patch has a much simplified rbtree usage from the kernel sch_fq qdisc. It has a "struct node_data" which can be added to two different rbtrees which are ordered by different keys. The test first populates both rbtrees. Then search for a lookup_key from the "groot0" rbtree. Once the lookup_key is found, that node refcount is taken. The node is then removed from another "groot1" rbtree. While searching the lookup_key, the test will also try to remove all rbnodes in the path leading to the lookup_key. The test_{root,left,right}_spinlock_true tests ensure that the return value of the bpf_rbtree functions is a non_own_ref node pointer. This is done by forcing an verifier error by calling a helper bpf_jiffies64() while holding the spinlock. The tests then check for the verifier message "call bpf_rbtree...R0=rcu_ptr_or_null_node..." The other test_{root,left,right}_spinlock_false tests ensure that they must be called with spinlock held. Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> # Check non_own_ref marking Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-6-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06	bpf: Allow refcounted bpf_rb_node used in bpf_rbtree_{remove,left,right}	Martin KaFai Lau
	The bpf_rbtree_{remove,left,right} requires the root's lock to be held. They also check the node_internal->owner is still owned by that root before proceeding, so it is safe to allow refcounted bpf_rb_node pointer to be used in these kfuncs. In a bpf fq implementation which is much closer to the kernel fq, https://lore.kernel.org/bpf/20250418224652.105998-13-martin.lau@linux.dev/, a networking flow (allocated by bpf_obj_new) can be added to two different rbtrees. There are cases that the flow is searched from one rbtree, held the refcount of the flow, and then removed from another rbtree: struct fq_flow { struct bpf_rb_node fq_node; struct bpf_rb_node rate_node; struct bpf_refcount refcount; unsigned long sk_long; }; int bpf_fq_enqueue(...) { /* ... / bpf_spin_lock(&root->lock); while (can_loop) { / ... / if (!p) break; gc_f = bpf_rb_entry(p, struct fq_flow, fq_node); if (gc_f->sk_long == sk_long) { f = bpf_refcount_acquire(gc_f); break; } / ... */ } bpf_spin_unlock(&root->lock); if (f) { bpf_spin_lock(&q->lock); bpf_rbtree_remove(&q->delayed, &f->rate_node); bpf_spin_unlock(&q->lock); } } bpf_rbtree_{left,right} do not need this change but are relaxed together with bpf_rbtree_remove instead of adding extra verifier logic to exclude these kfuncs. To avoid bi-sect failure, this patch also changes the selftests together. The "rbtree_api_remove_unadded_node" is not expecting verifier's error. The test now expects bpf_rbtree_remove(&groot, &m->node) to return NULL. The test uses __retval(0) to ensure this NULL return value. Some of the "only take non-owning..." failure messages are changed also. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250506015857.817950-5-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-06	arm64: tools: Resync sysreg.h	Marc Zyngier
	Perform a bulk resync of tools/arch/arm64/include/asm/sysreg.h. Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06	tools/arch/x86: Move the <asm/amd-ibs.h> header to <asm/amd/ibs.h>	Ingo Molnar
	Synchronize with what we did with the kernel side header in: 3846389c03a8 ("x86/platform/amd: Move the <asm/amd-ibs.h> header to <asm/amd/ibs.h>") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org
2025-05-06	Merge tag 'nf-next-25-05-06' of ↵	Paolo Abeni
	git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Apparently, nf_conntrack_bridge changes the way in which fragments are handled, dealing to packet drop. From Huajian Yang. 2) Add a selftest to stress the conntrack subsystem, from Florian Westphal. 3) nft_quota depletion is off-by-one byte, Zhongqiu Duan. 4) Rewrites the procfs to read the conntrack table to speed it up, from Florian Westphal. 5) Two patches to prevent overflow in nft_pipapo lookup table and to clamp the maximum bucket size. 6) Update nft_fib selftest to check for loopback packet bypass. From Florian Westphal. netfilter pull request 25-05-06 * tag 'nf-next-25-05-06' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: selftests: netfilter: nft_fib.sh: check lo packets bypass fib lookup netfilter: nft_set_pipapo: clamp maximum map bucket size to INT_MAX netfilter: nft_set_pipapo: prevent overflow in lookup table allocation netfilter: nf_conntrack: speed up reads from nf_conntrack proc file netfilter: nft_quota: match correctly when the quota just depleted selftests: netfilter: add conntrack stress test netfilter: bridge: Move specific fragmented packet to slow_path instead of dropping it ==================== Link: https://patch.msgid.link/20250505234151.228057-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-06	x86/insn: Stop decoding i64 instructions in x86-64 mode at opcode	Masami Hiramatsu (Google)
	In commit 2e044911be75 ("x86/traps: Decode 0xEA instructions as #UD") FineIBT starts using 0xEA as an invalid instruction like UD2. But insn decoder always returns the length of "0xea" instruction as 7 because it does not check the (i64) superscript. The x86 instruction decoder should also decode 0xEA on x86-64 as a one-byte invalid instruction by decoding the "(i64)" superscript tag. This stops decoding instruction which has (i64) but does not have (o64) superscript in 64-bit mode at opcode and skips other fields. With this change, insn_decoder_test says 0xea is 1 byte length if x86-64 (-y option means 64-bit): $ printf "0:\tea\t\n" \| insn_decoder_test -y -v insn_decoder_test: success: Decoded and checked 1 instructions Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/174580490000.388420.5225447607417115496.stgit@devnote2
2025-05-06	x86/insn: Fix opcode map (!REX2) superscript tags	Masami Hiramatsu (Google)
	Commit: 159039af8c07 ("x86/insn: x86/insn: Add support for REX2 prefix to the instruction decoder opcode map") added (!REX2) superscript with a space, but the correct format requires ',' for concatination with other superscript tags. Add ',' to generate correct insn attribute tables. I confirmed with following command: arch/x86/lib/x86-opcode-map.txt \| grep e8 \| head -n 1 [0xe8] = INAT_MAKE_IMM(INAT_IMM_VWORD32) \| INAT_FORCE64 \| INAT_NO_REX2, Fixes: 159039af8c07 ("x86/insn: x86/insn: Add support for REX2 prefix to the instruction decoder opcode map") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/174580489027.388420.15539375184727726142.stgit@devnote2
2025-05-06	Merge tag 'v6.15-rc4' into x86/asm, to pick up fixes	Ingo Molnar
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-06	selftests: net: exit cleanly on SIGTERM / timeout	Jakub Kicinski
	ksft runner sends 2 SIGTERMs in a row if a test runs out of time. Handle this in a similar way we handle SIGINT - cleanup and stop running further tests. Because we get 2 signals we need a bit of logic to ignore the subsequent one, they come immediately one after the other (due to commit 9616cb34b08e ("kselftest/runner.sh: Propagate SIGTERM to runner child")). This change makes sure we run cleanup (scheduled defer()s) and also print a stack trace on SIGTERM, which doesn't happen by default. Tests occasionally hang in NIPA and it's impossible to tell what they are waiting from or doing. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250503011856.46308-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-06	KVM: arm64: selftests: Add test for SVE host corruption	Mark Brown
	Until recently, the kernel could unexpectedly discard SVE state for a period after a KVM_RUN ioctl, when the guest did not execute any FPSIMD/SVE/SME instructions. We fixed that issue in commit: fbc7e61195e2 ("KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state") Add a test which tries to provoke that issue by manipulating SVE state before/after running a guest which does not execute any FPSIMD/SVE/SME instructions. The test executes a handful of iterations to miminize the risk that the issue is masked by preemption. Signed-off--by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/r/20250417-kvm-selftest-sve-signal-v1-1-6330c2f3da0c@kernel.org [maz: Restored MR's SoB, fixed commit message according to MR's write-up] Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-05-06	tools/x86/kcpuid: Update bitfields to x86-cpuid-db v2.4	Ahmed S. Darwish
	Update kcpuid's CSV file to version 2.4, as generated by x86-cpuid-db. Summary of the v2.4 changes: * Mark CPUID(0x80000001) EDX:23 bit, 'e_mmx', as not exclusive to Transmeta since it is supported by AMD as well. Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: x86-cpuid@lists.linux.dev Link: https://gitlab.com/x86-cpuid.org/x86-cpuid-db/-/blob/v2.4/CHANGELOG.rst Link: https://lore.kernel.org/r/20250506050437.10264-2-darwi@linutronix.de
2025-05-06	Merge tag 'v6.15-rc5' into x86/cpu, to resolve conflicts	Ingo Molnar
	Conflicts: tools/arch/x86/include/asm/cpufeatures.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-05	tools: ynl-gen: validate 0 len strings from kernel	David Wei
	Strings from the kernel are guaranteed to be null terminated and ynl_attr_validate() checks for this. But it doesn't check if the string has a len of 0, which would cause problems when trying to access data[len - 1]. Fix this by checking that len is positive. Signed-off-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250503043050.861238-1-dw@davidwei.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05	selftests: drv: net: add version indicator	Mohsin Bashir
	Currently, the test result does not differentiate between the cases when either one of the address families are configured or if both the address families are configured. Ideally, the result should report if a particular case was skipped. ./drivers/net/ping.py TAP version 13 1..7 ok 1 ping.test_default_v4 # SKIP Test requires IPv4 connectivity ok 2 ping.test_default_v6 ok 3 ping.test_xdp_generic_sb ok 4 ping.test_xdp_generic_mb ok 5 ping.test_xdp_native_sb ok 6 ping.test_xdp_native_mb ok 7 ping.test_xdp_offload # SKIP device does not support offloaded XDP Totals: pass:5 fail:0 xfail:0 xpass:0 skip:2 error:0 Fixes: 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250503013518.1722913-4-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05	selftests: drv: net: avoid skipping tests	Mohsin Bashir
	On a system with either of the ipv4 or ipv6 information missing, tests are currently skipped. Ideally, the test should run as long as at least one address family is present. This patch make test run whenever possible. Before: ./drivers/net/ping.py TAP version 13 1..6 ok 1 ping.test_default # SKIP Test requires IPv4 connectivity ok 2 ping.test_xdp_generic_sb # SKIP Test requires IPv4 connectivity ok 3 ping.test_xdp_generic_mb # SKIP Test requires IPv4 connectivity ok 4 ping.test_xdp_native_sb # SKIP Test requires IPv4 connectivity ok 5 ping.test_xdp_native_mb # SKIP Test requires IPv4 connectivity ok 6 ping.test_xdp_offload # SKIP device does not support offloaded XDP Totals: pass:0 fail:0 xfail:0 xpass:0 skip:6 error:0 After: ./drivers/net/ping.py TAP version 13 1..6 ok 1 ping.test_default ok 2 ping.test_xdp_generic_sb ok 3 ping.test_xdp_generic_mb ok 4 ping.test_xdp_native_sb ok 5 ping.test_xdp_native_mb ok 6 ping.test_xdp_offload # SKIP device does not support offloaded XDP Totals: pass:5 fail:0 xfail:0 xpass:0 skip:1 error:0 Fixes: 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250503013518.1722913-3-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05	selftests: drv: net: fix test failure on ipv6 sys	Mohsin Bashir
	The `get_interface_info` call has ip version hard-coded which leads to failures on an IPV6 system. The NetDrvEnv class already gathers information about remote interface, so instead of fixing the local implementation switch to using cfg.remote_ifname. Before: ./drivers/net/ping.py Traceback (most recent call last): File "/new_tests/./drivers/net/ping.py", line 217, in <module> main() File "/new_tests/./drivers/net/ping.py", line 204, in main get_interface_info(cfg) File "/new_tests/./drivers/net/ping.py", line 128, in get_interface_info raise KsftFailEx('Can not get remote interface') net.lib.py.ksft.KsftFailEx: Can not get remote interface After: ./drivers/net/ping.py TAP version 13 1..6 ok 1 ping.test_default # SKIP Test requires IPv4 connectivity ok 2 ping.test_xdp_generic_sb # SKIP Test requires IPv4 connectivity ok 3 ping.test_xdp_generic_mb # SKIP Test requires IPv4 connectivity ok 4 ping.test_xdp_native_sb # SKIP Test requires IPv4 connectivity ok 5 ping.test_xdp_native_mb # SKIP Test requires IPv4 connectivity ok 6 ping.test_xdp_offload # SKIP device does not support offloaded XDP Totals: pass:0 fail:0 xfail:0 xpass:0 skip:6 error:0 Fixes: 75cc19c8ff89 ("selftests: drv-net: add xdp cases for ping.py") Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250503013518.1722913-2-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05	selftests: Add IPv6 link-local address generation tests for GRE devices.	Guillaume Nault
	GRE devices have their special code for IPv6 link-local address generation that has been the source of several regressions in the past. Add selftest to check that all gre, ip6gre, gretap and ip6gretap get an IPv6 link-link local address in accordance with the net.ipv6.conf.<dev>.addr_gen_mode sysctl. Note: This patch was originally applied as commit 6f50175ccad4 ("selftests: Add IPv6 link-local address generation tests for GRE devices."). However, it was then reverted by commit 355d940f4d5a ("Revert "selftests: Add IPv6 link-local address generation tests for GRE devices."") because the commit it depended on was going to be reverted. Now that the situation is resolved, we can add this selftest again (no changes since original patch, appart from context update in tools/testing/selftests/net/Makefile). Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/2c3a5733cb3a6e3119504361a9b9f89fda570a2d.1746225214.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05	perf test: Add direct off-cpu tests	Howard Chu
	Since we added --off-cpu-thresh, add tests for when a sample's off-cpu time is above the threshold, and when it's below the threshold. Note that the basic test performed in test_offcpu_basic() collects a direct sample now, since sleep 1 has duration of 1000ms, higher than the default value of --off-cpu-thresh of 500ms, resulting in a direct sample. An example: $ sudo perf test offcpu 124: perf record offcpu profiling tests : Ok $ Committer testing: root@number:~# perf test offcpu 126: perf record offcpu profiling tests : Ok root@number:~# perf test -v offcpu 126: perf record offcpu profiling tests : Ok root@number:~# perf test -vv offcpu 126: perf record offcpu profiling tests: --- start --- test child forked, pid 1410791 Checking off-cpu privilege Basic off-cpu test Basic off-cpu test [Success] Child task off-cpu test Child task off-cpu test [Success] Threshold test (above threshold) Threshold test (above threshold) [Success] Threshold test (below threshold) Threshold test (below threshold) [Success] ---- end(0) ---- 126: perf record offcpu profiling tests : Ok root@number:~# Suggested-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250501022809.449767-11-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf record --off-cpu: Add --off-cpu-thresh option	Howard Chu
	Specify the threshold for dumping offcpu samples with --off-cpu-thresh, the unit is milliseconds. Default value is 500ms. Example: perf record --off-cpu --off-cpu-thresh 824 The example above collects direct off-cpu samples where the off-cpu time is longer than 824ms. Committer testing: After commenting out the end off-cpu dump to have just the ones that are added right after the task is scheduled back, and using a threshould of 1000ms, we see some periods (the 5th column, just before "offcpu-time" in the 'perf script' output) that are over 1000.000.000 nanoseconds: root@number:~# perf record --off-cpu --off-cpu-thresh 10000 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 3.902 MB perf.data (34335 samples) ] root@number:~# perf script <SNIP> Isolated Web Co 59932 [028] 63839.594437: 1000049427 offcpu-time: 7fe63c7976c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6) 7fe63c78c04c __futex_abstimed_wait_common+0x7c (/usr/lib64/libc.so.6) 7fe63c78e928 pthread_cond_timedwait@@GLIBC_2.3.2+0x178 (/usr/lib64/libc.so.6) 5599974a9fe7 mozilla::detail::ConditionVariableImpl::wait_for(mozilla::detail::MutexImpl&, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&)+0xe7 (/usr/lib64/fir> 100000000 [unknown] ([unknown]) swapper 0 [025] 63839.594459: 195724 cycles:P: ffffffffac328270 read_tsc+0x0 ([kernel.kallsyms]) Isolated Web Co 59932 [010] 63839.594466: 1000055278 offcpu-time: 7fe63c7976c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6) 7fe63c78ba24 __syscall_cancel+0x14 (/usr/lib64/libc.so.6) 7fe63c804c4e __poll+0x1e (/usr/lib64/libc.so.6) 7fe633b0d1b8 PollWrapper(_GPollFD, unsigned int, int) [clone .lto_priv.0]+0xf8 (/usr/lib64/firefox/libxul.so) 10000002c [unknown] ([unknown]) swapper 0 [027] 63839.594475: 134433 cycles:P: ffffffffad4c45d9 irqentry_enter+0x19 ([kernel.kallsyms]) swapper 0 [028] 63839.594499: 215838 cycles:P: ffffffffac39199a switch_mm_irqs_off+0x10a ([kernel.kallsyms]) MediaPD~oder #1 1407676 [027] 63839.594514: 134433 cycles:P: 7f982ef5e69f dct_IV(int, int, int)+0x24f (/usr/lib64/libfdk-aac.so.2.0.0) swapper 0 [024] 63839.594524: 267411 cycles:P: ffffffffad4c6ee6 poll_idle+0x56 ([kernel.kallsyms]) MediaSu~sor #75 1093827 [026] 63839.594555: 332652 cycles:P: 55be753ad030 moz_xmalloc+0x200 (/usr/lib64/firefox/firefox) swapper 0 [027] 63839.594616: 160548 cycles:P: ffffffffad144840 menu_select+0x570 ([kernel.kallsyms]) Isolated Web Co 14019 [027] 63839.595120: 1000050178 offcpu-time: 7fc9537cc6c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6) 7fc9537c104c __futex_abstimed_wait_common+0x7c (/usr/lib64/libc.so.6) 7fc9537c3928 pthread_cond_timedwait@@GLIBC_2.3.2+0x178 (/usr/lib64/libc.so.6) 7fc95372a3c8 pt_TimedWait+0xb8 (/usr/lib64/libnspr4.so) 7fc95372a8d8 PR_WaitCondVar+0x68 (/usr/lib64/libnspr4.so) 7fc94afb1f7c WatchdogMain(void)+0xac (/usr/lib64/firefox/libxul.so) 7fc947498660 [unknown] ([unknown]) 7fc9535fce88 [unknown] ([unknown]) 7fc94b620e60 WatchdogManager::~WatchdogManager()+0x0 (/usr/lib64/firefox/libxul.so) fff8548387f8b48 [unknown] ([unknown]) swapper 0 [003] 63839.595712: 212948 cycles:P: ffffffffacd5b865 acpi_os_read_port+0x55 ([kernel.kallsyms]) <SNIP> Suggested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Suggested-by: Ian Rogers <irogers@google.com> Suggested-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-2-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-10-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf record --off-cpu: Dump the remaining PERF_SAMPLE_ in sample_type from ↵	Howard Chu
	BPF's stack trace map Dump the remaining PERF_SAMPLE_ data, as if it is dumping a direct sample. Put the stack trace, tid, off-cpu time and cgroup id into the raw_data section, just like a direct off-cpu sample coming from BPF's bpf_perf_event_output(). This ensures that evsel__parse_sample() correctly parses both direct samples and accumulated samples. Suggested-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-10-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-9-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf script: Display off-cpu samples correctly	Howard Chu
	No PERF_SAMPLE_CALLCHAIN in sample_type, but 'perf script' needs to display a callchain, have to specify manually. Also, prefer displaying a callchain: gvfs-afc-volume 2267 [001] 3829232.955656: 1001115340 offcpu-time: 77f05292603f __pselect+0xbf (/usr/lib/x86_64-linux-gnu/libc.so.6) 77f052a1801c [unknown] (/usr/lib/x86_64-linux-gnu/libusbmuxd-2.0.so.6.0.0) 77f052a18d45 [unknown] (/usr/lib/x86_64-linux-gnu/libusbmuxd-2.0.so.6.0.0) 77f05289ca94 start_thread+0x384 (/usr/lib/x86_64-linux-gnu/libc.so.6) 77f052929c3c clone3+0x2c (/usr/lib/x86_64-linux-gnu/libc.so.6) to a raw binary BPF output: BPF output: 0000: dd 08 00 00 db 08 00 00 <DD>...<DB>... 0008: cc ce ab 3b 00 00 00 00 <CC>Ϋ;.... 0010: 06 00 00 00 00 00 00 00 ........ 0018: 00 fe ff ff ff ff ff ff .<FE><FF><FF><FF><FF><FF><FF> 0020: 3f 60 92 52 f0 77 00 00 ?`.R<F0>w.. 0028: 1c 80 a1 52 f0 77 00 00 ..<A1>R<F0>w.. 0030: 45 8d a1 52 f0 77 00 00 E.<A1>R<F0>w.. 0038: 94 ca 89 52 f0 77 00 00 .<CA>.R<F0>w.. 0040: 3c 9c 92 52 f0 77 00 00 <..R<F0>w.. 0048: 00 00 00 00 00 00 00 00 ........ 0050: 00 00 00 00 .... Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-9-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-8-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf record --off-cpu: Disable perf_event's callchain collection	Howard Chu
	There is a check in evsel.c that does this: if (evsel__is_offcpu_event(evsel)) evsel->core.attr.sample_type &= OFFCPU_SAMPLE_TYPES; This along with: #define OFFCPU_SAMPLE_TYPES (PERF_SAMPLE_IDENTIFIER \| PERF_SAMPLE_IP \| \ PERF_SAMPLE_TID \| PERF_SAMPLE_TIME \| \ PERF_SAMPLE_ID \| PERF_SAMPLE_CPU \| \ PERF_SAMPLE_PERIOD \| PERF_SAMPLE_CALLCHAIN \| \ PERF_SAMPLE_CGROUP) will tell perf_event to collect callchain. We don't need the callchain from perf_event when collecting off-cpu samples, because it's prev's callchain, not next's callchain. (perf_event) (task_storage) (needed) prev next \| \| ---sched_switch----> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-8-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-7-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf evsel: Assemble off-cpu samples	Howard Chu
	Use the data in bpf-output samples, to assemble off-cpu samples. In evsel__is_offcpu_event(), check if sample_type is PERF_SAMPLE_RAW to support off-cpu sample data created by an older version of perf. Testing compatibility on off-cpu samples collected by perf before this patch series: See below, the sample_type still uses PERF_SAMPLE_CALLCHAIN $ perf script --header -i ./perf.data.ptn \| grep "event : name = offcpu-time" # event : name = offcpu-time, , id = { 237917, 237918, 237919, 237920 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP\|TID\|TIME\|CALLCHAIN\|CPU\|PERIOD\|IDENTIFIER, read_format = ID\|LOST, disabled = 1, freq = 1, sample_id_all = 1 The output is correct. $ perf script -i ./perf.data.ptn \| grep offcpu-time gmain 2173 [000] 18446744069.414584: 100102015 offcpu-time: NetworkManager 901 [000] 18446744069.414584: 5603579 offcpu-time: Web Content 1183550 [000] 18446744069.414584: 46278 offcpu-time: gnome-control-c 2200559 [000] 18446744069.414584: 11998247014 offcpu-time: <SNIP> $ And after this patch series: $ perf script --header -i ./perf.data.off-cpu-v9 \| grep "event : name = offcpu-time" # event : name = offcpu-time, , id = { 237959, 237960, 237961, 237962 }, type = 1 (software), size = 136, config = 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq } = 1, sample_type = IP\|TID\|TIME\|CPU\|PERIOD\|RAW\|IDENTIFIER, read_format = ID\|LOST, disabled = 1, freq = 1, sample_id_all = 1 $ ./perf script -i ./perf.data.off-cpu-v9 \| grep offcpu-time gnome-shell 1875 [001] 4789616.361225: 100097057 offcpu-time: gnome-shell 1875 [001] 4789616.461419: 100107463 offcpu-time: firefox 2206821 [002] 4789616.475690: 255257245 offcpu-time: $ Committer testing: The command to record those samples: root@number:~# perf record --off-cpu -a sleep 1 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.092 MB perf.data (1552 samples) ] root@number:~# Then, before this patch series, the sample_type for the "offcpu-time" event is: root@number:~# perf evlist -v \| grep offcpu-time offcpu-time: type: 1 (PERF_TYPE_SOFTWARE), size: 136, config: 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq }: 1, sample_type: IP\|TID\|TIME\|CALLCHAIN\|CPU\|PERIOD\|IDENTIFIER, read_format: ID\|LOST, disabled: 1, freq: 1, sample_id_all: 1 root@number:~# And after it, after recording it again: root@number:~# perf record --off-cpu -a sleep 1 ; perf evlist -v \| grep offcpu-time [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 2.151 MB perf.data (2843 samples) ] offcpu-time: type: 1 (PERF_TYPE_SOFTWARE), size: 136, config: 0xa (PERF_COUNT_SW_BPF_OUTPUT), { sample_period, sample_freq }: 1, sample_type: IP\|TID\|TIME\|CPU\|PERIOD\|IDENTIFIER, read_format: ID\|LOST, disabled: 1, sample_id_all: 1 root@number:~# Suggested-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-7-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-6-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf record --off-cpu: Dump off-cpu samples in BPF	Howard Chu
	Collect tid, period, callchain, and cgroup id and dump them when off-cpu time threshold is reached. We don't collect the off-cpu time twice (the delta), it's either in direct samples, or accumulated samples that are dumped at the end of perf.data. Suggested-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-6-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-5-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf record --off-cpu: Preparation of off-cpu BPF program	Howard Chu
	Set the perf_event map in BPF for dumping off-cpu samples, and set the offcpu_thresh to specify the threshold. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-5-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-4-howardchu95@gmail.com [ Added some missing iteration variables to off_cpu_config() and fixed up a manually edited patch hunk line boundary line ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf record --off-cpu: Parse off-cpu event	Howard Chu
	Parse the off-cpu event using parse_event(), as bpf-output. Call evlist__enable_evsel() on off-cpu event. This fixes the inability to collect direct off-cpu samples on a workload, as reported by Arnaldo Carvalho de Melo <acme@redhat.com>. The reason being, workload sets enable_on_exec instead of calling evlist__enable(), but off-cpu event does not attach to an executable and execve won't be called, so the fds from perf_event_open() are not enabled. no-inherit should be set to 1, here's the reason: We update the BPF perf_event map for direct off-cpu sample dumping (in following patches), it executes as follows: bpf_map_update_value() bpf_fd_array_map_update_elem() perf_event_fd_array_get_ptr() perf_event_read_local() In perf_event_read_local(), there is: int perf_event_read_local(struct perf_event event, u64 value, u64 enabled, u64 running) { ... /* * It must not be an event with inherit set, we cannot read * all child counters from atomic context. */ if (event->attr.inherit) { ret = -EOPNOTSUPP; goto out; } Which means no-inherit has to be true for updating the BPF perf_event map. Moreover, for bpf-output events, we primarily want a system-wide event instead of a per-task event. The reason is that in BPF's bpf_perf_event_output(), BPF uses the CPU index to retrieve the perf_event file descriptor it outputs to. Making a bpf-output event system-wide naturally satisfies this requirement by mapping CPU appropriately. Suggested-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-4-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-3-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	perf evsel: Expose evsel__is_offcpu_event() for future use	Howard Chu
	Expose evsel__is_offcpu_event() so it can be used in off_cpu_config(), evsel__parse_sample() and 'perf script'. Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-3-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-2-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-05	selftests: iou-zcrx: Clean up build warnings for error format	Haiyue Wang
	Clean up two build warnings: [1] iou-zcrx.c: In function ‘process_recvzc’: iou-zcrx.c:263:37: warning: too many arguments for format [-Wformat-extra-args] 263 \| error(1, 0, "payload mismatch at ", i); \| ^~~~~~~~~~~~~~~~~~~~~~ [2] Use "%zd" for ssize_t type as better iou-zcrx.c: In function ‘run_client’: iou-zcrx.c:357:47: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘ssize_t’ {aka ‘long int’} [-Wformat=] 357 \| error(1, 0, "send(): %d", sent); \| ~^ ~~~~ \| \| \| \| int ssize_t {aka long int} \| %ld Signed-off-by: Haiyue Wang <haiyuewa@163.com> Reviewed-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250502175136.1122-1-haiyuewa@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05	selftests: mptcp: add chk_sublfow in diag.sh	Gang Yan
	This patch aims to add chk_dump_subflow in diag.sh. The subflow's info can be obtained through "ss -tin", then use the 'mptcp_diag' to verify the token in subflow_info. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/524 Co-developed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-7-68eec95898fb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05	selftests: mptcp: add helpers to get subflow_info	Gang Yan
	This patch adds 'get_subflow_info' in 'mptcp_diag', which can check whether a TCP connection is an MPTCP subflow based on the "INET_ULP_INFO_MPTCP" with tcp_diag method. The helper 'print_subflow_info' in 'mptcp_diag' can print the subflow_filed of an MPTCP subflow for further checking the 'subflow_info' through inet_diag method. The example of the whole output should be: $ ./mptcp_diag -s "127.0.0.1:10000 127.0.0.1:38984" 127.0.0.1:10000 -> 127.0.0.1:38984 It's a mptcp subflow, the subflow info: flags:Mec token:0000(id:0)/4278e77e(id:0) seq:9288466187236176036 \ sfseq:1 ssnoff:2317083055 maplen:215 Co-developed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Gang Yan <yangang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250502-net-next-mptcp-sft-inc-cover-v1-6-68eec95898fb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>