summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2024-12-09futex: fix user access on powerpcLinus Torvalds
The powerpc user access code is special, and unlike other architectures distinguishes between user access for reading and writing. And commit 43a43faf5376 ("futex: improve user space accesses") messed that up. It went undetected elsewhere, but caused ppc32 to fail early during boot, because the user access had been started with user_read_access_begin(), but then finished off with just a plain "user_access_end()". Note that the address-masking user access helpers don't even have that read-vs-write distinction, so if powerpc ever wants to do address masking tricks, we'll have to do some extra work for it. [ Make sure to also do it for the EFAULT case, as pointed out by Christophe Leroy ] Reported-by: Andreas Schwab <schwab@linux-m68k.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/87bjxl6b0i.fsf@igel.home/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-09uprobes: Guard against kmemdup() failing in dup_return_instance()Andrii Nakryiko
If kmemdup() failed to alloc memory, don't proceed with extra_consumers copy. Fixes: e62f2d492728 ("uprobes: Simplify session consumer tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20241206183436.968068-1-andrii@kernel.org
2024-12-09perf/core: Export perf_exclude_event()Namhyung Kim
While at it, rename the same function in s390 cpum_sf PMU. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ravi Bangoria <ravi.bangoria@amd.com> Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Acked-by: Thomas Richter <tmricht@linux.ibm.com> Link: https://lore.kernel.org/r/20241203180441.1634709-2-namhyung@kernel.org
2024-12-09uprobes: Reuse return_instances between multiple uretprobes within taskAndrii Nakryiko
Instead of constantly allocating and freeing very short-lived struct return_instance, reuse it as much as possible within current task. For that, store a linked list of reusable return_instances within current->utask. The only complication is that ri_timer() might be still processing such return_instance. And so while the main uretprobe processing logic might be already done with return_instance and would be OK to immediately reuse it for the next uretprobe instance, it's not correct to unconditionally reuse it just like that. Instead we make sure that ri_timer() can't possibly be processing it by using seqcount_t, with ri_timer() being "a writer", while free_ret_instance() being "a reader". If, after we unlink return instance from utask->return_instances list, we know that ri_timer() hasn't gotten to processing utask->return_instances yet, then we can be sure that immediate return_instance reuse is OK, and so we put it onto utask->ri_pool for future (potentially, almost immediate) reuse. This change shows improvements both in single CPU performance (by avoiding relatively expensive kmalloc/free combon) and in terms of multi-CPU scalability, where you can see that per-CPU throughput doesn't decline as steeply with increased number of CPUs (which were previously attributed to kmalloc()/free() through profiling): BASELINE (latest perf/core) =========================== uretprobe-nop ( 1 cpus): 1.898 ± 0.002M/s ( 1.898M/s/cpu) uretprobe-nop ( 2 cpus): 3.574 ± 0.011M/s ( 1.787M/s/cpu) uretprobe-nop ( 3 cpus): 5.279 ± 0.066M/s ( 1.760M/s/cpu) uretprobe-nop ( 4 cpus): 6.824 ± 0.047M/s ( 1.706M/s/cpu) uretprobe-nop ( 5 cpus): 8.339 ± 0.060M/s ( 1.668M/s/cpu) uretprobe-nop ( 6 cpus): 9.812 ± 0.047M/s ( 1.635M/s/cpu) uretprobe-nop ( 7 cpus): 11.030 ± 0.048M/s ( 1.576M/s/cpu) uretprobe-nop ( 8 cpus): 12.453 ± 0.126M/s ( 1.557M/s/cpu) uretprobe-nop (10 cpus): 14.838 ± 0.044M/s ( 1.484M/s/cpu) uretprobe-nop (12 cpus): 17.092 ± 0.115M/s ( 1.424M/s/cpu) uretprobe-nop (14 cpus): 19.576 ± 0.022M/s ( 1.398M/s/cpu) uretprobe-nop (16 cpus): 22.264 ± 0.015M/s ( 1.391M/s/cpu) uretprobe-nop (24 cpus): 33.534 ± 0.078M/s ( 1.397M/s/cpu) uretprobe-nop (32 cpus): 43.262 ± 0.127M/s ( 1.352M/s/cpu) uretprobe-nop (40 cpus): 53.252 ± 0.080M/s ( 1.331M/s/cpu) uretprobe-nop (48 cpus): 55.778 ± 0.045M/s ( 1.162M/s/cpu) uretprobe-nop (56 cpus): 56.850 ± 0.227M/s ( 1.015M/s/cpu) uretprobe-nop (64 cpus): 62.005 ± 0.077M/s ( 0.969M/s/cpu) uretprobe-nop (72 cpus): 66.445 ± 0.236M/s ( 0.923M/s/cpu) uretprobe-nop (80 cpus): 68.353 ± 0.180M/s ( 0.854M/s/cpu) THIS PATCHSET (on top of latest perf/core) ========================================== uretprobe-nop ( 1 cpus): 2.253 ± 0.004M/s ( 2.253M/s/cpu) uretprobe-nop ( 2 cpus): 4.281 ± 0.003M/s ( 2.140M/s/cpu) uretprobe-nop ( 3 cpus): 6.389 ± 0.027M/s ( 2.130M/s/cpu) uretprobe-nop ( 4 cpus): 8.328 ± 0.005M/s ( 2.082M/s/cpu) uretprobe-nop ( 5 cpus): 10.353 ± 0.001M/s ( 2.071M/s/cpu) uretprobe-nop ( 6 cpus): 12.513 ± 0.010M/s ( 2.086M/s/cpu) uretprobe-nop ( 7 cpus): 14.525 ± 0.017M/s ( 2.075M/s/cpu) uretprobe-nop ( 8 cpus): 15.633 ± 0.013M/s ( 1.954M/s/cpu) uretprobe-nop (10 cpus): 19.532 ± 0.011M/s ( 1.953M/s/cpu) uretprobe-nop (12 cpus): 21.405 ± 0.009M/s ( 1.784M/s/cpu) uretprobe-nop (14 cpus): 24.857 ± 0.020M/s ( 1.776M/s/cpu) uretprobe-nop (16 cpus): 26.466 ± 0.018M/s ( 1.654M/s/cpu) uretprobe-nop (24 cpus): 40.513 ± 0.222M/s ( 1.688M/s/cpu) uretprobe-nop (32 cpus): 54.180 ± 0.074M/s ( 1.693M/s/cpu) uretprobe-nop (40 cpus): 66.100 ± 0.082M/s ( 1.652M/s/cpu) uretprobe-nop (48 cpus): 70.544 ± 0.068M/s ( 1.470M/s/cpu) uretprobe-nop (56 cpus): 74.494 ± 0.055M/s ( 1.330M/s/cpu) uretprobe-nop (64 cpus): 79.317 ± 0.029M/s ( 1.239M/s/cpu) uretprobe-nop (72 cpus): 84.875 ± 0.020M/s ( 1.179M/s/cpu) uretprobe-nop (80 cpus): 92.318 ± 0.224M/s ( 1.154M/s/cpu) For reference, with uprobe-nop we hit the following throughput: uprobe-nop (80 cpus): 143.485 ± 0.035M/s ( 1.794M/s/cpu) So now uretprobe stays a bit closer to that performance. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20241206002417.3295533-5-andrii@kernel.org
2024-12-09uprobes: Ensure return_instance is detached from the list before freeingAndrii Nakryiko
Ensure that by the time we call free_ret_instance() to clean up an instance of struct return_instance it isn't reachable from utask->return_instances anymore. free_ret_instance() is called in a few different situations, all but one of which already are fine w.r.t. return_instance visibility: - uprobe_free_utask() guarantees that ri_timer() won't be called (through timer_delete_sync() call), and so there is no need to unlink anything, because entire utask is being freed; - uprobe_handle_trampoline() is already unlinking to-be-freed return_instance with rcu_assign_pointer() before calling free_ret_instance(). Only cleanup_return_instances() violates this property, which so far is not causing problems due to RCU-delayed freeing of return_instance, which we'll change in the next patch. So make sure we unlink return_instance before passing it into free_ret_instance(), as otherwise reuse will be unsafe. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20241206002417.3295533-4-andrii@kernel.org
2024-12-09uprobes: Decouple return_instance list traversal and freeingAndrii Nakryiko
free_ret_instance() has two unrelated responsibilities: actually cleaning up return_instance's resources and freeing memory, and also helping with utask->return_instances list traversal by returning the next alive pointer. There is no reason why these two aspects have to be mixed together, so turn free_ret_instance() into void-returning function and make callers do list traversal on their own. We'll use this simplification in the next patch that will guarantee that to-be-freed return_instance isn't reachable from utask->return_instances list. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20241206002417.3295533-3-andrii@kernel.org
2024-12-09uprobes: Simplify session consumer trackingAndrii Nakryiko
In practice, each return_instance will typically contain either zero or one return_consumer, depending on whether it has any uprobe session consumer attached or not. It's highly unlikely that more than one uprobe session consumers will be attached to any given uprobe, so there is no need to optimize for that case. But the way we currently do memory allocation and accounting is by pre-allocating the space for 4 session consumers in contiguous block of memory next to struct return_instance fixed part. This is unnecessarily wasteful. This patch changes this to keep struct return_instance fixed-sized with one pre-allocated return_consumer, while (in a highly unlikely scenario) allowing for more session consumers in a separate dynamically allocated and reallocated array. We also simplify accounting a bit by not maintaining a separate temporary capacity for consumers array, and, instead, relying on krealloc() to be a no-op if underlying memory can accommodate a slightly bigger allocation (but again, it's very uncommon scenario to even have to do this reallocation). All this gets rid of ri_size(), simplifies push_consumer() and removes confusing ri->consumers_cnt re-assignment, while containing this singular preallocated consumer logic contained within a few simple preexisting helpers. Having fixed-sized struct return_instance simplifies and speeds up return_instance reuse that we ultimately add later in this patch set, see follow up patches. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20241206002417.3295533-2-andrii@kernel.org
2024-12-09sched/fair: Untangle NEXT_BUDDY and pick_next_task()Peter Zijlstra
There are 3 sites using set_next_buddy() and only one is conditional on NEXT_BUDDY, the other two sites are unconditional; to note: - yield_to_task() - cgroup dequeue / pick optimization However, having NEXT_BUDDY control both the wakeup-preemption and the picking side of things means its near useless. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241129101541.GA33464@noisy.programming.kicks-ass.net
2024-12-09sched/fair: Mark m*_vruntime() with __maybe_unusedAndy Shevchenko
When max_vruntime() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: kernel/sched/fair.c:526:19: error: unused function 'max_vruntime' [-Werror,-Wunused-function] 526 | static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime) | ^~~~~~~~~~~~ Fix this by marking them with __maybe_unused (all cases for the sake of symmetry). See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20241202173546.634433-1-andriy.shevchenko@linux.intel.com
2024-12-09sched/fair: Fix variable declaration positionVincent Guittot
Move variable declaration at the beginning of the function Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-12-vincent.guittot@linaro.org
2024-12-09sched/fair: Do not try to migrate delayed dequeue taskVincent Guittot
Migrating a delayed dequeued task doesn't help in balancing the number of runnable tasks in the system. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-11-vincent.guittot@linaro.org
2024-12-09sched/fair: Rename cfs_rq.nr_running into nr_queuedVincent Guittot
Rename cfs_rq.nr_running into cfs_rq.nr_queued which better reflects the reality as the value includes both the ready to run tasks and the delayed dequeue tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-10-vincent.guittot@linaro.org
2024-12-09sched/fair: Remove unused cfs_rq.idle_nr_runningVincent Guittot
cfs_rq.idle_nr_running field is not used anywhere so we can remove the useless associated computation. Last user went in commit 5e963f2bd465 ("sched/fair: Commit to EEVDF"). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-9-vincent.guittot@linaro.org
2024-12-09sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idleVincent Guittot
Use same naming convention as others starting with h_nr_* and rename idle_h_nr_running into h_nr_idle. The "running" is not correct anymore as it includes delayed dequeue tasks as well. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-8-vincent.guittot@linaro.org
2024-12-09sched/fair: Removed unsued cfs_rq.h_nr_delayedVincent Guittot
h_nr_delayed is not used anymore. We now have: - h_nr_runnable which tracks tasks ready to run - h_nr_queued which tracks enqueued tasks either ready to run or delayed dequeue Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-7-vincent.guittot@linaro.org
2024-12-09sched/fair: Use the new cfs_rq.h_nr_runnableVincent Guittot
Use the new h_nr_runnable that tracks only queued and runnable tasks in the statistics that are used to balance the system: - PELT runnable_avg - deciding if a group is overloaded or has spare capacity - numa stats - reduced capacity management - load balance - nohz kick It should be noticed that the rq->nr_running still counts the delayed dequeued tasks as delayed dequeue is a fair feature that is meaningless at core level. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-6-vincent.guittot@linaro.org
2024-12-09sched/fair: Add new cfs_rq.h_nr_runnableVincent Guittot
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed. As a result, it stays also visible in the statistics that are used to balance the system and in particular the field cfs.h_nr_queued when the sched_entity is associated to a task. Create a new h_nr_runnable that tracks only queued and runnable tasks. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-5-vincent.guittot@linaro.org
2024-12-09sched/fair: Rename h_nr_running into h_nr_queuedVincent Guittot
With delayed dequeued feature, a sleeping sched_entity remains queued in the rq until its lag has elapsed but can't run. Rename h_nr_running into h_nr_queued to reflect this new behavior. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-4-vincent.guittot@linaro.org
2024-12-09Merge branch 'sched/urgent'Peter Zijlstra
Sync with urgent bits as a base for further work. Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2024-12-09sched/eevdf: More PELT vs DELAYED_DEQUEUEPeter Zijlstra
Vincent and Dietmar noted that while commit fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") fixes the entity runnable stats, it does not adjust the cfs_rq runnable stats, which are based off of h_nr_running. Track h_nr_delayed such that we can discount those and adjust the signal. Fixes: fc1892becd56 ("sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE") Closes: https://lore.kernel.org/lkml/a9a45193-d0c6-4ba2-a822-464ad30b550e@arm.com/ Closes: https://lore.kernel.org/lkml/CAKfTPtCNUvWE_GX5LyvTF-WdxUT=ZgvZZv-4t=eWntg5uOFqiQ@mail.gmail.com/ [ Fixes checkpatch warnings and rebased ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20241202174606.4074512-3-vincent.guittot@linaro.org
2024-12-09sched/fair: Fix sched_can_stop_tick() for fair tasksVincent Guittot
We can't stop the tick of a rq if there are at least 2 tasks enqueued in the whole hierarchy and not only at the root cfs rq. rq->cfs.nr_running tracks the number of sched_entity at one level whereas rq->cfs.h_nr_running tracks all queued tasks in the hierarchy. Fixes: 11cc374f4643b ("sched_ext: Simplify scx_can_stop_tick() invocation in sched_can_stop_tick()") Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/20241202174606.4074512-2-vincent.guittot@linaro.org
2024-12-09sched/fair: Fix NEXT_BUDDYK Prateek Nayak
Adam reports that enabling NEXT_BUDDY insta triggers a WARN in pick_next_entity(). Moving clear_buddies() up before the delayed dequeue bits ensures no ->next buddy becomes delayed. Further ensure no new ->next buddy ever starts as delayed. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Adam Li <adamli@os.amperecomputing.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Adam Li <adamli@os.amperecomputing.com> Link: https://lkml.kernel.org/r/670a0d54-e398-4b1f-8a6e-90784e2fdf89@amd.com
2024-12-09livepatch: Add stack_order sysfs attributeWardenjohn
Add "stack_order" sysfs attribute which holds the order in which a live patch module was loaded into the system. A user can then determine an active live patched version of a function. cat /sys/kernel/livepatch/livepatch_1/stack_order -> 1 means that livepatch_1 is the first live patch applied cat /sys/kernel/livepatch/livepatch_module/stack_order -> N means that livepatch_module is the Nth live patch applied Suggested-by: Petr Mladek <pmladek@suse.com> Suggested-by: Miroslav Benes <mbenes@suse.cz> Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Wardenjohn <zhangwarden@gmail.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Miroslav Benes <mbenes@suse.cz> Link: https://lore.kernel.org/r/20241008014856.3729-2-zhangwarden@gmail.com [pmladek@suse.com: Updated kernel version and date in the ABI documentation.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2024-12-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. Trivial conflict: tools/testing/selftests/bpf/prog_tests/verifier.c Adjacent changes in: Auto-merging kernel/bpf/verifier.c Auto-merging samples/bpf/Makefile Auto-merging tools/testing/selftests/bpf/.gitignore Auto-merging tools/testing/selftests/bpf/Makefile Auto-merging tools/testing/selftests/bpf/prog_tests/verifier.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-08Merge tag 'irq_urgent_for_v6.13_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Borislav Petkov: - Fix a /proc/interrupts formatting regression - Have the BCM2836 interrupt controller enter power management states properly - Other fixlets * tag 'irq_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/stm32mp-exti: CONFIG_STM32MP_EXTI should not default to y when compile-testing genirq/proc: Add missing space separator back irqchip/bcm2836: Enable SKIP_SET_WAKE and MASK_ON_SUSPEND irqchip/gic-v3: Fix irq_complete_ack() comment
2024-12-08Merge tag 'timers_urgent_for_v6.13_rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Borislav Petkov: - Handle the case where clocksources with small counter width can, in conjunction with overly long idle sleeps, falsely trigger the negative motion detection of clocksources * tag 'timers_urgent_for_v6.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: clocksource: Make negative motion detection more robust
2024-12-08Merge tag 'mm-hotfixes-stable-2024-12-07-22-39' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "24 hotfixes. 17 are cc:stable. 15 are MM and 9 are non-MM. The usual bunch of singletons - please see the relevant changelogs for details" * tag 'mm-hotfixes-stable-2024-12-07-22-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits) iio: magnetometer: yas530: use signed integer type for clamp limits sched/numa: fix memory leak due to the overwritten vma->numab_state mm/damon: fix order of arguments in damos_before_apply tracepoint lib: stackinit: hide never-taken branch from compiler mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio() scatterlist: fix incorrect func name in kernel-doc mm: correct typo in MMAP_STATE() macro mm: respect mmap hint address when aligning for THP mm: memcg: declare do_memsw_account inline mm/codetag: swap tags when migrate pages ocfs2: update seq_file index in ocfs2_dlm_seq_next stackdepot: fix stack_depot_save_flags() in NMI context mm: open-code page_folio() in dump_page() mm: open-code PageTail in folio_flags() and const_folio_flags() mm: fix vrealloc()'s KASAN poisoning logic Revert "readahead: properly shorten readahead when falling back to do_page_cache_ra()" selftests/damon: add _damon_sysfs.py to TEST_FILES selftest: hugetlb_dio: fix test naming ocfs2: free inode when ocfs2_get_init_inode() fails nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry() ...
2024-12-08tracing/eprobe: Fix to release eprobe when failed to add dyn_eventMasami Hiramatsu (Google)
Fix eprobe event to unregister event call and release eprobe when it fails to add dynamic event correctly. Link: https://lore.kernel.org/all/173289886698.73724.1959899350183686006.stgit@devnote2/ Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-12-06Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann:: - Fix several issues for BPF LPM trie map which were found by syzbot and during addition of new test cases (Hou Tao) - Fix a missing process_iter_arg register type check in the BPF verifier (Kumar Kartikeya Dwivedi, Tao Lyu) - Fix several correctness gaps in the BPF verifier when interacting with the BPF stack without CAP_PERFMON (Kumar Kartikeya Dwivedi, Eduard Zingerman, Tao Lyu) - Fix OOB BPF map writes when deleting elements for the case of xsk map as well as devmap (Maciej Fijalkowski) - Fix xsk sockets to always clear DMA mapping information when unmapping the pool (Larysa Zaremba) - Fix sk_mem_uncharge logic in tcp_bpf_sendmsg to only uncharge after sent bytes have been finalized (Zijian Zhang) - Fix BPF sockmap with vsocks which was missing a queue check in poll and sockmap cleanup on close (Michal Luczaj) - Fix tools infra to override makefile ARCH variable if defined but empty, which addresses cross-building tools. (Björn Töpel) - Fix two resolve_btfids build warnings on unresolved bpf_lsm symbols (Thomas Weißschuh) - Fix a NULL pointer dereference in bpftool (Amir Mohammadi) - Fix BPF selftests to check for CONFIG_PREEMPTION instead of CONFIG_PREEMPT (Sebastian Andrzej Siewior) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (31 commits) selftests/bpf: Add more test cases for LPM trie selftests/bpf: Move test_lpm_map.c to map_tests bpf: Use raw_spinlock_t for LPM trie bpf: Switch to bpf mem allocator for LPM trie bpf: Fix exact match conditions in trie_get_next_key() bpf: Handle in-place update for full LPM trie correctly bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trie bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elem bpf: Remove unnecessary check when updating LPM trie selftests/bpf: Add test for narrow spill into 64-bit spilled scalar selftests/bpf: Add test for reading from STACK_INVALID slots selftests/bpf: Introduce __caps_unpriv annotation for tests bpf: Fix narrow scalar spill onto 64-bit spilled scalar slots bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_misc samples/bpf: Remove unnecessary -I flags from libbpf EXTRA_CFLAGS bpf: Zero index arg error string for dynptr and iter selftests/bpf: Add tests for iter arg check bpf: Ensure reg is PTR_TO_STACK in process_iter_arg tools: Override makefile ARCH variable if defined, but empty selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap ...
2024-12-06bpf: Use raw_spinlock_t for LPM trieHou Tao
After switching from kmalloc() to the bpf memory allocator, there will be no blocking operation during the update of LPM trie. Therefore, change trie->lock from spinlock_t to raw_spinlock_t to make LPM trie usable in atomic context, even on RT kernels. The max value of prefixlen is 2048. Therefore, update or deletion operations will find the target after at most 2048 comparisons. Constructing a test case which updates an element after 2048 comparisons under a 8 CPU VM, and the average time and the maximal time for such update operation is about 210us and 900us. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-8-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Switch to bpf mem allocator for LPM trieHou Tao
Multiple syzbot warnings have been reported. These warnings are mainly about the lock order between trie->lock and kmalloc()'s internal lock. See report [1] as an example: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc7-syzkaller-00003-g4376e966ecb7 #0 Not tainted ------------------------------------------------------ syz.3.2069/15008 is trying to acquire lock: ffff88801544e6d8 (&n->list_lock){-.-.}-{2:2}, at: get_partial_node ... but task is already holding lock: ffff88802dcc89f8 (&trie->lock){-.-.}-{2:2}, at: trie_update_elem ... which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&trie->lock){-.-.}-{2:2}: __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 trie_delete_elem+0xb0/0x820 ___bpf_prog_run+0x3e51/0xabd0 __bpf_prog_run32+0xc1/0x100 bpf_dispatcher_nop_func ...... bpf_trace_run2+0x231/0x590 __bpf_trace_contention_end+0xca/0x110 trace_contention_end.constprop.0+0xea/0x170 __pv_queued_spin_lock_slowpath+0x28e/0xcc0 pv_queued_spin_lock_slowpath queued_spin_lock_slowpath queued_spin_lock do_raw_spin_lock+0x210/0x2c0 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x42/0x60 __put_partials+0xc3/0x170 qlink_free qlist_free_all+0x4e/0x140 kasan_quarantine_reduce+0x192/0x1e0 __kasan_slab_alloc+0x69/0x90 kasan_slab_alloc slab_post_alloc_hook slab_alloc_node kmem_cache_alloc_node_noprof+0x153/0x310 __alloc_skb+0x2b1/0x380 ...... -> #0 (&n->list_lock){-.-.}-{2:2}: check_prev_add check_prevs_add validate_chain __lock_acquire+0x2478/0x3b30 lock_acquire lock_acquire+0x1b1/0x560 __raw_spin_lock_irqsave _raw_spin_lock_irqsave+0x3a/0x60 get_partial_node.part.0+0x20/0x350 get_partial_node get_partial ___slab_alloc+0x65b/0x1870 __slab_alloc.constprop.0+0x56/0xb0 __slab_alloc_node slab_alloc_node __do_kmalloc_node __kmalloc_node_noprof+0x35c/0x440 kmalloc_node_noprof bpf_map_kmalloc_node+0x98/0x4a0 lpm_trie_node_alloc trie_update_elem+0x1ef/0xe00 bpf_map_update_value+0x2c1/0x6c0 map_update_elem+0x623/0x910 __sys_bpf+0x90c/0x49a0 ... other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&trie->lock); lock(&n->list_lock); lock(&trie->lock); lock(&n->list_lock); *** DEADLOCK *** [1]: https://syzkaller.appspot.com/bug?extid=9045c0a3d5a7f1b119f7 A bpf program attached to trace_contention_end() triggers after acquiring &n->list_lock. The program invokes trie_delete_elem(), which then acquires trie->lock. However, it is possible that another process is invoking trie_update_elem(). trie_update_elem() will acquire trie->lock first, then invoke kmalloc_node(). kmalloc_node() may invoke get_partial_node() and try to acquire &n->list_lock (not necessarily the same lock object). Therefore, lockdep warns about the circular locking dependency. Invoking kmalloc() before acquiring trie->lock could fix the warning. However, since BPF programs call be invoked from any context (e.g., through kprobe/tracepoint/fentry), there may still be lock ordering problems for internal locks in kmalloc() or trie->lock itself. To eliminate these potential lock ordering problems with kmalloc()'s internal locks, replacing kmalloc()/kfree()/kfree_rcu() with equivalent BPF memory allocator APIs that can be invoked in any context. The lock ordering problems with trie->lock (e.g., reentrance) will be handled separately. Three aspects of this change require explanation: 1. Intermediate and leaf nodes are allocated from the same allocator. Since the value size of LPM trie is usually small, using a single alocator reduces the memory overhead of the BPF memory allocator. 2. Leaf nodes are allocated before disabling IRQs. This handles cases where leaf_size is large (e.g., > 4KB - 8) and updates require intermediate node allocation. If leaf nodes were allocated in IRQ-disabled region, the free objects in BPF memory allocator would not be refilled timely and the intermediate node allocation may fail. 3. Paired migrate_{disable|enable}() calls for node alloc and free. The BPF memory allocator uses per-CPU struct internally, these paired calls are necessary to guarantee correctness. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Fix exact match conditions in trie_get_next_key()Hou Tao
trie_get_next_key() uses node->prefixlen == key->prefixlen to identify an exact match, However, it is incorrect because when the target key doesn't fully match the found node (e.g., node->prefixlen != matchlen), these two nodes may also have the same prefixlen. It will return expected result when the passed key exist in the trie. However when a recently-deleted key or nonexistent key is passed to trie_get_next_key(), it may skip keys and return incorrect result. Fix it by using node->prefixlen == matchlen to identify exact matches. When the condition is true after the search, it also implies node->prefixlen equals key->prefixlen, otherwise, the search would return NULL instead. Fixes: b471f2f1de8b ("bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Handle in-place update for full LPM trie correctlyHou Tao
When a LPM trie is full, in-place updates of existing elements incorrectly return -ENOSPC. Fix this by deferring the check of trie->n_entries. For new insertions, n_entries must not exceed max_entries. However, in-place updates are allowed even when the trie is full. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Handle BPF_EXIST and BPF_NOEXIST for LPM trieHou Tao
Add the currently missing handling for the BPF_EXIST and BPF_NOEXIST flags. These flags can be specified by users and are relevant since LPM trie supports exact matches during update. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Remove unnecessary kfree(im_node) in lpm_trie_update_elemHou Tao
There is no need to call kfree(im_node) when updating element fails, because im_node must be NULL. Remove the unnecessary kfree() for im_node. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-06bpf: Remove unnecessary check when updating LPM trieHou Tao
When "node->prefixlen == matchlen" is true, it means that the node is fully matched. If "node->prefixlen == key->prefixlen" is false, it means the prefix length of key is greater than the prefix length of node, otherwise, matchlen will not be equal with node->prefixlen. However, it also implies that the prefix length of node must be less than max_prefixlen. Therefore, "node->prefixlen == trie->max_prefixlen" will always be false when the check of "node->prefixlen == key->prefixlen" returns false. Remove this unnecessary comparison. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241206110622.1161752-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-05sched/numa: fix memory leak due to the overwritten vma->numab_stateAdrian Huang
[Problem Description] When running the hackbench program of LTP, the following memory leak is reported by kmemleak. # /opt/ltp/testcases/bin/hackbench 20 thread 1000 Running with 20*40 (== 800) tasks. # dmesg | grep kmemleak ... kmemleak: 480 new suspected memory leaks (see /sys/kernel/debug/kmemleak) kmemleak: 665 new suspected memory leaks (see /sys/kernel/debug/kmemleak) # cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888cd8ca2c40 (size 64): comm "hackbench", pid 17142, jiffies 4299780315 hex dump (first 32 bytes): ac 74 49 00 01 00 00 00 4c 84 49 00 01 00 00 00 .tI.....L.I..... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc bff18fd4): [<ffffffff81419a89>] __kmalloc_cache_noprof+0x2f9/0x3f0 [<ffffffff8113f715>] task_numa_work+0x725/0xa00 [<ffffffff8110f878>] task_work_run+0x58/0x90 [<ffffffff81ddd9f8>] syscall_exit_to_user_mode+0x1c8/0x1e0 [<ffffffff81dd78d5>] do_syscall_64+0x85/0x150 [<ffffffff81e0012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e ... This issue can be consistently reproduced on three different servers: * a 448-core server * a 256-core server * a 192-core server [Root Cause] Since multiple threads are created by the hackbench program (along with the command argument 'thread'), a shared vma might be accessed by two or more cores simultaneously. When two or more cores observe that vma->numab_state is NULL at the same time, vma->numab_state will be overwritten. Although current code ensures that only one thread scans the VMAs in a single 'numa_scan_period', there might be a chance for another thread to enter in the next 'numa_scan_period' while we have not gotten till numab_state allocation [1]. Note that the command `/opt/ltp/testcases/bin/hackbench 50 process 1000` cannot the reproduce the issue. It is verified with 200+ test runs. [Solution] Use the cmpxchg atomic operation to ensure that only one thread executes the vma->numab_state assignment. [1] https://lore.kernel.org/lkml/1794be3c-358c-4cdc-a43d-a1f841d91ef7@amd.com/ Link: https://lkml.kernel.org/r/20241113102146.2384-1-ahuang12@lenovo.com Fixes: ef6a22b70f6d ("sched/numa: apply the scan delay to every new vma") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Reported-by: Jiwei Sun <sunjw10@lenovo.com> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05bpf, xdp: constify some bpf_prog * function argumentsAlexander Lobakin
In lots of places, bpf_prog pointer is used only for tracing or other stuff that doesn't modify the structure itself. Same for net_device. Address at least some of them and add `const` attributes there. The object code didn't change, but that may prevent unwanted data modifications and also allow more helpers to have const arguments. Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05audit: fix suffixed '/' filename matchingRicardo Robaina
When the user specifies a directory to delete with the suffix '/', the audit record fails to collect the filename, resulting in the following logs: type=PATH msg=audit(10/30/2024 14:11:17.796:6304) : item=2 name=(null) type=PATH msg=audit(10/30/2024 14:11:17.796:6304) : item=1 name=(null) It happens because the value of the variables dname, and n->name->name in __audit_inode_child() differ only by the suffix '/'. This commit treats this corner case by handling pathname's trailing slashes in audit_compare_dname_path(). Steps to reproduce the issue: # auditctl -w /tmp $ mkdir /tmp/foo $ rm -r /tmp/foo/ # ausearch -i | grep PATH | tail -3 The first version of this patch was based on a GitHub patch/PR by user @hqh2010 [1]. Link: https://github.com/linux-audit/audit-kernel/pull/148 [1] Suggested-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> [PM: subject tweak, trim old metadata] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-05Merge tag 'audit-pr-20241205' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit build problem workaround from Paul Moore: "A minor audit patch that shuffles some code slightly to workaround a GCC bug affecting a number of people. The GCC folks have been able to reproduce the problem and are discussing solutions (see the bug report link in the commit), but since the workaround is trivial let's do that in the kernel so we can unblock people who are hitting this" * tag 'audit-pr-20241205' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: workaround a GCC bug triggered by task comm changes
2024-12-05Merge tag 'trace-v6.13-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix trace histogram sort function cmp_entries_dup() The sort function cmp_entries_dup() returns either 1 or 0, and not -1 if parameter "a" is less than "b" by memcmp(). - Fix archs that call trace_hardirqs_off() without RCU watching Both x86 and arm64 no longer call any tracepoints with RCU not watching. It was assumed that it was safe to get rid of trace_*_rcuidle() version of the tracepoint calls. This was needed to get rid of the SRCU protection and be able to implement features like faultable traceponits and add rust tracepoints. Unfortunately, there were a few architectures that still relied on that logic. There's only one file that has tracepoints that are called without RCU watching. Add macro logic around the tracepoints for architectures that do not have CONFIG_ARCH_WANTS_NO_INSTR defined will check if the code is in the idle path (the only place RCU isn't watching), and enable RCU around calling the tracepoint, but only do it if the tracepoint is enabled. * tag 'trace-v6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix archs that still call tracepoints without RCU watching tracing: Fix cmp_entries_dup() to respect sort() comparison rules
2024-12-05clocksource: Make negative motion detection more robustThomas Gleixner
Guenter reported boot stalls on a emulated ARM 32-bit platform, which has a 24-bit wide clocksource. It turns out that the calculated maximal idle time, which limits idle sleeps to prevent clocksource wrap arounds, is close to the point where the negative motion detection triggers. max_idle_ns: 597268854 ns negative motion tripping point: 671088640 ns If the idle wakeup is delayed beyond that point, the clocksource advances far enough to trigger the negative motion detection. This prevents the clock to advance and in the worst case the system stalls completely if the consecutive sleeps based on the stale clock are delayed as well. Cure this by calculating a more robust cut-off value for negative motion, which covers 87.5% of the actual clocksource counter width. Compare the delta against this value to catch negative motion. This is specifically for clock sources with a small counter width as their wrap around time is close to the half counter width. For clock sources with wide counters this is not a problem because the maximum idle time is far from the half counter width due to the math overflow protection constraints. For the case at hand this results in a tripping point of 1174405120ns. Note, that this cannot prevent issues when the delay exceeds the 87.5% margin, but that's not different from the previous unchecked version which allowed arbitrary time jumps. Systems with small counter width are prone to invalid results, but this problem is unlikely to be seen on real hardware. If such a system completely stalls for more than half a second, then there are other more urgent problems than the counter wrapping around. Fixes: c163e40af9b2 ("timekeeping: Always check for negative motion") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/all/8734j5ul4x.ffs@tglx Closes: https://lore.kernel.org/all/387b120b-d68a-45e8-b6ab-768cd95d11c2@roeck-us.net
2024-12-05tracing: Fix archs that still call tracepoints without RCU watchingSteven Rostedt
Tracepoints require having RCU "watching" as it uses RCU to do updates to the tracepoints. There are some cases that would call a tracepoint when RCU was not "watching". This was usually in the idle path where RCU has "shutdown". For the few locations that had tracepoints without RCU watching, there was an trace_*_rcuidle() variant that could be used. This used SRCU for protection. There are tracepoints that trace when interrupts and preemption are enabled and disabled. In some architectures, these tracepoints are called in a path where RCU is not watching. When x86 and arm64 removed these locations, it was incorrectly assumed that it would be safe to remove the trace_*_rcuidle() variant and also remove the SRCU logic, as it made the code more complex and harder to implement new tracepoint features (like faultable tracepoints and tracepoints in rust). Instead of bringing back the trace_*_rcuidle(), as it will not be trivial to do as new code has already been added depending on its removal, add a workaround to the one file that still requires it (trace_preemptirq.c). If the architecture does not define CONFIG_ARCH_WANTS_NO_INSTR, then check if the code is in the idle path, and if so, call ct_irq_enter/exit() which will enable RCU around the tracepoint. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20241204100414.4d3e06d0@gandalf.local.home Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: 48bcda684823 ("tracing: Remove definition of trace_*_rcuidle()") Closes: https://lore.kernel.org/all/bddb02de-957a-4df5-8e77-829f55728ea2@roeck-us.net/ Acked-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-12-05smp/scf: Evaluate local cond_func() before IPI side-effectsMathieu Desnoyers
In smp_call_function_many_cond(), the local cond_func() is evaluated after triggering the remote CPU IPIs. If cond_func() depends on loading shared state updated by other CPU's IPI handlers func(), then triggering execution of remote CPUs IPI before evaluating cond_func() may have unexpected consequences. One example scenario is evaluating a jiffies delay in cond_func(), which is updated by func() in the IPI handlers. This situation can prevent execution of periodic cleanup code on the local CPU. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20241203163558.3455535-1-mathieu.desnoyers@efficios.com
2024-12-05PM: sleep: autosleep: don't include 'pm_wakeup.h' directlyWolfram Sang
The header clearly states that it does not want to be included directly, only via 'device.h'. 'platform_device.h' works equally well. Remove the direct inclusion. Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Link: https://patch.msgid.link/20241118072917.3853-16-wsa+renesas@sang-engineering.com [ rjw: Subject edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-12-04audit: workaround a GCC bug triggered by task comm changesYafang shao
A build failure has been reported with the following details: In file included from include/linux/string.h:390, from include/linux/bitmap.h:13, from include/linux/cpumask.h:12, from include/linux/smp.h:13, from include/linux/lockdep.h:14, from include/linux/spinlock.h:63, from include/linux/wait.h:9, from include/linux/wait_bit.h:8, from include/linux/fs.h:6, from kernel/auditsc.c:37: In function 'sized_strscpy', inlined from '__audit_ptrace' at kernel/auditsc.c:2732:2: >> include/linux/fortify-string.h:293:17: error: call to '__write_overflow' declared with attribute error: detected write beyond size of object (1st parameter) 293 | __write_overflow(); | ^~~~~~~~~~~~~~~~~~ In function 'sized_strscpy', inlined from 'audit_signal_info_syscall' at kernel/auditsc.c:2759:3: >> include/linux/fortify-string.h:293:17: error: call to '__write_overflow' declared with attribute error: detected write beyond size of object (1st parameter) 293 | __write_overflow(); | ^~~~~~~~~~~~~~~~~~ The issue appears to be a GCC bug, though the root cause remains unclear at this time. For now, let's implement a workaround. A bug report has also been filed with GCC [0]. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117912 [0] Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202410171420.1V00ICVG-lkp@intel.com/ Reported-by: Steven Rostedt (Google) <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20241128182435.57a1ea6f@gandalf.local.home/ Reported-by: Zhuo, Qiuxu <qiuxu.zhuo@intel.com> Closes: https://lore.kernel.org/all/CY8PR11MB71348E568DBDA576F17DAFF389362@CY8PR11MB7134.namprd11.prod.outlook.com/ Originally-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/linux-hardening/202410171059.C2C395030@keescook/ Signed-off-by: Yafang shao <laoar.shao@gmail.com> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Yafang Shao <laoar.shao@gmail.com> [PM: subject tweak, description line wrapping] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04sched_ext: Use the NUMA scheduling domain for NUMA optimizationsAndrea Righi
Rely on the NUMA scheduling domain topology, instead of accessing NUMA topology information directly. There is basically no functional change, but in this way we ensure consistent use of the same topology information determined by the scheduling subsystem. Fixes: f6ce6b949304 ("sched_ext: Do not enable LLC/NUMA optimizations when domains overlap") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2024-12-04lsm: replace context+len with lsm_contextCasey Schaufler
Replace the (secctx,seclen) pointer pair with a single lsm_context pointer to allow return of the LSM identifier along with the context and context length. This allows security_release_secctx() to know how to release the context. Callers have been modified to use or save the returned data from the new structure. security_secid_to_secctx() and security_lsmproc_to_secctx() will now return the length value on success instead of 0. Cc: netdev@vger.kernel.org Cc: audit@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: Todd Kjos <tkjos@google.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject tweak, kdoc fix, signedness fix from Dan Carpenter] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04bpf: Fix narrow scalar spill onto 64-bit spilled scalar slotsTao Lyu
When CAP_PERFMON and CAP_SYS_ADMIN (allow_ptr_leaks) are disabled, the verifier aims to reject partial overwrite on an 8-byte stack slot that contains a spilled pointer. However, in such a scenario, it rejects all partial stack overwrites as long as the targeted stack slot is a spilled register, because it does not check if the stack slot is a spilled pointer. Incomplete checks will result in the rejection of valid programs, which spill narrower scalar values onto scalar slots, as shown below. 0: R1=ctx() R10=fp0 ; asm volatile ( @ repro.bpf.c:679 0: (7a) *(u64 *)(r10 -8) = 1 ; R10=fp0 fp-8_w=1 1: (62) *(u32 *)(r10 -8) = 1 attempt to corrupt spilled pointer on stack processed 2 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0. Fix this by expanding the check to not consider spilled scalar registers when rejecting the write into the stack. Previous discussion on this patch is at link [0]. [0]: https://lore.kernel.org/bpf/20240403202409.2615469-1-tao.lyu@epfl.ch Fixes: ab125ed3ec1c ("bpf: fix check for attempt to corrupt spilled pointer") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-12-04bpf: Don't mark STACK_INVALID as STACK_MISC in mark_stack_slot_miscKumar Kartikeya Dwivedi
Inside mark_stack_slot_misc, we should not upgrade STACK_INVALID to STACK_MISC when allow_ptr_leaks is false, since invalid contents shouldn't be read unless the program has the relevant capabilities. The relaxation only makes sense when env->allow_ptr_leaks is true. However, such conversion in privileged mode becomes unnecessary, as invalid slots can be read without being upgraded to STACK_MISC. Currently, the condition is inverted (i.e. checking for true instead of false), simply remove it to restore correct behavior. Fixes: eaf18febd6eb ("bpf: preserve STACK_ZERO slots on partial reg spills") Acked-by: Andrii Nakryiko <andrii@kernel.org> Reported-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241204044757.1483141-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>