summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-01-20bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()Hou Tao
The freeing of special fields in map value may acquire a spin-lock (e.g., the freeing of bpf_timer), however, the lookup_and_delete_elem procedure has already held a raw-spin-lock, which violates the lockdep rule. The running context of __htab_map_lookup_and_delete_elem() has already disabled the migration. Therefore, it is OK to invoke free_htab_elem() after unlocking the bucket lock. Fix the potential problem by freeing element after unlocking bucket lock in __htab_map_lookup_and_delete_elem(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250117101816.2101857-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20bpf: Bail out early in __htab_map_lookup_and_delete_elem()Hou Tao
Use goto statement to bail out early when the target element is not found, instead of using a large else branch to handle the more likely case. This change doesn't affect functionality and simply make the code cleaner. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://lore.kernel.org/r/20250117101816.2101857-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20bpf: Free special fields after unlock in htab_lru_map_delete_node()Hou Tao
When bpf_timer is used in LRU hash map, calling check_and_free_fields() in htab_lru_map_delete_node() will invoke bpf_timer_cancel_and_free() to free the bpf_timer. If the timer is running on other CPUs, hrtimer_cancel() will invoke hrtimer_cancel_wait_running() to spin on current CPU to wait for the completion of the hrtimer callback. Considering that the deletion has already acquired a raw-spin-lock (bucket lock). To reduce the time holding the bucket lock, move the invocation of check_and_free_fields() out of bucket lock. However, because htab_lru_map_delete_node() is invoked with LRU raw spin lock being held, the freeing of special fields still happens in a locked scope. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://lore.kernel.org/r/20250117101816.2101857-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20Merge branch 'for-6.14-cpu_sync-fixup' into for-linusPetr Mladek
2025-01-19Merge tag 'timers_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Reset hrtimers correctly when a CPU hotplug state traversal happens "half-ways" and leaves hrtimers not (re-)initialized properly - Annotate accesses to a timer group's ignore flag to prevent KCSAN from raising data_race warnings - Make sure timer group initialization is visible to timer tree walkers and avoid a hypothetical race - Fix another race between CPU hotplug and idle entry/exit where timers on a fully idle system are getting ignored - Fix a case where an ignored signal is still being handled which it shouldn't be * tag 'timers_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimers: Handle CPU state correctly on hotplug timers/migration: Annotate accesses to ignore flag timers/migration: Enforce group initialization visibility to tree walkers timers/migration: Fix another race between hotplug and idle entry/exit signal/posixtimers: Handle ignore/blocked sequences correctly
2025-01-19Merge tag 'sched_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Do not adjust the weight of empty group entities and avoid scheduling artifacts - Avoid scheduling lag by computing lag properly and thus address an EEVDF entity placement issue * tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE sched/fair: Fix EEVDF entity placement bug causing scheduling lag
2025-01-19padata: avoid UAF for reorder_workChen Ridong
Although the previous patch can avoid ps and ps UAF for _do_serial, it can not avoid potential UAF issue for reorder_work. This issue can happen just as below: crypto_request crypto_request crypto_del_alg padata_do_serial ... padata_reorder // processes all remaining // requests then breaks while (1) { if (!padata) break; ... } padata_do_serial // new request added list_add // sees the new request queue_work(reorder_work) padata_reorder queue_work_on(squeue->work) ... <kworker context> padata_serial_worker // completes new request, // no more outstanding // requests crypto_del_alg // free pd <kworker context> invoke_padata_reorder // UAF of pd To avoid UAF for 'reorder_work', get 'pd' ref before put 'reorder_work' into the 'serial_wq' and put 'pd' ref until the 'serial_wq' finish. Fixes: bbefa1dd6a6d ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-01-19padata: fix UAF in padata_reorderChen Ridong
A bug was found when run ltp test: BUG: KASAN: slab-use-after-free in padata_find_next+0x29/0x1a0 Read of size 4 at addr ffff88bbfe003524 by task kworker/u113:2/3039206 CPU: 0 PID: 3039206 Comm: kworker/u113:2 Kdump: loaded Not tainted 6.6.0+ Workqueue: pdecrypt_parallel padata_parallel_worker Call Trace: <TASK> dump_stack_lvl+0x32/0x50 print_address_description.constprop.0+0x6b/0x3d0 print_report+0xdd/0x2c0 kasan_report+0xa5/0xd0 padata_find_next+0x29/0x1a0 padata_reorder+0x131/0x220 padata_parallel_worker+0x3d/0xc0 process_one_work+0x2ec/0x5a0 If 'mdelay(10)' is added before calling 'padata_find_next' in the 'padata_reorder' function, this issue could be reproduced easily with ltp test (pcrypt_aead01). This can be explained as bellow: pcrypt_aead_encrypt ... padata_do_parallel refcount_inc(&pd->refcnt); // add refcnt ... padata_do_serial padata_reorder // pd while (1) { padata_find_next(pd, true); // using pd queue_work_on ... padata_serial_worker crypto_del_alg padata_put_pd_cnt // sub refcnt padata_free_shell padata_put_pd(ps->pd); // pd is freed // loop again, but pd is freed // call padata_find_next, UAF } In the padata_reorder function, when it loops in 'while', if the alg is deleted, the refcnt may be decreased to 0 before entering 'padata_find_next', which leads to UAF. As mentioned in [1], do_serial is supposed to be called with BHs disabled and always happen under RCU protection, to address this issue, add synchronize_rcu() in 'padata_free_shell' wait for all _do_serial calls to finish. [1] https://lore.kernel.org/all/20221028160401.cccypv4euxikusiq@parnassus.localdomain/ [2] https://lore.kernel.org/linux-kernel/jfjz5d7zwbytztackem7ibzalm5lnxldi2eofeiczqmqs2m7o6@fq426cwnjtkm/ Fixes: b128a3040935 ("padata: allocate workqueue internally") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Qu Zicheng <quzicheng@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-01-19padata: add pd get/put refcnt helperChen Ridong
Add helpers for pd to get/put refcnt to make code consice. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-01-16ftrace: Implement :mod: cache filtering on kernel command lineSteven Rostedt
Module functions can be set to set_ftrace_filter before the module is loaded. # echo :mod:snd_hda_intel > set_ftrace_filter This will enable all the functions for the module snd_hda_intel. If that module is not loaded, it is "cached" in the trace array for when the module is loaded, its functions will be traced. But this is not implemented in the kernel command line. That's because the kernel command line filtering is added very early in boot up as it is needed to be done before boot time function tracing can start, which is also available very early in boot up. The code used by the "set_ftrace_filter" file can not be used that early as it depends on some other initialization to occur first. But some of the functions can. Implement the ":mod:" feature of "set_ftrace_filter" in the kernel command line parsing. Now function tracing on just a single module that is loaded at boot up can be done. Adding: ftrace=function ftrace_filter=:mod:sna_hda_intel To the kernel command line will only enable the sna_hda_intel module functions when the module is loaded, and it will start tracing. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250116175832.34e39779@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16tracing: Adopt __free() and guard() for trace_fprobe.cMasami Hiramatsu (Google)
Adopt __free() and guard() for trace_fprobe.c to remove gotos. Link: https://lore.kernel.org/173708043449.319651.12242878905778792182.stgit@mhiramat.roam.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16bpf: verifier: Support eliding map lookup nullnessDaniel Xu
This commit allows progs to elide a null check on statically known map lookup keys. In other words, if the verifier can statically prove that the lookup will be in-bounds, allow the prog to drop the null check. This is useful for two reasons: 1. Large numbers of nullness checks (especially when they cannot fail) unnecessarily pushes prog towards BPF_COMPLEXITY_LIMIT_JMP_SEQ. 2. It forms a tighter contract between programmer and verifier. For (1), bpftrace is starting to make heavier use of percpu scratch maps. As a result, for user scripts with large number of unrolled loops, we are starting to hit jump complexity verification errors. These percpu lookups cannot fail anyways, as we only use static key values. Eliding nullness probably results in less work for verifier as well. For (2), percpu scratch maps are often used as a larger stack, as the currrent stack is limited to 512 bytes. In these situations, it is desirable for the programmer to express: "this lookup should never fail, and if it does, it means I messed up the code". By omitting the null check, the programmer can "ask" the verifier to double check the logic. Tests also have to be updated in sync with these changes, as the verifier is more efficient with this change. Notable, iters.c tests had to be changed to use a map type that still requires null checks, as it's exercising verifier tracking logic w.r.t iterators. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/68f3ea96ff3809a87e502a11a4bd30177fc5823e.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: verifier: Refactor helper access type trackingDaniel Xu
Previously, the verifier was treating all PTR_TO_STACK registers passed to a helper call as potentially written to by the helper. However, all calls to check_stack_range_initialized() already have precise access type information available. Rather than treat ACCESS_HELPER as a proxy for BPF_WRITE, pass enum bpf_access_type to check_stack_range_initialized() to more precisely track helper arguments. One benefit from this precision is that registers tracked as valid spills and passed as a read-only helper argument remain tracked after the call. Rather than being marked STACK_MISC afterwards. An additional benefit is the verifier logs are also more precise. For this particular error, users will enjoy a slightly clearer message. See included selftest updates for examples. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/ff885c0e5859e0cd12077c3148ff0754cad4f7ed.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: verifier: Add missing newline on verbose() callDaniel Xu
The print was missing a newline. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/59cbe18367b159cd470dc6d5c652524c1dc2b984.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16Merge back earlier cpufreq material for 6.14Rafael J. Wysocki
2025-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc8). Conflicts: drivers/net/ethernet/realtek/r8169_main.c 1f691a1fc4be ("r8169: remove redundant hwmon support") 152d00a91396 ("r8169: simplify setting hwmon attribute visibility") https://lore.kernel.org/20250115122152.760b4e8d@canb.auug.org.au Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 152f4da05aee ("bnxt_en: add support for rx-copybreak ethtool command") f0aa6a37a3db ("eth: bnxt: always recalculate features after XDP clearing, fix null-deref") drivers/net/ethernet/intel/ice/ice_type.h 50327223a8bb ("ice: add lock to protect low latency interface") dc26548d729e ("ice: Fix quad registers read on E825") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16tracing: Cache ":mod:" events for modules not loaded yetSteven Rostedt
When the :mod: command is written into /sys/kernel/tracing/set_event (or that file within an instance), if the module specified after the ":mod:" is not yet loaded, it will store that string internally. When the module is loaded, it will enable the events as if the module was loaded when the string was written into the set_event file. This can also be useful to enable events that are in the init section of the module, as the events are enabled before the init section is executed. This also works on the kernel command line: trace_event=:mod:<module> Will enable the events for <module> when it is loaded. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250116143533.514730995@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16tracing: Add :mod: command to enabled module eventsSteven Rostedt
Add a :mod: command to enable only events from a given module from the set_events file. echo '*:mod:<module>' > set_events Or echo ':mod:<module>' > set_events Will enable all events for that module. Specific events can also be enabled via: echo '<event>:mod:<module>' > set_events Or echo '<system>:<event>:mod:<module>' > set_events Or echo '*:<event>:mod:<module>' > set_events The ":mod:" keyword is consistent with the function tracing filter to enable functions from a given module. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250116143533.214496360@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16timers/migration: Simplify top level detection on group setupFrederic Weisbecker
Having a single group on a given level is enough to know this is the top level, because a root has to have at least two children, unless that root is the only group and the children are actual CPUs. Simplify the test in tmigr_setup_groups() accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250114231507.21672-5-frederic@kernel.org
2025-01-16hrtimers: Handle CPU state correctly on hotplugKoichiro Den
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to CPUHP_ONLINE: Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set to 1 throughout. However, during a CPU unplug operation, the tick and the clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online state, for instance CFS incorrectly assumes that the hrtick is already active, and the chance of the clockevent device to transition to oneshot mode is also lost forever for the CPU, unless it goes back to a lower state than CPUHP_HRTIMERS_PREPARE once. This round-trip reveals another issue; cpu_base.online is not set to 1 after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer(). Aside of that, the bulk of the per CPU state is not reset either, which means there are dangling pointers in the worst case. Address this by adding a corresponding startup() callback, which resets the stale per CPU state and sets the online flag. [ tglx: Make the new callback unconditionally available, remove the online modification in the prepare() callback and clear the remaining state in the starting callback instead of the prepare callback ] Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
2025-01-16timers/migration: Annotate accesses to ignore flagFrederic Weisbecker
The group's ignore flag is: _ read under the group's lock (idle entry, remote expiry) _ turned on/off under the group's lock (idle entry, remote expiry) _ turned on locklessly on idle exit When idle entry or remote expiry clear the "ignore" flag of a group, the operation must be synchronized against other concurrent idle entry or remote expiry to make sure the related group timer is never missed. To enforce this synchronization, both "ignore" clear and read are performed under the group lock. On the contrary, whether idle entry or remote expiry manage to observe the "ignore" flag turned on by a CPU exiting idle is a matter of optimization. If that flag set is missed or cleared concurrently, the worst outcome is a migrator wasting time remotely handling a "ghost" timer. This is why the ignore flag can be set locklessly. Unfortunately, the related lockless accesses are bare and miss appropriate annotations. KCSAN rightfully complains: BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0: __tmigr_cpu_activate tmigr_cpu_activate timer_clear_idle tick_nohz_restart_sched_tick tick_nohz_idle_exit do_idle cpu_startup_entry kernel_init do_initcalls clear_bss reserve_bios_regions common_startup_64 read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1: print_report kcsan_report_known_origin kcsan_setup_watchpoint tmigr_next_groupevt tmigr_update_events tmigr_inactive_up __walk_groups+0x50/0x77 walk_groups __tmigr_cpu_deactivate tmigr_cpu_deactivate __get_next_timer_interrupt timer_base_try_to_set_idle tick_nohz_stop_tick tick_nohz_idle_stop_tick cpuidle_idle_call do_idle Although the relevant accesses could be marked as data_race(), the "ignore" flag being read several times within the same tmigr_update_events() function is confusing and error prone. Prefer reading it once in that function and make use of similar/paired accesses elsewhere with appropriate comments when necessary. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250114231507.21672-4-frederic@kernel.org Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
2025-01-16timers/migration: Enforce group initialization visibility to tree walkersFrederic Weisbecker
Commit 2522c84db513 ("timers/migration: Fix another race between hotplug and idle entry/exit") fixed yet another race between idle exit and CPU hotplug up leading to a wrong "0" value migrator assigned to the top level. However there is yet another situation that remains unhandled: [GRP0:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ \ 0 1 2..7 idle idle idle 0) The system is fully idle. [GRP0:0] migrator = CPU 0 active = CPU 0 groupmask = 1 / \ \ 0 1 2..7 active idle idle 1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state but it hasn't yet returned to __walk_groups(). [GRP0:0] migrator = CPU 0 active = CPU 0, CPU 1 groupmask = 1 / \ \ 0 1 2..7 active active idle 2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in __walk_groups(), delayed by #VMEXIT for example). [GRP1:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1 and GRP0:0. CPU 1 hasn't yet propagated its activation up to GRP1:0. [GRP1:0] migrator = GRP0:0 active = GRP0:0 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP1:0 is visible and fetched and the pre-initialized groupmask of GRP0:0 is also visible. As a result tmigr_active_up() is called to GRP1:0 with GRP0:0 as active and migrator. CPU 0 is returning to __walk_groups() but suffers again a #VMEXIT. [GRP1:0] migrator = GRP0:0 active = GRP0:0 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 5) CPU 1 propagates its activation of GRP0:0 to GRP1:0. This has no effect since CPU 0 did it already. [GRP1:0] migrator = GRP0:0 active = GRP0:0, GRP0:1 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = CPU 8 active = CPU 0, CPU1 active = CPU 8 groupmask = 1 groupmask = 2 / \ \ \ 0 1 2..7 8 active active idle active 6) CPU 1 links CPU 8 to its group. CPU 8 boots and goes through CPUHP_AP_TMIGR_ONLINE which propagates activation. [GRP2:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ [GRP1:0] [GRP1:1] migrator = GRP0:0 migrator = TMIGR_NONE active = GRP0:0, GRP0:1 active = NONE groupmask = 1 groupmask = 2 / \ [GRP0:0] [GRP0:1] [GRP0:2] migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE active = CPU 0, CPU1 active = CPU 8 active = NONE groupmask = 1 groupmask = 2 groupmask = 0 / \ \ \ 0 1 2..7 8 64 active active idle active !online 7) CPU 64 is booting. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP1:1, GRP0:2 and the new top GRP2:0 connected to GRP1:1 and GRP1:0. CPU 1 hasn't yet propagated its activation up to GRP2:0. [GRP2:0] migrator = 0 (!!!) active = NONE groupmask = 1 / \ [GRP1:0] [GRP1:1] migrator = GRP0:0 migrator = TMIGR_NONE active = GRP0:0, GRP0:1 active = NONE groupmask = 1 groupmask = 2 / \ [GRP0:0] [GRP0:1] [GRP0:2] migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE active = CPU 0, CPU1 active = CPU 8 active = NONE groupmask = 1 groupmask = 2 groupmask = 0 / \ \ \ 0 1 2..7 8 64 active active idle active !online 8) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP2:0 is visible and fetched but the pre-initialized groupmask of GRP1:0 is not because no ordering made its initialization visible. As a result tmigr_active_up() may be called to GRP2:0 with a "0" child's groumask. Leaving the timers ignored for ever when the system is fully idle. The race is highly theoretical and perhaps impossible in practice but the groupmask of the child is not the only concern here as the whole initialization of the child is not guaranteed to be visible to any tree walker racing against hotplug (idle entry/exit, remote handling, etc...). Although the current code layout seem to be resilient to such hazards, this doesn't tell much about the future. Fix this with enforcing address dependency between group initialization and the write/read to the group's parent's pointer. Fortunately that doesn't involve any barrier addition in the fast paths. Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250114231507.21672-3-frederic@kernel.org
2025-01-16timers/migration: Fix another race between hotplug and idle entry/exitFrederic Weisbecker
Commit 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") fixed a race between idle exit and CPU hotplug up leading to a wrong "0" value migrator assigned to the top level. However there is still a situation that remains unhandled: [GRP0:0] migrator = TMIGR_NONE active = NONE groupmask = 0 / \ \ 0 1 2..7 idle idle idle 0) The system is fully idle. [GRP0:0] migrator = CPU 0 active = CPU 0 groupmask = 0 / \ \ 0 1 2..7 active idle idle 1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state but it hasn't yet returned to __walk_groups(). [GRP0:0] migrator = CPU 0 active = CPU 0, CPU 1 groupmask = 0 / \ \ 0 1 2..7 active active idle 2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in __walk_groups(), delayed by #VMEXIT for example). [GRP1:0] migrator = TMIGR_NONE active = NONE groupmask = 0 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 2 groupmask = 1 / \ \ 0 1 2..7 8 active active idle !online 3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1 and GRP0:0. The groupmask of GRP0:0 is now 2. CPU 1 hasn't yet propagated its activation up to GRP1:0. [GRP1:0] migrator = 0 (!!!) active = NONE groupmask = 0 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 2 groupmask = 1 / \ \ 0 1 2..7 8 active active idle !online 4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP1:0 is visible and fetched but the freshly updated groupmask of GRP0:0 may not be visible due to lack of ordering! As a result tmigr_active_up() is called to GRP0:0 with a child's groupmask of "0". This buggy "0" groupmask then becomes the migrator for GRP1:0 forever. As a result, timers on a fully idle system get ignored. One possible fix would be to define TMIGR_NONE as "0" so that such a race would have no effect. And after all TMIGR_NONE doesn't need to be anything else. However this would leave an uncomfortable state machine where gears happen not to break by chance but are vulnerable to future modifications. Keep TMIGR_NONE as is instead and pre-initialize to "1" the groupmask of any newly created top level. This groupmask is guaranteed to be visible upon fetching the corresponding group for the 1st time: _ By the upcoming CPU thanks to CPU hotplug synchronization between the control CPU (BP) and the booting one (AP). _ By the control CPU since the groupmask and parent pointers are initialized locally. _ By all CPUs belonging to the same group than the control CPU because they must wait for it to ever become idle before needing to walk to the new top. The cmpcxhg() on ->migr_state then makes sure its groupmask is visible. With this pre-initialization, it is guaranteed that if a future top level is linked to an old one, it is walked through with a valid groupmask. Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250114231507.21672-2-frederic@kernel.org
2025-01-16genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set()Dr. David Alan Gilbert
The recent conversion of brcmstb_l2_mask_and_ack() to irq_gc_mask_disable_and_ack_set() missed that the driver can be built as a module, but the generic function is not exported. Add the missing export. [ tglx: Converted it to a fix ] Fixes: dd1f17a9faf5 ("irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function") Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250116005920.626822-1-linux@treblig.org
2025-01-16timers: Optimize get_timer_[this_]cpu_base()Zhongqiu Han
If a timer is deferrable and NO_HZ_COMMON is enabled, get_timer_cpu_base() and get_timer_this_cpu_base() invoke per_cpu_ptr() and this_cpu_ptr() twice. While this seems to be cheap, get_timer_cpu_base() can be called in a loop in lock_timer_base(). Optimize the functions by updating the base index for deferrable timers and retrieving the actual base pointer once. In both cases the resulting assembly code of those helpers becomes smaller, which results in a ~30% execution time reduction for a lock_timer_base() micro bench mark. Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241231150115.1978342-1-quic_zhonhan@quicinc.com
2025-01-15bpf: Send signals asynchronously if !preemptiblePuranjay Mohan
BPF programs can execute in all kinds of contexts and when a program running in a non-preemptible context uses the bpf_send_signal() kfunc, it will cause issues because this kfunc can sleep. Change `irqs_disabled()` to `!preemptible()`. Reported-by: syzbot+97da3d7e0112d59971de@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67486b09.050a0220.253251.0084.GAE@google.com/ Fixes: 1bc7896e9ef4 ("bpf: Fix deadlock with rq_lock in bpf_send_signal()") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250115103647.38487-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-15genirq/timings: Add kernel-doc for a function parameterRandy Dunlap
Add the description for @now to eliminate a kernel-doc warning. timings.c:537: warning: Function parameter or struct member 'now' not described in 'irq_timings_next_event' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250111062954.910657-1-rdunlap@infradead.org
2025-01-15genirq: Remove IRQ_MOVE_PCNTXT and related codeThomas Gleixner
Now that x86 is converted over to use the IRQCHIP_MOVE_DEFERRED flags, remove IRQ*_MOVE_PCNTXT and related code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241210103335.626707225@linutronix.de
2025-01-15timekeeping: Remove unused ktime_get_fast_timestamps()Dr. David Alan Gilbert
ktime_get_fast_timestamps() was added in 2020 by commit e2d977c9f1ab ("timekeeping: Provide multi-timestamp accessor to NMI safe timekeeper") but has remained unused. Remove it. [ tglx: Fold the inline as David suggested in the submission ] Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250112160132.450209-1-linux@treblig.org
2025-01-15timer/migration: Fix kernel-doc warnings for union tmigr_stateRandy Dunlap
Use the correct kernel-doc notation for nested structs/unions to eliminate warnings: timer_migration.h:119: warning: Incorrect use of kernel-doc format: * struct - split state of tmigr_group timer_migration.h:134: warning: Function parameter or struct member 'active' not described in 'tmigr_state' timer_migration.h:134: warning: Function parameter or struct member 'migrator' not described in 'tmigr_state' timer_migration.h:134: warning: Function parameter or struct member 'seq' not described in 'tmigr_state' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250111063156.910903-1-rdunlap@infradead.org
2025-01-15tick/broadcast: Add kernel-doc for function parametersRandy Dunlap
Add kernel-doc comments for two parameters to eliminate kernel-doc warnings: tick-broadcast.c:1026: warning: Function parameter or struct member 'bc' not described in 'tick_broadcast_setup_oneshot' tick-broadcast.c:1026: warning: Function parameter or struct member 'from_periodic' not described in 'tick_broadcast_setup_oneshot' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250111063148.910887-1-rdunlap@infradead.org
2025-01-15hrtimers: Update the return type of enqueue_hrtimer()Richard Clark
The return type should be 'bool' instead of 'int' according to the calling context in the kernel, and its internal implementation, i.e. : return timerqueue_add(); which is a bool-return function. [ tglx: Adjust function arguments ] Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/Z2ppT7me13dtxm1a@MBC02GN1V4Q05P
2025-01-15clocksource/wdtest: Print time values for short udelay(1)Paul E. McKenney
When a pair of clocksource reads separated by a udelay(1) claim less than a full microsecond of elapsed time, print the measured delay as part of the splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/717a2ddf-a80f-490b-aa3a-4e4b74fa56ca@paulmck-laptop
2025-01-15posix-timers: Fix typo in __lock_timer()Zhu Jun
The word 'accross' is wrong, so fix it. Signed-off-by: Zhu Jun <zhujun2@cmss.chinamobile.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241204080907.11989-1-zhujun2@cmss.chinamobile.com
2025-01-15signal/posixtimers: Handle ignore/blocked sequences correctlyThomas Gleixner
syzbot triggered the warning in posixtimer_send_sigqueue(), which warns about a non-ignored signal being already queued on the ignored list. The warning is actually bogus, as the following sequence causes this: signal($SIG, SIGIGN); timer_settime(...); // arm periodic timer timer fires, signal is ignored and queued on ignored list sigprocmask(SIG_BLOCK, ...); // block the signal timer_settime(...); // re-arm periodic timer timer fires, signal is not ignored because it is blocked ---> Warning triggers as signal is on the ignored list Ideally timer_settime() could remove the signal, but that's racy and incomplete vs. other scenarios and requires a full reevaluation of the pending signal list. Instead of adding more complexity, handle it gracefully by removing the warning and requeueing the signal to the pending list. That's correct versus: 1) sig[timed]wait() as that does not check for SIGIGN and only relies on dequeue_signal() -> posixtimers_deliver_signal() to check whether the pending signal is still valid. 2) Unblocking of the signal. - If the unblocking happens before SIGIGN is replaced by a signal handler, then the timer is rearmed in dequeue_signal(), but get_signal() will ignore it. The next timer expiry will move it back to the ignored list. - If SIGIGN was replaced before unblocking, then the signal will be delivered and a subsequent expiry will queue a signal on the pending list again. There is a related scenario to trigger the complementary warning in the signal ignored path, which does not expect the signal to be on the pending list when it is ignored. That can be triggered even before the above change via: task1 task2 signal($SIG, SIGIGN); sigprocmask(SIG_BLOCK, ...); timer_create(); // Signal target is task2 timer_settime(...); // arm periodic timer timer fires, signal is not ignored because it is blocked and queued on the pending list of task2 syscall() // Sets the pending flag sigprocmask(SIG_UNBLOCK, ...); -> preemption, task2 cannot dequeue the signal timer_settime(...); // re-arm periodic timer timer fires, signal is ignored ---> Warning triggers as signal is on task2's pending list and the thread group is not exiting Consequently, remove that warning too and just keep the signal on the pending list. The following attempt to deliver the signal on return to user space of task2 will ignore the signal and a subsequent expiry will bring it back to the ignored list, if it did not get blocked or un-ignored before that. Fixes: df7a996b4dab ("signal: Queue ignored posixtimers on ignore list") Reported-by: syzbot+3c2e3cc60665d71de2f7@syzkaller.appspotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/87ikqhcnjn.ffs@tglx
2025-01-15genirq: Provide IRQCHIP_MOVE_DEFERREDThomas Gleixner
The logic of GENERIC_PENDING_IRQ is backwards for historical reasons. Most interrupt controllers allow to move the interrupt from arbitrary contexts. If GENERIC_PENDING_IRQ is enabled by an architecture to support a chip, which requires the affinity change to happen in interrupt context, all other chips have to be marked with IRQF_MOVE_PCNTXT. That's tedious and there is no real good reason for the extra flags in the irq descriptor and the irq data status fields. In fact the decision whether interrupts can be moved in arbitrary context or not is a property of the interrupt chip. To simplify adoption for RISC-V provide a new mechanism which is enabled via a config switch and allows to add a flag to irq_chip::flags to request that interrupt affinity changes are deferred. Setting the top level chip of an interrupt evaluates the flag and maps it into the existing logic. The config switch and the various PCNTXT flags are temporary until x86 is converted over to this scheme. This intermediate step also allows trivial backporting of the mechanism to plug the affinity change race of various RISC-V interrupt controllers. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241210103335.500314436@linutronix.de
2025-01-15genirq: Remove handle_enforce_irqctx() wrapperThomas Gleixner
Now that it is unconditionally available, remove the wrapper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241210101811.561078243@linutronix.de
2025-01-15genirq: Make handle_enforce_irqctx() unconditionally availableThomas Gleixner
Commit 1b57d91b969c ("irqchip/gic-v2, v3: Prevent SW resends entirely") sett the flag which enforces interrupt handling in interrupt context and prevents software base resends for ARM GIC v2/v3. But it missed that the helper function which checks the flag was hidden behind CONFIG_GENERIC_PENDING_IRQ, which is not set by ARM[64]. Make the helper unconditionally available so that the enforcement actually works. Fixes: 1b57d91b969c ("irqchip/gic-v2, v3: Prevent SW resends entirely") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241210101811.497716609@linutronix.de
2025-01-15cgroup/dmem: Fix parameters documentationMaxime Ripard
During the dmem cgroup development, the parameters to the dmem_cgroup_state_evict_valuable() and dmem_cgroup_try_charge() were changed, but the documentation wasn't adjusted accordingly. This results in a documentation build warning. Adjust the documentation to reflect what the final functions parameters are. Fixes: b168ed458dde ("kernel/cgroup: Add "dmem" memory accounting cgroup") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/r/20250113160334.1f09f881@canb.auug.org.au/ Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20250113092608.1349287-2-mripard@kernel.org Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-01-15kernel/cgroup: Remove the unused variable climitJiapeng Chong
Variable climit is not effectively used, so delete it. kernel/cgroup/dmem.c:302:23: warning: variable ‘climit’ set but not used. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=13512 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20250114062804.5092-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Maxime Ripard <mripard@kernel.org>
2025-01-14PM: sleep: Allow configuring the DPM watchdog to warn earlier than panicDouglas Anderson
Allow configuring the DPM watchdog to warn about slow suspend/resume functions without causing a system panic(). This allows you to set the DPM_WATCHDOG_WARNING_TIMEOUT to something like 5 or 10 seconds to get warnings about slow suspend/resume functions that eventually succeed. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Tomasz Figa <tfiga@chromium.org> Link: https://patch.msgid.link/20250109125957.v2.1.I4554f931b8da97948f308ecc651b124338ee9603@changeid [ rjw: Subject edit ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-14PM: sleep: convert comment from kernel-doc to plain commentRandy Dunlap
Modify a non-kernel-doc comment to begin with /* instead of /** so that it does not cause a kernel-doc warning. power.h:114: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst * Auxiliary structure used for reading the snapshot image data and power.h:114: warning: missing initial short description on line: * Auxiliary structure used for reading the snapshot image data and Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Pavel Machek <pavel@ucw.cz> Link: https://patch.msgid.link/20250111063107.910825-1-rdunlap@infradead.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-01-14tracing: Print lazy preemption modelShrikanth Hegde
Print lazy preemption model in ftrace header when latency-format=1. # cat /sys/kernel/debug/sched/preempt none voluntary full (lazy) Without patch: latency: 0 us, #232946/232946, CPU#40 | (M:unknown VP:0, KP:0, SP:0 HP:0 #P:80) ^^^^^^^ With Patch: latency: 0 us, #1897938/25566788, CPU#16 | (M:lazy VP:0, KP:0, SP:0 HP:0 #P:80) ^^^^ Now that lazy preemption is part of the kernel, make sure the tracing infrastructure reflects that. Link: https://lore.kernel.org/20250103093647.575919-1-sshegde@linux.ibm.com Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-14tracing: Fix irqsoff and wakeup latency tracers when using function graphSteven Rostedt
The function graph tracer has become generic so that kretprobes and BPF can use it along with function graph tracing itself. Some of the infrastructure was specific for function graph tracing such as recording the calltime and return time of the functions. Calling the clock code on a high volume function does add overhead. The calculation of the calltime was removed from the generic code and placed into the function graph tracer itself so that the other users did not incur this overhead as they did not need that timestamp. The calltime field was still kept in the generic return entry structure and the function graph return entry callback filled it as that structure was passed to other code. But this broke both irqsoff and wakeup latency tracer as they still depended on the trace structure containing the calltime when the option display-graph is set as it used some of those same functions that the function graph tracer used. But now the calltime was not set and was just zero. This caused the calculation of the function time to be the absolute value of the return timestamp and not the length of the function. # cd /sys/kernel/tracing # echo 1 > options/display-graph # echo irqsoff > current_tracer The tracers went from: # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS # | | | | |||| | | | | | | 0 us | 4) <idle>-0 | d..1. | 0.000 us | irqentry_enter(); 3 us | 4) <idle>-0 | d..2. | | irq_enter_rcu() { 4 us | 4) <idle>-0 | d..2. | 0.431 us | preempt_count_add(); 5 us | 4) <idle>-0 | d.h2. | | tick_irq_enter() { 5 us | 4) <idle>-0 | d.h2. | 0.433 us | tick_check_oneshot_broadcast_this_cpu(); 6 us | 4) <idle>-0 | d.h2. | 2.426 us | ktime_get(); 9 us | 4) <idle>-0 | d.h2. | | tick_nohz_stop_idle() { 10 us | 4) <idle>-0 | d.h2. | 0.398 us | nr_iowait_cpu(); 11 us | 4) <idle>-0 | d.h1. | 1.903 us | } 11 us | 4) <idle>-0 | d.h2. | | tick_do_update_jiffies64() { 12 us | 4) <idle>-0 | d.h2. | | _raw_spin_lock() { 12 us | 4) <idle>-0 | d.h2. | 0.360 us | preempt_count_add(); 13 us | 4) <idle>-0 | d.h3. | 0.354 us | do_raw_spin_lock(); 14 us | 4) <idle>-0 | d.h2. | 2.207 us | } 15 us | 4) <idle>-0 | d.h3. | 0.428 us | calc_global_load(); 16 us | 4) <idle>-0 | d.h3. | | _raw_spin_unlock() { 16 us | 4) <idle>-0 | d.h3. | 0.380 us | do_raw_spin_unlock(); 17 us | 4) <idle>-0 | d.h3. | 0.334 us | preempt_count_sub(); 18 us | 4) <idle>-0 | d.h1. | 1.768 us | } 18 us | 4) <idle>-0 | d.h2. | | update_wall_time() { [..] To: # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS # | | | | |||| | | | | | | 0 us | 5) <idle>-0 | d.s2. | 0.000 us | _raw_spin_lock_irqsave(); 0 us | 5) <idle>-0 | d.s3. | 312159583 us | preempt_count_add(); 2 us | 5) <idle>-0 | d.s4. | 312159585 us | do_raw_spin_lock(); 3 us | 5) <idle>-0 | d.s4. | | _raw_spin_unlock() { 3 us | 5) <idle>-0 | d.s4. | 312159586 us | do_raw_spin_unlock(); 4 us | 5) <idle>-0 | d.s4. | 312159587 us | preempt_count_sub(); 4 us | 5) <idle>-0 | d.s2. | 312159587 us | } 5 us | 5) <idle>-0 | d.s3. | | _raw_spin_lock() { 5 us | 5) <idle>-0 | d.s3. | 312159588 us | preempt_count_add(); 6 us | 5) <idle>-0 | d.s4. | 312159589 us | do_raw_spin_lock(); 7 us | 5) <idle>-0 | d.s3. | 312159590 us | } 8 us | 5) <idle>-0 | d.s4. | 312159591 us | calc_wheel_index(); 9 us | 5) <idle>-0 | d.s4. | | enqueue_timer() { 9 us | 5) <idle>-0 | d.s4. | | wake_up_nohz_cpu() { 11 us | 5) <idle>-0 | d.s4. | | native_smp_send_reschedule() { 11 us | 5) <idle>-0 | d.s4. | 312171987 us | default_send_IPI_single_phys(); 12408 us | 5) <idle>-0 | d.s3. | 312171990 us | } 12408 us | 5) <idle>-0 | d.s3. | 312171991 us | } 12409 us | 5) <idle>-0 | d.s3. | 312171991 us | } Where the calculation of the time for each function was the return time minus zero and not the time of when the function returned. Have these tracers also save the calltime in the fgraph data section and retrieve it again on the return to get the correct timings again. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20250113183124.61767419@gandalf.local.home Fixes: f1f36e22bee9 ("ftrace: Have calltime be saved in the fgraph storage") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-14kexec_core: Add and update comments regarding the KEXEC_JUMP flowRafael J. Wysocki
The KEXEC_JUMP flow is analogous to hibernation flows occurring before and after creating an image and before and after jumping from the restore kernel to the image one, which is why it uses the same device callbacks as those hibernation flows. Add comments explaining that to the code in question and update an existing comment in it which appears a bit out of context. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250109140757.2841269-8-dwmw2@infradead.org
2025-01-13mm: convert mm_lock_seq to a proper seqcountSuren Baghdasaryan
Convert mm_lock_seq to be seqcount_t and change all mmap_write_lock variants to increment it, in-line with the usual seqcount usage pattern. This lets us check whether the mmap_lock is write-locked by checking mm_lock_seq.sequence counter (odd=locked, even=unlocked). This will be used when implementing mmap_lock speculation functions. As a result vm_lock_seq is also change to be unsigned to match the type of mm_lock_seq.sequence. Link: https://lkml.kernel.org/r/20241122174416.1367052-2-surenb@google.com Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13lazy tlb: fix hotplug exit race with MMU_LAZY_TLB_SHOOTDOWNNicholas Piggin
CPU unplug first calls __cpu_disable(), and that's where powerpc calls cleanup_cpu_mmu_context(), which clears this CPU from mm_cpumask() of all mms in the system. However this CPU may still be using a lazy tlb mm, and its mm_cpumask bit will be cleared from it. The CPU does not switch away from the lazy tlb mm until arch_cpu_idle_dead() calls idle_task_exit(). If that user mm exits in this window, it will not be subject to the lazy tlb mm shootdown and may be freed while in use as a lazy mm by the CPU that is being unplugged. cleanup_cpu_mmu_context() could be moved later, but it looks better to move the lazy tlb mm switching earlier. The problem with doing the lazy mm switching in idle_task_exit() is explained in commit bf2c59fce4074 ("sched/core: Fix illegal RCU from offline CPUs"), which added a wart to switch away from the mm but leave it set in active_mm to be cleaned up later. So instead, switch away from the lazy tlb mm at sched_cpu_wait_empty(), which is the last hotplug state before teardown (CPUHP_AP_SCHED_WAIT_EMPTY). This CPU will never switch to a user thread from this point, so it has no chance to pick up a new lazy tlb mm. This removes the lazy tlb mm handling wart in CPU unplug. With this, idle_task_exit() is not needed anymore and can be cleaned up. This leaves the prototype alone, to be cleaned after this change. herton: took the suggestions from https://lore.kernel.org/all/87jzvyprsw.ffs@tglx/ and made adjustments on the initial patch proposed by Nicholas. Link: https://lkml.kernel.org/r/20230524060455.147699-1-npiggin@gmail.com Link: https://lore.kernel.org/all/20230525205253.E2FAEC433EF@smtp.kernel.org/ Link: https://lkml.kernel.org/r/20241104142318.3295663-1-herton@redhat.com Fixes: 2655421ae69f ("lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13kasan: make kasan_record_aux_stack_noalloc() the default behaviourPeter Zijlstra
kasan_record_aux_stack_noalloc() was introduced to record a stack trace without allocating memory in the process. It has been added to callers which were invoked while a raw_spinlock_t was held. More and more callers were identified and changed over time. Is it a good thing to have this while functions try their best to do a locklessly setup? The only downside of having kasan_record_aux_stack() not allocate any memory is that we end up without a stacktrace if stackdepot runs out of memory and at the same stacktrace was not recorded before To quote Marco Elver from https://lore.kernel.org/all/CANpmjNPmQYJ7pv1N3cuU8cP18u7PP_uoZD8YxwZd4jtbof9nVQ@mail.gmail.com/ | I'd be in favor, it simplifies things. And stack depot should be | able to replenish its pool sufficiently in the "non-aux" cases | i.e. regular allocations. Worst case we fail to record some | aux stacks, but I think that's only really bad if there's a bug | around one of these allocations. In general the probabilities | of this being a regression are extremely small [...] Make the kasan_record_aux_stack_noalloc() behaviour default as kasan_record_aux_stack(). [bigeasy@linutronix.de: dressed the diff as patch] Link: https://lkml.kernel.org/r/20241122155451.Mb2pmeyJ@linutronix.de Fixes: 7cb3007ce2da ("kasan: generic: introduce kasan_record_aux_stack_noalloc()") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reported-by: syzbot+39f85d612b7c20d8db48@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67275485.050a0220.3c8d68.0a37.GAE@google.com Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Ben Segall <bsegall@google.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: <kasan-dev@googlegroups.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: syzkaller-bugs@googlegroups.com Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-13ring-buffer: Make reading page consistent with the code logicJeongjun Park
In the loop of __rb_map_vma(), the 's' variable is calculated from the same logic that nr_pages is and they both come from nr_subbufs. But the relationship is not obvious and there's a WARN_ON_ONCE() around the 's' variable to make sure it never becomes equal to nr_subbufs within the loop. If that happens, then the code is buggy and needs to be fixed. The 'page' variable is calculated from cpu_buffer->subbuf_ids[s] which is an array of 'nr_subbufs' entries. If the code becomes buggy and 's' becomes equal to or greater than 'nr_subbufs' then this will be an out of bounds hit before the WARN_ON() is triggered and the code exiting safely. Make the 'page' initialization consistent with the code logic and assign it after the out of bounds check. Link: https://lore.kernel.org/20250110162612.13983-1-aha310510@gmail.com Signed-off-by: Jeongjun Park <aha310510@gmail.com> [ sdr: rewrote change log ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-13ring-buffer: Check for empty ring-buffer with rb_num_of_entries()Vincent Donnefort
Currently there are two ways of identifying an empty ring-buffer. One relying on the current status of the commit / reader page (rb_per_cpu_empty()) and the other on the write and read counters (rb_num_of_entries() used in rb_get_reader_page()). with rb_num_of_entries(). This intends to ease later introduction of ring-buffer writers which are out of the kernel control and with whom, the only information available is through the meta-page counters. Link: https://lore.kernel.org/20250108114536.627715-2-vdonnefort@google.com Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>