summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-01-21sched/fair: Fix inaccurate h_nr_runnable accounting with delayed dequeueK Prateek Nayak
set_delayed() adjusts cfs_rq->h_nr_runnable for the hierarchy when an entity is delayed irrespective of whether the entity corresponds to a task or a cfs_rq. Consider the following scenario: root / \ A B (*) delayed since B is no longer eligible on root | | Task0 Task1 <--- dequeue_task_fair() - task blocks When Task1 blocks (dequeue_entity() for task's se returns true), dequeue_entities() will continue adjusting cfs_rq->h_nr_* for the hierarchy of Task1. However, when the sched_entity corresponding to cfs_rq B is delayed, set_delayed() will adjust the h_nr_runnable for the hierarchy too leading to both dequeue_entity() and set_delayed() decrementing h_nr_runnable for the dequeue of the same task. A SCHED_WARN_ON() to inspect h_nr_runnable post its update in dequeue_entities() like below: cfs_rq->h_nr_runnable -= h_nr_runnable; SCHED_WARN_ON(((int) cfs_rq->h_nr_runnable) < 0); is consistently tripped when running wakeup intensive workloads like hackbench in a cgroup. This error is self correcting since cfs_rq are per-cpu and cannot migrate. The entitiy is either picked for full dequeue or is requeued when a task wakes up below it. Both those paths call clear_delayed() which again increments h_nr_runnable of the hierarchy without considering if the entity corresponds to a task or not. h_nr_runnable will eventually reflect the correct value however in the interim, the incorrect values can still influence PELT calculation which uses se->runnable_weight or cfs_rq->h_nr_runnable. Since only delayed tasks take the early return path in dequeue_entities() and enqueue_task_fair(), adjust the h_nr_runnable in {set,clear}_delayed() only when a task is delayed as this path skips the h_nr_* update loops and returns early. For entities corresponding to cfs_rq, the h_nr_* update loop in the caller will do the right thing. Fixes: 76f2f783294d ("sched/eevdf: More PELT vs DELAYED_DEQUEUE") Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> Link: https://lkml.kernel.org/r/20250117105852.23908-1-kprateek.nayak@amd.com
2025-01-21rseq: Fix rseq unregistration regressionMathieu Desnoyers
A logic inversion in rseq_reset_rseq_cpu_node_id() causes the rseq unregistration to fail when rseq_validate_ro_fields() succeeds rather than the opposite. This affects both CONFIG_DEBUG_RSEQ=y and CONFIG_DEBUG_RSEQ=n. Fixes: 7d5265ffcd8b ("rseq: Validate read-only fields under DEBUG_RSEQ config") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250116205956.836074-1-mathieu.desnoyers@efficios.com
2025-01-20Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull requests via Keith: - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis) - MD pull requests via Song: - Reintroduce md-linear (Yu Kuai) - md-bitmap refactor and fix (Yu Kuai) - Replace kmap_atomic with kmap_local_page (David Reaver) - Quite a few queue freeze and debugfs deadlock fixes Ming introduced lockdep support for this in the 6.13 kernel, and it has (unsurprisingly) uncovered quite a few issues - Use const attributes for IO schedulers - Remove bio ioprio wrappers - Fixes for stacked device atomic write support - Refactor queue affinity helpers, in preparation for better supporting isolated CPUs - Cleanups of loop O_DIRECT handling - Cleanup of BLK_MQ_F_* flags - Add rotational support for null_blk - Various fixes and cleanups * tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits) block: Don't trim an atomic write block: Add common atomic writes enable flag md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add() block: limit disk max sectors to (LLONG_MAX >> 9) block: Change blk_stack_atomic_writes_limits() unit_min check block: Ensure start sector is aligned for stacking atomic writes blk-mq: Move more error handling into blk_mq_submit_bio() block: Reorder the request allocation code in blk_mq_submit_bio() nvme: fix bogus kzalloc() return check in nvme_init_effects_log() md/md-bitmap: move bitmap_{start, end}write to md upper layer md/raid5: implement pers->bitmap_sector() md: add a new callback pers->bitmap_sector() md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() md: Replace deprecated kmap_atomic() with kmap_local_page() md: reintroduce md-linear partitions: ldm: remove the initial kernel-doc notation blk-cgroup: rwstat: fix kernel-doc warnings in header file blk-cgroup: fix kernel-doc warnings in header file nbd: fix partial sending ...
2025-01-20tracing: Rename update_cache() to update_mod_cache()Steven Rostedt
The static function in trace_events.c called update_cache() is too generic and conflicts with the function defined in arch/openrisc/include/asm/pgtable.h Rename it to update_mod_cache() to make it less generic. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250120172756.4ecfb43f@batman.local.home Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501210550.Ufrj5CRn-lkp@intel.com/ Fixes: b355247df104e ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-20Merge tag 'execve-v6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case (Tycho Andersen, Kees Cook) - binfmt_misc: Fix comment typos (Christophe JAILLET) - move empty argv[0] warning closer to actual logic (Nir Lichtman) - remove legacy custom binfmt modules autoloading (Nir Lichtman) - Make sure set_task_comm() always NUL-terminates - binfmt_flat: Fix integer overflow bug on 32 bit systems (Dan Carpenter) - coredump: Do not lock when copying "comm" - MAINTAINERS: add auxvec.h and set myself as maintainer * tag 'execve-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_flat: Fix integer overflow bug on 32 bit systems selftests/exec: add a test for execveat()'s comm exec: fix up /proc/pid/comm in the execveat(AT_EMPTY_PATH) case exec: Make sure task->comm is always NUL-terminated exec: remove legacy custom binfmt modules autoloading exec: move warning of null argv to be next to the relevant code fs: binfmt: Fix a typo MAINTAINERS: exec: Mark Kees as maintainer MAINTAINERS: exec: Add auxvec.h UAPI coredump: Do not lock during 'comm' reporting
2025-01-20Merge branch 'pm-cpufreq'Rafael J. Wysocki
Merge cpufreq updates for 6.14: - Use str_enable_disable()-like helpers in cpufreq (Krzysztof Kozlowski). - Extend the Apple cpufreq driver to support more SoCs (Hector Martin, Nick Chan). - Add new cpufreq driver for Airoha SoCs (Christian Marangi). - Fix using cpufreq-dt as module (Andreas Kemnade). - Minor fixes for Sparc, SCMI, and Qcom cpufreq drivers (Ethan Carter Edwards, Sibi Sankar, Manivannan Sadhasivam). - Fix the maximum supported frequency computation in the ACPI cpufreq driver to avoid relying on unfounded assumptions (Gautham Shenoy). - Fix an amd-pstate driver regression with preferred core rankings not being used (Mario Limonciello). - Fix a precision issue with frequency calculation in the amd-pstate driver (Naresh Solanki). - Add ftrace event to the amd-pstate driver for active mode (Mario Limonciello). - Set default EPP policy on Ryzen processors in amd-pstate (Mario Limonciello). - Clean up the amd-pstate cpufreq driver and optimize it to increase code reuse (Mario Limonciello, Dhananjay Ugwekar). - Use CPPC to get scaling factors between HWP performance levels and frequency in the intel_pstate driver and make it stop using a built -in scaling factor for the Arrow Lake processor (Rafael Wysocki). - Make intel_pstate initialize epp_policy to CPUFREQ_POLICY_UNKNOWN for consistency with CPU offline (Christian Loehle). - Fix superfluous updates caused by need_freq_update in the schedutil cpufreq governor (Sultan Alsawaf). * pm-cpufreq: (40 commits) cpufreq: Use str_enable_disable()-like helpers cpufreq: airoha: Add EN7581 CPUFreq SMCCC driver cpufreq: ACPI: Fix max-frequency computation cpufreq/amd-pstate: Refactor max frequency calculation cpufreq/amd-pstate: Fix prefcore rankings cpufreq: sparc: change kzalloc to kcalloc cpufreq: qcom: Implement clk_ops::determine_rate() for qcom_cpufreq* clocks cpufreq: qcom: Fix qcom_cpufreq_hw_recalc_rate() to query LUT if LMh IRQ is not available cpufreq: apple-soc: Add Apple A7-A8X SoC cpufreq support cpufreq: apple-soc: Set fallback transition latency to APPLE_DVFS_TRANSITION_TIMEOUT cpufreq: apple-soc: Increase cluster switch timeout to 400us cpufreq: apple-soc: Use 32-bit read for status register cpufreq: apple-soc: Allow per-SoC configuration of APPLE_DVFS_CMD_PS1 cpufreq: apple-soc: Drop setting the PS2 field on M2+ dt-bindings: cpufreq: apple,cluster-cpufreq: Add A7-A11, T2 compatibles dt-bindings: cpufreq: Document support for Airoha EN7581 CPUFreq cpufreq: fix using cpufreq-dt as module cpufreq: scmi: Register for limit change notifications cpufreq: schedutil: Fix superfluous updates caused by need_freq_update cpufreq: intel_pstate: Use CPUFREQ_POLICY_UNKNOWN ...
2025-01-20Merge tag 'kernel-6.14-rc1.pid' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pid_max namespacing update from Christian Brauner: "The pid_max sysctl is a global value. For a long time the default value has been 65535 and during the pidfd dicussions Linus proposed to bump pid_max by default. Based on this discussion systemd started bumping pid_max to 2^22. So all new systems now run with a very high pid_max limit with some distros having also backported that change. The decision to bump pid_max is obviously correct. It just doesn't make a lot of sense nowadays to enforce such a low pid number. There's sufficient tooling to make selecting specific processes without typing really large pid numbers available. In any case, there are workloads that have expections about how large pid numbers they accept. Either for historical reasons or architectural reasons. One concreate example is the 32-bit version of Android's bionic libc which requires pid numbers less than 65536. There are workloads where it is run in a 32-bit container on a 64-bit kernel. If the host has a pid_max value greater than 65535 the libc will abort thread creation because of size assumptions of pthread_mutex_t. That's a fairly specific use-case however, in general specific workloads that are moved into containers running on a host with a new kernel and a new systemd can run into issues with large pid_max values. Obviously making assumptions about the size of the allocated pid is suboptimal but we have userspace that does it. Of course, giving containers the ability to restrict the number of processes in their respective pid namespace indepent of the global limit through pid_max is something desirable in itself and comes in handy in general. Independent of motivating use-cases the existence of pid namespaces makes this also a good semantical extension and there have been prior proposals pushing in a similar direction. The trick here is to minimize the risk of regressions which I think is doable. The fact that pid namespaces are hierarchical will help us here. What we mostly care about is that when the host sets a low pid_max limit, say (crazy number) 100 that no descendant pid namespace can allocate a higher pid number in its namespace. Since pid allocation is hierarchial this can be ensured by checking each pid allocation against the pid namespace's pid_max limit. This means if the allocation in the descendant pid namespace succeeds, the ancestor pid namespace can reject it. If the ancestor pid namespace has a higher limit than the descendant pid namespace the descendant pid namespace will reject the pid allocation. The ancestor pid namespace will obviously not care about this. All in all this means pid_max continues to enforce a system wide limit on the number of processes but allows pid namespaces sufficient leeway in handling workloads with assumptions about pid values and allows containers to restrict the number of processes in a pid namespace through the pid_max interface" * tag 'kernel-6.14-rc1.pid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tests/pid_namespace: add pid_max tests pid: allow pid_max to be set per pid namespace
2025-01-20Merge branches 'pm-sleep', 'pm-cpuidle' and 'pm-em'Rafael J. Wysocki
Merge updates related to system sleep, a cpuidle update and an Energy Model handling code update for 6.14-rc1: - Allow configuring the system suspend-resume (DPM) watchdog to warn earlier than panic (Douglas Anderson). - Implement devm_device_init_wakeup() helper and introduce a device- managed variant of dev_pm_set_wake_irq() (Joe Hattori, Peng Fan). - Remove direct inclusions of 'pm_wakeup.h' which should be only included via 'device.h' (Wolfram Sang). - Clean up two comments in the core system-wide PM code (Rafael Wysocki, Randy Dunlap). - Add Clearwater Forest processor support to the intel_idle cpuidle driver (Artem Bityutskiy). - Move sched domains rebuild function from the schedutil cpufreq governor to the Energy Model handling code (Rafael Wysocki). * pm-sleep: PM: sleep: wakeirq: Introduce device-managed variant of dev_pm_set_wake_irq() PM: sleep: Allow configuring the DPM watchdog to warn earlier than panic PM: sleep: convert comment from kernel-doc to plain comment PM: wakeup: implement devm_device_init_wakeup() helper PM: sleep: sysfs: don't include 'pm_wakeup.h' directly PM: sleep: autosleep: don't include 'pm_wakeup.h' directly PM: sleep: Update stale comment in device_resume() * pm-cpuidle: intel_idle: add Clearwater Forest SoC support * pm-em: PM: EM: Move sched domains rebuild function from schedutil to EM
2025-01-20Merge tag 'kernel-6.14-rc1.cred' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull cred refcount updates from Christian Brauner: "For the v6.13 cycle we switched overlayfs to a variant of override_creds() that doesn't take an extra reference. To this end the {override,revert}_creds_light() helpers were introduced. This generalizes the idea behind {override,revert}_creds_light() to the {override,revert}_creds() helpers. Afterwards overriding and reverting credentials is reference count free unless the caller explicitly takes a reference. All callers have been appropriately ported" * tag 'kernel-6.14-rc1.cred' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) cred: fold get_new_cred_many() into get_cred_many() cred: remove unused get_new_cred() nfsd: avoid pointless cred reference count bump cachefiles: avoid pointless cred reference count bump dns_resolver: avoid pointless cred reference count bump trace: avoid pointless cred reference count bump cgroup: avoid pointless cred reference count bump acct: avoid pointless reference count bump io_uring: avoid pointless cred reference count bump smb: avoid pointless cred reference count bump cifs: avoid pointless cred reference count bump cifs: avoid pointless cred reference count bump ovl: avoid pointless cred reference count bump open: avoid pointless cred reference count bump nfsfh: avoid pointless cred reference count bump nfs/nfs4recover: avoid pointless cred reference count bump nfs/nfs4idmap: avoid pointless reference count bump nfs/localio: avoid pointless cred reference count bumps coredump: avoid pointless cred reference count bump binfmt_misc: avoid pointless cred reference count bump ...
2025-01-20tracing: Fix #if CONFIG_MODULES to #ifdef CONFIG_MODULESSteven Rostedt
A typo was introduced when adding the ":mod:" command that did a "#if CONFIG_MODULES" instead of a "#ifdef CONFIG_MODULES". Fix it. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/20250120125745.4ac90ca6@gandalf.local.home Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501190121.E2CIJuUj-lkp@intel.com/ Fixes: b355247df104e ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-20Merge tag 'vfs-6.14-rc1.pidfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull pidfs updates from Christian Brauner: - Rework inode number allocation Recently we received a patchset that aims to enable file handle encoding and decoding via name_to_handle_at(2) and open_by_handle_at(2). A crucical step in the patch series is how to go from inode number to struct pid without leaking information into unprivileged contexts. The issue is that in order to find a struct pid the pid number in the initial pid namespace must be encoded into the file handle via name_to_handle_at(2). This can be used by containers using a separate pid namespace to learn what the pid number of a given process in the initial pid namespace is. While this is a weak information leak it could be used in various exploits and in general is an ugly wart in the design. To solve this problem a new way is needed to lookup a struct pid based on the inode number allocated for that struct pid. The other part is to remove the custom inode number allocation on 32bit systems that is also an ugly wart that should go away. Allocate unique identifiers for struct pid by simply incrementing a 64 bit counter and insert each struct pid into the rbtree so it can be looked up to decode file handles avoiding to leak actual pids across pid namespaces in file handles. On both 64 bit and 32 bit the same 64 bit identifier is used to lookup struct pid in the rbtree. On 64 bit the unique identifier for struct pid simply becomes the inode number. Comparing two pidfds continues to be as simple as comparing inode numbers. On 32 bit the 64 bit number assigned to struct pid is split into two 32 bit numbers. The lower 32 bits are used as the inode number and the upper 32 bits are used as the inode generation number. Whenever a wraparound happens on 32 bit the 64 bit number will be incremented by 2 so inode numbering starts at 2 again. When a wraparound happens on 32 bit multiple pidfds with the same inode number are likely to exist. This isn't a problem since before pidfs pidfds used the anonymous inode meaning all pidfds had the same inode number. On 32 bit sserspace can thus reconstruct the 64 bit identifier by retrieving both the inode number and the inode generation number to compare, or use file handles. This gives the same guarantees on both 32 bit and 64 bit. - Implement file handle support This is based on custom export operation methods which allows pidfs to implement permission checking and opening of pidfs file handles cleanly without hacking around in the core file handle code too much. - Support bind-mounts Allow bind-mounting pidfds. Similar to nsfs let's allow bind-mounts for pidfds. This allows pidfds to be safely recovered and checked for process recycling. Instead of checking d_ops for both nsfs and pidfs we could in a follow-up patch add a flag argument to struct dentry_operations that functions similar to file_operations->fop_flags. * tag 'vfs-6.14-rc1.pidfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: add pidfd bind-mount tests pidfs: allow bind-mounts pidfs: lookup pid through rbtree selftests/pidfd: add pidfs file handle selftests pidfs: check for valid ioctl commands pidfs: implement file handle support exportfs: add permission method fhandle: pull CAP_DAC_READ_SEARCH check into may_decode_fh() exportfs: add open method fhandle: simplify error handling pseudofs: add support for export_ops pidfs: support FS_IOC_GETVERSION pidfs: remove 32bit inode number handling pidfs: rework inode number allocation
2025-01-20bpf: Remove 'may_goto 0' instruction in opt_remove_nops()Yonghong Song
Since 'may_goto 0' insns are actually no-op, let us remove them. Otherwise, verifier will generate code like /* r10 - 8 stores the implicit loop count */ r11 = *(u64 *)(r10 -8) if r11 == 0x0 goto pc+2 r11 -= 1 *(u64 *)(r10 -8) = r11 which is the pure overhead. The following code patterns (from the previous commit) are also handled: may_goto 2 may_goto 1 may_goto 0 With this commit, the above three 'may_goto' insns are all eliminated. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250118192029.2124584-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20bpf: Allow 'may_goto 0' instruction in verifierYonghong Song
Commit 011832b97b31 ("bpf: Introduce may_goto instruction") added support for may_goto insn. The 'may_goto 0' insn is disallowed since the insn is equivalent to a nop as both branch will go to the next insn. But it is possible that compiler transformation may generate 'may_goto 0' insn. Emil Tsalapatis from Meta reported such a case which caused verification failure. For example, for the following code, int i, tmp[3]; for (i = 0; i < 3 && can_loop; i++) tmp[i] = 0; ... clang 20 may generate code like may_goto 2; may_goto 1; may_goto 0; r1 = 0; /* tmp[0] = 0; */ r2 = 0; /* tmp[1] = 0; */ r3 = 0; /* tmp[2] = 0; */ Let us permit 'may_goto 0' insn to avoid verification failure for codes like the above. Reported-by: Emil Tsalapatis <etsal@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250118192024.2124059-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20Merge tag 'vfs-6.14-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Support caching symlink lengths in inodes The size is stored in a new union utilizing the same space as i_devices, thus avoiding growing the struct or taking up any more space When utilized it dodges strlen() in vfs_readlink(), giving about 1.5% speed up when issuing readlink on /initrd.img on ext4 - Add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag If a file system supports uncached buffered IO, it may set FOP_DONTCACHE and enable support for RWF_DONTCACHE. If RWF_DONTCACHE is attempted without the file system supporting it, it'll get errored with -EOPNOTSUPP - Enable VBOXGUEST and VBOXSF_FS on ARM64 Now that VirtualBox is able to run as a host on arm64 (e.g. the Apple M3 processors) we can enable VBOXSF_FS (and in turn VBOXGUEST) for this architecture. Tested with various runs of bonnie++ and dbench on an Apple MacBook Pro with the latest Virtualbox 7.1.4 r165100 installed Cleanups: - Delay sysctl_nr_open check in expand_files() - Use kernel-doc includes in fiemap docbook - Use page->private instead of page->index in watch_queue - Use a consume fence in mnt_idmap() as it's heavily used in link_path_walk() - Replace magic number 7 with ARRAY_SIZE() in fc_log - Sort out a stale comment about races between fd alloc and dup2() - Fix return type of do_mount() from long to int - Various cosmetic cleanups for the lockref code Fixes: - Annotate spinning as unlikely() in __read_seqcount_begin The annotation already used to be there, but got lost in commit 52ac39e5db51 ("seqlock: seqcount_t: Implement all read APIs as statement expressions") - Fix proc_handler for sysctl_nr_open - Flush delayed work in delayed fput() - Fix grammar and spelling in propagate_umount() - Fix ESP not readable during coredump In /proc/PID/stat, there is the kstkesp field which is the stack pointer of a thread. While the thread is active, this field reads zero. But during a coredump, it should have a valid value However, at the moment, kstkesp is zero even during coredump - Don't wake up the writer if the pipe is still full - Fix unbalanced user_access_end() in select code" * tag 'vfs-6.14-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (28 commits) gfs2: use lockref_init for qd_lockref erofs: use lockref_init for pcl->lockref dcache: use lockref_init for d_lockref lockref: add a lockref_init helper lockref: drop superfluous externs lockref: use bool for false/true returns lockref: improve the lockref_get_not_zero description lockref: remove lockref_put_not_zero fs: Fix return type of do_mount() from long to int select: Fix unbalanced user_access_end() vbox: Enable VBOXGUEST and VBOXSF_FS on ARM64 pipe_read: don't wake up the writer if the pipe is still full selftests: coredump: Add stackdump test fs/proc: do_task_stat: Fix ESP not readable during coredump fs: add RWF_DONTCACHE iocb and FOP_DONTCACHE file_operations flag fs: sort out a stale comment about races between fd alloc and dup2 fs: Fix grammar and spelling in propagate_umount() fs: fc_log replace magic number 7 with ARRAY_SIZE() fs: use a consume fence in mnt_idmap() file: flush delayed work in delayed fput() ...
2025-01-20bpf: Cancel the running bpf_timer through kworker for PREEMPT_RTHou Tao
During the update procedure, when overwrite element in a pre-allocated htab, the freeing of old_element is protected by the bucket lock. The reason why the bucket lock is necessary is that the old_element has already been stashed in htab->extra_elems after alloc_htab_elem() returns. If freeing the old_element after the bucket lock is unlocked, the stashed element may be reused by concurrent update procedure and the freeing of old_element will run concurrently with the reuse of the old_element. However, the invocation of check_and_free_fields() may acquire a spin-lock which violates the lockdep rule because its caller has already held a raw-spin-lock (bucket lock). The following warning will be reported when such race happens: BUG: scheduling while atomic: test_progs/676/0x00000003 3 locks held by test_progs/676: #0: ffffffff864b0240 (rcu_read_lock_trace){....}-{0:0}, at: bpf_prog_test_run_syscall+0x2c0/0x830 #1: ffff88810e961188 (&htab->lockdep_key){....}-{2:2}, at: htab_map_update_elem+0x306/0x1500 #2: ffff8881f4eac1b8 (&base->softirq_expiry_lock){....}-{2:2}, at: hrtimer_cancel_wait_running+0xe9/0x1b0 Modules linked in: bpf_testmod(O) Preemption disabled at: [<ffffffff817837a3>] htab_map_update_elem+0x293/0x1500 CPU: 0 UID: 0 PID: 676 Comm: test_progs Tainted: G ... 6.12.0+ #11 Tainted: [W]=WARN, [O]=OOT_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)... Call Trace: <TASK> dump_stack_lvl+0x57/0x70 dump_stack+0x10/0x20 __schedule_bug+0x120/0x170 __schedule+0x300c/0x4800 schedule_rtlock+0x37/0x60 rtlock_slowlock_locked+0x6d9/0x54c0 rt_spin_lock+0x168/0x230 hrtimer_cancel_wait_running+0xe9/0x1b0 hrtimer_cancel+0x24/0x30 bpf_timer_delete_work+0x1d/0x40 bpf_timer_cancel_and_free+0x5e/0x80 bpf_obj_free_fields+0x262/0x4a0 check_and_free_fields+0x1d0/0x280 htab_map_update_elem+0x7fc/0x1500 bpf_prog_9f90bc20768e0cb9_overwrite_cb+0x3f/0x43 bpf_prog_ea601c4649694dbd_overwrite_timer+0x5d/0x7e bpf_prog_test_run_syscall+0x322/0x830 __sys_bpf+0x135d/0x3ca0 __x64_sys_bpf+0x75/0xb0 x64_sys_call+0x1b5/0xa10 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 ... </TASK> It seems feasible to break the reuse and refill of per-cpu extra_elems into two independent parts: reuse the per-cpu extra_elems with bucket lock being held and refill the old_element as per-cpu extra_elems after the bucket lock is unlocked. However, it will make the concurrent overwrite procedures on the same CPU return unexpected -E2BIG error when the map is full. Therefore, the patch fixes the lock problem by breaking the cancelling of bpf_timer into two steps for PREEMPT_RT: 1) use hrtimer_try_to_cancel() and check its return value 2) if the timer is running, use hrtimer_cancel() through a kworker to cancel it again Considering that the current implementation of hrtimer_cancel() will try to acquire a being held softirq_expiry_lock when the current timer is running, these steps above are reasonable. However, it also has downside. When the timer is running, the cancelling of the timer is delayed when releasing the last map uref. The delay is also fixable (e.g., break the cancelling of bpf timer into two parts: one part in locked scope, another one in unlocked scope), it can be revised later if necessary. It is a bit hard to decide the right fix tag. One reason is that the problem depends on PREEMPT_RT which is enabled in v6.12. Considering the softirq_expiry_lock lock exists since v5.4 and bpf_timer is introduced in v5.15, the bpf_timer commit is used in the fixes tag and an extra depends-on tag is added to state the dependency on PREEMPT_RT. Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Depends-on: v6.12+ with PREEMPT_RT enabled Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Closes: https://lore.kernel.org/bpf/20241106084527.4gPrMnHt@linutronix.de Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://lore.kernel.org/r/20250117101816.2101857-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20bpf: Free element after unlock in __htab_map_lookup_and_delete_elem()Hou Tao
The freeing of special fields in map value may acquire a spin-lock (e.g., the freeing of bpf_timer), however, the lookup_and_delete_elem procedure has already held a raw-spin-lock, which violates the lockdep rule. The running context of __htab_map_lookup_and_delete_elem() has already disabled the migration. Therefore, it is OK to invoke free_htab_elem() after unlocking the bucket lock. Fix the potential problem by freeing element after unlocking bucket lock in __htab_map_lookup_and_delete_elem(). Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250117101816.2101857-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20bpf: Bail out early in __htab_map_lookup_and_delete_elem()Hou Tao
Use goto statement to bail out early when the target element is not found, instead of using a large else branch to handle the more likely case. This change doesn't affect functionality and simply make the code cleaner. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://lore.kernel.org/r/20250117101816.2101857-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20bpf: Free special fields after unlock in htab_lru_map_delete_node()Hou Tao
When bpf_timer is used in LRU hash map, calling check_and_free_fields() in htab_lru_map_delete_node() will invoke bpf_timer_cancel_and_free() to free the bpf_timer. If the timer is running on other CPUs, hrtimer_cancel() will invoke hrtimer_cancel_wait_running() to spin on current CPU to wait for the completion of the hrtimer callback. Considering that the deletion has already acquired a raw-spin-lock (bucket lock). To reduce the time holding the bucket lock, move the invocation of check_and_free_fields() out of bucket lock. However, because htab_lru_map_delete_node() is invoked with LRU raw spin lock being held, the freeing of special fields still happens in a locked scope. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://lore.kernel.org/r/20250117101816.2101857-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-20Merge branch 'for-6.14-cpu_sync-fixup' into for-linusPetr Mladek
2025-01-19Merge tag 'timers_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Borislav Petkov: - Reset hrtimers correctly when a CPU hotplug state traversal happens "half-ways" and leaves hrtimers not (re-)initialized properly - Annotate accesses to a timer group's ignore flag to prevent KCSAN from raising data_race warnings - Make sure timer group initialization is visible to timer tree walkers and avoid a hypothetical race - Fix another race between CPU hotplug and idle entry/exit where timers on a fully idle system are getting ignored - Fix a case where an ignored signal is still being handled which it shouldn't be * tag 'timers_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimers: Handle CPU state correctly on hotplug timers/migration: Annotate accesses to ignore flag timers/migration: Enforce group initialization visibility to tree walkers timers/migration: Fix another race between hotplug and idle entry/exit signal/posixtimers: Handle ignore/blocked sequences correctly
2025-01-19Merge tag 'sched_urgent_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Do not adjust the weight of empty group entities and avoid scheduling artifacts - Avoid scheduling lag by computing lag properly and thus address an EEVDF entity placement issue * tag 'sched_urgent_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix update_cfs_group() vs DELAY_DEQUEUE sched/fair: Fix EEVDF entity placement bug causing scheduling lag
2025-01-19padata: avoid UAF for reorder_workChen Ridong
Although the previous patch can avoid ps and ps UAF for _do_serial, it can not avoid potential UAF issue for reorder_work. This issue can happen just as below: crypto_request crypto_request crypto_del_alg padata_do_serial ... padata_reorder // processes all remaining // requests then breaks while (1) { if (!padata) break; ... } padata_do_serial // new request added list_add // sees the new request queue_work(reorder_work) padata_reorder queue_work_on(squeue->work) ... <kworker context> padata_serial_worker // completes new request, // no more outstanding // requests crypto_del_alg // free pd <kworker context> invoke_padata_reorder // UAF of pd To avoid UAF for 'reorder_work', get 'pd' ref before put 'reorder_work' into the 'serial_wq' and put 'pd' ref until the 'serial_wq' finish. Fixes: bbefa1dd6a6d ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-01-19padata: fix UAF in padata_reorderChen Ridong
A bug was found when run ltp test: BUG: KASAN: slab-use-after-free in padata_find_next+0x29/0x1a0 Read of size 4 at addr ffff88bbfe003524 by task kworker/u113:2/3039206 CPU: 0 PID: 3039206 Comm: kworker/u113:2 Kdump: loaded Not tainted 6.6.0+ Workqueue: pdecrypt_parallel padata_parallel_worker Call Trace: <TASK> dump_stack_lvl+0x32/0x50 print_address_description.constprop.0+0x6b/0x3d0 print_report+0xdd/0x2c0 kasan_report+0xa5/0xd0 padata_find_next+0x29/0x1a0 padata_reorder+0x131/0x220 padata_parallel_worker+0x3d/0xc0 process_one_work+0x2ec/0x5a0 If 'mdelay(10)' is added before calling 'padata_find_next' in the 'padata_reorder' function, this issue could be reproduced easily with ltp test (pcrypt_aead01). This can be explained as bellow: pcrypt_aead_encrypt ... padata_do_parallel refcount_inc(&pd->refcnt); // add refcnt ... padata_do_serial padata_reorder // pd while (1) { padata_find_next(pd, true); // using pd queue_work_on ... padata_serial_worker crypto_del_alg padata_put_pd_cnt // sub refcnt padata_free_shell padata_put_pd(ps->pd); // pd is freed // loop again, but pd is freed // call padata_find_next, UAF } In the padata_reorder function, when it loops in 'while', if the alg is deleted, the refcnt may be decreased to 0 before entering 'padata_find_next', which leads to UAF. As mentioned in [1], do_serial is supposed to be called with BHs disabled and always happen under RCU protection, to address this issue, add synchronize_rcu() in 'padata_free_shell' wait for all _do_serial calls to finish. [1] https://lore.kernel.org/all/20221028160401.cccypv4euxikusiq@parnassus.localdomain/ [2] https://lore.kernel.org/linux-kernel/jfjz5d7zwbytztackem7ibzalm5lnxldi2eofeiczqmqs2m7o6@fq426cwnjtkm/ Fixes: b128a3040935 ("padata: allocate workqueue internally") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Qu Zicheng <quzicheng@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-01-19padata: add pd get/put refcnt helperChen Ridong
Add helpers for pd to get/put refcnt to make code consice. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-01-16ftrace: Implement :mod: cache filtering on kernel command lineSteven Rostedt
Module functions can be set to set_ftrace_filter before the module is loaded. # echo :mod:snd_hda_intel > set_ftrace_filter This will enable all the functions for the module snd_hda_intel. If that module is not loaded, it is "cached" in the trace array for when the module is loaded, its functions will be traced. But this is not implemented in the kernel command line. That's because the kernel command line filtering is added very early in boot up as it is needed to be done before boot time function tracing can start, which is also available very early in boot up. The code used by the "set_ftrace_filter" file can not be used that early as it depends on some other initialization to occur first. But some of the functions can. Implement the ":mod:" feature of "set_ftrace_filter" in the kernel command line parsing. Now function tracing on just a single module that is loaded at boot up can be done. Adding: ftrace=function ftrace_filter=:mod:sna_hda_intel To the kernel command line will only enable the sna_hda_intel module functions when the module is loaded, and it will start tracing. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250116175832.34e39779@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16tracing: Adopt __free() and guard() for trace_fprobe.cMasami Hiramatsu (Google)
Adopt __free() and guard() for trace_fprobe.c to remove gotos. Link: https://lore.kernel.org/173708043449.319651.12242878905778792182.stgit@mhiramat.roam.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16bpf: verifier: Support eliding map lookup nullnessDaniel Xu
This commit allows progs to elide a null check on statically known map lookup keys. In other words, if the verifier can statically prove that the lookup will be in-bounds, allow the prog to drop the null check. This is useful for two reasons: 1. Large numbers of nullness checks (especially when they cannot fail) unnecessarily pushes prog towards BPF_COMPLEXITY_LIMIT_JMP_SEQ. 2. It forms a tighter contract between programmer and verifier. For (1), bpftrace is starting to make heavier use of percpu scratch maps. As a result, for user scripts with large number of unrolled loops, we are starting to hit jump complexity verification errors. These percpu lookups cannot fail anyways, as we only use static key values. Eliding nullness probably results in less work for verifier as well. For (2), percpu scratch maps are often used as a larger stack, as the currrent stack is limited to 512 bytes. In these situations, it is desirable for the programmer to express: "this lookup should never fail, and if it does, it means I messed up the code". By omitting the null check, the programmer can "ask" the verifier to double check the logic. Tests also have to be updated in sync with these changes, as the verifier is more efficient with this change. Notable, iters.c tests had to be changed to use a map type that still requires null checks, as it's exercising verifier tracking logic w.r.t iterators. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/68f3ea96ff3809a87e502a11a4bd30177fc5823e.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: verifier: Refactor helper access type trackingDaniel Xu
Previously, the verifier was treating all PTR_TO_STACK registers passed to a helper call as potentially written to by the helper. However, all calls to check_stack_range_initialized() already have precise access type information available. Rather than treat ACCESS_HELPER as a proxy for BPF_WRITE, pass enum bpf_access_type to check_stack_range_initialized() to more precisely track helper arguments. One benefit from this precision is that registers tracked as valid spills and passed as a read-only helper argument remain tracked after the call. Rather than being marked STACK_MISC afterwards. An additional benefit is the verifier logs are also more precise. For this particular error, users will enjoy a slightly clearer message. See included selftest updates for examples. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/ff885c0e5859e0cd12077c3148ff0754cad4f7ed.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16bpf: verifier: Add missing newline on verbose() callDaniel Xu
The print was missing a newline. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/59cbe18367b159cd470dc6d5c652524c1dc2b984.1736886479.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-16Merge back earlier cpufreq material for 6.14Rafael J. Wysocki
2025-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc8). Conflicts: drivers/net/ethernet/realtek/r8169_main.c 1f691a1fc4be ("r8169: remove redundant hwmon support") 152d00a91396 ("r8169: simplify setting hwmon attribute visibility") https://lore.kernel.org/20250115122152.760b4e8d@canb.auug.org.au Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 152f4da05aee ("bnxt_en: add support for rx-copybreak ethtool command") f0aa6a37a3db ("eth: bnxt: always recalculate features after XDP clearing, fix null-deref") drivers/net/ethernet/intel/ice/ice_type.h 50327223a8bb ("ice: add lock to protect low latency interface") dc26548d729e ("ice: Fix quad registers read on E825") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16tracing: Cache ":mod:" events for modules not loaded yetSteven Rostedt
When the :mod: command is written into /sys/kernel/tracing/set_event (or that file within an instance), if the module specified after the ":mod:" is not yet loaded, it will store that string internally. When the module is loaded, it will enable the events as if the module was loaded when the string was written into the set_event file. This can also be useful to enable events that are in the init section of the module, as the events are enabled before the init section is executed. This also works on the kernel command line: trace_event=:mod:<module> Will enable the events for <module> when it is loaded. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250116143533.514730995@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16tracing: Add :mod: command to enabled module eventsSteven Rostedt
Add a :mod: command to enable only events from a given module from the set_events file. echo '*:mod:<module>' > set_events Or echo ':mod:<module>' > set_events Will enable all events for that module. Specific events can also be enabled via: echo '<event>:mod:<module>' > set_events Or echo '<system>:<event>:mod:<module>' > set_events Or echo '*:<event>:mod:<module>' > set_events The ":mod:" keyword is consistent with the function tracing filter to enable functions from a given module. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250116143533.214496360@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-01-16timers/migration: Simplify top level detection on group setupFrederic Weisbecker
Having a single group on a given level is enough to know this is the top level, because a root has to have at least two children, unless that root is the only group and the children are actual CPUs. Simplify the test in tmigr_setup_groups() accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250114231507.21672-5-frederic@kernel.org
2025-01-16hrtimers: Handle CPU state correctly on hotplugKoichiro Den
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to CPUHP_ONLINE: Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set to 1 throughout. However, during a CPU unplug operation, the tick and the clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online state, for instance CFS incorrectly assumes that the hrtick is already active, and the chance of the clockevent device to transition to oneshot mode is also lost forever for the CPU, unless it goes back to a lower state than CPUHP_HRTIMERS_PREPARE once. This round-trip reveals another issue; cpu_base.online is not set to 1 after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer(). Aside of that, the bulk of the per CPU state is not reset either, which means there are dangling pointers in the worst case. Address this by adding a corresponding startup() callback, which resets the stale per CPU state and sets the online flag. [ tglx: Make the new callback unconditionally available, remove the online modification in the prepare() callback and clear the remaining state in the starting callback instead of the prepare callback ] Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier") Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
2025-01-16timers/migration: Annotate accesses to ignore flagFrederic Weisbecker
The group's ignore flag is: _ read under the group's lock (idle entry, remote expiry) _ turned on/off under the group's lock (idle entry, remote expiry) _ turned on locklessly on idle exit When idle entry or remote expiry clear the "ignore" flag of a group, the operation must be synchronized against other concurrent idle entry or remote expiry to make sure the related group timer is never missed. To enforce this synchronization, both "ignore" clear and read are performed under the group lock. On the contrary, whether idle entry or remote expiry manage to observe the "ignore" flag turned on by a CPU exiting idle is a matter of optimization. If that flag set is missed or cleared concurrently, the worst outcome is a migrator wasting time remotely handling a "ghost" timer. This is why the ignore flag can be set locklessly. Unfortunately, the related lockless accesses are bare and miss appropriate annotations. KCSAN rightfully complains: BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0: __tmigr_cpu_activate tmigr_cpu_activate timer_clear_idle tick_nohz_restart_sched_tick tick_nohz_idle_exit do_idle cpu_startup_entry kernel_init do_initcalls clear_bss reserve_bios_regions common_startup_64 read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1: print_report kcsan_report_known_origin kcsan_setup_watchpoint tmigr_next_groupevt tmigr_update_events tmigr_inactive_up __walk_groups+0x50/0x77 walk_groups __tmigr_cpu_deactivate tmigr_cpu_deactivate __get_next_timer_interrupt timer_base_try_to_set_idle tick_nohz_stop_tick tick_nohz_idle_stop_tick cpuidle_idle_call do_idle Although the relevant accesses could be marked as data_race(), the "ignore" flag being read several times within the same tmigr_update_events() function is confusing and error prone. Prefer reading it once in that function and make use of similar/paired accesses elsewhere with appropriate comments when necessary. Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250114231507.21672-4-frederic@kernel.org Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
2025-01-16timers/migration: Enforce group initialization visibility to tree walkersFrederic Weisbecker
Commit 2522c84db513 ("timers/migration: Fix another race between hotplug and idle entry/exit") fixed yet another race between idle exit and CPU hotplug up leading to a wrong "0" value migrator assigned to the top level. However there is yet another situation that remains unhandled: [GRP0:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ \ 0 1 2..7 idle idle idle 0) The system is fully idle. [GRP0:0] migrator = CPU 0 active = CPU 0 groupmask = 1 / \ \ 0 1 2..7 active idle idle 1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state but it hasn't yet returned to __walk_groups(). [GRP0:0] migrator = CPU 0 active = CPU 0, CPU 1 groupmask = 1 / \ \ 0 1 2..7 active active idle 2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in __walk_groups(), delayed by #VMEXIT for example). [GRP1:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1 and GRP0:0. CPU 1 hasn't yet propagated its activation up to GRP1:0. [GRP1:0] migrator = GRP0:0 active = GRP0:0 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP1:0 is visible and fetched and the pre-initialized groupmask of GRP0:0 is also visible. As a result tmigr_active_up() is called to GRP1:0 with GRP0:0 as active and migrator. CPU 0 is returning to __walk_groups() but suffers again a #VMEXIT. [GRP1:0] migrator = GRP0:0 active = GRP0:0 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 1 groupmask = 2 / \ \ 0 1 2..7 8 active active idle !online 5) CPU 1 propagates its activation of GRP0:0 to GRP1:0. This has no effect since CPU 0 did it already. [GRP1:0] migrator = GRP0:0 active = GRP0:0, GRP0:1 groupmask = 1 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = CPU 8 active = CPU 0, CPU1 active = CPU 8 groupmask = 1 groupmask = 2 / \ \ \ 0 1 2..7 8 active active idle active 6) CPU 1 links CPU 8 to its group. CPU 8 boots and goes through CPUHP_AP_TMIGR_ONLINE which propagates activation. [GRP2:0] migrator = TMIGR_NONE active = NONE groupmask = 1 / \ [GRP1:0] [GRP1:1] migrator = GRP0:0 migrator = TMIGR_NONE active = GRP0:0, GRP0:1 active = NONE groupmask = 1 groupmask = 2 / \ [GRP0:0] [GRP0:1] [GRP0:2] migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE active = CPU 0, CPU1 active = CPU 8 active = NONE groupmask = 1 groupmask = 2 groupmask = 0 / \ \ \ 0 1 2..7 8 64 active active idle active !online 7) CPU 64 is booting. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP1:1, GRP0:2 and the new top GRP2:0 connected to GRP1:1 and GRP1:0. CPU 1 hasn't yet propagated its activation up to GRP2:0. [GRP2:0] migrator = 0 (!!!) active = NONE groupmask = 1 / \ [GRP1:0] [GRP1:1] migrator = GRP0:0 migrator = TMIGR_NONE active = GRP0:0, GRP0:1 active = NONE groupmask = 1 groupmask = 2 / \ [GRP0:0] [GRP0:1] [GRP0:2] migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE active = CPU 0, CPU1 active = CPU 8 active = NONE groupmask = 1 groupmask = 2 groupmask = 0 / \ \ \ 0 1 2..7 8 64 active active idle active !online 8) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP2:0 is visible and fetched but the pre-initialized groupmask of GRP1:0 is not because no ordering made its initialization visible. As a result tmigr_active_up() may be called to GRP2:0 with a "0" child's groumask. Leaving the timers ignored for ever when the system is fully idle. The race is highly theoretical and perhaps impossible in practice but the groupmask of the child is not the only concern here as the whole initialization of the child is not guaranteed to be visible to any tree walker racing against hotplug (idle entry/exit, remote handling, etc...). Although the current code layout seem to be resilient to such hazards, this doesn't tell much about the future. Fix this with enforcing address dependency between group initialization and the write/read to the group's parent's pointer. Fortunately that doesn't involve any barrier addition in the fast paths. Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250114231507.21672-3-frederic@kernel.org
2025-01-16timers/migration: Fix another race between hotplug and idle entry/exitFrederic Weisbecker
Commit 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") fixed a race between idle exit and CPU hotplug up leading to a wrong "0" value migrator assigned to the top level. However there is still a situation that remains unhandled: [GRP0:0] migrator = TMIGR_NONE active = NONE groupmask = 0 / \ \ 0 1 2..7 idle idle idle 0) The system is fully idle. [GRP0:0] migrator = CPU 0 active = CPU 0 groupmask = 0 / \ \ 0 1 2..7 active idle idle 1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state but it hasn't yet returned to __walk_groups(). [GRP0:0] migrator = CPU 0 active = CPU 0, CPU 1 groupmask = 0 / \ \ 0 1 2..7 active active idle 2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in __walk_groups(), delayed by #VMEXIT for example). [GRP1:0] migrator = TMIGR_NONE active = NONE groupmask = 0 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 2 groupmask = 1 / \ \ 0 1 2..7 8 active active idle !online 3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1 which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1 and GRP0:0. The groupmask of GRP0:0 is now 2. CPU 1 hasn't yet propagated its activation up to GRP1:0. [GRP1:0] migrator = 0 (!!!) active = NONE groupmask = 0 / \ [GRP0:0] [GRP0:1] migrator = CPU 0 migrator = TMIGR_NONE active = CPU 0, CPU1 active = NONE groupmask = 2 groupmask = 1 / \ \ 0 1 2..7 8 active active idle !online 4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups() returning from tmigr_cpu_active(). The new top GRP1:0 is visible and fetched but the freshly updated groupmask of GRP0:0 may not be visible due to lack of ordering! As a result tmigr_active_up() is called to GRP0:0 with a child's groupmask of "0". This buggy "0" groupmask then becomes the migrator for GRP1:0 forever. As a result, timers on a fully idle system get ignored. One possible fix would be to define TMIGR_NONE as "0" so that such a race would have no effect. And after all TMIGR_NONE doesn't need to be anything else. However this would leave an uncomfortable state machine where gears happen not to break by chance but are vulnerable to future modifications. Keep TMIGR_NONE as is instead and pre-initialize to "1" the groupmask of any newly created top level. This groupmask is guaranteed to be visible upon fetching the corresponding group for the 1st time: _ By the upcoming CPU thanks to CPU hotplug synchronization between the control CPU (BP) and the booting one (AP). _ By the control CPU since the groupmask and parent pointers are initialized locally. _ By all CPUs belonging to the same group than the control CPU because they must wait for it to ever become idle before needing to walk to the new top. The cmpcxhg() on ->migr_state then makes sure its groupmask is visible. With this pre-initialization, it is guaranteed that if a future top level is linked to an old one, it is walked through with a valid groupmask. Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20250114231507.21672-2-frederic@kernel.org
2025-01-16genirq/generic_chip: Export irq_gc_mask_disable_and_ack_set()Dr. David Alan Gilbert
The recent conversion of brcmstb_l2_mask_and_ack() to irq_gc_mask_disable_and_ack_set() missed that the driver can be built as a module, but the generic function is not exported. Add the missing export. [ tglx: Converted it to a fix ] Fixes: dd1f17a9faf5 ("irqchip/irq-brcmstb-l2: Replace brcmstb_l2_mask_and_ack() by generic function") Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250116005920.626822-1-linux@treblig.org
2025-01-16timers: Optimize get_timer_[this_]cpu_base()Zhongqiu Han
If a timer is deferrable and NO_HZ_COMMON is enabled, get_timer_cpu_base() and get_timer_this_cpu_base() invoke per_cpu_ptr() and this_cpu_ptr() twice. While this seems to be cheap, get_timer_cpu_base() can be called in a loop in lock_timer_base(). Optimize the functions by updating the base index for deferrable timers and retrieving the actual base pointer once. In both cases the resulting assembly code of those helpers becomes smaller, which results in a ~30% execution time reduction for a lock_timer_base() micro bench mark. Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241231150115.1978342-1-quic_zhonhan@quicinc.com
2025-01-15bpf: Send signals asynchronously if !preemptiblePuranjay Mohan
BPF programs can execute in all kinds of contexts and when a program running in a non-preemptible context uses the bpf_send_signal() kfunc, it will cause issues because this kfunc can sleep. Change `irqs_disabled()` to `!preemptible()`. Reported-by: syzbot+97da3d7e0112d59971de@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67486b09.050a0220.253251.0084.GAE@google.com/ Fixes: 1bc7896e9ef4 ("bpf: Fix deadlock with rq_lock in bpf_send_signal()") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250115103647.38487-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-15genirq/timings: Add kernel-doc for a function parameterRandy Dunlap
Add the description for @now to eliminate a kernel-doc warning. timings.c:537: warning: Function parameter or struct member 'now' not described in 'irq_timings_next_event' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250111062954.910657-1-rdunlap@infradead.org
2025-01-15genirq: Remove IRQ_MOVE_PCNTXT and related codeThomas Gleixner
Now that x86 is converted over to use the IRQCHIP_MOVE_DEFERRED flags, remove IRQ*_MOVE_PCNTXT and related code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241210103335.626707225@linutronix.de
2025-01-15timekeeping: Remove unused ktime_get_fast_timestamps()Dr. David Alan Gilbert
ktime_get_fast_timestamps() was added in 2020 by commit e2d977c9f1ab ("timekeeping: Provide multi-timestamp accessor to NMI safe timekeeper") but has remained unused. Remove it. [ tglx: Fold the inline as David suggested in the submission ] Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250112160132.450209-1-linux@treblig.org
2025-01-15timer/migration: Fix kernel-doc warnings for union tmigr_stateRandy Dunlap
Use the correct kernel-doc notation for nested structs/unions to eliminate warnings: timer_migration.h:119: warning: Incorrect use of kernel-doc format: * struct - split state of tmigr_group timer_migration.h:134: warning: Function parameter or struct member 'active' not described in 'tmigr_state' timer_migration.h:134: warning: Function parameter or struct member 'migrator' not described in 'tmigr_state' timer_migration.h:134: warning: Function parameter or struct member 'seq' not described in 'tmigr_state' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250111063156.910903-1-rdunlap@infradead.org
2025-01-15tick/broadcast: Add kernel-doc for function parametersRandy Dunlap
Add kernel-doc comments for two parameters to eliminate kernel-doc warnings: tick-broadcast.c:1026: warning: Function parameter or struct member 'bc' not described in 'tick_broadcast_setup_oneshot' tick-broadcast.c:1026: warning: Function parameter or struct member 'from_periodic' not described in 'tick_broadcast_setup_oneshot' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250111063148.910887-1-rdunlap@infradead.org
2025-01-15hrtimers: Update the return type of enqueue_hrtimer()Richard Clark
The return type should be 'bool' instead of 'int' according to the calling context in the kernel, and its internal implementation, i.e. : return timerqueue_add(); which is a bool-return function. [ tglx: Adjust function arguments ] Signed-off-by: Richard Clark <richard.xnu.clark@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/Z2ppT7me13dtxm1a@MBC02GN1V4Q05P
2025-01-15clocksource/wdtest: Print time values for short udelay(1)Paul E. McKenney
When a pair of clocksource reads separated by a udelay(1) claim less than a full microsecond of elapsed time, print the measured delay as part of the splat. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/717a2ddf-a80f-490b-aa3a-4e4b74fa56ca@paulmck-laptop
2025-01-15posix-timers: Fix typo in __lock_timer()Zhu Jun
The word 'accross' is wrong, so fix it. Signed-off-by: Zhu Jun <zhujun2@cmss.chinamobile.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241204080907.11989-1-zhujun2@cmss.chinamobile.com
2025-01-15signal/posixtimers: Handle ignore/blocked sequences correctlyThomas Gleixner
syzbot triggered the warning in posixtimer_send_sigqueue(), which warns about a non-ignored signal being already queued on the ignored list. The warning is actually bogus, as the following sequence causes this: signal($SIG, SIGIGN); timer_settime(...); // arm periodic timer timer fires, signal is ignored and queued on ignored list sigprocmask(SIG_BLOCK, ...); // block the signal timer_settime(...); // re-arm periodic timer timer fires, signal is not ignored because it is blocked ---> Warning triggers as signal is on the ignored list Ideally timer_settime() could remove the signal, but that's racy and incomplete vs. other scenarios and requires a full reevaluation of the pending signal list. Instead of adding more complexity, handle it gracefully by removing the warning and requeueing the signal to the pending list. That's correct versus: 1) sig[timed]wait() as that does not check for SIGIGN and only relies on dequeue_signal() -> posixtimers_deliver_signal() to check whether the pending signal is still valid. 2) Unblocking of the signal. - If the unblocking happens before SIGIGN is replaced by a signal handler, then the timer is rearmed in dequeue_signal(), but get_signal() will ignore it. The next timer expiry will move it back to the ignored list. - If SIGIGN was replaced before unblocking, then the signal will be delivered and a subsequent expiry will queue a signal on the pending list again. There is a related scenario to trigger the complementary warning in the signal ignored path, which does not expect the signal to be on the pending list when it is ignored. That can be triggered even before the above change via: task1 task2 signal($SIG, SIGIGN); sigprocmask(SIG_BLOCK, ...); timer_create(); // Signal target is task2 timer_settime(...); // arm periodic timer timer fires, signal is not ignored because it is blocked and queued on the pending list of task2 syscall() // Sets the pending flag sigprocmask(SIG_UNBLOCK, ...); -> preemption, task2 cannot dequeue the signal timer_settime(...); // re-arm periodic timer timer fires, signal is ignored ---> Warning triggers as signal is on task2's pending list and the thread group is not exiting Consequently, remove that warning too and just keep the signal on the pending list. The following attempt to deliver the signal on return to user space of task2 will ignore the signal and a subsequent expiry will bring it back to the ignored list, if it did not get blocked or un-ignored before that. Fixes: df7a996b4dab ("signal: Queue ignored posixtimers on ignore list") Reported-by: syzbot+3c2e3cc60665d71de2f7@syzkaller.appspotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/87ikqhcnjn.ffs@tglx