From ddc204b517e60ae64db34f9832dc41dafa77c751 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 8 Feb 2022 11:39:12 -0500 Subject: copy_process(): Move fd_install() out of sighand->siglock critical section I was made aware of the following lockdep splat: [ 2516.308763] ===================================================== [ 2516.309085] WARNING: HARDIRQ-safe -> HARDIRQ-unsafe lock order detected [ 2516.309433] 5.14.0-51.el9.aarch64+debug #1 Not tainted [ 2516.309703] ----------------------------------------------------- [ 2516.310149] stress-ng/153663 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: [ 2516.310512] ffff0000e422b198 (&newf->file_lock){+.+.}-{2:2}, at: fd_install+0x368/0x4f0 [ 2516.310944] and this task is already holding: [ 2516.311248] ffff0000c08140d8 (&sighand->siglock){-.-.}-{2:2}, at: copy_process+0x1e2c/0x3e80 [ 2516.311804] which would create a new lock dependency: [ 2516.312066] (&sighand->siglock){-.-.}-{2:2} -> (&newf->file_lock){+.+.}-{2:2} [ 2516.312446] but this new dependency connects a HARDIRQ-irq-safe lock: [ 2516.312983] (&sighand->siglock){-.-.}-{2:2} : [ 2516.330700] Possible interrupt unsafe locking scenario: [ 2516.331075] CPU0 CPU1 [ 2516.331328] ---- ---- [ 2516.331580] lock(&newf->file_lock); [ 2516.331790] local_irq_disable(); [ 2516.332231] lock(&sighand->siglock); [ 2516.332579] lock(&newf->file_lock); [ 2516.332922] [ 2516.333069] lock(&sighand->siglock); [ 2516.333291] *** DEADLOCK *** [ 2516.389845] stack backtrace: [ 2516.390101] CPU: 3 PID: 153663 Comm: stress-ng Kdump: loaded Not tainted 5.14.0-51.el9.aarch64+debug #1 [ 2516.390756] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 2516.391155] Call trace: [ 2516.391302] dump_backtrace+0x0/0x3e0 [ 2516.391518] show_stack+0x24/0x30 [ 2516.391717] dump_stack_lvl+0x9c/0xd8 [ 2516.391938] dump_stack+0x1c/0x38 [ 2516.392247] print_bad_irq_dependency+0x620/0x710 [ 2516.392525] check_irq_usage+0x4fc/0x86c [ 2516.392756] check_prev_add+0x180/0x1d90 [ 2516.392988] validate_chain+0x8e0/0xee0 [ 2516.393215] __lock_acquire+0x97c/0x1e40 [ 2516.393449] lock_acquire.part.0+0x240/0x570 [ 2516.393814] lock_acquire+0x90/0xb4 [ 2516.394021] _raw_spin_lock+0xe8/0x154 [ 2516.394244] fd_install+0x368/0x4f0 [ 2516.394451] copy_process+0x1f5c/0x3e80 [ 2516.394678] kernel_clone+0x134/0x660 [ 2516.394895] __do_sys_clone3+0x130/0x1f4 [ 2516.395128] __arm64_sys_clone3+0x5c/0x7c [ 2516.395478] invoke_syscall.constprop.0+0x78/0x1f0 [ 2516.395762] el0_svc_common.constprop.0+0x22c/0x2c4 [ 2516.396050] do_el0_svc+0xb0/0x10c [ 2516.396252] el0_svc+0x24/0x34 [ 2516.396436] el0t_64_sync_handler+0xa4/0x12c [ 2516.396688] el0t_64_sync+0x198/0x19c [ 2517.491197] NET: Registered PF_ATMPVC protocol family [ 2517.491524] NET: Registered PF_ATMSVC protocol family [ 2591.991877] sched: RT throttling activated One way to solve this problem is to move the fd_install() call out of the sighand->siglock critical section. Before commit 6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups"), the pidfd installation was done without holding both the task_list lock and the sighand->siglock. Obviously, holding these two locks are not really needed to protect the fd_install() call. So move the fd_install() call down to after the releases of both locks. Link: https://lore.kernel.org/r/20220208163912.1084752-1-longman@redhat.com Fixes: 6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups") Reviewed-by: "Eric W. Biederman" Signed-off-by: Waiman Long Signed-off-by: Christian Brauner --- kernel/fork.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) (limited to 'kernel/fork.c') diff --git a/kernel/fork.c b/kernel/fork.c index d75a528f7b21..007af7fb47c7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2323,10 +2323,6 @@ static __latent_entropy struct task_struct *copy_process( goto bad_fork_cancel_cgroup; } - /* past the last point of failure */ - if (pidfile) - fd_install(pidfd, pidfile); - init_task_pid_links(p); if (likely(p->pid)) { ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace); @@ -2375,6 +2371,9 @@ static __latent_entropy struct task_struct *copy_process( syscall_tracepoint_update(p); write_unlock_irq(&tasklist_lock); + if (pidfile) + fd_install(pidfd, pidfile); + proc_fork_connector(p); sched_post_fork(p, args); cgroup_post_fork(p, args); -- cgit From 8f2f9c4d82f24f172ae439e5035fc1e0e4c229dd Mon Sep 17 00:00:00 2001 From: "Eric W. Biederman" Date: Wed, 9 Feb 2022 20:03:19 -0600 Subject: ucounts: Enforce RLIMIT_NPROC not RLIMIT_NPROC+1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Michal Koutný wrote: > It was reported that v5.14 behaves differently when enforcing > RLIMIT_NPROC limit, namely, it allows one more task than previously. > This is consequence of the commit 21d1c5e386bc ("Reimplement > RLIMIT_NPROC on top of ucounts") that missed the sharpness of > equality in the forking path. This can be fixed either by fixing the test or by moving the increment to be before the test. Fix it my moving copy_creds which contains the increment before is_ucounts_overlimit. In the case of CLONE_NEWUSER the ucounts in the task_cred changes. The function is_ucounts_overlimit needs to use the final version of the ucounts for the new process. Which means moving the is_ucounts_overlimit test after copy_creds is necessary. Both the test in fork and the test in set_user were semantically changed when the code moved to ucounts. The change of the test in fork was bad because it was before the increment. The test in set_user was wrong and the change to ucounts fixed it. So this fix only restores the old behavior in one lcation not two. Link: https://lkml.kernel.org/r/20220204181144.24462-1-mkoutny@suse.com Link: https://lkml.kernel.org/r/20220216155832.680775-2-ebiederm@xmission.com Cc: stable@vger.kernel.org Reported-by: Michal Koutný Reviewed-by: Michal Koutný Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Signed-off-by: "Eric W. Biederman" --- kernel/fork.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'kernel/fork.c') diff --git a/kernel/fork.c b/kernel/fork.c index d75a528f7b21..17d8a8c85e3b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2021,18 +2021,18 @@ static __latent_entropy struct task_struct *copy_process( #ifdef CONFIG_PROVE_LOCKING DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled); #endif + retval = copy_creds(p, clone_flags); + if (retval < 0) + goto bad_fork_free; + retval = -EAGAIN; if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) { if (p->real_cred->user != INIT_USER && !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) - goto bad_fork_free; + goto bad_fork_cleanup_count; } current->flags &= ~PF_NPROC_EXCEEDED; - retval = copy_creds(p, clone_flags); - if (retval < 0) - goto bad_fork_free; - /* * If multiple threads are within copy_process(), then this check * triggers too late. This doesn't hurt, the check is only there -- cgit From b1e8206582f9d680cff7d04828708c8b6ab32957 Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Mon, 14 Feb 2022 10:16:57 +0100 Subject: sched: Fix yet more sched_fork() races Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: Linus Torvalds Signed-off-by: Peter Zijlstra (Intel) Tested-by: Tadeusz Struk Tested-by: Zhang Qiao Tested-by: Dietmar Eggemann Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net --- kernel/fork.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) (limited to 'kernel/fork.c') diff --git a/kernel/fork.c b/kernel/fork.c index d75a528f7b21..c607d238fc23 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2266,6 +2266,17 @@ static __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_put_pidfd; + /* + * Now that the cgroups are pinned, re-clone the parent cgroup and put + * the new task on the correct runqueue. All this *before* the task + * becomes visible. + * + * This isn't part of ->can_fork() because while the re-cloning is + * cgroup specific, it unconditionally needs to place the task on a + * runqueue. + */ + sched_cgroup_fork(p, args); + /* * From this point on we must avoid any synchronous user-space * communication until we take the tasklist-lock. In particular, we do @@ -2376,7 +2387,7 @@ static __latent_entropy struct task_struct *copy_process( write_unlock_irq(&tasklist_lock); proc_fork_connector(p); - sched_post_fork(p, args); + sched_post_fork(p); cgroup_post_fork(p, args); perf_event_fork(p); -- cgit From 5c26f6ac9416b63d093e29c30e79b3297e425472 Mon Sep 17 00:00:00 2001 From: Suren Baghdasaryan Date: Fri, 4 Mar 2022 20:28:51 -0800 Subject: mm: refactor vm_area_struct::anon_vma_name usage code Avoid mixing strings and their anon_vma_name referenced pointers by using struct anon_vma_name whenever possible. This simplifies the code and allows easier sharing of anon_vma_name structures when they represent the same name. [surenb@google.com: fix comment] Link: https://lkml.kernel.org/r/20220223153613.835563-1-surenb@google.com Link: https://lkml.kernel.org/r/20220224231834.1481408-1-surenb@google.com Signed-off-by: Suren Baghdasaryan Suggested-by: Matthew Wilcox Suggested-by: Michal Hocko Acked-by: Michal Hocko Cc: Colin Cross Cc: Sumit Semwal Cc: Dave Hansen Cc: Kees Cook Cc: "Kirill A. Shutemov" Cc: Vlastimil Babka Cc: Johannes Weiner Cc: "Eric W. Biederman" Cc: Christian Brauner Cc: Alexey Gladkov Cc: Sasha Levin Cc: Chris Hyser Cc: Davidlohr Bueso Cc: Peter Collingbourne Cc: Xiaofeng Cao Cc: David Hildenbrand Cc: Cyrill Gorcunov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'kernel/fork.c') diff --git a/kernel/fork.c b/kernel/fork.c index a024bf6254df..f1e89007f228 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -366,14 +366,14 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) *new = data_race(*orig); INIT_LIST_HEAD(&new->anon_vma_chain); new->vm_next = new->vm_prev = NULL; - dup_vma_anon_name(orig, new); + dup_anon_vma_name(orig, new); } return new; } void vm_area_free(struct vm_area_struct *vma) { - free_vma_anon_name(vma); + free_anon_vma_name(vma); kmem_cache_free(vm_area_cachep, vma); } -- cgit