summaryrefslogtreecommitdiff
path: root/fs/namespace.c
AgeCommit message (Collapse)Author
8 daysuserns and mnt_idmap leak in open_tree_attr(2)Al Viro
Once want_mount_setattr() has returned a positive, it does require finish_mount_kattr() to release ->mnt_userns. Failing do_mount_setattr() does not change that. As the result, we can end up leaking userns and possibly mnt_idmap as well. Fixes: c4a16820d901 ("fs: add open_tree_attr()") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 daysattach_recursive_mnt(): do not lock the covering tree when sliding something ↵Al Viro
under it If we are propagating across the userns boundary, we need to lock the mounts added there. However, in case when something has already been mounted there and we end up sliding a new tree under that, the stuff that had been there before should not get locked. IOW, lock_mnt_tree() should be called before we reparent the preexisting tree on top of what we are adding. Fixes: 3bd045cc9c4b ("separate copying and locking mount tree on cross-userns copies") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
9 daysreplace collect_mounts()/drop_collected_mounts() with a safer variantAl Viro
collect_mounts() has several problems - one can't iterate over the results directly, so it has to be done with callback passed to iterate_mounts(); it has an oopsable race with d_invalidate(); it creates temporary clones of mounts invisibly for sync umount (IOW, you can have non-lazy umount succeed leaving filesystem not mounted anywhere and yet still busy). A saner approach is to give caller an array of struct path that would pin every mount in a subtree, without cloning any mounts. * collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone * collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or a pointer to array of struct path, one for each chunk of tree visible under 'where' (i.e. the first element is a copy of where, followed by (mount,root) for everything mounted under it - the same set collect_mounts() would give). Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning references to the roots of subtrees in the caller's namespace. Array is terminated by {NULL, NULL} struct path. If it fits into preallocated array (on-stack, normally), that's where it goes; otherwise it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated' is ignored (and expected to be NULL). * drop_collected_paths(paths, preallocated) is given the array returned by an earlier call of collect_paths() and the preallocated array passed to that call. All mount/dentry references are dropped and array is kfree'd if it's not equal to 'preallocated'. * instead of iterate_mounts(), users should just iterate over array of struct path - nothing exotic is needed for that. Existing users (all in audit_tree.c) are converted. [folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>] Fixes: 80b5dce8c59b0 ("vfs: Add a function to lazily unmount all mounts from any dentry") Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-08Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull mount fixes from Al Viro: "Various mount-related bugfixes: - split the do_move_mount() checks in subtree-of-our-ns and entire-anon cases and adapt detached mount propagation selftest for mount_setattr - allow clone_private_mount() for a path on real rootfs - fix a race in call of has_locked_children() - fix move_mount propagation graph breakage by MOVE_MOUNT_SET_GROUP - make sure clone_private_mnt() caller has CAP_SYS_ADMIN in the right userns - avoid false negatives in path_overmount() - don't leak MNT_LOCKED from parent to child in finish_automount() - do_change_type(): refuse to operate on unmounted/not ours mounts" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_change_type(): refuse to operate on unmounted/not ours mounts clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns selftests/mount_setattr: adapt detached mount propagation test do_move_mount(): split the checks in subtree-of-our-ns and entire-anon cases fs: allow clone_private_mount() for a path on real rootfs fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2) finish_automount(): don't leak MNT_LOCKED from parent to child path_overmount(): avoid false negatives fs/fhandle.c: fix a race in call of has_locked_children()
2025-06-07do_change_type(): refuse to operate on unmounted/not ours mountsAl Viro
Ensure that propagation settings can only be changed for mounts located in the caller's mount namespace. This change aligns permission checking with the rest of mount(2). Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 07b20889e305 ("beginning of the shared-subtree proper") Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right usernsAl Viro
What we want is to verify there is that clone won't expose something hidden by a mount we wouldn't be able to undo. "Wouldn't be able to undo" may be a result of MNT_LOCKED on a child, but it may also come from lacking admin rights in the userns of the namespace mount belongs to. clone_private_mnt() checks the former, but not the latter. There's a number of rather confusing CAP_SYS_ADMIN checks in various userns during the mount, especially with the new mount API; they serve different purposes and in case of clone_private_mnt() they usually, but not always end up covering the missing check mentioned above. Reviewed-by: Christian Brauner <brauner@kernel.org> Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com> Fixes: 427215d85e8d ("ovl: prevent private clone if bind mount is not allowed") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07do_move_mount(): split the checks in subtree-of-our-ns and entire-anon casesAl Viro
... and fix the breakage in anon-to-anon case. There are two cases acceptable for do_move_mount() and mixing checks for those is making things hard to follow. One case is move of a subtree in caller's namespace. * source and destination must be in caller's namespace * source must be detachable from parent Another is moving the entire anon namespace elsewhere * source must be the root of anon namespace * target must either in caller's namespace or in a suitable anon namespace (see may_use_mount() for details). * target must not be in the same namespace as source. It's really easier to follow if tests are *not* mixed together... Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 3b5260d12b1f ("Don't propagate mounts into detached trees") Reported-by: Allison Karlitskaya <lis@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07fs: allow clone_private_mount() for a path on real rootfsKONDO KAZUMA(近藤 和真)
Mounting overlayfs with a directory on real rootfs (initramfs) as upperdir has failed with following message since commit db04662e2f4f ("fs: allow detached mounts in clone_private_mount()"). [ 4.080134] overlayfs: failed to clone upperpath Overlayfs mount uses clone_private_mount() to create internal mount for the underlying layers. The commit made clone_private_mount() reject real rootfs because it does not have a parent mount and is in the initial mount namespace, that is not an anonymous mount namespace. This issue can be fixed by modifying the permission check of clone_private_mount() following [1]. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: db04662e2f4f ("fs: allow detached mounts in clone_private_mount()") Link: https://lore.kernel.org/all/20250514190252.GQ2023217@ZenIV/ [1] Link: https://lore.kernel.org/all/20250506194849.GT2023217@ZenIV/ Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kazuma Kondo <kazuma-kondo@nec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07fix propagation graph breakage by MOVE_MOUNT_SET_GROUP move_mount(2)Al Viro
9ffb14ef61ba "move_mount: allow to add a mount into an existing group" breaks assertions on ->mnt_share/->mnt_slave. For once, the data structures in question are actually documented. Documentation/filesystem/sharedsubtree.rst: All vfsmounts in a peer group have the same ->mnt_master. If it is non-NULL, they form a contiguous (ordered) segment of slave list. do_set_group() puts a mount into the same place in propagation graph as the old one. As the result, if old mount gets events from somewhere and is not a pure event sink, new one needs to be placed next to the old one in the slave list the old one's on. If it is a pure event sink, we only need to make sure the new one doesn't end up in the middle of some peer group. "move_mount: allow to add a mount into an existing group" ends up putting the new one in the beginning of list; that's definitely not going to be in the middle of anything, so that's fine for case when old is not marked shared. In case when old one _is_ marked shared (i.e. is not a pure event sink), that breaks the assumptions of propagation graph iterators. Put the new mount next to the old one on the list - that does the right thing in "old is marked shared" case and is just as correct as the current behaviour if old is not marked shared (kudos to Pavel for pointing that out - my original suggested fix changed behaviour in the "nor marked" case, which complicated things for no good reason). Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 9ffb14ef61ba ("move_mount: allow to add a mount into an existing group") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07path_overmount(): avoid false negativesAl Viro
Holding namespace_sem is enough to make sure that result remains valid. It is *not* enough to avoid false negatives from __lookup_mnt(). Mounts can be unhashed outside of namespace_sem (stuck children getting detached on final mntput() of lazy-umounted mount) and having an unrelated mount removed from the hash chain while we traverse it may end up with false negative from __lookup_mnt(). We need to sample and recheck the seqlock component of mount_lock... Bug predates the introduction of path_overmount() - it had come from the code in finish_automount() that got abstracted into that helper. Reviewed-by: Christian Brauner <brauner@kernel.org> Fixes: 26df6034fdb2 ("fix automount/automount race properly") Fixes: 6ac392815628 ("fs: allow to mount beneath top mount") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-07fs/fhandle.c: fix a race in call of has_locked_children()Al Viro
may_decode_fh() is calling has_locked_children() while holding no locks. That's an oopsable race... The rest of the callers are safe since they are holding namespace_sem and are guaranteed a positive refcount on the mount in question. Rename the current has_locked_children() to __has_locked_children(), make it static and switch the fs/namespace.c users to it. Make has_locked_children() a wrapper for __has_locked_children(), calling the latter under read_seqlock_excl(&mount_lock). Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Fixes: 620c266f3949 ("fhandle: relax open_by_handle_at() permission checks") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-30Merge tag 'pull-automount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull automount updates from Al Viro: "Automount wart removal A bunch of odd boilerplate gone from instances - the reason for those was the need to protect the yet-to-be-attched mount from mark_mounts_for_expiry() deciding to take it out. But that's easy to detect and take care of in mark_mounts_for_expiry() itself; no need to have every instance simulate mount being busy by grabbing an extra reference to it, with finish_automount() undoing that once it attaches that mount. Should've done it that way from the very beginning... This is a flagday change, thankfully there are very few instances. vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives" * tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill vfs_submount() saner calling conventions for ->d_automount()
2025-05-30Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull mount propagation fix from Al Viro: "6.15 allowed mount propagation to destinations in detached trees; unfortunately, that breaks existing userland, so the old behaviour needs to be restored. It's not exactly a revert - the original behaviour had a bug, where existence of detached tree might disrupt propagation between locations not in detached trees. Thankfully, userland did not depend upon that bug, so we want to keep the fix" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Don't propagate mounts into detached trees
2025-05-26Don't propagate mounts into detached treesAl Viro
All versions up to 6.14 did not propagate mount events into detached tree. Shortly after 6.14 a merge of vfs-6.15-rc1.mount.namespace (130e696aa68b) has changed that. Unfortunately, that has caused userland regressions (reported in https://lore.kernel.org/all/CAOYeF9WQhFDe+BGW=Dp5fK8oRy5AgZ6zokVyTj1Wp4EUiYgt4w@mail.gmail.com/) Straight revert wouldn't be an option - in particular, the variant in 6.14 had a bug that got fixed in d1ddc6f1d9f0 ("fix IS_MNT_PROPAGATING uses") and we don't want to bring the bug back. This is a modification of manual revert posted by Christian, with changes needed to avoid reintroducing the breakage in scenario described in d1ddc6f1d9f0. Cc: stable@vger.kernel.org Reported-by: Allison Karlitskaya <lis@redhat.com> Tested-by: Allison Karlitskaya <lis@redhat.com> Acked-by: Christian Brauner <brauner@kernel.org> Co-developed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-26Merge tag 'vfs-6.16-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: "This contains minor mount updates for this cycle: - mnt->mnt_devname can never be NULL so simplify the code handling that case - Add a comment about concurrent changes during statmount() and listmount() - Update the STATMOUNT_SUPPORTED macro - Convert mount flags to an enum" * tag 'vfs-6.16-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: statmount: update STATMOUNT_SUPPORTED macro fs: convert mount flags to enum ->mnt_devname is never NULL mount: add a comment about concurrent changes with statmount()/listmount()
2025-05-23statmount: update STATMOUNT_SUPPORTED macroDmitry V. Levin
According to commit 8f6116b5b77b ("statmount: add a new supported_mask field"), STATMOUNT_SUPPORTED macro shall be updated whenever a new flag is added. Fixes: 7a54947e727b ("Merge patch series "fs: allow changing idmappings"") Signed-off-by: "Dmitry V. Levin" <ldv@strace.io> Link: https://lore.kernel.org/20250511224953.GA17849@strace.io Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-23->mnt_devname is never NULLAl Viro
Not since 8f2918898eb5 "new helpers: vfs_create_mount(), fc_mount()" back in 2018. Get rid of the dead checks... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250421033509.GV2023217@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-23mount: add a comment about concurrent changes with statmount()/listmount()Christian Brauner
Add some comments in there highlighting a few non-obvious assumptions. Link: https://lore.kernel.org/20250416-zerknirschen-aluminium-14a55639076f@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09fix IS_MNT_PROPAGATING usesAl Viro
propagate_mnt() does not attach anything to mounts created during propagate_mnt() itself. What's more, anything on ->mnt_slave_list of such new mount must also be new, so we don't need to even look there. When move_mount() had been introduced, we've got an additional class of mounts to skip - if we are moving from anon namespace, we do not want to propagate to mounts we are moving (i.e. all mounts in that anon namespace). Unfortunately, the part about "everything on their ->mnt_slave_list will also be ignorable" is not true - if we have propagation graph A -> B -> C and do OPEN_TREE_CLONE open_tree() of B, we get A -> [B <-> B'] -> C as propagation graph, where B' is a clone of B in our detached tree. Making B private will result in A -> B' -> C C still gets propagation from A, as it would after making B private if we hadn't done that open_tree(), but now the propagation goes through B'. Trying to move_mount() our detached tree on subdirectory in A should have * moved B' on that subdirectory in A * skipped the corresponding subdirectory in B' itself * copied B' on the corresponding subdirectory in C. As it is, the logics in propagation_next() and friends ends up skipping propagation into C, since it doesn't consider anything downstream of B'. IOW, walking the propagation graph should only skip the ->mnt_slave_list of new mounts; the only places where the check for "in that one anon namespace" are applicable are propagate_one() (where we should treat that as the same kind of thing as "mountpoint we are looking at is not visible in the mount we are looking at") and propagation_would_overmount(). The latter is better dealt with in the caller (can_move_mount_beneath()); on the first call of propagation_would_overmount() the test is always false, on the second it is always true in "move from anon namespace" case and always false in "move within our namespace" one, so it's easier to just use check_mnt() before bothering with the second call and be done with that. Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09do_move_mount(): don't leak MNTNS_PROPAGATING on failuresAl Viro
as it is, a failed move_mount(2) from anon namespace breaks all further propagation into that namespace, including normal mounts in non-anon namespaces that would otherwise propagate there. Fixes: 064fe6e233e8 ("mount: handle mount propagation for detached mount trees") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09do_umount(): add missing barrier before refcount checks in sync caseAl Viro
do_umount() analogue of the race fixed in 119e1ef80ecf "fix __legitimize_mnt()/mntput() race". Here we want to make sure that if __legitimize_mnt() doesn't notice our lock_mount_hash(), we will notice their refcount increment. Harder to hit than mntput_no_expire() one, fortunately, and consequences are milder (sync umount acting like umount -l on a rare race with RCU pathwalk hitting at just the wrong time instead of use-after-free galore mntput_no_expire() counterpart used to be hit). Still a bug... Fixes: 48a066e72d97 ("RCU'd vfsmounts") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-09__legitimize_mnt(): check for MNT_SYNC_UMOUNT should be under mount_lockAl Viro
... or we risk stealing final mntput from sync umount - raising mnt_count after umount(2) has verified that victim is not busy, but before it has set MNT_SYNC_UMOUNT; in that case __legitimize_mnt() doesn't see that it's safe to quietly undo mnt_count increment and leaves dropping the reference to caller, where it'll be a full-blown mntput(). Check under mount_lock is needed; leaving the current one done before taking that makes no sense - it's nowhere near common enough to bother with. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-06kill vfs_submount()Al Viro
The last remaining user of vfs_submount() (tracefs) is easy to convert to fs_context_for_submount(); do that and bury that thing, along with SB_SUBMOUNT Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-05-05saner calling conventions for ->d_automount()Al Viro
Currently the calling conventions for ->d_automount() instances have an odd wart - returned new mount to be attached is expected to have refcount 2. That kludge is intended to make sure that mark_mounts_for_expiry() called before we get around to attaching that new mount to the tree won't decide to take it out. finish_automount() drops the extra reference after it's done with attaching mount to the tree - or drops the reference twice in case of error. ->d_automount() instances have rather counterintuitive boilerplate in them. There's a much simpler approach: have mark_mounts_for_expiry() skip the mounts that are yet to be mounted. And to hell with grabbing/dropping those extra references. Makes for simpler correctness analysis, at that... Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Acked-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-04-23fix a couple of races in MNT_TREE_BENEATH handling by do_move_mount()Al Viro
Normally do_lock_mount(path, _) is locking a mountpoint pinned by *path and at the time when matching unlock_mount() unlocks that location it is still pinned by the same thing. Unfortunately, for 'beneath' case it's no longer that simple - the object being locked is not the one *path points to. It's the mountpoint of path->mnt. The thing is, without sufficient locking ->mnt_parent may change under us and none of the locks are held at that point. The rules are * mount_lock stabilizes m->mnt_parent for any mount m. * namespace_sem stabilizes m->mnt_parent, provided that m is mounted. * if either of the above holds and refcount of m is positive, we are guaranteed the same for refcount of m->mnt_parent. namespace_sem nests inside inode_lock(), so do_lock_mount() has to take inode_lock() before grabbing namespace_sem. It does recheck that path->mnt is still mounted in the same place after getting namespace_sem, and it does take care to pin the dentry. It is needed, since otherwise we might end up with racing mount --move (or umount) happening while we were getting locks; in that case dentry would no longer be a mountpoint and could've been evicted on memory pressure along with its inode - not something you want when grabbing lock on that inode. However, pinning a dentry is not enough - the matching mount is also pinned only by the fact that path->mnt is mounted on top it and at that point we are not holding any locks whatsoever, so the same kind of races could end up with all references to that mount gone just as we are about to enter inode_lock(). If that happens, we are left with filesystem being shut down while we are holding a dentry reference on it; results are not pretty. What we need to do is grab both dentry and mount at the same time; that makes inode_lock() safe *and* avoids the problem with fs getting shut down under us. After taking namespace_sem we verify that path->mnt is still mounted (which stabilizes its ->mnt_parent) and check that it's still mounted at the same place. From that point on to the matching namespace_unlock() we are guaranteed that mount/dentry pair we'd grabbed are also pinned by being the mountpoint of path->mnt, so we can quietly drop both the dentry reference (as the current code does) and mnt one - it's OK to do under namespace_sem, since we are not dropping the final refs. That solves the problem on do_lock_mount() side; unlock_mount() also has one, since dentry is guaranteed to stay pinned only until the namespace_unlock(). That's easy to fix - just have inode_unlock() done earlier, while it's still pinned by mp->m_dentry. Fixes: 6ac392815628 "fs: allow to mount beneath top mount" # v6.5+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-11fs: use namespace_{lock,unlock} in dissolve_on_fput()Jan Stancek
In commit b73ec10a4587 ("fs: add fastpath for dissolve_on_fput()"), the namespace_{lock,unlock} has been replaced with scoped_guard using the namespace_sem. This however now also skips processing of 'unmounted' list in namespace_unlock(), and mount is not (immediately) cleaned up. For example, this causes LTP move_mount02 fail: ... move_mount02.c:80: TPASS: invalid-from-fd: move_mount() failed as expected: EBADF (9) move_mount02.c:80: TPASS: invalid-from-path: move_mount() failed as expected: ENOENT (2) move_mount02.c:80: TPASS: invalid-to-fd: move_mount() failed as expected: EBADF (9) move_mount02.c:80: TPASS: invalid-to-path: move_mount() failed as expected: ENOENT (2) move_mount02.c:80: TPASS: invalid-flags: move_mount() failed as expected: EINVAL (22) tst_test.c:1833: TINFO: === Testing on ext3 === tst_test.c:1170: TINFO: Formatting /dev/loop0 with ext3 opts='' extra opts='' mke2fs 1.47.2 (1-Jan-2025) /dev/loop0 is apparently in use by the system; will not make a filesystem here! tst_test.c:1170: TBROK: mkfs.ext3 failed with exit code 1 The test makes number of move_mount() calls but these are all designed to fail with specific errno. Even after test, 'losetup -d' can't detach loop device. Define a new guard for dissolve_on_fput, that will use namespace_{lock,unlock}. Fixes: b73ec10a4587 ("fs: add fastpath for dissolve_on_fput()") Signed-off-by: Jan Stancek <jstancek@redhat.com> Link: https://lore.kernel.org/cad2f042b886bf0ced3d8e3aff120ec5e0125d61.1744297468.git.jstancek@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-11mount: ensure we don't pointlessly walk the mount treeChristian Brauner
This logic got broken recently. Add it back. Fixes: 474f7825d533 ("fs: add copy_mount_setattr() helper") Link: https://lore.kernel.org/20250409-sektflaschen-gecko-27c021fbd222@brauner Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-03fs: actually hold the namespace semaphoreChristian Brauner
Don't use a scoped guard that only protects the next statement. Use a regular guard to make sure that the namespace semaphore is held across the whole function. Signed-off-by: Christian Brauner <brauner@kernel.org> Reported-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/all/20250401170715.GA112019@unreal/ Fixes: db04662e2f4f ("fs: allow detached mounts in clone_private_mount()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-28fs: namespace: Avoid -Wflex-array-member-not-at-end warningGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Move the conflicting declaration to the end of the structure. Notice that `struct statmount` is a flexible structure --a structure that contains a flexible-array member. Fix the following warning: fs/namespace.c:5329:26: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/Z-SZKNdCiAkVJvqm@kspp Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-24Merge tag 'vfs-6.15-rc1.mount.namespace' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount namespace updates from Christian Brauner: "This expands the ability of anonymous mount namespaces: - Creating detached mounts from detached mounts Currently, detached mounts can only be created from attached mounts. This limitaton prevents various use-cases. For example, the ability to mount a subdirectory without ever having to make the whole filesystem visible first. The current permission modelis: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. While it is not strictly necessary to do it this way it is consistently applied in the new mount api. This model will also be used when allowing the creation of detached mount from another detached mount. The (1) requirement can simply be met by performing the same check as for the non-detached case, i.e., verify that the caller is privileged over its current mount namespace. To meet the (2) requirement it must be possible to infer the origin mount namespace that the anonymous mount namespace of the detached mount was created from. The origin mount namespace of an anonymous mount is the mount namespace that the mounts that were copied into the anonymous mount namespace originate from. In order to check the origin mount namespace of an anonymous mount namespace the sequence number of the original mount namespace is recorded in the anonymous mount namespace. With this in place it is possible to perform an equivalent check (2') to (2). The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to copy the mount tree. - Allow mount detached mounts on detached mounts Currently, detached mounts can only be mounted onto attached mounts. This limitation makes it impossible to assemble a new private rootfs and move it into place. Instead, a detached tree must be created, attached, then mounted open and then either moved or detached again. Lift this restriction. In order to allow mounting detached mounts onto other detached mounts the same permission model used for creating detached mounts from detached mounts can be used (cf. above). Allowing to mount detached mounts onto detached mounts leaves three cases to consider: (1) The source mount is an attached mount and the target mount is a detached mount. This would be equivalent to moving a mount between different mount namespaces. A caller could move an attached mount to a detached mount. The detached mount can now be freely attached to any mount namespace. This changes the current delegatioh model significantly for no good reason. So this will fail. (2) Anonymous mount namespaces are always attached fully, i.e., it is not possible to only attach a subtree of an anoymous mount namespace. This simplifies the implementation and reasoning. Consequently, if the anonymous mount namespace of the source detached mount and the target detached mount are the identical the mount request will fail. (3) The source mount's anonymous mount namespace is different from the target mount's anonymous mount namespace. In this case the source anonymous mount namespace of the source mount tree must be freed after its mounts have been moved to the target anonymous mount namespace. The source anonymous mount namespace must be empty afterwards. By allowing to mount detached mounts onto detached mounts a caller may do the following: fd_tree1 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE) fd_tree2 = open_tree(-EBADF, "/tmp", OPEN_TREE_CLONE) fd_tree1 and fd_tree2 refer to two different detached mount trees that belong to two different anonymous mount namespace. It is important to note that fd_tree1 and fd_tree2 both refer to the root of their respective anonymous mount namespaces. By allowing to mount detached mounts onto detached mounts the caller may now do: move_mount(fd_tree1, "", fd_tree2, "", MOVE_MOUNT_F_EMPTY_PATH | MOVE_MOUNT_T_EMPTY_PATH) This will cause the detached mount referred to by fd_tree1 to be mounted on top of the detached mount referred to by fd_tree2. Thus, the detached mount fd_tree1 is moved from its separate anonymous mount namespace into fd_tree2's anonymous mount namespace. It also means that while fd_tree2 continues to refer to the root of its respective anonymous mount namespace fd_tree1 doesn't anymore. This has the consequence that only fd_tree2 can be moved to another anonymous or non-anonymous mount namespace. Moving fd_tree1 will now fail as fd_tree1 doesn't refer to the root of an anoymous mount namespace anymore. Now fd_tree1 and fd_tree2 refer to separate detached mount trees referring to the same anonymous mount namespace. This is conceptually fine. The new mount api does allow for this to happen already via: mount -t tmpfs tmpfs /mnt mkdir -p /mnt/A mount -t tmpfs tmpfs /mnt/A fd_tree3 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE | AT_RECURSIVE) fd_tree4 = open_tree(-EBADF, "/mnt/A", 0) Both fd_tree3 and fd_tree4 refer to two different detached mount trees but both detached mount trees refer to the same anonymous mount namespace. An as with fd_tree1 and fd_tree2, only fd_tree3 may be moved another mount namespace as fd_tree3 refers to the root of the anonymous mount namespace just while fd_tree4 doesn't. However, there's an important difference between the fd_tree3/fd_tree4 and the fd_tree1/fd_tree2 example. Closing fd_tree4 and releasing the respective struct file will have no further effect on fd_tree3's detached mount tree. However, closing fd_tree3 will cause the mount tree and the respective anonymous mount namespace to be destroyed causing the detached mount tree of fd_tree4 to be invalid for further mounting. By allowing to mount detached mounts on detached mounts as in the fd_tree1/fd_tree2 example both struct files will affect each other. Both fd_tree1 and fd_tree2 refer to struct files that have FMODE_NEED_UNMOUNT set. To handle this we use the fact that @fd_tree1 will have a parent mount once it has been attached to @fd_tree2. When dissolve_on_fput() is called the mount that has been passed in will refer to the root of the anonymous mount namespace. If it doesn't it would mean that mounts are leaked. So before allowing to mount detached mounts onto detached mounts this would be a bug. Now that detached mounts can be mounted onto detached mounts it just means that the mount has been attached to another anonymous mount namespace and thus dissolve_on_fput() must not unmount the mount tree or free the anonymous mount namespace as the file referring to the root of the namespace hasn't been closed yet. If it had been closed yet it would be obvious because the mount namespace would be NULL, i.e., the @fd_tree1 would have already been unmounted. If @fd_tree1 hasn't been unmounted yet and has a parent mount it is safe to skip any cleanup as closing @fd_tree2 will take care of all cleanup operations. - Allow mount propagation for detached mount trees In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto targets that reside on shared mounts") I fixed a bug where propagating the source mount tree of an anonymous mount namespace into a target mount tree of a non-anonymous mount namespace could be used to trigger an integer overflow in the non-anonymous mount namespace causing any new mounts to fail. The cause of this was that the propagation algorithm was unable to recognize mounts from the source mount tree that were already propagated into the target mount tree and then reappeared as propagation targets when walking the destination propagation mount tree. When fixing this I disabled mount propagation into anonymous mount namespaces. Make it possible for anonymous mount namespace to receive mount propagation events correctly. This is now also a correctness issue now that we allow mounting detached mount trees onto detached mount trees. Mark the source anonymous mount namespace with MNTNS_PROPAGATING indicating that all mounts belonging to this mount namespace are currently in the process of being propagated and make the propagation algorithm discard those if they appear as propagation targets" * tag 'vfs-6.15-rc1.mount.namespace' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) selftests: test subdirectory mounting selftests: add test for detached mount tree propagation fs: namespace: fix uninitialized variable use mount: handle mount propagation for detached mount trees fs: allow creating detached mounts from fsmount() file descriptors selftests: seventh test for mounting detached mounts onto detached mounts selftests: sixth test for mounting detached mounts onto detached mounts selftests: fifth test for mounting detached mounts onto detached mounts selftests: fourth test for mounting detached mounts onto detached mounts selftests: third test for mounting detached mounts onto detached mounts selftests: second test for mounting detached mounts onto detached mounts selftests: first test for mounting detached mounts onto detached mounts fs: mount detached mounts onto detached mounts fs: support getname_maybe_null() in move_mount() selftests: create detached mounts from detached mounts fs: create detached mounts from detached mounts fs: add may_copy_tree() fs: add fastpath for dissolve_on_fput() fs: add assert for move_mount() fs: add mnt_ns_empty() helper ...
2025-03-24Merge tag 'vfs-6.15-rc1.mount' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs mount updates from Christian Brauner: - Mount notifications The day has come where we finally provide a new api to listen for mount topology changes outside of /proc/<pid>/mountinfo. A mount namespace file descriptor can be supplied and registered with fanotify to listen for mount topology changes. Currently notifications for mount, umount and moving mounts are generated. The generated notification record contains the unique mount id of the mount. The listmount() and statmount() api can be used to query detailed information about the mount using the received unique mount id. This allows userspace to figure out exactly how the mount topology changed without having to generating diffs of /proc/<pid>/mountinfo in userspace. - Support O_PATH file descriptors with FSCONFIG_SET_FD in the new mount api - Support detached mounts in overlayfs Since last cycle we support specifying overlayfs layers via file descriptors. However, we don't allow detached mounts which means userspace cannot user file descriptors received via open_tree(OPEN_TREE_CLONE) and fsmount() directly. They have to attach them to a mount namespace via move_mount() first. This is cumbersome and means they have to undo mounts via umount(). Allow them to directly use detached mounts. - Allow to retrieve idmappings with statmount Currently it isn't possible to figure out what idmapping has been attached to an idmapped mount. Add an extension to statmount() which allows to read the idmapping from the mount. - Allow creating idmapped mounts from mounts that are already idmapped So far it isn't possible to allow the creation of idmapped mounts from already idmapped mounts as this has significant lifetime implications. Make the creation of idmapped mounts atomic by allow to pass struct mount_attr together with the open_tree_attr() system call allowing to solve these issues without complicating VFS lookup in any way. The system call has in general the benefit that creating a detached mount and applying mount attributes to it becomes an atomic operation for userspace. - Add a way to query statmount() for supported options Allow userspace to query which mount information can be retrieved through statmount(). - Allow superblock owners to force unmount * tag 'vfs-6.15-rc1.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (21 commits) umount: Allow superblock owners to force umount selftests: add tests for mount notification selinux: add FILE__WATCH_MOUNTNS samples/vfs: fix printf format string for size_t fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper statmount: add a new supported_mask field samples/vfs: add STATMOUNT_MNT_{G,U}IDMAP selftests: add tests for using detached mount with overlayfs samples/vfs: check whether flag was raised statmount: allow to retrieve idmappings uidgid: add map_id_range_up() fs: allow detached mounts in clone_private_mount() selftests/overlayfs: test specifying layers as O_PATH file descriptors fs: support O_PATH fds with FSCONFIG_SET_FD vfs: add notifications for mount attach and detach fanotify: notify on mount attach and detach ...
2025-03-19umount: Allow superblock owners to force umountTrond Myklebust
Loosen the permission check on forced umount to allow users holding CAP_SYS_ADMIN privileges in namespaces that are privileged with respect to the userns that originally mounted the filesystem. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/12f212d4ef983714d065a6bb372fbb378753bf4c.1742315194.git.trond.myklebust@hammerspace.com Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04mount: handle mount propagation for detached mount treesChristian Brauner
In commit ee2e3f50629f ("mount: fix mounting of detached mounts onto targets that reside on shared mounts") I fixed a bug where propagating the source mount tree of an anonymous mount namespace into a target mount tree of a non-anonymous mount namespace could be used to trigger an integer overflow in the non-anonymous mount namespace causing any new mounts to fail. The cause of this was that the propagation algorithm was unable to recognize mounts from the source mount tree that were already propagated into the target mount tree and then reappeared as propagation targets when walking the destination propagation mount tree. When fixing this I disabled mount propagation into anonymous mount namespaces. Make it possible for anonymous mount namespace to receive mount propagation events correctly. This is no also a correctness issue now that we allow mounting detached mount trees onto detached mount trees. Mark the source anonymous mount namespace with MNTNS_PROPAGATING indicating that all mounts belonging to this mount namespace are currently in the process of being propagated and make the propagation algorithm discard those if they appear as propagation targets. Link: https://lore.kernel.org/r/20250225-work-mount-propagation-v1-1-e6e3724500eb@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: namespace: fix uninitialized variable useArnd Bergmann
clang correctly notices that the 'uflags' variable initialization only happens in some cases: fs/namespace.c:4622:6: error: variable 'uflags' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] 4622 | if (flags & MOVE_MOUNT_F_EMPTY_PATH) uflags = AT_EMPTY_PATH; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/namespace.c:4623:48: note: uninitialized use occurs here 4623 | from_name = getname_maybe_null(from_pathname, uflags); | ^~~~~~ fs/namespace.c:4622:2: note: remove the 'if' if its condition is always true 4622 | if (flags & MOVE_MOUNT_F_EMPTY_PATH) uflags = AT_EMPTY_PATH; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fixes: b1e9423d65e3 ("fs: support getname_maybe_null() in move_mount()") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20250226081201.1876195-1-arnd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: allow creating detached mounts from fsmount() file descriptorsChristian Brauner
The previous patch series only enabled the creation of detached mounts from detached mounts that were created via open_tree(). In such cases we know that the origin sequence number for the newly created anonymous mount namespace will be set to the sequence number of the mount namespace the source mount belonged to. But fsmount() creates an anonymous mount namespace that does not have an origin mount namespace as the anonymous mount namespace was derived from a filesystem context created via fsopen(). Account for this case and allow the creation of detached mounts from mounts created via fsmount(). Consequently, any such detached mount created from an fsmount() mount will also have a zero origin sequence number. This allows to mount subdirectories without ever having to expose the filesystem to a a non-anonymous mount namespace: fd_context = sys_fsopen("tmpfs", 0); sys_fsconfig(fd_context, FSCONFIG_CMD_CREATE, NULL, NULL, 0); fd_tmpfs = sys_fsmount(fd_context, 0, 0); mkdirat(fd_tmpfs, "subdir", 0755); fd_tree = sys_open_tree(fd_tmpfs, "subdir", OPEN_TREE_CLONE); sys_move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: mount detached mounts onto detached mountsChristian Brauner
Currently, detached mounts can only be mounted onto attached mounts. This limitation makes it impossible to assemble a new private rootfs and move it into place. That's an extremely powerful concept for container and service workloads that we should support. Right now, a detached tree must be created, attached, then it can gain additional mounts and then it can either be moved (if it doesn't reside under a shared mount) or a detached mount created again. Lift this restriction. In order to allow mounting detached mounts onto other detached mounts the same permission model used for creating detached mounts from detached mounts can be used: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to attach the mount tree. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-9-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: support getname_maybe_null() in move_mount()Christian Brauner
Allow move_mount() to work with NULL path arguments. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-8-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: create detached mounts from detached mountsChristian Brauner
Add the ability to create detached mounts from detached mounts. Currently, detached mounts can only be created from attached mounts. This limitaton prevents various use-cases. For example, the ability to mount a subdirectory without ever having to make the whole filesystem visible first. The current permission model for the OPEN_TREE_CLONE flag of the open_tree() system call is: (1) Check that the caller is privileged over the owning user namespace of it's current mount namespace. (2) Check that the caller is located in the mount namespace of the mount it wants to create a detached copy of. While it is not strictly necessary to do it this way it is consistently applied in the new mount api. This model will also be used when allowing the creation of detached mount from another detached mount. The (1) requirement can simply be met by performing the same check as for the non-detached case, i.e., verify that the caller is privileged over its current mount namespace. To meet the (2) requirement it must be possible to infer the origin mount namespace that the anonymous mount namespace of the detached mount was created from. The origin mount namespace of an anonymous mount is the mount namespace that the mounts that were copied into the anonymous mount namespace originate from. The origin mount namespace of the anonymous mount namespace must be the same as the caller's mount namespace. To establish this the sequence number of the caller's mount namespace and the origin sequence number of the anonymous mount namespace are compared. The caller is always located in a non-anonymous mount namespace since anonymous mount namespaces cannot be setns()ed into. The caller's mount namespace will thus always have a valid sequence number. The owning namespace of any mount namespace, anonymous or non-anonymous, can never change. A mount attached to a non-anonymous mount namespace can never change mount namespace. If the sequence number of the non-anonymous mount namespace and the origin sequence number of the anonymous mount namespace match, the owning namespaces must match as well. Hence, the capability check on the owning namespace of the caller's mount namespace ensures that the caller has the ability to copy the mount tree. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-6-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: add may_copy_tree()Christian Brauner
Add a helper that verifies whether a caller may copy a given mount tree. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-5-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: add fastpath for dissolve_on_fput()Christian Brauner
Instead of acquiring the namespace semaphore and the mount lock everytime we close a file with FMODE_NEED_UNMOUNT set add a fastpath that checks whether we need to at all. Most of the time the caller will have attached the mount to the filesystem hierarchy and there's nothing to do. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-4-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: add assert for move_mount()Christian Brauner
After we've attached a detached mount tree the anonymous mount namespace must be empty. Add an assert and make this assumption explicit. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-3-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: add mnt_ns_empty() helperChristian Brauner
Add a helper that checks whether a give mount namespace is empty instead of open-coding the specific data structure check. This also be will be used in follow-up patches. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-2-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-04fs: record sequence number of origin mount namespaceChristian Brauner
Store the sequence number of the mount namespace the anonymous mount namespace has been created from. This information will be used in follow-up patches. Link: https://lore.kernel.org/r/20250221-brauner-open_tree-v1-1-dbcfcb98c676@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12Merge patch series "fs: allow changing idmappings"Christian Brauner
Christian Brauner <brauner@kernel.org> says: Currently, it isn't possible to change the idmapping of an idmapped mount. This is becoming an obstacle for various use-cases. /* idmapped home directories with systemd-homed */ On newer systems /home is can be an idmapped mount such that each file on disk is owned by 65536 and a subfolder exists for foreign id ranges such as containers. For example, a home directory might look like this (using an arbitrary folder as an example): user1@localhost:~/data/mount-idmapped$ ls -al /data/ total 16 drwxrwxrwx 1 65536 65536 36 Jan 27 12:15 . drwxrwxr-x 1 root root 184 Jan 27 12:06 .. -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 aaa -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 bbb -rw-r--r-- 1 65536 65536 0 Jan 27 12:07 cc drwxr-xr-x 1 2147352576 2147352576 0 Jan 27 19:06 containers When logging in home is mounted as an idmapped mount with the following idmappings: 65536:$(id -u):1 // uid mapping 65536:$(id -g):1 // gid mapping 2147352576:2147352576:65536 // uid mapping 2147352576:2147352576:65536 // gid mapping So for a user with uid/gid 1000 an idmapped /home would like like this: user1@localhost:~/data/mount-idmapped$ ls -aln /mnt/ total 16 drwxrwxrwx 1 1000 1000 36 Jan 27 12:15 . drwxrwxr-x 1 0 0 184 Jan 27 12:06 .. -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 aaa -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 bbb -rw-r--r-- 1 1000 1000 0 Jan 27 12:07 cc drwxr-xr-x 1 2147352576 2147352576 0 Jan 27 19:06 containers In other words, 65536 is mapped to the user's uid/gid and the range 2147352576 up to 2147352576 + 65536 is an identity mapping for containers. When a container is started a transient uid/gid range is allocated outside of both mappings of the idmapped mount. For example, the container might get the idmapping: $ cat /proc/1742611/uid_map 0 537985024 65536 This container will be allowed to write to disk within the allocated foreign id range 2147352576 to 2147352576 + 65536. To do this an idmapped mount must be created from an already idmapped mount such that: - The mappings for the user's uid/gid must be dropped, i.e., the following mappings are removed: 65536:$(id -u):1 // uid mapping 65536:$(id -g):1 // gid mapping - A mapping for the transient uid/gid range to the foreign uid/gid range is added: 2147352576:537985024:65536 In combination this will mean that the container will write to disk within the foreign id range 2147352576 to 2147352576 + 65536. /* nested containers */ When the outer container makes use of idmapped mounts it isn't posssible to create an idmapped mount for the inner container with a differen idmapping from the outer container's idmapped mount. There are other usecases and the two above just serve as an illustration of the problem. This patchset makes it possible to create a new idmapped mount from an already idmapped mount. It aims to adhere to current performance constraints and requirements: - Idmapped mounts aim to have near zero performance implications for path lookup. That is why no refernce counting, locking or any other mechanism can be required that would impact performance. This works be ensuring that a regular mount transitions to an idmapped mount once going from a static nop_mnt_idmap mapping to a non-static idmapping. - The idmapping of a mount change anymore for the lifetime of the mount afterwards. This not just avoids UAF issues it also avoids pitfalls such as generating non-matching uid/gid values. Changing idmappings could be solved by: - Idmappings could simply be reference counted (above the simple reference count when sharing them across multiple mounts). This would require pairing mnt_idmap_get() with mnt_idmap_put() which would end up being sprinkled everywhere into the VFS and some filesystems that access idmappings directly. It wouldn't just be quite ugly and introduce new complexity it would have a noticeable performance impact. - Idmappings could gain RCU protection. This would help the LOOKUP_RCU case and avoids taking reference counts under RCU. When not under LOOKUP_RCU reference counts need to be acquired on each idmapping. This would require pairing mnt_idmap_get() with mnt_idmap_put() which would end up being sprinkled everywhere into the VFS and some filesystems that access idmappings directly. This would have the same downsides as mentioned earlier. - The earlier solutions work by updating the mnt->mnt_idmap pointer with the new idmapping. Instead of this it would be possible to change the idmapping itself to avoid UAF issues. To do this a sequence counter would have to be added to struct mount. When retrieving the idmapping to generate uid/gid values the sequence counter would need to be sampled and the generation of the uid/gid would spin until the update of the idmap is finished. This has problems as well but the biggest issue will be that this can lead to inconsistent permission checking and inconsistent uid/gid pairs even more than this is already possible today. Specifically, during creation it could happen that: idmap = mnt_idmap(mnt); inode_permission(idmap, ...); may_create(idmap); // create file with uid/gid based on @idmap in between the permission checking and the generation of the uid/gid value the idmapping could change leading to the permission checking and uid/gid value that is actually used to create a file on disk being out of sync. Similarly if two values are generated like: idmap = mnt_idmap(mnt) vfsgid = make_vfsgid(idmap); // idmapping gets update concurrently vfsuid = make_vfsuid(idmap); @vfsgid and @vfsuid could be out of sync if the idmapping was changed in between. The generation of vfsgid/vfsuid could span a lot of codelines so to guard against this a sequence count would have to be passed around. The performance impact of this solutio are less clear but very likely not zero. - Using SRCU similar to fanotify that can sleep. I find that not just ugly but it would have memory consumption implications and is overall pretty ugly. /* solution */ So, to avoid all of these pitfalls creating an idmapped mount from an already idmapped mount will be done atomically, i.e., a new detached mount is created and a new set of mount properties applied to it without it ever having been exposed to userspace at all. This can be done in two ways. A new flag to open_tree() is added OPEN_TREE_CLEAR_IDMAP that clears the old idmapping and returns a mount that isn't idmapped. And then it is possible to set mount attributes on it again including creation of an idmapped mount. This has the consequence that a file descriptor must exist in userspace that doesn't have any idmapping applied and it will thus never work in unpriviledged scenarios. As a container would be able to remove the idmapping of the mount it has been given. That should be avoided. Instead, we add open_tree_attr() which works just like open_tree() but takes an optional struct mount_attr parameter. This is useful beyond idmappings as it fills a gap where a mount never exists in userspace without the necessary mount properties applied. This is particularly useful for mount options such as MOUNT_ATTR_{RDONLY,NOSUID,NODEV,NOEXEC}. To create a new idmapped mount the following works: // Create a first idmapped mount struct mount_attr attr = { .attr_set = MOUNT_ATTR_IDMAP .userns_fd = fd_userns }; fd_tree = open_tree(-EBADF, "/", OPEN_TREE_CLONE, &attr, sizeof(attr)); move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); // Create a second idmapped mount from the first idmapped mount attr.attr_set = MOUNT_ATTR_IDMAP; attr.userns_fd = fd_userns2; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); // Create a second non-idmapped mount from the first idmapped mount: memset(&attr, 0, sizeof(attr)); attr.attr_clr = MOUNT_ATTR_IDMAP; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); * patches from https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org: fs: allow changing idmappings fs: add kflags member to struct mount_kattr fs: add open_tree_attr() fs: add copy_mount_setattr() helper fs: add vfs_open_tree() helper Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-0-c25feb0d2eb3@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12fs: allow changing idmappingsChristian Brauner
This patchset makes it possible to create a new idmapped mount from an already idmapped mount and to clear idmappings. // Create a first idmapped mount struct mount_attr attr = { .attr_set = MOUNT_ATTR_IDMAP .userns_fd = fd_userns }; fd_tree = open_tree(-EBADF, "/", OPEN_TREE_CLONE, &attr, sizeof(attr)); move_mount(fd_tree, "", -EBADF, "/mnt", MOVE_MOUNT_F_EMPTY_PATH); // Create a second idmapped mount from the first idmapped mount attr.attr_set = MOUNT_ATTR_IDMAP; attr.userns_fd = fd_userns2; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); // Create a second non-idmapped mount from the first idmapped mount: memset(&attr, 0, sizeof(attr)); attr.attr_clr = MOUNT_ATTR_IDMAP; fd_tree2 = open_tree(-EBADF, "/mnt", OPEN_TREE_CLONE, &attr, sizeof(attr)); Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-5-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12fs: add kflags member to struct mount_kattrChristian Brauner
Instead of using a boolean use a flag so we can add new flags in following patches. Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-4-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12fs: add open_tree_attr()Christian Brauner
Add open_tree_attr() which allow to atomically create a detached mount tree and set mount options on it. If OPEN_TREE_CLONE is used this will allow the creation of a detached mount with a new set of mount options without it ever being exposed to userspace without that set of mount options applied. Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-3-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12fs: add copy_mount_setattr() helperChristian Brauner
Split out copy_mount_setattr() from mount_setattr() so we can use it in later patches. Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-2-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12fs: add vfs_open_tree() helperChristian Brauner
Split out vfs_open_tree() from open_tree() so we can use it in later patches. Link: https://lore.kernel.org/r/20250128-work-mnt_idmap-update-v2-v1-1-c25feb0d2eb3@kernel.org Reviewed-by: "Seth Forshee (DigitalOcean)" <sforshee@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-12statmount: add a new supported_mask fieldJeff Layton
Some of the fields in the statmount() reply can be optional. If the kernel has nothing to emit in that field, then it doesn't set the flag in the reply. This presents a problem: There is currently no way to know what mask flags the kernel supports since you can't always count on them being in the reply. Add a new STATMOUNT_SUPPORTED_MASK flag and field that the kernel can set in the reply. Userland can use this to determine if the fields it requires from the kernel are supported. This also gives us a way to deprecate fields in the future, if that should become necessary. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250206-statmount-v2-1-6ae70a21c2ab@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>