path: root/Documentation/filesystems
AgeCommit message (Collapse)Author
2021-05-11erofs: update documentation about data compressionGao Xiang
Add more description about (NON)HEAD lclusters, and the new big pcluster feature. Link: Reviewed-by: Chao Yu <> Signed-off-by: Gao Xiang <>
2021-05-11erofs: fix broken illustration in documentationGao Xiang
Illustration was broken after ReST conversion by accident. (checked by 'make SPHINXDIRS="filesystems" htmldocs') Link: Fixes: e66d8631ddb3 ("docs: filesystems: convert erofs.txt to ReST") Reviewed-by: Chao Yu <> Cc: Mauro Carvalho Chehab <> Signed-off-by: Gao Xiang <>
2021-05-04Merge tag 'f2fs-for-5.13-rc1' of ↵Linus Torvalds
git:// Pull f2fs updates from Jaegeuk Kim: "In this round, we added a new mount option, "checkpoint_merge", which introduces a kernel thread dealing with the f2fs checkpoints. Once we start to manage the IO priority along with blk-cgroup, the checkpoint operation can be processed in a lower priority under the process context. Since the checkpoint holds all the filesystem operations, we give a higher priority to the checkpoint thread all the time. Enhancements: - introduce gc_merge mount option to introduce a checkpoint thread - improve to run discard thread efficiently - allow modular compression algorithms - expose # of overprivision segments to sysfs - expose runtime compression stat to sysfs Bug fixes: - fix OOB memory access by the node id lookup - avoid touching checkpointed data in the checkpoint-disabled mode - fix the resizing flow to avoid kernel panic and race conditions - fix block allocation issues on pinned files - address some swapfile issues - fix hugtask problem and kernel panic during atomic write operations - don't start checkpoint thread in RO And, we've cleaned up some kernel coding style and build warnings. In addition, we fixed some minor race conditions and error handling routines" * tag 'f2fs-for-5.13-rc1' of git:// (48 commits) f2fs: drop inplace IO if fs status is abnormal f2fs: compress: remove unneed check condition f2fs: clean up left deprecated IO trace codes f2fs: avoid using native allocate_segment_by_default() f2fs: remove unnecessary struct declaration f2fs: fix to avoid NULL pointer dereference f2fs: avoid duplicated codes for cleanup f2fs: document: add description about compressed space handling f2fs: clean up build warnings f2fs: fix the periodic wakeups of discard thread f2fs: fix to avoid accessing invalid fio in f2fs_allocate_data_block() f2fs: fix to avoid GC/mmap race with f2fs_truncate() f2fs: set checkpoint_merge by default f2fs: Fix a hungtask problem in atomic write f2fs: fix to restrict mount condition on readonly block device f2fs: introduce gc_merge mount option f2fs: fix to cover __allocate_new_section() with curseg_lock f2fs: fix wrong alloc_type in f2fs_do_replace_block f2fs: delete empty compress.h f2fs: fix a typo in inode.c ...
2021-04-30Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git:// Pull ext4 updates from Ted Ts'o: "New features for ext4 this cycle include support for encrypted casefold, ensure that deleted file names are cleared in directory blocks by zeroing directory entries when they are unlinked or moved as part of a hash tree node split. We also improve the block allocator's performance on a freshly mounted file system by prefetching block bitmaps. There are also the usual cleanups and bug fixes, including fixing a page cache invalidation race when there is mixed buffered and direct I/O and the block size is less than page size, and allow the dax flag to be set and cleared on inline directories" * tag 'ext4_for_linus' of git:// (32 commits) ext4: wipe ext4_dir_entry2 upon file deletion ext4: Fix occasional generic/418 failure fs: fix reporting supported extra file attributes for statx() ext4: allow the dax flag to be set and cleared on inline directories ext4: fix debug format string warning ext4: fix trailing whitespace ext4: fix various seppling typos ext4: fix error return code in ext4_fc_perform_commit() ext4: annotate data race in jbd2_journal_dirty_metadata() ext4: annotate data race in start_this_handle() ext4: fix ext4_error_err save negative errno into superblock ext4: fix error code in ext4_commit_super ext4: always panic when errors=panic is specified ext4: delete redundant uptodate check for buffer ext4: do not set SB_ACTIVE in ext4_orphan_cleanup() ext4: make prefetch_block_bitmaps default ext4: add proc files to monitor new structures ext4: improve cr 0 / cr 1 group scanning ext4: add MB_NUM_ORDERS macro ext4: add mballoc stats proc file ...
2021-04-30Merge tag 'ovl-update-5.13' of ↵Linus Torvalds
git:// Pull overlayfs update from Miklos Szeredi: - Fix a regression introduced in 5.2 that resulted in valid overlayfs mounts being rejected with ELOOP (Too many levels of symbolic links) - Fix bugs found by various tools - Miscellaneous improvements and cleanups * tag 'ovl-update-5.13' of git:// ovl: add debug print to ovl_do_getxattr() ovl: invalidate readdir cache on changes to dir with origin ovl: allow upperdir inside lowerdir ovl: show "userxattr" in the mount data ovl: trivial typo fixes in the file inode.c ovl: fix misspellings using codespell tool ovl: do not copy attr several times ovl: remove ovl_map_dev_ino() return value ovl: fix error for ovl_fill_super() ovl: fix missing revert_creds() on error path ovl: fix leaked dentry ovl: restrict lower null uuid for "xino=auto" ovl: check that upperdir path is not on a read-only mount ovl: plumb through flush method
2021-04-28Merge tag 'for-5.13/drivers-2021-04-27' of git:// Torvalds
Pull block driver updates from Jens Axboe: - MD changes via Song: - raid5 POWER fix - raid1 failure fix - UAF fix for md cluster - mddev_find_or_alloc() clean up - Fix NULL pointer deref with external bitmap - Performance improvement for raid10 discard requests - Fix missing information of /proc/mdstat - rsxx const qualifier removal (Arnd) - Expose allocated brd pages (Calvin) - rnbd via Gioh Kim: - Change maintainer - Change domain address of maintainers' email - Add polling IO mode and document update - Fix memory leak and some bug detected by static code analysis tools - Code refactoring - Series of floppy cleanups/fixes (Denis) - s390 dasd fixes (Julian) - kerneldoc fixes (Lee) - null_blk double free (Lv) - null_blk virtual boundary addition (Max) - Remove xsysace driver (Michal) - umem driver removal (Davidlohr) - ataflop fixes (Dan) - Revalidate disk removal (Christoph) - Bounce buffer cleanups (Christoph) - Mark lightnvm as deprecated (Christoph) - mtip32xx init cleanups (Shixin) - Various fixes (Tian, Gustavo, Coly, Yang, Zhang, Zhiqiang) * tag 'for-5.13/drivers-2021-04-27' of git:// (143 commits) async_xor: increase src_offs when dropping destination page drivers/block/null_blk/main: Fix a double free in null_init. md/raid1: properly indicate failure when ending a failed write request md-cluster: fix use-after-free issue when removing rdev nvme: introduce generic per-namespace chardev nvme: cleanup nvme_configure_apst nvme: do not try to reconfigure APST when the controller is not live nvme: add 'kato' sysfs attribute nvme: sanitize KATO setting nvmet: avoid queuing keep-alive timer if it is disabled brd: expose number of allocated pages in debugfs ataflop: fix off by one in ataflop_probe() ataflop: potential out of bounds in do_format() drbd: Fix fall-through warnings for Clang block/rnbd: Use strscpy instead of strlcpy block/rnbd-clt-sysfs: Remove copy buffer overlap in rnbd_clt_get_path_name block/rnbd-clt: Remove max_segment_size block/rnbd-clt: Generate kobject_uevent when the rnbd device state changes block/rnbd-srv: Remove unused arguments of rnbd_srv_rdma_ev Documentation/ABI/rnbd-clt: Add description for nr_poll_queues ...
2021-04-27Merge tag 'netfs-lib-20210426' of ↵Linus Torvalds
git:// Pull network filesystem helper library updates from David Howells: "Here's a set of patches for 5.13 to begin the process of overhauling the local caching API for network filesystems. This set consists of two parts: (1) Add a helper library to handle the new VM readahead interface. This is intended to be used unconditionally by the filesystem (whether or not caching is enabled) and provides a common framework for doing caching, transparent huge pages and, in the future, possibly fscrypt and read bandwidth maximisation. It also allows the netfs and the cache to align, expand and slice up a read request from the VM in various ways; the netfs need only provide a function to read a stretch of data to the pagecache and the helper takes care of the rest. (2) Add an alternative fscache/cachfiles I/O API that uses the kiocb facility to do async DIO to transfer data to/from the netfs's pages, rather than using readpage with wait queue snooping on one side and vfs_write() on the other. It also uses less memory, since it doesn't do buffered I/O on the backing file. Note that this uses SEEK_HOLE/SEEK_DATA to locate the data available to be read from the cache. Whilst this is an improvement from the bmap interface, it still has a problem with regard to a modern extent-based filesystem inserting or removing bridging blocks of zeros. Fixing that requires a much greater overhaul. This is a step towards overhauling the fscache API. The change is opt-in on the part of the network filesystem. A netfs should not try to mix the old and the new API because of conflicting ways of handling pages and the PG_fscache page flag and because it would be mixing DIO with buffered I/O. Further, the helper library can't be used with the old API. This does not change any of the fscache cookie handling APIs or the way invalidation is done at this time. In the near term, I intend to deprecate and remove the old I/O API (fscache_allocate_page{,s}(), fscache_read_or_alloc_page{,s}(), fscache_write_page() and fscache_uncache_page()) and eventually replace most of fscache/cachefiles with something simpler and easier to follow. This patchset contains the following parts: - Some helper patches, including provision of an ITER_XARRAY iov iterator and a function to do readahead expansion. - Patches to add the netfs helper library. - A patch to add the fscache/cachefiles kiocb API. - A pair of patches to fix some review issues in the ITER_XARRAY and read helpers as spotted by Al and Willy. Jeff Layton has patches to add support in Ceph for this that he intends for this merge window. I have a set of patches to support AFS that I will post a separate pull request for. With this, AFS without a cache passes all expected xfstests; with a cache, there's an extra failure, but that's also there before these patches. Fixing that probably requires a greater overhaul. Ceph also passes the expected tests. I also have patches in a separate branch to tidy up the handling of PG_fscache/PG_private_2 and their contribution to page refcounting in the core kernel here, but I haven't included them in this set and will route them separately" Link: * tag 'netfs-lib-20210426' of git:// netfs: Miscellaneous fixes iov_iter: Four fixes for ITER_XARRAY fscache, cachefiles: Add alternate API to use kiocb for read/write to cache netfs: Add a tracepoint to log failures that would be otherwise unseen netfs: Define an interface to talk to a cache netfs: Add write_begin helper netfs: Gather stats netfs: Add tracepoints netfs: Provide readahead and readpage netfs helpers netfs, mm: Add set/end/wait_on_page_fscache() aliases netfs, mm: Move PG_fscache helper funcs to linux/netfs.h netfs: Documentation for helper library netfs: Make a netfs helper module mm: Implement readahead_control pageset expansion mm/readahead: Handle ractl nr_pages being modified fs: Document file_ra_state mm/filemap: Pass the file_ra_state in the ractl mm: Add set/end/wait functions for PG_private_2 iov_iter: Add ITER_XARRAY
2021-04-27Merge branch 'miklos.fileattr' of ↵Linus Torvalds
git:// Pull fileattr conversion updates from Miklos Szeredi via Al Viro: "This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a separate method. The interface is reasonably uniform across the filesystems that support it and gives nice boilerplate removal" * 'miklos.fileattr' of git:// (23 commits) ovl: remove unneeded ioctls fuse: convert to fileattr fuse: add internal open/release helpers fuse: unsigned open flags fuse: move ioctl to separate source file vfs: remove unused ioctl helpers ubifs: convert to fileattr reiserfs: convert to fileattr ocfs2: convert to fileattr nilfs2: convert to fileattr jfs: convert to fileattr hfsplus: convert to fileattr efivars: convert to fileattr xfs: convert to fileattr orangefs: convert to fileattr gfs2: convert to fileattr f2fs: convert to fileattr ext4: convert to fileattr ext2: convert to fileattr btrfs: convert to fileattr ...
2021-04-23netfs: Documentation for helper libraryDavid Howells
Add interface documentation for the netfs helper library. Signed-off-by: David Howells <> cc: cc: cc: cc: cc: cc: cc: cc: Link: # v4 Link: # v5 Link: # v6
2021-04-13f2fs: document: add description about compressed space handlingChao Yu
User or developer may still be confused about why f2fs doesn't expose compressed space to userspace, add description about compressed space handling policy into f2fs documentation. Signed-off-by: Chao Yu <> Signed-off-by: Jaegeuk Kim <>
2021-04-12vfs: add fileattr opsMiklos Szeredi
There's a substantial amount of boilerplate in filesystems handling FS_IOC_[GS]ETFLAGS/ FS_IOC_FS[GS]ETXATTR ioctls. Also due to userspace buffers being involved in the ioctl API this is difficult to stack, as shown by overlayfs issues related to these ioctls. Introduce a new internal API named "fileattr" (fsxattr can be confused with xattr, xflags is inappropriate, since this is more than just flags). There's significant overlap between flags and xflags and this API handles the conversions automatically, so filesystems may choose which one to use. In ->fileattr_get() a hint is provided to the filesystem whether flags or xattr are being requested by userspace, but in this series this hint is ignored by all filesystems, since generating all the attributes is cheap. If a filesystem doesn't implemement the fileattr API, just fall back to f_op->ioctl(). When all filesystems are converted, the fallback can be removed. 32bit compat ioctls are now handled by the generic code as well. Signed-off-by: Miklos Szeredi <>
2021-04-12ovl: restrict lower null uuid for "xino=auto"Amir Goldstein
Commit a888db310195 ("ovl: fix regression with re-formatted lower squashfs") attempted to fix a regression with existing setups that use a practice that we are trying to discourage. The discourage part was described this way in the commit message: "To avoid the reported regression while still allowing the new features with single lower squashfs, do not allow decoding origin with lower null uuid unless user opted-in to one of the new features that require following the lower inode of non-dir upper (index, xino, metacopy)." The three mentioned features are disabled by default in Kconfig, so it was assumed that if they are enabled, the user opted-in for them. Apparently, distros started to configure CONFIG_OVERLAY_FS_XINO_AUTO=y some time ago, so users upgrading their kernels can still be affected by said regression even though they never opted-in for any new feature. To fix this, treat "xino=on" as "user opted-in", but not "xino=auto". Since we are changing the behavior of "xino=auto" to no longer follow to lower origin with null uuid, take this one step further and disable xino in that corner case. To be consistent, disable xino also in cases of lower fs without file handle support and upper fs without xattr support. Update documentation w.r.t the new "xino=auto" behavior and fix the out dated bits of documentation regarding "xino" and regarding offline modifications to lower layers. Link: Signed-off-by: Amir Goldstein <> Signed-off-by: Miklos Szeredi <>
2021-04-05ext4: handle casefolding with encryptionDaniel Rosenberg
This adds support for encryption with casefolding. Since the name on disk is case preserving, and also encrypted, we can no longer just recompute the hash on the fly. Additionally, to avoid leaking extra information from the hash of the unencrypted name, we use siphash via an fscrypt v2 policy. The hash is stored at the end of the directory entry for all entries inside of an encrypted and casefolded directory apart from those that deal with '.' and '..'. This way, the change is backwards compatible with existing ext4 filesystems. [ Changed to advertise this feature via the file: /sys/fs/ext4/features/encrypted_casefold -- TYT ] Signed-off-by: Daniel Rosenberg <> Reviewed-by: Andreas Dilger <> Link: Signed-off-by: Theodore Ts'o <>
2021-03-31Documentation: filesystems api-summary: add namespace.cRandy Dunlap
Add fs/namespace.c to the filesystems api-summary docbook. Signed-off-by: Randy Dunlap <> Cc: Alexander Viro <> Cc: Jonathan Corbet <> Cc: Link: Signed-off-by: Jonathan Corbet <>
2021-03-30f2fs: introduce gc_merge mount optionChao Yu
In this patch, we will add two new mount options: "gc_merge" and "nogc_merge", when background_gc is on, "gc_merge" option can be set to let background GC thread to handle foreground GC requests, it can eliminate the sluggish issue caused by slow foreground GC operation when GC is triggered from a process with limited I/O and CPU resources. Original idea is from Xiang. Signed-off-by: Gao Xiang <> Signed-off-by: Chao Yu <> Signed-off-by: Jaegeuk Kim <>
2021-03-29block: remove the revalidate_disk methodChristoph Hellwig
No implementations left. Signed-off-by: Christoph Hellwig <> Link: Signed-off-by: Jens Axboe <>
2021-03-25docs: filesystems: Fix a mundane typoBhaskar Chowdhury
s/provisoned/provisioned/ Signed-off-by: Bhaskar Chowdhury <> Acked-by: Randy Dunlap <> Acked-by: OGAWA Hirofumi <> Link: Signed-off-by: Jonathan Corbet <>
2021-03-08docs: filesystem: Update smaps vm flag list to latestPeter Xu
We've missed a few documentation when adding new VM_* flags. Add the missing pieces so they'll be in sync now. Signed-off-by: Peter Xu <> Link: Signed-off-by: Jonathan Corbet <>
2021-03-06Docs: add fs/eventpoll to docbooksRandy Dunlap
Add fs/eventpoll.c to the filesystem api-summary book. Signed-off-by: Randy Dunlap <> Cc: Jonathan Corbet <> Cc: Cc: Andrew Morton <> Cc: Alexander Viro <> Link: Signed-off-by: Jonathan Corbet <>
2021-02-27Merge branch 'work.misc' of ↵Linus Torvalds
git:// Pull misc vfs updates from Al Viro: "Assorted stuff pile - no common topic here" * 'work.misc' of git:// whack-a-mole: don't open-code iminor/imajor 9p: fix misuse of sscanf() in v9fs_stat2inode() audit_alloc_mark(): don't open-code ERR_CAST() fs/inode.c: make inode_init_always() initialize i_ino to 0 vfs: don't unnecessarily clone write access for writable fds
2021-02-26Merge tag 'docs-5.12-2' of git:// Torvalds
Pull documentation fixes from Jonathan Corbet: "A handful of late-arriving documentation fixes, nothing all that notable" * tag 'docs-5.12-2' of git:// docs: proc.rst: fix indentation warning Documentation: cgroup-v2: fix path to example BPF program docs: powerpc: Fix tables in syscall64-abi.rst Documentation: features: refresh feature list Documentation: features: remove c6x references docs: ABI: testing: ima_policy: Fixed missing bracket Fix unaesthetic indentation scripts: kernel-doc: fix array element capture in pointer-to-func parsing doc: use KCFLAGS instead of EXTRA_CFLAGS to pass flags from command line Documentation: proc.rst: add more about the 6 fields in loadavg
2021-02-26seq_file: document how per-entry resources are managed.NeilBrown
Patch series "Fix some seq_file users that were recently broken". A recent change to seq_file broke some users which were using seq_file in a non-"standard" way ... though the "standard" isn't documented, so they can be excused. The result is a possible leak - of memory in one case, of references to a 'transport' in the other. These three patches: 1/ document and explain the problem 2/ fix the problem user in x86 3/ fix the problem user in net/sctp This patch (of 3): Users of seq_file will sometimes find it convenient to take a resource, such as a lock or memory allocation, in the ->start or ->next operations. These are per-entry resources, distinct from per-session resources which are taken in ->start and released in ->stop. The preferred management of these is release the resource on the subsequent call to ->next or ->stop. However prior to Commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") it happened that ->show would always be called after ->start or ->next, and a few users chose to release the resource in ->show. This is no longer reliable. Since the mentioned commit, ->next will always come after a successful ->show (to ensure m->index is updated correctly), so the original ordering cannot be maintained. This patch updates the documentation to clearly state the required behaviour. Other patches will fix the few problematic users. [ fix typo, per Willy] Link: Link: Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Signed-off-by: NeilBrown <> Cc: Xin Long <> Cc: Alexander Viro <> Cc: Jonathan Corbet <> Cc: Ingo Molnar <> Cc: Dave Hansen <> Cc: Andy Lutomirski <> Cc: Peter Zijlstra <> Cc: Vlad Yasevich <> Cc: Neil Horman <> Cc: Marcelo Ricardo Leitner <> Cc: "David S. Miller" <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-02-25docs: proc.rst: fix indentation warningRandy Dunlap
Fix indentation snafu in proc.rst as reported by Stephen. next-20210219/Documentation/filesystems/proc.rst:697: WARNING: Unexpected indentation. Fixes: 93ea4a0b8fce ("Documentation: proc.rst: add more about the 6 fields in loadavg") Reported-by: Stephen Rothwell <> Signed-off-by: Randy Dunlap <> Cc: Jonathan Corbet <> Cc: Link: Signed-off-by: Jonathan Corbet <>
2021-02-23Merge tag 'idmapped-mounts-v5.12' of ↵Linus Torvalds
git:// Pull idmapped mounts from Christian Brauner: "This introduces idmapped mounts which has been in the making for some time. Simply put, different mounts can expose the same file or directory with different ownership. This initial implementation comes with ports for fat, ext4 and with Christoph's port for xfs with more filesystems being actively worked on by independent people and maintainers. Idmapping mounts handle a wide range of long standing use-cases. Here are just a few: - Idmapped mounts make it possible to easily share files between multiple users or multiple machines especially in complex scenarios. For example, idmapped mounts will be used in the implementation of portable home directories in systemd-homed.service(8) where they allow users to move their home directory to an external storage device and use it on multiple computers where they are assigned different uids and gids. This effectively makes it possible to assign random uids and gids at login time. - It is possible to share files from the host with unprivileged containers without having to change ownership permanently through chown(2). - It is possible to idmap a container's rootfs and without having to mangle every file. For example, Chromebooks use it to share the user's Download folder with their unprivileged containers in their Linux subsystem. - It is possible to share files between containers with non-overlapping idmappings. - Filesystem that lack a proper concept of ownership such as fat can use idmapped mounts to implement discretionary access (DAC) permission checking. - They allow users to efficiently changing ownership on a per-mount basis without having to (recursively) chown(2) all files. In contrast to chown (2) changing ownership of large sets of files is instantenous with idmapped mounts. This is especially useful when ownership of a whole root filesystem of a virtual machine or container is changed. With idmapped mounts a single syscall mount_setattr syscall will be sufficient to change the ownership of all files. - Idmapped mounts always take the current ownership into account as idmappings specify what a given uid or gid is supposed to be mapped to. This contrasts with the chown(2) syscall which cannot by itself take the current ownership of the files it changes into account. It simply changes the ownership to the specified uid and gid. This is especially problematic when recursively chown(2)ing a large set of files which is commong with the aforementioned portable home directory and container and vm scenario. - Idmapped mounts allow to change ownership locally, restricting it to specific mounts, and temporarily as the ownership changes only apply as long as the mount exists. Several userspace projects have either already put up patches and pull-requests for this feature or will do so should you decide to pull this: - systemd: In a wide variety of scenarios but especially right away in their implementation of portable home directories. - container runtimes: containerd, runC, LXD:To share data between host and unprivileged containers, unprivileged and privileged containers, etc. The pull request for idmapped mounts support in containerd, the default Kubernetes runtime is already up for quite a while now: - The virtio-fs developers and several users have expressed interest in using this feature with virtual machines once virtio-fs is ported. - ChromeOS: Sharing host-directories with unprivileged containers. I've tightly synced with all those projects and all of those listed here have also expressed their need/desire for this feature on the mailing list. For more info on how people use this there's a bunch of talks about this too. Here's just two recent ones: This comes with an extensive xfstests suite covering both ext4 and xfs: It covers truncation, creation, opening, xattrs, vfscaps, setid execution, setgid inheritance and more both with idmapped and non-idmapped mounts. It already helped to discover an unrelated xfs setgid inheritance bug which has since been fixed in mainline. It will be sent for inclusion with the xfstests project should you decide to merge this. In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before and this is verified in the testsuite. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. The user namespace the mount will be marked with can be specified by passing a file descriptor refering to the user namespace as an argument to the new mount_setattr() syscall together with the new MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern of extensibility. The following conditions must be met in order to create an idmapped mount: - The caller must currently have the CAP_SYS_ADMIN capability in the user namespace the underlying filesystem has been mounted in. - The underlying filesystem must support idmapped mounts. - The mount must not already be idmapped. This also implies that the idmapping of a mount cannot be altered once it has been idmapped. - The mount must be a detached/anonymous mount, i.e. it must have been created by calling open_tree() with the OPEN_TREE_CLONE flag and it must not already have been visible in the filesystem. The last two points guarantee easier semantics for userspace and the kernel and make the implementation significantly simpler. By default vfsmounts are marked with the initial user namespace and no behavioral or performance changes are observed. The manpage with a detailed description can be found here: In order to support idmapped mounts, filesystems need to be changed and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The patches to convert individual filesystem are not very large or complicated overall as can be seen from the included fat, ext4, and xfs ports. Patches for other filesystems are actively worked on and will be sent out separately. The xfstestsuite can be used to verify that port has been done correctly. The mount_setattr() syscall is motivated independent of the idmapped mounts patches and it's been around since July 2019. One of the most valuable features of the new mount api is the ability to perform mounts based on file descriptors only. Together with the lookup restrictions available in the openat2() RESOLVE_* flag namespace which we added in v5.6 this is the first time we are close to hardened and race-free (e.g. symlinks) mounting and path resolution. While userspace has started porting to the new mount api to mount proper filesystems and create new bind-mounts it is currently not possible to change mount options of an already existing bind mount in the new mount api since the mount_setattr() syscall is missing. With the addition of the mount_setattr() syscall we remove this last restriction and userspace can now fully port to the new mount api, covering every use-case the old mount api could. We also add the crucial ability to recursively change mount options for a whole mount tree, both removing and adding mount options at the same time. This syscall has been requested multiple times by various people and projects. There is a simple tool available at that allows to create idmapped mounts so people can play with this patch series. I'll add support for the regular mount binary should you decide to pull this in the following weeks: Here's an example to a simple idmapped mount of another user's home directory: u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt u1001@f2-vm:/$ ls -al /home/ubuntu/ total 28 drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 . drwxr-xr-x 4 root root 4096 Oct 28 04:00 .. -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ ls -al /mnt/ total 28 drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 . drwxr-xr-x 29 root root 4096 Oct 28 22:01 .. -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo u1001@f2-vm:/$ touch /mnt/my-file u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file u1001@f2-vm:/$ ls -al /mnt/my-file -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file u1001@f2-vm:/$ ls -al /home/ubuntu/my-file -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file u1001@f2-vm:/$ getfacl /mnt/my-file getfacl: Removing leading '/' from absolute path names # file: mnt/my-file # owner: u1001 # group: u1001 user::rw- user:u1001:rwx group::rw- mask::rwx other::r-- u1001@f2-vm:/$ getfacl /home/ubuntu/my-file getfacl: Removing leading '/' from absolute path names # file: home/ubuntu/my-file # owner: ubuntu # group: ubuntu user::rw- user:ubuntu:rwx group::rw- mask::rwx other::r--" * tag 'idmapped-mounts-v5.12' of git:// (41 commits) xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl xfs: support idmapped mounts ext4: support idmapped mounts fat: handle idmapped mounts tests: add mount_setattr() selftests fs: introduce MOUNT_ATTR_IDMAP fs: add mount_setattr() fs: add attr_flags_to_mnt_flags helper fs: split out functions to hold writers namespace: only take read lock in do_reconfigure_mnt() mount: make {lock,unlock}_mount_hash() static namespace: take lock_mount_hash() directly when changing flags nfs: do not export idmapped mounts overlayfs: do not mount on top of idmapped mounts ecryptfs: do not mount on top of idmapped mounts ima: handle idmapped mounts apparmor: handle idmapped mounts fs: make helpers idmap mount aware exec: handle idmapped mounts would_dump: handle idmapped mounts ...
2021-02-22Merge tag 'lazytime_for_v5.12-rc1' of ↵Linus Torvalds
git:// Pull lazytime updates from Jan Kara: "Cleanups of the lazytime handling in the writeback code making rules for calling ->dirty_inode() filesystem handlers saner" * tag 'lazytime_for_v5.12-rc1' of git:// ext4: simplify i_state checks in __ext4_update_other_inode_time() gfs2: don't worry about I_DIRTY_TIME in gfs2_fsync() fs: improve comments for writeback_single_inode() fs: drop redundant check from __writeback_single_inode() fs: clean up __mark_inode_dirty() a bit fs: pass only I_DIRTY_INODE flags to ->dirty_inode fs: don't call ->dirty_inode for lazytime timestamp updates fat: only specify I_DIRTY_TIME when needed in fat_update_time() fs: only specify I_DIRTY_TIME when needed in generic_update_time() fs: correctly document the inode dirty flags
2021-02-22Documentation: proc.rst: add more about the 6 fields in loadavgRandy Dunlap
Address Jon's feedback on the previous patch by adding info about field separators in the /proc/loadavg file. Signed-off-by: Randy Dunlap <> Link: Signed-off-by: Jonathan Corbet <>
2021-02-22Merge tag 'docs-5.12' of git:// Torvalds
Pull documentation updates from Jonathan Corbet: "It has been a relatively quiet cycle in docsland. - As promised, the minimum Sphinx version to build the docs is now 1.7, and we have dropped support for Python 2 entirely. That allowed the removal of a bunch of compatibility code. - A set of treewide warning fixups from Mauro that I applied after it became clear nobody else was going to deal with them. - The automarkup mechanism can now create cross-references from relative paths to RST files. - More translations, typo fixes, and warning fixes" * tag 'docs-5.12' of git:// (75 commits) docs: kernel-hacking: be more civil docs: Remove the Microsoft rhetoric Documentation/admin-guide: kernel-parameters: Update nohlt section doc/admin-guide: fix spelling mistake: "perfomance" -> "performance" docs: Document cross-referencing using relative path docs: Enable usage of relative paths to docs on automarkup docs: thermal: fix spelling mistakes Documentation: admin-guide: Update kvm/xen config option docs: Make syscalls' helpers naming consistent coding-style.rst: Avoid comma statements Documentation: /proc/loadavg: add 3 more field descriptions Documentation/submitting-patches: Add blurb about backtraces in commit messages Docs: drop Python 2 support Move our minimum Sphinx version to 1.7 Documentation: input: define ABS_PRESSURE/ABS_MT_PRESSURE resolution as grams scripts/kernel-doc: add internal hyperlink to DOC: sections Update Documentation/admin-guide/sysctl/fs.rst docs: Update DTB format references docs: zh_CN: add iio index.rst translation docs/zh_CN: add iio ep93xx_adc.rst translation ...
2021-02-21Merge tag 'for-5.12/block-2021-02-17' of git:// Torvalds
Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git:// (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ...
2021-02-21Merge tag 'fsverity-for-linus' of ↵Linus Torvalds
git:// Pull fsverity updates from Eric Biggers: "Add an ioctl which allows reading fs-verity metadata from a file. This is useful when a file with fs-verity enabled needs to be served somewhere, and the other end wants to do its own fs-verity compatible verification of the file. See the commit messages for details. This new ioctl has been tested using new xfstests I've written for it" * tag 'fsverity-for-linus' of git:// fs-verity: support reading signature with ioctl fs-verity: support reading descriptor with ioctl fs-verity: support reading Merkle tree with ioctl fs-verity: add FS_IOC_READ_VERITY_METADATA ioctl fs-verity: don't pass whole descriptor to fsverity_verify_signature() fs-verity: factor out fsverity_get_descriptor()
2021-02-21Merge tag 'f2fs-for-5.12-rc1' of ↵Linus Torvalds
git:// Pull f2fs updates from Jaegeuk Kim: "We've added two major features: 1) compression level and 2) checkpoint_merge, in this round. Compression level expands 'compress_algorithm' mount option to accept parameter as format of <algorithm>:<level>, by this way, it gives a way to allow user to do more specified config on lz4 and zstd compression level, then f2fs compression can provide higher compress ratio. checkpoint_merge creates a kernel daemon and makes it to merge concurrent checkpoint requests as much as possible to eliminate redundant checkpoint issues. Plus, we can eliminate the sluggish issue caused by slow checkpoint operation when the checkpoint is done in a process context in a cgroup having low i/o budget and cpu shares. Enhancements: - add compress level for lz4 and zstd in mount option - checkpoint_merge mount option - deprecate f2fs_trace_io Bug fixes: - flush data when enabling checkpoint back - handle corner cases of mount options - missing ACL update and lock for I_LINKABLE flag - attach FIEMAP_EXTENT_MERGED in f2fs_fiemap - fix potential deadlock in compression flow - fix wrong submit_io condition As usual, we've cleaned up many code flows and fixed minor bugs" * tag 'f2fs-for-5.12-rc1' of git:// (32 commits) Documentation: f2fs: fix typo s/automaic/automatic f2fs: give a warning only for readonly partition f2fs: don't grab superblock freeze for flush/ckpt thread f2fs: add ckpt_thread_ioprio sysfs node f2fs: introduce checkpoint_merge mount option f2fs: relocate inline conversion from mmap() to mkwrite() f2fs: fix a wrong condition in __submit_bio f2fs: remove unnecessary initialization in xattr.c f2fs: fix to avoid inconsistent quota data f2fs: flush data when enabling checkpoint back f2fs: deprecate f2fs_trace_io f2fs: Remove readahead collision detection f2fs: remove unused stat_{inc, dec}_atomic_write f2fs: introduce sb_status sysfs node f2fs: fix to use per-inode maxbytes f2fs: compress: fix potential deadlock libfs: unexport generic_ci_d_compare() and generic_ci_d_hash() f2fs: fix to set/clear I_LINKABLE under i_lock f2fs: fix null page reference in redirty_blocks f2fs: clean up post-read processing ...
2021-02-16Documentation: f2fs: fix typo s/automaic/automaticEd Tsai
Fix typo in f2fs documentation. Signed-off-by: Ed Tsai <> Reviewed-by: Chao Yu <> Signed-off-by: Jaegeuk Kim <>
2021-02-07fs-verity: support reading signature with ioctlEric Biggers
Add support for FS_VERITY_METADATA_TYPE_SIGNATURE to FS_IOC_READ_VERITY_METADATA. This allows a userspace server program to retrieve the built-in signature (if present) of a verity file for serving to a client which implements fs-verity compatible verification. See the patch which introduced FS_IOC_READ_VERITY_METADATA for more details. The ability for userspace to read the built-in signatures is also useful because it allows a system that is using the in-kernel signature verification to migrate to userspace signature verification. This has been tested using a new xfstest which calls this ioctl via a new subcommand for the 'fsverity' program from fsverity-utils. Link: Reviewed-by: Victor Hsieh <> Reviewed-by: Jaegeuk Kim <> Reviewed-by: Chao Yu <> Signed-off-by: Eric Biggers <>
2021-02-07fs-verity: support reading descriptor with ioctlEric Biggers
Add support for FS_VERITY_METADATA_TYPE_DESCRIPTOR to FS_IOC_READ_VERITY_METADATA. This allows a userspace server program to retrieve the fs-verity descriptor of a file for serving to a client which implements fs-verity compatible verification. See the patch which introduced FS_IOC_READ_VERITY_METADATA for more details. "fs-verity descriptor" here means only the part that userspace cares about because it is hashed to produce the file digest. It doesn't include the signature which ext4 and f2fs append to the fsverity_descriptor struct when storing it on-disk, since that way of storing the signature is an implementation detail. The next patch adds a separate metadata_type value for retrieving the signature separately. This has been tested using a new xfstest which calls this ioctl via a new subcommand for the 'fsverity' program from fsverity-utils. Link: Reviewed-by: Victor Hsieh <> Reviewed-by: Jaegeuk Kim <> Reviewed-by: Chao Yu <> Signed-off-by: Eric Biggers <>
2021-02-07fs-verity: support reading Merkle tree with ioctlEric Biggers
Add support for FS_VERITY_METADATA_TYPE_MERKLE_TREE to FS_IOC_READ_VERITY_METADATA. This allows a userspace server program to retrieve the Merkle tree of a verity file for serving to a client which implements fs-verity compatible verification. See the patch which introduced FS_IOC_READ_VERITY_METADATA for more details. This has been tested using a new xfstest which calls this ioctl via a new subcommand for the 'fsverity' program from fsverity-utils. Link: Reviewed-by: Victor Hsieh <> Reviewed-by: Jaegeuk Kim <> Reviewed-by: Chao Yu <> Signed-off-by: Eric Biggers <>
2021-02-07fs-verity: add FS_IOC_READ_VERITY_METADATA ioctlEric Biggers
Add an ioctl FS_IOC_READ_VERITY_METADATA which will allow reading verity metadata from a file that has fs-verity enabled, including: - The Merkle tree - The fsverity_descriptor (not including the signature if present) - The built-in signature, if present This ioctl has similar semantics to pread(). It is passed the type of metadata to read (one of the above three), and a buffer, offset, and size. It returns the number of bytes read or an error. Separate patches will add support for each of the above metadata types. This patch just adds the ioctl itself. This ioctl doesn't make any assumption about where the metadata is stored on-disk. It does assume the metadata is in a stable format, but that's basically already the case: - The Merkle tree and fsverity_descriptor are defined by how fs-verity file digests are computed; see the "File digest computation" section of Documentation/filesystems/fsverity.rst. Technically, the way in which the levels of the tree are ordered relative to each other wasn't previously specified, but it's logical to put the root level first. - The built-in signature is the value passed to FS_IOC_ENABLE_VERITY. This ioctl is useful because it allows writing a server program that takes a verity file and serves it to a client program, such that the client can do its own fs-verity compatible verification of the file. This only makes sense if the client doesn't trust the server and if the server needs to provide the storage for the client. More concretely, there is interest in using this ability in Android to export APK files (which are protected by fs-verity) to "protected VMs". This would use Protected KVM (, which provides an isolated execution environment without having to trust the traditional "host". A "guest" VM can boot from a signed image and perform specific tasks in a minimum trusted environment using files that have fs-verity enabled on the host, without trusting the host or requiring that the guest has its own trusted storage. Technically, it would be possible to duplicate the metadata and store it in separate files for serving. However, that would be less efficient and would require extra care in userspace to maintain file consistency. In addition to the above, the ability to read the built-in signatures is useful because it allows a system that is using the in-kernel signature verification to migrate to userspace signature verification. Link: Reviewed-by: Victor Hsieh <> Acked-by: Jaegeuk Kim <> Reviewed-by: Chao Yu <> Signed-off-by: Eric Biggers <>
2021-02-04Documentation: /proc/loadavg: add 3 more field descriptionsRandy Dunlap
Update contents of /proc/loadavg: add 3 more fields. Signed-off-by: Randy Dunlap <> Cc: Link: Signed-off-by: Jonathan Corbet <>
2021-02-03f2fs: introduce checkpoint_merge mount optionDaeho Jeong
We've added a new mount options, "checkpoint_merge" and "nocheckpoint_merge", which creates a kernel daemon and makes it to merge concurrent checkpoint requests as much as possible to eliminate redundant checkpoint issues. Plus, we can eliminate the sluggish issue caused by slow checkpoint operation when the checkpoint is done in a process context in a cgroup having low i/o budget and cpu shares. To make this do better, we set the default i/o priority of the kernel daemon to "3", to give one higher priority than other kernel threads. The below verification result explains this. The basic idea has come from [Verification] Android Pixel Device(ARM64, 7GB RAM, 256GB UFS) Create two I/O cgroups (fg w/ weight 100, bg w/ wight 20) Set "strict_guarantees" to "1" in BFQ tunables In "fg" cgroup, - thread A => trigger 1000 checkpoint operations "for i in `seq 1 1000`; do touch test_dir1/file; fsync test_dir1; done" - thread B => gererating async. I/O "fio --rw=write --numjobs=1 --bs=128k --runtime=3600 --time_based=1 --filename=test_img --name=test" In "bg" cgroup, - thread C => trigger repeated checkpoint operations "echo $$ > /dev/blkio/bg/tasks; while true; do touch test_dir2/file; fsync test_dir2; done" We've measured thread A's execution time. [ w/o patch ] Elapsed Time: Avg. 68 seconds [ w/ patch ] Elapsed Time: Avg. 48 seconds Reported-by: kernel test robot <> Reported-by: Dan Carpenter <> [Jaegeuk Kim: fix the return value in f2fs_start_ckpt_thread, reported by Dan] Signed-off-by: Daeho Jeong <> Signed-off-by: Sungjong Seo <> Reviewed-by: Chao Yu <> Signed-off-by: Jaegeuk Kim <>
2021-01-28ovl: implement volatile-specific fsync error behaviourSargun Dhillon
Overlayfs's volatile option allows the user to bypass all forced sync calls to the upperdir filesystem. This comes at the cost of safety. We can never ensure that the user's data is intact, but we can make a best effort to expose whether or not the data is likely to be in a bad state. The best way to handle this in the time being is that if an overlayfs's upperdir experiences an error after a volatile mount occurs, that error will be returned on fsync, fdatasync, sync, and syncfs. This is contradictory to the traditional behaviour of VFS which fails the call once, and only raises an error if a subsequent fsync error has occurred, and been raised by the filesystem. One awkward aspect of the patch is that we have to manually set the superblock's errseq_t after the sync_fs callback as opposed to just returning an error from syncfs. This is because the call chain looks something like this: sys_syncfs -> sync_filesystem -> __sync_filesystem -> /* The return value is ignored here sb->s_op->sync_fs(sb) _sync_blockdev /* Where the VFS fetches the error to raise to userspace */ errseq_check_and_advance Because of this we call errseq_set every time the sync_fs callback occurs. Due to the nature of this seen / unseen dichotomy, if the upperdir is an inconsistent state at the initial mount time, overlayfs will refuse to mount, as overlayfs cannot get a snapshot of the upperdir's errseq that will increment on error until the user calls syncfs. Signed-off-by: Sargun Dhillon <> Suggested-by: Amir Goldstein <> Reviewed-by: Amir Goldstein <> Fixes: c86243b090bc ("ovl: provide a mount option "volatile"") Cc: Reviewed-by: Vivek Goyal <> Reviewed-by: Jeff Layton <> Signed-off-by: Miklos Szeredi <>
2021-01-27f2fs: compress: support compress levelChao Yu
Expand 'compress_algorithm' mount option to accept parameter as format of <algorithm>:<level>, by this way, it gives a way to allow user to do more specified config on lz4 and zstd compression level, then f2fs compression can provide higher compress ratio. In order to set compress level for lz4 algorithm, it needs to set CONFIG_LZ4HC_COMPRESS and CONFIG_F2FS_FS_LZ4HC config to enable lz4hc compress algorithm. CR and performance number on lz4/lz4hc algorithm: dd if=enwik9 of=compressed_file conv=fsync Original blocks: 244382 lz4 lz4hc-9 compressed blocks 170647 163270 compress ratio 69.8% 66.8% speed 16.4207 s, 60.9 MB/s 26.7299 s, 37.4 MB/s compress ratio = after / before Signed-off-by: Chao Yu <> Signed-off-by: Jaegeuk Kim <>
2021-01-27f2fs: remove FAULT_ALLOC_BIOChristoph Hellwig
Sleeping bio allocations do not fail, which means that injecting an error into sleeping bio allocations is a little silly. Signed-off-by: Christoph Hellwig <> Reviewed-by: Johannes Thumshirn <> Reviewed-by: Chaitanya Kulkarni <> Acked-by: Damien Le Moal <> Signed-off-by: Jens Axboe <>
2021-01-25bio: don't copy bvec for direct IOPavel Begunkov
The block layer spends quite a while in blkdev_direct_IO() to copy and initialise bio's bvec. However, if we've already got a bvec in the input iterator it might be reused in some cases, i.e. when new ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable performance boost, and it also reduces memory footprint. Suggested-by: Matthew Wilcox <> Reviewed-by: Christoph Hellwig <> Signed-off-by: Pavel Begunkov <> Reviewed-by: Ming Lei <> Signed-off-by: Jens Axboe <>
2021-01-25bvec/iter: disallow zero-length segment bvecsPavel Begunkov
zero-length bvec segments are allowed in general, but not handled by bio and down the block layer so filtered out. This inconsistency may be confusing and prevent from optimisations. As zero-length segments are useless and places that were generating them are patched, declare them not allowed. Reviewed-by: Christoph Hellwig <> Signed-off-by: Pavel Begunkov <> Reviewed-by: Ming Lei <> Signed-off-by: Jens Axboe <>
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: Cc: Christoph Hellwig <> Cc: David Howells <> Cc: Al Viro <> Cc: Reviewed-by: Christoph Hellwig <> Signed-off-by: Christian Brauner <>
2021-01-24acl: handle idmapped mountsChristian Brauner
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: Cc: Christoph Hellwig <> Cc: David Howells <> Cc: Al Viro <> Cc: Reviewed-by: Christoph Hellwig <> Signed-off-by: Christian Brauner <>
2021-01-21AFS: Documentation: fix a few typos in afs.rstRandy Dunlap
Fix typos (punctuation, grammar, spelling) in afs.rst. Signed-off-by: Randy Dunlap <> Cc: David Howells <> Cc: Cc: Jonathan Corbet <> Cc: Link: Signed-off-by: Jonathan Corbet <>
2021-01-13fs: pass only I_DIRTY_INODE flags to ->dirty_inodeEric Biggers
->dirty_inode is now only called when I_DIRTY_INODE (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) is set. However it may still be passed other dirty flags at the same time, provided that these other flags happened to be passed to __mark_inode_dirty() at the same time as I_DIRTY_INODE. This doesn't make sense because there is no reason for filesystems to care about these extra flags. Nor are filesystems notified about all updates to these other flags. Therefore, mask the flags before passing them to ->dirty_inode. Also properly document ->dirty_inode in vfs.rst. Link: Reviewed-by: Christoph Hellwig <> Reviewed-by: Jan Kara <> Signed-off-by: Eric Biggers <> Signed-off-by: Jan Kara <>
2021-01-11docs: Include ext4 documentation via filesystems/Jonathan Neuschäfer
The documentation for other filesystems is already included via filesystems/index.rst. Include ext4 in the same way and remove it from the top-level table of contents. Signed-off-by: Jonathan Neuschäfer <> Link: Signed-off-by: Jonathan Corbet <>
2021-01-11Documentation/dax: Update description of DAX policy changingHao Li
After commit 77573fa310d9 ("fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set"), changes to DAX policy will take effect as soon as all references to this file are gone. Update the documentation accordingly. Signed-off-by: Hao Li <> Reviewed-by: Ira Weiny <> Link: Signed-off-by: Jonathan Corbet <>
2021-01-11docs: filesystems: vfs: Correct the struct nameLiao Pingfang
The struct name should be file_system_type instead of file_system_operations. Signed-off-by: Liao Pingfang <> Link: Signed-off-by: Jonathan Corbet <>
2021-01-04vfs: don't unnecessarily clone write access for writable fdsEric Biggers
There's no need for mnt_want_write_file() to increment mnt_writers when the file is already open for writing, provided that mnt_drop_write_file() is changed to conditionally decrement it. We seem to have ended up in the current situation because mnt_want_write_file() used to be paired with mnt_drop_write(), due to mnt_drop_write_file() not having been added yet. So originally mnt_want_write_file() had to always increment mnt_writers. But later mnt_drop_write_file() was added, and all callers of mnt_want_write_file() were paired with it. This makes the compatibility between mnt_want_write_file() and mnt_drop_write() no longer necessary. Therefore, make __mnt_want_write_file() and __mnt_drop_write_file() skip incrementing mnt_writers on files already open for writing. This removes the only caller of mnt_clone_write(), so remove that too. Signed-off-by: Eric Biggers <> Signed-off-by: Al Viro <>