diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/debugfs.rst | 19 | ||||
-rw-r--r-- | Documentation/filesystems/ext4/atomic_writes.rst | 225 | ||||
-rw-r--r-- | Documentation/filesystems/ext4/overview.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/f2fs.rst | 52 | ||||
-rw-r--r-- | Documentation/filesystems/fuse-passthrough.rst | 133 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 5 | ||||
-rw-r--r-- | Documentation/filesystems/porting.rst | 6 | ||||
-rw-r--r-- | Documentation/filesystems/relay.rst | 10 | ||||
-rw-r--r-- | Documentation/filesystems/vfs.rst | 4 |
10 files changed, 400 insertions, 56 deletions
diff --git a/Documentation/filesystems/debugfs.rst b/Documentation/filesystems/debugfs.rst index 610f718ef8b5..55f807293924 100644 --- a/Documentation/filesystems/debugfs.rst +++ b/Documentation/filesystems/debugfs.rst @@ -229,22 +229,15 @@ module is unloaded without explicitly removing debugfs entries, the result will be a lot of stale pointers and no end of highly antisocial behavior. So all debugfs users - at least those which can be built as modules - must be prepared to remove all files and directories they create there. A file -can be removed with:: +or directory can be removed with:: void debugfs_remove(struct dentry *dentry); The dentry value can be NULL or an error value, in which case nothing will -be removed. - -Once upon a time, debugfs users were required to remember the dentry -pointer for every debugfs file they created so that all files could be -cleaned up. We live in more civilized times now, though, and debugfs users -can call:: - - void debugfs_remove_recursive(struct dentry *dentry); - -If this function is passed a pointer for the dentry corresponding to the -top-level directory, the entire hierarchy below that directory will be -removed. +be removed. Note that this function will recursively remove all files and +directories underneath it. Previously, debugfs_remove_recursive() was used +to perform that task, but this function is now just an alias to +debugfs_remove(). debugfs_remove_recursive() should be considered +deprecated. .. [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst new file mode 100644 index 000000000000..f65767df3620 --- /dev/null +++ b/Documentation/filesystems/ext4/atomic_writes.rst @@ -0,0 +1,225 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _atomic_writes: + +Atomic Block Writes +------------------------- + +Introduction +~~~~~~~~~~~~ + +Atomic (untorn) block writes ensure that either the entire write is committed +to disk or none of it is. This prevents "torn writes" during power loss or +system crashes. The ext4 filesystem supports atomic writes (only with Direct +I/O) on regular files with extents, provided the underlying storage device +supports hardware atomic writes. This is supported in the following two ways: + +1. **Single-fsblock Atomic Writes**: + EXT4's supports atomic write operations with a single filesystem block since + v6.13. In this the atomic write unit minimum and maximum sizes are both set + to filesystem blocksize. + e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB + pagesize system is possible. + +2. **Multi-fsblock Atomic Writes with Bigalloc**: + EXT4 now also supports atomic writes spanning multiple filesystem blocks + using a feature known as bigalloc. The atomic write unit's minimum and + maximum sizes are determined by the filesystem block size and cluster size, + based on the underlying device’s supported atomic write unit limits. + +Requirements +~~~~~~~~~~~~ + +Basic requirements for atomic writes in ext4: + + 1. The extents feature must be enabled (default for ext4) + 2. The underlying block device must support atomic writes + 3. For single-fsblock atomic writes: + + 1. A filesystem with appropriate block size (up to the page size) + 4. For multi-fsblock atomic writes: + + 1. The bigalloc feature must be enabled + 2. The cluster size must be appropriately configured + +NOTE: EXT4 does not support software or COW based atomic write, which means +atomic writes on ext4 are only supported if underlying storage device supports +it. + +Multi-fsblock Implementation Details +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The bigalloc feature changes ext4 to allocate in units of multiple filesystem +blocks, also known as clusters. With bigalloc each bit within block bitmap +represents cluster (power of 2 number of blocks) rather than individual +filesystem blocks. +EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the +following constraints. The minimum atomic write size is the larger of the fs +block size and the minimum hardware atomic write unit; and the maximum atomic +write size is smaller of the bigalloc cluster size and the maximum hardware +atomic write unit. Bigalloc ensures that all allocations are aligned to the +cluster size, which satisfies the LBA alignment requirements of the hardware +device if the start of the partition/logical volume is itself aligned correctly. + +Here is the block allocation strategy in bigalloc for atomic writes: + + * For regions with fully mapped extents, no additional work is needed + * For append writes, a new mapped extent is allocated + * For regions that are entirely holes, unwritten extent is created + * For large unwritten extents, the extent gets split into two unwritten + extents of appropriate requested size + * For mixed mapping regions (combinations of holes, unwritten extents, or + mapped extents), ext4_map_blocks() is called in a loop with + EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous + mapped extent by writing zeroes to it and converting any unwritten extents to + written, if found within the range. + +Note: Writing on a single contiguous underlying extent, whether mapped or +unwritten, is not inherently problematic. However, writing to a mixed mapping +region (i.e. one containing a combination of mapped and unwritten extents) +must be avoided when performing atomic writes. + +The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC +flag, requires that either all data is written or none at all. In the event of +a system crash or unexpected power loss during the write operation, the affected +region (when later read) must reflect either the complete old data or the +complete new data, but never a mix of both. + +To enforce this guarantee, we ensure that the write target is backed by +a single, contiguous extent before any data is written. This is critical because +ext4 defers the conversion of unwritten extents to written extents until the I/O +completion path (typically in ->end_io()). If a write is allowed to proceed over +a mixed mapping region (with mapped and unwritten extents) and a failure occurs +mid-write, the system could observe partially updated regions after reboot, i.e. +new data over mapped areas, and stale (old) data over unwritten extents that +were never marked written. This violates the atomicity and/or torn write +prevention guarantee. + +To prevent such torn writes, ext4 proactively allocates a single contiguous +extent for the entire requested region in ``ext4_iomap_alloc`` via +``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling +transaction in case if allocation is done over mixed mapping. This ensures any +pending metadata updates (like unwritten to written extents conversion) in this +range are in consistent state with the file data blocks, before performing the +actual write I/O. If the commit fails, the whole I/O must be aborted to prevent +from any possible torn writes. +Only after this step, the actual data write operation is performed by the iomap. + +Handling Split Extents Across Leaf Blocks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +There can be a special edge case where we have logically and physically +contiguous extents stored in separate leaf nodes of the on-disk extent tree. +This occurs because on-disk extent tree merges only happens within the leaf +blocks except for a case where we have 2-level tree which can get merged and +collapsed entirely into the inode. +If such a layout exists and, in the worst case, the extent status cache entries +are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return +a single contiguous extent for these split leaf extents. + +To address this edge case, a new get block flag +``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the +``ext4_map_query_blocks()`` lookup behavior. + +This new get block flag allows ``ext4_map_blocks()`` to first check if there is +an entry in the extent status cache for the full range. +If not present, it consults the on-disk extent tree using +``ext4_map_query_blocks()``. +If the located extent is at the end of a leaf node, it probes the next logical +block (lblk) to detect a contiguous extent in the adjacent leaf. + +For now only one additional leaf block is queried to maintain efficiency, as +atomic writes are typically constrained to small sizes +(e.g. [blocksize, clustersize]). + + +Handling Journal transactions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To support multi-fsblock atomic writes, we ensure enough journal credits are +reserved during: + + 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there + could be a mixed mapping for the underlying requested range. If yes, then we + reserve credits of up to ``m_len``, assuming every alternate block can be + an unwritten extent followed by a hole. + + 2. During ``->end_io()`` call, we make sure a single transaction is started for + doing unwritten-to-written conversion. The loop for conversion is mainly + only required to handle a split extent across leaf blocks. + +How to +------ + +Creating Filesystems with Atomic Write Support +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First check the atomic write units supported by block device. +See :ref:`atomic_write_bdev_support` for more details. + +For single-fsblock atomic writes with a larger block size +(on systems with block size < page size): + +.. code-block:: bash + + # Create an ext4 filesystem with a 16KB block size + # (requires page size >= 16KB) + mkfs.ext4 -b 16384 /dev/device + +For multi-fsblock atomic writes with bigalloc: + +.. code-block:: bash + + # Create an ext4 filesystem with bigalloc and 64KB cluster size + mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device + +Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes, +and ``-O bigalloc`` enables the bigalloc feature. + +Application Interface +~~~~~~~~~~~~~~~~~~~~~ + +Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag +to perform atomic writes: + +.. code-block:: c + + pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC); + +The write must be aligned to the filesystem's block size and not exceed the +filesystem's maximum atomic write unit size. +See ``generic_atomic_write_valid()`` for more details. + +``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following +details: + + * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request. + * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request. + * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of + separate memory buffers that can be gathered into a write operation + (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one. + +The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic +writes are supported. + +.. _atomic_write_bdev_support: + +Hardware Support +---------------- + +The underlying storage device must support atomic write operations. +Modern NVMe and SCSI devices often provide this capability. +The Linux kernel exposes this information through sysfs: + +* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size +* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size + +Nonzero values for these attributes indicate that the device supports +atomic writes. + +See Also +-------- + +* :doc:`bigalloc` - Documentation on the bigalloc feature +* :doc:`allocators` - Documentation on block allocation in ext4 +* Support for atomic block writes in 6.13: + https://lwn.net/Articles/1009298/ diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst index 0fad6eda6e15..9d4054c17ecb 100644 --- a/Documentation/filesystems/ext4/overview.rst +++ b/Documentation/filesystems/ext4/overview.rst @@ -25,3 +25,4 @@ order. .. include:: inlinedata.rst .. include:: eainode.rst .. include:: verity.rst +.. include:: atomic_writes.rst diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index e15c4275862a..440e4ae74e44 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -182,32 +182,34 @@ fault_type=%d Support configuring fault injection type, should be enabled with fault_injection option, fault type value is shown below, it supports single or combined type. - =========================== =========== + =========================== ========== Type_Name Type_Value - =========================== =========== - FAULT_KMALLOC 0x000000001 - FAULT_KVMALLOC 0x000000002 - FAULT_PAGE_ALLOC 0x000000004 - FAULT_PAGE_GET 0x000000008 - FAULT_ALLOC_BIO 0x000000010 (obsolete) - FAULT_ALLOC_NID 0x000000020 - FAULT_ORPHAN 0x000000040 - FAULT_BLOCK 0x000000080 - FAULT_DIR_DEPTH 0x000000100 - FAULT_EVICT_INODE 0x000000200 - FAULT_TRUNCATE 0x000000400 - FAULT_READ_IO 0x000000800 - FAULT_CHECKPOINT 0x000001000 - FAULT_DISCARD 0x000002000 - FAULT_WRITE_IO 0x000004000 - FAULT_SLAB_ALLOC 0x000008000 - FAULT_DQUOT_INIT 0x000010000 - FAULT_LOCK_OP 0x000020000 - FAULT_BLKADDR_VALIDITY 0x000040000 - FAULT_BLKADDR_CONSISTENCE 0x000080000 - FAULT_NO_SEGMENT 0x000100000 - FAULT_INCONSISTENT_FOOTER 0x000200000 - =========================== =========== + =========================== ========== + FAULT_KMALLOC 0x00000001 + FAULT_KVMALLOC 0x00000002 + FAULT_PAGE_ALLOC 0x00000004 + FAULT_PAGE_GET 0x00000008 + FAULT_ALLOC_BIO 0x00000010 (obsolete) + FAULT_ALLOC_NID 0x00000020 + FAULT_ORPHAN 0x00000040 + FAULT_BLOCK 0x00000080 + FAULT_DIR_DEPTH 0x00000100 + FAULT_EVICT_INODE 0x00000200 + FAULT_TRUNCATE 0x00000400 + FAULT_READ_IO 0x00000800 + FAULT_CHECKPOINT 0x00001000 + FAULT_DISCARD 0x00002000 + FAULT_WRITE_IO 0x00004000 + FAULT_SLAB_ALLOC 0x00008000 + FAULT_DQUOT_INIT 0x00010000 + FAULT_LOCK_OP 0x00020000 + FAULT_BLKADDR_VALIDITY 0x00040000 + FAULT_BLKADDR_CONSISTENCE 0x00080000 + FAULT_NO_SEGMENT 0x00100000 + FAULT_INCONSISTENT_FOOTER 0x00200000 + FAULT_TIMEOUT 0x00400000 (1000ms) + FAULT_VMALLOC 0x00800000 + =========================== ========== mode=%s Control block allocation mode which supports "adaptive" and "lfs". In "lfs" mode, there should be no random writes towards main area. diff --git a/Documentation/filesystems/fuse-passthrough.rst b/Documentation/filesystems/fuse-passthrough.rst new file mode 100644 index 000000000000..2b0e7c2da54a --- /dev/null +++ b/Documentation/filesystems/fuse-passthrough.rst @@ -0,0 +1,133 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================ +FUSE Passthrough +================ + +Introduction +============ + +FUSE (Filesystem in Userspace) passthrough is a feature designed to improve the +performance of FUSE filesystems for I/O operations. Typically, FUSE operations +involve communication between the kernel and a userspace FUSE daemon, which can +incur overhead. Passthrough allows certain operations on a FUSE file to bypass +the userspace daemon and be executed directly by the kernel on an underlying +"backing file". + +This is achieved by the FUSE daemon registering a file descriptor (pointing to +the backing file on a lower filesystem) with the FUSE kernel module. The kernel +then receives an identifier (``backing_id``) for this registered backing file. +When a FUSE file is subsequently opened, the FUSE daemon can, in its response to +the ``OPEN`` request, include this ``backing_id`` and set the +``FOPEN_PASSTHROUGH`` flag. This establishes a direct link for specific +operations. + +Currently, passthrough is supported for operations like ``read(2)``/``write(2)`` +(via ``read_iter``/``write_iter``), ``splice(2)``, and ``mmap(2)``. + +Enabling Passthrough +==================== + +To use FUSE passthrough: + + 1. The FUSE filesystem must be compiled with ``CONFIG_FUSE_PASSTHROUGH`` + enabled. + 2. The FUSE daemon, during the ``FUSE_INIT`` handshake, must negotiate the + ``FUSE_PASSTHROUGH`` capability and specify its desired + ``max_stack_depth``. + 3. The (privileged) FUSE daemon uses the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl + on its connection file descriptor (e.g., ``/dev/fuse``) to register a + backing file descriptor and obtain a ``backing_id``. + 4. When handling an ``OPEN`` or ``CREATE`` request for a FUSE file, the daemon + replies with the ``FOPEN_PASSTHROUGH`` flag set in + ``fuse_open_out::open_flags`` and provides the corresponding ``backing_id`` + in ``fuse_open_out::backing_id``. + 5. The FUSE daemon should eventually call ``FUSE_DEV_IOC_BACKING_CLOSE`` with + the ``backing_id`` to release the kernel's reference to the backing file + when it's no longer needed for passthrough setups. + +Privilege Requirements +====================== + +Setting up passthrough functionality currently requires the FUSE daemon to +possess the ``CAP_SYS_ADMIN`` capability. This requirement stems from several +security and resource management considerations that are actively being +discussed and worked on. The primary reasons for this restriction are detailed +below. + +Resource Accounting and Visibility +---------------------------------- + +The core mechanism for passthrough involves the FUSE daemon opening a file +descriptor to a backing file and registering it with the FUSE kernel module via +the ``FUSE_DEV_IOC_BACKING_OPEN`` ioctl. This ioctl returns a ``backing_id`` +associated with a kernel-internal ``struct fuse_backing`` object, which holds a +reference to the backing ``struct file``. + +A significant concern arises because the FUSE daemon can close its own file +descriptor to the backing file after registration. The kernel, however, will +still hold a reference to the ``struct file`` via the ``struct fuse_backing`` +object as long as it's associated with a ``backing_id`` (or subsequently, with +an open FUSE file in passthrough mode). + +This behavior leads to two main issues for unprivileged FUSE daemons: + + 1. **Invisibility to lsof and other inspection tools**: Once the FUSE + daemon closes its file descriptor, the open backing file held by the kernel + becomes "hidden." Standard tools like ``lsof``, which typically inspect + process file descriptor tables, would not be able to identify that this + file is still open by the system on behalf of the FUSE filesystem. This + makes it difficult for system administrators to track resource usage or + debug issues related to open files (e.g., preventing unmounts). + + 2. **Bypassing RLIMIT_NOFILE**: The FUSE daemon process is subject to + resource limits, including the maximum number of open file descriptors + (``RLIMIT_NOFILE``). If an unprivileged daemon could register backing files + and then close its own FDs, it could potentially cause the kernel to hold + an unlimited number of open ``struct file`` references without these being + accounted against the daemon's ``RLIMIT_NOFILE``. This could lead to a + denial-of-service (DoS) by exhausting system-wide file resources. + +The ``CAP_SYS_ADMIN`` requirement acts as a safeguard against these issues, +restricting this powerful capability to trusted processes. + +**NOTE**: ``io_uring`` solves this similar issue by exposing its "fixed files", +which are visible via ``fdinfo`` and accounted under the registering user's +``RLIMIT_NOFILE``. + +Filesystem Stacking and Shutdown Loops +-------------------------------------- + +Another concern relates to the potential for creating complex and problematic +filesystem stacking scenarios if unprivileged users could set up passthrough. +A FUSE passthrough filesystem might use a backing file that resides: + + * On the *same* FUSE filesystem. + * On another filesystem (like OverlayFS) which itself might have an upper or + lower layer that is a FUSE filesystem. + +These configurations could create dependency loops, particularly during +filesystem shutdown or unmount sequences, leading to deadlocks or system +instability. This is conceptually similar to the risks associated with the +``LOOP_SET_FD`` ioctl, which also requires ``CAP_SYS_ADMIN``. + +To mitigate this, FUSE passthrough already incorporates checks based on +filesystem stacking depth (``sb->s_stack_depth`` and ``fc->max_stack_depth``). +For example, during the ``FUSE_INIT`` handshake, the FUSE daemon can negotiate +the ``max_stack_depth`` it supports. When a backing file is registered via +``FUSE_DEV_IOC_BACKING_OPEN``, the kernel checks if the backing file's +filesystem stack depth is within the allowed limit. + +The ``CAP_SYS_ADMIN`` requirement provides an additional layer of security, +ensuring that only privileged users can create these potentially complex +stacking arrangements. + +General Security Posture +------------------------ + +As a general principle for new kernel features that allow userspace to instruct +the kernel to perform direct operations on its behalf based on user-provided +file descriptors, starting with a higher privilege requirement (like +``CAP_SYS_ADMIN``) is a conservative and common security practice. This allows +the feature to be used and tested while further security implications are +evaluated and addressed. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 32618512a965..11a599387266 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -99,6 +99,7 @@ Documentation for filesystem implementations. fuse fuse-io fuse-io-uring + fuse-passthrough inotify isofs nilfs2 diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index 939b4b624fad..ddd799df6ce3 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -712,11 +712,6 @@ handle falling back from one source type to another. The members are: at a boundary with the filesystem structure (e.g. at the end of a Ceph object). It tells netfslib not to retile subrequests across it. - * ``NETFS_SREQ_SEEK_DATA_READ`` - - This is a hint from netfslib to the cache that it might want to try - skipping ahead to the next data (ie. using SEEK_DATA). - * ``error`` This is for the filesystem to store result of the subrequest. It should be diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 3111ef5592f3..3616d7161dab 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1243,3 +1243,9 @@ arguments in the opposite order but is otherwise identical. Using try_lookup_noperm() will require linux/namei.h to be included. +--- + +**mandatory** + +Calling conventions for ->d_automount() have changed; we should *not* grab +an extra reference to new mount - it should be returned with refcount 1. diff --git a/Documentation/filesystems/relay.rst b/Documentation/filesystems/relay.rst index 46447dbc75ad..301ff4c6e6c6 100644 --- a/Documentation/filesystems/relay.rst +++ b/Documentation/filesystems/relay.rst @@ -301,16 +301,6 @@ user-defined data with a channel, and is immediately available (including in create_buf_file()) via chan->private_data or buf->chan->private_data. -Buffer-only channels --------------------- - -These channels have no files associated and can be created with -relay_open(NULL, NULL, ...). Such channels are useful in scenarios such -as when doing early tracing in the kernel, before the VFS is up. In these -cases, one may open a buffer-only channel and then call -relay_late_setup_files() when the kernel is ready to handle files, -to expose the buffered data to the userspace. - Channel 'modes' --------------- diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index bf051c7da6b8..fd32a9a17bfb 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -1390,9 +1390,7 @@ defined: If a vfsmount is returned, the caller will attempt to mount it on the mountpoint and will remove the vfsmount from its - expiration list in the case of failure. The vfsmount should be - returned with 2 refs on it to prevent automatic expiration - the - caller will clean up the additional ref. + expiration list in the case of failure. This function is only used if DCACHE_NEED_AUTOMOUNT is set on the dentry. This is set by __d_instantiate() if S_AUTOMOUNT is |