summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_file.c
AgeCommit message (Collapse)Author
2025-03-18Merge branch 'xfs-6.15-zoned_devices' into XFS-for-linus-6.15-mergeCarlos Maiolino
Merge Zoned allocator for XFS. Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-13Revert "xfs: add pre-content fsnotify hook for DAX faults"Amir Goldstein
This reverts commit 7f4796a46571ced5d3d5b0942e1bfea1eedaaecd. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-4-amir73il@gmail.com
2025-03-03xfs: implement direct writes to zoned RT devicesChristoph Hellwig
Direct writes to zoned RT devices are extremely simple. After taking the block reservation before acquiring the iolock, the iomap direct I/O calls into ->iomap_begin which will return a "fake" iomap for the entire requested range. The actual block allocation is then done from the submit_io handler using code shared with the buffered I/O path. The iomap_dio_ops set the bio_set to the (iomap) ioend one and initialize the embedded ioend, which allows reusing the existing ioend based buffered I/O completion path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: implement buffered writes to zoned RT devicesChristoph Hellwig
Implement buffered writes including page faults and block zeroing for zoned RT devices. Buffered writes to zoned RT devices are split into three phases: 1) a reservation for the worst case data block usage is taken before acquiring the iolock. When there are enough free blocks but not enough available one, garbage collection is kicked off to free the space before continuing with the write. If there isn't enough freeable space, the block reservation is reduced and a short write will happen as expected by normal Linux write semantics. 2) with the iolock held, the generic iomap buffered write code is called, which through the iomap_begin operation usually just inserts delalloc extents for the range in a single iteration. Only for overwrites of existing data that are not block aligned, or zeroing operations the existing extent mapping is read to fill out the srcmap and to figure out if zeroing is required. 3) the ->map_blocks callback to the generic iomap writeback code calls into the zoned space allocator to actually allocate on-disk space for the range before kicking of the writeback. Note that because all writes are out of place, truncate or hole punches that are not aligned to block size boundaries need to allocate space. For block zeroing from truncate, ->setattr is called with the iolock (aka i_rwsem) already held, so a hacky deviation from the above scheme is needed. In this case the space reservations is called with the iolock held, but is required not to block and can dip into the reserved block pool. This can lead to -ENOSPC when truncating a file, which is unfortunate. But fixing the calling conventions in the VFS is probably much easier with code requiring it already in mainline. Similarly because all writes are out place, the zoned allocator can't support unwritten extents and thus the FALLOC_FL_ALLOCATE_RANGE range mode of fallocate. Other fallocate modes that would reserved space but don't need to to provide proper semantics do work but do not reserve space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: don't call xfs_can_free_eofblocks from ->release for zoned inodesChristoph Hellwig
Zoned file systems require out of place writes and thus can't support post-EOF speculative preallocations. Avoid the pointless ilock critical section to find out that none can be freed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: allow internal RT devices for zoned modeChristoph Hellwig
Allow creating an RT subvolume on the same device as the main data device. This is mostly used for SMR HDDs where the conventional zones are used for the data device and the sequential write required zones for the zoned RT section. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-03-03xfs: refine the unaligned check for always COW inodes in xfs_file_dio_writeChristoph Hellwig
For always COW inodes we also must check the alignment of each individual iovec segment, as they could end up with different I/Os due to the way bio_iov_iter_get_pages works, and we'd then overwrite an already written block. The existing always_cow sysctl based code doesn't catch this because nothing enforces that blocks aren't rewritten, but for zoned XFS on sequential write required zones this is a hard error. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
2025-02-27xfs: flag as supporting FOP_DONTCACHEJens Axboe
Read side was already fully supported, and with the write side appropriately punted to the worker queue, all that's needed now is setting FOP_DONTCACHE in the file_operations structure to enable full support for read and write uncached IO. This provides similar benefits to using RWF_DONTCACHE with reads. Testing buffered writes on 32 files: writing bs 65536, uncached 0 1s: 196035MB/sec 2s: 132308MB/sec 3s: 132438MB/sec 4s: 116528MB/sec 5s: 103898MB/sec 6s: 108893MB/sec 7s: 99678MB/sec 8s: 106545MB/sec 9s: 106826MB/sec 10s: 101544MB/sec 11s: 111044MB/sec 12s: 124257MB/sec 13s: 116031MB/sec 14s: 114540MB/sec 15s: 115011MB/sec 16s: 115260MB/sec 17s: 116068MB/sec 18s: 116096MB/sec where it's quite obvious where the page cache filled, and performance dropped from to about half of where it started, settling in at around 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to reclaim pages. Running the same test with uncached buffered writes: writing bs 65536, uncached 1 1s: 198974MB/sec 2s: 189618MB/sec 3s: 193601MB/sec 4s: 188582MB/sec 5s: 193487MB/sec 6s: 188341MB/sec 7s: 194325MB/sec 8s: 188114MB/sec 9s: 192740MB/sec 10s: 189206MB/sec 11s: 193442MB/sec 12s: 189659MB/sec 13s: 191732MB/sec 14s: 190701MB/sec 15s: 191789MB/sec 16s: 191259MB/sec 17s: 190613MB/sec 18s: 191951MB/sec and the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It's also about 65% faster, and using half the CPU of the system compared to the normal buffered write. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-06iomap: pass private data to iomap_page_mkwriteChristoph Hellwig
Allow the file system to pass private data which can be used by the iomap_begin and iomap_end methods through the private pointer in the iomap_iter structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-10-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-01-23Merge tag 'fsnotify_hsm_for_v6.14-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify pre-content notification support from Jan Kara: "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets generated before a file contents is accessed. The event is synchronous so if there is listener for this event, the kernel waits for reply. On success the execution continues as usual, on failure we propagate the error to userspace. This allows userspace to fill in file content on demand from slow storage. The context in which the events are generated has been picked so that we don't hold any locks and thus there's no risk of a deadlock for the userspace handler. The new pre-content event is available only for users with global CAP_SYS_ADMIN capability (similarly to other parts of fanotify functionality) and it is an administrator responsibility to make sure the userspace event handler doesn't do stupid stuff that can DoS the system. Based on your feedback from the last submission, fsnotify code has been improved and now file->f_mode encodes whether pre-content event needs to be generated for the file so the fast path when nobody wants pre-content event for the file just grows the additional file->f_mode check. As a bonus this also removes the checks whether the old FS_ACCESS event needs to be generated from the fast path. Also the place where the event is generated during page fault has been moved so now filemap_fault() generates the event if and only if there is no uptodate folio in the page cache. Also we have dropped FS_PRE_MODIFY event as current real-world users of the pre-content functionality don't really use it so let's start with the minimal useful feature set" * tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) fanotify: Fix crash in fanotify_init(2) fs: don't block write during exec on pre-content watched files fs: enable pre-content events on supported file systems ext4: add pre-content fsnotify hook for DAX faults btrfs: disable defrag on pre-content watched files xfs: add pre-content fsnotify hook for DAX faults fsnotify: generate pre-content permission event on page fault mm: don't allow huge faults for files with pre content watches fanotify: disable readahead if we have pre-content watches fanotify: allow to set errno in FAN_DENY permission response fanotify: report file range info with pre-content events fanotify: introduce FAN_PRE_ACCESS permission event fsnotify: generate pre-content permission event on truncate fsnotify: pass optional file access range in pre-content event fsnotify: introduce pre-content permission events fanotify: reserve event bit of deprecated FAN_DIR_MODIFY fanotify: rename a misnamed constant fanotify: don't skip extra event info if no info_mode is set fsnotify: check if file is actually being watched for pre-content events on open fsnotify: opt-in for permission events at file open time ...
2024-12-12xfs: don't drop errno values when we fail to ficlone the entire rangeDarrick J. Wong
Way back when we first implemented FICLONE for XFS, life was simple -- either the the entire remapping completed, or something happened and we had to return an errno explaining what happened. Neither of those ioctls support returning partial results, so it's all or nothing. Then things got complicated when copy_file_range came along, because it actually can return the number of bytes copied, so commit 3f68c1f562f1e4 tried to make it so that we could return a partial result if the REMAP_FILE_CAN_SHORTEN flag is set. This is also how FIDEDUPERANGE can indicate that the kernel performed a partial deduplication. Unfortunately, the logic is wrong if an error stops the remapping and CAN_SHORTEN is not set. Because those callers cannot return partial results, it is an error for ->remap_file_range to return a positive quantity that is less than the @len passed in. Implementations really should be returning a negative errno in this case, because that's what btrfs (which introduced FICLONE{,RANGE}) did. Therefore, ->remap_range implementations cannot silently drop an errno that they might have when the number of bytes remapped is less than the number of bytes requested and CAN_SHORTEN is not set. Found by running generic/562 on a 64k fsblock filesystem and wondering why it reported corrupt files. Cc: <stable@vger.kernel.org> # v4.20 Fixes: 3fc9f5e409319e ("xfs: remove xfs_reflink_remap_range") Really-Fixes: 3f68c1f562f1e4 ("xfs: support returning partial reflink results") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-12-11xfs: add pre-content fsnotify hook for DAX faultsJosef Bacik
xfs has it's own handling for DAX faults, so we need to add the pre-content fsnotify hook for this case. Other faults go through filemap_fault so they're handled properly there. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/9eccdf59a65b72f0a1a5e2f2b9bff8eda2d4f2d9.1731684329.git.josef@toxicpanda.com
2024-11-21Merge tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: "The bulk of this pull request is a major rework that Darrick and Christoph have been doing on XFS's real-time volume, coupled with a few features to support this rework. It does also includes some bug fixes. - convert perag to use xarrays - create a new generic allocation group structure - add metadata inode dir trees - create in-core rt allocation groups - shard the RT section into allocation groups - persist quota options with the enw metadata dir tree - enable quota for RT volumes - enable metadata directory trees - some bugfixes" * tag 'xfs-6.13-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (146 commits) xfs: port ondisk structure checks from xfs/122 to the kernel xfs: separate space btree structures in xfs_ondisk.h xfs: convert struct typedefs in xfs_ondisk.h xfs: enable metadata directory feature xfs: enable realtime quota again xfs: update sb field checks when metadir is turned on xfs: reserve quota for realtime files correctly xfs: create quota preallocation watermarks for realtime quota xfs: report realtime block quota limits on realtime directories xfs: persist quota flags with metadir xfs: advertise realtime quota support in the xqm stat files xfs: scrub quota file metapaths xfs: fix chown with rt quota xfs: use metadir for quota inodes xfs: refactor xfs_qm_destroy_quotainos xfs: use rtgroup busy extent list for FITRIM xfs: implement busy extent tracking for rtgroups xfs: port the perag discard code to handle generic groups xfs: move the min and max group block numbers to xfs_group xfs: adjust min_block usage in xfs_verify_agbno ...
2024-11-18Merge tag 'vfs-6.13.untorn.writes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs untorn write support from Christian Brauner: "An atomic write is a write issed with torn-write protection. This means for a power failure or any hardware failure all or none of the data from the write will be stored, never a mix of old and new data. This work is already supported for block devices. If a block device is opened with O_DIRECT and the block device supports atomic write, then FMODE_CAN_ATOMIC_WRITE is added to the file of the opened block device. This contains the work to expand atomic write support to filesystems, specifically ext4 and XFS. Currently, only support for writing exactly one filesystem block atomically is added. Since it's now possible to have filesystem block size > page size for XFS, it's possible to write 4K+ blocks atomically on x86" * tag 'vfs-6.13.untorn.writes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: drop an obsolete comment in iomap_dio_bio_iter ext4: Do not fallback to buffered-io for DIO atomic write ext4: Support setting FMODE_CAN_ATOMIC_WRITE ext4: Check for atomic writes support in write iter ext4: Add statx support for atomic writes xfs: Support setting FMODE_CAN_ATOMIC_WRITE xfs: Validate atomic writes xfs: Support atomic write for statx fs: iomap: Atomic write support fs: Export generic_atomic_write_valid() block: Add bdev atomic write limits helpers fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid() block/fs: Pass an iocb to generic_atomic_write_valid()
2024-11-05xfs: remove xfs_page_mkwrite_iomap_opsChristoph Hellwig
Shared the regular buffered write iomap_ops with the page fault path and just check for the IOMAP_FAULT flag to skip delalloc punching. This keeps the delalloc punching checks in one place, and will make it easier to convert iomap to an iter model where the begin and end handlers are merged into a single callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05xfs: remove __xfs_filemap_faultChristoph Hellwig
xfs_filemap_huge_fault only ever serves DAX faults, so hard code the call to xfs_dax_read_fault and open code __xfs_filemap_fault in the only remaining caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05xfs: split write fault handling out of __xfs_filemap_faultChristoph Hellwig
Only two of the callers of __xfs_filemap_fault every handle read faults. Split the write_fault handling out of __xfs_filemap_fault so that all callers call that directly either conditionally or unconditionally and only leave the read fault handling in __xfs_filemap_fault. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-05xfs: split the page fault trace eventChristoph Hellwig
Split the xfs_filemap_fault trace event into separate ones for read and write faults and move them into the applicable locations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-11-04xfs: Support setting FMODE_CAN_ATOMIC_WRITEJohn Garry
Set FMODE_CAN_ATOMIC_WRITE flag if we can atomic write for that inode. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: John Garry <john.g.garry@oracle.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> #On ppc64
2024-11-04xfs: Validate atomic writesJohn Garry
Validate that an atomic write adheres to length/offset rules. Currently we can only write a single FS block. For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_write_iter(), FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this, ATOMICWRITES flags would also need to be set for the inode. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-10-15xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eofChristoph Hellwig
xfs_file_write_zero_eof is the only caller of xfs_zero_range that does not take XFS_MMAPLOCK_EXCL (aka the invalidate lock). Currently that is actually the right thing, as an error in the iomap zeroing code will also take the invalidate_lock to clean up, but to fix that deadlock we need a consistent locking pattern first. The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read pagefaults, which isn't really needed here, but also not actively harmful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15xfs: factor out a xfs_file_write_zero_eof helperChristoph Hellwig
Split a helper from xfs_file_write_checks that just deal with the post-EOF zeroing to keep the code readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-09-20Merge tag 'vfs-6.12.blocksize' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs blocksize updates from Christian Brauner: "This contains the vfs infrastructure as well as the xfs bits to enable support for block sizes (bs) larger than page sizes (ps) plus a few fixes to related infrastructure. There has been efforts over the last 16 years to enable enable Large Block Sizes (LBS), that is block sizes in filesystems where bs > page size. Through these efforts we have learned that one of the main blockers to supporting bs > ps in filesystems has been a way to allocate pages that are at least the filesystem block size on the page cache where bs > ps. Thanks to various previous efforts it is possible to support bs > ps in XFS with only a few changes in XFS itself. Most changes are to the page cache to support minimum order folio support for the target block size on the filesystem. A motivation for Large Block Sizes today is to support high-capacity (large amount of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are typically greater than 4k to help reduce DRAM and so in turn cost and space. In practice this then allows different architectures to use a base page size of 4k while still enabling support for block sizes aligned to the larger IUs by relying on high order folios on the page cache when needed. It also allows to take advantage of the drive's support for atomics larger than 4k with buffered IO support in Linux. As described this year at LSFMM, supporting large atomics greater than 4k enables databases to remove the need to rely on their own journaling, so they can disable double buffered writes, which is a feature different cloud providers are already enabling through custom storage solutions" * tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (22 commits) Documentation: iomap: fix a typo iomap: remove the iomap_file_buffered_write_punch_delalloc return value iomap: pass the iomap to the punch callback iomap: pass flags to iomap_file_buffered_write_punch_delalloc iomap: improve shared block detection in iomap_unshare_iter iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release docs:filesystems: fix spelling and grammar mistakes in iomap design page filemap: fix htmldoc warning for mapping_align_index() iomap: make zero range flush conditional on unwritten mappings iomap: fix handling of dirty folios over unwritten extents iomap: add a private argument for iomap_file_buffered_write iomap: remove set_memor_ro() on zero page xfs: enable block size larger than page size support xfs: make the calculation generic in xfs_sb_validate_fsb_count() xfs: expose block size in stat xfs: use kvmalloc for xattr buffers iomap: fix iomap_dio_zero() for fs bs > system page size filemap: cap PTE range to be created to allowed zero fill in folio_map_range() mm: split a folio in minimum folio order chunks readahead: allocate folios with mapping_min_order in readahead ...
2024-09-19Merge tag 'xfs-6.12-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: "New code: - Introduce new ioctls to exchange contents of two files. The first ioctl does the preparation work to exchange the contents of two files while the second ioctl performs the actual exchange if the target file has not been changed since a given sampling point. Fixes: - Fix bugs associated with calculating the maximum range of realtime extents to scan for free space. - Copy keys instead of records when resizing the incore BMBT root block. - Do not report FITRIMming more bytes than possibly exist in the filesystem. - Modify xfs_fs.h to prevent C++ compilation errors. - Do not over eagerly free post-EOF speculative preallocation. - Ensure st_blocks never goes to zero during COW writes Cleanups/refactors: - Use Xarray to hold per-AG data instead of a Radix tree. - Cleanups to: - realtime bitmap - inode allocator - quota - inode rooted btree code" * tag 'xfs-6.12-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (61 commits) xfs: ensure st_blocks never goes to zero during COW writes xfs: use xas_for_each_marked in xfs_reclaim_inodes_count xfs: convert perag lookup to xarray xfs: simplify tagged perag iteration xfs: move the tagged perag lookup helpers to xfs_icache.c xfs: use kfree_rcu_mightsleep to free the perag structures xfs: use LIST_HEAD() to simplify code xfs: Remove duplicate xfs_trans_priv.h header xfs: remove unnecessary check xfs: Use xfs set and clear mp state helpers xfs: reclaim speculative preallocations for append only files xfs: simplify extent lookup in xfs_can_free_eofblocks xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocks xfs: only free posteof blocks on first close xfs: don't free post-EOF blocks on read close xfs: skip all of xfs_file_release when shut down xfs: don't bother returning errors from xfs_file_release xfs: refactor f_op->release handling xfs: remove the i_mode check in xfs_release xfs: standardize the btree maxrecs function parameters ...
2024-09-03iomap: add a private argument for iomap_file_buffered_writeJosef Bacik
In order to switch fuse over to using iomap for buffered writes we need to be able to have the struct file for the original write, in case we have to read in the page to make it uptodate. Handle this by using the existing private field in the iomap_iter, and add the argument to iomap_file_buffered_write. This will allow us to pass the file in through the iomap buffered write path, and is flexible for any other file systems needs. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/7f55c7c32275004ba00cddf862d970e6e633f750.1724755651.git.josef@toxicpanda.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-03xfs: reclaim speculative preallocations for append only filesChristoph Hellwig
The XFS XFS_DIFLAG_APPEND maps to the VFS S_APPEND flag, which forbids writes that don't append at the current EOF. But the commit originally adding XFS_DIFLAG_APPEND support (commit a23321e766d in xfs xfs-import repository) also checked it to skip releasing speculative preallocations, which doesn't make any sense. Another commit (dd9f438e3290 in the xfs-import repository) later extended that flag to also report these speculation preallocations which should not exist in getbmap. Remove these checks as nothing XFS_DIFLAG_APPEND implies that preallocations beyond EOF should exist, but explicitly check for XFS_DIFLAG_APPEND in xfs_file_release to bypass the algorithm that discard preallocations on the first close as append only files aren't expected to be written to only once. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03xfs: check XFS_EOFBLOCKS_RELEASED earlier in xfs_release_eofblocksChristoph Hellwig
If the XFS_EOFBLOCKS_RELEASED flag is set, we are not going to free the eofblocks, so don't bother locking the inode or performing the checks in xfs_can_free_eofblocks. Also switch to a test_and_set operation once the iolock has been acquire so that only the caller that sets it actually frees the post-EOF blocks. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03xfs: only free posteof blocks on first closeDarrick J. Wong
Certain workloads fragment files on XFS very badly, such as a software package that creates a number of threads, each of which repeatedly run the sequence: open a file, perform a synchronous write, and close the file, which defeats the speculative preallocation mechanism. We work around this problem by only deleting posteof blocks the /first/ time a file is closed to preserve the behavior that unpacking a tarball lays out files one after the other with no gaps. Signed-off-by: Darrick J. Wong <djwong@kernel.org> [hch: rebased, updated comment, renamed the flag] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03xfs: don't free post-EOF blocks on read closeDave Chinner
When we have a workload that does open/read/close in parallel with other allocation, the file becomes rapidly fragmented. This is due to close() calling xfs_file_release() and removing the speculative preallocation beyond EOF. Add a check for a writable context to xfs_file_release to skip the post-EOF block freeing (an the similarly pointless flushing on truncate down). Before: Test 1: sync write fragmentation counts /mnt/scratch/file.0: 919 /mnt/scratch/file.1: 916 /mnt/scratch/file.2: 919 /mnt/scratch/file.3: 920 /mnt/scratch/file.4: 920 /mnt/scratch/file.5: 921 /mnt/scratch/file.6: 916 /mnt/scratch/file.7: 918 After: Test 1: sync write fragmentation counts /mnt/scratch/file.0: 24 /mnt/scratch/file.1: 24 /mnt/scratch/file.2: 11 /mnt/scratch/file.3: 24 /mnt/scratch/file.4: 3 /mnt/scratch/file.5: 24 /mnt/scratch/file.6: 24 /mnt/scratch/file.7: 23 Signed-off-by: Dave Chinner <dchinner@redhat.com> [darrick: wordsmithing, fix commit message] Signed-off-by: Darrick J. Wong <djwong@kernel.org> [hch: ported to the new ->release code structure] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03xfs: skip all of xfs_file_release when shut downChristoph Hellwig
There is no point in trying to free post-EOF blocks when the file system is shutdown, as it will just error out ASAP. Instead return instantly when xfs_file_release is called on a shut down file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03xfs: don't bother returning errors from xfs_file_releaseChristoph Hellwig
While ->release returns int, the only caller ignores the return value. As we're only doing cleanup work there isn't much of a point in return a value to start with, so just document the situation instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-09-03xfs: refactor f_op->release handlingChristoph Hellwig
Currently f_op->release is split in not very obvious ways. Fix that by folding xfs_release into xfs_file_release. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-08-28xfs: refactor xfs_file_fallocateChristoph Hellwig
Refactor xfs_file_fallocate into separate helpers for each mode, two factors for i_size handling and a single switch statement over the supported modes. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-7-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28xfs: move the xfs_is_always_cow_inode check into xfs_alloc_file_spaceChristoph Hellwig
Move the xfs_is_always_cow_inode check from the caller into xfs_alloc_file_space to prepare for refactoring of xfs_file_fallocate. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-6-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28xfs: call xfs_flush_unmap_range from xfs_free_file_spaceChristoph Hellwig
Call xfs_flush_unmap_range from xfs_free_file_space so that xfs_file_fallocate doesn't have to predict which mode will call it. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240827065123.1762168-5-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-01xfs: fold xfs_ilock_for_write_fault into xfs_write_faultChristoph Hellwig
Now that the page fault handler has been refactored, the only caller of xfs_ilock_for_write_fault is simple enough and calls it unconditionally. Fold the logic and expand the comments explaining it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01xfs: always take XFS_MMAPLOCK shared in xfs_dax_read_faultChristoph Hellwig
After the previous refactoring, xfs_dax_fault is now never used for write faults, so don't bother with the xfs_ilock_for_write_fault logic to protect against writes when remapping is in progress. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01xfs: refactor __xfs_filemap_faultChristoph Hellwig
Split the write fault and DAX fault handling into separate helpers so that the main fault handler is easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01xfs: simplify xfs_dax_faultChristoph Hellwig
Replace the separate stub with an IS_ENABLED check, and take the call to dax_finish_sync_fault into xfs_dax_fault instead of leaving it in the caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-07-01xfs: cleanup xfs_ilock_iocb_for_writeChristoph Hellwig
Move the relock path out of the straight line and add a comment explaining why it exists. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-05-20Merge tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: "Online repair feature continues to be expanded. Also, we now support delayed allocation for realtime devices which have an extent size that is equal to filesystem's block size. New code: - Introduce Parent Pointer extended attribute for inodes - Bring back delalloc support for realtime devices which have an extent size that is equal to filesystem's block size - Improve performance of log incompat feature handling Online Repair: - Implement atomic file content exchanges i.e. exchange ranges of bytes between two files atomically - Create temporary files to repair file-based metadata. This uses atomic file content exchange facility to swap file fork mappings between the temporary file and the metadata inode - Allow callers of directory/xattr code to set an explicit owner number to be written into the header fields of any new blocks that are created. This is required to avoid walking every block of the new structure and modify their ownership during online repair - Repair more data structures: - Extended attributes - Inode unlinked state - Directories - Symbolic links - AGI's unlinked inode list - Parent pointers - Move Orphan files to lost and found directory - Fixes for Inode repair functionality - Introduce a new sub-AG FITRIM implementation to reduce the duration for which the AGF lock is held - Updates for the design documentation - Use Parent Pointers to assist in checking directories, parent pointers, extended attributes, and link counts Fixes: - Prevent userspace from reading invalid file data due to incorrect. updation of file size when performing a non-atomic clone operation - Minor fixes to online repair - Fix confusing return values from xfs_bmapi_write() - Fix an out of bounds access due to incorrect h_size during log recovery - Defer upgrading the extent counters in xfs_reflink_end_cow_extent() until we know we are going to modify the extent mapping - Remove racy access to if_bytes check in xfs_reflink_end_cow_extent() - Fix sparse warnings Cleanups: - Hold inode locks on all files involved in a rename until the completion of the operation. This is in preparation for the parent pointers patchset where parent pointers are applied in a separate chained update from the actual directory update - Compile out v4 support when disabled - Cleanup xfs_extent_busy_clear() - Remove unused flags and fields from struct xfs_da_args - Remove definitions of unused functions - Improve extended attribute validation - Add higher level directory operations helpers to remove duplication of code - Cleanup quota (un)reservation interfaces" * tag 'xfs-6.10-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (221 commits) xfs: simplify iext overflow checking and upgrade xfs: remove a racy if_bytes check in xfs_reflink_end_cow_extent xfs: upgrade the extent counters in xfs_reflink_end_cow_extent later xfs: xfs_quota_unreserve_blkres can't fail xfs: consolidate the xfs_quota_reserve_blkres definitions xfs: clean up buffer allocation in xlog_do_recovery_pass xfs: fix log recovery buffer allocation for the legacy h_size fixup xfs: widen flags argument to the xfs_iflags_* helpers xfs: minor cleanups of xfs_attr3_rmt_blocks xfs: create a helper to compute the blockcount of a max sized remote value xfs: turn XFS_ATTR3_RMT_BUF_SPACE into a function xfs: use unsigned ints for non-negative quantities in xfs_attr_remote.c xfs: do not allocate the entire delalloc extent in xfs_bmapi_write xfs: fix xfs_bmap_add_extent_delay_real for partial conversions xfs: remove the xfs_iext_peek_prev_extent call in xfs_bmapi_allocate xfs: pass the actual offset and len to allocate to xfs_bmapi_allocate xfs: don't open code XFS_FILBLKS_MIN in xfs_bmapi_write xfs: lift a xfs_valid_startblock into xfs_bmapi_allocate xfs: remove the unusued tmp_logflags variable in xfs_bmapi_allocate xfs: fix error returns from xfs_bmapi_write ...
2024-04-24xfs: don't call xfs_file_open from xfs_dir_openChristoph Hellwig
Directories do not support direct I/O and thus no non-blocking direct I/O either. Open code the shutdown check and call to generic_file_open instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-4-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24xfs: drop fop_flags for directoriesChristoph Hellwig
Directories have non of the capabilities, so drop the flags. Note that the current state is harmless as no one actually checks for the flags either. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-3-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-24xfs: fix overly long line in the file_operationsChristoph Hellwig
Re-wrap the newly added fop_flags fields to not go over 80 characters. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240423124608.537794-2-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-15xfs: refactor non-power-of-two alignment checksDarrick J. Wong
Create a helper function that can compute if a 64-bit number is an integer multiple of a 32-bit number, where the 32-bit number is not required to be an even power of two. This is needed for some new code for the realtime device, where we can set 37k allocation units and then have to remap them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: create a new helper to return a file's allocation unitDarrick J. Wong
Create a new helper function to calculate the fundamental allocation unit (i.e. the smallest unit of space we can allocate) of a file. Things are going to get hairy with range-exchange on the realtime device, so prepare for this now. Remove the static attribute from xfs_is_falloc_aligned since the next patch will need it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: declare xfs_file.c symbols in xfs_file.hDarrick J. Wong
Move the two public symbols in xfs_file.c to xfs_file.h. We're about to add more public symbols in that source file, so let's finally create the header file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: move inode lease breaking functions to xfs_inode.cDarrick J. Wong
The lease breaking functions operate at the scope of the entire VFS inode, not subranges of a file. Move them to xfs_inode.c since they're already declared in xfs_inode.h. This cleanup moves us closer to having xfs_FOO.h declare only the symbols in xfs_FOO.c. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-07fs: claw back a few FMODE_* bitsChristian Brauner
There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-19xfs: Replace xfs_isilocked with xfs_assert_ilockedMatthew Wilcox (Oracle)
To use the new rwsem_assert_held()/rwsem_assert_held_write(), we can't use the existing ASSERT macro. Add a new xfs_assert_ilocked() and convert all the callers. Fix an apparent bug in xfs_isilocked(): If the caller specifies XFS_IOLOCK_EXCL | XFS_ILOCK_EXCL, xfs_assert_ilocked() will check both the IOLOCK and the ILOCK are held for write. xfs_isilocked() only checked that the ILOCK was held for write. xfs_assert_ilocked() is always on, even if DEBUG or XFS_WARN aren't defined. It's a cheap check, so I don't think it's worth defining it away. Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>