summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2024-04-15xfs: support preallocating and copying content into temporary filesDarrick J. Wong
Create the routines we need to preallocate space in a temporary ondisk file and then copy the contents of an xfile into the tempfile. The upcoming rtsummary repair feature will construct the contents of a realtime summary file in memory, after which it will want to copy all that into the ondisk temporary file before atomically committing the new rtsummary contents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: add the ability to reap entire inode forksDarrick J. Wong
In preparation for supporting repair of indexed file-based metadata (such as realtime bitmaps, directories, and extended attribute data), add a function to reap the old blocks after a metadata repair finishes. IOWs, this is an elaborate bunmapi call that deals with crosslinked blocks by unmapping them without freeing them, and also scans for incore buffers to invalidate. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: refactor live buffer invalidation for repairsDarrick J. Wong
In an upcoming patch, we will need to be able to look for xfs_buf objects caching file-based metadata blocks without needing to walk the (possibly corrupt) structures to find all the buffers. Repair already has most of the code needed to scan the buffer cache, so hoist these utility functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: create temporary files and directories for online repairDarrick J. Wong
Teach the online repair code how to create temporary files or directories. These temporary files can be used to stage reconstructed information until we're ready to perform an atomic extent swap to commit the new metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: hide private inodes from bulkstat and handle functionsDarrick J. Wong
We're about to start adding functionality that uses internal inodes that are private to XFS. What this means is that userspace should never be able to access any information about these files, and should not be able to open these files by handle. To prevent users from ever finding the file or mis-interactions with the security apparatus, set S_PRIVATE on the inode. Don't allow bulkstat, open-by-handle, or linking of S_PRIVATE files into the directory tree. This should keep private inodes actually private. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: enable logged file mapping exchange featureDarrick J. Wong
Add the XFS_SB_FEAT_INCOMPAT_EXCHRANGE feature to the set of features that we will permit when mounting a filesystem. This turns on support for the file range exchange feature. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: capture inode generation numbers in the ondisk exchmaps log itemDarrick J. Wong
Per some very late review comments, capture the generation numbers of both inodes involved in a file content exchange operation so that we don't accidentally target files with have been reallocated. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: support non-power-of-two rtextsize with exchange-rangeDarrick J. Wong
The generic exchange-range alignment checks use (fast) bitmasking operations to perform block alignment checks on the exchange parameters. Unfortunately, bitmasks require that the alignment size be a power of two. This isn't true for realtime devices with a non-power-of-two extent size, so we have to copy-pasta the generic checks using long division for this to work properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: make file range exchange support realtime filesDarrick J. Wong
Now that bmap items support the realtime device, we can add the necessary pieces to the file range exchange code to support exchanging mappings. All we really need to do here is adjust the blockcount upwards to the end of the rt extent and remove the inode checks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: condense symbolic links after a mapping exchange operationDarrick J. Wong
The previous commit added a new file mapping exchange flag that enables us to perform post-exchange processing on file2 once we're done exchanging the extent mappings. Now add this ability for symlinks. This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online symlink repair feature can salvage the remote target in a temporary link and exchange the data fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed symlink down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: condense directories after a mapping exchange operationDarrick J. Wong
The previous commit added a new file mapping exchange flag that enables us to perform post-swap processing on file2 once we're done exchanging extent mappings. Now add this ability for directories. This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online directory repair feature can create salvaged dirents in a temporary directory and exchange the data fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed directory down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: condense extended attributes after a mapping exchange operationDarrick J. Wong
Add a new file mapping exchange flag that enables us to perform post-exchange processing on file2 once we're done exchanging the extent mappings. If we were swapping mappings between extended attribute forks, we want to be able to convert file2's attr fork from block to inline format. (This implies that all fork contents are exchanged.) This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online xattr repair feature can create salvaged attrs in a temporary file and exchange the attr fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed file's attr fork back down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: add error injection to test file mapping exchange recoveryDarrick J. Wong
Add an errortag so that we can test recovery of exchmaps log items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: bind together the front and back ends of the file range exchange codeDarrick J. Wong
So far, we've constructed the front end of the file range exchange code that does all the checking; and the back end of the file mapping exchange code that actually does the work. Glue these two pieces together so that we can turn on the functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: create deferred log items for file mapping exchangesDarrick J. Wong
Now that we've created the skeleton of a log intent item to track and restart file mapping exchange operations, add the upper level logic to commit intent items and turn them into concrete work recorded in the log. This builds on the existing bmap update intent items that have been around for a while now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: introduce a file mapping exchange log intent itemDarrick J. Wong
Introduce a new intent log item to handle exchanging mappings between the forks of two files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: create a incompat flag for atomic file mapping exchangesDarrick J. Wong
Create a incompat flag so that we only attempt to process file mapping exchange log items if the filesystem supports it, and a geometry flag to advertise support if it's present. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: introduce new file range exchange ioctlDarrick J. Wong
Introduce a new ioctl to handle exchanging ranges of bytes between files. The goal here is to perform the exchange atomically with respect to applications -- either they see the file contents before the exchange or they see that A-B is now B-A, even if the kernel crashes. My original goal with all this code was to make it so that online repair can build a replacement directory or xattr structure in a temporary file and commit the repair by atomically exchanging all the data blocks between the two files. However, I needed a way to test this mechanism thoroughly, so I've been evolving an ioctl interface since then. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: constify xfs_bmap_is_written_extentDarrick J. Wong
This predicate doesn't modify the structure that's being passed in, so we can mark it const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: refactor non-power-of-two alignment checksDarrick J. Wong
Create a helper function that can compute if a 64-bit number is an integer multiple of a 32-bit number, where the 32-bit number is not required to be an even power of two. This is needed for some new code for the realtime device, where we can set 37k allocation units and then have to remap them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: hoist multi-fsb allocation unit detection to a helperDarrick J. Wong
Replace the open-coded logic to decide if a file has a multi-fsb allocation unit to a helper to make the code easier to read. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: create a new helper to return a file's allocation unitDarrick J. Wong
Create a new helper function to calculate the fundamental allocation unit (i.e. the smallest unit of space we can allocate) of a file. Things are going to get hairy with range-exchange on the realtime device, so prepare for this now. Remove the static attribute from xfs_is_falloc_aligned since the next patch will need it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: declare xfs_file.c symbols in xfs_file.hDarrick J. Wong
Move the two public symbols in xfs_file.c to xfs_file.h. We're about to add more public symbols in that source file, so let's finally create the header file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: move xfs_iops.c declarations out of xfs_inode.hDarrick J. Wong
Similarly, move declarations of public symbols of xfs_iops.c from xfs_inode.h to xfs_iops.h. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: move inode lease breaking functions to xfs_inode.cDarrick J. Wong
The lease breaking functions operate at the scope of the entire VFS inode, not subranges of a file. Move them to xfs_inode.c since they're already declared in xfs_inode.h. This cleanup moves us closer to having xfs_FOO.h declare only the symbols in xfs_FOO.c. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: only clear log incompat flags at clean unmountDarrick J. Wong
While reviewing the online fsck patchset, someone spied the xfs_swapext_can_use_without_log_assistance function and wondered why we go through this inverted-bitmask dance to avoid setting the XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT feature. (The same principles apply to the logged extended attribute update feature bit in the since-merged LARP series.) The reason for this dance is that xfs_add_incompat_log_feature is an expensive operation -- it forces the log, pushes the AIL, and then if nobody's beaten us to it, sets the feature bit and issues a synchronous write of the primary superblock. That could be a one-time cost amortized over the life of the filesystem, but the log quiesce and cover operations call xfs_clear_incompat_log_features to remove feature bits opportunistically. On a moderately loaded filesystem this leads to us cycling those bits on and off over and over, which hurts performance. Why do we clear the log incompat bits? Back in ~2020 I think Dave and I had a conversation on IRC[2] about what the log incompat bits represent. IIRC in that conversation we decided that the log incompat bits protect unrecovered log items so that old kernels won't try to recover them and barf. Since a clean log has no protected log items, we could clear the bits at cover/quiesce time. As Dave Chinner pointed out in the thread, clearing log incompat bits at unmount time has positive effects for golden root disk image generator setups, since the generator could be running a newer kernel than what gets written to the golden image -- if there are log incompat fields set in the golden image that was generated by a newer kernel/OS image builder then the provisioning host cannot mount the filesystem even though the log is clean and recovery is unnecessary to mount the filesystem. Given that it's expensive to set log incompat bits, we really only want to do that once per bit per mount. Therefore, I propose that we only clear log incompat bits as part of writing a clean unmount record. Do this by adding an operational state flag to the xfs mount that guards whether or not the feature bit clearing can actually take place. This eliminates the l_incompat_users rwsem that we use to protect a log cleaning operation from clearing a feature bit that a frontend thread is trying to set -- this lock adds another way to fail w.r.t. locking. For the swapext series, I shard that into multiple locks just to work around the lockdep complaints, and that's fugly. Link: https://lore.kernel.org/linux-xfs/20240131230043.GA6180@frogsfrogsfrogs/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2024-04-15xfs: fix error bailout in xrep_abt_build_new_treesDarrick J. Wong
Dan Carpenter reports: "Commit 4bdfd7d15747 ("xfs: repair free space btrees") from Dec 15, 2023 (linux-next), leads to the following Smatch static checker warning: fs/xfs/scrub/alloc_repair.c:781 xrep_abt_build_new_trees() warn: missing unwind goto?" That's a bug, so let's fix it. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 4bdfd7d15747 ("xfs: repair free space btrees") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: fix potential AGI <-> ILOCK ABBA deadlock in ↵Darrick J. Wong
xrep_dinode_findmode_walk_directory xfs/399 found the following deadlock when fuzzing core.mode = ones: /proc/20506/task/20558/stack : [<0>] xfs_ilock+0xa0/0x240 [xfs] [<0>] xfs_ilock_data_map_shared+0x1b/0x20 [xfs] [<0>] xrep_dinode_findmode_walk_directory+0x69/0xe0 [xfs] [<0>] xrep_dinode_find_mode+0x103/0x2a0 [xfs] [<0>] xrep_dinode_mode+0x7c/0x120 [xfs] [<0>] xrep_dinode_core+0xed/0x2b0 [xfs] [<0>] xrep_dinode_problems+0x10/0x80 [xfs] [<0>] xrep_inode+0x6c/0xc0 [xfs] [<0>] xrep_attempt+0x64/0x1d0 [xfs] [<0>] xfs_scrub_metadata+0x365/0x840 [xfs] [<0>] xfs_scrubv_metadata+0x282/0x430 [xfs] [<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs] [<0>] xfs_file_ioctl+0xc68/0x1780 [xfs] /proc/20506/task/20559/stack : [<0>] xfs_buf_lock+0x3b/0x110 [xfs] [<0>] xfs_buf_find_lock+0x66/0x1c0 [xfs] [<0>] xfs_buf_get_map+0x208/0xc00 [xfs] [<0>] xfs_buf_read_map+0x5d/0x2c0 [xfs] [<0>] xfs_trans_read_buf_map+0x1b0/0x4c0 [xfs] [<0>] xfs_read_agi+0xbd/0x190 [xfs] [<0>] xfs_ialloc_read_agi+0x47/0x160 [xfs] [<0>] xfs_imap_lookup+0x69/0x1f0 [xfs] [<0>] xfs_imap+0x1fc/0x3d0 [xfs] [<0>] xfs_iget+0x357/0xd50 [xfs] [<0>] xchk_dir_actor+0x16e/0x330 [xfs] [<0>] xchk_dir_walk_block+0x164/0x1e0 [xfs] [<0>] xchk_dir_walk+0x13a/0x190 [xfs] [<0>] xchk_directory+0x1a2/0x2b0 [xfs] [<0>] xfs_scrub_metadata+0x2f4/0x840 [xfs] [<0>] xfs_scrubv_metadata+0x282/0x430 [xfs] [<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs] [<0>] xfs_file_ioctl+0xc68/0x1780 [xfs] Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the root directory. Thread 20559 holds the root directory ILOCK and is trying to grab the AGI of an inode that is one of the root directory's children. The AGI held by 20558 is the same buffer that 20559 is trying to acquire. In other words, this is an ABBA deadlock. In general, the lock order is ILOCK and then AGI -- rename does this while preparing for an operation involving whiteouts or renaming files out of existence; and unlink does this when moving an inode to the unlinked list. The only place where we do it in the opposite order is on the child during an icreate, but at that point the child is marked INEW and is not visible to other threads. Work around this deadlock by replacing the blocking ilock attempt with a nonblocking loop that aborts after 30 seconds. Relax for a jiffy after a failed lock attempt. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmodeDarrick J. Wong
While reviewing the next patch which fixes an ABBA deadlock between the AGI and a directory ILOCK, someone asked a question about why we're holding the AGI in the first place. The reason for that is to quiesce the inode structures for that AG while we do a repair. I then realized that the xrep_dinode_findmode invokes xchk_iscan_iter, which walks the inobts (and hence the AGIs) to find all the inodes. This itself is also an ABBA vector, since the damaged inode could be in AG 5, which we hold while we scan AG 0 for directories. 5 -> 0 is not allowed. To address this, modify the iscan to allow trylock of the AGI buffer using the flags argument to xfs_ialloc_read_agi that the previous patch added. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-15xfs: pass xfs_buf lookup flags to xfs_*read_agiDarrick J. Wong
Allow callers to pass buffer lookup flags to xfs_read_agi and xfs_ialloc_read_agi. This will be used in the next patch to fix a deadlock in the online fsck inode scanner. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-04-07fs: claw back a few FMODE_* bitsChristian Brauner
There's a bunch of flags that are purely based on what the file operations support while also never being conditionally set or unset. IOW, they're not subject to change for individual files. Imho, such flags don't need to live in f_mode they might as well live in the fops structs itself. And the fops struct already has that lonely mmap_supported_flags member. We might as well turn that into a generic fop_flags member and move a few flags from FMODE_* space into FOP_* space. That gets us four FMODE_* bits back and the ability for new static flags that are about file ops to not have to live in FMODE_* space but in their own FOP_* space. It's not the most beautiful thing ever but it gets the job done. Yes, there'll be an additional pointer chase but hopefully that won't matter for these flags. I suspect there's a few more we can move into there and that we can also redirect a bunch of new flag suggestions that follow this pattern into the fop_flags field instead of f_mode. Link: https://lore.kernel.org/r/20240328-gewendet-spargel-aa60a030ef74@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-06Merge tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Chandan Babu: - Allow creating new links to special files which were not associated with a project quota * tag 'xfs-6.9-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: allow cross-linking special files without project quota
2024-04-05Merge tag 'vfs-6.9-rc3.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes. This comes with some delay because I wanted to wait on people running their reproducers and the Easter Holidays meant that those replies came in a little later than usual: - Fix handling of preventing writes to mounted block devices. Since last kernel we allow to prevent writing to mounted block devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the block device is opened with restricted writes. When we switched to opening block devices as files we altered the mechanism by which we recognize when a block device has been opened with write restrictions. The detection logic assumed that only read-write mounted filesystems would apply write restrictions to their block devices from other openers. That of course is not true since it also makes sense to apply write restrictions for filesystems that are read-only. Fix the detection logic using an FMODE_* bit. We still have a few left since we freed up a couple a while ago. I also picked up a patch to free up four additional FMODE_* bits scheduled for the next merge window. - Fix counting the number of writers to a block device. This just changes the logic to be consistent. - Fix a bug in aio causing a NULL pointer derefernce after we implemented batched processing in aio. - Finally, add the changes we discussed that allows to yield block devices early even though file closing itself is deferred. This also allows us to remove two holder operations to get and release the holder to align lifetime of file and holder of the block device" * tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: aio: Fix null ptr deref in aio_complete() wakeup fs,block: yield devices early block: count BLK_OPEN_RESTRICT_WRITES openers block: handle BLK_OPEN_RESTRICT_WRITES correctly
2024-04-01xfs: allow cross-linking special files without project quotaAndrey Albershteyn
There's an issue that if special files is created before quota project is enabled, then it's not possible to link this file. This works fine for normal files. This happens because xfs_quota skips special files (no ioctls to set necessary flags). The check for having the same project ID for source and destination then fails as source file doesn't have any ID. mkfs.xfs -f /dev/sda mount -o prjquota /dev/sda /mnt/test mkdir /mnt/test/foo mkfifo /mnt/test/foo/fifo1 xfs_quota -xc "project -sp /mnt/test/foo 9" /mnt/test > Setting up project 9 (path /mnt/test/foo)... > xfs_quota: skipping special file /mnt/test/foo/fifo1 > Processed 1 (/etc/projects and cmdline) paths for project 9 with recursion depth infinite (-1). ln /mnt/test/foo/fifo1 /mnt/test/foo/fifo1_link > ln: failed to create hard link '/mnt/test/testdir/fifo1_link' => '/mnt/test/testdir/fifo1': Invalid cross-device link mkfifo /mnt/test/foo/fifo2 ln /mnt/test/foo/fifo2 /mnt/test/foo/fifo2_link Fix this by allowing linking of special files to the project quota if special files doesn't have any ID set (ID = 0). Signed-off-by: Andrey Albershteyn <aalbersh@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-27fs,block: yield devices earlyChristian Brauner
Currently a device is only really released once the umount returns to userspace due to how file closing works. That ultimately could cause an old umount assumption to be violated that concurrent umount and mount don't fail. So an exclusively held device with a temporary holder should be yielded before the filesystem is gone. Add a helper that allows callers to do that. This also allows us to remove the two holder ops that Linus wasn't excited about. Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-03-25xfs: don't use current->journal_infoDave Chinner
syzbot reported an ext4 panic during a page fault where found a journal handle when it didn't expect to find one. The structure it tripped over had a value of 'TRAN' in the first entry in the structure, and that indicates it tripped over a struct xfs_trans instead of a jbd2 handle. The reason for this is that the page fault was taken during a copy-out to a user buffer from an xfs bulkstat operation. XFS uses an "empty" transaction context for bulkstat to do automated metadata buffer cleanup, and so the transaction context is valid across the copyout of the bulkstat info into the user buffer. We are using empty transaction contexts like this in XFS to reduce the risk of failing to release objects we reference during the operation, especially during error handling. Hence we really need to ensure that we can take page faults from these contexts without leaving landmines for the code processing the page fault to trip over. However, this same behaviour could happen from any other filesystem that triggers a page fault or any other exception that is handled on-stack from within a task context that has current->journal_info set. Having a page fault from some other filesystem bounce into XFS where we have to run a transaction isn't a bug at all, but the usage of current->journal_info means that this could result corruption of the outer task's journal_info structure. The problem is purely that we now have two different contexts that now think they own current->journal_info. IOWs, no filesystem can allow page faults or on-stack exceptions while current->journal_info is set by the filesystem because the exception processing might use current->journal_info itself. If we end up with nested XFS transactions whilst holding an empty transaction, then it isn't an issue as the outer transaction does not hold a log reservation. If we ignore the current->journal_info usage, then the only problem that might occur is a deadlock if the exception tries to take the same locks the upper context holds. That, however, is not a problem that setting current->journal_info would solve, so it's largely an irrelevant concern here. IOWs, we really only use current->journal_info for a warning check in xfs_vm_writepages() to ensure we aren't doing writeback from a transaction context. Writeback might need to do allocation, so it can need to run transactions itself. Hence it's a debug check to warn us that we've done something silly, and largely it is not all that useful. So let's just remove all the use of current->journal_info in XFS and get rid of all the potential issues from nested contexts where current->journal_info might get misused by another filesystem context. Reported-by: syzbot+cdee56dbcdf0096ef605@syzkaller.appspotmail.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Mark Tinguely <mark.tinguely@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-25xfs: allow sunit mount option to repair bad primary sb stripe valuesDave Chinner
If a filesystem has a busted stripe alignment configuration on disk (e.g. because broken RAID firmware told mkfs that swidth was smaller than sunit), then the filesystem will refuse to mount due to the stripe validation failing. This failure is triggering during distro upgrades from old kernels lacking this check to newer kernels with this check, and currently the only way to fix it is with offline xfs_db surgery. This runtime validity checking occurs when we read the superblock for the first time and causes the mount to fail immediately. This prevents the rewrite of stripe unit/width via mount options that occurs later in the mount process. Hence there is no way to recover this situation without resorting to offline xfs_db rewrite of the values. However, we parse the mount options long before we read the superblock, and we know if the mount has been asked to re-write the stripe alignment configuration when we are reading the superblock and verifying it for the first time. Hence we can conditionally ignore stripe verification failures if the mount options specified will correct the issue. We validate that the new stripe unit/width are valid before we overwrite the superblock values, so we can ignore the invalid config at verification and fail the mount later if the new values are not valid. This, at least, gives users the chance of correcting the issue after a kernel upgrade without having to resort to xfs-db hacks. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-22Merge tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Chandan Babu: - Fix invalid pointer dereference by initializing xmbuf before tracepoint function is invoked - Use memalloc_nofs_save() when inserting into quota radix tree * tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: quota radix tree allocations need to be NOFS on insert xfs: fix dev_t usage in xmbuf tracepoints
2024-03-15xfs: quota radix tree allocations need to be NOFS on insertDave Chinner
In converting the XFS code from GFP_NOFS to scoped contexts, we converted the quota radix tree to GFP_KERNEL. Unfortunately, it was not clearly documented that this set was because there is a dependency on the quotainfo->qi_tree_lock being taken in memory reclaim to remove dquots from the radix tree. In hindsight this is obvious, but the radix tree allocations on insert are not immediately obvious, and we avoid this for the inode cache radix trees by using preloading and hence completely avoiding the radix tree node allocation under tree lock constraints. Hence there are a few solutions here. The first is to reinstate GFP_NOFS for the radix tree and add a comment explaining why GFP_NOFS is used. The second is to use memalloc_nofs_save() on the radix tree insert context, which makes it obvious that the radix tree insert runs under GFP_NOFS constraints. The third option is to simply replace the radix tree and it's lock with an xarray which can do memory allocation safely in an insert context. The first is OK, but not really the direction we want to head. The second is my preferred short term solution. The third - converting XFS radix trees to xarray - is the longer term solution. Hence to fix the regression here, we take option 2 as it moves us in the direction we want to head with memory allocation and GFP_NOFS removal. Reported-by: syzbot+8fdff861a781522bda4d@syzkaller.appspotmail.com Reported-by: syzbot+d247769793ec169e4bf9@syzkaller.appspotmail.com Fixes: 94a69db2367e ("xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-15xfs: fix dev_t usage in xmbuf tracepointsDarrick J. Wong
Fix some inconsistencies in the xmbuf tracepoints -- they should be reporting the major/minor of the filesystem that they're associated with, so that we have some clue on whose behalf the xmbuf was created. Fix the xmbuf_free tracepoint to report the same. Don't call the trace function until the xmbuf is fully initialized. Fixes: 5076a6040ca1 ("xfs: support in-memory buffer cache target") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-03-13Merge tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: - Online repair updates: - More ondisk structures being repaired: - Inode's mode field by trying to obtain file type value from the a directory entry - Quota counters - Link counts of inodes - FS summary counters - Support for in-memory btrees has been added to support repair of rmap btrees - Misc changes: - Report corruption of metadata to the health tracking subsystem - Enable indirect health reporting when resources are scarce - Reduce memory usage while repairing refcount btree - Extend "Bmap update" intent item to support atomic extent swapping on the realtime device - Extend "Bmap update" intent item to support extended attribute fork and unwritten extents - Code cleanups: - Bmap log intent - Btree block pointer checking - Btree readahead - Buffer target - Symbolic link code - Remove mrlock wrapper around the rwsem - Convert all the GFP_NOFS flag usages to use the scoped memalloc_nofs_save() API instead of direct calls with the GFP_NOFS - Refactor and simplify xfile abstraction. Lower level APIs in shmem.c are required to be exported in order to achieve this - Skip checking alignment constraints for inode chunk allocations when block size is larger than inode chunk size - Do not submit delwri buffers collected during log recovery when an error has been encountered - Fix SEEK_HOLE/DATA for file regions which have active COW extents - Fix lock order inversion when executing error handling path during shrinking a filesystem - Remove duplicate ifdefs * tag 'xfs-6.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (183 commits) xfs: shrink failure needs to hold AGI buffer mm/shmem.c: Use new form of *@param in kernel-doc kernel-doc: Add unary operator * to $type_param_ref xfs: use kvfree() in xlog_cil_free_logvec() xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAIL xfs: fix scrub stats file permissions xfs: fix log recovery erroring out on refcount recovery failure xfs: move symlink target write function to libxfs xfs: move remote symlink target read function to libxfs xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.h xfs: xfs_bmap_finish_one should map unwritten extents properly xfs: support deferred bmap updates on the attr fork xfs: support recovering bmap intent items targetting realtime extents xfs: add a realtime flag to the bmap update log redo items xfs: add a xattr_entry helper xfs: fix xfs_bunmapi to allow unmapping of partial rt extents xfs: move xfs_bmap_defer_add to xfs_bmap_item.c xfs: reuse xfs_bmap_update_cancel_item xfs: add a bi_entry helper xfs: remove xfs_trans_set_bmap_flags ...
2024-03-12mm, slab: remove last vestiges of SLAB_MEM_SPREADLinus Torvalds
Yes, yes, I know the slab people were planning on going slow and letting every subsystem fight this thing on their own. But let's just rip off the band-aid and get it over and done with. I don't want to see a number of unnecessary pull requests just to get rid of a flag that no longer has any meaning. This was mainly done with a couple of 'sed' scripts and then some manual cleanup of the end result. Link: https://lore.kernel.org/all/CAHk-=wji0u+OOtmAOD-5JV3SXcRJF___k_+8XNKmak0yd5vW1Q@mail.gmail.com/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-03-11Merge tag 'vfs-6.9.uuid' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs uuid updates from Christian Brauner: "This adds two new ioctl()s for getting the filesystem uuid and retrieving the sysfs path based on the path of a mounted filesystem. Getting the filesystem uuid has been implemented in filesystem specific code for a while it's now lifted as a generic ioctl" * tag 'vfs-6.9.uuid' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: add support for FS_IOC_GETFSSYSFSPATH fs: add FS_IOC_GETFSSYSFSPATH fat: Hook up sb->s_uuid fs: FS_IOC_GETUUID ovl: convert to super_set_uuid() fs: super_set_uuid()
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-03-11Merge tag 'vfs-6.9.iomap' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull iomap updates from Christian Brauner: - Restore read-write hints in struct bio through the bi_write_hint member for the sake of UFS devices in mobile applications. This can result in up to 40% lower write amplification in UFS devices. The patch series that builds on this will be coming in via the SCSI maintainers (Bart) - Overhaul the iomap writeback code. Afterwards ->map_blocks() is able to map multiple blocks at once as long as they're in the same folio. This reduces CPU usage for buffered write workloads on e.g., xfs on systems with lots of cores (Christoph) - Record processed bytes in iomap_iter() trace event (Kassey) - Extend iomap_writepage_map() trace event after Christoph's ->map_block() changes to map mutliple blocks at once (Zhang) * tag 'vfs-6.9.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits) iomap: Add processed for iomap_iter iomap: add pos and dirty_len into trace_iomap_writepage_map block, fs: Restore the per-bio/request data lifetime fields fs: Propagate write hints to the struct block_device inode fs: Move enum rw_hint into a new header file fs: Split fcntl_rw_hint() fs: Verify write lifetime constants at compile time fs: Fix rw_hint validation iomap: pass the length of the dirty region to ->map_blocks iomap: map multiple blocks at a time iomap: submit ioends immediately iomap: factor out a iomap_writepage_map_block helper iomap: only call mapping_set_error once for each failed bio iomap: don't chain bios iomap: move the iomap_sector sector calculation out of iomap_add_to_ioend iomap: clean up the iomap_alloc_ioend calling convention iomap: move all remaining per-folio logic into iomap_writepage_map iomap: factor out a iomap_writepage_handle_eof helper iomap: move the PF_MEMALLOC check to iomap_writepages iomap: move the io_folios field out of struct iomap_ioend ...
2024-03-07xfs: shrink failure needs to hold AGI bufferDave Chinner
Chandan reported a AGI/AGF lock order hang on xfs/168 during recent testing. The cause of the problem was the task running xfs_growfs to shrink the filesystem. A failure occurred trying to remove the free space from the btrees that the shrink would make disappear, and that meant it ran the error handling for a partial failure. This error path involves restoring the per-ag block reservations, and that requires calculating the amount of space needed to be reserved for the free inode btree. The growfs operation hung here: [18679.536829] down+0x71/0xa0 [18679.537657] xfs_buf_lock+0xa4/0x290 [xfs] [18679.538731] xfs_buf_find_lock+0xf7/0x4d0 [xfs] [18679.539920] xfs_buf_lookup.constprop.0+0x289/0x500 [xfs] [18679.542628] xfs_buf_get_map+0x2b3/0xe40 [xfs] [18679.547076] xfs_buf_read_map+0xbb/0x900 [xfs] [18679.562616] xfs_trans_read_buf_map+0x449/0xb10 [xfs] [18679.569778] xfs_read_agi+0x1cd/0x500 [xfs] [18679.573126] xfs_ialloc_read_agi+0xc2/0x5b0 [xfs] [18679.578708] xfs_finobt_calc_reserves+0xe7/0x4d0 [xfs] [18679.582480] xfs_ag_resv_init+0x2c5/0x490 [xfs] [18679.586023] xfs_ag_shrink_space+0x736/0xd30 [xfs] [18679.590730] xfs_growfs_data_private.isra.0+0x55e/0x990 [xfs] [18679.599764] xfs_growfs_data+0x2f1/0x410 [xfs] [18679.602212] xfs_file_ioctl+0xd1e/0x1370 [xfs] trying to get the AGI lock. The AGI lock was held by a fstress task trying to do an inode allocation, and it was waiting on the AGF lock to allocate a new inode chunk on disk. Hence deadlock. The fix for this is for the growfs code to hold the AGI over the transaction roll it does in the error path. It already holds the AGF locked across this, and that is what causes the lock order inversion in the xfs_ag_resv_init() call. Reported-by: Chandan Babu R <chandanbabu@kernel.org> Fixes: 46141dc891f7 ("xfs: introduce xfs_ag_shrink_space()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-28xfs: use kvfree() in xlog_cil_free_logvec()Dave Chinner
The xfs_log_vec items are allocated by xlog_kvmalloc(), and so need to be freed with kvfree. This was missed when coverting from the kmem_free() API. Fixes: 49292576136f ("xfs: convert kmem_free() for kvmalloc users to kvfree()") Reported-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-28xfs: xfs_btree_bload_prep_block() should use __GFP_NOFAILDave Chinner
This was missed in the conversion from KM* flags. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 10634530f7ba ("xfs: convert kmem_zalloc() to kzalloc()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-27xfs: drop experimental warning for FSDAXShiyang Ruan
FSDAX and reflink can work together now, let's drop this warning. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2024-02-26xfs: fix scrub stats file permissionsDarrick J. Wong
When the kernel is in lockdown mode, debugfs will only show files that are world-readable and cannot be written, mmaped, or used with ioctl. That more or less describes the scrub stats file, except that the permissions are wrong -- they should be 0444, not 0644. You can't write the stats file, so the 0200 makes no sense. Meanwhile, the clear_stats file is only writable, but it got mode 0400 instead of 0200, which would make more sense. Fix both files so that they make sense. Fixes: d7a74cad8f451 ("xfs: track usage statistics of online fsck") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>