summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2025-05-15btrfs: use clear_extent_bits() at chunk_map_device_clear_bits()Filipe Manana
Instead of using __clear_extent_bit() we can use clear_extent_bits() since we pass a NULL value for the cached and changeset arguments. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use clear_extent_bit() at try_release_extent_state()Filipe Manana
Instead of using __clear_extent_bit() we can use clear_extent_bit() since we pass a NULL value for the changeset argument. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()Qu Wenruo
[BUG WITH EXPERIMENTAL LARGE FOLIOS] When testing the experimental large data folio support with compression, there are several ASSERT()s triggered from btrfs_decompress_buf2page() when running fsstress with compress=zstd mount option: - ASSERT(copy_len) from btrfs_decompress_buf2page() - VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page() [CAUSE] Inside btrfs_decompress_buf2page(), we need to grab the file offset from the current bvec.bv_page, to check if we even need to copy data into the bio. And since we're using single page bvec, and no large folio, every page inside the folio should have its index properly setup. But when large folios are involved, only the first page (aka, the head page) of a large folio has its index properly initialized. The other pages inside the large folio will not have their indexes properly initialized. Thus the page_offset() call inside btrfs_decompress_buf2page() will result garbage, and completely screw up the @copy_len calculation. [FIX] Instead of using page->index directly, go with page_pgoff(), which can handle non-head pages correctly. So introduce a helper, file_offset_from_bvec(), to get the file offset from a single page bio_vec, so the copy_len calculation can be done correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use rb_entry_safe() where possible to simplify codeDavid Sterba
Simplify conditionally reading an rb_entry(), there's the rb_entry_safe() helper that checks the node pointer for NULL so we don't have to write it explicitly. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: pass a pointer to get_range_bits() to cache first search resultFilipe Manana
Allow get_range_bits() to take an extent state pointer to pointer argument so that we can cache the first extent state record in the target range, so that a caller can use it for subsequent operations without doing a full tree search. Currently the only user is try_release_extent_state(), which then does a call to __clear_extent_bit() which can use such a cached state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: allow folios to be released while ordered extent is finishingFilipe Manana
When the release_folio callback (from struct address_space_operations) is invoked we don't allow the folio to be released if its range is currently locked in the inode's io_tree, as it may indicate the folio may be needed by the task that locked the range. However if the range is locked because an ordered extent is finishing, then we can safely allow the folio to be released because ordered extent completion doesn't need to use the folio at all. When we are under memory pressure, the kernel starts writeback of dirty pages (folios) with the goal of releasing the pages from the page cache after writeback completes, however this often is not possible on btrfs because: * Once the writeback completes we queue the ordered extent completion; * Once the ordered extent completion starts, we lock the range in the inode's io_tree (at btrfs_finish_one_ordered()); * If the release_folio callback is called while the folio's range is locked in the inode's io_tree, we don't allow the folio to be released, so the kernel has to try to release memory elsewhere, which may result in triggering more writeback or releasing other pages from the page cache which may be more useful to have around for applications. In contrast, when the release_folio callback is invoked after writeback finishes and before ordered extent completion starts or locks the range, we allow the folio to be released, as well as when the release_folio callback is invoked after ordered extent completion unlocks the range. Improve on this by detecting if the range is locked for ordered extent completion and if it is, allow the folio to be released. This detection is achieved by adding a new extent flag in the io_tree that is set when the range is locked during ordered extent completion. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: update comment for try_release_extent_state()Filipe Manana
Drop reference to pages from the comment since the function is fully folio aware and works regardless of how many pages are in the folio. Also while at it, capitalize the first word and make it more explicit that release_folio is a callback from struct address_space_operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: prepare btrfs_punch_hole_lock_range() for large data foliosQu Wenruo
The function btrfs_punch_hole_lock_range() needs to make sure there is no other folio in the range, thus it goes with filemap_range_has_page(), which works pretty fine. But if we have large folios, under the following case filemap_range_has_page() will always return true, forcing btrfs_punch_hole_lock_range() to do a very time consuming busy loop: start end | | |//|//|//|//| | | | | | | | |//|//| \ / \ / Folio A Folio B In the above case, folio A and B contain our start/end indexes, and there are no other folios in the range. Thus we do not need to retry inside btrfs_punch_hole_lock_range(). To prepare for large data folios, introduce a helper, check_range_has_page(), which will: - Shrink the search range towards page boundaries If the rounded down end (exclusive, otherwise it can underflow when @end is inside the folio at file offset 0) is no larger than the rounded up start, it means the range contains no other pages other than the ones covering @start and @end. Can return false directly in that case. - Grab all the folios inside the range - Skip any large folios that cover the start and end indexes - If any other folios are found return true - Otherwise return false This new helper is going to handle both large folios and regular ones. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: prepare btrfs_buffered_write() for large data foliosQu Wenruo
This involves the following modifications: - Set the order flags for __filemap_get_folio() inside prepare_one_folio() This will allow __filemap_get_folio() to create a large folio if the address space supports it. - Limit the initial @write_bytes inside copy_one_range() If the largest folio boundary splits the initial write range, there is no way we can write beyond the largest folio boundary. This is done by a simple helper calc_write_bytes(). - Release exceeding reserved space if the folio is smaller than expected Which is doing the same handling when short copy happens. All the preparations should not change the behavior when the largest folio order is 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: refactor how we handle reserved space inside copy_one_range()Qu Wenruo
There are several things not ideal in copy_one_range(): - Unnecessary temporary variables * block_offset * reserve_bytes * dirty_blocks * num_blocks * release_bytes These are utilized to handle short-copy cases. - Inconsistent handling of btrfs_delalloc_release_extents() There is a hidden behavior that, after reserving metadata for X bytes of data write, we have to call btrfs_delalloc_release_extents() with X once and only once. Calling btrfs_delalloc_release_extents(X - 4K) and btrfs_delalloc_release_extents(4K) will cause outstanding extents accounting to go wrong. This is because the outstanding extents mechanism is not designed to handle shrinking of reserved space. Improve above situations by: - Use a single @reserved_start and @reserved_len pair Now we reserve space for the initial range, and if a short copy happened and we need to shrink the reserved space, we can easily calculate the new length, and update @reserved_len. - Introduce helpers to shrink reserved data and metadata space This is done by two new helpers, shrink_reserved_space() and btrfs_delalloc_shrink_extents(). The later will do a better calculation if we need to modify the outstanding extents, and the first one will be utilized inside copy_one_range(). - Manually unlock, release reserved space and return if no byte is copied Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove EXTENT_UPTODATE io tree flagFilipe Manana
The EXTENT_UPTODATE io tree flag is now used only to mark ranges in the fs_info->excluded_extents as used by super blocks and not available for extent allocation (to prevent adding those ranges as free space in the in memory space caches). As we can use any flag for that purpose, and we are using EXTENT_DIRTY for the pinned extents io tree for example, remove the EXTENT_UPTODATE flag and use instead EXTENT_DIRTY for the excluded extents io tree. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: stop searching for EXTENT_DIRTY bit in the excluded extents io treeFilipe Manana
At btrfs_add_new_free_space() we keep searching for ranges in the excluded extents io tree that have the EXTENT_DIRTY bit set, however we never ever set that bit for ranges in that tree. That is a leftover from when that function used the global freed extents trees (fs_info->freed_extents[2]), where we used both the EXTENT_DIRTY and EXTENT_UPTODATE bits, but those trees are gone with commit fe119a6eeb67 ("btrfs: switch to per-transaction pinned extents"), which introduced the fs_info->excluded_extents io tree, where only EXTENT_UPTODATE is set. So remove the EXTENT_DIRTY bit search at btrfs_add_new_free_space(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove leftover EXTENT_UPTODATE clear from an inode's io_treeFilipe Manana
After commit 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path") we never set EXTENT_UPTODATE in an inode's io_tree anymore, but we still have some code attempting to clear that bit from an inode's io_tree. Remove that code as it doesn't do anything anymore. The sole use of the EXTENT_UPTODATE bit is for the excluded extents io_tree (fs_info->excluded_extents), which is used to track the locations of super blocks, so that their ranges are never marked as free, making them unavailable for extent allocation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: fix fsync of files with no hard links not persisting deletionFilipe Manana
If we fsync a file (or directory) that has no more hard links, because while a process had a file descriptor open on it, the file's last hard link was removed and then the process did an fsync against the file descriptor, after a power failure or crash the file still exists after replaying the log. This behaviour is incorrect since once an inode has no more hard links it's not accessible anymore and we insert an orphan item into its subvolume's tree so that the deletion of all its items is not missed in case of a power failure or crash. So after log replay the file shouldn't exist anymore, which is also the behaviour on ext4, xfs, f2fs and other filesystems. Fix this by not ignoring inodes with zero hard links at btrfs_log_inode_parent() and by committing an inode's delayed inode when we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This last step is necessary because when removing the last hard link we don't delete the corresponding ref (or extref) item, instead we record the change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF flag, so that when the delayed inode is committed we delete the ref/extref item from the inode's subvolume tree - otherwise the logging code will log the last hard link and therefore upon log replay the inode is not deleted. The base code for a fstests test case that reproduces this bug is the following: . ./common/dmflakey _require_scratch _require_dm_target flakey _require_mknod _scratch_mkfs >>$seqres.full 2>&1 || _fail "mkfs failed" _require_metadata_journaling $SCRATCH_DEV _init_flakey _mount_flakey touch $SCRATCH_MNT/foo # Commit the current transaction and persist the file. _scratch_sync # A fifo to communicate with a background xfs_io process that will # fsync the file after we deleted its hard link while it's open by # xfs_io. mkfifo $SCRATCH_MNT/fifo tail -f $SCRATCH_MNT/fifo | \ $XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full & XFS_IO_PID=$! # Give some time for the xfs_io process to open a file descriptor for # the file. sleep 1 # Now while the file is open by the xfs_io process, delete its only # hard link. rm -f $SCRATCH_MNT/foo # Now that it has no more hard links, make the xfs_io process fsync it. echo "fsync" > $SCRATCH_MNT/fifo # Terminate the xfs_io process so that we can unmount. echo "quit" > $SCRATCH_MNT/fifo wait $XFS_IO_PID unset XFS_IO_PID # Simulate a power failure and then mount again the filesystem to # replay the journal/log. _flakey_drop_and_remount # We don't expect the file to exist anymore, since it was fsynced when # it had no more hard links. [ -f $SCRATCH_MNT/foo ] && echo "file foo still exists" _unmount_flakey # success, all done echo "Silence is golden" status=0 exit A test case for fstests will be submitted soon. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: fix typo in space info explanationMark Harmstone
There's an explanation of how space info works at the top of fs/btrfs/space-info.c, which makes reference to a variable called bytes_may_reserve. There's nothing called that in the code, and wasn't at time the comment was written; as far I can tell this is a typo, and it should actually be bytes_may_use. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unused flag EXTENT_BUFFER_IN_TREEDaniel Vacek
This flag is set after inserting the eb to the buffer tree and cleared on it's removal. It was added in commit 34b41acec1ccc0 ("Btrfs: use a bit to track if we're in the radix tree") and wanted to make use of it, faa2dbf004e89e ("Btrfs: add sanity tests for new qgroup accounting code"). Both are 10+ years old, we can remove the flag. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unused flag EXTENT_BUFFER_CORRUPTDaniel Vacek
This flag is no longer being used. It was added by commit a826d6dcb32d ("Btrfs: check items for correctness as we search") but it's no longer being used after commit f26c92386028 ("btrfs: remove reada infrastructure"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unused flag EXTENT_BUFFER_READAHEADDaniel Vacek
This flag is no longer being used. It was added by commit ab0fff03055d ("btrfs: add READAHEAD extent buffer flag") and used in commits: 79fb65a1f6d9 ("Btrfs: don't call readahead hook until we have read the entire eb") 78e62c02abb9 ("btrfs: Remove extent_io_ops::readpage_io_failed_hook") 371cdc0700c7 ("btrfs: introduce subpage metadata validation check") Finally all the code using it was removed by commit f26c92386028 ("btrfs: remove reada infrastructure"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unused flag EXTENT_BUFFER_READ_ERRDaniel Vacek
This flag was added by commit 656f30dba7ab ("Btrfs: be aware of btree inode write errors to avoid fs corruption") but it stopped being used after commit 046b562b20a5 ("btrfs: use a separate end_io handler for read_extent_buffer"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: factor out the main loop of btrfs_buffered_write() into a helperQu Wenruo
Inside the main loop of btrfs_buffered_write() we are doing a lot of heavy lifting inside a while() loop. This makes it pretty hard to read, factor out the content into a helper, copy_one_range() to do the work. This has no functional change, but with some minor variable renames, e.g. rename all "sector" into "block". Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: factor out space reservation code from btrfs_buffered_write()Qu Wenruo
Inside the main loop of btrfs_buffered_write(), we have a complex data and metadata space reservation code, which tries to reserve space for a COW write, if failed then fallback to check if we can do a NOCOW write. Factor out that part of code into a dedicated helper, reserve_space(), to make the main loop a little easier to read. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: cleanup the reserved space inside loop of btrfs_buffered_write()Qu Wenruo
Inside the main loop of btrfs_buffered_write(), if something wrong happened, there is a out-of-loop cleanup path to release the reserved space. This behavior saves some code lines, but makes it much harder to read, as we need to check release_bytes to make sure when we need to do the cleanup. Factor out the cleanup part into a helper, release_reserved_space(), to do the cleanup inside the main loop, so that we can move @release_bytes inside the loop. This will make later refactoring of the main loop much easier. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove force_page_uptodate variable from btrfs_buffered_write()Qu Wenruo
Commit c87c299776e4 ("btrfs: make buffered write to copy one page a time") changed how the variable @force_page_uptodate was updated. Before that commit the variable was only initialized to false at the beginning of the function, and after hitting a short copy, the next retry on the same folio would force the folio to be read from the disk. But after the commit, the variable is always initialized to false at the beginning of the loop's scope, causing prepare_one_folio() never to get a true value passed in. The change in behavior is not a huge deal, it only makes a difference on how we handle short copies: Old: Allow the buffer to be split The first short copy will be rejected, that's the same for both cases. But for the next retry, we require the folio to be read from disk. Then even if we hit a short copy again, since the folio is already uptodate, we do not need to handle partial uptodate range, and can continue, marking the short copied range as dirty and continue. This will split the buffer write into the folio as two buffered writes. New: Do not allow the buffer to be split The first short copy will be rejected, that's the same for both cases. For the next retry, we do nothing special, thus if the short copy happened again, we reject it again, until either the short copy is gone, or we failed to fault in the buffer. This will mean the buffer write into the folio will either fail or succeed, no splitting will happen. To me, either solution is fine, but the new one makes it simpler and requires no special handling, so I prefer that solution. And since @force_page_uptodate is always false when passed into prepare_one_folio(), we can just remove the variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move block perfect compression out of experimental featuresQu Wenruo
Commit 1d2fbb7f1f9e ("btrfs: allow compression even if the range is not page aligned") introduced the block perfect compression for block size < page size cases. Before that commit, if the fs block size is smaller than page size (aka subpage cases), compressed write is only enabled if the dirty range is fully page aligned. This block perfect compression support was introduced in v6.13, and has been tested for two kernel releases. I believe it's time to move it out of experimental features so that we can get more tests in the real world. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-14Merge tag 'for-6.15-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential endless loop when discarding a block group when disabling discard - reinstate message when setting a large value of mount option 'commit' - fix a folio leak when async extent submission fails * tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add back warning for mount option commit values exceeding 300 btrfs: fix folio leak in submit_one_async_extent() btrfs: fix discard worker infinite loop after disabling discard
2025-05-12btrfs: add back warning for mount option commit values exceeding 300Kyoji Ogasawara
The Btrfs documentation states that if the commit value is greater than 300 a warning should be issued. The warning was accidentally lost in the new mount API update. Fixes: 6941823cc878 ("btrfs: remove old mount API code") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12btrfs: fix folio leak in submit_one_async_extent()Boris Burkov
If btrfs_reserve_extent() fails while submitting an async_extent for a compressed write, then we fail to call free_async_extent_pages() on the async_extent and leak its folios. A likely cause for such a failure would be btrfs_reserve_extent() failing to find a large enough contiguous free extent for the compressed extent. I was able to reproduce this by: 1. mount with compress-force=zstd:3 2. fallocating most of a filesystem to a big file 3. fragmenting the remaining free space 4. trying to copy in a file which zstd would generate large compressed extents for (vmlinux worked well for this) Step 4. hits the memory leak and can be repeated ad nauseam to eventually exhaust the system memory. Fix this by detecting the case where we fallback to uncompressed submission for a compressed async_extent and ensuring that we call free_async_extent_pages(). Fixes: 131a821a243f ("btrfs: fallback if compressed IO fails for ENOSPC") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Co-developed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12btrfs: fix discard worker infinite loop after disabling discardFilipe Manana
If the discard worker is running and there's currently only one block group, that block group is a data block group, it's in the unused block groups discard list and is being used (it got an extent allocated from it after becoming unused), the worker can end up in an infinite loop if a transaction abort happens or the async discard is disabled (during remount or unmount for example). This happens like this: 1) Task A, the discard worker, is at peek_discard_list() and find_next_block_group() returns block group X; 2) Block group X is in the unused block groups discard list (its discard index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past it become an unused block group and was added to that list, but then later it got an extent allocated from it, so its ->used counter is not zero anymore; 3) The current transaction is aborted by task B and we end up at __btrfs_handle_fs_error() in the transaction abort path, where we call btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode (setting SB_RDONLY in the super block's s_flags field); 4) Task A calls __add_to_discard_list() with the goal of moving the block group from the unused block groups discard list into another discard list, but at __add_to_discard_list() we end up doing nothing because btrfs_run_discard_work() returns false, since the super block has SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set anymore in fs_info->flags. So block group X remains in the unused block groups discard list; 5) Task A then does a goto into the 'again' label, calls find_next_block_group() again we gets block group X again. Then it repeats the previous steps over and over since there are not other block groups in the discard lists and block group X is never moved out of the unused block groups discard list since btrfs_run_discard_work() keeps returning false and therefore __add_to_discard_list() doesn't move block group X out of that discard list. When this happens we can get a soft lockup report like this: [71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97] [71.957] Modules linked in: xfs af_packet rfkill (...) [71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa [71.957] Tainted: [W]=WARN [71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs] [71.957] Code: c1 01 48 83 (...) [71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246 [71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000 [71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad [71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080 [71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860 [71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000 [71.957] FS: 0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000 [71.957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0 [71.957] PKRU: 55555554 [71.957] Call Trace: [71.957] <TASK> [71.957] process_one_work+0x17e/0x330 [71.957] worker_thread+0x2ce/0x3f0 [71.957] ? __pfx_worker_thread+0x10/0x10 [71.957] kthread+0xef/0x220 [71.957] ? __pfx_kthread+0x10/0x10 [71.957] ret_from_fork+0x34/0x50 [71.957] ? __pfx_kthread+0x10/0x10 [71.957] ret_from_fork_asm+0x1a/0x30 [71.957] </TASK> [71.957] Kernel panic - not syncing: softlockup: hung tasks [71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W L 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa [71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP [71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [71.992] Call Trace: [71.993] <IRQ> [71.994] dump_stack_lvl+0x5a/0x80 [71.994] panic+0x10b/0x2da [71.995] watchdog_timer_fn.cold+0x9a/0xa1 [71.996] ? __pfx_watchdog_timer_fn+0x10/0x10 [71.997] __hrtimer_run_queues+0x132/0x2a0 [71.997] hrtimer_interrupt+0xff/0x230 [71.998] __sysvec_apic_timer_interrupt+0x55/0x100 [71.999] sysvec_apic_timer_interrupt+0x6c/0x90 [72.000] </IRQ> [72.000] <TASK> [72.001] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs] [72.002] Code: c1 01 48 83 (...) [72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246 [72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000 [72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad [72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080 [72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860 [72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000 [72.010] ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f] [72.011] process_one_work+0x17e/0x330 [72.012] worker_thread+0x2ce/0x3f0 [72.013] ? __pfx_worker_thread+0x10/0x10 [72.014] kthread+0xef/0x220 [72.014] ? __pfx_kthread+0x10/0x10 [72.015] ret_from_fork+0x34/0x50 [72.015] ? __pfx_kthread+0x10/0x10 [72.016] ret_from_fork_asm+0x1a/0x30 [72.017] </TASK> [72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [72.019] Rebooting in 90 seconds.. So fix this by making sure we move a block group out of the unused block groups discard list when calling __add_to_discard_list(). Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard") Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-07btrfs: use bdev_rw_virt in scrub_one_superChristoph Hellwig
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06Merge tag 'for-6.15-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - revert device path canonicalization, this does not work as intended with namespaces and is not reliable in all setups - fix crash in scrub when checksum tree is not valid, e.g. when mounted with rescue=ignoredatacsums - fix crash when tracepoint btrfs_prelim_ref_insert is enabled - other minor fixups: - open code folio_index(), meant to be used in MM code - use matching type for sizeof in compression allocation * tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: open code folio_index() in btree_clear_folio_dirty_tag() Revert "btrfs: canonicalize the device path before adding it" btrfs: avoid NULL pointer dereference if no valid csum tree btrfs: handle empty eb->folios in num_extent_folios() btrfs: correct the order of prelim_ref arguments in btrfs__prelim_ref btrfs: compression: adjust cb->compressed_folios allocation type
2025-05-02btrfs: open code folio_index() in btree_clear_folio_dirty_tag()Kairui Song
The folio_index() helper is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through 'swap_rw' but btrfs does not use that method for swap. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02Revert "btrfs: canonicalize the device path before adding it"Qu Wenruo
This reverts commit 7e06de7c83a746e58d4701e013182af133395188. Commit 7e06de7c83a7 ("btrfs: canonicalize the device path before adding it") tries to make btrfs to use "/dev/mapper/*" name first, then any filename inside "/dev/" as the device path. This is mostly fine when there is only the root namespace involved, but when multiple namespace are involved, things can easily go wrong for the d_path() usage. As d_path() returns a file path that is namespace dependent, the resulted string may not make any sense in another namespace. Furthermore, the "/dev/" prefix checks itself is not reliable, one can still make a valid initramfs without devtmpfs, and fill all needed device nodes manually. Overall the userspace has all its might to pass whatever device path for mount, and we are not going to win the war trying to cover every corner case. So just revert that commit, and do no extra d_path() based file path sanity check. CC: stable@vger.kernel.org # 6.12+ Link: https://lore.kernel.org/linux-fsdevel/20250115185608.GA2223535@zen.localdomain/ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02btrfs: avoid NULL pointer dereference if no valid csum treeQu Wenruo
[BUG] When trying read-only scrub on a btrfs with rescue=idatacsums mount option, it will crash with the following call trace: BUG: kernel NULL pointer dereference, address: 0000000000000208 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page CPU: 1 UID: 0 PID: 835 Comm: btrfs Tainted: G O 6.15.0-rc3-custom+ #236 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:btrfs_lookup_csums_bitmap+0x49/0x480 [btrfs] Call Trace: <TASK> scrub_find_fill_first_stripe+0x35b/0x3d0 [btrfs] scrub_simple_mirror+0x175/0x290 [btrfs] scrub_stripe+0x5f7/0x6f0 [btrfs] scrub_chunk+0x9a/0x150 [btrfs] scrub_enumerate_chunks+0x333/0x660 [btrfs] btrfs_scrub_dev+0x23e/0x600 [btrfs] btrfs_ioctl+0x1dcf/0x2f80 [btrfs] __x64_sys_ioctl+0x97/0xc0 do_syscall_64+0x4f/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e [CAUSE] Mount option "rescue=idatacsums" will completely skip loading the csum tree, so that any data read will not find any data csum thus we will ignore data checksum verification. Normally call sites utilizing csum tree will check the fs state flag NO_DATA_CSUMS bit, but unfortunately scrub does not check that bit at all. This results in scrub to call btrfs_search_slot() on a NULL pointer and triggered above crash. [FIX] Check both extent and csum tree root before doing any tree search. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02btrfs: handle empty eb->folios in num_extent_folios()Boris Burkov
num_extent_folios() unconditionally calls folio_order() on eb->folios[0]. If that is NULL this will be a segfault. It is reasonable for it to return 0 as the number of folios in the eb when the first entry is NULL, so do that instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02btrfs: compression: adjust cb->compressed_folios allocation typeKees Cook
In preparation for making the kmalloc() family of allocators type aware, we need to make sure that the returned type from the allocation matches the type of the variable being assigned. (Before, the allocator would always return "void *", which can be implicitly cast to any pointer type.) The assigned type is "struct folio **" but the returned type will be "struct page **". These are the same allocation size (pointer size), but the types don't match. Adjust the allocation type to match the assignment. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kees Cook <kees@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-30Merge tag 'for-6.15-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential inode leak in iget() after memory allocation failure - in subpage mode, fix extent buffer bitmap iteration when writing out dirty sectors - fix range calculation when falling back to COW for a NOCOW file * tag 'for-6.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: adjust subpage bit start based on sectorsize btrfs: fix the inode leak in btrfs_iget() btrfs: fix COW handling in run_delalloc_nocow()
2025-04-23btrfs: adjust subpage bit start based on sectorsizeJosef Bacik
When running machines with 64k page size and a 16k nodesize we started seeing tree log corruption in production. This turned out to be because we were not writing out dirty blocks sometimes, so this in fact affects all metadata writes. When writing out a subpage EB we scan the subpage bitmap for a dirty range. If the range isn't dirty we do bit_start++; to move onto the next bit. The problem is the bitmap is based on the number of sectors that an EB has. So in this case, we have a 64k pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4 bits for every node. With a 64k page size we end up with 4 nodes per page. To make this easier this is how everything looks [0 16k 32k 48k ] logical address [0 4 8 12 ] radix tree offset [ 64k page ] folio [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers [ | | | | | | | | | | | | | | | | ] bitmap Now we use all of our addressing based on fs_info->sectorsize_bits, so as you can see the above our 16k eb->start turns into radix entry 4. When we find a dirty range for our eb, we correctly do bit_start += sectors_per_node, because if we start at bit 0, the next bit for the next eb is 4, to correspond to eb->start 16k. However if our range is clean, we will do bit_start++, which will now put us offset from our radix tree entries. In our case, assume that the first time we check the bitmap the block is not dirty, we increment bit_start so now it == 1, and then we loop around and check again. This time it is dirty, and we go to find that start using the following equation start = folio_start + bit_start * fs_info->sectorsize; so in the case above, eb->start 0 is now dirty, and we calculate start as 0 + 1 * fs_info->sectorsize = 4096 4096 >> 12 = 1 Now we're looking up the radix tree for 1, and we won't find an eb. What's worse is now we're using bit_start == 1, so we do bit_start += sectors_per_node, which is now 5. If that eb is dirty we will run into the same thing, we will look at an offset that is not populated in the radix tree, and now we're skipping the writeout of dirty extent buffers. The best fix for this is to not use sectorsize_bits to address nodes, but that's a larger change. Since this is a fs corruption problem fix it simply by always using sectors_per_node to increment the start bit. Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-23btrfs: fix the inode leak in btrfs_iget()Penglei Jiang
[BUG] There is a bug report that a syzbot reproducer can lead to the following busy inode at unmount time: BTRFS info (device loop1): last unmount of filesystem 1680000e-3c1e-4c46-84b6-56bd3909af50 VFS: Busy inodes after unmount of loop1 (btrfs) ------------[ cut here ]------------ kernel BUG at fs/super.c:650! Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI CPU: 0 UID: 0 PID: 48168 Comm: syz-executor Not tainted 6.15.0-rc2-00471-g119009db2674 #2 PREEMPT(full) Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:generic_shutdown_super+0x2e9/0x390 fs/super.c:650 Call Trace: <TASK> kill_anon_super+0x3a/0x60 fs/super.c:1237 btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2099 deactivate_locked_super+0xbe/0x1a0 fs/super.c:473 deactivate_super fs/super.c:506 [inline] deactivate_super+0xe2/0x100 fs/super.c:502 cleanup_mnt+0x21f/0x440 fs/namespace.c:1435 task_work_run+0x14d/0x240 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x269/0x290 kernel/entry/common.c:218 do_syscall_64+0xd4/0x250 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> [CAUSE] When btrfs_alloc_path() failed, btrfs_iget() directly returned without releasing the inode already allocated by btrfs_iget_locked(). This results the above busy inode and trigger the kernel BUG. [FIX] Fix it by calling iget_failed() if btrfs_alloc_path() failed. If we hit error inside btrfs_read_locked_inode(), it will properly call iget_failed(), so nothing to worry about. Although the iget_failed() cleanup inside btrfs_read_locked_inode() is a break of the normal error handling scheme, let's fix the obvious bug and backport first, then rework the error handling later. Reported-by: Penglei Jiang <superman.xpt@gmail.com> Link: https://lore.kernel.org/linux-btrfs/20250421102425.44431-1-superman.xpt@gmail.com/ Fixes: 7c855e16ab72 ("btrfs: remove conditional path allocation in btrfs_read_locked_inode()") CC: stable@vger.kernel.org # 6.13+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-23btrfs: fix COW handling in run_delalloc_nocow()Dave Chen
In run_delalloc_nocow(), when the found btrfs_key's offset > cur_offset, it indicates a gap between the current processing region and the next file extent. The original code would directly jump to the "must_cow" label, which increments the slot and forces a fallback to COW. This behavior might skip an extent item and result in an overestimated COW fallback range. This patch modifies the logic so that when a gap is detected: - If no COW range is already being recorded (cow_start is unset), cow_start is set to cur_offset. - cur_offset is then advanced to the beginning of the next extent. - Instead of jumping to "must_cow", control flows directly to "next_slot" so that the same extent item can be reexamined properly. The change ensures that we accurately account for the extent gap and avoid accidentally extending the range that needs to fallback to COW. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Dave Chen <davechen@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-22Merge tag 'for-6.15-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - subpage mode fixes: - access correct object (folio) when looking up bit offset - fix assertion condition for number of blocks per folio - fix upper boundary of locking range in hole punch - zoned fixes: - fix potential deadlock caught by lockdep when zone reporting and device freeze run in parallel - fix zone write pointer mismatch and NULL pointer dereference when metadata are converted from DUP to RAID1 - fix error handling when reloc inode creation fails - in tree-checker, unify error code for header level check - block layer: add helpers to read zone capacity * tag 'for-6.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: skip reporting zone for new block group block: introduce zone capacity helper btrfs: tree-checker: adjust error code for header level check btrfs: fix invalid inode pointer after failure to create reloc inode btrfs: zoned: return EIO on RAID1 block group write pointer mismatch btrfs: fix the ASSERT() inside GET_SUBPAGE_BITMAP() btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range() btrfs: subpage: access correct object when reading bitmap start in subpage_calc_start_bit()
2025-04-17Merge tag 'for-6.15-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - handle encoded read ioctl returning EAGAIN so it does not mistakenly free the work structure - escape subvolume path in mount option list so it cannot be wrongly parsed when the path contains "," - remove folio size assertions when writing super block to device with enabled large folios * tag 'for-6.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: remove folio order ASSERT()s in super block writeback path btrfs: correctly escape subvol in btrfs_show_options() btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
2025-04-17btrfs: zoned: skip reporting zone for new block groupNaohiro Aota
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 #252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 #252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-17btrfs: tree-checker: adjust error code for header level checkDavid Sterba
The whole tree checker returns EUCLEAN, except the one check in btrfs_verify_level_key(). This was inherited from the function that was moved from disk-io.c in 2cac5af16537 ("btrfs: move btrfs_verify_level_key into tree-checker.c") but this should be unified with the rest. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-17btrfs: fix invalid inode pointer after failure to create reloc inodeFilipe Manana
If we have a failure at create_reloc_inode(), under the 'out' label we assign an error pointer to the 'inode' variable and then return a weird pointer because we return the expression "&inode->vfs_inode": static noinline_for_stack struct inode *create_reloc_inode( const struct btrfs_block_group *group) { (...) out: (...) if (ret) { if (inode) iput(&inode->vfs_inode); inode = ERR_PTR(ret); } return &inode->vfs_inode; } This can make us return a pointer that is not an error pointer and make the caller proceed as if an error didn't happen and later result in an invalid memory access when dereferencing the inode pointer. Syzbot reported reported such a case with the following stack trace: R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790 </TASK> BTRFS info (device loop0): relocating block group 6881280 flags data|metadata Oops: general protection fault, probably for non-canonical address 0xdffffc0000000045: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000228-0x000000000000022f] CPU: 0 UID: 0 PID: 5332 Comm: syz-executor215 Not tainted 6.14.0-syzkaller-13423-ga8662bcd2ff1 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971 Code: 00 74 08 (...) RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203 RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000 RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000 R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000 FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> relocate_block_group+0xa1e/0xd50 fs/btrfs/relocation.c:3657 btrfs_relocate_block_group+0x777/0xd80 fs/btrfs/relocation.c:4011 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3511 __btrfs_balance+0x1a93/0x25e0 fs/btrfs/volumes.c:4292 btrfs_balance+0xbde/0x10c0 fs/btrfs/volumes.c:4669 btrfs_ioctl_balance+0x3f5/0x660 fs/btrfs/ioctl.c:3586 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf1/0x160 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb4ef537dd9 Code: 28 00 00 (...) RSP: 002b:00007ffc55de5728 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffc55de5750 RCX: 00007fb4ef537dd9 RDX: 0000200000000440 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000002 R08: 00007ffc55de54c6 R09: 00007ffc55de5770 R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 431bde82d7b634db R15: 00007ffc55de5790 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:relocate_file_extent_cluster+0xe7/0x1750 fs/btrfs/relocation.c:2971 Code: 00 74 08 (...) RSP: 0018:ffffc9000d3375e0 EFLAGS: 00010203 RAX: 0000000000000045 RBX: 000000000000022c RCX: ffff888000562440 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880452db000 RBP: ffffc9000d337870 R08: ffffffff84089251 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000 R13: ffffffff9368a020 R14: 0000000000000394 R15: ffff8880452db000 FS: 000055558bc7b380(0000) GS:ffff88808c596000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055a7a192e740 CR3: 0000000036e2e000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 ---------------- Code disassembly (best guess): 0: 00 74 08 48 add %dh,0x48(%rax,%rcx,1) 4: 89 df mov %ebx,%edi 6: e8 f8 36 24 fe call 0xfe243703 b: 48 89 9c 24 30 01 00 mov %rbx,0x130(%rsp) 12: 00 13: 4c 89 74 24 28 mov %r14,0x28(%rsp) 18: 4d 8b 76 10 mov 0x10(%r14),%r14 1c: 49 8d 9e 98 fe ff ff lea -0x168(%r14),%rbx 23: 48 89 d8 mov %rbx,%rax 26: 48 c1 e8 03 shr $0x3,%rax * 2a: 42 80 3c 20 00 cmpb $0x0,(%rax,%r12,1) <-- trapping instruction 2f: 74 08 je 0x39 31: 48 89 df mov %rbx,%rdi 34: e8 ca 36 24 fe call 0xfe243703 39: 4c 8b 3b mov (%rbx),%r15 3c: 48 rex.W 3d: 8b .byte 0x8b 3e: 44 rex.R 3f: 24 .byte 0x24 So fix this by returning the error immediately. Reported-by: syzbot+7481815bb47ef3e702e2@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/67f14ee9.050a0220.0a13.023e.GAE@google.com/ Fixes: b204e5c7d4dc ("btrfs: make btrfs_iget() return a btrfs inode instead") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-17btrfs: zoned: return EIO on RAID1 block group write pointer mismatchJohannes Thumshirn
There was a bug report about a NULL pointer dereference in __btrfs_add_free_space_zoned() that ultimately happens because a conversion from the default metadata profile DUP to a RAID1 profile on two disks. The stack trace has the following signature: BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile BUG: kernel NULL pointer dereference, address: 0000000000000058 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:__btrfs_add_free_space_zoned.isra.0+0x61/0x1a0 RSP: 0018:ffffa236b6f3f6d0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff96c8132f3400 RCX: 0000000000000001 RDX: 0000000010000000 RSI: 0000000000000000 RDI: ffff96c8132f3410 RBP: 0000000010000000 R08: 0000000000000003 R09: 0000000000000000 R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000000 R13: ffff96c758f65a40 R14: 0000000000000001 R15: 000011aac0000000 FS: 00007fdab1cb2900(0000) GS:ffff96e60ca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000058 CR3: 00000001a05ae000 CR4: 0000000000350ef0 Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15c/0x2f0 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? __btrfs_add_free_space_zoned.isra.0+0x61/0x1a0 btrfs_add_free_space_async_trimmed+0x34/0x40 btrfs_add_new_free_space+0x107/0x120 btrfs_make_block_group+0x104/0x2b0 btrfs_create_chunk+0x977/0xf20 btrfs_chunk_alloc+0x174/0x510 ? srso_return_thunk+0x5/0x5f btrfs_inc_block_group_ro+0x1b1/0x230 btrfs_relocate_block_group+0x9e/0x410 btrfs_relocate_chunk+0x3f/0x130 btrfs_balance+0x8ac/0x12b0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? __kmalloc_cache_noprof+0x14c/0x3e0 btrfs_ioctl+0x2686/0x2a80 ? srso_return_thunk+0x5/0x5f ? ioctl_has_perm.constprop.0.isra.0+0xd2/0x120 __x64_sys_ioctl+0x97/0xc0 do_syscall_64+0x82/0x160 ? srso_return_thunk+0x5/0x5f ? __memcg_slab_free_hook+0x11a/0x170 ? srso_return_thunk+0x5/0x5f ? kmem_cache_free+0x3f0/0x450 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? syscall_exit_to_user_mode+0x10/0x210 ? srso_return_thunk+0x5/0x5f ? do_syscall_64+0x8e/0x160 ? sysfs_emit+0xaf/0xc0 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? seq_read_iter+0x207/0x460 ? srso_return_thunk+0x5/0x5f ? vfs_read+0x29c/0x370 ? srso_return_thunk+0x5/0x5f ? srso_return_thunk+0x5/0x5f ? syscall_exit_to_user_mode+0x10/0x210 ? srso_return_thunk+0x5/0x5f ? do_syscall_64+0x8e/0x160 ? srso_return_thunk+0x5/0x5f ? exc_page_fault+0x7e/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fdab1e0ca6d RSP: 002b:00007ffeb2b60c80 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fdab1e0ca6d RDX: 00007ffeb2b60d80 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 00007ffeb2b60cd0 R08: 0000000000000000 R09: 0000000000000013 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffeb2b6343b R14: 00007ffeb2b60d80 R15: 0000000000000001 </TASK> CR2: 0000000000000058 ---[ end trace 0000000000000000 ]--- The 1st line is the most interesting here: BTRFS error (device sdc): zoned: write pointer offset mismatch of zones in raid1 profile When a RAID1 block-group is created and a write pointer mismatch between the disks in the RAID set is detected, btrfs sets the alloc_offset to the length of the block group marking it as full. Afterwards the code expects that a balance operation will evacuate the data in this block-group and repair the problems. But before this is possible, the new space of this block-group will be accounted in the free space cache. But in __btrfs_add_free_space_zoned() it is being checked if it is a initial creation of a block group and if not a reclaim decision will be made. But the decision if a block-group's free space accounting is done for an initial creation depends on if the size of the added free space is the whole length of the block-group and the allocation offset is 0. But as btrfs_load_block_group_zone_info() sets the allocation offset to the zone capacity (i.e. marking the block-group as full) this initial decision is not met, and the space_info pointer in the 'struct btrfs_block_group' has not yet been assigned. Fail creation of the block group and rely on manual user intervention to re-balance the filesystem. Afterwards the filesystem can be unmounted, mounted in degraded mode and the missing device can be removed after a full balance of the filesystem. Reported-by: 西木野羰基 <yanqiyu01@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAB_b4sBhDe3tscz=duVyhc9hNE+gu=B8CrgLO152uMyanR8BEA@mail.gmail.com/ Fixes: b1934cd60695 ("btrfs: zoned: handle broken write pointer on zones") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-17btrfs: fix the ASSERT() inside GET_SUBPAGE_BITMAP()Qu Wenruo
After enabling large data folios for tests, I hit the ASSERT() inside GET_SUBPAGE_BITMAP() where blocks_per_folio matches BITS_PER_LONG. The ASSERT() itself is only based on the original subpage fs block size, where we have at most 16 blocks per page, thus "ASSERT(blocks_per_folio < BITS_PER_LONG)". However the experimental large data folio support will set the max folio order according to the BITS_PER_LONG, so we can have a case where a large folio contains exactly BITS_PER_LONG blocks. So the ASSERT() is too strict, change it to "ASSERT(blocks_per_folio <= BITS_PER_LONG)" to avoid the false alert. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-17btrfs: avoid page_lockend underflow in btrfs_punch_hole_lock_range()Qu Wenruo
[BUG] When running btrfs/004 with 4K fs block size and 64K page size, sometimes fsstress workload can take 100% CPU for a while, but not long enough to trigger a 120s hang warning. [CAUSE] When such 100% CPU usage happens, btrfs_punch_hole_lock_range() is always in the call trace. One example when this problem happens, the function btrfs_punch_hole_lock_range() got the following parameters: lock_start = 4096, lockend = 20469 Then we calculate @page_lockstart by rounding up lock_start to page boundary, which is 64K (page size is 64K). For @page_lockend, we round down the value towards page boundary, which result 0. Then since we need to pass an inclusive end to filemap_range_has_page(), we subtract 1 from the rounded down value, resulting in (u64)-1. In the above case, the range is inside the same page, and we do not even need to call filemap_range_has_page(), not to mention to call it with (u64)-1 at the end. This behavior will cause btrfs_punch_hole_lock_range() to busy loop waiting for irrelevant range to have its pages dropped. [FIX] Calculate @page_lockend by just rounding down @lockend, without decreasing the value by one. So @page_lockend will no longer overflow. Then exit early if @page_lockend is no larger than @page_lockstart. As it means either the range is inside the same page, or the two pages are adjacent already. Finally only decrease @page_lockend when calling filemap_range_has_page(). Fixes: 0528476b6ac7 ("btrfs: fix the filemap_range_has_page() call in btrfs_punch_hole_lock_range()") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-17btrfs: subpage: access correct object when reading bitmap start in ↵Qu Wenruo
subpage_calc_start_bit() Inside the macro, subpage_calc_start_bit(), we need to calculate the offset to the beginning of the folio. But we're using offset_in_page(), on systems with 4K page size and 4K fs block size, this means we will always return offset 0 for a large folio, causing all kinds of errors. Fix it by using offset_in_folio() instead. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-08Merge tag 'crc-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC cleanups from Eric Biggers: "Finish cleaning up the CRC kconfig options by removing the remaining unnecessary prompts and an unnecessary 'default y', removing CONFIG_LIBCRC32C, and documenting all the CRC library options" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: lib/crc: remove CONFIG_LIBCRC32C lib/crc: document all the CRC library kconfig options lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF lib/crc: remove unnecessary prompt for CONFIG_CRC16 lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
2025-04-07VFS: improve interface for lookup_one functionsNeilBrown
The family of functions: lookup_one() lookup_one_unlocked() lookup_one_positive_unlocked() appear designed to be used by external clients of the filesystem rather than by filesystems acting on themselves as the lookup_one_len family are used. They are used by: btrfs/ioctl - which is a user-space interface rather than an internal activity exportfs - i.e. from nfsd or the open_by_handle_at interface overlayfs - at access the underlying filesystems smb/server - for file service They should be used by nfsd (more than just the exportfs path) and cachefs but aren't. It would help if the documentation didn't claim they should "not be called by generic code". Also the path component name is passed as "name" and "len" which are (confusingly?) separate by the "base". In some cases the len in simply "strlen" and so passing a qstr using QSTR() would make the calling clearer. Other callers do pass separate name and len which are stored in a struct. Sometimes these are already stored in a qstr, other times it easily could be. So this patch changes these three functions to receive a 'struct qstr *', and improves the documentation. QSTR_LEN() is added to make it easy to pass a QSTR containing a known len. [brauner@kernel.org: take a struct qstr pointer] Signed-off-by: NeilBrown <neil@brown.name> Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name Signed-off-by: Christian Brauner <brauner@kernel.org>