linux-arm.git - Russell King's ARM Linux kernel tree

Age	Commit message (Collapse)	Author
2025-05-15	btrfs: subpage: prepare for large data folios	Qu Wenruo
	The subpage handling code has two locations not supporting large folios: - btrfs_attach_subpage() Which is doing a metadata specific ASSERT() check. But for the future large data folios support, that check is too generic. Since it's metadata specific, only check the ASSERT() for metadata. - btrfs_subpage_assert() Just remove the "ASSERT(folio_order(folio) == 0)" check. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: prepare end_bbio_data_write() for large data folios	Qu Wenruo
	The function is doing an ASSERT() checking the folio order, but all later functions are handling large folios properly, thus we can safely remove that ASSERT(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: prepare prepare_one_folio() for large data folios	Qu Wenruo
	The only blockage is the ASSERT() rejecting large folios, just remove it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: prepare btrfs_page_mkwrite() for large data folios	Qu Wenruo
	The function btrfs_page_mkwrite() has an explicit ASSERT() checking the folio order. To make it support large data folios, we need to: - Remove the ASSERT(folio_order(folio) == 0) - Use folio_contains() to check if the folio covers the last page Otherwise the code is already supporting large folios well. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: send: prepare put_file_data() for large data folios	Qu Wenruo
	Currently put_file_data() can only accept a page sized folio. However the function itself is not that complex, it's just copying data from filemap folio into the send buffer. Make it support large data folios: - Change the loop to use file offset instead of page index - Calculate @pg_offset and @cur_len after getting the folio - Remove the "WARN_ON(folio_order(folio));" line Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: send: remove the again label inside put_file_data()	Qu Wenruo
	The again label is here to retry to get the folio for the current index. When triggering that label, there is no advance of the iterator. So it can be replaced by a simple "continue" and remove the again label. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_insert_inode_extref()	David Sterba
	This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_del_inode_extref()	David Sterba
	This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_encoded_read_inline()	David Sterba
	This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use BTRFS_PATH_AUTO_FREE in can_nocow_extent()	David Sterba
	This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_set_inode_index_count()	David Sterba
	This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use BTRFS_PATH_AUTO_FREE in may_destroy_subvol()	David Sterba
	This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: do more trivial BTRFS_PATH_AUTO_FREE conversions	David Sterba
	The most trivial pattern for the auto freeing when the variable is declared with the macro and the final btrfs_free_path() is removed. There are almost none goto -> return conversions and there's no other function cleanup. Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove redundant record start offset check at test_range_bit()	Filipe Manana
	It's pointless to check if the current record's start offset is greater than the end offset, as before we just tested if it was greater than the start offset - and if it's not it means it's less than or equal to the start offset, so it can not be greater than the end offset, as our start offset is always smaller than the end offset. So remove that check and also add an assertion to verify the start offset is smaller then the end offset. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: simplify last record detection at test_range_bit()	Filipe Manana
	The overflow detection for the start offset of the next record is not really necessary, we can just stop iterating if the current record ends at or after out end offset. This removes the need to test if the current record end offset is (u64)-1 and to check if adding 1 to the current end offset results in 0. By testing only if the current record ends at or after the end offset, we also don't need anymore to test the new start offset at the head of the while loop. This makes both the source code and assembly code simpler, more efficient and shorter (reducing the object text size). Also remove the pointless initialization to NULL of the state variable, as we don't use it before the first assignment to it. This may help avoid some warnings with clang tools such as the one reported/fixed by commit 966de47ff0c9 ("btrfs: remove redundant initialization of variables in log_new_ancestors"). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove redundant check at find_first_extent_bit_state()	Filipe Manana
	The tree_search() function always returns an entry that either contains the search offset or the first entry in the tree that starts after the offset. So checking at find_first_extent_bit_state() if the returned entry ends at or after the search offset is pointless. Remove the check. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: fix documentation for tree_search_for_insert()	Filipe Manana
	There are several things wrong with the documentation: 1) At the top it's only mentioned that we search for an entry containing the given offset, but when such entry does not exists we search for the first entry that starts and ends after that offset; 2) It mentions that @node_ret and @parent_ret aren't changed if the returned entry contains the given offset - that is true only if the returned entry starts exactly at @offset, otherwise those arguments are changed; 3) It mentions that if no entry containing offset is found then we return the first entry ending before the offset - that is not true, we return the first entry that starts and ends after that offset; 4) It also mentions that NULL is never returned. This is false as in case there's no entry containing offset or any entry that starts and ends after offset, NULL is returned. So fix the documentation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: simplify last record detection at test_range_bit_exists()	Filipe Manana
	Instead of keeping track of the minimum start offset of the next record and detecting overflow every time we update that offset to be the sum of current record's end offset plus one, we can simply exit when the current record ends at or beyond our end offset and forget about updating the start offset on every iteration and testing for it at the top of the loop. This makes both the source code and assembly code simpler, more efficient and shorter (reducing the object text size). Also remove the pointless initialization to NULL of the state variable, as we don't use it before the first assignment to it. This may help avoid some warnings with clang tools such as the one reported/fixed by commit 966de47ff0c9 ("btrfs: remove redundant initialization of variables in log_new_ancestors"). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use clear_extent_bits() instead of clear_extent_bit() where possible	Filipe Manana
	Several places are using clear_extent_bit() and passing a NULL value for the 'cached' argument, which is pointless as they can use instead clear_extent_bits(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use clear_extent_bits() at chunk_map_device_clear_bits()	Filipe Manana
	Instead of using __clear_extent_bit() we can use clear_extent_bits() since we pass a NULL value for the cached and changeset arguments. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use clear_extent_bit() at try_release_extent_state()	Filipe Manana
	Instead of using __clear_extent_bit() we can use clear_extent_bit() since we pass a NULL value for the changeset argument. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: fix the file offset calculation inside btrfs_decompress_buf2page()	Qu Wenruo
	[BUG WITH EXPERIMENTAL LARGE FOLIOS] When testing the experimental large data folio support with compression, there are several ASSERT()s triggered from btrfs_decompress_buf2page() when running fsstress with compress=zstd mount option: - ASSERT(copy_len) from btrfs_decompress_buf2page() - VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page() [CAUSE] Inside btrfs_decompress_buf2page(), we need to grab the file offset from the current bvec.bv_page, to check if we even need to copy data into the bio. And since we're using single page bvec, and no large folio, every page inside the folio should have its index properly setup. But when large folios are involved, only the first page (aka, the head page) of a large folio has its index properly initialized. The other pages inside the large folio will not have their indexes properly initialized. Thus the page_offset() call inside btrfs_decompress_buf2page() will result garbage, and completely screw up the @copy_len calculation. [FIX] Instead of using page->index directly, go with page_pgoff(), which can handle non-head pages correctly. So introduce a helper, file_offset_from_bvec(), to get the file offset from a single page bio_vec, so the copy_len calculation can be done correctly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: use rb_entry_safe() where possible to simplify code	David Sterba
	Simplify conditionally reading an rb_entry(), there's the rb_entry_safe() helper that checks the node pointer for NULL so we don't have to write it explicitly. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: pass a pointer to get_range_bits() to cache first search result	Filipe Manana
	Allow get_range_bits() to take an extent state pointer to pointer argument so that we can cache the first extent state record in the target range, so that a caller can use it for subsequent operations without doing a full tree search. Currently the only user is try_release_extent_state(), which then does a call to __clear_extent_bit() which can use such a cached state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: allow folios to be released while ordered extent is finishing	Filipe Manana
	When the release_folio callback (from struct address_space_operations) is invoked we don't allow the folio to be released if its range is currently locked in the inode's io_tree, as it may indicate the folio may be needed by the task that locked the range. However if the range is locked because an ordered extent is finishing, then we can safely allow the folio to be released because ordered extent completion doesn't need to use the folio at all. When we are under memory pressure, the kernel starts writeback of dirty pages (folios) with the goal of releasing the pages from the page cache after writeback completes, however this often is not possible on btrfs because: * Once the writeback completes we queue the ordered extent completion; * Once the ordered extent completion starts, we lock the range in the inode's io_tree (at btrfs_finish_one_ordered()); * If the release_folio callback is called while the folio's range is locked in the inode's io_tree, we don't allow the folio to be released, so the kernel has to try to release memory elsewhere, which may result in triggering more writeback or releasing other pages from the page cache which may be more useful to have around for applications. In contrast, when the release_folio callback is invoked after writeback finishes and before ordered extent completion starts or locks the range, we allow the folio to be released, as well as when the release_folio callback is invoked after ordered extent completion unlocks the range. Improve on this by detecting if the range is locked for ordered extent completion and if it is, allow the folio to be released. This detection is achieved by adding a new extent flag in the io_tree that is set when the range is locked during ordered extent completion. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: update comment for try_release_extent_state()	Filipe Manana
	Drop reference to pages from the comment since the function is fully folio aware and works regardless of how many pages are in the folio. Also while at it, capitalize the first word and make it more explicit that release_folio is a callback from struct address_space_operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: prepare btrfs_punch_hole_lock_range() for large data folios	Qu Wenruo
	The function btrfs_punch_hole_lock_range() needs to make sure there is no other folio in the range, thus it goes with filemap_range_has_page(), which works pretty fine. But if we have large folios, under the following case filemap_range_has_page() will always return true, forcing btrfs_punch_hole_lock_range() to do a very time consuming busy loop: start end \| \| \|//\|//\|//\|//\| \| \| \| \| \| \| \| \|//\|//\| \ / \ / Folio A Folio B In the above case, folio A and B contain our start/end indexes, and there are no other folios in the range. Thus we do not need to retry inside btrfs_punch_hole_lock_range(). To prepare for large data folios, introduce a helper, check_range_has_page(), which will: - Shrink the search range towards page boundaries If the rounded down end (exclusive, otherwise it can underflow when @end is inside the folio at file offset 0) is no larger than the rounded up start, it means the range contains no other pages other than the ones covering @start and @end. Can return false directly in that case. - Grab all the folios inside the range - Skip any large folios that cover the start and end indexes - If any other folios are found return true - Otherwise return false This new helper is going to handle both large folios and regular ones. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: prepare btrfs_buffered_write() for large data folios	Qu Wenruo
	This involves the following modifications: - Set the order flags for __filemap_get_folio() inside prepare_one_folio() This will allow __filemap_get_folio() to create a large folio if the address space supports it. - Limit the initial @write_bytes inside copy_one_range() If the largest folio boundary splits the initial write range, there is no way we can write beyond the largest folio boundary. This is done by a simple helper calc_write_bytes(). - Release exceeding reserved space if the folio is smaller than expected Which is doing the same handling when short copy happens. All the preparations should not change the behavior when the largest folio order is 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: refactor how we handle reserved space inside copy_one_range()	Qu Wenruo
	There are several things not ideal in copy_one_range(): - Unnecessary temporary variables * block_offset * reserve_bytes * dirty_blocks * num_blocks * release_bytes These are utilized to handle short-copy cases. - Inconsistent handling of btrfs_delalloc_release_extents() There is a hidden behavior that, after reserving metadata for X bytes of data write, we have to call btrfs_delalloc_release_extents() with X once and only once. Calling btrfs_delalloc_release_extents(X - 4K) and btrfs_delalloc_release_extents(4K) will cause outstanding extents accounting to go wrong. This is because the outstanding extents mechanism is not designed to handle shrinking of reserved space. Improve above situations by: - Use a single @reserved_start and @reserved_len pair Now we reserve space for the initial range, and if a short copy happened and we need to shrink the reserved space, we can easily calculate the new length, and update @reserved_len. - Introduce helpers to shrink reserved data and metadata space This is done by two new helpers, shrink_reserved_space() and btrfs_delalloc_shrink_extents(). The later will do a better calculation if we need to modify the outstanding extents, and the first one will be utilized inside copy_one_range(). - Manually unlock, release reserved space and return if no byte is copied Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove EXTENT_UPTODATE io tree flag	Filipe Manana
	The EXTENT_UPTODATE io tree flag is now used only to mark ranges in the fs_info->excluded_extents as used by super blocks and not available for extent allocation (to prevent adding those ranges as free space in the in memory space caches). As we can use any flag for that purpose, and we are using EXTENT_DIRTY for the pinned extents io tree for example, remove the EXTENT_UPTODATE flag and use instead EXTENT_DIRTY for the excluded extents io tree. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: stop searching for EXTENT_DIRTY bit in the excluded extents io tree	Filipe Manana
	At btrfs_add_new_free_space() we keep searching for ranges in the excluded extents io tree that have the EXTENT_DIRTY bit set, however we never ever set that bit for ranges in that tree. That is a leftover from when that function used the global freed extents trees (fs_info->freed_extents[2]), where we used both the EXTENT_DIRTY and EXTENT_UPTODATE bits, but those trees are gone with commit fe119a6eeb67 ("btrfs: switch to per-transaction pinned extents"), which introduced the fs_info->excluded_extents io tree, where only EXTENT_UPTODATE is set. So remove the EXTENT_DIRTY bit search at btrfs_add_new_free_space(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove leftover EXTENT_UPTODATE clear from an inode's io_tree	Filipe Manana
	After commit 52b029f42751 ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path") we never set EXTENT_UPTODATE in an inode's io_tree anymore, but we still have some code attempting to clear that bit from an inode's io_tree. Remove that code as it doesn't do anything anymore. The sole use of the EXTENT_UPTODATE bit is for the excluded extents io_tree (fs_info->excluded_extents), which is used to track the locations of super blocks, so that their ranges are never marked as free, making them unavailable for extent allocation. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: fix fsync of files with no hard links not persisting deletion	Filipe Manana
	If we fsync a file (or directory) that has no more hard links, because while a process had a file descriptor open on it, the file's last hard link was removed and then the process did an fsync against the file descriptor, after a power failure or crash the file still exists after replaying the log. This behaviour is incorrect since once an inode has no more hard links it's not accessible anymore and we insert an orphan item into its subvolume's tree so that the deletion of all its items is not missed in case of a power failure or crash. So after log replay the file shouldn't exist anymore, which is also the behaviour on ext4, xfs, f2fs and other filesystems. Fix this by not ignoring inodes with zero hard links at btrfs_log_inode_parent() and by committing an inode's delayed inode when we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This last step is necessary because when removing the last hard link we don't delete the corresponding ref (or extref) item, instead we record the change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF flag, so that when the delayed inode is committed we delete the ref/extref item from the inode's subvolume tree - otherwise the logging code will log the last hard link and therefore upon log replay the inode is not deleted. The base code for a fstests test case that reproduces this bug is the following: . ./common/dmflakey _require_scratch _require_dm_target flakey _require_mknod _scratch_mkfs >>$seqres.full 2>&1 \|\| _fail "mkfs failed" _require_metadata_journaling $SCRATCH_DEV _init_flakey _mount_flakey touch $SCRATCH_MNT/foo # Commit the current transaction and persist the file. _scratch_sync # A fifo to communicate with a background xfs_io process that will # fsync the file after we deleted its hard link while it's open by # xfs_io. mkfifo $SCRATCH_MNT/fifo tail -f $SCRATCH_MNT/fifo \| \ $XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full & XFS_IO_PID=$! # Give some time for the xfs_io process to open a file descriptor for # the file. sleep 1 # Now while the file is open by the xfs_io process, delete its only # hard link. rm -f $SCRATCH_MNT/foo # Now that it has no more hard links, make the xfs_io process fsync it. echo "fsync" > $SCRATCH_MNT/fifo # Terminate the xfs_io process so that we can unmount. echo "quit" > $SCRATCH_MNT/fifo wait $XFS_IO_PID unset XFS_IO_PID # Simulate a power failure and then mount again the filesystem to # replay the journal/log. _flakey_drop_and_remount # We don't expect the file to exist anymore, since it was fsynced when # it had no more hard links. [ -f $SCRATCH_MNT/foo ] && echo "file foo still exists" _unmount_flakey # success, all done echo "Silence is golden" status=0 exit A test case for fstests will be submitted soon. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: fix typo in space info explanation	Mark Harmstone
	There's an explanation of how space info works at the top of fs/btrfs/space-info.c, which makes reference to a variable called bytes_may_reserve. There's nothing called that in the code, and wasn't at time the comment was written; as far I can tell this is a typo, and it should actually be bytes_may_use. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove unused flag EXTENT_BUFFER_IN_TREE	Daniel Vacek
	This flag is set after inserting the eb to the buffer tree and cleared on it's removal. It was added in commit 34b41acec1ccc0 ("Btrfs: use a bit to track if we're in the radix tree") and wanted to make use of it, faa2dbf004e89e ("Btrfs: add sanity tests for new qgroup accounting code"). Both are 10+ years old, we can remove the flag. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove unused flag EXTENT_BUFFER_CORRUPT	Daniel Vacek
	This flag is no longer being used. It was added by commit a826d6dcb32d ("Btrfs: check items for correctness as we search") but it's no longer being used after commit f26c92386028 ("btrfs: remove reada infrastructure"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove unused flag EXTENT_BUFFER_READAHEAD	Daniel Vacek
	This flag is no longer being used. It was added by commit ab0fff03055d ("btrfs: add READAHEAD extent buffer flag") and used in commits: 79fb65a1f6d9 ("Btrfs: don't call readahead hook until we have read the entire eb") 78e62c02abb9 ("btrfs: Remove extent_io_ops::readpage_io_failed_hook") 371cdc0700c7 ("btrfs: introduce subpage metadata validation check") Finally all the code using it was removed by commit f26c92386028 ("btrfs: remove reada infrastructure"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove unused flag EXTENT_BUFFER_READ_ERR	Daniel Vacek
	This flag was added by commit 656f30dba7ab ("Btrfs: be aware of btree inode write errors to avoid fs corruption") but it stopped being used after commit 046b562b20a5 ("btrfs: use a separate end_io handler for read_extent_buffer"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: factor out the main loop of btrfs_buffered_write() into a helper	Qu Wenruo
	Inside the main loop of btrfs_buffered_write() we are doing a lot of heavy lifting inside a while() loop. This makes it pretty hard to read, factor out the content into a helper, copy_one_range() to do the work. This has no functional change, but with some minor variable renames, e.g. rename all "sector" into "block". Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: factor out space reservation code from btrfs_buffered_write()	Qu Wenruo
	Inside the main loop of btrfs_buffered_write(), we have a complex data and metadata space reservation code, which tries to reserve space for a COW write, if failed then fallback to check if we can do a NOCOW write. Factor out that part of code into a dedicated helper, reserve_space(), to make the main loop a little easier to read. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: cleanup the reserved space inside loop of btrfs_buffered_write()	Qu Wenruo
	Inside the main loop of btrfs_buffered_write(), if something wrong happened, there is a out-of-loop cleanup path to release the reserved space. This behavior saves some code lines, but makes it much harder to read, as we need to check release_bytes to make sure when we need to do the cleanup. Factor out the cleanup part into a helper, release_reserved_space(), to do the cleanup inside the main loop, so that we can move @release_bytes inside the loop. This will make later refactoring of the main loop much easier. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: remove force_page_uptodate variable from btrfs_buffered_write()	Qu Wenruo
	Commit c87c299776e4 ("btrfs: make buffered write to copy one page a time") changed how the variable @force_page_uptodate was updated. Before that commit the variable was only initialized to false at the beginning of the function, and after hitting a short copy, the next retry on the same folio would force the folio to be read from the disk. But after the commit, the variable is always initialized to false at the beginning of the loop's scope, causing prepare_one_folio() never to get a true value passed in. The change in behavior is not a huge deal, it only makes a difference on how we handle short copies: Old: Allow the buffer to be split The first short copy will be rejected, that's the same for both cases. But for the next retry, we require the folio to be read from disk. Then even if we hit a short copy again, since the folio is already uptodate, we do not need to handle partial uptodate range, and can continue, marking the short copied range as dirty and continue. This will split the buffer write into the folio as two buffered writes. New: Do not allow the buffer to be split The first short copy will be rejected, that's the same for both cases. For the next retry, we do nothing special, thus if the short copy happened again, we reject it again, until either the short copy is gone, or we failed to fault in the buffer. This will mean the buffer write into the folio will either fail or succeed, no splitting will happen. To me, either solution is fine, but the new one makes it simpler and requires no special handling, so I prefer that solution. And since @force_page_uptodate is always false when passed into prepare_one_folio(), we can just remove the variable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15	btrfs: move block perfect compression out of experimental features	Qu Wenruo
	Commit 1d2fbb7f1f9e ("btrfs: allow compression even if the range is not page aligned") introduced the block perfect compression for block size < page size cases. Before that commit, if the fs block size is smaller than page size (aka subpage cases), compressed write is only enabled if the dirty range is fully page aligned. This block perfect compression support was introduced in v6.13, and has been tested for two kernel releases. I believe it's time to move it out of experimental features so that we can get more tests in the real world. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-14	Merge tag 'for-6.15-rc6-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential endless loop when discarding a block group when disabling discard - reinstate message when setting a large value of mount option 'commit' - fix a folio leak when async extent submission fails * tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add back warning for mount option commit values exceeding 300 btrfs: fix folio leak in submit_one_async_extent() btrfs: fix discard worker infinite loop after disabling discard
2025-05-12	btrfs: add back warning for mount option commit values exceeding 300	Kyoji Ogasawara
	The Btrfs documentation states that if the commit value is greater than 300 a warning should be issued. The warning was accidentally lost in the new mount API update. Fixes: 6941823cc878 ("btrfs: remove old mount API code") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12	btrfs: fix folio leak in submit_one_async_extent()	Boris Burkov
	If btrfs_reserve_extent() fails while submitting an async_extent for a compressed write, then we fail to call free_async_extent_pages() on the async_extent and leak its folios. A likely cause for such a failure would be btrfs_reserve_extent() failing to find a large enough contiguous free extent for the compressed extent. I was able to reproduce this by: 1. mount with compress-force=zstd:3 2. fallocating most of a filesystem to a big file 3. fragmenting the remaining free space 4. trying to copy in a file which zstd would generate large compressed extents for (vmlinux worked well for this) Step 4. hits the memory leak and can be repeated ad nauseam to eventually exhaust the system memory. Fix this by detecting the case where we fallback to uncompressed submission for a compressed async_extent and ensuring that we call free_async_extent_pages(). Fixes: 131a821a243f ("btrfs: fallback if compressed IO fails for ENOSPC") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Co-developed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12	btrfs: fix discard worker infinite loop after disabling discard	Filipe Manana
	If the discard worker is running and there's currently only one block group, that block group is a data block group, it's in the unused block groups discard list and is being used (it got an extent allocated from it after becoming unused), the worker can end up in an infinite loop if a transaction abort happens or the async discard is disabled (during remount or unmount for example). This happens like this: 1) Task A, the discard worker, is at peek_discard_list() and find_next_block_group() returns block group X; 2) Block group X is in the unused block groups discard list (its discard index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past it become an unused block group and was added to that list, but then later it got an extent allocated from it, so its ->used counter is not zero anymore; 3) The current transaction is aborted by task B and we end up at __btrfs_handle_fs_error() in the transaction abort path, where we call btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode (setting SB_RDONLY in the super block's s_flags field); 4) Task A calls __add_to_discard_list() with the goal of moving the block group from the unused block groups discard list into another discard list, but at __add_to_discard_list() we end up doing nothing because btrfs_run_discard_work() returns false, since the super block has SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set anymore in fs_info->flags. So block group X remains in the unused block groups discard list; 5) Task A then does a goto into the 'again' label, calls find_next_block_group() again we gets block group X again. Then it repeats the previous steps over and over since there are not other block groups in the discard lists and block group X is never moved out of the unused block groups discard list since btrfs_run_discard_work() keeps returning false and therefore __add_to_discard_list() doesn't move block group X out of that discard list. When this happens we can get a soft lockup report like this: [71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97] [71.957] Modules linked in: xfs af_packet rfkill (...) [71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa [71.957] Tainted: [W]=WARN [71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs] [71.957] Code: c1 01 48 83 (...) [71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246 [71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000 [71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad [71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080 [71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860 [71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000 [71.957] FS: 0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000 [71.957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0 [71.957] PKRU: 55555554 [71.957] Call Trace: [71.957] <TASK> [71.957] process_one_work+0x17e/0x330 [71.957] worker_thread+0x2ce/0x3f0 [71.957] ? __pfx_worker_thread+0x10/0x10 [71.957] kthread+0xef/0x220 [71.957] ? __pfx_kthread+0x10/0x10 [71.957] ret_from_fork+0x34/0x50 [71.957] ? __pfx_kthread+0x10/0x10 [71.957] ret_from_fork_asm+0x1a/0x30 [71.957] </TASK> [71.957] Kernel panic - not syncing: softlockup: hung tasks [71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W L 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa [71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP [71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [71.992] Call Trace: [71.993] <IRQ> [71.994] dump_stack_lvl+0x5a/0x80 [71.994] panic+0x10b/0x2da [71.995] watchdog_timer_fn.cold+0x9a/0xa1 [71.996] ? __pfx_watchdog_timer_fn+0x10/0x10 [71.997] __hrtimer_run_queues+0x132/0x2a0 [71.997] hrtimer_interrupt+0xff/0x230 [71.998] __sysvec_apic_timer_interrupt+0x55/0x100 [71.999] sysvec_apic_timer_interrupt+0x6c/0x90 [72.000] </IRQ> [72.000] <TASK> [72.001] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs] [72.002] Code: c1 01 48 83 (...) [72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246 [72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000 [72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad [72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080 [72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860 [72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000 [72.010] ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f] [72.011] process_one_work+0x17e/0x330 [72.012] worker_thread+0x2ce/0x3f0 [72.013] ? __pfx_worker_thread+0x10/0x10 [72.014] kthread+0xef/0x220 [72.014] ? __pfx_kthread+0x10/0x10 [72.015] ret_from_fork+0x34/0x50 [72.015] ? __pfx_kthread+0x10/0x10 [72.016] ret_from_fork_asm+0x1a/0x30 [72.017] </TASK> [72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [72.019] Rebooting in 90 seconds.. So fix this by making sure we move a block group out of the unused block groups discard list when calling __add_to_discard_list(). Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard") Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-07	btrfs: use bdev_rw_virt in scrub_one_super	Christoph Hellwig
	Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06	Merge tag 'for-6.15-rc5-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - revert device path canonicalization, this does not work as intended with namespaces and is not reliable in all setups - fix crash in scrub when checksum tree is not valid, e.g. when mounted with rescue=ignoredatacsums - fix crash when tracepoint btrfs_prelim_ref_insert is enabled - other minor fixups: - open code folio_index(), meant to be used in MM code - use matching type for sizeof in compression allocation * tag 'for-6.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: open code folio_index() in btree_clear_folio_dirty_tag() Revert "btrfs: canonicalize the device path before adding it" btrfs: avoid NULL pointer dereference if no valid csum tree btrfs: handle empty eb->folios in num_extent_folios() btrfs: correct the order of prelim_ref arguments in btrfs__prelim_ref btrfs: compression: adjust cb->compressed_folios allocation type
2025-05-02	btrfs: open code folio_index() in btree_clear_folio_dirty_tag()	Kairui Song
	The folio_index() helper is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through 'swap_rw' but btrfs does not use that method for swap. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>