summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2025-03-18btrfs: explicitly ref count block_group on new_bgs listBoris Burkov
All other users of the bg_list list_head increment the refcount when adding to a list and decrement it when deleting from the list. Just for the sake of uniformity and to try to avoid refcounting bugs, do it for this list as well. This does not fix any known ref-counting bug, as the reference belongs to a single task (trans_handle is not shared and this represents trans_handle->new_bgs linkage) and will not lose its original refcount while that thread is running. And BLOCK_GROUP_FLAG_NEW protects against ref-counting errors "moving" the block group to the unused list without taking a ref. With that said, I still believe it is simpler to just hold the extra ref count for this list user as well. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_discard_workfn() block_group ref explicitBoris Burkov
Currently, the async discard machinery owns a ref to the block_group when the block_group is queued on a discard list. However, to handle races with discard cancellation and the discard workfn, we have a specific logic to detect that the block_group is *currently* running in the workfn, to protect the workfn's usage amidst cancellation. As far as I can tell, this doesn't have any overt bugs (though finish_discard_pass() and remove_from_discard_list() racing can have a surprising outcome for the caller of remove_from_discard_list() in that it is again added at the end). But it is needlessly complicated to rely on locking and the nullity of discard_ctl->block_group. Simplify this significantly by just taking a refcount while we are in the workfn and unconditionally drop it in both the remove and workfn paths, regardless of if they race. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: harden block_group::bg_list against list_del() racesBoris Burkov
As far as I can tell, these calls of list_del_init() on bg_list cannot run concurrently with btrfs_mark_bg_unused() or btrfs_mark_bg_to_reclaim(), as they are in transaction error paths and situations where the block group is readonly. However, if there is any chance at all of racing with mark_bg_unused(), or a different future user of bg_list, better to be safe than sorry. Otherwise we risk the following interleaving (bg_list refcount in parens) T1 (some random op) T2 (btrfs_mark_bg_unused) !list_empty(&bg->bg_list); (1) list_del_init(&bg->bg_list); (1) list_move_tail (1) btrfs_put_block_group (0) btrfs_delete_unused_bgs bg = list_first_entry list_del_init(&bg->bg_list); btrfs_put_block_group(bg); (-1) Ultimately, this results in a broken ref count that hits zero one deref early and the real final deref underflows the refcount, resulting in a WARNING. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: fix block group refcount race in btrfs_create_pending_block_groups()Boris Burkov
Block group creation is done in two phases, which results in a slightly unintuitive property: a block group can be allocated/deallocated from after btrfs_make_block_group() adds it to the space_info with btrfs_add_bg_to_space_info(), but before creation is completely completed in btrfs_create_pending_block_groups(). As a result, it is possible for a block group to go unused and have 'btrfs_mark_bg_unused' called on it concurrently with 'btrfs_create_pending_block_groups'. This causes a number of issues, which were fixed with the block group flag 'BLOCK_GROUP_FLAG_NEW'. However, this fix is not quite complete. Since it does not use the unused_bg_lock, it is possible for the following race to occur: btrfs_create_pending_block_groups btrfs_mark_bg_unused if list_empty // false list_del_init clear_bit else if (test_bit) // true list_move_tail And we get into the exact same broken ref count and invalid new_bgs state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to prevent. The broken refcount aspect will result in a warning like: [1272.943527] refcount_t: underflow; use-after-free. [1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs] [1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G W 6.14.0-rc5+ #108 [1272.946368] Tainted: [W]=WARN [1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110 [1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282 [1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000 [1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff [1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268 [1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0 [1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0 [1272.952850] FS: 0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000 [1272.953458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0 [1272.954474] Call Trace: [1272.954655] <TASK> [1272.954812] ? refcount_warn_saturate+0xba/0x110 [1272.955173] ? __warn.cold+0x93/0xd7 [1272.955487] ? refcount_warn_saturate+0xba/0x110 [1272.955816] ? report_bug+0xe7/0x120 [1272.956103] ? handle_bug+0x53/0x90 [1272.956424] ? exc_invalid_op+0x13/0x60 [1272.956700] ? asm_exc_invalid_op+0x16/0x20 [1272.957011] ? refcount_warn_saturate+0xba/0x110 [1272.957399] btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs] [1272.957853] btrfs_put_block_group.cold+0x5d/0x8e [btrfs] [1272.958289] btrfs_discard_workfn+0x194/0x380 [btrfs] [1272.958729] process_one_work+0x130/0x290 [1272.959026] worker_thread+0x2ea/0x420 [1272.959335] ? __pfx_worker_thread+0x10/0x10 [1272.959644] kthread+0xd7/0x1c0 [1272.959872] ? __pfx_kthread+0x10/0x10 [1272.960172] ret_from_fork+0x30/0x50 [1272.960474] ? __pfx_kthread+0x10/0x10 [1272.960745] ret_from_fork_asm+0x1a/0x30 [1272.961035] </TASK> [1272.961238] ---[ end trace 0000000000000000 ]--- Though we have seen them in the async discard workfn as well. It is most likely to happen after a relocation finishes which cancels discard, tears down the block group, etc. Fix this fully by taking the lock around the list_del_init + clear_bit so that the two are done atomically. Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary fs_info argument from btrfs_add_block_group_cache()Filipe Manana
The fs_info can be taken from the given block group, so there is no need to pass it as an argument. Also rename the local variable from 'info' to 'fs_info' which is more widely used, more clear and to be more consistent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary fs_info argument from delete_block_group_cache()Filipe Manana
The fs_info can be taken from the given block group, so there is no need to pass it as an argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary fs_info argument from create_reloc_inode()Filipe Manana
The fs_info can be taken from the given block group, so there is no need to pass it as an argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget_path() return a btrfs inode insteadFilipe Manana
It's an internal function and btrfs_iget() is now returning a btrfs inode, so change btrfs_iget_path() to also return a btrfs inode instead of a VFS inode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_iget() return a btrfs inode insteadFilipe Manana
It's an internal function and most of the time the callers are doing a lot of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so change the return type to struct btrfs_inode instead. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass a btrfs_inode to fixup_inode_link_count()Filipe Manana
fixup_inode_link_count() mostly wants to use a btrfs_inode, plus it's an internal function so it should take btrfs_inode instead of a VFS inode. Change the argument type to btrfs_inode, avoiding several BTRFS_I() calls too. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: return a btrfs_inode from read_one_inode()Filipe Manana
All callers of read_one_inode() are mostly interested in the btrfs_inode structure rather than the VFS inode, so make read_one_inode() return the btrfs_inode instead, avoiding lots of BTRFS_I() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: return a btrfs_inode from btrfs_iget_logging()Filipe Manana
All callers of btrfs_iget_logging() are interested in the btrfs_inode structure rather than the VFS inode, so make btrfs_iget_logging() return the btrfs_inode instead, avoiding lots of BTRFS_I() calls. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid linker error in btrfs_find_create_tree_block()Mark Harmstone
The inline function btrfs_is_testing() is hardcoded to return 0 if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set. Currently we're relying on the compiler optimizing out the call to alloc_test_extent_buffer() in btrfs_find_create_tree_block(), as it's not been defined (it's behind an #ifdef). Add a stub version of alloc_test_extent_buffer() to avoid linker errors on non-standard optimization levels. This problem was seen on GCC 14 with -O0 and is helps to see symbols that would be otherwise optimized out. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: run btrfs_error_commit_super() earlyQu Wenruo
[BUG] Even after all the error fixes related the "ASSERT(list_empty(&fs_info->delayed_iputs));" in close_ctree(), I can still hit it reliably with my experimental 2K block size. [CAUSE] In my case, all the error is triggered after the fs is already in error status. I find the following call trace to be the cause of race: Main thread | endio_write_workers ---------------------------------------------+--------------------------- close_ctree() | |- btrfs_error_commit_super() | | |- btrfs_cleanup_transaction() | | | |- btrfs_destroy_all_ordered_extents() | | | |- btrfs_wait_ordered_roots() | | |- btrfs_run_delayed_iputs() | | | btrfs_finish_ordered_io() | | |- btrfs_put_ordered_extent() | | |- btrfs_add_delayed_iput() |- ASSERT(list_empty(delayed_iputs)) | !!! Triggered !!! The root cause is that, btrfs_wait_ordered_roots() only wait for ordered extents to finish their IOs, not to wait for them to finish and removed. [FIX] Since btrfs_error_commit_super() will flush and wait for all ordered extents, it should be executed early, before we start flushing the workqueues. And since btrfs_error_commit_super() now runs early, there is no need to run btrfs_run_delayed_iputs() inside it, so just remove the btrfs_run_delayed_iputs() call from btrfs_error_commit_super(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: defrag: extend ioctl to accept compression levelsDaniel Vacek
The zstd and zlib compression types support setting compression levels. Enhance the defrag interface to specify the levels as well. For zstd the negative (realtime) levels are also accepted. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: send: simplify return logic from send_encoded_extent()Filipe Manana
The 'out' label is pointless as we don't have anything to cleanup anymore (we used to have an inode to iput), so remove it and make error paths directly return an error. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: send: remove unnecessary inode lookup at send_encoded_inline_extent()Filipe Manana
We are doing a lookup of the inode but we don't use it at all. So just remove this pointless lookup. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid unnecessary bio dereference at run_one_async_done()Filipe Manana
We have dereferenced the async_submit_bio structure and extracted the bio pointer into a local variable, so there's no need to dereference it again when calling btrfs_bio_end_io(). Just use "bio->bi_status" instead of the longer expression "async->bbio->bio.bi_status". Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: fix non-empty delayed iputs list on unmount due to async workersFilipe Manana
At close_ctree() after we have ran delayed iputs either explicitly through calling btrfs_run_delayed_iputs() or later during the call to btrfs_commit_super() or btrfs_error_commit_super(), we assert that the delayed iputs list is empty. We have (another) race where this assertion might fail because we have queued an async write into the fs_info->workers workqueue. Here's how it happens: 1) We are submitting a data bio for an inode that is not the data relocation inode, so we call btrfs_wq_submit_bio(); 2) btrfs_wq_submit_bio() submits a work for the fs_info->workers queue that will run run_one_async_done(); 3) We enter close_ctree(), flush several work queues except fs_info->workers, explicitly run delayed iputs with a call to btrfs_run_delayed_iputs() and then again shortly after by calling btrfs_commit_super() or btrfs_error_commit_super(), which also run delayed iputs; 4) run_one_async_done() is executed in the work queue, and because there was an IO error (bio->bi_status is not 0) it calls btrfs_bio_end_io(), which drops the final reference on the associated ordered extent by calling btrfs_put_ordered_extent() - and that adds a delayed iput for the inode; 5) At close_ctree() we find that after stopping the cleaner and transaction kthreads the delayed iputs list is not empty, failing the following assertion: ASSERT(list_empty(&fs_info->delayed_iputs)); Fix this by flushing the fs_info->workers workqueue before running delayed iputs at close_ctree(). David reported this when running generic/648, which exercises IO error paths by using the DM error table. Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: reject out-of-band dirty folios during writebackQu Wenruo
[OUT-OF-BAND DIRTY FOLIOS] An out-of-band folio means the folio is marked dirty but without notifying the filesystem. This can lead to various problems, not limited to: - No folio::private to track per block status - No proper space reserved for such a dirty folio [HISTORY IN BTRFS] This used to be a problem related to get_user_page(), but with the introduction of pin_user_pages*(), we should no longer hit such case anymore. In btrfs, we have a long history of catching such out-of-band dirty folios by: - Mark the folio ordered during delayed allocation - Check the folio ordered flag during writeback If the folio has no ordered flag, it means it doesn't go through delayed allocation, thus it's definitely an out-of-band one. If we got one, we go through COW fixup, which will re-dirty the folio with proper handling in another workqueue. [PROBLEMS OF COW-FIXUP] Such workaround is a blockage for us to migrate to iomap (it requires extra flags to trace if a folio is dirtied by the fs or not) and I'd argue it's not data checksum safe, since if a folio can be marked dirty without informing the fs, the content can also change at any time. But with the introduction of pin_user_pages*() during v5.8 merge window, such out-of-band dirty folio such be treated as a bug. Ext4 has treated such case by warning and erroring out even before pin_user_pages*(). Furthermore, there are already proofs that such folio ordered flag tracking can be screwed up by incorrect error handling, check the commit messages of the following commits: 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed") c2b47df81c8e ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed") [FIXES] Unlike btrfs, ext4 and xfs (iomap) never bother handling such out-of-band dirty folios. - Ext4 just warns and errors out - Iomap always follows the folio/block dirty flags And there is nothing really COW specific, xfs also supports COW too. Here we take one step towards ext4 by doing warning and erroring out. But since the cow fixup thing is introduced from the beginning, we keep the old behavior for non-experimental builds, and only do the new warning for experimental builds before we're 100% sure and remove cow fixup. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: return a literal instead of a variable in btrfs_init_dev_replace()Dan Carpenter
This is just a small clean up, it doesn't change how the code works. Originally this code had a goto so we needed to set "ret = 0;" but now it returns directly and so we can simplify it a bit by doing a "return 0;" and removing the assignment. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: move btrfs_cleanup_bio() code into its single callerFilipe Manana
The btrfs_cleanup_bio() helper is trivial and has a single caller, there's no point in having a dedicated helper function. So get rid of it and move its code into the caller (btrfs_bio_end_io()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: move __btrfs_bio_end_io() code into its single callerFilipe Manana
The __btrfs_bio_end_io() helper is trivial and has a single caller, so there's no point in having a dedicated helper function. Further the double underscore prefix in the name is discouraged. So get rid of it and move its code into the caller (btrfs_bio_end_io()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: fix non-empty delayed iputs list on unmount due to compressed write ↵Filipe Manana
workers At close_ctree() after we have ran delayed iputs either through explicitly calling btrfs_run_delayed_iputs() or later during the call to btrfs_commit_super() or btrfs_error_commit_super(), we assert that the delayed iputs list is empty. When we have compressed writes this assertion may fail because delayed iputs may have been added to the list after we last ran delayed iputs. This happens like this: 1) We have a compressed write bio executing; 2) We enter close_ctree() and flush the fs_info->endio_write_workers queue which is the queue used for running ordered extent completion; 3) The compressed write bio finishes and enters btrfs_finish_compressed_write_work(), where it calls btrfs_finish_ordered_extent() which in turn calls btrfs_queue_ordered_fn(), which queues a work item in the fs_info->endio_write_workers queue that we have flushed before; 4) At close_ctree() we proceed, run all existing delayed iputs and call btrfs_commit_super() (which also runs delayed iputs), but before we run the following assertion below: ASSERT(list_empty(&fs_info->delayed_iputs)) A delayed iput is added by the step below... 5) The ordered extent completion job queued in step 3 runs and results in creating a delayed iput when dropping the last reference of the ordered extent (a call to btrfs_put_ordered_extent() made from btrfs_finish_one_ordered()); 6) At this point the delayed iputs list is not empty, so the assertion at close_ctree() fails. Fix this by flushing the fs_info->compressed_write_workers queue at close_ctree() before flushing the fs_info->endio_write_workers queue, respecting the queue dependency as the later is responsible for the execution of ordered extent completion. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: unify inode variable namingDavid Sterba
Rename binode to inode in local variables or parameters so it's more unified with the rest of the code. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct to btrfs_ioctl_subvol_getflags()David Sterba
Pass a struct btrfs_inode to btrfs_ioctl_subvol_getflags() as it's an internal interface, allowing to remove some use of BTRFS_I. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify local variables in btrfs_ioctl_resize()David Sterba
Remove some redundant variables and assignments, move variable declarations to their closest scope. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags()David Sterba
Pass a struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags() as it's an internal interface. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass root pointers to search tree ioctl helpersDavid Sterba
The search tree ioctl use btrfs_root so change that from btrfs_inode pointers so we don't have to do the conversion. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass btrfs_root pointers to send ioctl parametersDavid Sterba
The ioctl switch btrfs_ioctl() provides several parameter types for convenience so we don't have to do the conversion in the callbacks. Pass root pointers to the send related functions. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: parameter constification in ioctl.cDavid Sterba
Add const to function parameters that are not changed. Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: allow debug builds to accept 2K block sizeQu Wenruo
Currently we only support two block sizes, 4K and PAGE_SIZE. This means on the most common architecture x86_64, we have no way to test subpage block size. And that's exactly I have an aarch64 machine dedicated for subpage tests. But this is still a hurdle for a lot of btrfs developers, and to improve the test coverage mostly on x86_64, here we enable debug builds to accept 2K block size. This involves: - Introduce a dedicated minimal block size macro BTRFS_MIN_BLOCKSIZE, which depends on if CONFIG_BTRFS_DEBUG is set. If so it's 2K, otherwise it's 4K as usual. - Allow 4K, PAGE_SIZE and BTRFS_MIN_BLOCKSIZE as block size - Update subpage block size checks to be based on BTRFS_MIN_BLOCKSIZE - Export the new supported blocksize through sysfs interfaces As most of the subpage support is already pretty mature, there is no extra work needed to support the extra 2K block size. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: properly limit inline data extent according to block sizeQu Wenruo
Btrfs utilizes inline data extent for the following cases: - Regular small files - Symlinks And "btrfs check" detects any file extents that are too large as an error. It's not a problem for 4K block size, but for the incoming smaller block sizes (2K), it can cause problems due to bad limits: - Non-compressed inline data extents We do not allow a non-compressed inline data extent to be as large as block size. - Symlinks Currently the only real limit on symlinks are 4K, which can be larger than 2K block size. These will result btrfs-check to report too large file extents. Fix it by adding proper size checks for the above cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove the subpage related warning messageQu Wenruo
Since the initial enablement of block size < page size support for btrfs in v5.15, we have hit several milestones for block size < page size (subpage) support: - RAID56 subpage support In v5.19 - Refactored scrub support to support subpage better In v6.4 - Block perfect (previously requires page aligned ranges) compressed write In v6.13 - Various error handling fixes involving subpage In v6.14 Finally the only missing feature is the pretty simple and harmless inlined data extent creation, just added in previous patches. Now btrfs has all of its features ready for both regular and subpage cases, there is no reason to output a warning about the experimental subpage support, and we can finally remove it now. Acked-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: allow inline data extents creation if block size < page sizeQu Wenruo
Previously inline data extents creation was disabled if the block size (previously called sector size) is smaller than the page size, for the following reasons: - Possible mixed inline and regular data extents However this is also the same if the block size matches the page size, thus we do not treat mixed inline and regular extents as an error. And the chance to cause mixed inline and regular data extents are not even increased, it has the same requirement (compressed inline data extent covering the whole first block, followed by regular extents). - Inability to handle async/inline delalloc range for block size < page size cases This is already fixed since commit 1d2fbb7f1f9e ("btrfs: allow compression even if the range is not page aligned"). This was the major technical obstacle, but it's not anymore. With that removed, we can enable inline data extents creation no matter the block size nor the page size, allowing btrfs to have the same capacity for all block sizes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: allow buffered write to avoid full page read if it's block alignedQu Wenruo
[BUG] Since the support of block size (sector size) < page size for btrfs, test case generic/563 fails with 4K block size and 64K page size: --- tests/generic/563.out 2024-04-25 18:13:45.178550333 +0930 +++ /home/adam/xfstests-dev/results//generic/563.out.bad 2024-09-30 09:09:16.155312379 +0930 @@ -3,7 +3,8 @@ read is in range write is in range write -> read/write -read is in range +read has value of 8388608 +read is NOT in range -33792 .. 33792 write is in range ... [CAUSE] The test case creates a 8MiB file, then does buffered write into the 8MiB using 4K block size, to overwrite the whole file. On 4K page sized systems, since the write range covers the full block and page, btrfs will not bother reading the page, just like what XFS and EXT4 do. But on 64K page sized systems, although the 4K sized write is still block aligned, it's not page aligned anymore, thus btrfs will read the full page, which will be accounted by cgroup and fail the test. As the test case itself expects such 4K block aligned write should not trigger any read. Such expected behavior is an optimization to reduce folio reads when possible, and unfortunately btrfs does not implement such optimization. [FIX] To skip the full page read, we need to do the following modification: - Do not trigger full page read as long as the buffered write is block aligned This is pretty simple by modifying the check inside prepare_uptodate_page(). - Skip already uptodate blocks during full page read Or we can lead to the following data corruption: 0 32K 64K |///////| | Where the file range [0, 32K) is dirtied by buffered write, the remaining range [32K, 64K) is not. When reading the full page, since [0,32K) is only dirtied but not written back, there is no data extent map for it, but a hole covering [0, 64k). If we continue reading the full page range [0, 64K), the dirtied range will be filled with 0 (since there is only a hole covering the whole range). This causes the dirtied range to get lost. With this optimization, btrfs can pass generic/563 even if the page size is larger than fs block size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_do_readpage() to do block-by-block readQu Wenruo
Currently if btrfs has its block size (the older sector size) smaller than the page size, btrfs_do_readpage() will handle the range extent by extent, this is good for performance as it doesn't need to re-lookup the same extent map again and again. (Although get_extent_map() already does extra cached em check, thus the optimization is not that obvious.) This is totally fine and is a valid optimization, but it has an assumption that there is no partial uptodate range in the page. Meanwhile there is an incoming feature, requiring btrfs to skip the full page read if a buffered write range covers a full block but not a full page. In that case, we can have a page that is partially uptodate, and the current per-extent lookup cannot handle such case. So here we change btrfs_do_readpage() to do block-by-block read, this simplifies the following things: - Remove the need for @iosize variable Because we just use sectorsize as our increment. - Remove @pg_offset, and calculate it inside the loop when needed It's just offset_in_folio(). - Use a for() loop instead of a while() loop This will slightly reduce the read performance for subpage cases, but for the future where we need to skip already uptodate blocks, it should still be worth. For block size == page size, this brings no performance change. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: introduce a read path dedicated extent lock helperQu Wenruo
Currently we're using btrfs_lock_and_flush_ordered_range() for both btrfs_read_folio() and btrfs_readahead(), but it has one critical problem for future subpage optimizations: - It will call btrfs_start_ordered_extent() to writeback the involved folios But remember we're calling btrfs_lock_and_flush_ordered_range() at read paths, meaning the folio is already locked by read path. If we really trigger writeback for those already locked folios, this will lead to a deadlock and writeback cannot get the folio lock. Such dead lock is prevented by the fact that btrfs always keeps a dirty folio also uptodate, by either dirtying all blocks of the folio, or by reading the whole folio before dirtying. To prepare for the incoming patch which allows btrfs to skip full folio read if the buffered write is block aligned, we have to start by solving the possible deadlock first. Instead of blindly calling btrfs_start_ordered_extent(), introduce a new helper, which is smarter in the following ways: - Only wait and flush the ordered extent if * The folio doesn't even have private bit set * Part of the blocks of the ordered extent are not uptodate This can happen by: * The folio writeback finished, then got invalidated. There are a lot of reasons that a folio can get invalidated, from memory pressure to direct IO (which invalidates all folios of the range). But OE not yet finished. We have to wait for the ordered extent, as the OE may contain to-be-inserted data checksum. Without waiting, our read can fail due to the missing checksum. But either way, the OE should not need any extra flush inside the locked folio range. - Skip the ordered extent completely if * All the blocks are dirty This happens when OE creation is caused by a folio writeback whose file offset is before our folio. E.g. 16K page size and 4K block size 0 8K 16K 24K 32K |//////////////||///////| | The writeback of folio 0 created an OE for range [0, 24K), but since folio 16K is not fully uptodate, a read is triggered for folio 16K. The writeback will never happen (we're holding the folio lock for read), nor will the OE finish. Thus we must skip the range. * All the blocks are uptodate This happens when the writeback finished, but OE not yet finished. Since the blocks are already uptodate, we can skip the OE range. The new helper lock_extents_for_read() will do a loop for the target range by: 1) Lock the full range 2) If there is no ordered extent in the remaining range, exit 3) If there is an ordered extent that we can skip Skip to the end of the OE, and continue checking We do not trigger writeback nor wait for the OE. 4) If there is an ordered extent that we cannot skip Unlock the whole extent range and start the ordered extent. And also update btrfs_start_ordered_extent() to add two more parameters: @nowriteback_start and @nowriteback_len, to prevent triggering flush for a certain range. This will allow us to handle the following case properly in the future: 16K page size, 4K btrfs block size: 0 4K 8K 12K 16K 20K 24K 28K 32K |/////////////////////////////||////////////////| | | |<-------------------- OE 2 ------------------->| |< OE 1 >| The folio has been written back before, thus we have an OE at [28K, 32K). Although the OE 1 finished its IO, the OE is not yet removed from IO tree. The folio got invalidated after writeback completed and before the ordered extent finished. And [16K, 24K) range is dirty and uptodate, caused by a block aligned buffered write (and future enhancements allowing btrfs to skip full folio read for such case). But writeback for folio 0 has began, thus it generated OE 2, covering range [0, 24K). Since the full folio 16K is not uptodate, if we want to read the folio, the existing btrfs_lock_and_flush_ordered_range() will dead lock, by: btrfs_read_folio() | Folio 16K is already locked |- btrfs_lock_and_flush_ordered_range() |- btrfs_start_ordered_extent() for range [16K, 24K) |- filemap_fdatawrite_range() for range [16K, 24K) |- extent_write_cache_pages() folio_lock() on folio 16K, deadlock. But now we will have the following sequence: btrfs_read_folio() | Folio 16K is already locked |- lock_extents_for_read() |- can_skip_ordered_extent() for range [16K, 24K) | Returned true, the range [16K, 24K) will be skipped. |- can_skip_ordered_extent() for range [28K, 32K) | Returned false. |- btrfs_start_ordered_extent() for range [28K, 32K) with [16K, 32K) as no writeback range No writeback for folio 16K will be triggered. And there will be no more possible deadlock on the same folio. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: fix the qgroup data free range for inline data extentsQu Wenruo
Inside function __cow_file_range_inline() since the inlined data no longer take any data space, we need to free up the reserved space. However the code is still using the old page size == sector size assumption, and will not handle subpage case well. Thankfully it is not going to cause any problems because we have two extra safe nets: - Inline data extents creation is disabled for sector size < page size cases for now But it won't stay that for long. - btrfs_qgroup_free_data() will only clear ranges which have been already reserved So even if we pass a range larger than what we need, it should still be fine, especially there is only reserved space for a single block at file offset 0 of an inline data extent. But just for the sake of consistency, fix the call site to use sectorsize instead of page size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prevent inline data extents read from touching blocks beyond its rangeQu Wenruo
Currently reading an inline data extent will zero out the remaining range in the page. This is not yet causing problems even for block size < page size (subpage) cases because: 1) An inline data extent always starts at file offset 0 Meaning at page read, we always read the inline extent first, before any other blocks in the page. Then later blocks are properly read out and re-fill the zeroed out ranges. 2) Currently btrfs will read out the whole page if a buffered write is not page aligned So a page is either fully uptodate at buffered write time (covers the whole page), or we will read out the whole page first. Meaning there is nothing to lose for such an inline extent read. But it's still not ideal: - We're zeroing out the page twice Once done by read_inline_extent()/uncompress_inline(), once done by btrfs_do_readpage() for ranges beyond i_size. - We're touching blocks that don't belong to the inline extent In the incoming patches, we can have a partial uptodate folio, of which some dirty blocks can exist while the page is not fully uptodate: The page size is 16K and block size is 4K: 0 4K 8K 12K 16K | | |/////////| | And range [8K, 12K) is dirtied by a buffered write, the remaining blocks are not uptodate. If range [0, 4K) contains an inline data extent, and we try to read the whole page, the current behavior will overwrite range [8K, 12K) with zero and cause data loss. So to make the behavior more consistent and in preparation for future changes, limit the inline data extents read to only zero out the range inside the first block, not the whole page. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: sysfs: accept size suffixes for read policy valuesAnand Jain
We now parse human-friendly size values (e.g. '1G', '2M') when setting read policies. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in load_free_space_tree()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in clear_free_space_tree()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in populate_free_space_tree()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. This applies to both path and path2. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_remove_free_space_inode()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_lookup_bio_sums()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in run_delayed_extent_op()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_lookup_extent_info()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_get_name()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with some return simplifications. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use BTRFS_PATH_AUTO_FREE in btrfs_init_root_free_objectid()David Sterba
This is the trivial pattern for path auto free, initialize at the beginning and free at the end with simple goto -> return conversions. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>