summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2025-05-15btrfs: handle aligned EOF truncation correctly for subpage casesQu Wenruo
[BUG] For the following fsx -e 1 run, the btrfs still fails the run on 64K page size with 4K fs block size: READ BAD DATA: offset = 0x26b3a, size = 0xfafa, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0x26b3a 0x0000 0x15b4 0x0 operation# (mod 256) for the bad data may be 21 [...] LOG DUMP (28 total operations): 1( 1 mod 256): SKIPPED (no operation) 2( 2 mod 256): SKIPPED (no operation) 3( 3 mod 256): SKIPPED (no operation) 4( 4 mod 256): SKIPPED (no operation) 5( 5 mod 256): WRITE 0x1ea90 thru 0x285e0 (0x9b51 bytes) HOLE 6( 6 mod 256): ZERO 0x1b1a8 thru 0x20bd4 (0x5a2d bytes) 7( 7 mod 256): FALLOC 0x22b1a thru 0x272fa (0x47e0 bytes) INTERIOR 8( 8 mod 256): WRITE 0x741d thru 0x13522 (0xc106 bytes) 9( 9 mod 256): MAPWRITE 0x73ee thru 0xdeeb (0x6afe bytes) 10( 10 mod 256): FALLOC 0xb719 thru 0xb994 (0x27b bytes) INTERIOR 11( 11 mod 256): COPY 0x15ed8 thru 0x18be1 (0x2d0a bytes) to 0x25f6e thru 0x28c77 12( 12 mod 256): ZERO 0x1615e thru 0x1770e (0x15b1 bytes) 13( 13 mod 256): SKIPPED (no operation) 14( 14 mod 256): DEDUPE 0x20000 thru 0x27fff (0x8000 bytes) to 0x1000 thru 0x8fff 15( 15 mod 256): SKIPPED (no operation) 16( 16 mod 256): CLONE 0xa000 thru 0xffff (0x6000 bytes) to 0x36000 thru 0x3bfff 17( 17 mod 256): ZERO 0x14adc thru 0x1b78a (0x6caf bytes) 18( 18 mod 256): TRUNCATE DOWN from 0x3c000 to 0x1e2e3 ******WWWW 19( 19 mod 256): CLONE 0x4000 thru 0x11fff (0xe000 bytes) to 0x16000 thru 0x23fff 20( 20 mod 256): FALLOC 0x311e1 thru 0x3681b (0x563a bytes) PAST_EOF 21( 21 mod 256): FALLOC 0x351c5 thru 0x40000 (0xae3b bytes) EXTENDING 22( 22 mod 256): WRITE 0x920 thru 0x7e51 (0x7532 bytes) 23( 23 mod 256): COPY 0x2b58 thru 0xc508 (0x99b1 bytes) to 0x117b1 thru 0x1b161 24( 24 mod 256): TRUNCATE DOWN from 0x40000 to 0x3c9a5 25( 25 mod 256): SKIPPED (no operation) 26( 26 mod 256): MAPWRITE 0x25020 thru 0x26b06 (0x1ae7 bytes) 27( 27 mod 256): SKIPPED (no operation) 28( 28 mod 256): READ 0x26b3a thru 0x36633 (0xfafa bytes) ***RRRR*** [CAUSE] The involved operations are: fallocating to largest ever: 0x40000 21 pollute_eof 0x24000 thru 0x2ffff (0xc000 bytes) 21 falloc from 0x351c5 to 0x40000 (0xae3b bytes) 28 read 0x26b3a thru 0x36633 (0xfafa bytes) At operation #21 a pollute_eof is done, by memory mapped write into range [0x24000, 0x2ffff). At this stage, the inode size is 0x24000, which is block aligned. Then fallocate happens, and since it's expanding the inode, it will call btrfs_truncate_block() to truncate any unaligned range. But since the inode size is already block aligned, btrfs_truncate_block() does nothing and exits. However remember the folio at 0x20000 has some range polluted already, although it will not be written back to disk, it still affects the page cache, resulting the later operation #28 to read out the polluted value. [FIX] Instead of early exit from btrfs_truncate_block() if the range is already block aligned, do extra filio zeroing if the fs block size is smaller than the page size and we're truncating beyond EOF. This is to address exactly the above case where memory mapped write can still leave some garbage beyond EOF. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: handle unaligned EOF truncation correctly for subpage casesQu Wenruo
[BUG] The following fsx sequence will fail on btrfs with 64K page size and 4K fs block size: #fsx -d -e 1 -N 4 $mnt/junk -S 36386 READ BAD DATA: offset = 0xe9ba, size = 0x6dd5, fname = /mnt/btrfs/junk OFFSET GOOD BAD RANGE 0xe9ba 0x0000 0x03ac 0x0 operation# (mod 256) for the bad data may be 3 ... LOG DUMP (4 total operations): 1( 1 mod 256): WRITE 0x6c62 thru 0x1147d (0xa81c bytes) HOLE ***WWWW 2( 2 mod 256): TRUNCATE DOWN from 0x1147e to 0x5448 ******WWWW 3( 3 mod 256): ZERO 0x1c7aa thru 0x28fe2 (0xc839 bytes) 4( 4 mod 256): MAPREAD 0xe9ba thru 0x1578e (0x6dd5 bytes) ***RRRR*** [CAUSE] Only 2 operations are really involved in this case: 3 pollute_eof 0x5448 thru 0xffff (0xabb8 bytes) 3 zero from 0x1c7aa to 0x28fe3, (0xc839 bytes) 4 mapread 0xe9ba thru 0x1578e (0x6dd5 bytes) At operation 3, fsx pollutes beyond EOF, that is done by mmap() and write into that mmap() range beyond EOF. Such write will fill the range beyond EOF, but it will never reach disk as ranges beyond EOF will not be marked dirty nor uptodate. Then we zero_range for [0x1c7aa, 0x28fe3], and since the range is beyond our isize (which was 0x5448), we should zero out any range beyond EOF (0x5448). During btrfs_zero_range(), we call btrfs_truncate_block() to dirty the unaligned head block. But that function only really zeroes out the block at [0x5000, 0x5fff], it doesn't bother any range other that that block, since those ranges will not be marked dirty nor written back. So the range [0x6000, 0xffff] is still polluted, and later mapread() will return the poisoned value. [FIX] Enhance btrfs_truncate_block() by: - Pass a @start/@end pair to indicate the full truncation range This is to handle the following truncation case: Page size is 64K, fs block size is 4K, truncate range is [6K, 60K] 0 32K 64K | |///////////////////////////////////| | 6K 60K The range is not aligned for its head block, so we need to call btrfs_truncate_block() with @from = 6K, @front = 0, @len = 0. But with that information we only know to zero the range [6K, 8K), if we zero out the range [6K, 64K), the last block will also be zeroed, causing data loss. So here we need the full range we're truncating, so that we can avoid over-truncation. - Rename @from to @offset As now the parameter is only utilized to locate a block, it's not really carrying the old @from meaning well. - Remove @front parameter With the full truncate range passed in, we can determine if the @offset is at the head or tail block. - Skip truncation if @offset is not in the head nor tail blocks The call site in hole punch unconditionally call btrfs_truncate_block() without even checking the range is aligned or not. If the @offset is neither in the head nor in tail block, it means we can safely ignore it. - Skip truncate if the range inside the target block is already aligned - Make btrfs_truncate_block() zero all blocks beyond EOF Since we have the original range, we know exactly if we're doing truncation beyond EOF (the @end will be (u64)-1). If we're doing truncation beyond EOF, then enlarge the truncation range to the folio end, to address the possibly polluted ranges. Otherwise still keep the zero range inside the block, as we can have large data folios soon, always truncating every blocks inside the same folio can be costly for large folios. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: fix broken drop_caches on extent buffer foliosBoris Burkov
The (correct) commit e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") replaced the page_mapped(page) check with a refcount check. However, this refcount check does not work as expected with drop_caches for btrfs's metadata pages. Btrfs has a per-sb metadata inode with cached pages, and when not in active use by btrfs, they have a refcount of 3. One from the initial call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and one from folio_attach_private(). We would expect such pages to get dropped by drop_caches. However, drop_caches calls into mapping_evict_folio() via mapping_try_invalidate() which gets a reference on the folio with find_lock_entries(). As a result, these pages have a refcount of 4, and fail this check. For what it's worth, such pages do get reclaimed under memory pressure, so I would say that while this behavior is surprising, it is not really dangerously broken. When I asked the mm folks about the expected refcount in this case, I was told that the correct thing to do is to donate the refcount from the original allocation to the page cache after inserting it. Therefore, attempt to fix this by adding a put_folio() to the critical spot in alloc_extent_buffer() where we are sure that we have really allocated and attached new pages. We must also adjust folio_detach_private() to properly handle being the last reference to the folio and not do a use-after-free after folio_detach_private(). extent_buffers allocated by clone_extent_buffer() and alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership from allocation to insertion in the mapping does not apply to them. However, we can still folio_put() them safely once they are finished being allocated and have called folio_attach_private(). Finally, removing the generic put_folio() for the allocation from btrfs_detach_extent_buffer_folios() means we need to be careful to do the appropriate put_folio() in allocation failure paths in alloc_extent_buffer(), clone_extent_buffer() and alloc_dummy_extent_buffer(). Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ Tested-by: Klara Modin <klarasmodin@gmail.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use verbose assert at peek_discard_list()Filipe Manana
We now have a verbose variant of ASSERT() so that we can print the value of the block group's discard_index. So use it for better problem analysis in case the assertion is triggered. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: scrub: aggregate small bitmaps into a larger oneQu Wenruo
Currently we have several small bitmaps inside scrub_stripe: - extent_sector_bitmap - error_bitmap - io_error_bitmap - csum_error_bitmap - meta_error_bitmap - meta_gen_error_bitmap All those bitmaps are at most 16 bits long, but unsigned long is either 32 or 64 (more common) bits. This means we're wasting 1/2 or 3/4 space for each bitmap. And we can have 128 scrub_stripe for each device, such wasted space adds up quickly. Instead of using a single unsigned long for each bitmap, aggregate them into a larger bitmap, just like what we're doing for subpage support. This reduces 24 bytes from each scrub_stripe structure on x86_64 systems. This will need a lot of macros converting direct bitmap/bit operations into our scrub_stripe specific helpers, but all those helpers are very small and can be inlined. So overall the overhead shouldn't be that huge, and we save quite some memory space. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: scrub: fix a wrong error type when metadata bytenr mismatchesQu Wenruo
When the bytenr doesn't match for a metadata tree block, we will report it as an csum error, which is incorrect and should be reported as a metadata error instead. Fixes: a3ddbaebc7c9 ("btrfs: scrub: introduce a helper to verify one metadata block") Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: defrag: use list_last_entry() at defrag_collect_targets()Filipe Manana
Instead of using list_entry() against the list's prev entry, use list_last_entry(), which removes the need to know the last member is accessed through the prev list pointer and the naming makes it easier to reason about what we are doing. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify csum list release at btrfs_put_ordered_extent()Filipe Manana
Instead of extracting each element by grabbing the list's first member in a local list_head variable, then extracting the csum with list_entry() and iterating with a while loop checking for list emptyness, use the iteration helper list_for_each_entry_safe(). This also removes the need to delete elements from the list with list_del() since the ordered extent is freed immediately after. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify extracting delayed node at btrfs_first_prepared_delayed_node()Filipe Manana
Instead of grabbing the next pointer from the list and then doing a list_entry() call, we can simply use list_first_entry(), removing the need for list_head variable. Also there's no need to check if the list is empty before attempting to extract the first element, we can use list_first_entry_or_null(), removing the need for a special if statement and the 'out' label. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify extracting delayed node at btrfs_first_delayed_node()Filipe Manana
Instead of grabbing the next pointer from the list and then doing a list_entry() call, we can simply use list_first_entry(), removing the need for list_head variable. Also there's no need to check if the list is empty before attempting to extract the first element, we can use list_first_entry_or_null(), removing the need for a special if statement and the 'out' label. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: raid56: use list_last_entry() at cache_rbio()Filipe Manana
Instead of using list_entry() against the list's prev entry, use list_last_entry(), which removes the need to know the last member is accessed through the prev list pointer and the naming makes it easier to reason about what we are doing. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify cow only root list extraction during transaction commitFilipe Manana
There's no need to keep a local variable to extract the first member of the list and then do a list_entry() call, we can use list_first_entry() instead, removing the need for the temporary variable and extracting the first element in a single step. Also, there's no need to do a list_del_init() followed by list_add_tail(), instead we can use list_move_tail(). We are in transaction commit critical section where we don't need to worry about concurrency and that's why we don't take any locks and can use list_move_tail() (we do assert early at commit_cowonly_roots() that we are in the critical section, that the transaction's state is TRANS_STATE_COMMIT_DOING). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify getting and extracting previous transaction at ↵Filipe Manana
clean_pinned_extents() Instead of detecting if there is a previous transaction by comparing the current transaction's list prev member to the head of the transaction list (fs_info->trans_list), use the list_is_first() helper which contains that logic and the naming makes sense since a new transaction is always added to the end of the list fs_info->trans_list with list_add_tail(). We are also extracting the previous transaction with list_last_entry() against the transaction, which is correct but confusing because that function is usually meant to be used against a pointer to the start of a list and not a member of a list. It is easier to reason by either calling list_first_entry() against the list fs_info->trans_list, since we can never have more than two transactions in the list, or by calling list_prev_entry() against the transaction. So change that to use the later method. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify getting and extracting previous transaction during commitFilipe Manana
Instead of detecting if there is a previous transaction by comparing the current transaction's list prev member to the head of the transaction list (fs_info->trans_list), use the list_is_first() helper which contains that logic and the naming makes sense since a new transaction is always added to the end of the list fs_info->trans_list with list_add_tail(). And instead of extracting the previous transaction with the more generic list_entry() helper against the current transaction's list prev member, use the more specific list_prev_entry() helper, which makes it clear what we are doing and is shorter. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move transaction aborts to the error site in add_to_free_space_tree()David Sterba
Transaction aborts should be done next to the place the error happens, which was not done in add_to_free_space_tree(). Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move transaction aborts to the error site in ↵David Sterba
remove_from_free_space_tree() Transaction aborts should be done next to the place the error happens, which was not done in remove_from_free_space_tree(). Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move transaction aborts to the error site in ↵David Sterba
convert_free_space_to_extents() Transaction aborts should be done next to the place the error happens, which was not done in convert_free_space_to_extents(). The DEBUG_WARN() is removed because we get the abort message. Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move transaction aborts to the error site in ↵David Sterba
convert_free_space_to_bitmaps() Transaction aborts should be done next to the place the error happens, which was not done in convert_free_space_to_bitmaps(). The DEBUG_WARN() is removed because we get the abort message. Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: scrub: move error reporting members to stackQu Wenruo
Currently the following members of scrub_stripe are only utilized for error reporting: - init_error_bitmap - init_nr_io_errors - init_nr_csum_errors - init_nr_meta_errors - init_nr_meta_gen_errors There is no need to put all those members into scrub_stripe, which take 24 bytes for each stripe, and we have 128 stripes for each device. Instead introduce a structure, scrub_error_records, and move all above members into that structure. And allocate such structure from stack inside scrub_stripe_read_repair_worker(). Since that function is called from a workqueue context, we have more than enough stack space for just 24 bytes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: scrub: update device stats when an error is detectedQu Wenruo
[BUG] Since the migration to the new scrub_stripe interface, scrub no longer updates the device stats when hitting an error, no matter if it's a read or checksum mismatch error. E.g: BTRFS info (device dm-2): scrub: started on devid 1 BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS info (device dm-2): scrub: finished on devid 1 with status: 0 Note there is no line showing the device stats error update. [CAUSE] In the migration to the new scrub_stripe interface, we no longer call btrfs_dev_stat_inc_and_print(). [FIX] - Introduce a new bitmap for metadata generation errors * A new bitmap @meta_gen_error_bitmap is introduced to record which blocks have metadata generation mismatch errors. * A new counter for that bitmap @init_nr_meta_gen_errors, is also introduced to store the number of generation mismatch errors that are found during the initial read. This is for the error reporting at scrub_stripe_report_errors(). * New dedicated error message for unrepaired generation mismatches * Update @meta_gen_error_bitmap if a transid mismatch is hit - Add btrfs_dev_stat_inc_and_print() calls to the following call sites * scrub_stripe_report_errors() * scrub_write_endio() This is only for the write errors. This means there is a minor behavior change: - The timing of device stats error message Since we concentrate the error messages at scrub_stripe_report_errors(), the device stats error messages will all show up in one go, after the detailed scrub error messages: BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): unable to fixup (regular) error at logical 13631488 on dev /dev/mapper/test-scratch1 physical 13631488 BTRFS warning (device dm-2): checksum error at logical 13631488 on dev /dev/mapper/test-scratch1, physical 13631488, root 5, inode 257, offset 0, length 4096, links 1 (path: file) BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add support for reclaiming from sub-space space_infoNaohiro Aota
Modify btrfs_async_{data,metadata}_reclaim() to run the reclaim process on the sub-spaces as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add block reserve for treelogNaohiro Aota
We need to add a dedicated block_rsv for tree-log, because the block_rsv serves for a tree node allocation in btrfs_alloc_tree_block(). Currently, tree-log tree uses fs_info->empty_block_rsv, which is shared across trees and points to the normal metadata space_info. Instead, we add a dedicated block_rsv and that block_rsv can use the dedicated sub-space_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use proper data space_info for zoned modeNaohiro Aota
Now that, we have data sub-space for the zoned mode. Tweak some space_info functions to use proper space_info for a file. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: tweak extent/chunk allocation for space_info sub-spaceNaohiro Aota
Make the extent allocator and the chunk allocator aware of the sub-space. It now uses BTRFS_SUB_GROUP_DATA_RELOC sub-space for data relocation block group, and uses BTRFS_SUB_GROUP_TREELOG for metadata tree-log block group. And, it needs to check the space_info is the right one when a block group candidate is given. Also, new block group should now belong to the specified one. Now that, block_group->space_info is always set before btrfs_add_bg_to_space_info(), we no longer need to "find" the space_info. So, rename the variable name to address that as well. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: introduce tree-log sub-space_infoNaohiro Aota
Introduce the tree-log sub-space_info, which is sub-space of metadata space_info and dedicated for tree-log node allocation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: introduce btrfs_space_info sub-groupNaohiro Aota
Current code assumes we have only one space_info for each block group type (DATA, METADATA, and SYSTEM). We sometime need multiple space infos to manage special block groups. One example is handling the data relocation block group for the zoned mode. That block group is dedicated for writing relocated data and we cannot allocate any regular extent from that block group, which is implemented in the zoned extent allocator. This block group still belongs to the normal data space_info. So, when all the normal data block groups are full and there is some free space in the dedicated block group, the space_info looks to have some free space, while it cannot allocate normal extent anymore. That results in a strange ENOSPC error. We need to have a space_info for the relocation data block group to represent the situation properly. Adds a basic infrastructure for having a "sub-group" of a space_info: creation and removing. A sub-group space_info belongs to one of the primary space_infos and has the same flags as its parent. This commit first introduces the relocation data sub-space_info, and the next commit will introduce tree-log sub-space_info. In the future, it could be useful to implement tiered storage for btrfs e.g. by implementing a sub-group space_info for block groups resides on a fast storage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add space_info parameter for block group creationNaohiro Aota
Add struct btrfs_space_info parameter to btrfs_make_block_group(), its related functions and related struct. Passed space_info will have a new block group. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add space_info argument to btrfs_chunk_alloc()Naohiro Aota
Take a btrfs_space_info argument in btrfs_chunk_alloc(). New block group will belong to that space_info. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: factor out check_removing_space_info() from btrfs_free_block_groups()Naohiro Aota
Factor out check_removing_space_info() from btrfs_free_block_groups(). It sanity checks a to-be-removed space_info. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: factor out do_async_reclaim_{data,metadata}_space()Naohiro Aota
Factor out the main part of btrfs_async_reclaim_data_space() to do_async_reclaim_data_space(), so it can take data space_info parameter it is working on. Do the same for metadata. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: factor out init_space_info() from create_space_info()Naohiro Aota
Factor out initialization of the space_info struct, which is used in a later patch. There is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: pass struct btrfs_inode to btrfs_free_reserved_data_space_noquota()Naohiro Aota
As well as the last patch, pass struct btrfs_inode to the function and let it distinguish which data space it is working on in a later patch. There is no functional change with this commit. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: pass btrfs_space_info to btrfs_reserve_data_bytes()Naohiro Aota
Pass struct btrfs_space_info to btrfs_reserve_data_bytes() to allow reserving the data from multiple data space_info candidates. This is a preparation for the following commits and there is no functional change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: make extent unpinning more efficient when committing transactionFilipe Manana
At btrfs_finish_extent_commit() we have this loop that keeps finding an extent range to unpin in the transaction's pinned_extents io tree, caches the extent state and then passes that cached extent state to btrfs_clear_extent_dirty(), which will free that extent state since we clear the only bit it can have set. So on each loop iteration we do a full io tree search and the cached state is used only to avoid having a tree search done by btrfs_clear_extent_dirty(). During the lifetime of a transaction we can pin many thousands of extents, resulting in a large and deep rb tree that backs the io tree. For example, for the following fs_mark run on a 12 cores boxes: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 FILES=100000 THREADS=$(nproc --all) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount $DEV $MNT OPTS="-S 0 -L 8 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT an histogram for the number of ranges (elements) in the pinned extents io tree of a transaction was the following: Count: 76 Range: 5440.000 - 51088.000; Mean: 27354.368; Median: 28312.000; Stddev: 9800.767 Percentiles: 90th: 40486.000; 95th: 43322.000; 99th: 51088.000 5440.000 - 6805.809: 1 ### 6805.809 - 10652.034: 1 ### 10652.034 - 13326.178: 3 ######## 13326.178 - 16671.590: 8 ###################### 16671.590 - 20856.773: 7 #################### 20856.773 - 26092.528: 13 #################################### 26092.528 - 32642.571: 19 ##################################################### 32642.571 - 40836.818: 17 ############################################### 40836.818 - 51088.000: 7 #################### We can improve on this by grabbing the next state before calling btrfs_clear_extent_dirty(), avoiding a full tree search on the next iteration which always has an O(log n) complexity while grabbing the next element (rb_next() rbtree operation) is in the worst case O(log n) too, but very often much less than that, making it more efficient. Here follow histograms for the execution times, in nanoseconds, of btrfs_finish_extent_commit() before and after applying this patch and all the other patches in the same patchset. Before patchset: Count: 32 Range: 3925691.000 - 269990635.000; Mean: 133814526.906; Median: 122758052.000; Stddev: 65776550.375 Percentiles: 90th: 228672087.000; 95th: 265187000.000; 99th: 269990635.000 3925691.000 - 5993208.660: 1 #### 5993208.660 - 75878537.656: 4 ################## 75878537.656 - 115840974.514: 12 ##################################################### 115840974.514 - 176850157.761: 6 ########################### 176850157.761 - 269990635.000: 9 ######################################## After patchset: Count: 32 Range: 1849393.000 - 231491064.000; Mean: 126978584.625; Median: 123732897.000; Stddev: 58007821.806 Percentiles: 90th: 203055491.000; 95th: 219952699.000; 99th: 231491064.000 1849393.000 - 2997642.092: 1 #### 2997642.092 - 88111637.071: 9 ##################################### 88111637.071 - 142818264.414: 9 ##################################### 142818264.414 - 231491064.000: 13 ##################################################### Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove variable to track trimmed bytes at btrfs_finish_extent_commit()Filipe Manana
We don't need to keep track of discarded (trimmed) bytes at btrfs_finish_extent_commit() but we are declaring a local variable for that and passing a reference to the btrfs_discard_extent() calls when we are processing delete block groups. So instead pass NULL to btrfs_discard_extent() and remove that variable. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: don't BUG_ON() when unpinning extents during transaction commitFilipe Manana
In the final phase of a transaction commit, when unpinning extents at btrfs_finish_extent_commit(), there's no need to BUG_ON() if we fail to unpin an extent range. All that can happen is that we fail to return the extent range to the in-memory free space cache, meaning no future space allocations can reuse that extent range while the fs is mounted. So instead return the error to the caller and make it abort the transaction, so that the error is noticed and prevent misteriously leaking space. We keep track of the first error we get while unpinning an extent range and keep trying to unpin all the following extent ranges, so that we attempt to do all discards. The transaction abort will deal with all resource cleanups. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unnecessary NULL checks before freeing extent stateFilipe Manana
When manipulating extent bits for an io tree range, we have this pattern: if (prealloc) btrfs_free_extent_state(prealloc); but this is not needed nowadays since btrfs_free_extent_state() ignores a NULL pointer argument, following the common pattern of kernel and btrfs freeing functions, as well as libc and other user space libraries. So remove the NULL checks, reducing source code and object size. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: avoid re-searching tree when setting bits in an extent rangeFilipe Manana
When setting bits for an extent range (set_extent_bit()), if the current extent state record starts after the target range, we always do a jump to the 'search_again' label, which will cause us to do a full tree search for the next state if the current state ends before the target range. Unless we need to reschedule, we can just grab the next state and process it, avoiding a full tree search, even if that next state is not contiguous, as we'll allocate and insert a new prealloc state if needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: avoid repeated extent state processing when setting extent bitsFilipe Manana
When setting bits for an extent range, if we find an extent state with its start offset greater than current start offset, we insert a new extent state to cover the gap, with its end offset computed and stored in the @this_end local variable, and after the insertion we update the current start offset to @this_end + 1. However if the insert_state() call resulted in an extent state merge then the end offset of the merged extent may be greater than @this_end and if that's the case, since we jump to the 'search_again' label, we'll do a full tree search that will leave us in the same extent state - this is harmless but wastes time by doing a pointless tree search and extent state processing. So improve on this by updating the current start offset to the end offset of the inserted state plus 1. This also removes the use of the @this_end variable and directly set the value in the prealloc extent state to avoid any confusion and misuse in the future. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify last record detection at set_extent_bit()Filipe Manana
There's no need to compare the current extent state's end offset to (u64)-1 to check if we have the last possible record and to check as as well if after updating the start offset to the end offset of the current record plus one we are still inside the target range. Instead we can simplify and exit if the current extent state ends at or after the target range and then remove the check for the (u64)-1 as well as the check to see if the updated start offset (to last_end + 1) is still inside the target range. Besides the simplification, this also avoids seaching for the next extent state record (through next_state()) when the current extent state record ends at the same offset as our target range, which is pointless and only wastes times iterating through the rb tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: exit after state split error at set_extent_bit()Filipe Manana
If split_state() returned an error we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and hit a use after free when calling set_state_bits() since the extent state record which the local variable 'prealloc' points to was freed by split_state(). So jump to the label 'out' after calling extent_io_tree_panic() and set the 'prealloc' pointer to NULL since split_state() has already freed it when it hit an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: exit after state insertion failure at set_extent_bit()Filipe Manana
If insert_state() state failed it returns an error pointer and we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and call cache_state() which will dereference the error pointer, resulting in an invalid memory access. So jump to the 'out' label after calling extent_io_tree_panic(), it also makes the code more clear besides dealing with the exotic scenario where CONFIG_BUG is disabled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify last record detection at btrfs_convert_extent_bit()Filipe Manana
There's no need to compare the current extent state's end offset to (u64)-1 to check if we have the last possible record and to check as as well if after updating the start offset to the end offset of the current record plus one we are still inside the target range. Instead we can simplify and exit if the current extent state ends at or after the target range and then remove the check for the (u64)-1 as well as the check to see if the updated start offset (to last_end + 1) is still inside the target range. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: avoid re-searching tree when converting bits in an extent rangeFilipe Manana
When converting bits for an extent range (btrfs_convert_extent_bit()), if the current extent state record starts after the target range, we always do a jump to the 'search_again' label, which will cause us to do a full tree search for the next state if the current state ends before the target range. Unless we need to reschedule, we can just grab the next state and process it, avoiding a full tree search, even if that next state is not contiguous, as we'll allocate and insert a new prealloc state if needed. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: avoid repeated extent state processing when converting extent bitsFilipe Manana
When converting bits for an extent range, if we find an extent state with its start offset greater than current start offset, we insert a new extent state to cover the gap, with its end offset computed and stored in the @this_end local variable, and after the insertion we update the current start offset to @this_end + 1. However if the insert_state() call resulted in an extent state merge then the end offset of the merged extent may be greater than @this_end and if that's the case, since we jump to the 'search_again' label, we'll do a full tree search that will leave us in the same extent state - this is harmless but wastes time by doing a pointless tree search and extent state processing. So improve on this by updating the current start offset to the end offset of the inserted state plus 1. This also removes the use of the @this_end variable and directly set the value in the prealloc extent state to avoid any confusion and misuse in the future. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: avoid unnecessary next node searches when clearing bits from extent rangeFilipe Manana
When clearing bits for a range in an io tree, at clear_state_bit(), we always go search for the next node (through next_state() -> rb_next()) and return it. However if the current extent state record ends at or after the target range passed to btrfs_clear_extent_bit_changeset() or btrfs_convert_extent_bit(), we are just wasting time finding that next node since we won't use it in those functions. Improve on this by skipping the next node search if the current node ends at or after the target range. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: exit after state insertion failure at btrfs_convert_extent_bit()Filipe Manana
If insert_state() state failed it returns an error pointer and we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and call cache_state() which will dereference the error pointer, resulting in an invalid memory access. So jump to the 'out' label after calling extent_io_tree_panic(), it also makes the code more clear besides dealing with the exotic scenario where CONFIG_BUG is disabled. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: exit after state split error at btrfs_convert_extent_bit()Filipe Manana
If split_state() returned an error we call extent_io_tree_panic() which will trigger a BUG() call. However if CONFIG_BUG is disabled, which is an uncommon and exotic scenario, then we fallthrough and hit a use after free when calling set_state_bits() since the extent state record which the local variable 'prealloc' points to was freed by split_state(). So jump to the label 'out' after calling extent_io_tree_panic() and set the 'prealloc' pointer to NULL since split_state() has already freed it when it hit an error. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove duplicate error check at btrfs_convert_extent_bit()Filipe Manana
There's no need to check if split_state() returned an error twice, instead unify into a single if statement after setting 'prealloc' to NULL, because on error split_state() frees the 'prealloc' extent state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: simplify last record detection at btrfs_clear_extent_bit_changeset()Filipe Manana
Instead of checking for an end offset of (u64)-1 (U64_MAX) for the current extent state's end, and then checking after updating the current start offset if it's now beyond the range's end offset, we can simply stop if the current extent state's end is greater than or equals to our range's end offset. This helps remove one comparison under the 'next' label and allows to remove the if statement that checks if the start offset is greater than the end offset under the 'search_again' label. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>