summaryrefslogtreecommitdiff
path: root/fs/btrfs/raid-stripe-tree.c
AgeCommit message (Collapse)Author
2025-01-14btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extentsJohannes Thumshirn
Don't use btrfs_set_item_key_safe() to modify the keys in the RAID stripe-tree, as this can lead to corruption of the tree, which is caught by the checks in btrfs_set_item_key_safe(): BTRFS info (device nvme1n1): leaf 49168384 gen 15 total ptrs 194 free space 8329 owner 12 BTRFS info (device nvme1n1): refs 2 lock_owner 1030 current 1030 [ snip ] item 105 key (354549760 230 20480) itemoff 14587 itemsize 16 stride 0 devid 5 physical 67502080 item 106 key (354631680 230 4096) itemoff 14571 itemsize 16 stride 0 devid 1 physical 88559616 item 107 key (354631680 230 32768) itemoff 14555 itemsize 16 stride 0 devid 1 physical 88555520 item 108 key (354717696 230 28672) itemoff 14539 itemsize 16 stride 0 devid 2 physical 67604480 [ snip ] BTRFS critical (device nvme1n1): slot 106 key (354631680 230 32768) new key (354635776 230 4096) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2602! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 1 UID: 0 PID: 1055 Comm: fsstress Not tainted 6.13.0-rc1+ #1464 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0xf7/0x270 Code: <snip> RSP: 0018:ffffc90001337ab0 EFLAGS: 00010287 RAX: 0000000000000000 RBX: ffff8881115fd000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00000000ffffffff RBP: ffff888110ed6f50 R08: 00000000ffffefff R09: ffffffff8244c500 R10: 00000000ffffefff R11: 00000000ffffffff R12: ffff888100586000 R13: 00000000000000c9 R14: ffffc90001337b1f R15: ffff888110f23b58 FS: 00007f7d75c72740(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa811652c60 CR3: 0000000111398001 CR4: 0000000000370eb0 Call Trace: <TASK> ? __die_body.cold+0x14/0x1a ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x65/0x80 ? btrfs_set_item_key_safe+0xf7/0x270 ? exc_invalid_op+0x50/0x70 ? btrfs_set_item_key_safe+0xf7/0x270 ? asm_exc_invalid_op+0x1a/0x20 ? btrfs_set_item_key_safe+0xf7/0x270 btrfs_partially_delete_raid_extent+0xc4/0xe0 btrfs_delete_raid_extent+0x227/0x240 __btrfs_free_extent.isra.0+0x57f/0x9c0 ? exc_coproc_segment_overrun+0x40/0x40 __btrfs_run_delayed_refs+0x2fa/0xe80 btrfs_run_delayed_refs+0x81/0xe0 btrfs_commit_transaction+0x2dd/0xbe0 ? preempt_count_add+0x52/0xb0 btrfs_sync_file+0x375/0x4c0 do_fsync+0x39/0x70 __x64_sys_fsync+0x13/0x20 do_syscall_64+0x54/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f7d7550ef90 Code: <snip> RSP: 002b:00007ffd70237248 EFLAGS: 00000202 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f7d7550ef90 RDX: 000000000000013a RSI: 000000000040eb28 RDI: 0000000000000004 RBP: 000000000000001b R08: 0000000000000078 R09: 00007ffd7023725c R10: 00007f7d75400390 R11: 0000000000000202 R12: 028f5c28f5c28f5c R13: 8f5c28f5c28f5c29 R14: 000000000040b520 R15: 00007f7d75c726c8 </TASK> While the root cause of the tree order corruption isn't clear, using btrfs_duplicate_item() to copy the item and then adjusting both the key and the per-device physical addresses is a safe way to counter this problem. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: implement hole punching for RAID stripe extentsJohannes Thumshirn
If the stripe extent we want to delete starts before the range we want to delete and ends after the range we want to delete we're punching a hole in the stripe extent: |--- RAID Stripe Extent ---| | keep |--- drop ---| keep | This means we need to a) truncate the existing item and b) create a second item for the remaining range. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: fix deletion of a range spanning parts two RAID stripe extentsJohannes Thumshirn
When a user requests the deletion of a range that spans multiple stripe extents and btrfs_search_slot() returns us the second RAID stripe extent, we need to pick the previous item and truncate it, if there's still a range to delete left, move on to the next item. The following diagram illustrates the operation: |--- RAID Stripe Extent ---||--- RAID Stripe Extent ---| |--- keep ---|--- drop ---| While at it, comment the trivial case of a whole item delete as well. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: fix tail delete of RAID stripe-extentsJohannes Thumshirn
Fix tail delete of RAID stripe-extents, if there is a range to be deleted as well after the tail delete of the extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: fix front delete range calculation for RAID stripe extentsJohannes Thumshirn
When deleting the front of a RAID stripe-extent the delete code miscalculates the size on how much to pad the remaining extent part in the front. Fix the calculation so we're always having the sizes we expect. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: assert RAID stripe-extent length is always greater than 0Johannes Thumshirn
When modifying a RAID stripe-extent, ASSERT() that the length of the new RAID stripe-extent is always greater than 0. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-14btrfs: don't try to delete RAID stripe-extents if we don't need toJohannes Thumshirn
Even if the RAID stripe-tree is not enabled in the filesystem, do_free_extent_accounting() still calls into btrfs_delete_raid_extent(). Check if the extent in question is on a block-group that has a profile which is used by RAID stripe-tree before attempting to delete a stripe extent. Return early if it doesn't, otherwise we're doing a unnecessary search. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: raid-stripe-tree: remove unnecessary call to btrfs_mark_buffer_dirty()Filipe Manana
The call to btrfs_mark_buffer_dirty() at update_raid_extent_item() is not necessary as we have a path setup for writing with btrfs_search_slot() having a 'cow' argument set to 1. This just makes the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove unused variable length in btrfs_insert_one_raid_extent()Johannes Thumshirn
Remove the variable length in btrfs_insert_one_raid_extent() as it is unused. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: implement partial deletion of RAID stripe extentsJohannes Thumshirn
In our CI system, the RAID stripe tree configuration sometimes fails with the following ASSERT(): assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64 This ASSERT()ion triggers, because for the initial design of RAID stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned btrfs in mind. But for a RAID stripe-tree based system, that is not hosted on a zoned storage device, but on a regular device this rule doesn't apply. So in case the range we want to delete starts in the middle of the previous item, grab the item and "truncate" it's length. That is, clone the item, subtract the deleted portion from the key's offset, delete the old item and insert the new one. In case the range to delete ends in the middle of an item, we have to adjust both the item's key as well as the stripe extents and then re-insert the modified clone into the tree after deleting the old stripe extent. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: return ENODATA in case RST lookup failsJohannes Thumshirn
In case a lookup in the RAID stripe-tree fails, return ENODATA instead of ENOENT to better distinguish stripe-tree lookups from other code paths where we return ENOENT. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: tests: add selftests for raid-stripe-treeJohannes Thumshirn
Add first stash of very basic self tests for the RAID stripe-tree. More test cases will follow exercising the tree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: change RST lookup error message level to debugJohannes Thumshirn
Now that RAID stripe-tree lookup failures are not treated as a fatal issue any more, change the RAID stripe-tree lookup error message to debug level. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: rename btrfs_io_stripe::is_scrub to rst_search_commit_rootJohannes Thumshirn
Rename 'btrfs_io_stripe::is_scrub' to 'rst_search_commit_root'. While 'is_scrub' describes the state of the io_stripe (it is a stripe submitted by scrub) it does not describe the purpose, namely looking at the commit root when searching RAID stripe-tree entries. Renaming the stripe to rst_search_commit_root describes this purpose. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: don't dump stripe-tree on lookup errorJohannes Thumshirn
This just creates unnecessary noise and doesn't provide any insights into debugging RAID stripe-tree related issues. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: update stripe_extent delete loop assumptionsJohannes Thumshirn
btrfs_delete_raid_extent() was written under the assumption, that it's call-chain always passes a start, length tuple that matches a single extent. But btrfs_delete_raid_extent() is called by do_free_extent_accounting() which in turn is called by __btrfs_free_extent(). But this call-chain passes in a start address and a length that can possibly match multiple on-disk extents. To make this possible, we have to adjust the start and length of each btree node lookup, to not delete beyond the requested range. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: update stripe extents for existing logical addressesJohannes Thumshirn
Update a stripe extent in case of an already existing logical address, but with different physical addresses and/or device id instead of bailing out with EEXIST. This can happen i.e. in case of a device replace operation, where data extents get rewritten to a new disk. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove raid-stripe-tree encoding field from stripe_extentJohannes Thumshirn
Remove the encoding field from 'struct btrfs_stripe_extent'. It was originally intended to encode the RAID type as well as if we're a data or a parity stripe. But the RAID type can be inferred form the block-group and the data vs. parity differentiation can be done easier with adding a new key type for parity stripes in the RAID stripe tree. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: remove unused included headersDavid Sterba
With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-03btrfs: directly return 0 on no error code in btrfs_insert_raid_extent()Dan Carpenter
It's more obvious to return a literal zero instead of "return ret;". Plus Smatch complains that ret could be uninitialized if the ordered_extent->bioc_list list is empty and this silences that warning. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: tracepoints: add events for raid stripe treeJohannes Thumshirn
Add trace events for raid-stripe-tree operations. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: scrub: implement raid stripe tree supportJohannes Thumshirn
A filesystem that uses the raid stripe tree for logical to physical address translation can't use the regular scrub path, that reads all stripes and then checks if a sector is unused afterwards. When using the raid stripe tree, this will result in lookup errors, as the stripe tree doesn't know the requested logical addresses. In case we're scrubbing a filesystem which uses the RAID stripe tree for multi-device logical to physical address translation, perform an extra block mapping step to get the real on-disk stripe length from the stripe tree when scrubbing the sectors. This prevents a double completion of the btrfs_bio caused by splitting the underlying bio and ultimately a use-after-free. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: lookup physical address from stripe extentJohannes Thumshirn
Lookup the physical address from the raid stripe tree when a read on an RAID volume formatted with the raid stripe tree was attempted. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: delete stripe extent on extent deletionJohannes Thumshirn
As each stripe extent is tied to an extent item, delete the stripe extent once the corresponding extent item is deleted. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add support for inserting raid stripe extentsJohannes Thumshirn
Add support for inserting stripe extents into the raid stripe tree on completion of every write that needs an extra logical-to-physical translation when using RAID. Inserting the stripe extents happens after the data I/O has completed, this is done to a) support zone-append and b) rule out the possibility of a RAID-write-hole. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>