Age | Commit message (Collapse) | Author |
|
There's no need to keep a local variable to extract the first member of
the list and then do a list_entry() call, we can use list_first_entry()
instead, removing the need for the temporary variable and extracting the
first element in a single step.
Also, there's no need to do a list_del_init() followed by list_add_tail(),
instead we can use list_move_tail(). We are in transaction commit critical
section where we don't need to worry about concurrency and that's why we
don't take any locks and can use list_move_tail() (we do assert early at
commit_cowonly_roots() that we are in the critical section, that the
transaction's state is TRANS_STATE_COMMIT_DOING).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of detecting if there is a previous transaction by comparing the
current transaction's list prev member to the head of the transaction
list (fs_info->trans_list), use the list_is_first() helper which contains
that logic and the naming makes sense since a new transaction is always
added to the end of the list fs_info->trans_list with list_add_tail().
And instead of extracting the previous transaction with the more generic
list_entry() helper against the current transaction's list prev member,
use the more specific list_prev_entry() helper, which makes it clear what
we are doing and is shorter.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Take a btrfs_space_info argument in btrfs_chunk_alloc(). New block group
will belong to that space_info.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the final phase of a transaction commit, when unpinning extents at
btrfs_finish_extent_commit(), there's no need to BUG_ON() if we fail to
unpin an extent range. All that can happen is that we fail to return the
extent range to the in-memory free space cache, meaning no future space
allocations can reuse that extent range while the fs is mounted.
So instead return the error to the caller and make it abort the
transaction, so that the error is noticed and prevent misteriously leaking
space. We keep track of the first error we get while unpinning an extent
range and keep trying to unpin all the following extent ranges, so that
we attempt to do all discards. The transaction abort will deal with all
resource cleanups.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we have this ugly back and forth with the btree writeback
where we find the folio, find the eb associated with that folio, and
then attempt to writeback. This results in two different paths for
subpage ebs and >= page size ebs.
Clean this up by adding our own infrastructure around looking up tagged
ebs and writing the ebs out directly. This allows us to unify the
subpage and >= pagesize IO paths, resulting in a much cleaner writeback
path for extent buffers.
I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM. I used
smallfiles100k, but reduced the files to 1k to make it run faster, the
results are as follows, with the statistically significant improvements
marked with *, there were no regressions. fsperf was run with -n 10 for
both runs, so the baseline is the average 10 runs and the test is the
average of 10 runs.
smallfiles100k results
metric baseline current stdev diff
================================================================================
avg_commit_ms 68.58 58.44 3.35 -14.79% *
commits 270.60 254.70 16.24 -5.88%
dev_read_iops 48 48 0 0.00%
dev_read_kbytes 1044 1044 0 0.00%
dev_write_iops 866117.90 850028.10 14292.20 -1.86%
dev_write_kbytes 10939976.40 10605701.20 351330.32 -3.06%
elapsed 49.30 33 1.64 -33.06% *
end_state_mount_ns 41251498.80 35773220.70 2531205.32 -13.28% *
end_state_umount_ns 1.90e+09 1.50e+09 14186226.85 -21.38% *
max_commit_ms 139 111.60 9.72 -19.71% *
sys_cpu 4.90 3.86 0.88 -21.29%
write_bw_bytes 42935768.20 64318451.10 1609415.05 49.80% *
write_clat_ns_mean 366431.69 243202.60 14161.98 -33.63% *
write_clat_ns_p50 49203.20 20992 264.40 -57.34% *
write_clat_ns_p99 827392 653721.60 65904.74 -20.99% *
write_io_kbytes 2035940 2035940 0 0.00%
write_iops 10482.37 15702.75 392.92 49.80% *
write_lat_ns_max 1.01e+08 90516129 3910102.06 -10.29% *
write_lat_ns_mean 366556.19 243308.48 14154.51 -33.62% *
As you can see we get about a 33% decrease runtime, with a 50%
throughput increase, which is pretty significant.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Old code has a lot of int for bool return values, bool is recommended
and done in new code. Convert the trivial cases that do simple 0/false
and 1/true. Functions comment are updated if needed.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename the remaning exported functions that don't have a 'btrfs_' prefix.
By convention exported functions should have such prefix to make it clear
they are btrfs specific and to avoid collisions with functions from
elsewhere in the kernel.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.
Rename the function to add 'btrfs_' prefix to it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. One of them has a
double underscore prefix which is also discouraged.
So remove double underscore prefix where applicable and add a 'btrfs_'
prefix to their name to make it clear they are from btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All other users of the bg_list list_head increment the refcount when
adding to a list and decrement it when deleting from the list. Just for
the sake of uniformity and to try to avoid refcounting bugs, do it for
this list as well.
This does not fix any known ref-counting bug, as the reference belongs
to a single task (trans_handle is not shared and this represents
trans_handle->new_bgs linkage) and will not lose its original refcount
while that thread is running. And BLOCK_GROUP_FLAG_NEW protects against
ref-counting errors "moving" the block group to the unused list without
taking a ref.
With that said, I still believe it is simpler to just hold the extra ref
count for this list user as well.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As far as I can tell, these calls of list_del_init() on bg_list cannot
run concurrently with btrfs_mark_bg_unused() or btrfs_mark_bg_to_reclaim(),
as they are in transaction error paths and situations where the block
group is readonly.
However, if there is any chance at all of racing with mark_bg_unused(),
or a different future user of bg_list, better to be safe than sorry.
Otherwise we risk the following interleaving (bg_list refcount in parens)
T1 (some random op) T2 (btrfs_mark_bg_unused)
!list_empty(&bg->bg_list); (1)
list_del_init(&bg->bg_list); (1)
list_move_tail (1)
btrfs_put_block_group (0)
btrfs_delete_unused_bgs
bg = list_first_entry
list_del_init(&bg->bg_list);
btrfs_put_block_group(bg); (-1)
Ultimately, this results in a broken ref count that hits zero one deref
early and the real final deref underflows the refcount, resulting in a WARNING.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Use a struct btrfs_inode in create_pending_snapshot() as it's an
internal helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The btrfs_key is defined as objectid/type/offset and the keys are also
printed like that. For better readability, update all key
initializations to match this order.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we are trying to join the current transaction and if it's aborted,
we read its 'aborted' field after unlocking fs_info->trans_lock and
without holding any extra reference count on it. This means that a
concurrent task that is aborting the transaction may free the transaction
before we read its 'aborted' field, leading to a use-after-free.
Fix this by reading the 'aborted' field while holding fs_info->trans_lock
since any freeing task must first acquire that lock and set
fs_info->running_transaction to NULL before freeing the transaction.
This was reported by syzbot and Dmitry with the following stack traces
from KASAN:
==================================================================
BUG: KASAN: slab-use-after-free in join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278
Read of size 4 at addr ffff888011839024 by task kworker/u4:9/1128
CPU: 0 UID: 0 PID: 1128 Comm: kworker/u4:9 Not tainted 6.13.0-rc7-syzkaller-00019-gc45323b7560e #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Workqueue: events_unbound btrfs_async_reclaim_data_space
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x169/0x550 mm/kasan/report.c:489
kasan_report+0x143/0x180 mm/kasan/report.c:602
join_transaction+0xd9b/0xda0 fs/btrfs/transaction.c:278
start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697
flush_space+0x448/0xcf0 fs/btrfs/space-info.c:803
btrfs_async_reclaim_data_space+0x159/0x510 fs/btrfs/space-info.c:1321
process_one_work kernel/workqueue.c:3236 [inline]
process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3317
worker_thread+0x870/0xd30 kernel/workqueue.c:3398
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Allocated by task 5315:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:394
kasan_kmalloc include/linux/kasan.h:260 [inline]
__kmalloc_cache_noprof+0x243/0x390 mm/slub.c:4329
kmalloc_noprof include/linux/slab.h:901 [inline]
join_transaction+0x144/0xda0 fs/btrfs/transaction.c:308
start_transaction+0xaf8/0x1670 fs/btrfs/transaction.c:697
btrfs_create_common+0x1b2/0x2e0 fs/btrfs/inode.c:6572
lookup_open fs/namei.c:3649 [inline]
open_last_lookups fs/namei.c:3748 [inline]
path_openat+0x1c03/0x3590 fs/namei.c:3984
do_filp_open+0x27f/0x4e0 fs/namei.c:4014
do_sys_openat2+0x13e/0x1d0 fs/open.c:1402
do_sys_open fs/open.c:1417 [inline]
__do_sys_creat fs/open.c:1495 [inline]
__se_sys_creat fs/open.c:1489 [inline]
__x64_sys_creat+0x123/0x170 fs/open.c:1489
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
Freed by task 5336:
kasan_save_stack mm/kasan/common.c:47 [inline]
kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582
poison_slab_object mm/kasan/common.c:247 [inline]
__kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
kasan_slab_free include/linux/kasan.h:233 [inline]
slab_free_hook mm/slub.c:2353 [inline]
slab_free mm/slub.c:4613 [inline]
kfree+0x196/0x430 mm/slub.c:4761
cleanup_transaction fs/btrfs/transaction.c:2063 [inline]
btrfs_commit_transaction+0x2c97/0x3720 fs/btrfs/transaction.c:2598
insert_balance_item+0x1284/0x20b0 fs/btrfs/volumes.c:3757
btrfs_balance+0x992/0x10c0 fs/btrfs/volumes.c:4633
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
The buggy address belongs to the object at ffff888011839000
which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 36 bytes inside of
freed 2048-byte region [ffff888011839000, ffff888011839800)
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11838
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
page_type: f5(slab)
raw: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002
raw: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
head: 00fff00000000040 ffff88801ac42000 ffffea0000493400 dead000000000002
head: 0000000000000000 0000000000080008 00000001f5000000 0000000000000000
head: 00fff00000000003 ffffea0000460e01 ffffffffffffffff 0000000000000000
head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 57, tgid 57 (kworker/0:2), ts 67248182943, free_ts 67229742023
set_page_owner include/linux/page_owner.h:32 [inline]
post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1558
prep_new_page mm/page_alloc.c:1566 [inline]
get_page_from_freelist+0x365c/0x37a0 mm/page_alloc.c:3476
__alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4753
alloc_pages_mpol_noprof+0x3e1/0x780 mm/mempolicy.c:2269
alloc_slab_page+0x6a/0x110 mm/slub.c:2423
allocate_slab+0x5a/0x2b0 mm/slub.c:2589
new_slab mm/slub.c:2642 [inline]
___slab_alloc+0xc27/0x14a0 mm/slub.c:3830
__slab_alloc+0x58/0xa0 mm/slub.c:3920
__slab_alloc_node mm/slub.c:3995 [inline]
slab_alloc_node mm/slub.c:4156 [inline]
__do_kmalloc_node mm/slub.c:4297 [inline]
__kmalloc_node_track_caller_noprof+0x2e9/0x4c0 mm/slub.c:4317
kmalloc_reserve+0x111/0x2a0 net/core/skbuff.c:609
__alloc_skb+0x1f3/0x440 net/core/skbuff.c:678
alloc_skb include/linux/skbuff.h:1323 [inline]
alloc_skb_with_frags+0xc3/0x820 net/core/skbuff.c:6612
sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2884
sock_alloc_send_skb include/net/sock.h:1803 [inline]
mld_newpack+0x1c3/0xaf0 net/ipv6/mcast.c:1747
add_grhead net/ipv6/mcast.c:1850 [inline]
add_grec+0x1492/0x19a0 net/ipv6/mcast.c:1988
mld_send_cr net/ipv6/mcast.c:2114 [inline]
mld_ifc_work+0x691/0xd90 net/ipv6/mcast.c:2651
page last free pid 5300 tgid 5300 stack trace:
reset_page_owner include/linux/page_owner.h:25 [inline]
free_pages_prepare mm/page_alloc.c:1127 [inline]
free_unref_page+0xd3f/0x1010 mm/page_alloc.c:2659
__slab_free+0x2c2/0x380 mm/slub.c:4524
qlink_free mm/kasan/quarantine.c:163 [inline]
qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
__kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
kasan_slab_alloc include/linux/kasan.h:250 [inline]
slab_post_alloc_hook mm/slub.c:4119 [inline]
slab_alloc_node mm/slub.c:4168 [inline]
__do_kmalloc_node mm/slub.c:4297 [inline]
__kmalloc_noprof+0x236/0x4c0 mm/slub.c:4310
kmalloc_noprof include/linux/slab.h:905 [inline]
kzalloc_noprof include/linux/slab.h:1037 [inline]
fib_create_info+0xc14/0x25b0 net/ipv4/fib_semantics.c:1435
fib_table_insert+0x1f6/0x1f20 net/ipv4/fib_trie.c:1231
fib_magic+0x3d8/0x620 net/ipv4/fib_frontend.c:1112
fib_add_ifaddr+0x40c/0x5e0 net/ipv4/fib_frontend.c:1156
fib_netdev_event+0x375/0x490 net/ipv4/fib_frontend.c:1494
notifier_call_chain+0x1a5/0x3f0 kernel/notifier.c:85
__dev_notify_flags+0x207/0x400
dev_change_flags+0xf0/0x1a0 net/core/dev.c:9045
do_setlink+0xc90/0x4210 net/core/rtnetlink.c:3109
rtnl_changelink net/core/rtnetlink.c:3723 [inline]
__rtnl_newlink net/core/rtnetlink.c:3875 [inline]
rtnl_newlink+0x1bb6/0x2210 net/core/rtnetlink.c:4012
Memory state around the buggy address:
ffff888011838f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888011838f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff888011839000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888011839080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888011839100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Reported-by: syzbot+45212e9d87a98c3f5b42@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/678e7da5.050a0220.303755.007c.GAE@google.com/
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/linux-btrfs/CACT4Y+ZFBdo7pT8L2AzM=vegZwjp-wNkVJZQf0Ta3vZqtExaSw@mail.gmail.com/
Fixes: 871383be592b ("btrfs: add missing unlocks to transaction abort paths")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have
the fs_info in a space_info. So, we can drop fs_info argument from
btrfs_update_space_info_*. There is no behavior change.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we use a red black tree (rb-tree) to track the delayed ref
heads (in struct btrfs_delayed_ref_root::href_root). This however is not
very efficient when the number of delayed ref heads is large (and it's
very common to be at least in the order of thousands) since rb-trees are
binary trees. For example for 10K delayed ref heads, the tree has a depth
of 13. Besides that, inserting into the tree requires navigating through
it and pulling useless cache lines in the process since the red black tree
nodes are embedded within the delayed ref head structure - on the other
hand, by being embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead which has a much higher
branching factor than a red black tree (binary balanced tree) and is more
cache friendly and behaves like a resizable array, with a much better
search and insertion complexity than a red black tree. This only has one
small disadvantage which is that insertion will sometimes require
allocating memory for the xarray - which may fail (not that often since
it uses a kmem_cache) - but on the other hand we can reduce the delayed
ref head structure size by 24 bytes (from 152 down to 128 bytes) after
removing the embedded red black tree node, meaning than we can now fit
32 delayed ref heads per 4K page instead of 26, and that gain compensates
for the occasional memory allocations needed for the xarray nodes. We
also end up using only 2 cache lines instead of 3 per delayed ref head.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 171845.7 12253839
16 2400000 0 230898.7 12308254
23 3600000 0 212292.9 12467768
30 4800000 0 195737.8 12627554
46 6000000 0 171055.2 12783329
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 173835.0 12246131
16 2400000 0 233537.8 12271746
23 3600000 0 220398.7 12307737
30 4800000 0 204483.6 12392318
40 6000000 0 182923.3 12771843
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The atomic counter 'num_entries' is not used anymore, we increment it
and decrement it but then we don't ever read it to use for any logic.
Its last use was removed with commit 61a56a992fcf ("btrfs: delayed refs
pre-flushing should only run the heads we have"). So remove it.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Use xarray to track dirty extents to reduce the size of the struct
btrfs_qgroup_extent_record from 64 bytes to 40 bytes. The xarray is
more cache line friendly, it also reduces the complexity of insertion
and search code compared to rb tree.
Another change introduced is about error handling. Before this patch,
the result of btrfs_qgroup_trace_extent_nolock() is always a success. In
this patch, because of this function calls the function xa_store() which
has the possibility to fail, so mark qgroup as inconsistent if error
happened and then free preallocated memory. Also we preallocate memory
before spin_lock(), if memory preallcation failed, error handling is the
same the existing code.
Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
KCSAN complains about a data race when accessing the last_trans field of a
root:
[ 199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
[ 199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
[ 199.555210] btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
[ 199.555999] start_transaction+0x154/0xcd8 [btrfs]
[ 199.556780] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.557559] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.558339] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.559123] touch_atime+0x16c/0x1e0
[ 199.559151] pipe_read+0x6a8/0x7d0
[ 199.559179] vfs_read+0x466/0x498
[ 199.559204] ksys_read+0x108/0x150
[ 199.559230] __s390x_sys_read+0x68/0x88
[ 199.559257] do_syscall+0x1c6/0x210
[ 199.559286] __do_syscall+0xc8/0xf0
[ 199.559318] system_call+0x70/0x98
[ 199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
[ 199.559464] record_root_in_trans+0x196/0x228 [btrfs]
[ 199.560236] btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
[ 199.561097] start_transaction+0x154/0xcd8 [btrfs]
[ 199.561927] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.562700] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.563493] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.564277] file_update_time+0xb8/0xf0
[ 199.564301] pipe_write+0x8ac/0xab8
[ 199.564326] vfs_write+0x33c/0x588
[ 199.564349] ksys_write+0x108/0x150
[ 199.564372] __s390x_sys_write+0x68/0x88
[ 199.564397] do_syscall+0x1c6/0x210
[ 199.564424] __do_syscall+0xc8/0xf0
[ 199.564452] system_call+0x70/0x98
This is because we update and read last_trans concurrently without any
type of synchronization. This should be generally harmless and in the
worst case it can make us do extra locking (btrfs_record_root_in_trans())
trigger some warnings at ctree.c or do extra work during relocation - this
would probably only happen in case of load or store tearing.
So fix this by always reading and updating the field using READ_ONCE()
and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The structure is internal so we should use struct btrfs_inode for that.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have several places that attach to the current transaction with
btrfs_attach_transaction_barrier() and then commit the transaction if
there is one. Add a helper and use it to deduplicate this pattern.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The range is specified only in two ways, we can simplify the case for
the whole filesystem range as a NULL block group parameter.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename the function's local return variables err and werr to ret.
Also, align the variable declarations with the other declarations in
the function for better function space alignment.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename the function's local variable werr and err to ret.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In the function btrfs_write_marked_extents() and in __btrfs_wait_marked_extents()
return the actual error if when filemap_fdata<write|wait>_range() fails.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl>
@@
expression E,E1;
@@
(
E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When creating a snapshot we first check with btrfs_lookup_dir_item() if
there is a name collision in the parent directory and then return an error
if there's a collision. Then later on when trying to insert a dir item for
the snapshot we BUG_ON() if the return value is -EEXIST or -EOVERFLOW:
static noinline int create_pending_snapshot(...)
{
(...)
/* check if there is a file/dir which has the same name. */
dir_item = btrfs_lookup_dir_item(...);
(...)
ret = btrfs_insert_dir_item(...);
/* We have check then name at the beginning, so it is impossible. */
BUG_ON(ret == -EEXIST || ret == -EOVERFLOW);
if (ret) {
btrfs_abort_transaction(trans, ret);
goto fail;
}
(...)
}
It's impossible to get the -EEXIST because we previously checked for a
potential collision with btrfs_lookup_dir_item() and we know that after
that no one could have added a colliding name because at this point the
transaction is in its critical section, state TRANS_STATE_COMMIT_DOING,
so no one can join this transaction to add a colliding name and neither
can anyone start a new transaction to do that.
As for the -EOVERFLOW, that can't happen as long as we have the extended
references feature enabled, which is a mkfs default for many years now.
In either case, the BUG_ON() is excessive as we can properly deal with
any error and can abort the transaction and jump to the 'fail' label,
in which case we'll also get the useful stack trace (just like a BUG_ON())
from the abort if the error is either -EEXIST or -EOVERFLOW.
So remove the BUG_ON().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It is possible to clear a root's IN_TRANS tag from the radix tree, but
not clear its PERTRANS, if there is some error in between. Eliminate
that possibility by moving the free up to where we clear the tag.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The transaction is only able to free PERTRANS reservations for a root
once that root has been recorded with the TRANS tag on the roots radix
tree. Therefore, until we are sure that this root will get tagged, it
isn't safe to convert. Generally, this is not an issue as *some*
transaction will likely tag the root before long and this reservation
will get freed in that transaction, but technically it could stick
around until unmount and result in a warning about leaked metadata
reservation space.
This path is most exercised by running the generic/269 fstest with
CONFIG_BTRFS_DEBUG.
Fixes: a6496849671a ("btrfs: fix start transaction qgroup rsv double free")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches when the default values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function btrfs_transaction_in_commit() is no longer used, its last
use was removed in commit 11aeb97b45ad ("btrfs: don't arbitrarily slow
down delalloc if we're committing"), so just remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When creating a snapshot we may do a double free of an anonymous device
in case there's an error committing the transaction. The second free may
result in freeing an anonymous device number that was allocated by some
other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number
and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at
transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device
number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because
btrfs_lookup_fs_root() returned a root - someone else did a lookup
of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at
ioctl.c:create_snapshot() we jump to the 'fail' label, and after
that we free again the same anonymous device number, which in the
meanwhile may have been reallocated somewhere else, because
pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------
ida_free called for id=51 which is not allocated.
WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525
Modules linked in:
CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024
RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525
Code: 10 42 80 3c 28 (...)
RSP: 0018:ffffc90015a67300 EFLAGS: 00010246
RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000
RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4
R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246
R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246
FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0
Call Trace:
<TASK>
btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346
create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931
btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404
create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306
btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393
btrfs_ioctl+0xa74/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857
do_syscall_64+0xfb/0x240
entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fca3e67dda9
Code: 28 00 00 00 (...)
RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9
RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658
</TASK>
Where we get an explicit message where we attempt to free an anonymous
device number that is not currently allocated. It happens in a different
code path from the example below, at btrfs_get_root_ref(), so this change
may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change
btrfs_get_root_ref() and its callers to receive a dev_t pointer argument
for the anonymous device number, so that in case it frees the number, it
also resets it to 0, so that up in the call chain we don't attempt to do
the double free.
CC: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/
Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since commit 28270e25c69a ("btrfs: always reserve space for delayed refs
when starting transaction") we started not only to reserve metadata space
for the delayed refs a caller of btrfs_start_transaction() might generate
but also to try to fully refill the delayed refs block reserve, because
there are several case where we generate delayed refs and haven't reserved
space for them, relying on the global block reserve. Relying too much on
the global block reserve is not always safe, and can result in hitting
-ENOSPC during transaction commits or worst, in rare cases, being unable
to mount a filesystem that needs to do orphan cleanup or anything that
requires modifying the filesystem during mount, and has no more
unallocated space and the metadata space is nearly full. This was
explained in detail in that commit's change log.
However the gap between the reserved amount and the size of the delayed
refs block reserve can be huge, so attempting to reserve space for such
a gap can result in allocating many metadata block groups that end up
not being used. After a recent patch, with the subject:
"btrfs: add new unused block groups to the list of unused block groups"
We started to add new block groups that are unused to the list of unused
block groups, to avoid having them around for a very long time in case
they are never used, because a block group is only added to the list of
unused block groups when we deallocate the last extent or when mounting
the filesystem and the block group has 0 bytes used. This is not a problem
introduced by the commit mentioned earlier, it always existed as our
metadata space reservations are, most of the time, pessimistic and end up
not using all the space they reserved, so we can occasionally end up with
one or two unused metadata block groups for a long period. However after
that commit mentioned earlier, we are just more pessimistic in the
metadata space reservations when starting a transaction and therefore the
issue is more likely to happen.
This however is not always enough because we might create unused metadata
block groups when reserving metadata space at a high rate if there's
always a gap in the delayed refs block reserve and the cleaner kthread
isn't triggered often enough or is busy with other work (running delayed
iputs, cleaning deleted roots, etc), not to mention the block group's
allocated space is only usable for a new block group after the transaction
used to remove it is committed.
A user reported that he's getting a lot of allocated metadata block groups
but the usage percentage of metadata space was very low compared to the
total allocated space, specially after running a series of block group
relocations.
So for now stop trying to refill the gap in the delayed refs block reserve
and reserve space only for the delayed refs we are expected to generate
when starting a transaction.
CC: stable@vger.kernel.org # 6.7+
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name/
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H6802ayLHUJFztzZAVzBLJAGdFx=6FHNNy87+obZXXZpQ@mail.gmail.com/
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Reported-by: Heddxh <g311571057@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAE93xANEby6RezOD=zcofENYZOT-wpYygJyauyUAZkLv6XVFOA@mail.gmail.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Some fixes to quota accounting code, mostly around error handling and
correctness:
- free reserves on various error paths, after IO errors or
transaction abort
- don't clear reserved range at the folio release time, it'll be
properly cleared after final write
- fix integer overflow due to int used when passing around size of
freed reservations
- fix a regression in squota accounting that missed some cases with
delayed refs"
* tag 'for-6.7-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ensure releasing squota reserve on head refs
btrfs: don't clear qgroup reserved bit in release_folio
btrfs: free qgroup pertrans reserve on transaction abort
btrfs: fix qgroup_free_reserved_data int overflow
btrfs: free qgroup reserve when ORDERED_IOERR is set
|
|
If we abort a transaction, we never run the code that frees the pertrans
qgroup reservation. This results in warnings on unmount as that
reservation has been leaked. The leak isn't a huge issue since the fs is
read-only, but it's better to clean it up when we know we can/should. Do
it during the cleanup_transaction step of aborting.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes and message updates:
- for simple quotas, handle the case when a snapshot is created and
the target qgroup already exists
- fix a warning when file descriptor given to send ioctl is not
writable
- fix off-by-one condition when checking chunk maps
- free pages when page array allocation fails during compression
read, other cases were handled
- fix memory leak on error handling path in ref-verify debugging
feature
- copy missing struct member 'version' in 64/32bit compat send ioctl
- tree-checker verifies inline backref ordering
- print messages to syslog on first mount and last unmount
- update error messages when reading chunk maps"
* tag 'for-6.7-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: send: ensure send_fd is writable
btrfs: free the allocated memory if btrfs_alloc_page_array() fails
btrfs: fix 64bit compat send ioctl arguments not initializing version member
btrfs: make error messages more clear when getting a chunk map
btrfs: fix off-by-one when checking chunk map includes logical address
btrfs: ref-verify: fix memory leaks in btrfs_ref_tree_mod()
btrfs: add dmesg output for first mount and last unmount of a filesystem
btrfs: do not abort transaction if there is already an existing qgroup
btrfs: tree-checker: add type and sequence check for inline backrefs
|
|
[BUG]
Syzbot reported a regression that after commit 6ed05643ddb1 ("btrfs:
create qgroup earlier in snapshot creation") we can trigger transaction
abort during snapshot creation:
BTRFS: Transaction aborted (error -17)
WARNING: CPU: 0 PID: 5057 at fs/btrfs/transaction.c:1778 create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
Modules linked in:
CPU: 0 PID: 5057 Comm: syz-executor225 Not tainted 6.6.0-syzkaller-15365-g305230142ae0 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
RIP: 0010:create_pending_snapshot+0x25f4/0x2b70 fs/btrfs/transaction.c:1778
Call Trace:
<TASK>
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1967
btrfs_commit_transaction+0xf1c/0x3730 fs/btrfs/transaction.c:2440
create_snapshot+0x4a5/0x7e0 fs/btrfs/ioctl.c:845
btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:995
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1041
__btrfs_ioctl_snap_create+0x344/0x460 fs/btrfs/ioctl.c:1294
btrfs_ioctl_snap_create+0x13c/0x190 fs/btrfs/ioctl.c:1321
btrfs_ioctl+0xbbf/0xd40
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
RIP: 0033:0x7f2f791127b9
</TASK>
[CAUSE]
The error number is -EEXIST, which can happen for qgroup if there is
already an existing qgroup and then we're trying to create a snapshot
for it.
[FIX]
In that case, we can continue creating the snapshot, although it may
lead to qgroup inconsistency, it's not so critical to abort the current
transaction.
So in this case, we can just ignore the non-critical errors, mostly -EEXIST
(there is already a qgroup).
Reported-by: syzbot+4d81015bc10889fd12ea@syzkaller.appspotmail.com
Fixes: 6ed05643ddb1 ("btrfs: create qgroup earlier in snapshot creation")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- raid-stripe-tree
New tree for logical file extent mapping where the physical mapping
may not match on multiple devices. This is now used in zoned mode
to implement RAID0/RAID1* profiles, but can be used in non-zoned
mode as well. The support for RAID56 is in development and will
eventually fix the problems with the current implementation. This
is a backward incompatible feature and has to be enabled at mkfs
time.
- simple quota accounting (squota)
A simplified mode of qgroup that accounts all space on the initial
extent owners (a subvolume), the snapshots are then cheap to create
and delete. The deletion of snapshots in fully accounting qgroups
is a known CPU/IO performance bottleneck.
The squota is not suitable for the general use case but works well
for containers where the original subvolume exists for the whole
time. This is a backward incompatible feature as it needs extending
some structures, but can be enabled on an existing filesystem.
- temporary filesystem fsid (temp_fsid)
The fsid identifies a filesystem and is hard coded in the
structures, which disallows mounting the same fsid found on
different devices.
For a single device filesystem this is not strictly necessary, a
new temporary fsid can be generated on mount e.g. after a device is
cloned. This will be used by Steam Deck for root partition A/B
testing, or can be used for VM root images.
Other user visible changes:
- filesystems with partially finished metadata_uuid conversion cannot
be mounted anymore and the uuid fixup has to be done by btrfs-progs
(btrfstune).
Performance improvements:
- reduce reservations for checksum deletions (with enabled free space
tree by factor of 4), on a sample workload on file with many
extents the deletion time decreased by 12%
- make extent state merges more efficient during insertions, reduce
rb-tree iterations (run time of critical functions reduced by 5%)
Core changes:
- the integrity check functionality has been removed, this was a
debugging feature and removal does not affect other integrity
checks like checksums or tree-checker
- space reservation changes:
- more efficient delayed ref reservations, this avoids building up
too much work or overusing or exhausting the global block
reserve in some situations
- move delayed refs reservation to the transaction start time,
this prevents some ENOSPC corner cases related to exhaustion of
global reserve
- improvements in reducing excessive reservations for block group
items
- adjust overcommit logic in near full situations, account for one
more chunk to eventually allocate metadata chunk, this is mostly
relevant for small filesystems (<10GiB)
- single device filesystems are scanned but not registered (except
seed devices), this allows temp_fsid to work
- qgroup iterations do not need GFP_ATOMIC allocations anymore
- cleanups, refactoring, reduced data structure size, function
parameter simplifications, error handling fixes"
* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
btrfs: open code timespec64 in struct btrfs_inode
btrfs: remove redundant log root tree index assignment during log sync
btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
btrfs: sysfs: show temp_fsid feature
btrfs: disable the device add feature for temp-fsid
btrfs: disable the seed feature for temp-fsid
btrfs: update comment for temp-fsid, fsid, and metadata_uuid
btrfs: remove pointless empty log context list check when syncing log
btrfs: update comment for struct btrfs_inode::lock
btrfs: remove pointless barrier from btrfs_sync_file()
btrfs: add and use helpers for reading and writing last_trans_committed
btrfs: add and use helpers for reading and writing fs_info->generation
btrfs: add and use helpers for reading and writing log_transid
btrfs: add and use helpers for reading and writing last_log_commit
btrfs: support cloned-device mount capability
btrfs: add helper function find_fsid_by_disk
btrfs: stop reserving excessive space for block group item insertions
btrfs: stop reserving excessive space for block group item updates
btrfs: reorder btrfs_inode to fill gaps
btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
...
|
|
Convert to using the new inode timestamp accessor functions.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently the last_trans_committed field of struct btrfs_fs_info is
modified and read without any locking or other protection. For example
early in the fsync path, skip_inode_logging() is called which reads
fs_info->last_trans_committed, but at the same time we can have a
transaction commit completing and updating that field.
In the case of an fsync this is harmless and any data race should be
rare and at most cause an unnecessary logging of an inode.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_trans_committed field of struct btrfs_fs_info using
READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently the generation field of struct btrfs_fs_info is always modified
while holding fs_info->trans_lock locked. Most readers will access this
field without taking that lock but while holding a transaction handle,
which is safe to do due to the transaction life cycle.
However there are other readers that are neither holding the lock nor
holding a transaction handle open:
1) When reading an inode from disk, at btrfs_read_locked_inode();
2) When reading the generation to expose it to sysfs, at
btrfs_generation_show();
3) Early in the fsync path, at skip_inode_logging();
4) When creating a hole at btrfs_cont_expand(), during write paths,
truncate and reflinking;
5) In the fs_info ioctl (btrfs_ioctl_fs_info());
6) While mounting the filesystem, in the open_ctree() path. In these
cases it's safe to directly read fs_info->generation as no one
can concurrently start a transaction and update fs_info->generation.
In case of the fsync path, races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective. In the other cases it's not
so clear if functional problems may arise, though in case 1 rare things
like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC
flag not being set on an inode and therefore result in incorrect logging
later on in case a fsync call is made.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the generation field of struct btrfs_fs_info using READ_ONCE() and
WRITE_ONCE(), and use these helpers where needed.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Space for block group item insertions, necessary after allocating a new
block group, is reserved in the delayed refs block reserve. Currently we
do this by incrementing the transaction handle's delayed_ref_updates
counter and then calling btrfs_update_delayed_refs_rsv(), which will
increase the size of the delayed refs block reserve by an amount that
corresponds to the same amount we use for delayed refs, given by
btrfs_calc_delayed_ref_bytes().
That is an excessive amount because it corresponds to the amount of space
needed to insert one item in a btree (btrfs_calc_insert_metadata_size())
times 2 when the free space tree feature is enabled. All we need is an
amount as given by btrfs_calc_insert_metadata_size(), since we only need to
insert a block group item in the extent tree (or block group tree if this
feature is enabled). By using btrfs_calc_insert_metadata_size() we will
need to reserve 2 times less space when using the free space tree, putting
less pressure on space reservation.
So use helpers to reserve and release space for block group item
insertions that use btrfs_calc_insert_metadata_size() for calculation of
the space.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The btrfs_defrag_root() function does not really belong in the
transaction.{c,h} module and as we have a defrag.{c,h} nowadays,
move it to there instead. This also allows to stop exporting
btrfs_defrag_leaves(), so we can make it static.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ rename info to fs_info for consistency ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The root argument for btrfs_update_inode_fallback() always matches the
root of the given inode, so remove the root argument and get it from the
inode argument.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|