summaryrefslogtreecommitdiff
path: root/fs/bcachefs
AgeCommit message (Collapse)Author
2025-05-21bcachefs: alloc_request no longer on stackKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: alloc_request.ptrs2Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: alloc_request.caKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: alloc_request.countersKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: alloc_request.usageKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: alloc_request: deallocate_extra_replicas()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: new_stripe_alloc_buckets() takes alloc_requestKent Overstreet
More stack usage improvements: instead of creating a new alloc_request (currently on the stack), save/restore just the fields we need to reuse. This is a bit tricky, because we're doing a normal alloc_foreground.c allocation, which calls into ec.c to get a stripe, which then does more normal allocations - some of the fields get reused, and used differently. So we have to save and restore them - but the stack usage improvements will be well worth it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: bch2_ec_stripe_head_get() takes alloc_requestKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: bch2_bucket_alloc_trans() takes alloc_requestKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: alloc_request.data_typeKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: struct alloc_requestKent Overstreet
Add a struct for common state for satisfying an on disk allocation, instead of passing the same long list of items to every function. This will help with stack usage, performance, and perhaps enable some code cleanups. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: trace bch2_trans_kmalloc()Kent Overstreet
We're occasionally seeing the WARN_ON() for bump allocator usage exceeding BTREE_TRANS_MEM_MAX; add some tracing so we can see what's going on. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: replace memcpy with memcpy_and_pad for jset_entry_log->d buffRoxana Nicolescu
This was achieved before by zero-ing out the source buffer and then copying the bytes into the destination buffer. This can also be done with memcpy_and_pad which will zero out only the destination buffer if its size is bigger than the size of the source buffer. This is already used in the same way in journal_transaction_name(). Moreover, zero-ing the source buffer was done twice, first in __bch2_fs_log_msg() and then in bch2_trans_log_msg(). And this method may also require allocating some extra memory for the source buffer. In conclusion, using memcpy_and_pad is better even tough the result is the same because it brings uniformity with what's already used in journal_transaction_name, it avoids code duplication and reallocating extra memory. Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: replace strncpy() with memcpy_and_pad in journal_transaction_nameRoxana Nicolescu
Strncpy is now deprecated. The buffer destination is not required to be NULL-terminated, but we also want to zero out the rest of the buffer as it is already done in other places. Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Roxana Nicolescu <nicolescu.roxana@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: Rebalance now skips poisoned extentsKent Overstreet
Let's not move poisoned extents unnecessarily, since we can't guard against introducing more bitrot. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: Data move can read from poisoned extentsKent Overstreet
Now, if an extent is poisoned we can move it even if there was a checksum error. We'll have to give it a new checksum, but the poison bit means that userspace will still see the appropriate error when they try to read it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: Poison extents that can't be read due to checksum errorsKent Overstreet
Copygc needs to be able to move extents that have bitrotted. We don't want to delete them - in the future we'll have an API for "read me the data even if there's checksum errors", and in general we don't want to delete anything unless the user asks us to. That will require writing it with a new checksum, which means we can't forget that there was a checksum error so we return the correct error to userspace. Rebalance also wants to skip bad extents; we can now use the poison flag for that. This is currently disabled by default, as we want read fua support so that we can distinguish between transient and permanent errors from the device. It may be enabled with the module parameter: poison_extents_on_checksum_error Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: Be precise about bch_io_failuresKent Overstreet
If the extent we're reading from changes, due to be being overwritten or moved (possibly partially) - we need to reset bch_io_failures so that we don't accidentally mark a new extent as poisoned prematurely. This means we have to separately track (in the retry path) the extent we previously read from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: bch2_subvolume_wait_for_pagecache_and_delete() cleanupKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: Check for casefolded dirents in non casefolded dirsKent Overstreet
Check for mismatches between casefold dirents and casefold directories. A mismatch will cause lookups to fail, as we'll be doing the lookup with the casefolded name, which won't match the non-casefolded dirent, and vice versa. Reported-by: Christopher Snowhill <chris@kode54.net> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: Fix bch2_dirent_create_snapshot() for casefoldingKent Overstreet
bch2_dirent_create_snapshot(), used in fsck, neglected to create a casefolded dirent. Just move this into dirent_create_key(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-21bcachefs: Fix casefold opt via xattr interfaceKent Overstreet
Changing the casefold option requires extra checks/work - factor out a helper from bch2_fileattr_set() for the xattr code to use. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-19bcachefs: mkwrite() now only dirties one pageKent Overstreet
Don't dirty the whole folio - fixes write amplification with applications doing mmaped writes. https://www.reddit.com/r/bcachefs/comments/1klzcg1/incredible_amounts_of_write_amplification_when/ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-18bcachefs: fix extent_has_stripe_ptr()Kent Overstreet
This wasn't checking indirect extents. Fixes: https://github.com/koverstreet/bcachefs/issues/887 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-17bcachefs: Fix bch2_btree_path_traverse_cached() when paths reallocedKent Overstreet
btree_key_cache_fill() will allocate and traverse another path (for the underlying btree), so we can't hold pointers to paths across a call - we have to pass indices. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: fix wrong arg to fsck_err()Kent Overstreet
fsck_err() needs the btree transaction passed to it if there is one - so that it can unlock/relock around prompting userspace for fixing the error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: Fix missing commit in backpointer to missing targetKent Overstreet
Fsck wants to do transaction commits from an outer context; it may have other repair to do (i.e. duplicate backpointers). But when calling backpointer_not_found() from runtime code, i.e. runtime self healing, we should be doing the commit - the outer context expects to just be doing lookups. This fixes bugs where we get stuck spinning, reported as "RCU lock hold time warnings. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: Fix accidental O(n^2) in fiemapKent Overstreet
Since bch2_seek_pagecache_data() searches for dirty data, we only want to call it for holes in the extents btree - otherwise we have an accidental O(n^2), as we repeatedly search the same range. Reported-by: Marcin Mirosław <marcin@mejor.pl> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: Fix set_should_be_locked() call in peek_slot()Kent Overstreet
set_should_be_locked() needs to be called before peek_key_cache(), which traverses other paths and may do a trans unlock/relock. This fixes an assertion pop in path_peek_slot(), when the path we're using is unexpectedly not uptodate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: Fix self deadlockAlan Huang
Before invoking bch2_accounting_mem_mod_locked in bch2_gc_accounting_done, we already write locked mark_lock, in bch2_accounting_mem_insert, we lock mark_lock again. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: Don't set btree nodes as accessed on fillKent Overstreet
Prevent jobs that do lots of scanning (i.e. evacuatee, scrub) from causing OOMs. The shrinker code seems to be having issues when it doesn't do any freeing because it's just flipping off the acccessed bit - and the accessed bit shouldn't be set on first use anyways. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: Fix livelock in journal_entry_open()Kent Overstreet
When the journal is low on space, we might do discards from journal_res_get() -> journal_entry_open(). Make sure we set j->can_discard correctly, so that if we're low on space but not because discards aren't keeping up we don't livelock. Fixes: 8e4d28036c29 ("bcachefs: Don't aggressively discard the journal") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: Fix broken btree_path lock invariants in next_node()Kent Overstreet
This fixes btree locking assert pops users were seeing during evacuate: https://github.com/koverstreet/bcachefs/issues/878 May 09 22:45:02 sharon kernel: bcachefs (68116e25-fa2d-4c6f-86c7-e8b431d792ae): bch2_btree_insert_node(): node not locked at level 1 May 09 22:45:02 sharon kernel: bch2_btree_node_rewrite [bcachefs]: watermark=btree no_check_rw alloc l=0-1 mode=none nodes_written=0 cl.remaining=2 journal_seq=0 May 09 22:45:02 sharon kernel: path: idx 1 ref 1:0 S B btree=alloc level=0 pos 0:3699637:0 0:3698012:1-0:3699637:0 bch2_move_btree.isra.0+0x1db/0x490 [bcachefs] uptodate 0 locks_want 2 May 09 22:45:02 sharon kernel: l=0 locks intent seq 4 node ffff8bd700c93600 May 09 22:45:02 sharon kernel: l=1 locks unlocked seq 1712 node ffff8bd6fd5e7a00 May 09 22:45:02 sharon kernel: l=2 locks unlocked seq 2295 node ffff8bd6cc725400 May 09 22:45:02 sharon kernel: l=3 locks unlocked seq 0 node 0000000000000000 Evacuate walks btree nodes with bch2_btree_iter_next_node() and rewrites them, bch2_btree_update_start() upgrades the path to take intent locks as far as it needs to. But next_node() does low level unlock/relock calls on individual nodes, and didn't handle the case where a path is supposed to be holding multiple intent locks. If a path has locks_want > 1, it needs to be either holding locks on all the btree nodes (at each level) requested, or none of them. Fix this with a bch2_btree_path_downgrade(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-14bcachefs: Don't strip rebalance_opts from indirect extentsKent Overstreet
Fix bch2_bkey_clear_needs_rebalance(): indirect extents are never supposed to have bch_extent_rebalance stripped off, because that's how we get the IO path options when we don't have the original inode it belonged to. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-12crypto: lib/chacha - add strongly-typed state zeroizationEric Biggers
Now that the ChaCha state matrix is strongly-typed, add a helper function chacha_zeroize_state() which zeroizes it. Then convert all applicable callers to use it instead of direct memzero_explicit. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: lib/chacha - strongly type the ChaCha stateEric Biggers
The ChaCha state matrix is 16 32-bit words. Currently it is represented in the code as a raw u32 array, or even just a pointer to u32. This weak typing is error-prone. Instead, introduce struct chacha_state: struct chacha_state { u32 x[16]; }; Convert all ChaCha and HChaCha functions to use struct chacha_state. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-11sort.h: hoist cmp_int() into generic header fileFedor Pchelkin
Deduplicate the same functionality implemented in several places by moving the cmp_int() helper macro into linux/sort.h. The macro performs a three-way comparison of the arguments mostly useful in different sorting strategies and algorithms. Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Suggested-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Coly Li <colyli@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-08treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()Ingo Molnar
Move this API to the canonical timer_*() namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250507175338.672442-10-mingo@kernel.org
2025-05-07bcachefs: Don't aggressively discard the journalKent Overstreet
We frequently use 'bcachefs list_journal -a' for debugging, as it provides a record of all btree transactions, and a history of what happened. But it's not so useful if we immediately discard journal buckets right after they're no longer dirty. This tweaks journal reclaim to only discard when we're low on space, keeping the journal mostly un-discarded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07bcachefs: Ensure superblock gets written when we go EROKent Overstreet
When we go emergency read-only, make sure we do a final write_super() to persist counters and error counts - this can be critical for piecing together what fsck was doing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07bcachefs: Filter out harmless EROFS error messagesKent Overstreet
These just indicate that we're shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07bcachefs: journal_shutdown is EROFS, not EIOKent Overstreet
We often filter out EROFS errors to avoid log spew after an emergency shutdown - journal_shutdown is just another emergency shutdown error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05bcachefs: Call bch2_fs_start before getting vfs superblockKent Overstreet
This reverts 1fdbe0b184c8 bcachefs: Make sure c->vfs_sb is set before starting fs switched up bch2_fs_get_tree() so that we got a superblock before calling bch2_fs_start, so that c->vfs_sb would always be initialized while the filesystem was active. This turned out not to be necessary, because blk_holder_ops were implemented using our own locking, not vfs locking. And this had the side effect of creating a super_block and doing our full recovery (including potentially fsck) before setting SB_BORN, which causes things like sync calls to hang until our recovery is finished. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05bcachefs: fix hung task timeout in journal readKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05bcachefs: Add missing barriers before wake_up_bit()Kent Overstreet
wake_up() doesn't require a barrier - but wake_up_bit() does. This only affected non x86, and primarily lead to lost wakeups after btree node reads. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05bcachefs: Ensure proper write alignmentKent Overstreet
There was a buggy version of bcachefs-tools which picked misaligned bucket sizes when formatting, and we're also about to do dynamic block sizes - which will allow picking logical block size or physical block size of the device per-write, allowing for better compression ratios at the cost of slightly worse write performance (i.e. forcing the device to do RMW or extra buffering). To account for this, tweak bch2_alloc_sectors_start() to properly align open_buckets to the blocksize of the write we're about to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-05bcachefs: Improve want_cached_ptr()Kent Overstreet
If promote target isn't set, rebalance should still leave a cached copy on the faster device. Fall back to foreground_target if it's set, or allow a cached copy on any device if neither are set. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-04bcachefs: thread_with_stdio: fix spinning instead of exitingKent Overstreet
bch2_stdio_redirect_vprintf() was missing a check for stdio->done, i.e. exiting. This caused the thread attempting to print to spin, and since it was being called from the kthread ran by thread_with_stdio, the userspace side hung as well. Change it to return -EPIPE - i.e. writing to a pipe that's been closed. Reported-by: Jan Solanti <jhs@psonet.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01bcachefs: Remove incorrect __counted_by annotationAlan Huang
This actually reverts 86e92eeeb237 ("bcachefs: Annotate struct bch_xattr with __counted_by()"). After the x_name, there is a value. According to the disscussion[1], __counted_by assumes that the flexible array member contains exactly the amount of elements that are specified. Now there are users came across a false positive detection of an out of bounds write caused by the __counted_by here[2], so revert that. [1] https://lore.kernel.org/lkml/Zv8VDKWN1GzLRT-_@archlinux/T/#m0ce9541c5070146320efd4f928cc1ff8de69e9b2 [2] https://privatebin.net/?a0d4e97d590d71e1#9bLmp2Kb5NU6X6cZEucchDcu88HzUQwHUah8okKPReEt Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-01bcachefs: add missing sched_annotate_sleep()Kent Overstreet
00594 ------------[ cut here ]------------ 00594 do not call blocking ops when !TASK_RUNNING; state=2 set at [<000000003e51ef4a>] prepare_to_wait_event+0x5c/0x1c0 00594 WARNING: CPU: 12 PID: 1117 at kernel/sched/core.c:8741 __might_sleep+0x74/0x88 00594 Modules linked in: 00594 CPU: 12 UID: 0 PID: 1117 Comm: umount Not tainted 6.15.0-rc4-ktest-g3a72e369412d #21845 PREEMPT 00594 Hardware name: linux,dummy-virt (DT) 00594 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00594 pc : __might_sleep+0x74/0x88 00594 lr : __might_sleep+0x74/0x88 00594 sp : ffffff80c8d67a90 00594 x29: ffffff80c8d67a90 x28: ffffff80f5903500 x27: 0000000000000000 00594 x26: 0000000000000000 x25: ffffff80cf5002a0 x24: ffffffc087dad000 00594 x23: ffffff80c8d67b40 x22: 0000000000000000 x21: 0000000000000000 00594 x20: 0000000000000242 x19: ffffffc080b92020 x18: 00000000ffffffff 00594 x17: 30303c5b20746120 x16: 74657320323d6574 x15: 617473203b474e49 00594 x14: 0000000000000001 x13: 00000000000c0000 x12: ffffff80facc0000 00594 x11: 0000000000000001 x10: 0000000000000001 x9 : ffffffc0800b0774 00594 x8 : c0000000fffbffff x7 : ffffffc087dac670 x6 : 00000000015fffa8 00594 x5 : ffffff80facbffa8 x4 : ffffff80fbd30b90 x3 : 0000000000000000 00594 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffffff80f5903500 00594 Call trace: 00594 __might_sleep+0x74/0x88 (P) 00594 __mutex_lock+0x64/0x8d8 00594 mutex_lock_nested+0x28/0x38 00594 bch2_fs_ec_flush+0xf8/0x128 00594 __bch2_fs_read_only+0x54/0x1d8 00594 bch2_fs_read_only+0x3e0/0x438 00594 __bch2_fs_stop+0x5c/0x250 00594 bch2_put_super+0x18/0x28 00594 generic_shutdown_super+0x6c/0x140 00594 bch2_kill_sb+0x1c/0x38 00594 deactivate_locked_super+0x54/0xd0 00594 deactivate_super+0x70/0x90 00594 cleanup_mnt+0xec/0x188 00594 __cleanup_mnt+0x18/0x28 00594 task_work_run+0x90/0xd8 00594 do_notify_resume+0x138/0x148 00594 el0_svc+0x9c/0xa0 00594 el0t_64_sync_handler+0x104/0x130 00594 el0t_64_sync+0x154/0x158 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>