summaryrefslogtreecommitdiff
path: root/fs/bcachefs
AgeCommit message (Collapse)Author
2025-03-14bcachefs: Use max() to improve gen_after()Thorsten Blum
Use max() to simplify gen_after() and improve its readability. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Remove unnecessary byte allocationThorsten Blum
The extra byte is not used - remove it. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: We no longer read stripes into memory at startupKent Overstreet
And the stripes heap gets deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: trace_stripe_createKent Overstreet
Add a simple tracepoint for stripe creation, we'll want to expand this later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: get_existing_stripe() uses new stripe lruKent Overstreet
Convert to the new persistent stripe LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: ec_stripe_delete() uses new stripe lruKent Overstreet
Convert to the new persistent stripe LRU. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: journal write path commentKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Kick devices out after too many write IO errorsKent Overstreet
We're improving our handling of write errors - we shouldn't write degraded data just because a write failed once, we should retry it (on other devices, if possible). But for this to work, we need to kick devices out when they're only returning errors - otherwise those retries will loop infinitely. This adds a configurable timeout - if writes are failing for too long, we'll set that device read-only. In the future we should also implement more tracking and another knob for an "allowed error rate", so that we can kick out drives that are acting "unhealthy". Another thing we'll want is a mechanism (likely in userspace) for bringing a device back in after a transient error - perhaps a cable was jiggled, or there was a controller reset. After transient errors we also need a mechanism to walk (from the journal) recent btree updates that weren't flushed to that device and treat them as "degraded", since unflushed data may well not have been written. Out of scope for this patch, but becoming relevant. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Change BCH_MEMBER_STATE_failed semanticsKent Overstreet
Previously, we woudn't try to read at all from a failed device - that doesn't make much sense, the device may be unhealthy (perhaps taking longer than it should to service reads), but if it's our only option we should still try to read from it. Now, bch2_bkey_pick_read_device() will pick failed devices only if there are no non-failed replicas to read from. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_dev_get_ioref() may now sleepKent Overstreet
The next patch implementing freezing will change bch2_dev_get_ioref() to sleep if a device is currently frozen. Add an annotation and fix the journal code accordingly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Fix btree_node_scan io_ref handlingKent Overstreet
This was completely fubar; it's now simplified a bit as well. Note that for_each_online_member() takes and releases io_refs as it iterates, so we need to release that if we break. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Implement blk_holder_opsKent Overstreet
We can't use the standard fs_holder_ops because they're meant for single device filesystems - fs_bdev_mark_dead() in particular - and they assume that the blk_holder is the super_block, which also doesn't work for a multi device filesystem. These generally follow the standard fs_holder_ops; the locking/refcounting is a bit simplified because c->ro_ref suffices, and bch2_fs_bdev_mark_dead() is not necessarily shutting down the entire filesystem. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Make sure c->vfs_sb is set before starting fsKent Overstreet
This is necessary for the new blk_holder_ops, which want the vfs super_block available for synchronization. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Stash a pointer to the filesystem for blk_holder_opsKent Overstreet
Note that we open block devices before we allocate bch_fs, but once attached to a filesystem they will be closed before the bch_fs is torn down - so stashing a pointer without a refcount looks incorrect but it's not. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Finish bch2_account_io_completion() conversionsKent Overstreet
More prep work for automatically kicking devices out after too many IO errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_account_io_completion()Kent Overstreet
We need to start accounting successes for every IO, not just failures, so introduce a unified hook for io completion accounting and convert io_read.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Fix read path io_ref handlingKent Overstreet
We were using our device pointer after we'd released our ref to it. Unlikely to be a race that's practical to hit, since actually removing a member device is a whole process besides just taking it offline, but - needs to be fixed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: data_update now checks for extents that can't be movedKent Overstreet
If a device is ro or failed, we might not have anywhere to move a replica. Check for this early, before doing the read and attempting to write. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: give bch2_write_super() a proper error codeKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bcachefs_metadata_version_extent_flagsKent Overstreet
This implements a new extent field bitflags that apply to the whole extent. There's been a couple things we've wanted this for in the past, but the immediate need is extent poisoning, to solve a rebalance issue. Unknown extent fields can't be parsed (we won't known their size, so we can't advance to the next field), so this is an incompat feature, and using it prevents the filesystem from being mounted by old versions. This also adds the BCH_EXTENT_poisoned flag; this indicates that the data is known to be bad (i.e. there was a checksum error, and we had to write a new checksum) and reads will return errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_request_incompat_feature() now returns error codeKent Overstreet
For future usage, we'll want a dedicated error code for better debugging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Fix error type in bch2_alloc_v3_validate()Thorsten Blum
Use error type alloc_v3_unpack_error in bch2_alloc_v3_validate(). Fixes: b65db750e2bb ("bcachefs: Enumerate fsck errors") Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: BCH_SB_FEATURES_ALL includes BCH_FEATURE_incompat_verison_fieldKent Overstreet
These features are set on format and incompat upgarde. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: sysfs internal/trigger_btree_updatesKent Overstreet
Add a debug knob to manually trigger the btree updates worker. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bcachefs_metadata_version_casefoldingJoshua Ashton
This patch implements support for case-insensitive file name lookups in bcachefs. The implementation uses the same UTF-8 lowering and normalization that ext4 and f2fs is using. More information is provided in Documentation/bcachefs/casefolding.rst Compatibility notes: This uses the new versioning scheme for incompatible features where an incompatible feature is tied to a version number: the superblock says "we may use incompat features up to x" and "incompat features up to x are in use", disallowing mounting by previous versions. Additionally, and old style incompat feature bit is used, so that kernels without utf8 casefolding support know if casefolding specifically is in use and they're allowed to mount. Signed-off-by: Joshua Ashton <joshua@froggi.es> Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Split out dirent alloc and name initializationJoshua Ashton
Splits out the code that allocates the dirent and initializes the name to make things easier to implement casefolding in a future commit. Cc: André Almeida <andrealmeid@igalia.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Joshua Ashton <joshua@froggi.es> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Kill dirent_occupied_size() in create pathKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Kill dirent_occupied_size() in rename pathKent Overstreet
Cc: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bcachefs_metadata_version_stripe_lruKent Overstreet
Add a persistent LRU for stripes, ordered by "number of empty blocks", i.e. order in which we wish to reuse them. This will replace the in-memory stripes heap, so we can kill off reading stripes into memory at startup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bcachefs_metadata_version_stripe_backpointersKent Overstreet
Stripes now have backpointers. This is needed for proper scrub - stripe checksums need to be verified, separately from extents within the stripe, since a block may not be full of live extents but it's still needed for reconstruct. And this will be needed for (efficient) evacuate/repair paths. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Advance bch_alloc.oldest_gen if no stale pointersKent Overstreet
Now that we've got cached backpointers and aren't leaving around stale pointers on bucket invalidation, we no longer need the periodic (rare) gc_gens - which recalculates each bucket's oldest gen to avoid wraparound. We can't delete that code because we've got to support existing filesystems that will still have stale pointers, but this gets rid of another scalability limit. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Invalidate cached data by backpointersKent Overstreet
If we don't leave stale pointers around, we won't have to deal with bucket gen wraparound. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bcachefs_metadata_version_cached_backpointersKent Overstreet
Cached pointers now have backpointers. This means that we'll be able to kill cached pointers in the bucket_invalidate path, when invalidating/reusing buckets containing cached data, instead of leaving them around to be cleaned up by gc_gens garbago collection - which requires a full metadata scan. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: rework bch2_trans_commit_run_triggers()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Better trigger orderingKent Overstreet
Transactional triggers need to run in a defined ordering, which is not quite the same as btree ID integer comparison. Previously this was handled in a hacky way in bch2_trans_commit_run_triggers(), since it was only the alloc btree that needed special handling, but upcoming stripe btree changes are going to require more ordering changes - so, define that ordering. Next patch will change the transaction commit path to use it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_trigger_stripe_ptr() no longer uses ec_stripes_heap_lockKent Overstreet
Introduce per-entry locks, like with struct bucket - the stripes heap is going away. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Rework bch2_check_lru_key()Kent Overstreet
It's now easier to add new LRU types. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: decouple bch2_lru_check_set() from alloc btreeKent Overstreet
Pass in the backpointer explicitly, instead of assuming 'referring_k' is an alloc key and calculating it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: s/BCH_LRU_FRAGMENTATION_START/BCH_LRU_BUCKET_FRAGMENTATION/Kent Overstreet
FRAGMENTATION_START was incorrect, there's currently only one fragmentation LRU (at the end of the reserved bits for LRU type), and we're getting ready to add a stripe fragmentation lru - so give it a better name. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_lru_change() checks for no-opKent Overstreet
Minor cleanup, no reason for the caller to have to this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: minor journal errcode cleanupKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: bch2_write_op_error() now prints info about data updateKent Overstreet
A user has been seeing the "error verifying existing checksum while rewriting existing data (memory corruption?)" error. This generally indicates a hardware issue (and that may be the case here), but it might also indicate a bug, in which case we need more information to look for patterns. Reported-by: Roland Vet <vet.roland@protonmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: metadata_target is not an inode optionKent Overstreet
This option only applies filesystem wide. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: eytzinger1_{next,prev} cleanupAndreas Gruenbacher
The eytzinger code was previously relying on the following wrap-around properties and their "eytzinger0" equivalents: eytzinger1_prev(0, size) == eytzinger1_last(size) eytzinger1_next(0, size) == eytzinger1_first(size) However, these properties are no longer relied upon and no longer necessary, so remove the corresponding asserts and forbid the use of eytzinger1_prev(0, size) and eytzinger1_next(0, size). This allows to further simplify the code in eytzinger1_next() and eytzinger1_prev(): where the left shifting happens, eytzinger1_next() is trying to move i to the lowest child on the left, which is equivalent to doubling i until the next doubling would cause it to be greater than size. This is implemented by shifting i to the left so that the most significant bits align and then shifting i to the right by one if the result is greater than size. Likewise, eytzinger1_prev() is trying to move to the lowest child on the right; the same applies here. The 1-offset in (size - 1) in eytzinger1_next() isn't needed at all, but the equivalent offset in eytzinger1_prev() is surprisingly needed to preserve the 'eytzinger1_prev(0, size) == eytzinger1_last(size)' property. However, since we no longer support that property, we can get rid of these offsets as well. This saves one addition in each function and makes the code less confusing. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: convert eytzinger sort to be 1-based (2)Andreas Gruenbacher
In this second step, transform the eytzinger indexes i, j, and k in eytzinger1_sort_r() from 0-based to 1-based. This step looks a bit messy, but the resulting code is slightly better. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: convert eytzinger sort to be 1-based (1)Andreas Gruenbacher
In this first step, convert the eytzinger sort functions to use 1-based primitives. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: convert eytzinger0_find to be 1-basedAndreas Gruenbacher
Several of the algorithms on eytzinger trees are implemented in terms of the eytzinger0 primitives. However, those algorithms can just as easily be expressed in terms of the eytzinger1 primitives, and that leads to better and easier to understand code. Start by converting eytzinger0_find(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: Add eytzinger0_find self testAndreas Gruenbacher
Function eytzinger0_find() isn't currently covered, so add a self test. We can rely on eytzinger0_find_le() here because it is being tested independently. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: add eytzinger0_find_ge self testAndreas Gruenbacher
Add an eytzinger0_find_ge() self test similar to eytzinger0_find_gt(). Note that this test requires eytzinger0_find_ge() to return the first matching element in the array in case of duplicates. To prevent bisection errors, we only add this test after strenghening the original implementation (see the previous commit). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-03-14bcachefs: implement eytzinger0_find_ge directlyAndreas Gruenbacher
Implement eytzinger0_find_ge() directly instead of implementing it in terms of eytzinger0_find_le() and adjusting the result. This turns eytzinger0_find_ge() into a minimum search, so when there are duplicate elements, the result of eytzinger0_find_ge() will now always point at the first matching element. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>