Age | Commit message (Collapse) | Author |
|
This started showing up more when we started logging the error being
corrected in the journal - but __bch2_fsck_err() could return
transaction restarts before that.
Setting BCH_FS_error incorrectly causes recovery passes to not be
cleared, among other issues.
Fixes: b43f72492768 ("bcachefs: Log fsck errors in the journal")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Additional fix on top of
f54b2a80d0df bcachefs: Fix misaligned bucket check in journal space calculations
Make sure that when we calculate space for the next entry it's not
misaligned: we need to round_down() to filesystem block size in multiple
places (next entry size calculation as well as total space available).
Reported-by: Ondřej Kraus <neverberlerfellerer@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
if (!(in_recovery && (flags & RUN_RECOVERY_PASS_nopersistent)))
should have been
if (!in_recovery && !(flags & RUN_RECOVERY_PASS_nopersistent)))
But the !in_recovery part was also wrong: the assumption is that if
we're in recovery we'll just rewind and run the recovery pass
immediately, but we're not able to do so if we've already gone RW and
the pass must be run before we go RW. In that case, we need to schedule
it in the superblock so it can be run on the next mount attempt.
Scheduling it persistently is fine, because it'll be cleared in the
superblock immediately when the pass completes successfully.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Since we're accessing btree_trans objects owned by another thread, we
need to guard against using pointers to freed key cache entries: we need
our own srcu read lock, and we should skip a btree_trans if it didn't
hold the srcu lock (and thus it might have pointers to freed key cache
entries).
00693 Mem abort info:
00693 ESR = 0x0000000096000005
00693 EC = 0x25: DABT (current EL), IL = 32 bits
00693 SET = 0, FnV = 0
00693 EA = 0, S1PTW = 0
00693 FSC = 0x05: level 1 translation fault
00693 Data abort info:
00693 ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
00693 CM = 0, WnR = 0, TnD = 0, TagAccess = 0
00693 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
00693 user pgtable: 4k pages, 39-bit VAs, pgdp=000000012e650000
00693 [000000008fb96218] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
00693 Internal error: Oops: 0000000096000005 [#1] SMP
00693 Modules linked in:
00693 CPU: 0 UID: 0 PID: 4307 Comm: cat Not tainted 6.16.0-rc2-ktest-g9e15af94fd86 #27578 NONE
00693 Hardware name: linux,dummy-virt (DT)
00693 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00693 pc : six_lock_counts+0x20/0xe8
00693 lr : bch2_btree_bkey_cached_common_to_text+0x38/0x130
00693 sp : ffffff80ca98bb60
00693 x29: ffffff80ca98bb60 x28: 000000008fb96200 x27: 0000000000000007
00693 x26: ffffff80eafd06b8 x25: 0000000000000000 x24: ffffffc080d75a60
00693 x23: ffffff80eafd0000 x22: ffffffc080bdfcc0 x21: ffffff80eafd0210
00693 x20: ffffff80c192ff08 x19: 000000008fb96200 x18: 00000000ffffffff
00693 x17: 0000000000000000 x16: 0000000000000000 x15: 00000000ffffffff
00693 x14: 0000000000000000 x13: ffffff80ceb5a29a x12: 20796220646c6568
00693 x11: 72205d3e303c5b20 x10: 0000000000000020 x9 : ffffffc0805fb6b0
00693 x8 : 0000000000000020 x7 : 0000000000000000 x6 : 0000000000000020
00693 x5 : ffffff80ceb5a29c x4 : 0000000000000001 x3 : 000000000000029c
00693 x2 : 0000000000000000 x1 : ffffff80ef66c000 x0 : 000000008fb96200
00693 Call trace:
00693 six_lock_counts+0x20/0xe8 (P)
00693 bch2_btree_bkey_cached_common_to_text+0x38/0x130
00693 bch2_btree_trans_to_text+0x260/0x2a8
00693 bch2_btree_transactions_read+0xac/0x1e8
00693 full_proxy_read+0x74/0xd8
00693 vfs_read+0x90/0x300
00693 ksys_read+0x6c/0x108
00693 __arm64_sys_read+0x20/0x30
00693 invoke_syscall.constprop.0+0x54/0xe8
00693 do_el0_svc+0x44/0xc8
00693 el0_svc+0x18/0x58
00693 el0t_64_sync_handler+0x104/0x130
00693 el0t_64_sync+0x154/0x158
00693 Code: 910003fd f9423c22 f90017e2 d2800002 (f9400c01)
00693 ---[ end trace 0000000000000000 ]---
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix btree node read retries after validate errors:
__btree_err() is the wrong place to flag a topology error: that is done
by btree_lost_data().
Additionally, some calls to bch2_bkey_pick_read_device() were not
updated in the 6.16 rework for improved log messages; we were failing to
signal that we still had a retry.
Cc: Nikita Ofitserov <himikof@gmail.com>
Cc: Alan Huang <mmpgouride@gmail.com>
Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, btree node scan used the btree node cache to check if btree
nodes were readable, but this is subject to interference from threads
scanning different devices trying to read the same node - and more
critically, nodes that we already attempted and failed to read before
kicking off scan.
Instead, we now allocate a 'struct btree' that does not live in the
btree node cache, and call bch2_btree_node_read_done() directly.
Cc: Nikita Ofitserov <himikof@gmail.com>
Reviewed-by: Nikita Ofitserov <himikof@gmail.com>
Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree node scan needs to not use the btree node cache: that causes
interference from prior failed reads and parallel workers.
Instead we need to allocate btree nodes that don't live in the btree
cache, so that we can call bch2_btree_node_read_done() directly.
This patch tweaks the low level helpers so they don't touch the btree
cache lists.
Cc: Nikita Ofitserov <himikof@gmail.com>
Reviewed-by: Nikita Ofitserov <himikof@gmail.com>
Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The fix for when we should increase tree depth in journal replay was
entirely bogus.
We should only increase the tree depth in journal replay when recovery
from btree node scan, and then only for keys found by btree node scan.
This needs additional work - we should be shooting down existing
interior node pointers when recovery from scan, they shouldn't be
showing up here.
Fixes: b47a82ff4772 ("bcachefs: Only run 'increase_depth' for keys from btree node csan")
Cc: Alan Huang <mmpgouride@gmail.com>
Reported-by: syzbot+8deb6ff4415db67a9f18@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This wasn't updated when we added tracking for btree validate errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add a new version of fpunch for operating on a snapshot ID, not a
subvolume - and use it for "extent past end of inode" repair.
Previously, repair would try to delete everything at once, but deleting
too many extents at once can overflow the btree_trans bump allocator, as
well as causing other problems - the new helper properly uses
bch2_extent_trim_atomic().
Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add an option for completely disabling casefolding on a filesystem, as a
workaround for overlayfs.
This should only be needed as a temporary workaround, until the
overlayfs fix arrives.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't mark btree nodes for rewrites, if they are or would be degraded,
if journal replay hasn't finished, to avoid a deadlock.
This is because btree node rewrites generate more updates for the
interior updates (alloc, backpointers), and if those updates touch
new nodes and generate more rewrites - we can only have so many interior
btree updates in flight before we deadlock on open_buckets.
The biggest cause is that we don't use the btree write buffer (for
the backpointer updates - this needs some real thought on locking in
order to fix.
The problem with this workaround (not doing the rewrite for degraded
nodes in journal replay) is that those degraded nodes persist, and we
don't want that (this is a real bug when a btree node write completes
with fewer replicas than we wanted and leaves a degraded node due to
device _removal_, i.e. the device went away mid write).
It's less of a bug here, but still a problem because we don't yet
have a way of tracking degraded data - we another index (all
extents/btree nodes, by replicas entry) in order to fix properly
(re-replicate degraded data at the earliest possible time).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: syzbot+cc7567f096079cb4146f@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Checking for invalid IDs was introduced in 9e7cfb35e266 ("bcachefs: Check for invalid btree IDs")
to prevent an invalid shift later, but since 141526548052 ("bcachefs: Bad btree roots are now autofix")
which made btree_root_bkey_invalid autofix, the fsck_err_on call didn't
do anything.
We can mark this err type (invalid_btree_id) autofix as well, so it gets
handled.
Reported-by: syzbot+029d1989099aa5ae3e89@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=029d1989099aa5ae3e89
Fixes: 141526548052 ("bcachefs: Bad btree roots are now autofix")
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Fix a 6.16 regression from the recovery pass rework, which introduced a
bug where calling bch2_run_explicit_recovery_pass() would only return
the error code to rewind recovery for the first call that scheduled that
recovery pass.
If the error code from the first call was swallowed (because it was
called by an asynchronous codepath), subsequent calls would go "ok, this
pass is already marked as needing to run" and return 0.
Fixing this ensures that check_topology bails out to run btree_node_scan
before doing any repair.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Previously, calling bch2_btree_has_scanned_nodes() when btree node
scan hadn't actually run would erroniously return false - causing us to
think a btree was entirely gone.
This fixes a 6.16 regression from moving the scheduling of btree node
scan out of bch2_btree_lost_data() (fixing the bug where we'd schedule
it persistently in the superblock) and only scheduling it when
check_toploogy() is asking for scanned btree nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Autofix is specified in btree_gc.c if it's not an important btree.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
wait_on_allocator() emits debug info when we hang trying to allocate.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Reported-by: syzbot+d540192e763531d307ff@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Before calling bch2_indirect_extent_missing_error(), we have to
calculate the missing range, which is the intersection of the reflink
pointer and the non-indirect-extent we found.
The calculation didn't take into account that the returned extent may
span the iter position, leading to an infinite loop when we
(unnecessarily) resized the extent we were returning to one that didn't
extend past the offset we were looking up.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Make sure we return a standard error code.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For now we only have one key type in these btrees, but forward
compatibility means we do have to check.
Reported-by: syzbot+b4cb4a6988aced0cec4b@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes exceeding the bump allocator limit when the allocator finds
many buckets that need repair - they're repaired asynchronously, which
means that every error logged a message in the bump allocator, without
committing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Don't change buf->size on error - this would usually be a transaction
restart, but it could also be -ENOMEM - when we've exceeded the bump
allocator max).
Fixes: 247abee6ae6d ("bcachefs: btree_trans_subbuf")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The normal fsck code doesn't call key_visible_in_snapshot() with an
empty list of snapshot IDs seen (the current snapshot ID will always be
on the list), but str_hash_repair_key() ->
bch2_get_snapshot_overwrites() can, and that's totally fine as long as
we check for it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
check_directory_structure runs after check_dirents, so it expects that
it won't see any inodes with missing backpointers - normally.
But online fsck can't run check_dirents yet, or the user might only be
running a specific pass, so we need to be careful that this isn't an
error. If an inode is unreachable, that's handled by a separate pass.
Also, add a new 'bch2_inode_has_backpointer()' helper, since we were
doing this inconsistently.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree node scrub was sometimes failing to rewrite nodes with errors;
bch2_btree_node_rewrite() can return a transaction restart and we
weren't checking - the lockrestart_do() needs to wrap the entire
operation.
And there's a better helper it should've been using,
bch2_btree_node_rewrite_key(), which makes all this more convenient.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can only pass negative error codes to bch2_err_str(); if it's a
positive integer it's not an error and we trip an assert.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
A path exists in a particular snapshot: we should do the pathwalk in the
snapshot ID of the inode we started from, _not_ change snapshot ID as we
walk inodes and dirents.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We can easily go from inode number -> path now, which makes for more
useful log messages.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Log the inode's new path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we find a directory connectivity problem, we should do the repair
in the oldest snapshot that has the issue - so that we don't end up
duplicating work or making a real mess of things.
Oldest snapshot IDs have the highest integer value, so - just walk
inodes in reverse order.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
bch_subvolume.fs_path_parent needs to be updated as well, it should
match inode.bi_parent_subvol.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The dirent will be in a different snapshot if the inode is a subvolume
root.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The bch2_subvolume_get_snapshot() call needs to happen before the dirent
lookup - the dirent is in the parent subvolume.
Also, check for loops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
kthread creation checks for pending signals, which is _very_ annoying if
we have to do a long recovery and don't go rw until we've done
significant work.
Check if we'll be going rw and pre-allocate kthreads/workqueues.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Print out more info when we find a key (extent, dirent, xattr) for a
missing inode - was there a good inode in an older snapshot, full(ish)
list of keys for that missing inode, so we can make better decisions on
how to repair.
If it looks like it should've been deleted, autofix it. If we ever hit
the non-autofix cases, we'll want to write more repair code (possibly
reconstituting the inode).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
After cd3cdb1ef706 ("Single err message for btree node reads"),
all errors caused __btree_err to return -BCH_ERR_fsck_fix no matter what
the actual error type was if the recovery pass was scanning for btree
nodes. This lead to the code continuing despite things like bad node
formats when they earlier would have caused a jump to fsck_err, because
btree_err only jumps when the return from __btree_err does not match
fsck_fix. Ultimately this lead to undefined behavior by attempting to
unpack a key based on an invalid format.
Make only errors of type -BCH_ERR_btree_node_read_err_fixable cause
__btree_err to return -BCH_ERR_fsck_fix when scanning for btree nodes.
Reported-by: syzbot+cfd994b9cdf00446fd54@syzkaller.appspotmail.com
Fixes: cd3cdb1ef706 ("bcachefs: Single err message for btree node reads")
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
btree_interior_update_pool has not been initialized before the
filesystem becomes read-write, thus mempool_alloc in bch2_btree_update_start
will trigger pool->alloc NULL pointer dereference in mempool_alloc_noprof
Reported-by: syzbot+2f3859bd28f20fa682e6@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
In syzbot's crash, the bset's u64s is larger than the btree node.
Reported-by: syzbot+bfaeaa8e26281970158d@syzkaller.appspotmail.com
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|