Age | Commit message (Collapse) | Author |
|
Print more information out about moving contexts - fold in the output of
the redundant bch2_data_jobs_to_text(), and also include information
relevant to whether move_data() should be blocked.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When doing updates early in recovery, before we can go RW, we still want
to check that keys are valid at commit time - this moves key invalid
checking to before the "btree updates to journal" path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
If fsck finds a key that needs work done, the primary example being an
unlinked inode that needs to be deleted, and the key is in an internal
snapshot node, we have a bit of a conundrum.
The conundrum is that internal snapshot nodes are shared, and we in
general do updates in internal snapshot nodes because there may be
overwrites in some snapshots and not others, and this may affect other
keys referenced by this key (i.e. extents).
For example, we might be seeing an unlinked inode in an internal
snapshot node, but then in one child snapshot the inode might have been
reattached and might not be unlinked. Deleting the inode in the internal
snapshot node would be wrong, because then we'll delete all the extents
that the child snapshot references.
But if an unlinked inode does not have any overwrites in child
snapshots, we're fine: the inode is overwrritten in all child snapshots,
so we can do the deletion at the point of comonality in the snapshot
tree, i.e. the node where we found it.
This patch adds a new helper, bch2_propagate_key_to_snapshot_leaves(),
to handle the case where we need a to update a key that does have
overwrites in child snapshots: we copy the key to leaf snapshot nodes,
and then rewind fsck and process the needed updates there.
With this, fsck can now always correctly handle unlinked inodes found in
internal snapshot nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
After deleteing snapshots, we may be left with a snapshot tree where
some nodes only have one child, and we have a linear chain.
Interior snapshot nodes are never used directly (i.e. they never have
subvolumes that point to them), they are only referered to by child
snapshot nodes - hence, they are redundant.
The existing code talks about redundant snapshot nodes as forming and
equivalence class; i.e. nodes for which snapshot_t->equiv is equal. In a
given equivalence class, we only ever need a single key at a given
position - i.e. multiple versions with different snapshot fields are
redundant.
The existing snapshot cleanup code deletes these redundant keys, but not
redundant nodes. It turns out this is buggy, because we assume that
after snapshot deletion finishes we should only have a single key per
equivalence class, but the btree update path doesn't preserve this -
overwriting keys in old snapshots doesn't check for the equivalence
class being equal, and thus we can end up with duplicate keys in the
same equivalence class and fsck complaining about snapshot deletion not
having run correctly.
The equivalence class notion has been leaking out of the core snapshots
code and into too much other code, i.e. fsck, so this patch takes a
different approach: snapshot deletion now moves keys to the node in an
equivalence class being kept (the leafiest node) and then deletes the
redundant nodes in the equivalance class.
Some work has to be done to correctly delete interior snapshot nodes;
snapshot node depth and skiplist fields for descendent nodes have to be
fixed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The is_ancestor bitmap is at optimization for bch2_snapshot_is_ancestor;
once we get sufficiently close to the ancestor ID we're searching for we
test a bitmap.
But initialization of the is_ancestor bitmap was broken; we do it by
using bch2_snapshot_parent(), but we call that on nodes that haven't
been initialized yet with bch2_mark_snapshot().
Fix this by adding a separate loop in bch2_snapshots_read() for
initializing the is_ancestor bitmap, and also add some new debug asserts
for checking this sort of breakage in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
In the bch2_mount() error path, we were calling
deactivate_locked_super(), which calls ->kill_sb(), which in our case
was calling bch2_fs_free() without __bch2_fs_stop().
This changes bch2_mount() to just call bch2_fs_stop() directly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
In https://github.com/koverstreet/bcachefs/issues/450, we're seeing
unexplained btree_path_relock_fail events - according to the information
currently in the tracepoint, it appears the relock should be succeeding.
This adds lock counts to the tracepoint to help track it down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes https://github.com/koverstreet/bcachefs-tools/issues/159
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
subvolume.c has gotten a bit large, this splits out a separate file just
for managing snapshot trees - BTREE_ID_snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
In __bch2_buffered_write, if we fail to write to an entire !uptodate
folio, we have to back out the write, bail out and retry.
But we were missing an iov_iter_revert() call, so the data written to
the folio was lost and the rest of the write shifted to the wrong
offset.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
The folio_hole_offset() helper returns a mix of bool and int types.
The latter is to support a possible -EAGAIN error code when using
nonblocking locks. This is not only confusing, but the only caller
also essentially ignores errors outside of stopping the range
iteration. This means an -EAGAIN error can't return directly from
folio_hole_offset() and may be lost via bch2_clamp_data_hole().
Fix up the error handling and make it more readable.
__filemap_get_folio() returns -ENOENT instead of NULL when no folio
exists, so reuse the same error code in folio_hole_offset(). Fix up
bch2_seek_pagecache_hole() to return the current offset on -ENOENT,
but otherwise return unexpected error code up to the caller.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
For extents, we increase the number of bits of the size field to allow
extents to get bigger due to merging - but this code didn't check for
overflow.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
- There was no need for a retry loop in bch2_extent_fallocate(); if we
have to retry we may be overwriting something different and we need
to return an error and let the caller retry.
- The bch2_alloc_sectors_start() error path was wrong, and wasn't
running our cleanup at the end of the function
This also fixes a very rare open bucket leak due to the missing cleanup.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a bug in the cycle detector, bch2_check_for_deadlock() - we
have to make sure the node pointers in the btree paths array are set to
something not-garbage before another thread may see them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes the device removal tests, which have been failing at random
due to the fact that when we're running the .key_invalid checks in the
write path the key may actually no longer exist - we might be racing
with the keys being deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
To ensure we aren't shooting ourselves in the foot after merge for
potentially doing future revisions for dirent or for storing multiple
names for casefolding, limit this to 512 for now.
Previously this define was linked to the max size a d_name in
bch_dirent could be.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Avoids doing a full strnlen for getting the length of the name of a
dirent entry.
Given the fact that the name of dirents is stored at the end of the
bkey's value, and we know the length of that in u64s, we can find the
last u64 and figure out how many NUL bytes are at the end of the string.
On little endian systems this ends up being the leading zeros of the
last u64, whereas on big endian systems this ends up being the trailing
zeros of the last u64.
We can take that value in bits and divide it by 8 to get the number of
NUL bytes at the end.
There is no endian-fixup or other compatibility here as this is string
data interpreted as a u64.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
A nice cleanup that avoids a bunch of open-coding name/string usage
around dirent usage.
Will be used by casefolding impl in future commits.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We're hunting for an open_bucket leak, add an assertion to help track it
down: also, we can't use the bch_fs after dropping our write ref to it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Six locks do lock handoff via the wakeup path: the thread doing the
wakeup also takes the lock on behalf of the waiter, which means the
waiter only has to look at its waitlist entry, and doesn't have to touch
the lock cacheline while another thread is using it.
Linus noticed that this needs a real barrier, which this patch fixes.
Also add a comment for the should_sleep_fn() error path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: linux-bcachefs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This will be used when we need to re-hash a directory tree when setting
flags.
It is not possible to have concurrent btree_trans on a thread.
Signed-off-by: Joshua Ashton <joshua@froggi.es>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Now we also print the open_buckets owned by each write_point - this is
to help with debugging a shutdown hang.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We were failing to upgrade to the latest compatible version - whoops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes the replicas_write_errors test: the patch
bcachefs: mark journal replicas before journal write submission
partially fixed replicas marking for the journal, but it broke the case
where one replica failed - this patch re-adds marking after the journal
write completes, when we know how many replicas succeeded.
Additionally, we do not consider it a fsck error when the very last
journal entry is not correctly marked, since there is an inherent race
there.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Split out a new file from recovery.c for managing the list of keys we
read from the journal: before journal replay finishes the btree iterator
code needs to be able to iterate over and return keys from the journal
as well, so there's a fair bit of code here.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Pull code for bch_sb_field_clean out into its own file.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Split out a new file for bch_sb_field_members - we'll likely want to
move more code here in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We now have
btree_trans_commit.c
btree_update.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
fs-io.c is too big - time for some reorganization
- fs-dio.c: direct io
- fs-pagecache.c: pagecache data structures (bch_folio), utility code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
On old kernels, kmalloc() may return an allocation that's not naturally
aligned - this resulted in a bug where we allocated a bio with not
enough biovecs. Fix this by using buf_pages().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
This fixes a bug where we were already passing bkey_invalid_flags
around, but treating the parameter as just read/write - so the compat
code wasn't being run correctly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Awhile back, we changed bkey_format generation to ensure that the packed
representation could never represent fields larger than the unpacked
representation.
This was to ensure that bkey_packed_successor() always gave a sensible
result, but in the current code bkey_packed_successor() is only used in
a debug assertion - not for anything important.
This kills the requirement that we've gotten rid of those weird bkey
formats, and instead changes the assertion to check if we're dealing
with an old weird bkey format.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
our debug mode assertions in bkey.c haven't been getting run, whoops
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Add error messages when we fail to lookup an inode, and also add a few
missing bch2_err_class() calls.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We've observed significant lock thrashing on fstests generic/083 in
fallocate, due to dropping and retaking btree locks when checking the
pagecache for data.
This adds a nonblocking mode to bch2_clamp_data_hole(), where we only
use folio_trylock(), and can thus be used safely while btree locks are
held - thus we only have to drop btree locks as a fallback, on actual
lock contention.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|