| Age | Commit message (Collapse) | Author |
|
When btrfs_del_items() empties a leaf, it deletes the leaf unless it's
the root node. For the root leaf case, the code used to reset its level
to 0 via btrfs_set_header_level(). This is redundant as leaf nodes
always have level == 0.
Remove the unnecessary level assignment and invert the conditional to
handle only the non-root leaf deletion. The root leaf is correctly left
as-is.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After releasing the path in btrfs_next_old_leaf(), we need to re-check
the leaf because a balance operation may have added items or removed the
last item. The original code handled this with two separate conditional
blocks, the second marked with a lengthy comment explaining a "missed
case".
Merge these two blocks into a single logical structure that handles both
scenarios more clearly.
Also update the comment to be more concise and accurate, incorporating the
explanation directly into the main block rather than a separate annotation.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of incrementing refcount on 'left' node when it's referenced by
path, simply transfer ownership to path and set left to NULL. This
eliminates:
- Unnecessary refcount increment/decrement operations
- Redundant conditional checks for left node cleanup
The path now consistently owns the left node reference when used.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The balance_level() function is overly long and contains a cold code path
that handles promoting a child node to root when the root has only one item.
This code has distinct logic that is clearer and more maintainable when
isolated in its own function.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The following functions are introduced as a middle step for bs > ps
support:
- rbio_streip_step_paddr()
- rbio_pstripe_step_paddr()
- rbio_qstripe_step_paddr()
- sector_step_paddr_in_rbio()
As there is already an existing function without the infix, and has a
different parameter list.
But the existing functions have been cleaned up, there is no need to
keep the "_step" infix, just remove it completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The support code for bs > ps is complete, enable it and update
assertions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function finish_parity_scrub() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Introduce a helper, verify_one_parity_step()
Since the P/Q generation is always done in a vertical stripe, we have
to handle the range step by step.
- Only clear the rbio->dbitmap if all steps of an fs block match
- Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers
Now we either use the paddrs version for checksum, or the step version
for P/Q generation/recovery.
- Make alloc_rbio_essential_pages() to handle bs > ps cases
Since for bs > ps cases, one fs block needs multiple pages, the
existing simple check against rbio->stripe_pages[] is not enough.
Extract a dedicated helper, alloc_rbio_sector_pages(), for the
existing alloc_rbio_essential_pages(), which is still based on sector
number.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function rbio_bio_add_io_paddr() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Introduce a helper bio_add_paddrs()
Previously we only need to add a single page to a bio for a fs block,
but now we need to add multiple pages, this means we can fail halfway.
In that case we need to properly revert the bio (only for its size
though) for halfway failed cases.
- Rename rbio_add_io_paddr() to rbio_add_io_paddrs()
And change the @paddr parameter to @paddrs[].
- Change all callers to use the updated rbio_add_io_paddrs()
For the @paddrs pointer used for the new function, it can be grabbed
using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function steal_rbio() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Introduce two helpers to calculate the sector number
Previously we assume one page will contain at least one fs block, thus
can use something like "sectors_per_page = PAGE_SIZE / sectorsize;",
but with bs > ps support that above number will be 0.
Instead introduce two helpers:
* page_nr_to_sector_nr()
Returns the sector number of the first sector covered by the page.
* page_nr_to_num_sectors()
Return how many sectors are covered by the page.
And use the returned values for bitmap operations other than
open-coded "PAGE_SIZE / sectorsize".
Those helpers also have extra ASSERT()s to catch weird numbers.
- Use above helpers
The involved functions are:
* steal_rbio_page()
* is_data_stripe_page()
* full_page_sectors_uptodate()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function set_bio_pages_uptodate() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Update find_stripe_sector_nr() to check only the first step paddr
We don't need to check each paddr, as the bios are still aligned to fs
block size, thus checking the first step is enough.
- Use step size to iterate the bio
This means we only need to find the sector number for the first step
of each fs block, and skip the remaining part.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function verify_bio_data_sectors() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Make get_bio_sector_nr() to consider bs > ps cases
The function is utilized to calculate the sector number of a device
bio submitted by btrfs raid56 layer.
- Assemble a local paddrs[] for checksum calculation
- Open code btrfs_check_block_csum()
btrfs_check_block_csum() only supports fs blocks backed by large
folios.
But for raid56 we can have fs blocks backed by multiple non-contiguous
pages, e.g. direct IO, encoded read/write/send.
So instead of using btrfs_check_block_csum(), open code it to use
btrfs_calculate_block_csum_pages().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function verify_one_sector() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.
Prepare it for bs > ps cases by:
- Introduce helpers to get a paddrs pointer
Thankfully all the higher layer bio should still be aligned to fs
block size, thus a fs block should still be fully covered by the bio.
Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will
return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or
stripe_paddrs[].
The pointer can be directly passed to
btrfs_calculate_block_csum_pages() to verify the checksum.
- Open code btrfs_check_block_csum()
btrfs_check_block_csum() only supports fs blocks backed by large
folios.
But for raid56 we can have fs blocks backed by multiple non-contiguous
pages, e.g. direct IO, encoded read/write/send.
So instead of using btrfs_check_block_csum(), open code it to use
btrfs_calculate_block_csum_pages().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently recover_vertical() assumes that every fs block can be mapped
by one page, this is blocking bs > ps support for raid56.
Prepare recover_vertical() to support bs > ps cases by:
- Introduce recover_vertical_step() helper
Which will recover a full step (min(PAGE_SIZE, sectorsize)).
Now recover_vertical() will do the error check for the specified
sector, do the recover step by step, then do the sector verification.
- Fix a spelling error of get_rbio_vertical_errors()
The old name has a typo: "veritical".
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unlike btrfs_calculate_block_csum_pages(), we cannot handle multiple
pages at the same time for P/Q generation.
So here we introduce a new @step_nr, and various helpers to grab the
sub-block page from the rbio, and generate the P/Q stripe page by page.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since we cannot ensure that all bios from the higher layer are backed by
large folios (e.g. direct IO, encoded read/write/send), we need the
ability to locate sub-block (aka, a page) inside a full stripe.
So the existing @stripe_nr + @sector_nr combination is not enough to
locate such page for bs > ps cases.
Introduce a new parameter, @step_nr, to locate the page of a larger fs
block. The naming is following the conventions used inside btrfs
elsewhere, where one step is min(sectorsize, PAGE_SIZE).
It's still a preparation, only touching the following aspects:
- btrfs_dump_rbio()
To show the new @sector_nsteps member.
- btrfs_raid_bio::sector_nsteps
Recording how many steps there are inside a fs block.
- Enlarge btrfs_raid_bio::*_paddrs[] size
To take @sector_nsteps into consideration.
- index_one_bio()
- index_stripe_sectors()
- memcpy_from_bio_to_stripe()
- cache_rbio_pages()
- need_read_stripe_sectors()
Those functions are iterating *_paddrs[], which needs to take
sector_nsteps into consideration.
- Rename rbio_stripe_sector_index() to rbio_sector_index()
The "stripe" part is not that helpful.
And an extra ASSERT() before returning the result.
- Add a new rbio_paddr_index() helper
This will take the extra @step_nr into consideration.
- The comments of btrfs_raid_bio
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The structure needs to track both the pages from higher layer bio and
internal pages, thus it can be a little complex to grasp.
Add an overview of the structure, especially how we track different
pages from higher layer bios and internal ones, to save some time for
future developers.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
When a scrub failed immediately without any byte scrubbed, the returned
btrfs_scrub_progress::last_physical will always be 0, even if there is a
non-zero @start passed into btrfs_scrub_dev() for resume cases.
This will reset the progress and make later scrub resume start from the
beginning.
[CAUSE]
The function btrfs_scrub_dev() accepts a @progress parameter to copy its
updated progress to the caller, there are cases where we either don't
touch progress::last_physical at all or copy 0 into last_physical:
- last_physical not updated at all
If some error happened before scrubbing any super block or chunk, we
will not copy the progress, leaving the @last_physical untouched.
E.g. failed to allocate @sctx, scrubbing a missing device or even
there is already a running scrub and so on.
All those cases won't touch @progress at all, resulting the
last_physical untouched and will be left as 0 for most cases.
- Error out before scrubbing any bytes
In those case we allocated @sctx, and sctx->stat.last_physical is all
zero (initialized by kvzalloc()).
Unfortunately some critical errors happened during
scrub_enumerate_chunks() or scrub_supers() before any stripe is really
scrubbed.
In that case although we will copy sctx->stat back to @progress, since
no byte is really scrubbed, last_physical will be overwritten to 0.
[FIX]
Make sure the parameter @progress always has its @last_physical member
updated to @start parameter inside btrfs_scrub_dev().
At the very beginning of the function, set @progress->last_physical to
@start, so that even if we error out without doing progress copying,
last_physical is still at @start.
Then after we got @sctx allocated, set sctx->stat.last_physical to
@start, this will make sure even if we didn't get any byte scrubbed, at
the progress copying stage the @last_physical is not left as zero.
This should resolve the resume progress reset problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move the 'retry_uncached' and 'hint' fields close to the other boolean
fields so that we remove a hole from the structure and reduce its size
from 136 bytes down to 128 bytes. Currently this structure is only
allocated in the stack of btrfs_reserve_extent().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The struct find_free_extent_ctl uses an int for the 'delalloc' field but
it's always used as a boolean, and its value is used to be passed to
several functions to signal if we are dealing with delalloc. The same goes
for the 'is_data' argument from btrfs_reserve_extent(). So change the type
from int to bool and move the field definition in the find_free_extent_ctl
structure so that it's close to other bool fields and reduces the size of
the structure from 144 down to 136 bytes (at the moment it's only declared
in the stack of btrfs_reserve_extent(), never allocated otherwise).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Many fields of struct btrfs_path are used as booleans but their type is
an unsigned int (of one 1 bit width to save space). Change the type to
bool keeping the :1 suffix so that they combine with the previous u8
fields in order to save space. This makes the code more clear by using
explicit true/false and more in line with the preferred style, preserving
the size of the structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to update the local variable 'check_skip' to false inside
the critical section delimited by the lock of the current node, so do it
after unlocking the node.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If we try to push an item count from the right leaf that is greater than
the number of items in the leaf, we just emit a warning. This should
never happen but if it does we get an underflow in the new number of
items in the right leaf and chaos follows from it. So replace the warning
with proper error handling, by aborting the transaction and returning
-EUCLEAN, and proper logging by using btrfs_crit() instead of WARN(),
which gives us proper formatting and information about the filesystem.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The 'right' variable points to path->nodes[0] and path->nodes[0] is never
changed, but some places use 'right' while others refer to path->nodes[0].
Update all sites to use 'right' as not only it's shorter it's also easier
to reason since it means the right leaf and avoids any confusion with the
sibling left leaf.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have already called btrfs_clear_buffer_dirty() against the left leaf in
the code above:
btrfs_set_header_nritems(left, left_nritems);
if (left_nritems)
btrfs_mark_buffer_dirty(trans, left);
else
btrfs_clear_buffer_dirty(trans, left);
So remove the second check for a 0 number of items in the left leaf and
calling again btrfs_clear_buffer_dirty() against the left leaf.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The 'left' variable points to path->nodes[0] and path->nodes[0] is never
changed, but some places use 'left' while others refer to path->nodes[0].
Update all sites to use 'left' as not only it's shorter it's also easier
to reason since it means the left leaf and avoids any confusion with the
sibling right leaf.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's not expected to get a data size less than the leaf's free space,
which would lead to a leaf dump and BUG(), so tag the if statement's
expression as unlikely, hinting the compiler to potentially generate
better code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The call to btrfs_del_leaf() can only return an error (negative value) or
zero (success). If we didn't get an error then 'ret' is zero, so it's
pointless to set it to zero again.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
If the call to btrfs_del_leaf() fails we return without decrementing the
extra ref we took on the leaf, therefore leaking it. Fix this by ensuring
we drop the ref count before returning the error.
Fixes: 751a27615dda ("btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 2c25716dcc25 ("btrfs: zlib: fix and simplify the inline extent
decompression") renamed the 'start_byte' parameter to 'dest_pgoff' in
the btrfs_decompress(). The remaining 'start_byte' references are
inconsistent with the actual implementation and may cause confusion for
developers.
Ensure consistency between function declaration and implementation.
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have support for optional string to be printed in ASSERT() (added in
19468a623a9109 ("btrfs: enhance ASSERT() to take optional format
string")), it's not yet everywhere it could be so add a few more files.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the read verification and read repair are all supporting bs > ps
without large folios now, we can enable encoded read/write/send.
Now we can relax the alignment in assert_bbio_alignment() to
min(blocksize, PAGE_SIZE).
But also add the extra blocksize based alignment check for the logical
and length of the bbio.
There is a pitfall in btrfs_add_compress_bio_folios(), which relies on
the folios passed in to meet the minimal folio order.
But now we can pass regular page sized folios in, update it to check
each folio's size instead of using the minimal folio size.
This allows btrfs_add_compress_bio_folios() to even handle folios array
with different sizes, thankfully we don't yet need to handle such crazy
situation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The current read verification is also relying on large folios to support
bs > ps cases, but that introduced quite some limits.
To enhance read-repair to support bs > ps without large folios:
- Make btrfs_data_csum_ok() to accept an array of paddrs
Which can pass the paddrs[] direct into
btrfs_calculate_block_csum_pages().
- Make repair_one_sector() to accept an array of paddrs
So that it can submit a repair bio backed by regular pages, not only
large folios.
This requires us to allocate more slots at bio allocation time though.
Also since the caller may have only partially advanced the saved_iter
for bs > ps cases, we can not directly trust the logical bytenr from
saved_iter (can be unaligned), thus a manual round down is necessary
for the logical bytenr.
- Make btrfs_check_read_bio() to build an array of paddrs
The tricky part is that we can only call btrfs_data_csum_ok() after
all involved pages are assembled.
This means at the call time of btrfs_check_read_bio(), our offset
inside the bio is already at the end of the fs block.
Thus we must re-calculate @bio_offset for btrfs_data_csum_ok() and
repair_one_sector().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs_repair_io_failure() only accept a single @paddr
parameter, and for bs > ps cases it's required that @paddr is backed by
a large folio.
That assumption has quite some limitations, preventing us from utilizing
true zero-copy direct-io and encoded read/writes.
To address the problem, enhance btrfs_repair_io_failure() by:
- Accept an array of paddrs, up to 64K / PAGE_SIZE entries
This kind of acts like a bio_vec, but with very limited entries, as the
function is only utilized to repair one fs data block, or a tree block.
Both have an upper size limit (BTRFS_MAX_BLOCK_SIZE, i.e. 64K), so we
don't need the full bio_vec thing to handle it.
- Allocate a bio with multiple slots
Previously even for bs > ps cases, we only passed in a contiguous
physical address range, thus a single slot will be enough.
But not anymore, so we have to allocate a bio structure, other than
using the on-stack one.
- Use on-stack memory to allocate @paddrs array
It's at most 16 pages (4K page size, 64K block size), will take up at
most 128 bytes.
I think the on-stack cost is still acceptable.
- Add one extra check to make sure the repair bio is exactly one block
- Utilize btrfs_repair_io_failure() to submit a single bio for metadata
This should improve the read-repair performance for metadata, as now
we submit a node sized bio then wait, other than submit each block of
the metadata and wait for each submitted block.
- Add one extra parameter indicating the step
This is due to the fact that metadata step can be as large as
nodesize, instead of sectorsize.
So we need a way to distinguish metadata and data repair.
- Reduce the width of @length parameter of btrfs_repair_io_failure()
Since we only call btrfs_repair_io_failure() on a single data or
metadata block, u64 is overkilled.
Use u32 instead and add one extra ASSERT()s to make sure the length
never exceed BTRFS_MAX_BLOCK_SIZE.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For bs > ps cases, all folios passed into btrfs_csum_one_bio() are
ensured to be backed by large folios. But that requirement excludes
features like direct IO and encoded writes.
To support bs > ps without large folios, enhance btrfs_csum_one_bio()
by:
- Split btrfs_calculate_block_csum() into two versions
* btrfs_calculate_block_csum_folio()
For call sites where a fs block is always backed by a large folio.
This will do extra checks on the folio size, build a paddrs[] array,
and pass it into the newer btrfs_calculate_block_csum_pages()
helper.
For now btrfs_check_block_csum() is still using this version.
* btrfs_calculate_block_csum_pages()
For call sites that may hit a fs block backed by noncontiguous pages.
The pages are represented by paddrs[] array, which includes the
offset inside the page.
This function will do the proper sub-block handling.
- Make btrfs_csum_one_bio() to use btrfs_calculate_block_csum_pages()
This means we will need to build a local paddrs[] array, and after
filling a fs block, do the checksum calculation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's not used anywhere outside space-info.c so move it from space-info.h
into space-info.c.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move the CSUM_FMT* definitions to fs.h where is be the BTRFS_KEY_FMT
and add the prefix for consistency.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Trivial pattern for the auto freeing where there are no operations
between btrfs_free_path() and the function returns.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since sector_ptr structure is now only containing a single paddr, there
is no need to use that structure.
Instead use phys_addr_t array for bio and stripe pointers.
This means several helpers are also needed to accept a paddr instead of
a sector_ptr pointer.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The uptodate boolean member can be extracted into a bitmap, which will
save us some space (1 bit in a byte vs 8 bits in a byte).
Furthermore we do not need to record the uptodate bitmap for bio
sectors, as if bio_sectors[].paddr is valid it means there is a bio and
will be uptodate.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can use paddr -1 as an indicator for unset/uninitialized paddr.
We can not use 0 paddr, unlike virtual address 0 which is never mapped
thus will always trigger a page fault, physical address 0 may be a valid
page.
So here we follow swiotlb to use (paddr)-1 as a special indicator for
invalid/unset physical address.
Even if the PFN may still be valid, our usage of the physical address
should always be aligned to fs block size (or page size for bs > ps
cases), thus such -1 paddr should never be a valid one.
With this special -1 paddr, we can get rid of has_paddr member and save
1 byte for sector_ptr structure.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In btrfs_compr_pool_scan(), use LIST_HEAD() to declare and initialize
the 'remove' list_head in one step instead of using INIT_LIST_HEAD()
separately.
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function scrub_raid56_parity_stripe() is handling the parity stripe
by the following steps:
- Scrub each data stripes
And make sure everything is fine in each data stripe
- Cache the data stripe into the raid bio
- Use the cached raid bio to scrub the target parity stripe
Extract the last two steps into a new helper,
scrub_raid56_cached_parity(), as a cleanup and make the error handling
more straightforward.
With the following minor cleanups:
- Use on-stack bio structure
The bio is always empty thus we do not need any bio vector nor the
block device. Thus there is no need to allocate a bio, the on-stack
one is more than enough to cut it.
- Remove the unnecessary btrfs_put_bioc() call if btrfs_map_block()
failed
If btrfs_map_block() is failed, @bioc_ret will not be touched thus
there is no need to call btrfs_put_bioc() in this case.
- Use a proper out: tag to do the cleanup
Now the error cleanup is much shorter and simpler, just
btrfs_bio_counter_dec() and bio_uninit().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
scrub_raid56_parity_stripe()
Unlike queue_scrub_stripe() which uses the global sctx->extent_path and
sctx->csum_path which are always released at the end of scrub_stripe(),
scrub_raid56_parity_stripe() uses local extent_path and csum_path, as
that function is going to handle the full stripe, whose bytenr may be
smaller than the bytenr in the global sctx paths.
However the cleanup of local extent/csum paths is only happening after
we have successfully submitted an rbio.
There are several error routes that we didn't release those two paths:
- scrub_find_fill_first_stripe() errored out at csum tree search
In that case extent_path is still valid, and that function itself will
not release the extent_path passed in.
And the function returns directly without releasing both paths.
- The full stripe is empty
- Some blocks failed to be recovered
- btrfs_map_block() failed
- raid56_parity_alloc_scrub_rbio() failed
The function returns directly without releasing both paths.
Fix it by covering btrfs_release_path() calls inside the out: tag.
This is just a hot fix, in the long run we will go scoped based auto
freeing for both local paths.
Fixes: 1dc4888e725d ("btrfs: scrub: avoid unnecessary extent tree search preparing stripes")
Fixes: 3c771c194402 ("btrfs: scrub: avoid unnecessary csum tree search preparing stripes")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
There is a report that memory allocation failed for btrfs_bio::csum
during a large read:
b2sum: page allocation failure: order:4, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
CPU: 0 UID: 0 PID: 416120 Comm: b2sum Tainted: G W 6.17.0 #1 NONE
Tainted: [W]=WARN
Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
Call trace:
show_stack+0x18/0x30 (C)
dump_stack_lvl+0x5c/0x7c
dump_stack+0x18/0x24
warn_alloc+0xec/0x184
__alloc_pages_slowpath.constprop.0+0x21c/0x730
__alloc_frozen_pages_noprof+0x230/0x260
___kmalloc_large_node+0xd4/0xf0
__kmalloc_noprof+0x1c8/0x260
btrfs_lookup_bio_sums+0x214/0x278
btrfs_submit_chunk+0xf0/0x3c0
btrfs_submit_bbio+0x2c/0x4c
submit_one_bio+0x50/0xac
submit_extent_folio+0x13c/0x340
btrfs_do_readpage+0x4b0/0x7a0
btrfs_readahead+0x184/0x254
read_pages+0x58/0x260
page_cache_ra_unbounded+0x170/0x24c
page_cache_ra_order+0x360/0x3bc
page_cache_async_ra+0x1a4/0x1d4
filemap_readahead.isra.0+0x44/0x74
filemap_get_pages+0x2b4/0x3b4
filemap_read+0xc4/0x3bc
btrfs_file_read_iter+0x70/0x7c
vfs_read+0x1ec/0x2c0
ksys_read+0x4c/0xe0
__arm64_sys_read+0x18/0x24
el0_svc_common.constprop.0+0x5c/0x130
do_el0_svc+0x1c/0x30
el0_svc+0x30/0xa0
el0t_64_sync_handler+0xa0/0xe4
el0t_64_sync+0x198/0x19c
[CAUSE]
Btrfs needs to allocate memory for btrfs_bio::csum for large reads, so
that we can later verify the contents of the read.
However nowadays a read bio can easily go beyond BIO_MAX_VECS *
PAGE_SIZE (which is 1M for 4K page sizes), due to the multi-page bvec
that one bvec can have more than one pages, as long as the pages are
physically adjacent.
This will become more common when the large folio support is moved out
of experimental features.
In the above case, a read larger than 4MiB with SHA256 checksum (32
bytes for each 4K block) will be able to trigger a order 4 allocation.
The order 4 is larger than PAGE_ALLOC_COSTLY_ORDER (3), thus without
extra flags such allocation will not retry.
And if the system has very small amount of memory (e.g. RPI4 with low
memory spec) or VMs with small vRAM, or the memory is heavily
fragmented, such allocation will fail and cause the above warning.
[FIX]
Although btrfs is handling the memory allocation failure correctly, we
do not really need the physically contiguous memory just to restore
our checksum.
In fact btrfs_csum_one_bio() is already using kvzalloc() to reduce the
memory pressure.
So follow the step to use kvcalloc() for btrfs_bio::csum.
Reported-by: Calvin Owens <calvin@wbinvd.org>
Link: https://lore.kernel.org/linux-btrfs/20251105180054.511528-1-calvin@wbinvd.org/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The current definition of ASSERT(cond) as (void)(cond) is redundant,
since these checks have no side effects and don't affect code logic.
However, some checks contain READ_ONCE() or other compiler-unfriendly
constructs. For example, ASSERT(list_empty) in btrfs_add_dealloc_inode()
was compiled to a redundant mov instruction due to this issue.
Define ASSERT as BUILD_BUG_ON_INVALID for !CONFIG_BTRFS_ASSERT builds
which uses sizeof(cond) trick. Also mark full_page_sectors_uptodate()
as __maybe_unused to suppress "unneeded declaration" warning (it's
needed in compile time)
Signed-off-by: Gladyshev Ilya <foxido@foxido.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[ENHANCEMENT]
Btrfs currently calculates data checksums then submits the bio.
But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
if the inode requires checksum"), any writes with data checksum will
fallback to buffered IO, meaning the content will not change during
writeback.
This means we're safe to calculate the data checksum and submit the bio
in parallel, and only need the following new behavior:
- Wait the csum generation to finish before calling btrfs_bio::end_io()
Or this can lead to use-after-free for the csum generation worker.
- Save the current bi_iter for csum_one_bio()
As the submission part can advance btrfs_bio::bio.bi_iter, if not
saved csum_one_bio() may got an empty bi_iter and do not generate any
checksum.
Unfortunately this means we have to increase the size of btrfs_bio for
16 bytes, but this is still acceptable.
As usual, such new feature is hidden behind the experimental flag.
[THEORETIC ANALYZE]
Consider the following theoretic hardware performance, which should be
more or less close to modern mainstream hardware:
Memory bandwidth: 50GiB/s
CRC32C bandwidth: 45GiB/s
SSD bandwidth: 8GiB/s
Then write bandwidth with data checksum before the patch is:
1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s
After the patch, the bandwidth is:
1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s
The difference is 15.32% improvement.
[REAL WORLD BENCHMARK]
I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
storage is backed by a PCIe gen3 x4 NVMe.
The test is a direct IO write, with 1MiB block size, write 7GiB data
into a btrfs mount with data checksum. Thus the direct write will
fallback to buffered one:
Vanilla Datasum: 1619.97 GiB/s
Patched Datasum: 1792.26 GiB/s
Diff +10.6 %
In my case, the bottleneck is the storage, thus the improvement is not
reaching the theoretic one, but still some observable improvement.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We used IRQ version of spinlock for ordered_tree_lock, as
btrfs_finish_ordered_extent() can be called in end_bbio_data_write()
which was in IRQ context.
However since we're moving all the btrfs_bio::end_io() calls into task
context, there is no more need to support IRQ context thus we can relax
to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The reason why end_bbio_compressed_write() queues a work into
compressed_write_workers wq is for end_compressed_writeback() call, as
it will grab all the involved folios and clear the writeback flags,
which may sleep.
However now we always run btrfs_bio::end_io() in task context, there is
no need to queue the work anymore.
Just remove btrfs_fs_info::compressed_write_workers and
compressed_bio::write_end_work.
There is a comment about the works queued into
compressed_write_workers, now change to flush endio wq instead, which is
responsible to handle all data endio functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BACKGROUND]
Btrfs has a lot of different bi_end_io functions, to handle different
raid profiles. But they introduced a lot of different contexts for
btrfs_bio::end_io() calls:
- Simple read bios
Run in task context, backed by either endio_meta_workers or
endio_workers.
- Simple write bios
Run in IRQ context.
- RAID56 write or rebuild bios
Run in task context, backed by rmw_workers.
- Mirrored write bios
Run in irq context.
This is inconsistent, and contributes to the number of workqueues used
in btrfs.
[ENHANCEMENT]
Make all the above bios call their btrfs_bio::end_io() in task context,
backed by either endio_meta_workers for metadata, or endio_workers for
data.
For simple write bios, merge the handling into simple_end_io_work().
For mirrored write bios, it will be a little more complex, since both
the original or the cloned bios can run the final btrfs_bio::end_io().
Here we make sure the cloned bios are using btrfs_bioset, to reuse the
end_io_work, and run both original and cloned work inside the workqueue.
Add extra ASSERT()s to make sure btrfs_bio_end_io() is running in task
context.
This not only unifies the context for btrfs_bio::end_io() functions, but
also opens a new door for further btrfs_bio::end_io() related cleanups.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.
The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.
However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.
The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.
Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|