Age | Commit message (Collapse) | Author |
|
The __btrfs_bio_end_io() helper is trivial and has a single caller, so
there's no point in having a dedicated helper function. Further the double
underscore prefix in the name is discouraged. So get rid of it and move
its code into the caller (btrfs_bio_end_io()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
workers
At close_ctree() after we have ran delayed iputs either through explicitly
calling btrfs_run_delayed_iputs() or later during the call to
btrfs_commit_super() or btrfs_error_commit_super(), we assert that the
delayed iputs list is empty.
When we have compressed writes this assertion may fail because delayed
iputs may have been added to the list after we last ran delayed iputs.
This happens like this:
1) We have a compressed write bio executing;
2) We enter close_ctree() and flush the fs_info->endio_write_workers
queue which is the queue used for running ordered extent completion;
3) The compressed write bio finishes and enters
btrfs_finish_compressed_write_work(), where it calls
btrfs_finish_ordered_extent() which in turn calls
btrfs_queue_ordered_fn(), which queues a work item in the
fs_info->endio_write_workers queue that we have flushed before;
4) At close_ctree() we proceed, run all existing delayed iputs and
call btrfs_commit_super() (which also runs delayed iputs), but before
we run the following assertion below:
ASSERT(list_empty(&fs_info->delayed_iputs))
A delayed iput is added by the step below...
5) The ordered extent completion job queued in step 3 runs and results in
creating a delayed iput when dropping the last reference of the ordered
extent (a call to btrfs_put_ordered_extent() made from
btrfs_finish_one_ordered());
6) At this point the delayed iputs list is not empty, so the assertion at
close_ctree() fails.
Fix this by flushing the fs_info->compressed_write_workers queue at
close_ctree() before flushing the fs_info->endio_write_workers queue,
respecting the queue dependency as the later is responsible for the
execution of ordered extent completion.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Rename binode to inode in local variables or parameters so it's more
unified with the rest of the code.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pass a struct btrfs_inode to btrfs_ioctl_subvol_getflags() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove some redundant variables and assignments, move variable
declarations to their closest scope.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Pass a struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags() as it's
an internal interface.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The search tree ioctl use btrfs_root so change that from btrfs_inode
pointers so we don't have to do the conversion.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The ioctl switch btrfs_ioctl() provides several parameter types for
convenience so we don't have to do the conversion in the callbacks.
Pass root pointers to the send related functions.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Add const to function parameters that are not changed.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we only support two block sizes, 4K and PAGE_SIZE.
This means on the most common architecture x86_64, we have no way to
test subpage block size. And that's exactly I have an aarch64 machine
dedicated for subpage tests.
But this is still a hurdle for a lot of btrfs developers, and to improve
the test coverage mostly on x86_64, here we enable debug builds to
accept 2K block size.
This involves:
- Introduce a dedicated minimal block size macro
BTRFS_MIN_BLOCKSIZE, which depends on if CONFIG_BTRFS_DEBUG is set.
If so it's 2K, otherwise it's 4K as usual.
- Allow 4K, PAGE_SIZE and BTRFS_MIN_BLOCKSIZE as block size
- Update subpage block size checks to be based on BTRFS_MIN_BLOCKSIZE
- Export the new supported blocksize through sysfs interfaces
As most of the subpage support is already pretty mature, there is no
extra work needed to support the extra 2K block size.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs utilizes inline data extent for the following cases:
- Regular small files
- Symlinks
And "btrfs check" detects any file extents that are too large as an
error.
It's not a problem for 4K block size, but for the incoming smaller
block sizes (2K), it can cause problems due to bad limits:
- Non-compressed inline data extents
We do not allow a non-compressed inline data extent to be as large as
block size.
- Symlinks
Currently the only real limit on symlinks are 4K, which can be larger
than 2K block size.
These will result btrfs-check to report too large file extents.
Fix it by adding proper size checks for the above cases.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the initial enablement of block size < page size support for
btrfs in v5.15, we have hit several milestones for block size < page
size (subpage) support:
- RAID56 subpage support
In v5.19
- Refactored scrub support to support subpage better
In v6.4
- Block perfect (previously requires page aligned ranges) compressed write
In v6.13
- Various error handling fixes involving subpage
In v6.14
Finally the only missing feature is the pretty simple and harmless
inlined data extent creation, just added in previous patches.
Now btrfs has all of its features ready for both regular and subpage
cases, there is no reason to output a warning about the experimental
subpage support, and we can finally remove it now.
Acked-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Previously inline data extents creation was disabled if the block size
(previously called sector size) is smaller than the page size, for the
following reasons:
- Possible mixed inline and regular data extents
However this is also the same if the block size matches the page size,
thus we do not treat mixed inline and regular extents as an error.
And the chance to cause mixed inline and regular data extents are not
even increased, it has the same requirement (compressed inline data
extent covering the whole first block, followed by regular extents).
- Inability to handle async/inline delalloc range for block size < page
size cases
This is already fixed since commit 1d2fbb7f1f9e ("btrfs: allow
compression even if the range is not page aligned").
This was the major technical obstacle, but it's not anymore.
With that removed, we can enable inline data extents creation no matter
the block size nor the page size, allowing btrfs to have the same
capacity for all block sizes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
Since the support of block size (sector size) < page size for btrfs,
test case generic/563 fails with 4K block size and 64K page size:
--- tests/generic/563.out 2024-04-25 18:13:45.178550333 +0930
+++ /home/adam/xfstests-dev/results//generic/563.out.bad 2024-09-30 09:09:16.155312379 +0930
@@ -3,7 +3,8 @@
read is in range
write is in range
write -> read/write
-read is in range
+read has value of 8388608
+read is NOT in range -33792 .. 33792
write is in range
...
[CAUSE]
The test case creates a 8MiB file, then does buffered write into the 8MiB
using 4K block size, to overwrite the whole file.
On 4K page sized systems, since the write range covers the full block and
page, btrfs will not bother reading the page, just like what XFS and EXT4
do.
But on 64K page sized systems, although the 4K sized write is still block
aligned, it's not page aligned anymore, thus btrfs will read the full
page, which will be accounted by cgroup and fail the test.
As the test case itself expects such 4K block aligned write should not
trigger any read.
Such expected behavior is an optimization to reduce folio reads when
possible, and unfortunately btrfs does not implement such optimization.
[FIX]
To skip the full page read, we need to do the following modification:
- Do not trigger full page read as long as the buffered write is block
aligned
This is pretty simple by modifying the check inside
prepare_uptodate_page().
- Skip already uptodate blocks during full page read
Or we can lead to the following data corruption:
0 32K 64K
|///////| |
Where the file range [0, 32K) is dirtied by buffered write, the
remaining range [32K, 64K) is not.
When reading the full page, since [0,32K) is only dirtied but not
written back, there is no data extent map for it, but a hole covering
[0, 64k).
If we continue reading the full page range [0, 64K), the dirtied range
will be filled with 0 (since there is only a hole covering the whole
range).
This causes the dirtied range to get lost.
With this optimization, btrfs can pass generic/563 even if the page size
is larger than fs block size.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently if btrfs has its block size (the older sector size) smaller
than the page size, btrfs_do_readpage() will handle the range extent by
extent, this is good for performance as it doesn't need to re-lookup the
same extent map again and again.
(Although get_extent_map() already does extra cached em check, thus
the optimization is not that obvious.)
This is totally fine and is a valid optimization, but it has an
assumption that there is no partial uptodate range in the page.
Meanwhile there is an incoming feature, requiring btrfs to skip the full
page read if a buffered write range covers a full block but not a full
page.
In that case, we can have a page that is partially uptodate, and the
current per-extent lookup cannot handle such case.
So here we change btrfs_do_readpage() to do block-by-block read, this
simplifies the following things:
- Remove the need for @iosize variable
Because we just use sectorsize as our increment.
- Remove @pg_offset, and calculate it inside the loop when needed
It's just offset_in_folio().
- Use a for() loop instead of a while() loop
This will slightly reduce the read performance for subpage cases, but for
the future where we need to skip already uptodate blocks, it should still
be worth.
For block size == page size, this brings no performance change.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we're using btrfs_lock_and_flush_ordered_range() for both
btrfs_read_folio() and btrfs_readahead(), but it has one critical
problem for future subpage optimizations:
- It will call btrfs_start_ordered_extent() to writeback the involved
folios
But remember we're calling btrfs_lock_and_flush_ordered_range() at
read paths, meaning the folio is already locked by read path.
If we really trigger writeback for those already locked folios, this
will lead to a deadlock and writeback cannot get the folio lock.
Such dead lock is prevented by the fact that btrfs always keeps a
dirty folio also uptodate, by either dirtying all blocks of the folio,
or by reading the whole folio before dirtying.
To prepare for the incoming patch which allows btrfs to skip full folio
read if the buffered write is block aligned, we have to start by solving
the possible deadlock first.
Instead of blindly calling btrfs_start_ordered_extent(), introduce a
new helper, which is smarter in the following ways:
- Only wait and flush the ordered extent if
* The folio doesn't even have private bit set
* Part of the blocks of the ordered extent are not uptodate
This can happen by:
* The folio writeback finished, then got invalidated.
There are a lot of reasons that a folio can get invalidated,
from memory pressure to direct IO (which invalidates all folios
of the range).
But OE not yet finished.
We have to wait for the ordered extent, as the OE may contain
to-be-inserted data checksum.
Without waiting, our read can fail due to the missing checksum.
But either way, the OE should not need any extra flush inside the
locked folio range.
- Skip the ordered extent completely if
* All the blocks are dirty
This happens when OE creation is caused by a folio writeback whose
file offset is before our folio.
E.g. 16K page size and 4K block size
0 8K 16K 24K 32K
|//////////////||///////| |
The writeback of folio 0 created an OE for range [0, 24K), but since
folio 16K is not fully uptodate, a read is triggered for folio 16K.
The writeback will never happen (we're holding the folio lock for
read), nor will the OE finish.
Thus we must skip the range.
* All the blocks are uptodate
This happens when the writeback finished, but OE not yet finished.
Since the blocks are already uptodate, we can skip the OE range.
The new helper lock_extents_for_read() will do a loop for the target
range by:
1) Lock the full range
2) If there is no ordered extent in the remaining range, exit
3) If there is an ordered extent that we can skip
Skip to the end of the OE, and continue checking
We do not trigger writeback nor wait for the OE.
4) If there is an ordered extent that we cannot skip
Unlock the whole extent range and start the ordered extent.
And also update btrfs_start_ordered_extent() to add two more parameters:
@nowriteback_start and @nowriteback_len, to prevent triggering flush for
a certain range.
This will allow us to handle the following case properly in the future:
16K page size, 4K btrfs block size:
0 4K 8K 12K 16K 20K 24K 28K 32K
|/////////////////////////////||////////////////| | |
|<-------------------- OE 2 ------------------->| |< OE 1 >|
The folio has been written back before, thus we have an OE at
[28K, 32K).
Although the OE 1 finished its IO, the OE is not yet removed from IO
tree.
The folio got invalidated after writeback completed and before the
ordered extent finished.
And [16K, 24K) range is dirty and uptodate, caused by a block aligned
buffered write (and future enhancements allowing btrfs to skip full
folio read for such case).
But writeback for folio 0 has began, thus it generated OE 2, covering
range [0, 24K).
Since the full folio 16K is not uptodate, if we want to read the folio,
the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:
btrfs_read_folio()
| Folio 16K is already locked
|- btrfs_lock_and_flush_ordered_range()
|- btrfs_start_ordered_extent() for range [16K, 24K)
|- filemap_fdatawrite_range() for range [16K, 24K)
|- extent_write_cache_pages()
folio_lock() on folio 16K, deadlock.
But now we will have the following sequence:
btrfs_read_folio()
| Folio 16K is already locked
|- lock_extents_for_read()
|- can_skip_ordered_extent() for range [16K, 24K)
| Returned true, the range [16K, 24K) will be skipped.
|- can_skip_ordered_extent() for range [28K, 32K)
| Returned false.
|- btrfs_start_ordered_extent() for range [28K, 32K) with
[16K, 32K) as no writeback range
No writeback for folio 16K will be triggered.
And there will be no more possible deadlock on the same folio.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Inside function __cow_file_range_inline() since the inlined data no
longer take any data space, we need to free up the reserved space.
However the code is still using the old page size == sector size
assumption, and will not handle subpage case well.
Thankfully it is not going to cause any problems because we have two extra
safe nets:
- Inline data extents creation is disabled for sector size < page size
cases for now
But it won't stay that for long.
- btrfs_qgroup_free_data() will only clear ranges which have been already
reserved
So even if we pass a range larger than what we need, it should still
be fine, especially there is only reserved space for a single block at
file offset 0 of an inline data extent.
But just for the sake of consistency, fix the call site to use
sectorsize instead of page size.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently reading an inline data extent will zero out the remaining
range in the page.
This is not yet causing problems even for block size < page size
(subpage) cases because:
1) An inline data extent always starts at file offset 0
Meaning at page read, we always read the inline extent first, before
any other blocks in the page. Then later blocks are properly read out
and re-fill the zeroed out ranges.
2) Currently btrfs will read out the whole page if a buffered write is
not page aligned
So a page is either fully uptodate at buffered write time (covers the
whole page), or we will read out the whole page first.
Meaning there is nothing to lose for such an inline extent read.
But it's still not ideal:
- We're zeroing out the page twice
Once done by read_inline_extent()/uncompress_inline(), once done by
btrfs_do_readpage() for ranges beyond i_size.
- We're touching blocks that don't belong to the inline extent
In the incoming patches, we can have a partial uptodate folio, of
which some dirty blocks can exist while the page is not fully uptodate:
The page size is 16K and block size is 4K:
0 4K 8K 12K 16K
| | |/////////| |
And range [8K, 12K) is dirtied by a buffered write, the remaining
blocks are not uptodate.
If range [0, 4K) contains an inline data extent, and we try to read
the whole page, the current behavior will overwrite range [8K, 12K)
with zero and cause data loss.
So to make the behavior more consistent and in preparation for future
changes, limit the inline data extents read to only zero out the range
inside the first block, not the whole page.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We now parse human-friendly size values (e.g. '1G', '2M') when setting
read policies.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
This applies to both path and path2.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with some return simplifications.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This is the trivial pattern for path auto free, initialize at the
beginning and free at the end with simple goto -> return conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The most trivial pattern for the auto freeing when the variable is
declared with the macro and the final btrfs_free_path() is removed.
There are almost none goto -> return conversions and there's no other
function cleanup.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As the helper num_extent_folios() is now __pure, we can use it in for
loop without storing its value in a variable explicitly, the compiler
will do this for us.
The effects on btrfs.ko is -200 bytes and there are stack space savings
too:
btrfs_clone_extent_buffer -8 (32 -> 24)
btrfs_clear_buffer_dirty -8 (48 -> 40)
clear_extent_buffer_uptodate -8 (40 -> 32)
set_extent_buffer_dirty -8 (32 -> 24)
write_one_eb -8 (88 -> 80)
set_extent_buffer_uptodate -8 (40 -> 32)
read_extent_buffer_pages_nowait -16 (64 -> 48)
find_extent_buffer -8 (32 -> 24)
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The functions qualify for the pure attribute as they always return the
same value for the same argument (in the given scope). This allows to
optimize the calls and cache the value.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unlike folio helpers for date the ones for metadata always take the
extent buffer start and length, so they can be simplified to take the
eb only. The fs_info can be obtained from eb too so it can be dropped
as parameter.
Added in patch "btrfs: use metadata specific helpers to simplify extent
buffer helpers".
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are considering the used bytes counter of a block group as the amount
to update the space info's reclaim bytes counter after relocating the
block group, but this value alone is often not enough. This is because we
may have a reserved extent (or more) and in that case its size is
reflected in the reserved counter of the block group - the size of the
extent is only transferred from the reserved counter to the used counter
of the block group when the delayed ref for the extent is run - typically
when committing the transaction (or when flushing delayed refs due to
ENOSPC on space reservation). Such call chain for data extents is:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_data_ref()
alloc_reserved_file_extent()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
For metadata extents:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_tree_ref()
alloc_reserved_tree_block()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
Since relocation flushes delalloc, waits for ordered extent completion
and commits the current transaction before doing the actual relocation
work, the correct amount of reclaimed space is therefore the sum of the
"used" and "reserved" counters of the block group before we call
btrfs_relocate_chunk() at btrfs_reclaim_bgs_work().
So fix this by taking the "reserved" counter into consideration.
Fixes: 243192b67649 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At btrfs_reclaim_bgs_work(), we are grabbing twice the used bytes counter
of the block group while not holding the block group's spinlock. This can
result in races, reported by KCSAN and similar tools, since a concurrent
task can be updating that counter while at btrfs_update_block_group().
So avoid these races by grabbing the counter in a critical section
delimited by the block group's spinlock after setting the block group to
RO mode. This also avoids using two different values of the counter in
case it changes in between each read. This silences KCSAN and is required
for the next patch in the series too.
Fixes: 243192b67649 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
At btrfs_reclaim_bgs_work(), we are grabbing a block group's zone unusable
bytes while not under the protection of the block group's spinlock, so
this can trigger race reports from KCSAN (or similar tools) since that
field is typically updated while holding the lock, such as at
__btrfs_add_free_space_zoned() for example.
Fix this by grabbing the zone unusable bytes while we are still in the
critical section holding the block group's spinlock, which is right above
where we are currently grabbing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After previous patch removing nodesize from parameters,
__alloc_dummy_extent_buffer() and alloc_dummy_extent_buffer() are
identical so we can drop one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
All callers pass a valid fs_info so we can read the nodesize from that
instead of passing it as parameter.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no longer any need for the 'out' label as there are no resources
to cleanup anymore in case of an error and we can directly return if
begin_cmd() fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Whenever we issue a command we allocate a path and then compute it. For
the current inode this is not necessary since we have one preallocated
and computed in the send context structure, so we can use it instead
and avoid allocating and freeing a path.
For example if we have 100 extents to send (100 write commands) for a
file, we are allocating and freeing paths 100 times.
So improve on this by avoiding path allocation and freeing whenever a
command is for the current inode by using the current inode's path
stored in the send context structure.
A test was run before applying this patch and the previous one in the
series:
"btrfs: send: keep the current inode's path cached"
The test script is the following:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
DIR="$MNT/one/two/three/four"
FILE="$DIR/foobar"
mkdir -p $DIR
# Create some empty files to get a deeper btree and therefore make
# path computations slower.
for ((i = 1; i <= 30000; i++)); do
echo -n > "$DIR/filler_$i"
done
for ((i = 0; i < 10000; i += 2)); do
offset=$(( i * 4096 ))
xfs_io -f -c "pwrite -S 0xab $offset 4K" $FILE > /dev/null
done
btrfs subvolume snapshot -r $MNT $MNT/snap
start=$(date +%s%N)
btrfs send -f /dev/null $MNT/snap
end=$(date +%s%N)
echo -e "\nsend took $(( (end - start) / 1000000 )) milliseconds"
umount $MNT
Result before applying the 2 patches: 1121 milliseconds
Result after applying the 2 patches: 815 milliseconds (-31.6%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Whenever we need to send a command for the current inode, like sending
writes, xattr updates, truncates, utimes, etc, we compute the inode's
path each time, which implies doing some memory allocations and traversing
the inode hierarchy to extract the name of the inode and each ancestor
directory, and that implies doing lookups in the subvolume tree amongst
other operations.
Most of the time, by far, the current inode's path doesn't change while
we are processing it (like if we need to issue 100 write commands, the
path remains the same and it's pointless to compute it 100 times).
To avoid this keep the current inode's path cached in the send context
and invalidate it or update it whenever it's needed (after unlinks or
renames).
A performance test, and its results, is mentioned in the next patch in
the series (subject: "btrfs: send: avoid path allocation for the current
inode when issuing commands").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need for the 'out' label as there are no resources to cleanup
in case of an error and we can directly return if begin_cmd() fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|