summaryrefslogtreecommitdiff
path: root/fs/btrfs/inode.c
AgeCommit message (Collapse)Author
2023-10-18btrfs: convert to new timestamp accessorsJeff Layton
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-12btrfs: open code timespec64 in struct btrfs_inodeDavid Sterba
The type of timespec64::tv_nsec is 'unsigned long', while we have only u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode. Add separate members for sec and nsec with the corresponding type width. This creates a 4 byte hole in btrfs_inode which can be utilized in the future. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant initialization of variable dirty in btrfs_update_time()Colin Ian King
The variable dirty is initialized with a value that is never read, it is being re-assigned later on. Remove the redundant initialization. Cleans up clang scan build warning: fs/btrfs/inode.c:5965:7: warning: Value stored to 'dirty' during its initialization is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing fs_info->generationFilipe Manana
Currently the generation field of struct btrfs_fs_info is always modified while holding fs_info->trans_lock locked. Most readers will access this field without taking that lock but while holding a transaction handle, which is safe to do due to the transaction life cycle. However there are other readers that are neither holding the lock nor holding a transaction handle open: 1) When reading an inode from disk, at btrfs_read_locked_inode(); 2) When reading the generation to expose it to sysfs, at btrfs_generation_show(); 3) Early in the fsync path, at skip_inode_logging(); 4) When creating a hole at btrfs_cont_expand(), during write paths, truncate and reflinking; 5) In the fs_info ioctl (btrfs_ioctl_fs_info()); 6) While mounting the filesystem, in the open_ctree() path. In these cases it's safe to directly read fs_info->generation as no one can concurrently start a transaction and update fs_info->generation. In case of the fsync path, races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. In the other cases it's not so clear if functional problems may arise, though in case 1 rare things like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC flag not being set on an inode and therefore result in incorrect logging later on in case a fsync call is made. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the generation field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: open code btrfs_ordered_inode_tree in btrfs_inodeDavid Sterba
The structure btrfs_ordered_inode_tree is used only in one place, in btrfs_inode. The structure itself has a 4 byte hole which is wasted space. Move the btrfs_ordered_inode_tree members to btrfs_inode with a common prefix 'ordered_tree_' where the hole can be utilized and shrink inode size. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: change test_range_bit to scan the whole rangeDavid Sterba
The semantics of test_range_bit() with filled == 0 is now in it's own helper so test_range_bit will check the whole range unconditionally. The detection logic is flipped and assumes success by default and catches exceptions. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add specific helper for range bit test existsDavid Sterba
The existing helper test_range_bit works in two ways, checks if the whole range contains all the bits, or stop on the first occurrence. By adding a specific helper for the latter case, the inner loop can be simplified and contains fewer conditionals, making it a bit faster. There's no caller that uses the cached state pointer so this reduces the argument count further. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from maybe_insert_hole()Filipe Manana
The root argument for maybe_insert_hole() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from btrfs_delayed_update_inode()Filipe Manana
The root argument for btrfs_delayed_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from btrfs_update_inode_item()Filipe Manana
The root argument for btrfs_update_inode_item() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from btrfs_update_inode()Filipe Manana
The root argument for btrfs_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from btrfs_update_inode_fallback()Filipe Manana
The root argument for btrfs_update_inode_fallback() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove noinline from btrfs_update_inode()Filipe Manana
The noinline attribute of btrfs_update_inode() is pointless as the function is exported and widely used, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: simplify error check condition at btrfs_dirty_inode()Filipe Manana
The following condition at btrfs_dirty_inode() is redundant: if (ret && (ret == -ENOSPC || ret == -EDQUOT)) The first check for a non-zero 'ret' value is pointless, we can simplify this to simply: if (ret == -ENOSPC || ret == -EDQUOT) Not only this makes it easier to read, it also slightly reduces the text size of the btrfs kernel module: $ size fs/btrfs/btrfs.ko.before text data bss dec hex filename 1641400 168265 16864 1826529 1bdee1 fs/btrfs/btrfs.ko.before $ size fs/btrfs/btrfs.ko.after text data bss dec hex filename 1641224 168181 16864 1826269 1bdddd fs/btrfs/btrfs.ko.after Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: merge ordered work callbacks in btrfs_work into oneDavid Sterba
There are two callbacks defined in btrfs_work but only two actually make use of them, otherwise there are NULLs. We can get rid of the freeing callback making it a special case of the normal work. This reduces the size of btrfs_work by 8 bytes, final layout: struct btrfs_work { btrfs_func_t func; /* 0 8 */ btrfs_ordered_func_t ordered_func; /* 8 8 */ struct work_struct normal_work; /* 16 32 */ struct list_head ordered_list; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct btrfs_workqueue * wq; /* 64 8 */ long unsigned int flags; /* 72 8 */ /* size: 80, cachelines: 2, members: 6 */ /* last cacheline: 16 bytes */ }; This in turn reduces size of other structures (on a release config): - async_chunk 160 -> 152 - async_submit_bio 152 -> 144 - btrfs_async_delayed_work 104 -> 96 - btrfs_caching_control 176 -> 168 - btrfs_delalloc_work 144 -> 136 - btrfs_fs_info 3608 -> 3600 - btrfs_ordered_extent 440 -> 424 - btrfs_writepage_fixup 104 -> 96 Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add support for inserting raid stripe extentsJohannes Thumshirn
Add support for inserting stripe extents into the raid stripe tree on completion of every write that needs an extra logical-to-physical translation when using RAID. Inserting the stripe extents happens after the data I/O has completed, this is done to a) support zone-append and b) rule out the possibility of a RAID-write-hole. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: abort transaction on generation mismatch when marking eb as dirtyFilipe Manana
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20Merge tag 'for-6.6-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more followup fixes to the directory listing. People have noticed different behaviour compared to other filesystems after changes in 6.5. This is now unified to more "logical" and expected behaviour while still within POSIX. And a few more fixes for stable. - change behaviour of readdir()/rewinddir() when new directory entries are created after opendir(), properly tracking the last entry - fix race in readdir when multiple threads can set the last entry index for a directory Additionally: - use exclusive lock when direct io might need to drop privs and call notify_change() - don't clear uptodate bit on page after an error, this may lead to a deadlock in subpage mode - fix waiting pattern when multiple readers block on Merkle tree data, switch to folios" * tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix race between reading a directory and adding entries to it btrfs: refresh dir last index during a rewinddir(3) call btrfs: set last dir index to the current last index when opening dir btrfs: don't clear uptodate on write errors btrfs: file_remove_privs needs an exclusive lock in direct io write btrfs: convert btrfs_read_merkle_tree_page() to use a folio
2023-09-14btrfs: fix race between reading a directory and adding entries to itFilipe Manana
When opening a directory (opendir(3)) or rewinding it (rewinddir(3)), we are not holding the directory's inode locked, and this can result in later attempting to add two entries to the directory with the same index number, resulting in a transaction abort, with -EEXIST (-17), when inserting the second delayed dir index. This results in a trace like the following: Sep 11 22:34:59 myhostname kernel: BTRFS error (device dm-3): err add delayed dir index item(name: cockroach-stderr.log) into the insertion tree of the delayed node(root id: 5, inode id: 4539217, errno: -17) Sep 11 22:34:59 myhostname kernel: ------------[ cut here ]------------ Sep 11 22:34:59 myhostname kernel: kernel BUG at fs/btrfs/delayed-inode.c:1504! Sep 11 22:34:59 myhostname kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI Sep 11 22:34:59 myhostname kernel: CPU: 0 PID: 7159 Comm: cockroach Not tainted 6.4.15-200.fc38.x86_64 #1 Sep 11 22:34:59 myhostname kernel: Hardware name: ASUS ESC500 G3/P9D WS, BIOS 2402 06/27/2018 Sep 11 22:34:59 myhostname kernel: RIP: 0010:btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: Code: eb dd 48 (...) Sep 11 22:34:59 myhostname kernel: RSP: 0000:ffffa9980e0fbb28 EFLAGS: 00010282 Sep 11 22:34:59 myhostname kernel: RAX: 0000000000000000 RBX: ffff8b10b8f4a3c0 RCX: 0000000000000000 Sep 11 22:34:59 myhostname kernel: RDX: 0000000000000000 RSI: ffff8b177ec21540 RDI: ffff8b177ec21540 Sep 11 22:34:59 myhostname kernel: RBP: ffff8b110cf80888 R08: 0000000000000000 R09: ffffa9980e0fb938 Sep 11 22:34:59 myhostname kernel: R10: 0000000000000003 R11: ffffffff86146508 R12: 0000000000000014 Sep 11 22:34:59 myhostname kernel: R13: ffff8b1131ae5b40 R14: ffff8b10b8f4a418 R15: 00000000ffffffef Sep 11 22:34:59 myhostname kernel: FS: 00007fb14a7fe6c0(0000) GS:ffff8b177ec00000(0000) knlGS:0000000000000000 Sep 11 22:34:59 myhostname kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 11 22:34:59 myhostname kernel: CR2: 000000c00143d000 CR3: 00000001b3b4e002 CR4: 00000000001706f0 Sep 11 22:34:59 myhostname kernel: Call Trace: Sep 11 22:34:59 myhostname kernel: <TASK> Sep 11 22:34:59 myhostname kernel: ? die+0x36/0x90 Sep 11 22:34:59 myhostname kernel: ? do_trap+0xda/0x100 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: ? do_error_trap+0x6a/0x90 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: ? exc_invalid_op+0x50/0x70 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: ? asm_exc_invalid_op+0x1a/0x20 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: ? btrfs_insert_delayed_dir_index+0x1da/0x260 Sep 11 22:34:59 myhostname kernel: btrfs_insert_dir_item+0x200/0x280 Sep 11 22:34:59 myhostname kernel: btrfs_add_link+0xab/0x4f0 Sep 11 22:34:59 myhostname kernel: ? ktime_get_real_ts64+0x47/0xe0 Sep 11 22:34:59 myhostname kernel: btrfs_create_new_inode+0x7cd/0xa80 Sep 11 22:34:59 myhostname kernel: btrfs_symlink+0x190/0x4d0 Sep 11 22:34:59 myhostname kernel: ? schedule+0x5e/0xd0 Sep 11 22:34:59 myhostname kernel: ? __d_lookup+0x7e/0xc0 Sep 11 22:34:59 myhostname kernel: vfs_symlink+0x148/0x1e0 Sep 11 22:34:59 myhostname kernel: do_symlinkat+0x130/0x140 Sep 11 22:34:59 myhostname kernel: __x64_sys_symlinkat+0x3d/0x50 Sep 11 22:34:59 myhostname kernel: do_syscall_64+0x5d/0x90 Sep 11 22:34:59 myhostname kernel: ? syscall_exit_to_user_mode+0x2b/0x40 Sep 11 22:34:59 myhostname kernel: ? do_syscall_64+0x6c/0x90 Sep 11 22:34:59 myhostname kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc The race leading to the problem happens like this: 1) Directory inode X is loaded into memory, its ->index_cnt field is initialized to (u64)-1 (at btrfs_alloc_inode()); 2) Task A is adding a new file to directory X, holding its vfs inode lock, and calls btrfs_set_inode_index() to get an index number for the entry. Because the inode's index_cnt field is set to (u64)-1 it calls btrfs_inode_delayed_dir_index_count() which fails because no dir index entries were added yet to the delayed inode and then it calls btrfs_set_inode_index_count(). This functions finds the last dir index key and then sets index_cnt to that index value + 1. It found that the last index key has an offset of 100. However before it assigns a value of 101 to index_cnt... 3) Task B calls opendir(3), ending up at btrfs_opendir(), where the VFS lock for inode X is not taken, so it calls btrfs_get_dir_last_index() and sees index_cnt still with a value of (u64)-1. Because of that it calls btrfs_inode_delayed_dir_index_count() which fails since no dir index entries were added to the delayed inode yet, and then it also calls btrfs_set_inode_index_count(). This also finds that the last index key has an offset of 100, and before it assigns the value 101 to the index_cnt field of inode X... 4) Task A assigns a value of 101 to index_cnt. And then the code flow goes to btrfs_set_inode_index() where it increments index_cnt from 101 to 102. Task A then creates a delayed dir index entry with a sequence number of 101 and adds it to the delayed inode; 5) Task B assigns 101 to the index_cnt field of inode X; 6) At some later point when someone tries to add a new entry to the directory, btrfs_set_inode_index() will return 101 again and shortly after an attempt to add another delayed dir index key with index number 101 will fail with -EEXIST resulting in a transaction abort. Fix this by locking the inode at btrfs_get_dir_last_index(), which is only only used when opening a directory or attempting to lseek on it. Reported-by: ken <ken@bllue.org> Link: https://lore.kernel.org/linux-btrfs/CAE6xmH+Lp=Q=E61bU+v9eWX8gYfLvu6jLYxjxjFpo3zHVPR0EQ@mail.gmail.com/ Reported-by: syzbot+d13490c82ad5353c779d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ Fixes: 9b378f6ad48c ("btrfs: fix infinite directory reads") CC: stable@vger.kernel.org # 6.5+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14btrfs: refresh dir last index during a rewinddir(3) callFilipe Manana
When opening a directory we find what's the index of its last entry and then store it in the directory's file handle private data (struct btrfs_file_private::last_index), so that in the case new directory entries are added to a directory after an opendir(3) call we don't end up in an infinite loop (see commit 9b378f6ad48c ("btrfs: fix infinite directory reads")) when calling readdir(3). However once rewinddir(3) is called, POSIX states [1] that any new directory entries added after the previous opendir(3) call, must be returned by subsequent calls to readdir(3): "The rewinddir() function shall reset the position of the directory stream to which dirp refers to the beginning of the directory. It shall also cause the directory stream to refer to the current state of the corresponding directory, as a call to opendir() would have done." We currently don't refresh the last_index field of the struct btrfs_file_private associated to the directory, so after a rewinddir(3) we are not returning any new entries added after the opendir(3) call. Fix this by finding the current last index of the directory when llseek is called against the directory. This can be reproduced by the following C program provided by Ian Johnson: #include <dirent.h> #include <stdio.h> int main(void) { DIR *dir = opendir("test"); FILE *file; file = fopen("test/1", "w"); fwrite("1", 1, 1, file); fclose(file); file = fopen("test/2", "w"); fwrite("2", 1, 1, file); fclose(file); rewinddir(dir); struct dirent *entry; while ((entry = readdir(dir))) { printf("%s\n", entry->d_name); } closedir(dir); return 0; } Reported-by: Ian Johnson <ian@ianjohnson.dev> Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/ Fixes: 9b378f6ad48c ("btrfs: fix infinite directory reads") CC: stable@vger.kernel.org # 6.5+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-14btrfs: set last dir index to the current last index when opening dirFilipe Manana
When opening a directory for reading it, we set the last index where we stop iteration to the value in struct btrfs_inode::index_cnt. That value does not match the index of the most recently added directory entry but it's instead the index number that will be assigned the next directory entry. This means that if after the call to opendir(3) new directory entries are added, a readdir(3) call will return the first new directory entry. This is fine because POSIX says the following [1]: "If a file is removed from or added to the directory after the most recent call to opendir() or rewinddir(), whether a subsequent call to readdir() returns an entry for that file is unspecified." For example for the test script from commit 9b378f6ad48c ("btrfs: fix infinite directory reads"), where we have 2000 files in a directory, ext4 doesn't return any new directory entry after opendir(3), while xfs returns the first 13 new directory entries added after the opendir(3) call. If we move to a shorter example with an empty directory when opendir(3) is called, and 2 files added to the directory after the opendir(3) call, then readdir(3) on btrfs will return the first file, ext4 and xfs return the 2 files (but in a different order). A test program for this, reported by Ian Johnson, is the following: #include <dirent.h> #include <stdio.h> int main(void) { DIR *dir = opendir("test"); FILE *file; file = fopen("test/1", "w"); fwrite("1", 1, 1, file); fclose(file); file = fopen("test/2", "w"); fwrite("2", 1, 1, file); fclose(file); struct dirent *entry; while ((entry = readdir(dir))) { printf("%s\n", entry->d_name); } closedir(dir); return 0; } To make this less odd, change the behaviour to never return new entries that were added after the opendir(3) call. This is done by setting the last_index field of the struct btrfs_file_private attached to the directory's file handle with a value matching btrfs_inode::index_cnt minus 1, since that value always matches the index of the next new directory entry and not the index of the most recently added entry. [1] https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html Link: https://lore.kernel.org/linux-btrfs/YR1P0S.NGASEG570GJ8@ianjohnson.dev/ CC: stable@vger.kernel.org # 6.5+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13btrfs: don't clear uptodate on write errorsJosef Bacik
We have been consistently seeing hangs with generic/648 in our subpage GitHub CI setup. This is a classic deadlock, we are calling btrfs_read_folio() on a folio, which requires holding the folio lock on the folio, and then finding a ordered extent that overlaps that range and calling btrfs_start_ordered_extent(), which then tries to write out the dirty page, which requires taking the folio lock and then we deadlock. The hang happens because we're writing to range [1271750656, 1271767040), page index [77621, 77622], and page 77621 is !Uptodate. It is also Dirty, so we call btrfs_read_folio() for 77621 and which does btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered extent which is [1271644160, 1271746560), page index [77615, 77621]. The page indexes overlap, but the actual bytes don't overlap. We're holding the page lock for 77621, then call btrfs_lock_and_flush_ordered_range() which tries to flush the dirty page, and tries to lock 77621 again and then we deadlock. The byte ranges do not overlap, but with subpage support if we clear uptodate on any portion of the page we mark the entire thing as not uptodate. We have been clearing page uptodate on write errors, but no other file system does this, and is in fact incorrect. This doesn't hurt us in the !subpage case because we can't end up with overlapped ranges that don't also overlap on the page. Fix this by not clearing uptodate when we have a write error. The only thing we should be doing in this case is setting the mapping error and carrying on. This makes it so we would no longer call btrfs_read_folio() on the page as it's uptodate and eliminates the deadlock. With this patch we're now able to make it through a full fstests run on our subpage blocksize VMs. Note for stable backports: this probably goes beyond 6.1 but the code has been cleaned up and clearing the uptodate bit must be verified on each version independently. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-28Merge tag 'for-6.6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "No new features, the bulk of the changes are fixes, refactoring and cleanups. The notable fix is the scrub performance restoration after rewrite in 6.4, though still only partial. Fixes: - scrub performance drop due to rewrite in 6.4 partially restored: - do IO grouping by blg_plug/blk_unplug again - avoid unnecessary tree searches when processing stripes, in extent and checksum trees - the drop is noticeable on fast PCIe devices, -66% and restored to -33% of the original - backports to 6.4 planned - handle more corner cases of transaction commit during orphan cleanup or delayed ref processing - use correct fsid/metadata_uuid when validating super block - copy directory permissions and time when creating a stub subvolume Core: - debugging feature integrity checker deprecated, to be removed in 6.7 - in zoned mode, zones are activated just before the write, making error handling easier, now the overcommit mechanism can be enabled again which improves performance by avoiding more frequent flushing - v0 extent handling completely removed, deprecated long time ago - error handling improvements - tests: - extent buffer bitmap tests - pinned extent splitting tests - cleanups and refactoring: - compression writeback - extent buffer bitmap - space flushing, ENOSPC handling" * tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits) btrfs: zoned: skip splitting and logical rewriting on pre-alloc write btrfs: tests: test invalid splitting when skipping pinned drop extent_map btrfs: tests: add a test for btrfs_add_extent_mapping btrfs: tests: add extent_map tests for dropping with odd layouts btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() btrfs: scrub: don't go ordered workqueue for dev-replace btrfs: scrub: fix grouping of read IO btrfs: scrub: avoid unnecessary csum tree search preparing stripes btrfs: scrub: avoid unnecessary extent tree search preparing stripes btrfs: copy dir permission and time when creating a stub subvolume btrfs: remove pointless empty list check when reading delayed dir indexes btrfs: drop redundant check to use fs_devices::metadata_uuid btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super btrfs: use the correct superblock to compare fsid in btrfs_validate_super btrfs: simplify memcpy either of metadata_uuid or fsid btrfs: add a helper to read the superblock metadata_uuid btrfs: remove v0 extent handling btrfs: output extra debug info if we failed to find an inline backref btrfs: move the !zoned assert into run_delalloc_cow btrfs: consolidate the error handling in run_delalloc_nocow ...
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-21btrfs: copy dir permission and time when creating a stub subvolumeLee Trager
btrfs supports creating nested subvolumes however snapshots are not recursive. When a snapshot is taken of a volume which contains a subvolume the subvolume is replaced with a stub subvolume which has the same name and uses inode number 2[1]. The stub subvolume kept the directory name but did not set the time or permissions of the stub subvolume. This resulted in all time information being the current time and ownership defaulting to root. When subvolumes and snapshots are created using unshare this results in a snapshot directory the user created but has no permissions for. Test case: [vmuser@archvm ~]# sudo -i [root@archvm ~]# mkdir -p /mnt/btrfs/test [root@archvm ~]# chown vmuser:users /mnt/btrfs/test/ [root@archvm ~]# exit logout [vmuser@archvm ~]$ cd /mnt/btrfs/test [vmuser@archvm test]$ unshare --user --keep-caps --map-auto --map-root-user [root@archvm test]# btrfs subvolume create subvolume Create subvolume './subvolume' [root@archvm test]# btrfs subvolume create subvolume/subsubvolume Create subvolume 'subvolume/subsubvolume' [root@archvm test]# btrfs subvolume snapshot subvolume snapshot Create a snapshot of 'subvolume' in './snapshot' [root@archvm test]# exit logout [vmuser@archvm test]$ tree -ug [vmuser users ] . ├── [vmuser users ] snapshot │   └── [vmuser users ] subsubvolume <-- Without patch perm is root:root └── [vmuser users ] subvolume └── [vmuser users ] subsubvolume 5 directories, 0 files [1] https://btrfs.readthedocs.io/en/latest/btrfs-subvolume.html#nested-subvolumes Signed-off-by: Lee Trager <lee@trager.us> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: move the !zoned assert into run_delalloc_cowChristoph Hellwig
Having the assert in the actual helper documents the pre-conditions much better than having it in the caller, so move it. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: consolidate the error handling in run_delalloc_nocowChristoph Hellwig
Share the calls to extent_clear_unlock_delalloc for btrfs_path allocation failure handling and the normal exit path. This relies on btrfs_free_path ignoring a NULL pointer, and the initialization of cur_offset to start at the beginning of the function. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: cleanup the COW fallback logic in run_delalloc_nocowChristoph Hellwig
Use the block group pointer used to track the outstanding NOCOW writes as a boolean to remove the duplicate nocow variable, and keep it contained in the main loop to simplify the logic. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: fix error handling when in a COW window in run_delalloc_nocowChristoph Hellwig
When run_delalloc_nocow has cow_start set to a value other than (u64)-1, it has delayed COW writeback pending behind cur_offset. When an error occurs in such a window, the range going back to cow_start and not just cur_offset needs to be unlocked, but only two error cases handle this correctly Move the code to handle unlock the COW range to the common error handling label and document the logic. To make things even more complicated, cow_file_range as called by fallback_to_cow will unlock the range it is operating on when it fails as well, so we need to reset cow_start right after caling fallback_to_cow instead of only when it succeeded. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use LIST_HEAD() to initialize the list_headRuan Jinjie
Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: return real error when orphan cleanup fails due to a transaction abortFilipe Manana
During mount we will call btrfs_orphan_cleanup() to remove any inodes that were previously deleted (have a link count of 0) but for which we were not able before to remove their items from the subvolume tree. The removal of the items will happen by triggering eviction, when we do the final iput() on them at btrfs_orphan_cleanup(), which will end in the loop at btrfs_evict_inode() that truncates inode items. In a dire situation we may have a transaction abort due to -ENOSPC when attempting to truncate the inode items, and in that case the orphan item (key type BTRFS_ORPHAN_ITEM_KEY) will remain in the subvolume tree and when we hit the next iteration of the while loop at btrfs_orphan_cleanup() we will find the same orphan item as before, and then we will return -EINVAL from btrfs_orphan_cleanup() through the following if statement: if (found_key.offset == last_objectid) { btrfs_err(fs_info, "Error removing orphan entry, stopping orphan cleanup"); ret = -EINVAL; goto out; } This makes the mount operation fail with -EINVAL, when it should have been -ENOSPC. This is confusing because -EINVAL might lead a user into thinking it provided invalid mount options for example. An example where this happens: $ mount test.img /mnt mount: /mnt: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. $ dmesg [ 2542.356934] BTRFS: device fsid 977fff75-1181-4d2b-a739-384fa710d16e devid 1 transid 47409973 /dev/loop0 scanned by mount (4459) [ 2542.357451] BTRFS info (device loop0): using crc32c (crc32c-intel) checksum algorithm [ 2542.357461] BTRFS info (device loop0): disk space caching is enabled [ 2542.742287] BTRFS info (device loop0): auto enabling async discard [ 2542.764554] BTRFS info (device loop0): checking UUID tree [ 2551.743065] ------------[ cut here ]------------ [ 2551.743068] BTRFS: Transaction aborted (error -28) [ 2551.743149] WARNING: CPU: 7 PID: 215 at fs/btrfs/block-group.c:3494 btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs] [ 2551.743311] Modules linked in: btrfs blake2b_generic (...) [ 2551.743353] CPU: 7 PID: 215 Comm: kworker/u24:5 Not tainted 6.4.0-rc6-btrfs-next-134+ #1 [ 2551.743356] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [ 2551.743357] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [ 2551.743405] RIP: 0010:btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs] [ 2551.743449] Code: 8b 43 0c (...) [ 2551.743451] RSP: 0018:ffff982c005a7c40 EFLAGS: 00010286 [ 2551.743452] RAX: 0000000000000000 RBX: ffff88fc6e44b400 RCX: 0000000000000000 [ 2551.743453] RDX: 0000000000000002 RSI: ffffffff8dff0878 RDI: 00000000ffffffff [ 2551.743454] RBP: ffff88fc51817208 R08: 0000000000000000 R09: ffff982c005a7ae0 [ 2551.743455] R10: 0000000000000001 R11: 0000000000000001 R12: ffff88fc43d2e570 [ 2551.743456] R13: ffff88fc43d2e400 R14: ffff88fc8fb08ee0 R15: ffff88fc6e44b530 [ 2551.743457] FS: 0000000000000000(0000) GS:ffff89035fbc0000(0000) knlGS:0000000000000000 [ 2551.743458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2551.743459] CR2: 00007fa8cdf2f6f4 CR3: 0000000124850003 CR4: 0000000000370ee0 [ 2551.743462] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2551.743463] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2551.743464] Call Trace: [ 2551.743472] <TASK> [ 2551.743474] ? __warn+0x80/0x130 [ 2551.743478] ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs] [ 2551.743520] ? report_bug+0x1f4/0x200 [ 2551.743523] ? handle_bug+0x42/0x70 [ 2551.743526] ? exc_invalid_op+0x14/0x70 [ 2551.743528] ? asm_exc_invalid_op+0x16/0x20 [ 2551.743532] ? btrfs_write_dirty_block_groups+0x397/0x3d0 [btrfs] [ 2551.743574] ? _raw_spin_unlock+0x15/0x30 [ 2551.743576] ? btrfs_run_delayed_refs+0x1bd/0x200 [btrfs] [ 2551.743609] commit_cowonly_roots+0x1e9/0x260 [btrfs] [ 2551.743652] btrfs_commit_transaction+0x42e/0xfa0 [btrfs] [ 2551.743693] ? __pfx_autoremove_wake_function+0x10/0x10 [ 2551.743697] flush_space+0xf1/0x5d0 [btrfs] [ 2551.743743] ? _raw_spin_unlock+0x15/0x30 [ 2551.743745] ? finish_task_switch+0x91/0x2a0 [ 2551.743748] ? _raw_spin_unlock+0x15/0x30 [ 2551.743750] ? btrfs_get_alloc_profile+0xc9/0x1f0 [btrfs] [ 2551.743793] btrfs_async_reclaim_metadata_space+0xe1/0x230 [btrfs] [ 2551.743837] process_one_work+0x1d9/0x3e0 [ 2551.743844] worker_thread+0x4a/0x3b0 [ 2551.743847] ? __pfx_worker_thread+0x10/0x10 [ 2551.743849] kthread+0xee/0x120 [ 2551.743852] ? __pfx_kthread+0x10/0x10 [ 2551.743854] ret_from_fork+0x29/0x50 [ 2551.743860] </TASK> [ 2551.743861] ---[ end trace 0000000000000000 ]--- [ 2551.743863] BTRFS info (device loop0: state A): dumping space info: [ 2551.743866] BTRFS info (device loop0: state A): space_info DATA has 126976 free, is full [ 2551.743868] BTRFS info (device loop0: state A): space_info total=13458472960, used=13458137088, pinned=143360, reserved=0, may_use=0, readonly=65536 zone_unusable=0 [ 2551.743870] BTRFS info (device loop0: state A): space_info METADATA has -51625984 free, is full [ 2551.743872] BTRFS info (device loop0: state A): space_info total=771751936, used=770146304, pinned=1605632, reserved=0, may_use=51625984, readonly=0 zone_unusable=0 [ 2551.743874] BTRFS info (device loop0: state A): space_info SYSTEM has 14663680 free, is not full [ 2551.743875] BTRFS info (device loop0: state A): space_info total=14680064, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0 [ 2551.743877] BTRFS info (device loop0: state A): global_block_rsv: size 53231616 reserved 51544064 [ 2551.743878] BTRFS info (device loop0: state A): trans_block_rsv: size 0 reserved 0 [ 2551.743879] BTRFS info (device loop0: state A): chunk_block_rsv: size 0 reserved 0 [ 2551.743880] BTRFS info (device loop0: state A): delayed_block_rsv: size 0 reserved 0 [ 2551.743881] BTRFS info (device loop0: state A): delayed_refs_rsv: size 786432 reserved 0 [ 2551.743886] BTRFS: error (device loop0: state A) in btrfs_write_dirty_block_groups:3494: errno=-28 No space left [ 2551.743911] BTRFS info (device loop0: state EA): forced readonly [ 2551.743951] BTRFS warning (device loop0: state EA): could not allocate space for delete; will truncate on mount [ 2551.743962] BTRFS error (device loop0: state EA): Error removing orphan entry, stopping orphan cleanup [ 2551.743973] BTRFS warning (device loop0: state EA): Skipping commit of aborted transaction. [ 2551.743989] BTRFS error (device loop0: state EA): could not do orphan cleanup -22 So make the btrfs_orphan_cleanup() return the value of BTRFS_FS_ERROR(), if it's set, and -EINVAL otherwise. For that same example, after this change, the mount operation fails with -ENOSPC: $ mount test.img /mnt mount: /mnt: mount(2) system call failed: No space left on device. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: fix zoned handling in submit_uncompressed_rangeChristoph Hellwig
For zoned file systems we need to use run_delalloc_zoned to submit writeback, as we need to write out partial allocations when running into zone active limits. submit_uncompressed_range currently always calls cow_file_range to allocate blocks and thus misses the active zone limits handling. Fix this by passing the pages_dirty argument to run_delalloc_zoned and always using it from submit_uncompressed_range as it does the right thing for zoned and non-zoned file systems. To account for the fact that run_delalloc_zoned is now also used for non-zoned file systems rename it to run_delalloc_cow, and add comment describing it. Fixes: 42c011000963 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't redirty locked_page in run_delalloc_zonedChristoph Hellwig
extent_write_locked_range currently expects that either all or no pages are dirty when it is called. Bur run_delalloc_zoned is called directly in the writepages path, and has the dirty bit cleared only for locked_page and which the extent_write_cache_pages currently operates. It currently works around this by redirtying locked_page, but that is a bit inefficient and cumbersome. Pass a locked_page argument to run_delalloc_zoned so that clearing the dirty bit can be skipped on just that page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: refactor the zoned device handling in cow_file_rangeChristoph Hellwig
Handling of the done_offset to cow_file_range is a bit confusing, as it is not updated at all when the function succeeds, and the -EAGAIN status is used bother for the case where we need to wait for a zone finish and the one where the allocation was partially successful. Change the calling convention so that done_offset is always updated, and 0 is returned if some allocation was successful (partial allocation can still only happen for zoned devices), and waiting for a zone finish is done internally in cow_file_range instead of the caller. Also write a comment explaining the logic. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't redirty pages in compress_file_rangeChristoph Hellwig
compress_file_range needs to clear the dirty bit before handing off work to the compression worker threads to prevent processes coming in through mmap and changing the file contents while the compression is accessing the data (See commit 4adaa611020f ("Btrfs: fix race between mmap writes and compression"). But when compress_file_range decides to not compress the data, it falls back to submit_uncompressed_range which uses extent_write_locked_range to write the uncompressed data. extent_write_locked_range currently expects all pages to be marked dirty so that it can clear the dirty bit itself, and thus compress_file_range has to redirty the page range. Redirtying the page range is rather inefficient and also pointless, so instead pass a pages_dirty parameter to extent_write_locked_range and skip the redirty game entirely. Note that compress_file_range was even redirtying the locked_page twice given that extent_range_clear_dirty_for_io already redirties all pages in the range, which must include locked_page if there is one. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: share the code to free the page array in compress_file_rangeChristoph Hellwig
compress_file_range has two code blocks to free the page array for the compressed data. Share the code using a goto label. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use a separate label for the incompressible case in compress_file_rangeChristoph Hellwig
compress_file_range can fail to compress either because of resource or alignment constraints or because the data is incompressible. In the latter case the inode is marked so that compression isn't tried again. Currently that check is based on the condition that the pages array has been allocated which is rather cryptic. Use a separate label to clearly distinguish this case. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: further simplify the compress or not logic in compress_file_rangeChristoph Hellwig
Currently the logic whether to compress or not in compress_file_range is a bit convoluted because it tries to share code for creating inline extents for the compressible [1] path and the bail to uncompressed path. But the latter isn't needed at all, because cow_file_range as called by submit_uncompressed_range will already create inline extents as needed, so there is no need to have special handling for it if we can live with the fact that it will be called a bit later in the ->ordered_func of the workqueue instead of right now. [1] there is undocumented logic that creates an uncompressed inline extent outside of the shall not compress logic if total_in is too small. This logic isn't explained in comments or any commit log I could find, so I've preserved it. Documentation explaining it would be appreciated if anyone understands this code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: streamline compress_file_rangeChristoph Hellwig
Reorder compress_file_range so that the main compression flow happens straight line and not in branches. To do this ensure that pages is always zeroed before a page allocation happens, which allows the cleanup_and_bail_uncompressed label to clean up the page allocations as needed. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: merge submit_compressed_extents and async_cow_submitChristoph Hellwig
The code in submit_compressed_extents just loops over the async_extents, and doesn't need to be conditional on an inode being present, as there won't be any async_extent in the list if we created and inline extent. Merge the two functions to simplify the logic. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: merge async_cow_start and compress_file_rangeChristoph Hellwig
There is no good reason to have the simple async_cow_start wrapper, merge the argument conversion into the main compress_file_range function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't clear async_chunk->inode in async_cow_startChristoph Hellwig
Now that the ->inode check isn't needed in submit_compressed_extents any more, there is no reason to clear the field early. Always keep the inode around until the work item is finished and remove the special casing, and the counting of compressed extents in compress_file_range. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: clean up the check for uncompressed ranges in submit_one_async_extentChristoph Hellwig
Instead of checking for a NULL !pages and explaining this with a cryptic comment, just check the compression type for BTRFS_COMPRESS_NONE to make the check self-explanatory. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: reduce the number of arguments to btrfs_run_delalloc_rangeChristoph Hellwig
Instead of a separate page_started argument that tells the callers that btrfs_run_delalloc_range already started writeback by itself, overload the return value with a positive 1 in additio to 0 and a negative error code to indicate that is has already started writeback, and remove the nr_written argument as that caller can calculate it directly based on the range, and in fact already does so for the case where writeback wasn't started yet. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove the return value from submit_uncompressed_rangeChristoph Hellwig
The return value from submit_uncompressed_range is ignored, and that's fine because the error reporting happens through the mapping and ordered_extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: reduce debug spam from submit_compressed_extentsChristoph Hellwig
Move the printk that is supposed to help to debug failures in submit_one_async_extent into submit_one_async_extent and make it coniditonal on actually having an error condition instead of spamming the log unconditionally. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove end_extent_writepageChristoph Hellwig
end_extent_writepage is a small helper that combines a call to btrfs_mark_ordered_io_finished with conditional error-only calls to btrfs_page_clear_uptodate and mapping_set_error with a somewhat unfortunate calling convention that passes and inclusive end instead of the len expected by the underlying functions. Remove end_extent_writepage and open code it in the 4 callers. Out of those two already are error-only and thus don't need the extra conditional, and one already has the mapping_set_error, so a duplicate call can be avoided. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove btrfs_writepage_endio_finish_orderedChristoph Hellwig
btrfs_writepage_endio_finish_ordered is a small wrapper around btrfs_mark_ordered_io_finished that just changs the argument passing slightly, and adds a tracepoint. Move the tracpoint to btrfs_mark_ordered_io_finished, which means it now also covers the error handling in btrfs_cleanup_ordered_extent and switch all callers to just call btrfs_mark_ordered_io_finished directly. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't create inline extents in fallback_to_cowChristoph Hellwig
For NOCOW files, run_delalloc_nocow can still fall back to COW allocations when required and calls to fallback_to_cow helper for that. For such an allocation we can have multiple ordered_extents for existing extents that NOCOW overwrites and new allocations that fallback_to_cow creates. If one of the new extents is an inline extent, the writepages could would have to avoid normal page writeback for them as indicated by the page_started return argument, which run_delalloc_nocow can't return. Fix this by never creating inline extents from fallback_to_cow. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>