summaryrefslogtreecommitdiff
path: root/fs/btrfs/delayed-inode.c
AgeCommit message (Collapse)Author
2025-03-18btrfs: do trivial BTRFS_PATH_AUTO_FREE conversionsDavid Sterba
The most trivial pattern for the auto freeing when the variable is declared with the macro and the final btrfs_free_path() is removed. There are almost none goto -> return conversions and there's no other function cleanup. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to btrfs_fill_inode()David Sterba
Pass a struct btrfs_inode to btrfs_fill_inode() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: pass struct btrfs_inode to fill_stack_inode_item()David Sterba
Pass a struct btrfs_inode to fill_stack_inode_item() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: drop unused parameter fs_info to btrfs_delete_delayed_insertion_item()David Sterba
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: delayed-inode: remove unnecessary call to btrfs_mark_buffer_dirty()Filipe Manana
The call to btrfs_mark_buffer_dirty() at __btrfs_update_delayed_inode() is not necessary as we have a path setup for writing with btrfs_search_slot() having a 'cow' argument set to 1. This just makes the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: update __btrfs_add_delayed_item() to use rb helperRoger L. Beckermeyer III
Update __btrfs_add_delayed_item() to use rb_find_add_cached(). Signed-off-by: Roger L. Beckermeyer III <beckerlee3@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass a btrfs_inode to btrfs_readdir_get_delayed_items()David Sterba
Pass a struct btrfs_inode to btrfs_readdir_get_delayed_items() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: pass a btrfs_inode to btrfs_readdir_put_delayed_items()David Sterba
Pass a struct btrfs_inode to btrfs_readdir_put_delayed_items() as it's an internal interface, allowing to remove some use of BTRFS_I. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: constify pointer parameters where applicableDavid Sterba
We can add const to many parameters, this is for clarity and minor addition to safety. There are some minor effects, in the assembly code and .ko measured on release config. This patch does not cover all possible conversions. Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: unify index_cnt and csum_bytes from struct btrfs_inodeFilipe Manana
The index_cnt field of struct btrfs_inode is used only for two purposes: 1) To store the index for the next entry added to a directory; 2) For the data relocation inode to track the logical start address of the block group currently being relocated. For the relocation case we use index_cnt because it's not used for anything else in the relocation use case - we could have used other fields that are not used by relocation such as defrag_bytes, last_unlink_trans or last_reflink_trans for example (among others). Since the csum_bytes field is not used for directories, do the following changes: 1) Put index_cnt and csum_bytes in a union, and index_cnt is only initialized when the inode is a directory. The csum_bytes is only accessed in IO paths for regular files, so we're fine here; 2) Use the defrag_bytes field for relocation, since the data relocation inode is never used for defrag purposes. And to make the naming better, alias it to reloc_block_group_start by using a union. This reduces the size of struct btrfs_inode by 8 bytes in a release kernel, from 1056 bytes down to 1048 bytes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove inode_lock from struct btrfs_root and use xarray locksFilipe Manana
Currently we use the spinlock inode_lock from struct btrfs_root to serialize access to two different data structures: 1) The delayed inodes xarray (struct btrfs_root::delayed_nodes); 2) The inodes xarray (struct btrfs_root::inodes). Instead of using our own lock, we can use the spinlock that is part of the xarray implementation, by using the xa_lock() and xa_unlock() APIs and using the xarray APIs with the double underscore prefix that don't take the xarray locks and assume the caller is using xa_lock() and xa_unlock(). So remove the spinlock inode_lock from struct btrfs_root and use the corresponding xarray locks. This brings 2 benefits: 1) We reduce the size of struct btrfs_root, from 1336 bytes down to 1328 bytes on a 64 bits release kernel config; 2) We reduce lock contention by not using anymore the same lock for changing two different and unrelated xarrays. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-07btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik
A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-02btrfs: record delayed inode root in transactionBoris Burkov
When running delayed inode updates, we do not record the inode's root in the transaction, but we do allocate PREALLOC and thus converted PERTRANS space for it. To be sure we free that PERTRANS meta rsv, we must ensure that we record the root in the transaction. Fixes: 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-05btrfs: remove SLAB_MEM_SPREAD flag useChengming Zhou
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the series[1] went on to mark it obsolete to avoid confusion for users. Here we can just remove all its users, which has no functional change. [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: use KMEM_CACHE() to create btrfs_delayed_node cacheKunwu Chan
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify the creation of SLAB caches when the default values are used. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: uninline btrfs_init_delayed_root()David Sterba
This is a simple initializer and not on any hot path, it does not need to be static inline. Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: change BUG_ON to assertion when checking for delayed_node rootDavid Sterba
The pointer to root is initialized in btrfs_init_delayed_node(), no need to check for it again. Change the BUG_ON to assertion. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-04btrfs: delayed-inode: drop pointless BUG_ON in __btrfs_remove_delayed_item()David Sterba
There's a BUG_ON checking for a valid pointer of fs_info::delayed_root but it is valid since init_mount_fs_info() and has the same lifetime as fs_info. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: switch btrfs_root::delayed_nodes_tree to xarray from radix-treeDavid Sterba
The radix-tree has been superseded by the xarray (https://lwn.net/Articles/745073), this patch converts the btrfs_root::delayed_nodes, the APIs are used in a simple way. First idea is to do xa_insert() but this would require GFP_ATOMIC allocation which we want to avoid if possible. The preload mechanism of radix-tree can be emulated within the xarray API. - xa_reserve() with GFP_NOFS outside of the lock, the reserved entry is inserted atomically at most once - xa_store() under a lock, in case something races in we can detect that and xa_load() returns a valid pointer All uses of xa_load() must check for a valid pointer in case they manage to get between the xa_reserve() and xa_store(), this is handled in btrfs_get_delayed_node(). Otherwise the functionality is equivalent, xarray implements the radix-tree and there should be no performance difference. The patch continues the efforts started in 253bf57555e451 ("btrfs: turn delayed_nodes_tree into an XArray") and fixes the problems with locking and GFP flags 088aea3b97e0ae ("Revert "btrfs: turn delayed_nodes_tree into an XArray""). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: do not utilize goto to implement delayed inode ref deletionQu Wenruo
[PROBLEM] The function __btrfs_update_delayed_inode() is doing something not meeting the code standard of today: path->slots[0]++ if (path->slots[0] >= btrfs_header_nritems(leaf)) goto search; again: if (!is_the_target_inode_ref()) goto out; ret = btrfs_delete_item(); /* Some cleanup. */ return ret; search: ret = search_for_the_last_inode_ref(); goto again; With the tag named "again", it's pretty common to think it's a loop, but the truth is, we only need to do the search once, to locate the last (also the first, since there should only be one INODE_REF or INODE_EXTREF now) ref of the inode. [FIX] Instead of the weird jumps, just do them in a stream-lined fashion. This removes those weird labels, and add extra comments on why we can do the different searches. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-30Merge tag 'for-6.7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group items - adjust overcommit logic in near full situations, account for one more chunk to eventually allocate metadata chunk, this is mostly relevant for small filesystems (<10GiB) - single device filesystems are scanned but not registered (except seed devices), this allows temp_fsid to work - qgroup iterations do not need GFP_ATOMIC allocations anymore - cleanups, refactoring, reduced data structure size, function parameter simplifications, error handling fixes" * tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits) btrfs: open code timespec64 in struct btrfs_inode btrfs: remove redundant log root tree index assignment during log sync btrfs: remove redundant initialization of variable dirty in btrfs_update_time() btrfs: sysfs: show temp_fsid feature btrfs: disable the device add feature for temp-fsid btrfs: disable the seed feature for temp-fsid btrfs: update comment for temp-fsid, fsid, and metadata_uuid btrfs: remove pointless empty log context list check when syncing log btrfs: update comment for struct btrfs_inode::lock btrfs: remove pointless barrier from btrfs_sync_file() btrfs: add and use helpers for reading and writing last_trans_committed btrfs: add and use helpers for reading and writing fs_info->generation btrfs: add and use helpers for reading and writing log_transid btrfs: add and use helpers for reading and writing last_log_commit btrfs: support cloned-device mount capability btrfs: add helper function find_fsid_by_disk btrfs: stop reserving excessive space for block group item insertions btrfs: stop reserving excessive space for block group item updates btrfs: reorder btrfs_inode to fill gaps btrfs: open code btrfs_ordered_inode_tree in btrfs_inode ...
2023-10-30Merge tag 'vfs-6.7.ctime' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode time accessor updates from Christian Brauner: "This finishes the conversion of all inode time fields to accessor functions as discussed on list. Changing timestamps manually as we used to do before is error prone. Using accessors function makes this robust. It does not contain the switch of the time fields to discrete 64 bit integers to replace struct timespec and free up space in struct inode. But after this, the switch can be trivially made and the patch should only affect the vfs if we decide to do it" * tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits) fs: rename inode i_atime and i_mtime fields security: convert to new timestamp accessors selinux: convert to new timestamp accessors apparmor: convert to new timestamp accessors sunrpc: convert to new timestamp accessors mm: convert to new timestamp accessors bpf: convert to new timestamp accessors ipc: convert to new timestamp accessors linux: convert to new timestamp accessors zonefs: convert to new timestamp accessors xfs: convert to new timestamp accessors vboxsf: convert to new timestamp accessors ufs: convert to new timestamp accessors udf: convert to new timestamp accessors ubifs: convert to new timestamp accessors tracefs: convert to new timestamp accessors sysv: convert to new timestamp accessors squashfs: convert to new timestamp accessors server: convert to new timestamp accessors client: convert to new timestamp accessors ...
2023-10-18btrfs: convert to new timestamp accessorsJeff Layton
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-12btrfs: open code timespec64 in struct btrfs_inodeDavid Sterba
The type of timespec64::tv_nsec is 'unsigned long', while we have only u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode. Add separate members for sec and nsec with the corresponding type width. This creates a 4 byte hole in btrfs_inode which can be utilized in the future. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from btrfs_delayed_update_inode()Filipe Manana
The root argument for btrfs_delayed_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: merge ordered work callbacks in btrfs_work into oneDavid Sterba
There are two callbacks defined in btrfs_work but only two actually make use of them, otherwise there are NULLs. We can get rid of the freeing callback making it a special case of the normal work. This reduces the size of btrfs_work by 8 bytes, final layout: struct btrfs_work { btrfs_func_t func; /* 0 8 */ btrfs_ordered_func_t ordered_func; /* 8 8 */ struct work_struct normal_work; /* 16 32 */ struct list_head ordered_list; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct btrfs_workqueue * wq; /* 64 8 */ long unsigned int flags; /* 72 8 */ /* size: 80, cachelines: 2, members: 6 */ /* last cacheline: 16 bytes */ }; This in turn reduces size of other structures (on a release config): - async_chunk 160 -> 152 - async_submit_bio 152 -> 144 - btrfs_async_delayed_work 104 -> 96 - btrfs_caching_control 176 -> 168 - btrfs_delalloc_work 144 -> 136 - btrfs_fs_info 3608 -> 3600 - btrfs_ordered_extent 440 -> 424 - btrfs_writepage_fixup 104 -> 96 Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: abort transaction on generation mismatch when marking eb as dirtyFilipe Manana
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: update comment for reservation of metadata space for delayed itemsFilipe Manana
The second comment at btrfs_delayed_item_reserve_metadata() refers to a field named "index_items_size" of a delayed inode, however that field does not exists - it existed in a previous patch version, but then it split into the fields "curr_index_batch_size" and "index_item_leaves" in the final patch version that was picked. So update the comment. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-11Merge tag 'for-6.6-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A revert of recent mount option parsing fix, this breaks mounts with security options. The second patch is a flexible array annotation" * tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size() Revert "btrfs: reject unknown mount options early"
2023-10-11btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size()Gustavo A. R. Silva
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). While there, use struct_size() helper, instead of the open-coded version, to calculate the size for the allocation of the whole flexible structure, including of course, the flexible-array member. This code was found with the help of Coccinelle, and audited and fixed manually. Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-12Merge tag 'for-6.6-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - several fixes for handling directory item (inserting, removing, iteration, error handling) - fix transaction commit stalls when auto relocation is running and blocks other tasks that want to commit - fix a build error when DEBUG is enabled - fix lockdep warning in inode number lookup ioctl - fix race when finishing block group creation - remove link to obsolete wiki in several files * tag 'for-6.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org btrfs: assert delayed node locked when removing delayed item btrfs: remove BUG() after failure to insert delayed dir index item btrfs: improve error message after failure to add delayed dir index item btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio btrfs: check for BTRFS_FS_ERROR in pending ordered assert btrfs: fix lockdep splat and potential deadlock after failure running delayed items btrfs: do not block starts waiting on previous transaction commit btrfs: release path before inode lookup during the ino lookup ioctl btrfs: fix race between finishing block group creation and its item update
2023-09-08btrfs: assert delayed node locked when removing delayed itemFilipe Manana
When removing a delayed item, or releasing which will remove it as well, we will modify one of the delayed node's rbtrees and item counter if the delayed item is in one of the rbtrees. This require having the delayed node's mutex locked, otherwise we will race with other tasks modifying the rbtrees and the counter. This is motivated by a previous version of another patch actually calling btrfs_release_delayed_item() after unlocking the delayed node's mutex and against a delayed item that is in a rbtree. So assert at __btrfs_remove_delayed_item() that the delayed node's mutex is locked. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: remove BUG() after failure to insert delayed dir index itemFilipe Manana
Instead of calling BUG() when we fail to insert a delayed dir index item into the delayed node's tree, we can just release all the resources we have allocated/acquired before and return the error to the caller. This is fine because all existing call chains undo anything they have done before calling btrfs_insert_delayed_dir_index() or BUG_ON (when creating pending snapshots in the transaction commit path). So remove the BUG() call and do proper error handling. This relates to a syzbot report linked below, but does not fix it because it only prevents hitting a BUG(), it does not fix the issue where somehow we attempt to use twice the same index number for different index items. Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: improve error message after failure to add delayed dir index itemFilipe Manana
If we fail to add a delayed dir index item because there's already another item with the same index number, we print an error message (and then BUG). However that message isn't very helpful to debug anything because we don't know what's the index number and what are the values of index counters in the inode and its delayed inode (index_cnt fields of struct btrfs_inode and struct btrfs_delayed_node). So update the error message to include the index number and counters. We actually had a recent case where this issue was hit by a syzbot report (see the link below). Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: fix lockdep splat and potential deadlock after failure running ↵Filipe Manana
delayed items When running delayed items we are holding a delayed node's mutex and then we will attempt to modify a subvolume btree to insert/update/delete the delayed items. However if have an error during the insertions for example, btrfs_insert_delayed_items() may return with a path that has locked extent buffers (a leaf at the very least), and then we attempt to release the delayed node at __btrfs_run_delayed_items(), which requires taking the delayed node's mutex, causing an ABBA type of deadlock. This was reported by syzbot and the lockdep splat is the following: WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Not tainted ------------------------------------------------------ syz-executor.2/13257 is trying to acquire lock: ffff88801835c0c0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 but task is already holding lock: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: __lock_release kernel/locking/lockdep.c:5475 [inline] lock_release+0x36f/0x9d0 kernel/locking/lockdep.c:5781 up_write+0x79/0x580 kernel/locking/rwsem.c:1625 btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline] btrfs_unlock_up_safe+0x179/0x3b0 fs/btrfs/locking.c:239 search_leaf fs/btrfs/ctree.c:1986 [inline] btrfs_search_slot+0x2511/0x2f80 fs/btrfs/ctree.c:2230 btrfs_insert_empty_items+0x9c/0x180 fs/btrfs/ctree.c:4376 btrfs_insert_delayed_item fs/btrfs/delayed-inode.c:746 [inline] btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline] __btrfs_commit_inode_delayed_items+0xd24/0x2410 fs/btrfs/delayed-inode.c:1111 __btrfs_run_delayed_items+0x1db/0x430 fs/btrfs/delayed-inode.c:1153 flush_space+0x269/0xe70 fs/btrfs/space-info.c:723 btrfs_async_reclaim_metadata_space+0x106/0x350 fs/btrfs/space-info.c:1078 process_one_work+0x92c/0x12c0 kernel/workqueue.c:2600 worker_thread+0xa63/0x1210 kernel/workqueue.c:2751 kthread+0x2b8/0x350 kernel/kthread.c:389 ret_from_fork+0x2e/0x60 arch/x86/kernel/process.c:145 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799 __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156 btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276 btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988 vfs_fsync_range fs/sync.c:188 [inline] vfs_fsync fs/sync.c:202 [inline] do_fsync fs/sync.c:212 [inline] __do_sys_fsync fs/sync.c:220 [inline] __se_sys_fsync fs/sync.c:218 [inline] __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(&delayed_node->mutex); lock(btrfs-tree-00); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by syz-executor.2/13257: #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: spin_unlock include/linux/spinlock.h:391 [inline] #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0xb87/0xe00 fs/btrfs/transaction.c:287 #1: ffff88802c1ee398 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0xbb2/0xe00 fs/btrfs/transaction.c:288 #2: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198 stack backtrace: CPU: 0 PID: 13257 Comm: syz-executor.2 Not tainted 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799 __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156 btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276 btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988 vfs_fsync_range fs/sync.c:188 [inline] vfs_fsync fs/sync.c:202 [inline] do_fsync fs/sync.c:212 [inline] __do_sys_fsync fs/sync.c:220 [inline] __se_sys_fsync fs/sync.c:218 [inline] __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f3ad047cae9 Code: 28 00 00 00 75 (...) RSP: 002b:00007f3ad12510c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 00007f3ad059bf80 RCX: 00007f3ad047cae9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007f3ad04c847a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007f3ad059bf80 R15: 00007ffe56af92f8 </TASK> ------------[ cut here ]------------ Fix this by releasing the path before releasing the delayed node in the error path at __btrfs_run_delayed_items(). Reported-by: syzbot+a379155f07c134ea9879@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000abba27060403b5bd@google.com/ CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-28Merge tag 'for-6.6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "No new features, the bulk of the changes are fixes, refactoring and cleanups. The notable fix is the scrub performance restoration after rewrite in 6.4, though still only partial. Fixes: - scrub performance drop due to rewrite in 6.4 partially restored: - do IO grouping by blg_plug/blk_unplug again - avoid unnecessary tree searches when processing stripes, in extent and checksum trees - the drop is noticeable on fast PCIe devices, -66% and restored to -33% of the original - backports to 6.4 planned - handle more corner cases of transaction commit during orphan cleanup or delayed ref processing - use correct fsid/metadata_uuid when validating super block - copy directory permissions and time when creating a stub subvolume Core: - debugging feature integrity checker deprecated, to be removed in 6.7 - in zoned mode, zones are activated just before the write, making error handling easier, now the overcommit mechanism can be enabled again which improves performance by avoiding more frequent flushing - v0 extent handling completely removed, deprecated long time ago - error handling improvements - tests: - extent buffer bitmap tests - pinned extent splitting tests - cleanups and refactoring: - compression writeback - extent buffer bitmap - space flushing, ENOSPC handling" * tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits) btrfs: zoned: skip splitting and logical rewriting on pre-alloc write btrfs: tests: test invalid splitting when skipping pinned drop extent_map btrfs: tests: add a test for btrfs_add_extent_mapping btrfs: tests: add extent_map tests for dropping with odd layouts btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() btrfs: scrub: don't go ordered workqueue for dev-replace btrfs: scrub: fix grouping of read IO btrfs: scrub: avoid unnecessary csum tree search preparing stripes btrfs: scrub: avoid unnecessary extent tree search preparing stripes btrfs: copy dir permission and time when creating a stub subvolume btrfs: remove pointless empty list check when reading delayed dir indexes btrfs: drop redundant check to use fs_devices::metadata_uuid btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super btrfs: use the correct superblock to compare fsid in btrfs_validate_super btrfs: simplify memcpy either of metadata_uuid or fsid btrfs: add a helper to read the superblock metadata_uuid btrfs: remove v0 extent handling btrfs: output extra debug info if we failed to find an inline backref btrfs: move the !zoned assert into run_delalloc_cow btrfs: consolidate the error handling in run_delalloc_nocow ...
2023-08-28Merge tag 'v6.6-vfs.ctime' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...
2023-08-21btrfs: remove pointless empty list check when reading delayed dir indexesFilipe Manana
At btrfs_readdir_delayed_dir_index(), called when reading a directory, we have this check for an empty list to return immediately, but it's not needed since list_for_each_entry_safe(), called immediately after, is prepared to deal with an empty list, it simply does nothing. So remove the empty list check. Besides shorter source code, it also slightly reduces the binary text size: Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1609408 167269 16864 1793541 1b5e05 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1609392 167269 16864 1793525 1b5df5 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-14btrfs: fix infinite directory readsFilipe Manana
The readdir implementation currently processes always up to the last index it finds. This however can result in an infinite loop if the directory has a large number of entries such that they won't all fit in the given buffer passed to the readdir callback, that is, dir_emit() returns a non-zero value. Because in that case readdir() will be called again and if in the meanwhile new directory entries were added and we still can't put all the remaining entries in the buffer, we keep repeating this over and over. The following C program and test script reproduce the problem: $ cat /mnt/readdir_prog.c #include <sys/types.h> #include <dirent.h> #include <stdio.h> int main(int argc, char *argv[]) { DIR *dir = opendir("."); struct dirent *dd; while ((dd = readdir(dir))) { printf("%s\n", dd->d_name); rename(dd->d_name, "TEMPFILE"); rename("TEMPFILE", dd->d_name); } closedir(dir); } $ gcc -o /mnt/readdir_prog /mnt/readdir_prog.c $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV &> /dev/null #mkfs.xfs -f $DEV &> /dev/null #mkfs.ext4 -F $DEV &> /dev/null mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= 2000; i++)); do echo -n > $MNT/testdir/file_$i done cd $MNT/testdir /mnt/readdir_prog cd /mnt umount $MNT This behaviour is surprising to applications and it's unlike ext4, xfs, tmpfs, vfat and other filesystems, which always finish. In this case where new entries were added due to renames, some file names may be reported more than once, but this varies according to each filesystem - for example ext4 never reported the same file more than once while xfs reports the first 13 file names twice. So change our readdir implementation to track the last index number when opendir() is called and then make readdir() never process beyond that index number. This gives the same behaviour as ext4. Reported-by: Rob Landley <rob@landley.net> Link: https://lore.kernel.org/linux-btrfs/2c8c55ec-04c6-e0dc-9c5c-8c7924778c35@landley.net/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=217681 CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-13btrfs: convert to ctime accessor functionsJeff Layton
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-27-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-03-06btrfs: handle btrfs_del_item errors in __btrfs_update_delayed_inodevoid0red
Even if the slot is already read out, we may still need to re-balance the tree, thus it can cause error in that btrfs_del_item() call and we need to handle it properly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: void0red <void0red@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: pass btrfs_inode to btrfs_inode_unlockDavid Sterba
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: pass btrfs_inode to btrfs_inode_lockDavid Sterba
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file-item prototypes into their own headerJosef Bacik
Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: extend btrfs_dir_item type to store encryption statusOmar Sandoval
For directories with encrypted files/filenames, we need to store a flag indicating this fact. There's no room in other fields, so we'll need to borrow a bit from dir_type. Since it's now a combination of type and flags, we rename it to dir_flags to reflect its new usage. The new flag, FT_ENCRYPTED, indicates a directory containing encrypted data, which is orthogonal to file type; therefore, add the new flag, and make conversion from directory type to file type strip the flag. As the file types almost never change we can afford to use the bits. Actual usage will be guarded behind an incompat bit, this patch only adds the support for later use by fscrypt. Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move accessor helpers into accessors.hJosef Bacik
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.hJosef Bacik
We're going to use fs.h to hold fs wide related helpers and definitions, move the FS_STATE enum and related helpers to fs.h, and then update all files that need these definitions to include fs.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move the printk helpers out of ctree.hJosef Bacik
We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move flush related definitions to space-info.hJosef Bacik
This code is used in space-info.c, move the definitions to space-info.h. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>