git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
13 days	Merge tag 'mm-stable-2025-10-01-19-00' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "mm, swap: improve cluster scan strategy" from Kairui Song improves performance and reduces the failure rate of swap cluster allocation - "support large align and nid in Rust allocators" from Vitaly Wool permits Rust allocators to set NUMA node and large alignment when perforning slub and vmalloc reallocs - "mm/damon/vaddr: support stat-purpose DAMOS" from Yueyang Pan extend DAMOS_STAT's handling of the DAMON operations sets for virtual address spaces for ops-level DAMOS filters - "execute PROCMAP_QUERY ioctl under per-vma lock" from Suren Baghdasaryan reduces mmap_lock contention during reads of /proc/pid/maps - "mm/mincore: minor clean up for swap cache checking" from Kairui Song performs some cleanup in the swap code - "mm: vm_normal_page() improvements" from David Hildenbrand provides code cleanup in the pagemap code - "add persistent huge zero folio support" from Pankaj Raghav provides a block layer speedup by optionalls making the huge_zero_pagepersistent, instead of releasing it when its refcount falls to zero - "kho: fixes and cleanups" from Mike Rapoport adds a few touchups to the recently added Kexec Handover feature - "mm: make mm->flags a bitmap and 64-bit on all arches" from Lorenzo Stoakes turns mm_struct.flags into a bitmap. To end the constant struggle with space shortage on 32-bit conflicting with 64-bit's needs - "mm/swapfile.c and swap.h cleanup" from Chris Li cleans up some swap code - "selftests/mm: Fix false positives and skip unsupported tests" from Donet Tom fixes a few things in our selftests code - "prctl: extend PR_SET_THP_DISABLE to only provide THPs when advised" from David Hildenbrand "allows individual processes to opt-out of THP=always into THP=madvise, without affecting other workloads on the system". It's a long story - the [1/N] changelog spells out the considerations - "Add and use memdesc_flags_t" from Matthew Wilcox gets us started on the memdesc project. Please see https://kernelnewbies.org/MatthewWilcox/Memdescs and https://blogs.oracle.com/linux/post/introducing-memdesc - "Tiny optimization for large read operations" from Chi Zhiling improves the efficiency of the pagecache read path - "Better split_huge_page_test result check" from Zi Yan improves our folio splitting selftest code - "test that rmap behaves as expected" from Wei Yang adds some rmap selftests - "remove write_cache_pages()" from Christoph Hellwig removes that function and converts its two remaining callers - "selftests/mm: uffd-stress fixes" from Dev Jain fixes some UFFD selftests issues - "introduce kernel file mapped folios" from Boris Burkov introduces the concept of "kernel file pages". Using these permits btrfs to account its metadata pages to the root cgroup, rather than to the cgroups of random inappropriate tasks - "mm/pageblock: improve readability of some pageblock handling" from Wei Yang provides some readability improvements to the page allocator code - "mm/damon: support ARM32 with LPAE" from SeongJae Park teaches DAMON to understand arm32 highmem - "tools: testing: Use existing atomic.h for vma/maple tests" from Brendan Jackman performs some code cleanups and deduplication under tools/testing/ - "maple_tree: Fix testing for 32bit compiles" from Liam Howlett fixes a couple of 32-bit issues in tools/testing/radix-tree.c - "kasan: unify kasan_enabled() and remove arch-specific implementations" from Sabyrzhan Tasbolatov moves KASAN arch-specific initialization code into a common arch-neutral implementation - "mm: remove zpool" from Johannes Weiner removes zspool - an indirection layer which now only redirects to a single thing (zsmalloc) - "mm: task_stack: Stack handling cleanups" from Pasha Tatashin makes a couple of cleanups in the fork code - "mm: remove nth_page()" from David Hildenbrand makes rather a lot of adjustments at various nth_page() callsites, eventually permitting the removal of that undesirable helper function - "introduce kasan.write_only option in hw-tags" from Yeoreum Yun creates a KASAN read-only mode for ARM, using that architecture's memory tagging feature. It is felt that a read-only mode KASAN is suitable for use in production systems rather than debug-only - "mm: hugetlb: cleanup hugetlb folio allocation" from Kefeng Wang does some tidying in the hugetlb folio allocation code - "mm: establish const-correctness for pointer parameters" from Max Kellermann makes quite a number of the MM API functions more accurate about the constness of their arguments. This was getting in the way of subsystems (in this case CEPH) when they attempt to improving their own const/non-const accuracy - "Cleanup free_pages() misuse" from Vishal Moola fixes a number of code sites which were confused over when to use free_pages() vs __free_pages() - "Add Rust abstraction for Maple Trees" from Alice Ryhl makes the mapletree code accessible to Rust. Required by nouveau and by its forthcoming successor: the new Rust Nova driver - "selftests/mm: split_huge_page_test: split_pte_mapped_thp improvements" from David Hildenbrand adds a fix and some cleanups to the thp selftesting code - "mm, swap: introduce swap table as swap cache (phase I)" from Chris Li and Kairui Song is the first step along the path to implementing "swap tables" - a new approach to swap allocation and state tracking which is expected to yield speed and space improvements. This patchset itself yields a 5-20% performance benefit in some situations - "Some ptdesc cleanups" from Matthew Wilcox utilizes the new memdesc layer to clean up the ptdesc code a little - "Fix va_high_addr_switch.sh test failure" from Chunyu Hu fixes some issues in our 5-level pagetable selftesting code - "Minor fixes for memory allocation profiling" from Suren Baghdasaryan addresses a couple of minor issues in relatively new memory allocation profiling feature - "Small cleanups" from Matthew Wilcox has a few cleanups in preparation for more memdesc work - "mm/damon: add addr_unit for DAMON_LRU_SORT and DAMON_RECLAIM" from Quanmin Yan makes some changes to DAMON in furtherance of supporting arm highmem - "selftests/mm: Add -Wunreachable-code and fix warnings" from Muhammad Anjum adds that compiler check to selftests code and fixes the fallout, by removing dead code - "Improvements to Victim Process Thawing and OOM Reaper Traversal Order" from zhongjinji makes a number of improvements in the OOM killer: mainly thawing a more appropriate group of victim threads so they can release resources - "mm/damon: misc fixups and improvements for 6.18" from SeongJae Park is a bunch of small and unrelated fixups for DAMON - "mm/damon: define and use DAMON initialization check function" from SeongJae Park implement reliability and maintainability improvements to a recently-added bug fix - "mm/damon/stat: expose auto-tuned intervals and non-idle ages" from SeongJae Park provides additional transparency to userspace clients of the DAMON_STAT information - "Expand scope of khugepaged anonymous collapse" from Dev Jain removes some constraints on khubepaged's collapsing of anon VMAs. It also increases the success rate of MADV_COLLAPSE against an anon vma - "mm: do not assume file == vma->vm_file in compat_vma_mmap_prepare()" from Lorenzo Stoakes moves us further towards removal of file_operations.mmap(). This patchset concentrates upon clearing up the treatment of stacked filesystems - "mm: Improve mlock tracking for large folios" from Kiryl Shutsemau provides some fixes and improvements to mlock's tracking of large folios. /proc/meminfo's "Mlocked" field became more accurate - "mm/ksm: Fix incorrect accounting of KSM counters during fork" from Donet Tom fixes several user-visible KSM stats inaccuracies across forks and adds selftest code to verify these counters - "mm_slot: fix the usage of mm_slot_entry" from Wei Yang addresses some potential but presently benign issues in KSM's mm_slot handling tag 'mm-stable-2025-10-01-19-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (372 commits) mm: swap: check for stable address space before operating on the VMA mm: convert folio_page() back to a macro mm/khugepaged: use start_addr/addr for improved readability hugetlbfs: skip VMAs without shareable locks in hugetlb_vmdelete_list alloc_tag: fix boot failure due to NULL pointer dereference mm: silence data-race in update_hiwater_rss mm/memory-failure: don't select MEMORY_ISOLATION mm/khugepaged: remove definition of struct khugepaged_mm_slot mm/ksm: get mm_slot by mm_slot_entry() when slot is !NULL hugetlb: increase number of reserving hugepages via cmdline selftests/mm: add fork inheritance test for ksm_merging_pages counter mm/ksm: fix incorrect KSM counter handling in mm_struct during fork drivers/base/node: fix double free in register_one_node() mm: remove PMD alignment constraint in execmem_vmalloc() mm/memory_hotplug: fix typo 'esecially' -> 'especially' mm/rmap: improve mlock tracking for large folios mm/filemap: map entire large folio faultaround mm/fault: try to map the entire file folio in finish_fault() mm/rmap: mlock large folios in try_to_unmap_one() mm/rmap: fix a mlock race condition in folio_referenced_one() ...
13 days	Merge tag 'for-6.18/io_uring-20250929' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...
2025-09-30	Merge tag 'for-6.18-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "There are no new features, the changes are in the core code, notably tree-log error handling and reporting improvements, and initial support for block size > page size. Performance improvements: - search data checksums in the commit root (previous transaction) to avoid locking contention, this improves parallelism of read heavy/low write workloads, and also reduces transaction commit time; on real and reproducer workload the sync time went from minutes to tens of seconds (workload and numbers are in the changelog) Core: - tree-log updates: - error handling improvements, transaction aborts - add new error state 'O' (printed in status messages) when log replay fails and is aborted - reduced number of btrfs_path allocations when traversing the tree - 'block size > page size' support - basic implementation with limitations, under experimental build - limitations: no direct io, raid56, encoded read (standalone and in send ioctl), encoded write - preparatory work for compression, removing implicit assumptions of page and block sizes - compression workspaces are now per-filesystem, we cannot assume common block size for work memory among different filesystems - tree-checker now verifies INODE_EXTREF item (which is implementing hardlinks) - tree leaf pretty printer updates, there were missing data from items, keys/items - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG, it's a debugging feature and not needed to be enabled separately - more struct btrfs_path auto free updates - use ref_tracker API for tracking delayed inodes, enabled by mount option 'ref_verify', allowing to better pinpoint leaking references - in zoned mode, avoid selecting data relocation zoned for ordinary data block groups - updated and enhanced error messages - lots of cleanups and refactoring" * tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits) btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot() btrfs: add unlikely annotations to branches leading to transaction abort btrfs: add unlikely annotations to branches leading to EIO btrfs: add unlikely annotations to branches leading to EUCLEAN btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions btrfs: zoned: don't fail mount needlessly due to too many active zones btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc() btrfs: enable experimental bs > ps support btrfs: add extra ASSERT()s to catch unaligned bios btrfs: fix symbolic link reading when bs > ps btrfs: prepare scrub to support bs > ps cases btrfs: prepare zlib to support bs > ps cases btrfs: prepare lzo to support bs > ps cases btrfs: prepare zstd to support bs > ps cases btrfs: prepare compression folio alloc/free for bs > ps cases btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range() btrfs: remove pointless key offset setup in create_pending_snapshot() btrfs: annotate btrfs_is_testing() as unlikely and make it return bool btrfs: make the rule checking more readable for should_cow_block() btrfs: simplify inline extent end calculation at replay_one_extent() ...
2025-09-29	Merge tag 'vfs-6.18-rc1.workqueue' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs workqueue updates from Christian Brauner: "This contains various workqueue changes affecting the filesystem layer. Currently if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This replaces the use of system_wq and system_unbound_wq. system_wq is a per-CPU workqueue which isn't very obvious from the name and system_unbound_wq is to be used when locality is not required. So this renames system_wq to system_percpu_wq, and system_unbound_wq to system_dfl_wq. This also adds a new WQ_PERCPU flag to allow the fs subsystem users to explicitly request the use of per-CPU behavior. Both WQ_UNBOUND and WQ_PERCPU flags coexist for one release cycle to allow callers to transition their calls. WQ_UNBOUND will be removed in a next release cycle" * tag 'vfs-6.18-rc1.workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: WQ_PERCPU added to alloc_workqueue users fs: replace use of system_wq with system_percpu_wq fs: replace use of system_unbound_wq with system_dfl_wq
2025-09-29	Merge tag 'vfs-6.18-rc1.inode' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode updates from Christian Brauner: "This contains a series I originally wrote and that Eric brought over the finish line. It moves out the i_crypt_info and i_verity_info pointers out of 'struct inode' and into the fs-specific part of the inode. So now the few filesytems that actually make use of this pay the price in their own private inode storage instead of forcing it upon every user of struct inode. The pointer for the crypt and verity info is simply found by storing an offset to its address in struct fsverity_operations and struct fscrypt_operations. This shrinks struct inode by 16 bytes. I hope to move a lot more out of it in the future so that struct inode becomes really just about very core stuff that we need, much like struct dentry and struct file, instead of the dumping ground it has become over the years. On top of this are a various changes associated with the ongoing inode lifetime handling rework that multiple people are pushing forward: - Stop accessing inode->i_count directly in f2fs and gfs2. They simply should use the __iget() and iput() helpers - Make the i_state flags an enum - Rework the iput() logic Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above - Add an icount_read() helper and convert everyone that accesses inode->i_count directly for this purpose to use the helper - Expand dump_inode() to dump more information about an inode helping in debugging - Add some might_sleep() annotations to iput() and associated helpers" * tag 'vfs-6.18-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add might_sleep() annotation to iput() and more fs: expand dump_inode() inode: fix whitespace issues fs: add an icount_read helper fs: rework iput logic fs: make the i_state flags an enum fs: stop accessing ->i_count directly in f2fs and gfs2 fsverity: check IS_VERITY() in fsverity_cleanup_inode() fs: remove inode::i_verity_info btrfs: move verity info pointer to fs-specific part of inode f2fs: move verity info pointer to fs-specific part of inode ext4: move verity info pointer to fs-specific part of inode fsverity: add support for info in fs-specific part of inode fs: remove inode::i_crypt_info ceph: move crypt info pointer to fs-specific part of inode ubifs: move crypt info pointer to fs-specific part of inode f2fs: move crypt info pointer to fs-specific part of inode ext4: move crypt info pointer to fs-specific part of inode fscrypt: add support for info in fs-specific part of inode fscrypt: replace raw loads of info pointer with helper function
2025-09-29	Merge tag 'vfs-6.18-rc1.misc' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual selections of misc updates for this cycle. Features: - Add "initramfs_options" parameter to set initramfs mount options. This allows to add specific mount options to the rootfs to e.g., limit the memory size - Add RWF_NOSIGNAL flag for pwritev2() Add RWF_NOSIGNAL flag for pwritev2. This flag prevents the SIGPIPE signal from being raised when writing on disconnected pipes or sockets. The flag is handled directly by the pipe filesystem and converted to the existing MSG_NOSIGNAL flag for sockets - Allow to pass pid namespace as procfs mount option Ever since the introduction of pid namespaces, procfs has had very implicit behaviour surrounding them (the pidns used by a procfs mount is auto-selected based on the mounting process's active pidns, and the pidns itself is basically hidden once the mount has been constructed) This implicit behaviour has historically meant that userspace was required to do some special dances in order to configure the pidns of a procfs mount as desired. Examples include: * In order to bypass the mnt_too_revealing() check, Kubernetes creates a procfs mount from an empty pidns so that user namespaced containers can be nested (without this, the nested containers would fail to mount procfs) But this requires forking off a helper process because you cannot just one-shot this using mount(2) * Container runtimes in general need to fork into a container before configuring its mounts, which can lead to security issues in the case of shared-pidns containers (a privileged process in the pidns can interact with your container runtime process) While SUID_DUMP_DISABLE and user namespaces make this less of an issue, the strict need for this due to a minor uAPI wart is kind of unfortunate Things would be much easier if there was a way for userspace to just specify the pidns they want. So this pull request contains changes to implement a new "pidns" argument which can be set using fsconfig(2): fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd); fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0); or classic mount(2) / mount(8): // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid"); Cleanups: - Remove the last references to EXPORT_OP_ASYNC_LOCK - Make file_remove_privs_flags() static - Remove redundant __GFP_NOWARN when GFP_NOWAIT is used - Use try_cmpxchg() in start_dir_add() - Use try_cmpxchg() in sb_init_done_wq() - Replace offsetof() with struct_size() in ioctl_file_dedupe_range() - Remove vfs_ioctl() export - Replace rwlock() with spinlock in epoll code as rwlock causes priority inversion on preempt rt kernels - Make ns_entries in fs/proc/namespaces const - Use a switch() statement() in init_special_inode() just like we do in may_open() - Use struct_size() in dir_add() in the initramfs code - Use str_plural() in rd_load_image() - Replace strcpy() with strscpy() in find_link() - Rename generic_delete_inode() to inode_just_drop() and generic_drop_inode() to inode_generic_drop() - Remove unused arguments from fcntl_{g,s}et_rw_hint() Fixes: - Document @name parameter for name_contains_dotdot() helper - Fix spelling mistake - Always return zero from replace_fd() instead of the file descriptor number - Limit the size for copy_file_range() in compat mode to prevent a signed overflow - Fix debugfs mount options not being applied - Verify the inode mode when loading it from disk in minixfs - Verify the inode mode when loading it from disk in cramfs - Don't trigger automounts with RESOLVE_NO_XDEV If openat2() was called with RESOLVE_NO_XDEV it didn't traverse through automounts, but could still trigger them - Add FL_RECLAIM flag to show_fl_flags() macro so it appears in tracepoints - Fix unused variable warning in rd_load_image() on s390 - Make INITRAMFS_PRESERVE_MTIME depend on BLK_DEV_INITRD - Use ns_capable_noaudit() when determining net sysctl permissions - Don't call path_put() under namespace semaphore in listmount() and statmount()" * tag 'vfs-6.18-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (38 commits) fcntl: trim arguments listmount: don't call path_put() under namespace semaphore statmount: don't call path_put() under namespace semaphore pid: use ns_capable_noaudit() when determining net sysctl permissions fs: rename generic_delete_inode() and generic_drop_inode() init: INITRAMFS_PRESERVE_MTIME should depend on BLK_DEV_INITRD initramfs: Replace strcpy() with strscpy() in find_link() initrd: Use str_plural() in rd_load_image() initramfs: Use struct_size() helper to improve dir_add() initrd: Fix unused variable warning in rd_load_image() on s390 fs: use the switch statement in init_special_inode() fs/proc/namespaces: make ns_entries const filelock: add FL_RECLAIM to show_fl_flags() macro eventpoll: Replace rwlock with spinlock selftests/proc: add tests for new pidns APIs procfs: add "pidns" mount option pidns: move is-ancestor logic to helper openat2: don't trigger automounts with RESOLVE_NO_XDEV namei: move cross-device check to __traverse_mounts namei: remove LOOKUP_NO_XDEV check from handle_mounts ...
2025-09-24	Merge tag 'for-6.17-rc7-tag' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more regression fix for a problem in zoned mode: mounting would fail if the number of open and active zones reached a common limit that didn't use to be checked" * tag 'for-6.17-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: don't fail mount needlessly due to too many active zones
2025-09-23	btrfs: zoned: don't fail mount needlessly due to too many active zones	Johannes Thumshirn
	Previously BTRFS did not look at a device's reported max_open_zones limit, but starting with commit 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones"), zoned BTRFS limited the number of concurrently used block-groups to the number of max_open_zones a device reported, if it hadn't already reported a number of max_active_zones. Starting with commit 04147d8394e8 the number of open zones is treated the same way as active zones. But this leads to mount failures on filesystems which have been used before 04147d8394e8 because too many zones are in an open state. Ignore the new limitations on these filesystems, so zones can be finished or evacuated. Reported-by: Yuwei Han <hrx@bupt.moe> Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/ Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()	Filipe Manana
	After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a full write barrier, smp_wmb(), but we don't need to, all we need is a smp_mb__after_atomic(). The use of the smp_wmb() is from the old days when we didn't use a bit and used instead an int field in the root to signal if cow is forced. After the int field was changed to a bit in the root's state (flags field), we forgot to update the memory barrier in create_pending_snapshot() to smp_mb__after_atomic(), but we did the change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That happened in commit 27cdeb7096b8 ("Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root"). On the reader side, in should_cow_block(), we also use the counterpart smp_mb__before_atomic() which generates further confusion. So change the smp_wmb() to smp_mb__after_atomic(). In fact we don't even need any barrier at all since create_pending_snapshot() is called in the critical section of a transaction commit and therefore no one can concurrently join/attach the transaction, or start a new one, until the transaction is unblocked. By the time someone starts a new transaction and enters should_cow_block(), a lot of implicit memory barriers already took place by having acquired several locks such as fs_info->trans_lock and extent buffer locks on the root node at least. Nevertlheless, for consistency use smp_mb__after_atomic() after setting the force cow bit in create_pending_snapshot(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: add unlikely annotations to branches leading to transaction abort	David Sterba
	The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen. Transaction abort is one such error, the btrfs_abort_transaction() inlines code to check the state and print a warning, this ought to be out of the hot path. The most common pattern is when transaction abort is called after checking a return value and the control flow leads to a quick return. In other cases it may not be necessary to add unlikely() e.g. when the function returns anyway or the control flow is not changed noticeably. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: add unlikely annotations to branches leading to EIO	David Sterba
	The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EIO is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: add unlikely annotations to branches leading to EUCLEAN	David Sterba
	The unlikely() annotation is a static prediction hint that compiler may use to reorder code out of hot path. We use it elsewhere (namely tree-checker.c) for error branches that almost never happen, where EUCLEAN (a corruption) is one of them. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions	Sun YangKai
	Trivial pattern for the auto freeing with goto -> return conversions if possible. The following cases are considered trivial in this patch: 1. Cases where there are no operations between btrfs_free_path() and the function returns. 2. Cases where only simple cleanup operations (such as kfree(), kvfree(), clear_bit(), and fs_path_free()) are present between btrfs_free_path() and the function return. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: zoned: don't fail mount needlessly due to too many active zones	Johannes Thumshirn
	Previously BTRFS did not look at a device's reported max_open_zones limit, but starting with commit 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones"), zoned BTRFS limited the number of concurrently used block-groups to the number of max_open_zones a device reported, if it hadn't already reported a number of max_active_zones. Starting with commit 04147d8394e8 the number of open zones is treated the same way as active zones. But this leads to mount failures on filesystems which have been used before 04147d8394e8 because too many zones are in an open state. Ignore the new limitations on these filesystems, so zones can be finished or evacuated. Reported-by: Yuwei Han <hrx@bupt.moe> Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/ Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()	Miquel Sabaté Solà
	As pointed out in the documentation, calling 'kmalloc' with open-coded arithmetic can lead to unfortunate overflows and this particular way of using it has been deprecated. Instead, it's preferred to use 'kmalloc_array' in cases where it might apply so an overflow check is performed. Note this is an API cleanup and is not fixing any overflows because in all cases the multipliers are bounded small numbers derived from number of items in leaves/nodes. Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: enable experimental bs > ps support	Qu Wenruo
	With all the preparation patches, we're able to finally enable btrfs block size (sector size) larger than page size support and give it a full fstests run. And obviously this new feature is hidden behind experimental flags, and should not be considered as a core feature yet as btrfs' default block size is still 4K. But this is still a feature that will shine in the future where 16K block sized device are widely adopted. For now there are some features explicitly disabled: - Direct IO This is the most complex part to support, the root reason is we can not control the pages of iov iter passed in. User space programs can only ensure the virtual addresses are contiguous, but have no control on their physical addresses. Our bs > ps support heavily relies on large folios, and direct IO memory can easily break it. So direct IO is disabled and will always fall back to buffered IO. - RAID56 In theory we can convert RAID56 to use large folios, but it will need to be converted back to page based if we want to support direct IO in the future. So just reject it for now. - Encoded send - Encoded read Both are utilizing btrfs_encoded_read_regular_fill_pages(), and send is utilizing vmallocated memory. Unfortunately for vmallocated memory we can not guarantee the minimal folio order. For send, it will just always fallback to regular writes, which reads from page cache and will follow the existing folio order requirement. - Encoded write Encoded write itself is allocating pages by themselves, and we can easily change it to follow the minimal order. But since encoded read is already disabled, there is no need to only enable encoded write. Finally just like what we did for bs < ps support in the past, add a warning message for bs > ps mounts. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: add extra ASSERT()s to catch unaligned bios	Qu Wenruo
	Btrfs uses btrfs_bio to handle read/write of logical address, for the incoming bs > ps support, btrfs has extra requirements: - One folio must contain at least one fs block - No fs block can cross folio boundaries This requirement is not hard to maintain, thanks to the address space's minimal folio order. But not all btrfs bios are generated through address space, e.g. compression and scrub. To catch possible unaligned bios, introduce a helper, assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio(). This will check the following things: - bv_offset is aligned to block size - bv_len is aligned to block size With a btrfs bio passing above checks, unless it's empty it will ensure the requirements for bs > ps support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: fix symbolic link reading when bs > ps	Qu Wenruo
	[BUG DURING BS > PS TEST] When running the following script on a btrfs whose block size is larger than page size, e.g. 8K block size and 4K page size, it will trigger a kernel BUG: # mkfs.btrfs -s 8k $dev # mount $dev $mnt # mkdir $mnt/dir # ln -s dir $mnt/link # ls $mnt/link The call trace looks like this: BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing BTRFS info (device dm-2): checking UUID tree BTRFS info (device dm-2): enabling ssd optimizations BTRFS info (device dm-2): enabling free space tree ------------[ cut here ]------------ kernel BUG at /home/adam/linux/include/linux/highmem.h:275! Oops: invalid opcode: 0000 [#1] SMP CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full) Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs] Call Trace: <TASK> btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f] filemap_read_folio+0x37/0xe0 do_read_cache_folio+0x94/0x3e0 __page_get_link.isra.0+0x20/0x90 page_get_link+0x16/0x40 step_into+0x69b/0x830 path_lookupat+0xa7/0x170 filename_lookup+0xf7/0x200 ? set_ptes.isra.0+0x36/0x70 vfs_statx+0x7a/0x160 do_statx+0x63/0xa0 __x64_sys_statx+0x90/0xe0 do_syscall_64+0x82/0xae0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Please note bs > ps support is still under development and the enablement patch is not even in btrfs development branch. [CAUSE] Btrfs reuses its data folio read path to handle symbolic links, as the symbolic link target is stored as an inline data extent. But for newly created inodes, btrfs only set the minimal order if the target inode is a regular file. Thus for above newly created symbolic link, it doesn't properly respect the minimal folio order, and triggered the above crash. [FIX] Call btrfs_set_inode_mapping_order() unconditionally inside btrfs_create_new_inode(). For symbolic links this will fix the crash as now the folio will meet the minimal order. For regular files this brings no change. For directory/bdev/char and all the other types of inodes, they won't go through the data read path, thus no effect either. Fixes: cc38d178ff33 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: prepare scrub to support bs > ps cases	Qu Wenruo
	This involves: - Migrate scrub_stripe::pages[] to folios[] - Use btrfs_alloc_folio_array() and folio_put() to alloc above array. - Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use folio interfaces - Migrate raid56_parity_cache_data_pages() to raid56_parity_cache_data_folios() Since scrub is the only caller still using pages. This helper will copy the folio array contents into rbio::stripe_pages, with sector uptodate flags updated. And a new ASSERT() to make sure bs > ps cases will not hit this path. Since most scrub code is based on kaddr/paddr, the migration itself is pretty straightforward. And since we're here, also move the loop to set the stripe_sectors[].uptodate out of the copy loop. As we always mark all the sectors as uptodate for the data stripe, it's easier to do in one go, other than doing it inside the copy loop. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: prepare zlib to support bs > ps cases	Qu Wenruo
	This involves converting the following functions to use correct folio sizes/shifts: - zlib_compress_folios() - zlib_decompress_bio() There is a special handling for s390 hardware acceleration. With bs > ps cases, we can go with 16K block size on s390 (which uses fixed 4K page size). In that case we do not need to do the buffer copy as our folio is large enough for hardware acceleration. So factor out the s390 specific and folio size check into a helper, need_special_buffer(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: prepare lzo to support bs > ps cases	Qu Wenruo
	This involves converting the following functions to use correct folio sizes/shifts: - copy_compress_data_to_page() - lzo_compress_folios() - lzo_decompress_bio() Just like zstd, lzo has some extra incorrect usage of kmap_local_folio() that the offset is always 0. This will not handle HIGHMEM large folios correctly, but those cases are already rejected explicitly so it should not cause problems when bs > ps support is enabled. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: prepare zstd to support bs > ps cases	Qu Wenruo
	This involves converting the following functions to use proper folio sizes/shifts: - zstd_compress_folios() - zstd_decompress_bio() The function zstd_decompress() is already using block size correctly without using page size, thus it needs no modification. And since zstd compression is calling kmap_local_folio(), the existing code cannot handle large folios with HIGHMEM, as kmap_local_folio() requires us to handle one page range each time. I do not really think it's worth to spend time on some feature that will be deprecated eventually. So here just add an extra explicit rejection for bs > ps with HIGHMEM feature enabled kernels. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: prepare compression folio alloc/free for bs > ps cases	Qu Wenruo
	This includes the following preparation for bs > ps cases: - Always alloc/free the folio directly if bs > ps This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus affecting all compression algorithms. For btrfs_free_compr_folio() it needs no parameter for now, as we can use the folio size to skip the caching part. For now the change is just to passing a @fs_info into the function, all the folio size assumption is still based on page size. - Properly zero the last folio in compress_file_range() Since the compressed folios can be larger than a page, we need to properly zero the whole folio. - Use correct folio size for btrfs_add_compressed_bio_folios() Instead of page size, use the correct folio size. - Use correct folio size/shift for btrfs_compress_filemap_get_folio() As we are not only using simple page sized folios anymore. - Use correct folio size for btrfs_decompress() There is an ASSERT() making sure the decompressed range is no larger than a page, which will be triggered for bs > ps cases. - Skip readahead for compressed pages Similar to subpage cases. - Make btrfs_alloc_folio_array() to accept a new @order parameter - Add a helper to calculate the minimal folio size All those changes should not affect the existing bs <= ps handling. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()	Qu Wenruo
	[BUG] With my local branch to enable bs > ps support for btrfs, sometimes I hit the following ASSERT() inside submit_one_sector(): ASSERT(block_start != EXTENT_MAP_HOLE); Please note that it's not yet possible to hit this ASSERT() in the wild yet, as it requires btrfs bs > ps support, which is not even in the development branch. But on the other hand, there is also a very low chance to hit above ASSERT() with bs < ps cases, so this is an existing bug affect not only the incoming bs > ps support but also the existing bs < ps support. [CAUSE] Firstly that ASSERT() means we're trying to submit a dirty block but without a real extent map nor ordered extent map backing it. Furthermore with extra debugging, the folio triggering such ASSERT() is always larger than the fs block size in my bs > ps case. (8K block size, 4K page size) After some more debugging, the ASSERT() is trigger by the following sequence: extent_writepage() \| We got a 32K folio (4 fs blocks) at file offset 0, and the fs block \| size is 8K, page size is 4K. \| And there is another 8K folio at file offset 32K, which is also \| dirty. \| So the filemap layout looks like the following: \| \| "\|\|" is the filio boundary in the filemap. \| "//\| is the dirty range. \| \| 0 8K 16K 24K 32K 40K \| \|////////\| \|//////////////////////\|\|////////\| \| \|- writepage_delalloc() \| \|- find_lock_delalloc_range() for [0, 8K) \| \| Now range [0, 8K) is properly locked. \| \| \| \|- find_lock_delalloc_range() for [16K, 40K) \| \| \|- btrfs_find_delalloc_range() returned range [16K, 40K) \| \| \|- lock_delalloc_folios() locked folio 0 successfully \| \| \| \| \| \| The filemap range [32K, 40K) got dropped from filemap. \| \| \| \| \| \|- lock_delalloc_folios() failed with -EAGAIN on folio 32K \| \| \| As the folio at 32K is dropped. \| \| \| \| \| \|- loops = 1; \| \| \|- max_bytes = PAGE_SIZE; \| \| \|- goto again; \| \| \| This will re-do the lookup for dirty delalloc ranges. \| \| \| \| \| \|- btrfs_find_delalloc_range() called with @max_bytes == 4K \| \| \| This is smaller than block size, so \| \| \| btrfs_find_delalloc_range() is unable to return any range. \| \| \- return false; \| \| \| \- Now only range [0, 8K) has an OE for it, but for dirty range \| [16K, 32K) it's dirty without an OE. \| This breaks the assumption that writepage_delalloc() will find \| and lock all dirty ranges inside the folio. \| \|- extent_writepage_io() \|- submit_one_sector() for [0, 8K) \| Succeeded \| \|- submit_one_sector() for [16K, 24K) Triggering the ASSERT(), as there is no OE, and the original extent map is a hole. Please note that, this also exposed the same problem for bs < ps support. E.g. with 64K page size and 4K block size. If we failed to lock a folio, and falls back into the "loops = 1;" branch, we will re-do the search using 64K as max_bytes. Which may fail again to lock the next folio, and exit early without handling all dirty blocks inside the folio. [FIX] Instead of using the fixed size PAGE_SIZE as @max_bytes, use @sectorsize, so that we are ensured to find and lock any remaining blocks inside the folio. And since we're here, add an extra ASSERT() to before calling btrfs_find_delalloc_range() to make sure the @max_bytes is at least no smaller than a block to avoid false negative. Cc: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: remove pointless key offset setup in create_pending_snapshot()	Filipe Manana
	There's no point in setting the key's offset to (u64)-1 since we never use it before setting it to the current transaction's ID. So remove the assignment of (u64)-1 to the key's offset and move the remainder of the key initialization close to where it's used. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: annotate btrfs_is_testing() as unlikely and make it return bool	Filipe Manana
	We can annotate btrfs_is_testing() as unlikely since that's the most expected scenario and it's desirable for the compiler to optimize for the case we are not running the self tests. So add the annotation to btrfs_is_testing() and while at it also make it return bool instead of int. Also make two of the existing callers use btrfs_is_testing() directly instead of storing its result in a local variable. On x86_64 with Debian's gcc 14.2.0-19 this resulted in a very tiny object code reduction. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913257 161567 15592 2090416 1fe5b0 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: make the rule checking more readable for should_cow_block()	Filipe Manana
	It's quite hard and unreadable the way the rule checks are organized in should_cow_block(). We have a single if statement that returns 0 (false) and it checks several conditions, with one them being a negated compound condition which is particularly hard to reason immediately. Improve on this by using multiple if statements, each checking a single condition and returning immediately. Also change the return type from an integer to a boolean, since all we need is to return true or false. At least on x86_64 with Debian's gcc 14.2.0-19, this also reduces the object code size by 64 bytes. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913327 161567 15592 2090486 1fe5f6 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: simplify inline extent end calculation at replay_one_extent()	Filipe Manana
	There is no need to store the extent's ram_bytes in two variables, further more one of them, named 'size', is used only for the extent's end offset calculation. So remove the 'size' variable and use 'nbytes' only. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: fix comment about nbytes increase at replay_one_extent()	Filipe Manana
	The comment is wrong about the part where it says a prealloc extent does not contribute to an inode's nbytes - it does. Only holes don't contribute and that's what we are checking for, as prealloc extents always have a disk_bytenr different from 0. So fix the comment and re-organize the code to not set nbytes twice and set it to the extent item's number of bytes only if it doesn't represent a hole - in case it's a hole we have already initialized nbytes to 0 when we declared it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: return any hit error from extent_writepage_io()	Qu Wenruo
	Since the support of bs < ps support, extent_writepage_io() will submit multiple blocks inside the folio. But if we hit error submitting one sector, but the next sector can still be submitted successfully, the function extent_writepage_io() will still return 0. This will make btrfs to silently ignore the error without setting error flag for the filemap. Fix it by recording the first error hit, and always return that value. Fixes: 8bf334beb349 ("btrfs: fix double accounting race when extent_writepage_io() failed") Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: mark leaf space and overflow checks as unlikely on insert and extension	Filipe Manana
	We have several sanity checks when inserting or extending items in a btree that verify we didn't overflow the leaf or access a slot beyond the last one. These are cases that are never expected to be hit so mark them as unlikely, allowing the compiler to potentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: mark as unlikely not uptodate extent buffer checks when navigating btrees	Filipe Manana
	We expect that after attempting to read an extent buffer we had no errors therefore the extent buffer is up to date, so mark the checks for a not up to date extent buffer as unlikely and allow the compiler to pontentially generate better code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: mark extent buffer alignment checks as unlikely	Filipe Manana
	We are not expecting to ever fail the extent buffer alignment checks, so mark them as unlikely to allow the compiler to potentially generate more optimized code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: store and use node size in local variable in check_eb_alignment()	Filipe Manana
	Instead of dereferencing fs_info every time we need to access the node size, store in a local variable to make the code less verbose and avoid a line split too. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print key types as human readable strings	Filipe Manana
	Looking at a leaf dump from the kernel's print-tree implementation is not so friendly to analyze since key types are printed as numbers. Improve on this by printing key types as strings that are a diminutive of the macro names for key types, just like we do in btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: move code for processing file extent item into helper	Filipe Manana
	The code for processing file extent items is quite large and it's better to have it in a dedicated helper rather than in a huge switch statement, just like we do in btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print compression type for file extent items	Filipe Manana
	We are not printing anything about the compression type, so add that useful information in the same format as btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print correct inline extent data size	Filipe Manana
	We are advertising the ram_bytes of an inline extent as its data size, but that is not true for compressed extents. The ram_bytes corresponds to the uncompressed data size while the data size (compressed data) is given by btrfs_file_extent_inline_item_len(). So fix this and print both values in the same format as in btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print range information for extent csum items	Filipe Manana
	Currently we don't print anything for extent csum items other than the generic line with the key, item offset and item size. While one can still determine the range the extent csum covers by doing a few simple computations, it makes it more time consuming to analyse a leaf dump. So add a line that prints information about the range covered by the checksum using the same format as btrfs-progs. This is useful when debugging log tree issues since we log extent csum items for new extents. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print information about dir log items	Filipe Manana
	We currently don't print information about dir log items (other than the key, item offset and item size), which is useful to look at when debugging problems with a log tree. So print their specific information (currently they only have an end index number) in a format similar to btrfs-progs. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print information about inode extref items	Filipe Manana
	Currently we ignore inode extref items, we just print their key, item offset in the leaf and their size, no information about their content like the index number, parent inode, name length and name. Improve on this by printing the index, parent and name length in the same format as btrfs-progs. Note that we don't print the name, as that would require some processing and escaping like we do in btrfs-progs, and that could expose sensitive information for some users in case they share their dmesg/syslog and it contains a leaf dump. So for now leave names out. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print information about inode ref items	Filipe Manana
	Currently we ignore inode ref items, we just print their key, item offset in the leaf and their size, no information about their content like the index number, name length and name. Improve on this by printing the index and name length in the same format as btrfs-progs. Note that we don't print the name, as that would require some processing and escaping like we do in btrfs-progs, and that could expose sensitive information for some users in case they share their dmesg/syslog and it contains a leaf dump. So for now leave names out. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print dir items for dir index and xattr keys too	Filipe Manana
	Currently we only print the dir items for BTRFS_DIR_ITEM_KEY keys, but we also have dir items for BTRFS_DIR_INDEX_KEY and BTRFS_XATTR_ITEM_KEY keys too. So print them for those keys too. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print more information about dir items	Filipe Manana
	Currently we only print the object id component of the location key from a dir item and the flags. We are missing the whole key, transid and the name and data lengths. We are also ignoring the fact that we can have multiple dir item objects encoded in a single item for a BTRFS_DIR_ITEM_KEY key, so what we print is only for the first item. Improve on this by iterating on all dir items and print the missing information. This is done with the same format as in btrfs-progs, what we miss is printing the names and data since not only that would require some processing and escaping like in btrfs-progs, but it would also reveal information that may be sensitive and users may not want to share that in case that get a leaf dumped in dmesg. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: print-tree: print missing fields for inode items	Filipe Manana
	We are not dumping a lot of fields for an inode item which are useful for debugging whenever we dump a leaf (log replay failure for example), so add them and make it as close as possible to the print tree implementation in btrfs-progs (things like converting timespecs to human readable dates and converting flags to strings are missing since they are not so practical to do in the kernel). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: tree-checker: add inode extref checks	Qu Wenruo
	Like inode refs, inode extrefs have a variable length name, which means we have to do a proper check to make sure no header nor name can exceed the item limits. The check itself is very similar to check_inode_ref(), just a different structure (btrfs_inode_extref vs btrfs_inode_ref). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: send: index backref cache by node number instead of by sector number	Filipe Manana
	We now have a nodesize_bits member in fs_info so we can index an extent buffer in the backref cache by node number instead of by sector number. While this allows for a denser index space with the possibility of using less maple tree nodes, in practice it's unlikely to hit such benefits since we currently limit the maximum number of keys in the cache to 128, so unless all extent buffers are contiguous we are unlikely to see a memory usage reduction in the backing maple tree due to fewer nodes. Nevertheless it doesn't cost anything to index by node number and it's more logical. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: dump detailed info and specific messages on log replay failures	Filipe Manana
	Currently debugging log replay failures can be harder than needed, since all we do now is abort a transaction, which gives us a line number, a stack trace and an error code. But that is most of the times not enough to give some clue about what went wrong. So add a new helper to abort log replay and provide contextual information: 1) Dump the current leaf of the log tree being processed and print the slot we are currently at and the key at that slot; 2) Dump the current subvolume tree leaf if we have any; 3) Print the current stage of log replay; 4) Print the id of the subvolume root associated with the log tree we are currently processing (as we can have multiple); 5) Print some error message to mention what we were trying to do when we got an error. Replace all transaction abort calls (btrfs_abort_transaction()) with the new helper btrfs_abort_log_replay(), which besides dumping all that extra information, it also aborts the current transaction. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: abort transaction if we fail to update inode in log replay dir fixup	Filipe Manana
	If we fail to update the inode at link_to_fixup_dir(), we don't abort the transaction and propagate the error up the call chain, which makes it hard to pinpoint the error to the inode update. So abort the transaction if the inode update call fails, so that if it happens we known immediately. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23	btrfs: abort transaction if we fail to find dir item during log replay	Filipe Manana
	At __add_inode_ref() if we get an error when trying to lookup a dir item we don't abort the transaction and propagate the error up the call chain, so that somewhere else up in the call chain the transaction is aborted. This however makes it hard to know that the failure comes from looking up a dir item, so add a transaction abort in case we fail there, so that we immediately pinpoint where the problem comes from during log replay. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>