summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-03-20pidfs: improve multi-threaded exec and premature thread-group leader exit ↵Christian Brauner
polling This is another attempt trying to make pidfd polling for multi-threaded exec and premature thread-group leader exit consistent. A quick recap of these two cases: (1) During a multi-threaded exec by a subthread, i.e., non-thread-group leader thread, all other threads in the thread-group including the thread-group leader are killed and the struct pid of the thread-group leader will be taken over by the subthread that called exec. IOW, two tasks change their TIDs. (2) A premature thread-group leader exit means that the thread-group leader exited before all of the other subthreads in the thread-group have exited. Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD. Any caller that holds a PIDFD_THREAD pidfd to the current thread-group leader may or may not see an exit notification on the file descriptor depending on when poll is performed. If the poll is performed before the exec of the subthread has concluded an exit notification is generated for the old thread-group leader. If the poll is performed after the exec of the subthread has concluded no exit notification is generated for the old thread-group leader. The correct behavior would be to simply not generate an exit notification on the struct pid of a subhthread exec because the struct pid is taken over by the subthread and thus remains alive. But this is difficult to handle because a thread-group may exit prematurely as mentioned in (2). In that case an exit notification is reliably generated but the subthreads may continue to run for an indeterminate amount of time and thus also may exec at some point. So far there was no way to distinguish between (1) and (2) internally. This tiny series tries to address this problem by discarding PIDFD_THREAD notification on premature thread-group leader exit. If that works correctly then no exit notifications are generated for a PIDFD_THREAD pidfd for a thread-group leader until all subthreads have been reaped. If a subthread should exec aftewards no exit notification will be generated until that task exits or it creates subthreads and repeates the cycle. Co-Developed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20fs: sort out fd allocation vs dup2 race commentary, take 2Mateusz Guzik
fd_install() has a questionable comment above it. While it correctly points out a possible race against dup2(), it states: > We need to detect this and fput() the struct file we are about to > overwrite in this case. > > It should never happen - if we allow dup2() do it, _really_ bad things > will follow. I have difficulty parsing the above. The first sentence would suggest fd_install() tries to detect and recover from the race (it does not), the next one claims the race needs to be dealt with (it is, by dup2()). Given that fd_install() does not suffer the burden, this patch removes the above and instead expands on the race in dup2() commentary. While here tidy up the docs around fd_install(). Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250320102637.1924183-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20iomap: rework IOMAP atomic flagsJohn Garry
Flag IOMAP_ATOMIC_SW is not really required. The idea of having this flag is that the FS ->iomap_begin callback could check if this flag is set to decide whether to do a SW (FS-based) atomic write. But the FS can set which ->iomap_begin callback it wants when deciding to do a FS-based atomic write. Furthermore, it was thought that IOMAP_ATOMIC_HW is not a proper name, as the block driver can use SW-methods to emulate an atomic write. So change back to IOMAP_ATOMIC. The ->iomap_begin callback needs though to indicate to iomap core that REQ_ATOMIC needs to be set, so add IOMAP_F_ATOMIC_BIO for that. These changes were suggested by Christoph Hellwig and Dave Chinner. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250320120250.4087011-4-john.g.garry@oracle.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20iomap: comment on atomic write checks in iomap_dio_bio_iter()John Garry
Help explain the code. Also clarify the comment for bio size check. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250320120250.4087011-3-john.g.garry@oracle.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20iomap: inline iomap_dio_bio_opflags()John Garry
It is neater to build blk_opf_t fully in one place, so inline iomap_dio_bio_opflags() in iomap_dio_bio_iter(). Also tidy up the logic in dealing with IOMAP_DIO_CALLER_COMP, in generally separate the logic in dealing with flags associated with reads and writes. Originally-from: Christoph Hellwig <hch@lst.de> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250320120250.4087011-2-john.g.garry@oracle.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20libfs: Fix duplicate directory entry in offset_dir_lookupYongjian Sun
There is an issue in the kernel: In tmpfs, when using the "ls" command to list the contents of a directory with a large number of files, glibc performs the getdents call in multiple rounds. If a concurrent unlink occurs between these getdents calls, it may lead to duplicate directory entries in the ls output. One possible reproduction scenario is as follows: Create 1026 files and execute ls and rm concurrently: for i in {1..1026}; do echo "This is file $i" > /tmp/dir/file$i done ls /tmp/dir rm /tmp/dir/file4 ->getdents(file1026-file5) ->unlink(file4) ->getdents(file5,file3,file2,file1) It is expected that the second getdents call to return file3 through file1, but instead it returns an extra file5. The root cause of this problem is in the offset_dir_lookup function. It uses mas_find to determine the starting position for the current getdents call. Since mas_find locates the first position that is greater than or equal to mas->index, when file4 is deleted, it ends up returning file5. It can be fixed by replacing mas_find with mas_find_rev, which finds the first position that is less than or equal to mas->index. Fixes: b9b588f22a0c ("libfs: Use d_children list to iterate simple_offset directories") Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Link: https://lore.kernel.org/r/20250320034417.555810-1-sunyongjian@huaweicloud.com Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20fs: call inode_sb_list_add() outside of inode hash lockMateusz Guzik
As both locks are highly contended during significant inode churn, holding the inode hash lock while waiting for the sb list lock exacerbates the problem. Why moving it out is safe: the inode at hand still has I_NEW set and anyone who finds it through legitimate means waits for the bit to clear, by which time inode_sb_list_add() is guaranteed to have finished. This significantly drops hash lock contention for me when stating 20 separate trees in parallel, each with 1000 directories * 1000 files. However, no speed up was observed as contention increased on the other locks, notably dentry LRU. Even so, removal of the lock ordering will help making this faster later. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250320004643.1903287-1-mjguzik@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20fs: tidy up do_sys_openat2() with likely/unlikelyMateusz Guzik
Otherwise gcc 13 generates conditional forward jumps (aka branch mispredict by default) for build_open_flags() being succesfull. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250320092331.1921700-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-20fs: reduce work in fdget_pos()Mateusz Guzik
1. predict the file was found 2. explicitly compare the ref to "one", ignoring the dead zone The latter arguably improves the behavior to begin with. Suppose the count turned bad -- the previously used ref routine is going to check for it and return 0, indicating the count does not necessitate taking ->f_pos_lock. But there very well may be several users. i.e. not paying for special-casing the dead zone improves semantics. While here spell out each condition in a dedicated if statement. This has no effect on generated code. Sizes are as follows (in bytes; gcc 13, x86-64): stock: 321 likely(): 298 likely()+ref: 280 Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250319215801.1870660-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19sched/debug: Make CONFIG_SCHED_DEBUG functionality unconditionalIngo Molnar
All the big Linux distros enable CONFIG_SCHED_DEBUG, because the various features it provides help not just with kernel development, but with system administration and user-space software development as well. Reflect this reality and enable this functionality unconditionally. Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250317104257.3496611-4-mingo@kernel.org
2025-03-19pidfs: ensure that PIDFS_INFO_EXIT is availableChristian Brauner
When we currently create a pidfd we check that the task hasn't been reaped right before we create the pidfd. But it is of course possible that by the time we return the pidfd to userspace the task has already been reaped since we don't check again after having created a dentry for it. This was fine until now because that race was meaningless. But now that we provide PIDFD_INFO_EXIT it is a problem because it is possible that the kernel returns a reaped pidfd and it depends on the race whether PIDFD_INFO_EXIT information is available. This depends on if the task gets reaped before or after a dentry has been attached to struct pid. Make this consistent and only returned pidfds for reaped tasks if PIDFD_INFO_EXIT information is available. This is done by performing another check whether the task has been reaped right after we attached a dentry to struct pid. Since pidfs_exit() is called before struct pid's task linkage is removed the case where the task got reaped but a dentry was already attached to struct pid and exit information was recorded and published can be handled correctly. In that case we do return a pidfd for a reaped task like we would've before. Link: https://lore.kernel.org/r/20250316-kabel-fehden-66bdb6a83436@brauner Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19fs: predict not reaching the limit in alloc_empty_file()Mateusz Guzik
Eliminates a jump over a call to capable() in the common case. By default the limit is not even set, in which case the check can't even fail to begin with. It remains unset at least on Debian and Ubuntu. For this cases this can probably become a static branch instead. In the meantime tidy it up. I note the check separate from the bump makes the entire thing racy. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250319124923.1838719-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19iomap: fix inline data on buffered readGao Xiang
Previously, iomap_readpage_iter() returning 0 would break out of the loops of iomap_readahead_iter(), which is what iomap_read_inline_data() relies on. However, commit d9dc477ff6a2 ("iomap: advance the iter directly on buffered read") changes this behavior without calling iomap_iter_advance(), which causes EROFS to get stuck in iomap_readpage_iter(). It seems iomap_iter_advance() cannot be called in iomap_read_inline_data() because of the iomap_write_begin() path, so handle this in iomap_readpage_iter() instead. Reported-and-tested-by: Bo Liu <liubo03@inspur.com> Fixes: d9dc477ff6a2 ("iomap: advance the iter directly on buffered read") Cc: Brian Foster <bfoster@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250319085125.4039368-1-hsiangkao@linux.alibaba.com Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19fuse: fix possible deadlock if rings are never initializedLuis Henriques
When mounting a user-space filesystem using io_uring, the initialization of the rings is done separately in the server side. If for some reason (e.g. a server bug) this step is not performed it will be impossible to unmount the filesystem if there are already requests waiting. This issue is easily reproduced with the libfuse passthrough_ll example, if the queue depth is set to '0' and a request is queued before trying to unmount the filesystem. When trying to force the unmount, fuse_abort_conn() will try to wake up all tasks waiting in fc->blocked_waitq, but because the rings were never initialized, fuse_uring_ready() will never return 'true'. Fixes: 3393ff964e0f ("fuse: block request allocation until io-uring init is complete") Signed-off-by: Luis Henriques <luis@igalia.com> Link: https://lore.kernel.org/r/20250306111218.13734-1-luis@igalia.com Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Fix netfs_unbuffered_read() to return ssize_t rather than intDavid Howells
Fix netfs_unbuffered_read() to return an ssize_t rather than an int as netfs_wait_for_read() returns ssize_t and this gets implicitly truncated. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-5-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Fix rolling_buffer_load_from_ra() to not clear mark bitsDavid Howells
rolling_buffer_load_from_ra() looms large in the perf report because it loops around doing an atomic clear for each of the three mark bits per folio. However, this is both inefficient (it would be better to build a mask and atomically AND them out) and unnecessary as they shouldn't be set. Fix this by removing the loop. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-4-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: netfs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Call `invalidate_cache` only if implementedMax Kellermann
Many filesystems such as NFS and Ceph do not implement the `invalidate_cache` method. On those filesystems, if writing to the cache (`NETFS_WRITE_TO_CACHE`) fails for some reason, the kernel crashes like this: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 0 P4D 0 Oops: Oops: 0010 [#1] SMP PTI CPU: 9 UID: 0 PID: 3380 Comm: kworker/u193:11 Not tainted 6.13.3-cm4all1-hp #437 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018 Workqueue: events_unbound netfs_write_collection_worker RIP: 0010:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 0018:ffff9b86e2ca7dc0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 7fffffffffffffff RDX: 0000000000000001 RSI: ffff89259d576a18 RDI: ffff89259d576900 RBP: ffff89259d5769b0 R08: ffff9b86e2ca7d28 R09: 0000000000000002 R10: ffff89258ceaca80 R11: 0000000000000001 R12: 0000000000000020 R13: ffff893d158b9338 R14: ffff89259d576900 R15: ffff89259d5769b0 FS: 0000000000000000(0000) GS:ffff893c9fa40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 000000054442e003 CR4: 00000000001706f0 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x15c/0x460 ? try_to_wake_up+0x2d2/0x530 ? exc_page_fault+0x5e/0x100 ? asm_exc_page_fault+0x22/0x30 netfs_write_collection_worker+0xe9f/0x12b0 ? xs_poll_check_readable+0x3f/0x80 ? xs_stream_data_receive_workfn+0x8d/0x110 process_one_work+0x134/0x2d0 worker_thread+0x299/0x3a0 ? __pfx_worker_thread+0x10/0x10 kthread+0xba/0xe0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: CR2: 0000000000000000 This patch adds the missing `NULL` check. Fixes: 0e0f2dfe880f ("netfs: Dispatch write requests to process a writeback slice") Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-3-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: netfs@lists.linux.dev cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19netfs: Fix collection of results during pause when collection offloadedDavid Howells
A netfs read request can run in one of two modes: for synchronous reads writes, the app thread does the collection of results and for asynchronous reads, this is offloaded to a worker thread. This is controlled by the NETFS_RREQ_OFFLOAD_COLLECTION flag. Now, if a subrequest incurs an error, the NETFS_RREQ_PAUSE flag is set to stop the issuing loop temporarily from issuing more subrequests until a retry is successful or the request is abandoned. When the issuing loop sees NETFS_RREQ_PAUSE, it jumps to netfs_wait_for_pause() which will wait for the PAUSE flag to be cleared - and whilst it is waiting, it will call out to the collector as more results acrue... But this is the wrong thing to do if OFFLOAD_COLLECTION is set as we can then end up with both the app thread and the work item collecting results simultaneously. This manifests itself occasionally when running the generic/323 xfstest against multichannel cifs as an oops that's a bit random but frequently involving io_submit() (the test does lots of simultaneous async DIO reads). Fix this by only doing the collection in netfs_wait_for_pause() if the NETFS_RREQ_OFFLOAD_COLLECTION is not set. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Steve French <stfrench@microsoft.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20250314164201.1993231-2-dhowells@redhat.com Acked-by: "Paulo Alcantara (Red Hat)" <pc@manguebit.com> cc: Paulo Alcantara <pc@manguebit.com> cc: Jeff Layton <jlayton@kernel.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19fs: load the ->i_sb pointer once in inode_sb_list_{add,del}Mateusz Guzik
While this may sound like a pedantic clean up, it does in fact impact code generation -- the patched add routine is slightly smaller. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250319004635.1820589-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19fuse: fix uring race condition for null dereference of fcJoanne Koong
There is a race condition leading to a kernel crash from a null dereference when attemping to access fc->lock in fuse_uring_create_queue(). fc may be NULL in the case where another thread is creating the uring in fuse_uring_create() and has set fc->ring but has not yet set ring->fc when fuse_uring_create_queue() reads ring->fc. There is another race condition as well where in fuse_uring_register(), ring->nr_queues may still be 0 and not yet set to the new value when we compare qid against it. This fix sets fc->ring only after ring->fc and ring->nr_queues have been set, which guarantees now that ring->fc is a proper pointer when any queues are created and ring->nr_queues reflects the right number of queues if ring is not NULL. We must use smp_store_release() and smp_load_acquire() semantics to ensure the ordering will remain correct where fc->ring is assigned only after ring->fc and ring->nr_queues have been assigned. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20250318003028.3330599-1-joannelkoong@gmail.com Fixes: 24fe962c86f5 ("fuse: {io-uring} Handle SQEs - register commands") Acked-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19afs: Fix afs_atcell_get_link() to check if ws_cell is unset firstDavid Howells
Fix afs_atcell_get_link() to check if the workstation cell is unset before doing the RCU pathwalk bit where we dereference that. Fixes: 823869e1e616 ("afs: Fix afs_atcell_get_link() to handle RCU pathwalk") Reported-by: syzbot+76a6f18e3af82e84f264@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/2481796.1742296819@warthog.procyon.org.uk Tested-by: syzbot+76a6f18e3af82e84f264@syzkaller.appspotmail.com cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-19umount: Allow superblock owners to force umountTrond Myklebust
Loosen the permission check on forced umount to allow users holding CAP_SYS_ADMIN privileges in namespaces that are privileged with respect to the userns that originally mounted the filesystem. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/12f212d4ef983714d065a6bb372fbb378753bf4c.1742315194.git.trond.myklebust@hammerspace.com Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-18orangefs: Bufmap deadcodingDr. David Alan Gilbert
orangefs_bufmap_shift_query() last use was removed in 2018 by commit 9f8fd53cd055 ("orangefs: revamp block sizes") orangefs_bufmap_page_fill() last use was removed in 2021 by commit 0c4b7cadd1ad ("Orangef: implement orangefs_readahead.") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2025-03-18btrfs: zoned: fix zone finishing with missing devicesJohannes Thumshirn
If do_zone_finish() is called with a filesystem that has missing devices (e.g. a RAID file system mounted in degraded mode) it is accessing the btrfs_device::zone_info pointer, which will not be set if the device in question is missing. Check if the device is present (by checking if it has a valid block device pointer associated) and if not, skip zone finishing for it. Fixes: 4dcbb8ab31c1 ("btrfs: zoned: make zone finishing multi stripe capable") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: zoned: fix zone activation with missing devicesJohannes Thumshirn
If btrfs_zone_activate() is called with a filesystem that has missing devices (e.g. a RAID file system mounted in degraded mode) it is accessing the btrfs_device::zone_info pointer, which will not be set if the device in question is missing. Check if the device is present (by checking if it has a valid block device pointer associated) and if not, skip zone activation for it. Fixes: f9a912a3c45f ("btrfs: zoned: make zone activation multi stripe capable") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove end_no_trans label from btrfs_log_inode_parent()Filipe Manana
It's a pointless label as we don't have to do anything under it other than return from the function. So remove it and directly return from the function where we used to goto. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify condition for logging new dentries at btrfs_log_inode_parent()Filipe Manana
There's no point in checking if the inode is a directory as ctx->log_new_dentries is only set in case we are logging a directory down the call chain of btrfs_log_inode(). So remove that check making the logic more simple and while at it add a comment about why use a local variable to track if we later need to log new dentries. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove redundant else statement from btrfs_log_inode_parent()Filipe Manana
If we don't need to log new directory dentries, there's no point in having an else branch just to set 'ret' to zero, as it's already zero because every time it gets a non-zero value we jump into one of the exit labels. So remove it, which reduces source code size and the module text size. Before this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1813855 163737 16920 1994512 1e6f10 fs/btrfs/btrfs.ko After this change: $ size fs/btrfs/btrfs.ko text data bss dec hex filename 1813807 163737 16920 1994464 1e6ee0 fs/btrfs/btrfs.ko Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use memcmp_extent_buffer() at replay_one_extent()Filipe Manana
Instead of using memcmp(), which requires copying both file extent items from each extent buffer into a local buffer, use memcmp_extent_buffer() so that we only need to copy one of the file extent items and directly use the extent buffer of the other file extent item for the comparison. This reduces code size, saves one memory copy and reduces stack usage. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: update outdated comment for overwrite_item()Filipe Manana
The function is exclusively used for log replay since commit 3eb423442483 ("btrfs: remove outdated logic from overwrite_item() and add assertion"), so update the comment so that it doesn't say it can be used for logging. Also some minor rewording for clarity and while at it reformat the affected text so that it fits closer to the 80 characters limit for comments. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: use variables to store extent buffer and slot at overwrite_item()Filipe Manana
Instead of referring to path->nodes[0] and path->slots[0] multiple times, which is verbose and confusing since we have an 'eb' and 'slot' variables as well, introduce local variables 'dst_eb' to point to path->nodes[0] and 'dst_slot' to have path->slots[0], reducing verbosity and making it more obvious about which extent buffer and slot we are referring to. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid unnecessary memory allocation and copy at overwrite_item()Filipe Manana
There's no need to allocate memory and copy from both the destination and source extent buffers to compare if the items are equal, we can instead use memcmp_extent_buffer() which allows to do only one memory allocation and copy instead of two. So use memcmp_extent_buffer() instead of memcmp(), allowing us to avoid one memory allocation, which can fail or be slow while under memory heavy pressure, avoid the memory copying and reducing code. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: don't clobber ret in btrfs_validate_super()Mark Harmstone
Commit 2a9bb78cfd36 ("btrfs: validate system chunk array at btrfs_validate_super()") introduces a call to validate_sys_chunk_array() in btrfs_validate_super(), which clobbers the value of ret set earlier. This has the effect of negating the validity checks done earlier, making it so btrfs could potentially try to mount invalid filesystems. Fixes: 2a9bb78cfd36 ("btrfs: validate system chunk array at btrfs_validate_super()") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prepare btrfs_page_mkwrite() for large foliosQu Wenruo
This changes the assumption that the folio is always page sized. (Although the ASSERT() for folio order is still kept as-is). Just replace the PAGE_SIZE with folio_size(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prepare extent_io.c for future large folio supportQu Wenruo
When we're handling folios from filemap, we can no longer assume all folios are page sized. Thus for call sites assuming the folio is page sized, change the PAGE_SIZE usage to folio_size() instead. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prepare btrfs_launcher_folio() for large folios supportQu Wenruo
That function is only calling btrfs_qgroup_free_data(), which doesn't care about the size of the folio. Just replace the fixed PAGE_SIZE with folio_size(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: replace PAGE_SIZE with folio_size for subpage.[ch]Qu Wenruo
Since we can no longer assume all data filemap folios are page sized, use proper folio_size() calls to determine the folio size, as a preparation for future large data filemap folios. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add a size parameter to btrfs_alloc_subpage()Qu Wenruo
Since we can no longer assume page sized folio for data filemap folios, allow btrfs_alloc_subpage() to accept a new parameter, @fsize, indicating the folio size. This doesn't follow the regular behavior of passing a folio directly, because this function is shared by both data and metadata folios, and for metadata folios we have extra allocation policy to ensure no large folios whose sizes are larger than nodesize (unless it's page sized). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: subpage: make btrfs_is_subpage() check against a folioQu Wenruo
To support large data folios, we can no longer assume every filemap folio is page sized. So btrfs_is_subpage() check must be done against a folio. Thankfully for metadata folios, we have the full control and ensure a large folio will not be large than nodesize, so btrfs_meta_is_subpage() doesn't need this change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add extra warning if delayed iput is added when it's not allowedQu Wenruo
Since I have triggered the ASSERT() on the delayed iput too many times, now is the time to add some extra debug warnings for delayed iput. All delayed iputs should be queued after all ordered extents finish their IO and all involved workqueues are flushed. Thus after the btrfs_run_delayed_iputs() inside close_ctree(), there should be no more delayed puts added. So introduce a new BTRFS_FS_STATE_NO_DELAYED_IPUT, set after the above mentioned timing. And all btrfs_add_delayed_iput() will check that flag and give a WARN_ON_ONCE(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid redundant path slot assignment in btrfs_search_forward()Sun YangKai
Move path slot assignment before the condition check to prevent duplicate assignment. Previously, the slot was set both inside and after the 'slot >= nritems' block with no change in its value, which is unnecessary. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary btrfs_key local variable in btrfs_search_forward()Sun YangKai
The 'found_key' variable was only used to temporarily store the found key before copying it to 'min_key' at the end of the function when returning success. Eliminate the 'found_key' variable, and directly store the key into 'min_key' at the exact loop exit points where ret=0 is set, maintaining identical functionality. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: simplify the return value handling in search_ioctl()Sun YangKai
Move the assignment of -EFAULT to within the error condition check in fault_in_subpage_writeable(). The previous placement outside the condition could lead to the error value being overwritten by subsequent assignments, cause unnecessary assignments. Simplify loop exit logic by removing redundant goto. The original code used 'goto err' to bypass post-loop processing after handling errors from btrfs_search_forward(). However, the loop's termination naturally falls through to the post-loop section, which already handles 'ret' values. Replacing 'goto err' with 'break' eliminates redundant control flow, consolidates error handling, and makes the loop's exit conditions explicit. Signed-off-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: tests: fix chunk map leak after failure to add it to the treeFilipe Manana
If we fail to add the chunk map to the fs mapping tree we exit test_rmap_block() without freeing the chunk map. Fix this by adding a call to btrfs_free_chunk_map() before exiting the test function if the call to btrfs_add_chunk_map() failed. Fixes: 7dc66abb5a47 ("btrfs: use a dedicated data structure for chunk maps") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: codify pattern for adding block_group to bg_listBoris Burkov
Similar to mark_bg_unused() and mark_bg_to_reclaim(), we have a few places that use bg_list with refcounting, mostly for retrying failures to reclaim/delete unused. These have custom logic for handling locking and refcounting the bg_list properly, but they actually all want to do the same thing, so pull that logic out into a helper. Unfortunately, mark_bg_unused() does still need the NEW flag to avoid prematurely marking stuff unused (even if refcount is fine, we don't want to mess with bg creation), so it cannot use the new helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: explicitly ref count block_group on new_bgs listBoris Burkov
All other users of the bg_list list_head increment the refcount when adding to a list and decrement it when deleting from the list. Just for the sake of uniformity and to try to avoid refcounting bugs, do it for this list as well. This does not fix any known ref-counting bug, as the reference belongs to a single task (trans_handle is not shared and this represents trans_handle->new_bgs linkage) and will not lose its original refcount while that thread is running. And BLOCK_GROUP_FLAG_NEW protects against ref-counting errors "moving" the block group to the unused list without taking a ref. With that said, I still believe it is simpler to just hold the extra ref count for this list user as well. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: make btrfs_discard_workfn() block_group ref explicitBoris Burkov
Currently, the async discard machinery owns a ref to the block_group when the block_group is queued on a discard list. However, to handle races with discard cancellation and the discard workfn, we have a specific logic to detect that the block_group is *currently* running in the workfn, to protect the workfn's usage amidst cancellation. As far as I can tell, this doesn't have any overt bugs (though finish_discard_pass() and remove_from_discard_list() racing can have a surprising outcome for the caller of remove_from_discard_list() in that it is again added at the end). But it is needlessly complicated to rely on locking and the nullity of discard_ctl->block_group. Simplify this significantly by just taking a refcount while we are in the workfn and unconditionally drop it in both the remove and workfn paths, regardless of if they race. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: harden block_group::bg_list against list_del() racesBoris Burkov
As far as I can tell, these calls of list_del_init() on bg_list cannot run concurrently with btrfs_mark_bg_unused() or btrfs_mark_bg_to_reclaim(), as they are in transaction error paths and situations where the block group is readonly. However, if there is any chance at all of racing with mark_bg_unused(), or a different future user of bg_list, better to be safe than sorry. Otherwise we risk the following interleaving (bg_list refcount in parens) T1 (some random op) T2 (btrfs_mark_bg_unused) !list_empty(&bg->bg_list); (1) list_del_init(&bg->bg_list); (1) list_move_tail (1) btrfs_put_block_group (0) btrfs_delete_unused_bgs bg = list_first_entry list_del_init(&bg->bg_list); btrfs_put_block_group(bg); (-1) Ultimately, this results in a broken ref count that hits zero one deref early and the real final deref underflows the refcount, resulting in a WARNING. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: fix block group refcount race in btrfs_create_pending_block_groups()Boris Burkov
Block group creation is done in two phases, which results in a slightly unintuitive property: a block group can be allocated/deallocated from after btrfs_make_block_group() adds it to the space_info with btrfs_add_bg_to_space_info(), but before creation is completely completed in btrfs_create_pending_block_groups(). As a result, it is possible for a block group to go unused and have 'btrfs_mark_bg_unused' called on it concurrently with 'btrfs_create_pending_block_groups'. This causes a number of issues, which were fixed with the block group flag 'BLOCK_GROUP_FLAG_NEW'. However, this fix is not quite complete. Since it does not use the unused_bg_lock, it is possible for the following race to occur: btrfs_create_pending_block_groups btrfs_mark_bg_unused if list_empty // false list_del_init clear_bit else if (test_bit) // true list_move_tail And we get into the exact same broken ref count and invalid new_bgs state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to prevent. The broken refcount aspect will result in a warning like: [1272.943527] refcount_t: underflow; use-after-free. [1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs] [1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G W 6.14.0-rc5+ #108 [1272.946368] Tainted: [W]=WARN [1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110 [1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282 [1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000 [1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff [1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268 [1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0 [1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0 [1272.952850] FS: 0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000 [1272.953458] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0 [1272.954474] Call Trace: [1272.954655] <TASK> [1272.954812] ? refcount_warn_saturate+0xba/0x110 [1272.955173] ? __warn.cold+0x93/0xd7 [1272.955487] ? refcount_warn_saturate+0xba/0x110 [1272.955816] ? report_bug+0xe7/0x120 [1272.956103] ? handle_bug+0x53/0x90 [1272.956424] ? exc_invalid_op+0x13/0x60 [1272.956700] ? asm_exc_invalid_op+0x16/0x20 [1272.957011] ? refcount_warn_saturate+0xba/0x110 [1272.957399] btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs] [1272.957853] btrfs_put_block_group.cold+0x5d/0x8e [btrfs] [1272.958289] btrfs_discard_workfn+0x194/0x380 [btrfs] [1272.958729] process_one_work+0x130/0x290 [1272.959026] worker_thread+0x2ea/0x420 [1272.959335] ? __pfx_worker_thread+0x10/0x10 [1272.959644] kthread+0xd7/0x1c0 [1272.959872] ? __pfx_kthread+0x10/0x10 [1272.960172] ret_from_fork+0x30/0x50 [1272.960474] ? __pfx_kthread+0x10/0x10 [1272.960745] ret_from_fork_asm+0x1a/0x30 [1272.961035] </TASK> [1272.961238] ---[ end trace 0000000000000000 ]--- Though we have seen them in the async discard workfn as well. It is most likely to happen after a relocation finishes which cancels discard, tears down the block group, etc. Fix this fully by taking the lock around the list_del_init + clear_bit so that the two are done atomically. Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: remove unnecessary fs_info argument from btrfs_add_block_group_cache()Filipe Manana
The fs_info can be taken from the given block group, so there is no need to pass it as an argument. Also rename the local variable from 'info' to 'fs_info' which is more widely used, more clear and to be more consistent. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>