summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2025-06-30xfs: add FALLOC_FL_ALLOCATE_RANGE to supported flags maskYouling Tang
Add FALLOC_FL_ALLOCATE_RANGE to the set of supported fallocate flags in XFS_FALLOC_FL_SUPPORTED. This change improves code clarity and maintains by explicitly showing this flag in the supported flags mask. Note that since FALLOC_FL_ALLOCATE_RANGE is defined as 0x00, this addition has no functional modifications. Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-27xfs: fix unmount hang with unflushable inodes stuck in the AILDave Chinner
Unmount of a shutdown filesystem can hang with stale inode cluster buffers in the AIL like so: [95964.140623] Call Trace: [95964.144641] __schedule+0x699/0xb70 [95964.154003] schedule+0x64/0xd0 [95964.156851] xfs_ail_push_all_sync+0x9b/0xf0 [95964.164816] xfs_unmount_flush_inodes+0x41/0x70 [95964.168698] xfs_unmountfs+0x7f/0x170 [95964.171846] xfs_fs_put_super+0x3b/0x90 [95964.175216] generic_shutdown_super+0x77/0x160 [95964.178060] kill_block_super+0x1b/0x40 [95964.180553] xfs_kill_sb+0x12/0x30 [95964.182796] deactivate_locked_super+0x38/0x100 [95964.185735] deactivate_super+0x41/0x50 [95964.188245] cleanup_mnt+0x9f/0x160 [95964.190519] __cleanup_mnt+0x12/0x20 [95964.192899] task_work_run+0x89/0xb0 [95964.195221] resume_user_mode_work+0x4f/0x60 [95964.197931] syscall_exit_to_user_mode+0x76/0xb0 [95964.201003] do_syscall_64+0x74/0x130 $ pstree -N mnt |grep umount |-check-parallel---nsexec---run_test.sh---753---umount It always seems to be generic/753 that triggers this, and repeating a quick group test run triggers it every 10-15 iterations. Hence it generally triggers once up every 30-40 minutes of test time. just running generic/753 by itself or concurrently with a limited group of tests doesn't reproduce this issue at all. Tracing on a hung system shows the AIL repeating every 50ms a log force followed by an attempt to push pinned, aborted inodes from the AIL (trimmed for brevity): xfs_log_force: lsn 0x1c caller xfsaild+0x18e xfs_log_force: lsn 0x0 caller xlog_cil_flush+0xbd xfs_log_force: lsn 0x1c caller xfs_log_force+0x77 xfs_ail_pinned: lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED xfs_ail_pinned: lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED .... The inode log items are marked as aborted, which means that either: a) a transaction commit has occurred, seen an error or shutdown, and called xfs_trans_free_items() to abort the items. This should happen before any pinning of log items occurs. or b) a dirty transaction has been cancelled. This should also happen before any pinning of log items occurs. or c) AIL insertion at journal IO completion is marked as aborted. In this case, the log item is pinned by the CIL until journal IO completes and hence needs to be unpinned. This is then done after the ->iop_committed() callback is run, so the pin count should be balanced correctly. Yet none of these seemed to be occurring. Further tracing indicated this: d) Shutdown during CIL pushing resulting in log item completion being called from checkpoint abort processing. Items are unpinned and released without serialisation against each other, journal IO completion or transaction commit completion. In this case, we may still have a transaction commit in flight that holds a reference to a xfs_buf_log_item (BLI) after CIL insertion. e.g. a synchronous transaction will flush the CIL before the transaction is torn down. The concurrent CIL push then aborts insertion it and drops the commit/AIL reference to the BLI. This can leave the transaction commit context with the last reference to the BLI which is dropped here: xfs_trans_free_items() ->iop_release xfs_buf_item_release xfs_buf_item_put if (XFS_LI_ABORTED) xfs_trans_ail_delete xfs_buf_item_relse() Unlike the journal completion ->iop_unpin path, this path does not run stale buffer completion process when it drops the last reference, hence leaving the stale inodes attached to the buffer sitting the AIL. There are no other references to those inodes, so there is no other mechanism to remove them from the AIL. Hence unmount hangs. The buffer lock context for stale buffers is passed to the last BLI reference. This is normally the last BLI unpin on journal IO completion. The unpin then processes the stale buffer completion and releases the buffer lock. However, if the final unpin from journal IO completion (or CIL push abort) does not hold the last reference to the BLI, there -must- still be a transaction context that references the BLI, and so that context must perform the stale buffer completion processing before the buffer is unlocked and the BLI torn down. The fix for this is to rework the xfs_buf_item_relse() path to run stale buffer completion processing if it drops the last reference to the BLI. We still hold the buffer locked, so the buffer owner and lock context is the same as if we passed the BLI and buffer to the ->iop_unpin() context to finish stale process on journal commit. However, we have to be careful here. In a shutdown state, we can be freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse() and xfs_trans_bdetach(). The existing code handles this case by considering shutdown state as "aborted", but in doing so largely masks the failure to clean up stale BLI state from the xfs_buf_item_relse() path. i.e regardless of the shutdown state and whether the item is in the AIL, we must finish the stale buffer cleanup if we are are dropping the last BLI reference from the ->iop_relse path in transaction commit context. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-27xfs: factor out stale buffer item completionDave Chinner
The stale buffer item completion handling is currently only done from BLI unpinning. We need to perform this function from where-ever the last reference to the BLI is dropped, so first we need to factor this code out into a helper. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-27xfs: rearrange code in xfs_buf_item.cDave Chinner
The code to initialise, release and free items is all the way down the bottom of the file. Upcoming fixes need to these functions earlier in the file, so move them to the top. There is one code change in this move - the parameter to xfs_buf_item_relse() is changed from the xfs_buf to the xfs_buf_log_item - the thing that the function is releasing. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-27xfs: add tracepoints for stale pinned inode state debugDave Chinner
I needed more insight into how stale inodes were getting stuck on the AIL after a forced shutdown when running fsstress. These are the tracepoints I added for that purpose. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-27xfs: avoid dquot buffer pin deadlockDave Chinner
On shutdown when quotas are enabled, the shutdown can deadlock trying to unpin the dquot buffer buf_log_item like so: [ 3319.483590] task:kworker/20:0H state:D stack:14360 pid:1962230 tgid:1962230 ppid:2 task_flags:0x4208060 flags:0x00004000 [ 3319.493966] Workqueue: xfs-log/dm-6 xlog_ioend_work [ 3319.498458] Call Trace: [ 3319.500800] <TASK> [ 3319.502809] __schedule+0x699/0xb70 [ 3319.512672] schedule+0x64/0xd0 [ 3319.515573] schedule_timeout+0x30/0xf0 [ 3319.528125] __down_common+0xc3/0x200 [ 3319.531488] __down+0x1d/0x30 [ 3319.534186] down+0x48/0x50 [ 3319.540501] xfs_buf_lock+0x3d/0xe0 [ 3319.543609] xfs_buf_item_unpin+0x85/0x1b0 [ 3319.547248] xlog_cil_committed+0x289/0x570 [ 3319.571411] xlog_cil_process_committed+0x6d/0x90 [ 3319.575590] xlog_state_shutdown_callbacks+0x52/0x110 [ 3319.580017] xlog_force_shutdown+0x169/0x1a0 [ 3319.583780] xlog_ioend_work+0x7c/0xb0 [ 3319.587049] process_scheduled_works+0x1d6/0x400 [ 3319.591127] worker_thread+0x202/0x2e0 [ 3319.594452] kthread+0x20c/0x240 The CIL push has seen the deadlock, so it has aborted the push and is running CIL checkpoint completion to abort all the items in the checkpoint. This calls ->iop_unpin(remove = true) to clean up the log items in the checkpoint. When a buffer log item is unpined like this, it needs to lock the buffer to run io completion to correctly fail the buffer and run all the required completions to fail attached log items as well. In this case, the attempt to lock the buffer on unpin is hanging because the buffer is already locked. I suspected a leaked XFS_BLI_HOLD state because of XFS_BLI_STALE handling changes I was testing, so I went looking for pin events on HOLD buffers and unpin events on locked buffer. That isolated this one buffer with these two events: xfs_buf_item_pin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 2 pincount 0 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags HOLD|DIRTY|LOGGED liflags DIRTY .... xfs_buf_item_unpin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 4 pincount 1 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags DIRTY liflags ABORTED Firstly, bbcount = 0x2, which means it is not a single sector structure. That rules out every xfs_trans_bhold() case except one: dquot buffers. Then hung task dumping gave this trace: [ 3197.312078] task:fsync-tester state:D stack:12080 pid:2051125 tgid:2051125 ppid:1643233 task_flags:0x400000 flags:0x00004002 [ 3197.323007] Call Trace: [ 3197.325581] <TASK> [ 3197.327727] __schedule+0x699/0xb70 [ 3197.334582] schedule+0x64/0xd0 [ 3197.337672] schedule_timeout+0x30/0xf0 [ 3197.350139] wait_for_completion+0xbd/0x180 [ 3197.354235] __flush_workqueue+0xef/0x4e0 [ 3197.362229] xlog_cil_force_seq+0xa0/0x300 [ 3197.374447] xfs_log_force+0x77/0x230 [ 3197.378015] xfs_qm_dqunpin_wait+0x49/0xf0 [ 3197.382010] xfs_qm_dqflush+0x55/0x460 [ 3197.385663] xfs_qm_dquot_isolate+0x29e/0x4d0 [ 3197.389977] __list_lru_walk_one+0x141/0x220 [ 3197.398867] list_lru_walk_one+0x10/0x20 [ 3197.402713] xfs_qm_shrink_scan+0x6a/0x100 [ 3197.406699] do_shrink_slab+0x18a/0x350 [ 3197.410512] shrink_slab+0xf7/0x430 [ 3197.413967] drop_slab+0x97/0xf0 [ 3197.417121] drop_caches_sysctl_handler+0x59/0xc0 [ 3197.421654] proc_sys_call_handler+0x18b/0x280 [ 3197.426050] proc_sys_write+0x13/0x20 [ 3197.429750] vfs_write+0x2b8/0x3e0 [ 3197.438532] ksys_write+0x7e/0xf0 [ 3197.441742] __x64_sys_write+0x1b/0x30 [ 3197.445363] x64_sys_call+0x2c72/0x2f60 [ 3197.449044] do_syscall_64+0x6c/0x140 [ 3197.456341] entry_SYSCALL_64_after_hwframe+0x76/0x7e Yup, another test run by check-parallel is running drop_caches concurrently and the dquot shrinker for the hung filesystem is running. That's trying to flush a dirty dquot from reclaim context, and it waiting on a log force to complete. xfs_qm_dqflush is called with the dquot buffer held locked, and so we've called xfs_log_force() with that buffer locked. Now the log force is waiting for a workqueue flush to complete, and that workqueue flush is waiting of CIL checkpoint processing to finish. The CIL checkpoint processing is aborting all the log items it has, and that requires locking aborted buffers to cancel them. Now, normally this isn't a problem if we are issuing a log force to unpin an object, because the ->iop_unpin() method wakes pin waiters first. That results in the pin waiter finishing off whatever it was doing, dropping the lock and then xfs_buf_item_unpin() can lock the buffer and fail it. However, xfs_qm_dqflush() is waiting on the -dquot- unpin event, not the dquot buffer unpin event, and so it never gets woken and so does not drop the buffer lock. Inodes do not have this problem, as they can only be written from one spot (->iop_push) whilst dquots can be written from multiple places (memory reclaim, ->iop_push, xfs_dq_dqpurge, and quotacheck). The reason that the dquot buffer has an attached buffer log item is that it has been recently allocated. Initialisation of the dquot buffer logs the buffer directly, thereby pinning it in memory. We then modify the dquot in a separate operation, and have memory reclaim racing with a shutdown and we trigger this deadlock. check-parallel reproduces this reliably on 1kB FSB filesystems with quota enabled because it does all of these things concurrently without having to explicitly write tests to exercise these corner case conditions. xfs_qm_dquot_logitem_push() doesn't have this deadlock because it checks if the dquot is pinned before locking the dquot buffer and skipping it if it is pinned. This means the xfs_qm_dqunpin_wait() log force in xfs_qm_dqflush() never triggers and we unlock the buffer safely allowing a concurrent shutdown to fail the buffer appropriately. xfs_qm_dqpurge() could have this problem as it is called from quotacheck and we might have allocated dquot buffers when recording the quota updates. This can be fixed by calling xfs_qm_dqunpin_wait() before we lock the dquot buffer. Because we hold the dquot locked, nothing will be able to add to the pin count between the unpin_wait and the dqflush callout, so this now makes xfs_qm_dqpurge() safe against this race. xfs_qm_dquot_isolate() can also be fixed this same way but, quite frankly, we shouldn't be doing IO in memory reclaim context. If the dquot is pinned or dirty, simply rotate it and let memory reclaim come back to it later, same as we do for inodes. This then gets rid of the nasty issue in xfs_qm_flush_one() where quotacheck writeback races with memory reclaim flushing the dquots. We can lift xfs_qm_dqunpin_wait() up into this code, then get rid of the "can't get the dqflush lock" buffer write to cycle the dqlfush lock and enable it to be flushed again. checking if the dquot is pinned and returning -EAGAIN so that the dquot walk will revisit the dquot again later. Finally, with xfs_qm_dqunpin_wait() lifted into all the callers, we can remove it from the xfs_qm_dqflush() code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-27xfs: catch stale AGF/AGF metadataDave Chinner
There is a race condition that can trigger in dmflakey fstests that can result in asserts in xfs_ialloc_read_agi() and xfs_alloc_read_agf() firing. The asserts look like this: XFS: Assertion failed: pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks), file: fs/xfs/libxfs/xfs_alloc.c, line: 3440 ..... Call Trace: <TASK> xfs_alloc_read_agf+0x2ad/0x3a0 xfs_alloc_fix_freelist+0x280/0x720 xfs_alloc_vextent_prepare_ag+0x42/0x120 xfs_alloc_vextent_iterate_ags+0x67/0x260 xfs_alloc_vextent_start_ag+0xe4/0x1c0 xfs_bmapi_allocate+0x6fe/0xc90 xfs_bmapi_convert_delalloc+0x338/0x560 xfs_map_blocks+0x354/0x580 iomap_writepages+0x52b/0xa70 xfs_vm_writepages+0xd7/0x100 do_writepages+0xe1/0x2c0 __writeback_single_inode+0x44/0x340 writeback_sb_inodes+0x2d0/0x570 __writeback_inodes_wb+0x9c/0xf0 wb_writeback+0x139/0x2d0 wb_workfn+0x23e/0x4c0 process_scheduled_works+0x1d4/0x400 worker_thread+0x234/0x2e0 kthread+0x147/0x170 ret_from_fork+0x3e/0x50 ret_from_fork_asm+0x1a/0x30 I've seen the AGI variant from scrub running on the filesysetm after unmount failed due to systemd interference: XFS: Assertion failed: pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) || xfs_is_shutdown(pag->pag_mount), file: fs/xfs/libxfs/xfs_ialloc.c, line: 2804 ..... Call Trace: <TASK> xfs_ialloc_read_agi+0xee/0x150 xchk_perag_drain_and_lock+0x7d/0x240 xchk_ag_init+0x34/0x90 xchk_inode_xref+0x7b/0x220 xchk_inode+0x14d/0x180 xfs_scrub_metadata+0x2e2/0x510 xfs_ioc_scrub_metadata+0x62/0xb0 xfs_file_ioctl+0x446/0xbf0 __se_sys_ioctl+0x6f/0xc0 __x64_sys_ioctl+0x1d/0x30 x64_sys_call+0x1879/0x2ee0 do_syscall_64+0x68/0x130 ? exc_page_fault+0x62/0xc0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Essentially, it is the same problem. When _flakey_drop_and_remount() loads the drop-writes table, it makes all writes silently fail. Writes are reported to the fs as completed successfully, but they are not issued to the backing store. The filesystem sees the successful write completion and marks the metadata buffer clean and removes it from the AIL. If this happens at the same time as memory pressure is occuring, the now-clean AGF and/or AGI buffers can be reclaimed from memory. Shortly afterwards, but before _flakey_drop_and_remount() runs unmount, background writeback is kicked and it tries to allocate blocks for the dirty pages in memory. This then tries to access the AGF buffer we just turfed out of memory. It's not found, so it gets read in from disk. This is all fine, except for the fact that the last writeback of the AGF did not actually reach disk. The AGF on disk is stale compared to the in-memory state held by the perag, and so they don't match and the assert fires. Then other operations on that inode hang because the task was killed whilst holding inode locks. e.g: Workqueue: xfs-conv/dm-12 xfs_end_io Call Trace: <TASK> __schedule+0x650/0xb10 schedule+0x6d/0xf0 schedule_preempt_disabled+0x15/0x30 rwsem_down_write_slowpath+0x31a/0x5f0 down_write+0x43/0x60 xfs_ilock+0x1a8/0x210 xfs_trans_alloc_inode+0x9c/0x240 xfs_iomap_write_unwritten+0xe3/0x300 xfs_end_ioend+0x90/0x130 xfs_end_io+0xce/0x100 process_scheduled_works+0x1d4/0x400 worker_thread+0x234/0x2e0 kthread+0x147/0x170 ret_from_fork+0x3e/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> and it's all down hill from there. Memory pressure is one way to trigger this, another is to run "echo 3 > /proc/sys/vm/drop_caches" randomly while tests are running. Regardless of how it is triggered, this effectively takes down the system once umount hangs because it's holding a sb->s_umount lock exclusive and now every sync(1) call gets stuck on it. Fix this by replacing the asserts with a corruption detection check and a shutdown. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-27xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlockDave Chinner
Lock order of xfs_ifree_cluster() is cluster buffer -> try ILOCK -> IFLUSHING, except for the last inode in the cluster that is triggering the free. In that case, the lock order is ILOCK -> cluster buffer -> IFLUSHING. xfs_iflush_cluster() uses cluster buffer -> try ILOCK -> IFLUSHING, so this can safely run concurrently with xfs_ifree_cluster(). xfs_inode_item_precommit() uses ILOCK -> cluster buffer, but this cannot race with xfs_ifree_cluster() so being in a different order will not trigger a deadlock. xfs_reclaim_inode() during a filesystem shutdown uses ILOCK -> IFLUSHING -> cluster buffer via xfs_iflush_shutdown_abort(), and this deadlocks against xfs_ifree_cluster() like so: sysrq: Show Blocked State task:kworker/10:37 state:D stack:12560 pid:276182 tgid:276182 ppid:2 flags:0x00004000 Workqueue: xfs-inodegc/dm-3 xfs_inodegc_worker Call Trace: <TASK> __schedule+0x650/0xb10 schedule+0x6d/0xf0 schedule_timeout+0x8b/0x180 schedule_timeout_uninterruptible+0x1e/0x30 xfs_ifree+0x326/0x730 xfs_inactive_ifree+0xcb/0x230 xfs_inactive+0x2c8/0x380 xfs_inodegc_worker+0xaa/0x180 process_scheduled_works+0x1d4/0x400 worker_thread+0x234/0x2e0 kthread+0x147/0x170 ret_from_fork+0x3e/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> task:fsync-tester state:D stack:12160 pid:2255943 tgid:2255943 ppid:3988702 flags:0x00004006 Call Trace: <TASK> __schedule+0x650/0xb10 schedule+0x6d/0xf0 schedule_timeout+0x31/0x180 __down_common+0xbe/0x1f0 __down+0x1d/0x30 down+0x48/0x50 xfs_buf_lock+0x3d/0xe0 xfs_iflush_shutdown_abort+0x51/0x1e0 xfs_icwalk_ag+0x386/0x690 xfs_reclaim_inodes_nr+0x114/0x160 xfs_fs_free_cached_objects+0x19/0x20 super_cache_scan+0x17b/0x1a0 do_shrink_slab+0x180/0x350 shrink_slab+0xf8/0x430 drop_slab+0x97/0xf0 drop_caches_sysctl_handler+0x59/0xc0 proc_sys_call_handler+0x189/0x280 proc_sys_write+0x13/0x20 vfs_write+0x33d/0x3f0 ksys_write+0x7c/0xf0 __x64_sys_write+0x1b/0x30 x64_sys_call+0x271d/0x2ee0 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x76/0x7e We can't change the lock order of xfs_ifree_cluster() - XFS_ISTALE and XFS_IFLUSHING are serialised through to journal IO completion by the cluster buffer lock being held. There's quite a few asserts in the code that check that XFS_ISTALE does not occur out of sync with buffer locking (e.g. in xfs_iflush_cluster). There's also a dependency on the inode log item being removed from the buffer before XFS_IFLUSHING is cleared, also with asserts that trigger on this. Further, we don't have a requirement for the inode to be locked when completing or aborting inode flushing because all the inode state updates are serialised by holding the cluster buffer lock across the IO to completion. We can't check for XFS_IRECLAIM in xfs_ifree_mark_inode_stale() and skip the inode, because there is no guarantee that the inode will be reclaimed. Hence it *must* be marked XFS_ISTALE regardless of whether reclaim is preparing to free that inode. Similarly, we can't check for IFLUSHING before locking the inode because that would result in dirty inodes not being marked with ISTALE in the event of racing with XFS_IRECLAIM. Hence we have to address this issue from the xfs_reclaim_inode() side. It is clear that we cannot hold the inode locked here when calling xfs_iflush_shutdown_abort() because it is the inode->buffer lock order that causes the deadlock against xfs_ifree_cluster(). Hence we need to drop the ILOCK before aborting the inode in the shutdown case. Once we've aborted the inode, we can grab the ILOCK again and then immediately reclaim it as it is now guaranteed to be clean. Note that dropping the ILOCK in xfs_reclaim_inode() means that it can now be locked by xfs_ifree_mark_inode_stale() and seen whilst in this state. This is safe because we have left the XFS_IFLUSHING flag on the inode and so xfs_ifree_mark_inode_stale() will simply set XFS_ISTALE and move to the next inode. An ASSERT check in this path needs to be tweaked to take into account this new shutdown interaction. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: actually use the xfs_growfs_check_rtgeom tracepointDarrick J. Wong
We created a new tracepoint but forgot to put it in. Fix that. Cc: rostedt@goodmis.org Cc: stable@vger.kernel.org # v6.14 Fixes: 59a57acbce282d ("xfs: check that the rtrmapbt maxlevels doesn't increase when growing fs") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Closes: https://lore.kernel.org/all/20250612131021.114e6ec8@batman.local.home/ Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: Improve error handling in xfs_mru_cache_create()Markus Elfring
Simplify error handling in this function implementation. * Delete unnecessary pointer checks and variable assignments. * Omit a redundant function call. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: move xfs_submit_zoned_bio a bitChristoph Hellwig
Commit f3e2e53823b9 ("xfs: add inode to zone caching for data placement") add the new code right between xfs_submit_zoned_bio and xfs_zone_alloc_and_submit which implement the main zoned write path. Move xfs_submit_zoned_bio down to keep it together again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: use xfs_readonly_buftarg in xfs_remount_rwChristoph Hellwig
Use xfs_readonly_buftarg instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: remove NULL pointer checks in xfs_mru_cache_insertChristoph Hellwig
Remove the check for a NULL mru or mru->list in xfs_mru_cache_insert as this API misused lead to a direct NULL pointer dereference on first use and is not user triggerable. As a smatch run by Dan points out with the recent cleanup it would otherwise try to free the object we just determined to be NULL for this impossible to reach case. Fixes: 70b95cb86513 ("xfs: free the item in xfs_mru_cache_insert on failure") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-16xfs: check for shutdown before going to sleep in xfs_select_zoneChristoph Hellwig
Ensure the file system hasn't been shut down before waiting for a free zone to become available, because that won't happen on a shut down file system. Without this processes can occasionally get stuck in the allocator wait loop when racing with a file system shutdown. This sporadically happens when running generic/388 or generic/475. Fixes: 4e4d52075577 ("xfs: add the zoned space allocator") Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-02Merge tag 'vfs-6.16-rc2.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix the AT_HANDLE_CONNECTABLE option so filesystems that don't know how to decode a connected non-dir dentry fail the request - Use repr(transparent) to ensure identical layout between the C and Rust implementation of struct file - Add a missing xas_pause() into the dax code employing wait_entry_unlocked_exclusive() - Fix FOP_DONTCACHE which we disabled for v6.15. A folio could get redirtied and/or scheduled for writeback after the initial dropbehind test. Change the test accordingly to handle these cases so we can re-enable FOP_DONTCACHE again * tag 'vfs-6.16-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: exportfs: require ->fh_to_parent() to encode connectable file handles rust: file: improve safety comments rust: file: mark `LocalFile` as `repr(transparent)` fs/dax: Fix "don't skip locked entries when scanning entries" iomap: don't lose folio dropbehind state for overwrites mm/filemap: unify dropbehind flag testing and clearing mm/filemap: unify read/write dropbehind naming Revert "Disable FOP_DONTCACHE for now due to bugs" mm/filemap: use filemap_end_dropbehind() for read invalidation mm/filemap: gate dropbehind invalidate on folio !dirty && !writeback
2025-05-31Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...
2025-05-28iomap: don't lose folio dropbehind state for overwritesJens Axboe
DONTCACHE I/O must have the completion punted to a workqueue, just like what is done for unwritten extents, as the completion needs task context to perform the invalidation of the folio(s). However, if writeback is started off filemap_fdatawrite_range() off generic_sync() and it's an overwrite, then the DONTCACHE marking gets lost as iomap_add_to_ioend() don't look at the folio being added and no further state is passed down to help it know that this is a dropbehind/DONTCACHE write. Check if the folio being added is marked as dropbehind, and set IOMAP_IOEND_DONTCACHE if that is the case. Then XFS can factor this into the decision making of completion context in xfs_submit_ioend(). Additionally include this ioend flag in the NOMERGE flags, to avoid mixing it with unrelated IO. Since this is the 3rd flag that will cause XFS to punt the completion to a workqueue, add a helper so that each one of them can get appropriately commented. This fixes extra page cache being instantiated when the write performed is an overwrite, rather than newly instantiated blocks. Fixes: b2cd5ae693a3 ("iomap: make buffered writes work with RWF_DONTCACHE") Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/5153f6e8-274d-4546-bf55-30a5018e0d03@kernel.dk Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-26Merge tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Carlos Maiolino: - Atomic writes for XFS - Remove experimental warnings for pNFS, scrub and parent pointers * tag 'xfs-merge-6.16' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) xfs: add inode to zone caching for data placement xfs: free the item in xfs_mru_cache_insert on failure xfs: remove the EXPERIMENTAL warning for pNFS xfs: remove some EXPERIMENTAL warnings xfs: Remove deprecated xfs_bufd sysctl parameters xfs: stop using set_blocksize xfs: allow sysadmins to specify a maximum atomic write limit at mount time xfs: update atomic write limits xfs: add xfs_calc_atomic_write_unit_max() xfs: add xfs_file_dio_write_atomic() xfs: commit CoW-based atomic writes atomically xfs: add large atomic writes checks in xfs_direct_write_iomap_begin() xfs: add xfs_atomic_write_cow_iomap_begin() xfs: refine atomic write size check in xfs_file_write_iter() xfs: refactor xfs_reflink_end_cow_extent() xfs: allow block allocator to take an alignment hint xfs: ignore HW which cannot atomic write a single block xfs: add helpers to compute transaction reservation for finishing intent items xfs: add helpers to compute log item overhead xfs: separate out setting buftarg atomic writes limits ...
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-26Merge tag 'vfs-6.16-rc1.super' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs freezing updates from Christian Brauner: "This contains various filesystem freezing related work for this cycle: - Allow the power subsystem to support filesystem freeze for suspend and hibernate. Now all the pieces are in place to actually allow the power subsystem to freeze/thaw filesystems during suspend/resume. Filesystems are only frozen and thawed if the power subsystem does actually own the freeze. If the filesystem is already frozen by the time we've frozen all userspace processes we don't care to freeze it again. That's userspace's job once the process resumes. We only actually freeze filesystems if we absolutely have to and we ignore other failures to freeze. We could bubble up errors and fail suspend/resume if the error isn't EBUSY (aka it's already frozen) but I don't think that this is worth it. Filesystem freezing during suspend/resume is best-effort. If the user has 500 ext4 filesystems mounted and 4 fail to freeze for whatever reason then we simply skip them. What we have now is already a big improvement and let's see how we fare with it before making our lives even harder (and uglier) than we have to. - Allow efivars to support freeze and thaw Allow efivarfs to partake to resync variable state during system hibernation and suspend. Add freeze/thaw support. This is a pretty straightforward implementation. We simply add regular freeze/thaw support for both userspace and the kernel. efivars is the first pseudofilesystem that adds support for filesystem freezing and thawing. The simplicity comes from the fact that we simply always resync variable state after efivarfs has been frozen. It doesn't matter whether that's because of suspend, userspace initiated freeze or hibernation. Efivars is simple enough that it doesn't matter that we walk all dentries. There are no directories and there aren't insane amounts of entries and both freeze/thaw are already heavy-handed operations. If userspace initiated a freeze/thaw cycle they would need CAP_SYS_ADMIN in the initial user namespace (as that's where efivarfs is mounted) so it can't be triggered by random userspace. IOW, we really really don't care" * tag 'vfs-6.16-rc1.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: f2fs: fix freezing filesystem during resize kernfs: add warning about implementing freeze/thaw efivarfs: support freeze/thaw power: freeze filesystems during suspend/resume libfs: export find_next_child() super: add filesystem freezing helpers for suspend and hibernate gfs2: pass through holder from the VFS for freeze/thaw super: use common iterator (Part 2) super: use a common iterator (Part 1) super: skip dying superblocks early super: simplify user_get_super() super: remove pointless s_root checks fs: allow all writers to be frozen locking/percpu-rwsem: add freezable alternative to down_read
2025-05-26Merge tag 'vfs-6.16-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one*() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions
2025-05-14xfs: add inode to zone caching for data placementHans Holmberg
Placing data from the same file in the same zone is a great heuristic for reducing write amplification and we do this already - but only for sequential writes. To support placing data in the same way for random writes, reuse the xfs mru cache to map inodes to open zones on first write. If a mapping is present, use the open zone for data placement for this file until the zone is full. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: free the item in xfs_mru_cache_insert on failureChristoph Hellwig
Call the provided free_func when xfs_mru_cache_insert as that's what the callers need to do anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: Fix comment on xfs_trans_ail_update_bulk()Carlos Maiolino
This function doesn't take the AIL lock, but should be called with AIL lock held. Also (hopefuly) simplify the comment. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: Fix a comment on xfs_ail_deleteCarlos Maiolino
It doesn't return anything. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: Fail remount with noattr2 on a v5 with v4 enabledNirjhar Roy (IBM)
Bug: When we compile the kernel with CONFIG_XFS_SUPPORT_V4=y, remount with "-o remount,noattr2" on a v5 XFS does not fail explicitly. Reproduction: mkfs.xfs -f /dev/loop0 mount /dev/loop0 /mnt/scratch mount -o remount,noattr2 /dev/loop0 /mnt/scratch However, with CONFIG_XFS_SUPPORT_V4=n, the remount correctly fails explicitly. This is because the way the following 2 functions are defined: static inline bool xfs_has_attr2 (struct xfs_mount *mp) { return !IS_ENABLED(CONFIG_XFS_SUPPORT_V4) || (mp->m_features & XFS_FEAT_ATTR2); } static inline bool xfs_has_noattr2 (const struct xfs_mount *mp) { return mp->m_features & XFS_FEAT_NOATTR2; } xfs_has_attr2() returns true when CONFIG_XFS_SUPPORT_V4=n and hence, the following if condition in xfs_fs_validate_params() succeeds and returns -EINVAL: /* * We have not read the superblock at this point, so only the attr2 * mount option can set the attr2 feature by this stage. */ if (xfs_has_attr2(mp) && xfs_has_noattr2(mp)) { xfs_warn(mp, "attr2 and noattr2 cannot both be specified."); return -EINVAL; } With CONFIG_XFS_SUPPORT_V4=y, xfs_has_attr2() always return false and hence no error is returned. Fix: Check if the existing mount has crc enabled(i.e, of type v5 and has attr2 enabled) and the remount has noattr2, if yes, return -EINVAL. I have tested xfs/{189,539} in fstests with v4 and v5 XFS with both CONFIG_XFS_SUPPORT_V4=y/n and they both behave as expected. This patch also fixes remount from noattr2 -> attr2 (on a v4 xfs). Related discussion in [1] [1] https://lore.kernel.org/all/Z65o6nWxT00MaUrW@dread.disaster.area/ Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: fix zoned GC data corruption due to wrong bv_offsetChristoph Hellwig
xfs_zone_gc_write_chunk writes out the data buffer read in earlier using the same bio, and currenly looks at bv_offset for the offset into the scratch folio for that. But commit 26064d3e2b4d ("block: fix adding folio to bio") changed how bv_page and bv_offset are calculated for adding larger folios, breaking this fragile logic. Switch to extracting the full physical address from the old bio_vec, and calculate the offset into the folio from that instead. This fixes data corruption during garbage collection with heavy rockdsb workloads. Thanks to Hans for tracking down the culprit commit during long bisection sessions. Fixes: 26064d3e2b4d ("block: fix adding folio to bio") Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection") Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <Hans.Holmberg@wdc.com> Tested-by: Hans Holmberg <Hans.Holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: free up mp->m_free[0].count in error caseWengang Wang
In xfs_init_percpu_counters(), memory for mp->m_free[0].count wasn't freed in error case. Free it up in this patch. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Fixes: 712bae96631852 ("xfs: generalize the freespace and reserved blocks handling") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: remove the EXPERIMENTAL warning for pNFSChristoph Hellwig
The pNFS layout support has been around for 10 years without major issues, drop the EXPERIMENTAL warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: remove some EXPERIMENTAL warningsDarrick J. Wong
Online fsck was finished a year ago, in Linux 6.10. The exchange-range syscall and parent pointers were merged in the same cycle. None of these have encountered any serious errors in the year that they've been in the kernel (or the many many years they've been under development) so let's drop the shouty warnings. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14Merge branch 'atomic_writes-6.16' into xfs-6.16-mergeCarlos Maiolino
Required update due to conflict with patch: xfs: stop using set_blocksize Conflicts: fs/xfs/xfs_buf.c Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: Remove deprecated xfs_bufd sysctl parametersZizhi Wo
Commit 64af7a6ea5a4 ("xfs: remove deprecated sysctls") removed the deprecated xfsbufd-related sysctl interface, but forgot to delete the corresponding parameters: "xfs_buf_timer" and "xfs_buf_age". This patch removes those parameters and makes no other changes. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: stop using set_blocksizeDarrick J. Wong
XFS has its own buffer cache for metadata that uses submit_bio, which means that it no longer uses the block device pagecache for anything. Create a more lightweight helper that runs the blocksize checks and flushes dirty data and use that instead. No more truncating the pagecache because XFS does not use it or care about it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-11sort.h: hoist cmp_int() into generic header fileFedor Pchelkin
Deduplicate the same functionality implemented in several places by moving the cmp_int() helper macro into linux/sort.h. The macro performs a three-way comparison of the arguments mostly useful in different sorting strategies and algorithms. Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Suggested-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Coly Li <colyli@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-09super: add filesystem freezing helpers for suspend and hibernateChristian Brauner
Allow the power subsystem to support filesystem freeze for suspend and hibernate. For some kernel subsystems it is paramount that they are guaranteed that they are the owner of the freeze to avoid any risk of deadlocks. This is the case for the power subsystem. Enable it to recognize whether it did actually freeze the filesystem. If userspace has 10 filesystems and suspend/hibernate manges to freeze 5 and then fails on the 6th for whatever odd reason (current or future) then power needs to undo the freeze of the first 5 filesystems. It can't just walk the list again because while it's unlikely that a new filesystem got added in the meantime it still cannot tell which filesystems the power subsystem actually managed to get a freeze reference count on that needs to be dropped during thaw. There's various ways out of this ugliness. For example, record the filesystems the power subsystem managed to freeze on a temporary list in the callbacks and then walk that list backwards during thaw to undo the freezing or make sure that the power subsystem just actually exclusively freezes things it can freeze and marking such filesystems as being owned by power for the duration of the suspend or resume cycle. I opted for the latter as that seemed the clean thing to do even if it means more code changes. If hibernation races with filesystem freezing (e.g. DM reconfiguration), then hibernation need not freeze a filesystem because it's already frozen but userspace may thaw the filesystem before hibernation actually happens. If the race happens the other way around, DM reconfiguration may unexpectedly fail with EBUSY. So allow FREEZE_EXCL to nest with other holders. An exclusive freezer cannot be undone by any of the other concurrent freezers. Link: https://lore.kernel.org/r/20250329-work-freeze-v2-6-a47af37ecc3d@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-09Merge tag 'atomic-writes-6.16_2025-05-07' of ↵Carlos Maiolino
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into atomic_writes large atomic writes for xfs [v12.1] Currently atomic write support for xfs is limited to writing a single block as we have no way to guarantee alignment and that the write covers a single extent. This series introduces a method to issue atomic writes via a software-based method. The software-based method is used as a fallback for when attempting to issue an atomic write over misaligned or multiple extents. For xfs, this support is based on reflink CoW support. The basic idea of this CoW method is to alloc a range in the CoW fork, write the data, and atomically update the mapping. Initial mysql performance testing has shown this method to perform ok. However, there we are only using 16K atomic writes (and 4K block size), so typically - and thankfully - this software fallback method won't be used often. For other FSes which want large atomics writes and don't support CoW, I think that they can follow the example in [0]. Catherine is currently working on further xfstests for this feature, which we hope to share soon. About 17/17, maybe it can be omitted as there is no strong demand to have it included. Based on bfecc4091e07 (xfs/next-rc, xfs/for-next) xfs: allow ro mounts if rtdev or logdev are read-only [0] https://lore.kernel.org/linux-xfs/20250102140411.14617-1-john.g.garry@oracle.com/ Differences to v12: - add more review tags Differences to v11: - split "xfs: ignore ..." patch - inline sync_blockdev() in xfs_alloc_buftarg() (Christoph) - fix xfs_calc_rtgroup_awu_max() for 0 block count (Darrick) - Add RB tag from Christoph (thanks!) Differences to v10: - add "xfs: only call xfs_setsize_buftarg once ..." by Darrick - symbol renames in "xfs: ignore HW which cannot..." by Darrick Differences to v9: - rework "ignore HW which cannot .." patch by Darrick - Ensure power-of-2 max always for unit min/max when no HW support With a bit of luck, this should all go splendidly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
2025-05-07xfs: allow sysadmins to specify a maximum atomic write limit at mount timeDarrick J. Wong
Introduce a mount option to allow sysadmins to specify the maximum size of an atomic write. If the filesystem can work with the supplied value, that becomes the new guaranteed maximum. The value mustn't be too big for the existing filesystem geometry (max write size, max AG/rtgroup size). We dynamically recompute the tr_atomic_write transaction reservation based on the given block size, check that the current log size isn't less than the new minimum log size constraints, and set a new maximum. The actual software atomic write max is still computed based off of tr_atomic_ioend the same way it has for the past few commits. Note also that xfs_calc_atomic_write_log_geometry is non-static because mkfs will need that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: update atomic write limitsJohn Garry
Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)(). No reflink support always means no CoW-based atomic writes. For updating xfs_get_atomic_write_min(), we support blocksize only and that depends on HW or reflink support. For updating xfs_get_atomic_write_max(), for no reflink, we are limited to blocksize but only if HW support. Otherwise we are limited to combined limit in mp->m_atomic_write_unit_max. For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by the bdev atomic write limit. If xfs_get_atomic_write_max() does not report > 1x blocksize, then just continue to report 0 as before. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: update comments in the helper functions] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add xfs_calc_atomic_write_unit_max()John Garry
Now that CoW-based atomic writes are supported, update the max size of an atomic write for the data device. The limit of a CoW-based atomic write will be the limit of the number of logitems which can fit into a single transaction. In addition, the max atomic write size needs to be aligned to the agsize. Limit the size of atomic writes to the greatest power-of-two factor of the agsize so that allocations for an atomic write will always be aligned compatibly with the alignment requirements of the storage. Function xfs_atomic_write_logitems() is added to find the limit the number of log items which can fit in a single transaction. Amend the max atomic write computation to create a new transaction reservation type, and compute the maximum size of an atomic write completion (in fsblocks) based on this new transaction reservation. Initially, tr_atomic_write is a clone of tr_itruncate, which provides a reasonable level of parallelism. In the next patch, we'll add a mount option so that sysadmins can configure their own limits. [djwong: use a new reservation type for atomic write ioends, refactor group limit calculations] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: rounddown power-of-2 always] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add xfs_file_dio_write_atomic()John Garry
Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes. Now HW offload will not be required to support atomic writes and is an optional feature. CoW-based atomic writes will be supported with out-of-places write and atomic extent remapping. Either mode of operation may be used for an atomic write, depending on the circumstances. The preferred method is HW offload as it will be faster. If HW offload is not available then we always use the CoW-based method. If HW offload is available but not possible to use, then again we use the CoW-based method. If available, HW offload would not be possible for the write length exceeding the HW offload limit, the write spanning multiple extents, unaligned disk blocks, etc. Apart from the write exceeding the HW offload limit, other conditions for HW offload usage can only be detected in the iomap handling for the write. As such, we use a fallback method to issue the write if we detect in the ->iomap_begin() handler that HW offload is not possible. Special code -ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload is not possible. In other words, atomic writes are supported on any filesystem that can perform out of place write remapping atomically (i.e. reflink) up to some fairly large size. If the conditions are right (a single correctly aligned overwrite mapping) then the filesystem will use any available hardware support to avoid the filesystem metadata updates. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: commit CoW-based atomic writes atomicallyJohn Garry
When completing a CoW-based write, each extent range mapping update is covered by a separate transaction. For a CoW-based atomic write, all mappings must be changed at once, so change to use a single transaction. Note that there is a limit on the amount of log intent items which can be fit into a single transaction, but this is being ignored for now since the count of items for a typical atomic write would be much less than is typically supported. A typical atomic write would be expected to be 64KB or less, which means only 16 possible extents unmaps, which is quite small. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add tr_atomic_ioend] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()John Garry
For when large atomic writes (> 1x FS block) are supported, there will be various occasions when HW offload may not be possible. Such instances include: - unaligned extent mapping wrt write length - extent mappings which do not cover the full write, e.g. the write spans sparse or mixed-mapping extents - the write length is greater than HW offload can support - no hardware support at all In those cases, we need to fallback to the CoW-based atomic write mode. For this, report special code -ENOPROTOOPT to inform the caller that HW offload-based method is not possible. In addition to the occasions mentioned, if the write covers an unallocated range, we again judge that we need to rely on the CoW-based method when we would need to allocate anything more than 1x block. This is because if we allocate less blocks that is required for the write, then again HW offload-based method would not be possible. So we are taking a pessimistic approach to writes covering unallocated space. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: various cleanups] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add xfs_atomic_write_cow_iomap_begin()John Garry
For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork support. Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create staging mappings in the CoW fork for atomic write updates. The general steps in the function are as follows: - find extent mapping in the CoW fork for the FS block range being written - if part or full extent is found, proceed to process found extent - if no extent found, map in new blocks to the CoW fork - convert unwritten blocks in extent if required - update iomap extent mapping and return The bulk of this function is quite similar to the processing in xfs_reflink_allocate_cow(), where we try to find an extent mapping; if none exists, then allocate a new extent in the CoW fork, convert unwritten blocks, and return a mapping. Performance testing has shown the XFS_ILOCK_EXCL locking to be quite a bottleneck, so this is an area which could be optimised in future. Christoph Hellwig contributed almost all of the code in xfs_atomic_write_cow_iomap_begin(). Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add a new xfs_can_sw_atomic_write to convey intent better] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: refine atomic write size check in xfs_file_write_iter()John Garry
Currently the size of atomic write allowed is fixed at the blocksize. To start to lift this restriction, partly refactor xfs_report_atomic_write() to into helpers - xfs_get_atomic_write_{min, max}() - and use those helpers to find the per-inode atomic write limits and check according to that. Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and just return 0 since large atomics aren't supported yet. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: refactor xfs_reflink_end_cow_extent()John Garry
Refactor xfs_reflink_end_cow_extent() into separate parts which process the CoW range and commit the transaction. This refactoring will be used in future for when it is required to commit a range of extents as a single transaction, similar to how it was done pre-commit d6f215f359637. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: allow block allocator to take an alignment hintJohn Garry
Add a BMAPI flag to provide a hint to the block allocator to align extents according to the extszhint. This will be useful for atomic writes to ensure that we are not being allocated extents which are not suitable (for atomic writes). Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: ignore HW which cannot atomic write a single blockDarrick J. Wong
Currently only HW which can write at least 1x block is supported. For supporting atomic writes > 1x block, a CoW-based method will also be used and this will not be resticted to using HW which can write >= 1x block. However for deciding if HW-based atomic writes can be used, we need to start adding checks for write length < HW min, which complicates the code. Indeed, a statx field similar to unit_max_opt should also be added for this minimum, which is undesirable. HW which can only write > 1x blocks would be uncommon and quite weird, so let's just not support it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-07xfs: add helpers to compute transaction reservation for finishing intent itemsDarrick J. Wong
In the transaction reservation code, hoist the logic that computes the reservation needed to finish one log intent item into separate helper functions. These will be used in subsequent patches to estimate the number of blocks that an online repair can commit to reaping in the same transaction as the change committing the new data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add helpers to compute log item overheadDarrick J. Wong
Add selected helpers to estimate the transaction reservation required to write various log intent and buffer items to the log. These helpers will be used by the online repair code for more precise estimations of how much work can be done in a single transaction. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: separate out setting buftarg atomic writes limitsDarrick J. Wong
Separate out setting buftarg atomic writes limits into a dedicated function, xfs_configure_buftarg_atomic_writes(), to keep the specific functionality self-contained. For naming consistency, rename xfs_setsize_buftarg() -> xfs_configure_buftarg(). Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: separate out from patch "xfs: ignore HW which ..."] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>