summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-05-14xfs: add inode to zone caching for data placementHans Holmberg
Placing data from the same file in the same zone is a great heuristic for reducing write amplification and we do this already - but only for sequential writes. To support placing data in the same way for random writes, reuse the xfs mru cache to map inodes to open zones on first write. If a mapping is present, use the open zone for data placement for this file until the zone is full. Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: free the item in xfs_mru_cache_insert on failureChristoph Hellwig
Call the provided free_func when xfs_mru_cache_insert as that's what the callers need to do anyway. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14Merge tag 'execve-v6.15-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fix from Kees Cook: "This fixes a corner case for ASLR-disabled static-PIE brk collision with vdso allocations: - binfmt_elf: Move brk for static PIE even if ASLR disabled" * tag 'execve-v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf: Move brk for static PIE even if ASLR disabled
2025-05-14ext4: introduce ext4_check_map_extents_env() debug helperZhang Yi
Loading and modifying the extents tree and extent status tree without holding the inode's i_rwsem or the mapping's invalidate_lock is not permitted, except during the I/O writeback. Add a new debug helper ext4_check_map_extents_env(), it will verify whether the extent loading/modifying context is safe. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-8-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: factor out is_special_ino()Zhang Yi
Factor out the helper is_special_ino() to facilitate the checking of special inodes in the subsequent patches. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-7-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: prevent stale extent cache entries caused by concurrent get es_cacheZhang Yi
The EXT4_IOC_GET_ES_CACHE and EXT4_IOC_PRECACHE_EXTENTS currently invokes ext4_ext_precache() to preload the extent cache without holding the inode's i_rwsem. This can result in stale extent cache entries when competing with operations such as ext4_collapse_range() which calls ext4_ext_remove_space() or ext4_ext_shift_extents(). The problem arises when ext4_ext_remove_space() temporarily releases i_data_sem due to insufficient journal credits. During this interval, a concurrent EXT4_IOC_GET_ES_CACHE or EXT4_IOC_PRECACHE_EXTENTS may cache extent entries that are about to be deleted. As a result, these cached entries become stale and inconsistent with the actual extents. Loading the extents cache without holding the inode's i_rwsem or the mapping's invalidate_lock is not permitted besides during the writeback. Fix this by holding the i_rwsem during EXT4_IOC_GET_ES_CACHE and EXT4_IOC_PRECACHE_EXTENTS. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-6-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: prevent stale extent cache entries caused by concurrent fiemapZhang Yi
The ext4_fiemap() currently invokes ext4_ext_precache() and iomap_fiemap() to preload the extent cache and query mapping information without holding the inode's i_rwsem. This can result in stale extent cache entries when competing with operations such as ext4_collapse_range() which calls ext4_ext_remove_space() or ext4_ext_shift_extents(). The problem arises when ext4_ext_remove_space() temporarily releases i_data_sem due to insufficient journal credits. During this interval, a concurrent ext4_fiemap() may cache extent entries that are about to be deleted. As a result, these cached entries become stale and inconsistent with the actual extents. Loading the extents cache without holding the inode's i_rwsem or the mapping's invalidate_lock is not permitted besides during the writeback. Fix this by holding the i_rwsem in ext4_fiemap(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-5-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: prevent stale extent cache entries caused by concurrent I/O writebackZhang Yi
Currently, in the I/O writeback path, ext4_map_blocks() may attempt to cache additional unrelated extents in the extent status tree without holding the inode's i_rwsem and the mapping's invalidate_lock. This can lead to stale extent status entries remaining in certain scenarios, potentially causing data corruption. For example, when performing a collapse range in ext4_collapse_range(), it clears the extent cache and dirty pages before removing blocks and shifting extents. It also holds the i_data_sem during these two operations. However, both ext4_ext_remove_space() and ext4_ext_shift_extents() may briefly release the i_data_sem if journal credits are insufficient (ext4_datasem_ensure_credits()). If another writeback process writes dirty pages from other regions during this interval, it may cache extents that are about to be modified. Unless ext4_collapse_range() explicitly clears the extent cache again, these cached entries can become stale and inconsistent with the actual extents. 0 a n b c m | | | | | | [www][wwwwww][wwwwwwww]...[wwwww][wwww]... | | N M Assume that block a is dirty. The collapse range operation is removing data from n to m and drops i_data_sem immediately after removing the extent from b to c. At the same time, a concurrent writeback begins to write back block a; it will reloads the extent from [n, b) into the extent status tree since it does not hold the i_rwsem or the invalidate_lock. After the collapse range operation, it left the stale extent [n, b), which points logical block n to N, but the actual physical block of n should be M. Similarly, both ext4_insert_range() and ext4_truncate() have the same problem. ext4_punch_hole() survived since it re-add a hole extent entry after removing space since commit 9f1118223aa0 ("ext4: add a hole extent entry in cache after punch"). In most cases, during dirty page writeback, the block mapping information is likely to be found in the extent cache, making it less necessary to search for physical extents. Consequently, loading unrelated extent caches during writeback appears to be ineffective. Therefore, fix this by adds EXT4_EX_NOCACHE in the writeback path to prevent caching of unrelated extents, eliminating this potential source of corruption. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-4-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: generalize EXT4_GET_BLOCKS_IO_SUBMIT flag usageZhang Yi
Currently, the EXT4_GET_BLOCKS_IO_SUBMIT flag is only used during data writeback to indicate that in ordered mode, the journal commit thread should skip re-submitting data and simply wait for I/O completion. To prepare for later patches that need to detect I/O submission context in ext4_map_blocks(), generalizes the meaning of EXT4_GET_BLOCKS_IO_SUBMIT. This flag will be set during: 1) data I/O writeback, 2) I/O completion extents conversion, 3) journal performing commit in fast_commit. This change doesn't affect current usage of this flag and provides a clear way to identify I/O submission context. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-3-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14ext4: ext4: unify EXT4_EX_NOCACHE|NOFAIL flags in ext4_ext_remove_space()Zhang Yi
When removing space, we should use EXT4_EX_NOCACHE because we don't need to cache extents, and we should also use EXT4_EX_NOFAIL to prevent metadata inconsistencies that may arise from memory allocation failures. While ext4_ext_remove_space() already uses these two flags in most places, they are missing in ext4_ext_search_right() and read_extent_tree_block() calls. Unify the flags to ensure consistent behavior throughout the extent removal process. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://patch.msgid.link/20250423085257.122685-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-14xfs: Fix comment on xfs_trans_ail_update_bulk()Carlos Maiolino
This function doesn't take the AIL lock, but should be called with AIL lock held. Also (hopefuly) simplify the comment. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: Fix a comment on xfs_ail_deleteCarlos Maiolino
It doesn't return anything. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: Fail remount with noattr2 on a v5 with v4 enabledNirjhar Roy (IBM)
Bug: When we compile the kernel with CONFIG_XFS_SUPPORT_V4=y, remount with "-o remount,noattr2" on a v5 XFS does not fail explicitly. Reproduction: mkfs.xfs -f /dev/loop0 mount /dev/loop0 /mnt/scratch mount -o remount,noattr2 /dev/loop0 /mnt/scratch However, with CONFIG_XFS_SUPPORT_V4=n, the remount correctly fails explicitly. This is because the way the following 2 functions are defined: static inline bool xfs_has_attr2 (struct xfs_mount *mp) { return !IS_ENABLED(CONFIG_XFS_SUPPORT_V4) || (mp->m_features & XFS_FEAT_ATTR2); } static inline bool xfs_has_noattr2 (const struct xfs_mount *mp) { return mp->m_features & XFS_FEAT_NOATTR2; } xfs_has_attr2() returns true when CONFIG_XFS_SUPPORT_V4=n and hence, the following if condition in xfs_fs_validate_params() succeeds and returns -EINVAL: /* * We have not read the superblock at this point, so only the attr2 * mount option can set the attr2 feature by this stage. */ if (xfs_has_attr2(mp) && xfs_has_noattr2(mp)) { xfs_warn(mp, "attr2 and noattr2 cannot both be specified."); return -EINVAL; } With CONFIG_XFS_SUPPORT_V4=y, xfs_has_attr2() always return false and hence no error is returned. Fix: Check if the existing mount has crc enabled(i.e, of type v5 and has attr2 enabled) and the remount has noattr2, if yes, return -EINVAL. I have tested xfs/{189,539} in fstests with v4 and v5 XFS with both CONFIG_XFS_SUPPORT_V4=y/n and they both behave as expected. This patch also fixes remount from noattr2 -> attr2 (on a v4 xfs). Related discussion in [1] [1] https://lore.kernel.org/all/Z65o6nWxT00MaUrW@dread.disaster.area/ Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: fix zoned GC data corruption due to wrong bv_offsetChristoph Hellwig
xfs_zone_gc_write_chunk writes out the data buffer read in earlier using the same bio, and currenly looks at bv_offset for the offset into the scratch folio for that. But commit 26064d3e2b4d ("block: fix adding folio to bio") changed how bv_page and bv_offset are calculated for adding larger folios, breaking this fragile logic. Switch to extracting the full physical address from the old bio_vec, and calculate the offset into the folio from that instead. This fixes data corruption during garbage collection with heavy rockdsb workloads. Thanks to Hans for tracking down the culprit commit during long bisection sessions. Fixes: 26064d3e2b4d ("block: fix adding folio to bio") Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection") Reported-by: Hans Holmberg <Hans.Holmberg@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hans Holmberg <Hans.Holmberg@wdc.com> Tested-by: Hans Holmberg <Hans.Holmberg@wdc.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: free up mp->m_free[0].count in error caseWengang Wang
In xfs_init_percpu_counters(), memory for mp->m_free[0].count wasn't freed in error case. Free it up in this patch. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Fixes: 712bae96631852 ("xfs: generalize the freespace and reserved blocks handling") Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: remove the EXPERIMENTAL warning for pNFSChristoph Hellwig
The pNFS layout support has been around for 10 years without major issues, drop the EXPERIMENTAL warning. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: remove some EXPERIMENTAL warningsDarrick J. Wong
Online fsck was finished a year ago, in Linux 6.10. The exchange-range syscall and parent pointers were merged in the same cycle. None of these have encountered any serious errors in the year that they've been in the kernel (or the many many years they've been under development) so let's drop the shouty warnings. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14Merge branch 'atomic_writes-6.16' into xfs-6.16-mergeCarlos Maiolino
Required update due to conflict with patch: xfs: stop using set_blocksize Conflicts: fs/xfs/xfs_buf.c Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: Remove deprecated xfs_bufd sysctl parametersZizhi Wo
Commit 64af7a6ea5a4 ("xfs: remove deprecated sysctls") removed the deprecated xfsbufd-related sysctl interface, but forgot to delete the corresponding parameters: "xfs_buf_timer" and "xfs_buf_age". This patch removes those parameters and makes no other changes. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14xfs: stop using set_blocksizeDarrick J. Wong
XFS has its own buffer cache for metadata that uses submit_bio, which means that it no longer uses the block device pagecache for anything. Create a more lightweight helper that runs the blocksize checks and flushes dirty data and use that instead. No more truncating the pagecache because XFS does not use it or care about it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-05-14fanotify: Drop use of flex array in fanotify_fhJan Kara
We use flex array 'buf' in fanotify_fh to contain the file handle content. However the buffer is not a proper flex array. It has a preconfigured fixed size. Furthermore if fixed size is not big enough, we use external separately allocated buffer. Hence don't pretend buf is a flex array since we have to use special accessors anyway and instead just modify the accessors to do the pointer math without flex array. This fixes warnings that flex array is not the last struct member in fanotify_fid_event or fanotify_error_event. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Link: https://patch.msgid.link/20250513131745.2808-2-jack@suse.cz
2025-05-13ext4: inline: fix len overflow in ext4_prepare_inline_dataThadeu Lima de Souza Cascardo
When running the following code on an ext4 filesystem with inline_data feature enabled, it will lead to the bug below. fd = open("file1", O_RDWR | O_CREAT | O_TRUNC, 0666); ftruncate(fd, 30); pwrite(fd, "a", 1, (1UL << 40) + 5UL); That happens because write_begin will succeed as when ext4_generic_write_inline_data calls ext4_prepare_inline_data, pos + len will be truncated, leading to ext4_prepare_inline_data parameter to be 6 instead of 0x10000000006. Then, later when write_end is called, we hit: BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); at ext4_write_inline_data. Fix it by using a loff_t type for the len parameter in ext4_prepare_inline_data instead of an unsigned int. [ 44.545164] ------------[ cut here ]------------ [ 44.545530] kernel BUG at fs/ext4/inline.c:240! [ 44.545834] Oops: invalid opcode: 0000 [#1] SMP NOPTI [ 44.546172] CPU: 3 UID: 0 PID: 343 Comm: test Not tainted 6.15.0-rc2-00003-g9080916f4863 #45 PREEMPT(full) 112853fcebfdb93254270a7959841d2c6aa2c8bb [ 44.546523] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 44.546523] RIP: 0010:ext4_write_inline_data+0xfe/0x100 [ 44.546523] Code: 3c 0e 48 83 c7 48 48 89 de 5b 41 5c 41 5d 41 5e 41 5f 5d e9 e4 fa 43 01 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 0b <0f> 0b 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53 48 83 ec 20 49 [ 44.546523] RSP: 0018:ffffb342008b79a8 EFLAGS: 00010216 [ 44.546523] RAX: 0000000000000001 RBX: ffff9329c579c000 RCX: 0000010000000006 [ 44.546523] RDX: 000000000000003c RSI: ffffb342008b79f0 RDI: ffff9329c158e738 [ 44.546523] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [ 44.546523] R10: 00007ffffffff000 R11: ffffffff9bd0d910 R12: 0000006210000000 [ 44.546523] R13: fffffc7e4015e700 R14: 0000010000000005 R15: ffff9329c158e738 [ 44.546523] FS: 00007f4299934740(0000) GS:ffff932a60179000(0000) knlGS:0000000000000000 [ 44.546523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 44.546523] CR2: 00007f4299a1ec90 CR3: 0000000002886002 CR4: 0000000000770eb0 [ 44.546523] PKRU: 55555554 [ 44.546523] Call Trace: [ 44.546523] <TASK> [ 44.546523] ext4_write_inline_data_end+0x126/0x2d0 [ 44.546523] generic_perform_write+0x17e/0x270 [ 44.546523] ext4_buffered_write_iter+0xc8/0x170 [ 44.546523] vfs_write+0x2be/0x3e0 [ 44.546523] __x64_sys_pwrite64+0x6d/0xc0 [ 44.546523] do_syscall_64+0x6a/0xf0 [ 44.546523] ? __wake_up+0x89/0xb0 [ 44.546523] ? xas_find+0x72/0x1c0 [ 44.546523] ? next_uptodate_folio+0x317/0x330 [ 44.546523] ? set_pte_range+0x1a6/0x270 [ 44.546523] ? filemap_map_pages+0x6ee/0x840 [ 44.546523] ? ext4_setattr+0x2fa/0x750 [ 44.546523] ? do_pte_missing+0x128/0xf70 [ 44.546523] ? security_inode_post_setattr+0x3e/0xd0 [ 44.546523] ? ___pte_offset_map+0x19/0x100 [ 44.546523] ? handle_mm_fault+0x721/0xa10 [ 44.546523] ? do_user_addr_fault+0x197/0x730 [ 44.546523] ? do_syscall_64+0x76/0xf0 [ 44.546523] ? arch_exit_to_user_mode_prepare+0x1e/0x60 [ 44.546523] ? irqentry_exit_to_user_mode+0x79/0x90 [ 44.546523] entry_SYSCALL_64_after_hwframe+0x55/0x5d [ 44.546523] RIP: 0033:0x7f42999c6687 [ 44.546523] Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00 83 e2 39 83 fa 08 75 de e8 23 ff ff ff [ 44.546523] RSP: 002b:00007ffeae4a7930 EFLAGS: 00000202 ORIG_RAX: 0000000000000012 [ 44.546523] RAX: ffffffffffffffda RBX: 00007f4299934740 RCX: 00007f42999c6687 [ 44.546523] RDX: 0000000000000001 RSI: 000055ea6149200f RDI: 0000000000000003 [ 44.546523] RBP: 00007ffeae4a79a0 R08: 0000000000000000 R09: 0000000000000000 [ 44.546523] R10: 0000010000000005 R11: 0000000000000202 R12: 0000000000000000 [ 44.546523] R13: 00007ffeae4a7ac8 R14: 00007f4299b86000 R15: 000055ea61493dd8 [ 44.546523] </TASK> [ 44.546523] Modules linked in: [ 44.568501] ---[ end trace 0000000000000000 ]--- [ 44.568889] RIP: 0010:ext4_write_inline_data+0xfe/0x100 [ 44.569328] Code: 3c 0e 48 83 c7 48 48 89 de 5b 41 5c 41 5d 41 5e 41 5f 5d e9 e4 fa 43 01 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 0b <0f> 0b 0f 1f 44 00 00 55 41 57 41 56 41 55 41 54 53 48 83 ec 20 49 [ 44.570931] RSP: 0018:ffffb342008b79a8 EFLAGS: 00010216 [ 44.571356] RAX: 0000000000000001 RBX: ffff9329c579c000 RCX: 0000010000000006 [ 44.571959] RDX: 000000000000003c RSI: ffffb342008b79f0 RDI: ffff9329c158e738 [ 44.572571] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 [ 44.573148] R10: 00007ffffffff000 R11: ffffffff9bd0d910 R12: 0000006210000000 [ 44.573748] R13: fffffc7e4015e700 R14: 0000010000000005 R15: ffff9329c158e738 [ 44.574335] FS: 00007f4299934740(0000) GS:ffff932a60179000(0000) knlGS:0000000000000000 [ 44.575027] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 44.575520] CR2: 00007f4299a1ec90 CR3: 0000000002886002 CR4: 0000000000770eb0 [ 44.576112] PKRU: 55555554 [ 44.576338] Kernel panic - not syncing: Fatal exception [ 44.576517] Kernel Offset: 0x1a600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Reported-by: syzbot+fe2a25dae02a207717a0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fe2a25dae02a207717a0 Fixes: f19d5870cbf7 ("ext4: add normal write support for inline data") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Cc: stable@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://patch.msgid.link/20250415-ext4-prepare-inline-overflow-v1-1-f4c13d900967@igalia.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-05-13f2fs: don't over-report free space or inodes in statvfsChao Yu
This fixes an analogus bug that was fixed in modern filesystems: a) xfs in commit 4b8d867ca6e2 ("xfs: don't over-report free space or inodes in statvfs") b) ext4 in commit f87d3af74193 ("ext4: don't over-report free space or inodes in statvfs") where statfs can report misleading / incorrect information where project quota is enabled, and the free space is less than the remaining quota. This commit will resolve a test failure in generic/762 which tests for this bug. generic/762 - output mismatch (see /share/git/fstests/results//generic/762.out.bad) --- tests/generic/762.out 2025-04-15 10:21:53.371067071 +0800 +++ /share/git/fstests/results//generic/762.out.bad 2025-05-13 16:13:37.000000000 +0800 @@ -6,8 +6,10 @@ root blocks2 is in range dir blocks2 is in range root bavail2 is in range -dir bavail2 is in range +dir bavail2 has value of 1539066 +dir bavail2 is NOT in range 304734.87 .. 310891.13 root blocks3 is in range ... (Run 'diff -u /share/git/fstests/tests/generic/762.out /share/git/fstests/results//generic/762.out.bad' to see the entire diff) HINT: You _MAY_ be missing kernel fix: XXXXXXXXXXXXXX xfs: don't over-report free space or inodes in statvfs Cc: stable@kernel.org Fixes: ddc34e328d06 ("f2fs: introduce f2fs_statfs_project") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-12f2fs: drop usage of folio_indexKairui Song
folio_index is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through `swap_rw` but f2fs does not use that method for swap. Link: https://lkml.kernel.org/r/20250430181052.55698-4-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chao Yu <chao@kernel.org> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Qu Wenruo <wqu@suse.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12fuse: drop usage of folio_indexKairui Song
Patch series "mm, swap: clean up swap cache mapping helper", v3. This series removes usage of folio_index usage in fs/, and remove swap cache checking in folio_contains. Currently, the swap cache is already no longer directly exposed to fs, and swap cache will be more different from page cache. Clean up the helpers first to simplify the code and eliminate the helpers used for resolving circular header dependency issue between filemap and swap headers, and prepare for further changes. This patch (of 6): folio_index is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through `swap_rw` but fuse does not use that method for SWAP. Link: https://lkml.kernel.org/r/20250430181052.55698-1-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250430181052.55698-2-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Qu Wenruo <wqu@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: abstract initial stack setup to mm subsystemLorenzo Stoakes
There are peculiarities within the kernel where what is very clearly mm code is performed elsewhere arbitrarily. This violates separation of concerns and makes it harder to refactor code to make changes to how fundamental initialisation and operation of mm logic is performed. One such case is the creation of the VMA containing the initial stack upon execve()'ing a new process. This is currently performed in __bprm_mm_init() in fs/exec.c. Abstract this operation to create_init_stack_vma(). This allows us to limit use of vma allocation and free code to fork and mm only. We previously did the same for the step at which we relocate the initial stack VMA downwards via relocate_vma_down(), now we move the initial VMA establishment too. Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no longer needed anywhere outside of mm. Link: https://lkml.kernel.org/r/118c950ef7a8dd19ab20a23a68c3603751acd30e.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: establish mm/vma_exec.c for shared exec/mm VMA functionalityLorenzo Stoakes
Patch series "move all VMA allocation, freeing and duplication logic to mm", v3. Currently VMA allocation, freeing and duplication exist in kernel/fork.c, which is a violation of separation of concerns, and leaves these functions exposed to the rest of the kernel when they are in fact internal implementation details. Resolve this by moving this logic to mm, and making it internal to vma.c, vma.h. This also allows us, in future, to provide userland testing around this functionality. We additionally abstract dup_mmap() to mm, being careful to ensure kernel/fork.c acceses this via the mm internal header so it is not exposed elsewhere in the kernel. As part of this change, also abstract initial stack allocation performed in __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as this code uses vm_area_alloc() and vm_area_free(). In order to do so sensibly, we introduce a new mm/vma_exec.c file, which contains the code that is shared by mm and exec. This file is added to both memory mapping and exec sections in MAINTAINERS so both sets of maintainers can maintain oversight. As part of this change, we also move relocate_vma_down() to mm/vma_exec.c so all shared mm/exec functionality is kept in one place. We add code shared between nommu and mmu-enabled configurations in order to share VMA allocation, freeing and duplication code correctly while also keeping these functions available in userland VMA testing. This is achieved by adding a mm/vma_init.c file which is also compiled by the userland tests. This patch (of 4): There is functionality that overlaps the exec and memory mapping subsystems. While it properly belongs in mm, it is important that exec maintainers maintain oversight of this functionality correctly. We can establish both goals by adding a new mm/vma_exec.c file which contains these 'glue' functions, and have fs/exec.c import them. As a part of this change, to ensure that proper oversight is achieved, add the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections. scripts/get_maintainer.pl can correctly handle files in multiple entries and this neatly handles the cross-over. [akpm@linux-foundation.org: fix comment typo] Link: https://lkml.kernel.org/r/80f0d0c6-0b68-47f9-ab78-0ab7f74677fc@lucifer.local Link: https://lkml.kernel.org/r/cover.1745853549.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/91f2cee8f17d65214a9d83abb7011aa15f1ea690.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12jfs: implement migrate_folio for jfs_metapage_aopsShivank Garg
Add the missing migrate_folio operation to jfs_metapage_aops to fix warnings during memory compaction. These warnings were introduced by commit 7ee3647243e5 ("migrate: Remove call to ->writepage") which added explicit warnings when filesystems don't implement migrate_folio. System reports following warnings: jfs_metapage_aops does not implement migrate_folio WARNING: CPU: 0 PID: 6870 at mm/migrate.c:955 fallback_migrate_folio mm/migrate.c:953 [inline] WARNING: CPU: 0 PID: 6870 at mm/migrate.c:955 move_to_new_folio+0x70e/0x840 mm/migrate.c:1007 Implement metapage_migrate_folio() which handles both single and multiple metapages per page configurations. [shivankg@amd.com: change comment style] Link: https://lkml.kernel.org/r/1967593d-8084-4a4a-b384-35d5adc54eb4@amd.com [akpm@linux-foundation.org: fix build] [shivankg@amd.com: remove redundant NULL check in __metapage_migrate_folio()] Link: https://lkml.kernel.org/r/a67db238-0ca6-4725-abb2-dc092de87e1b@amd.com Link: https://lkml.kernel.org/r/20250430100150.279751-3-shivankg@amd.com Fixes: 35474d52c605 ("jfs: Convert metapage_writepage to metapage_write_folio") Signed-off-by: Shivank Garg <shivankg@amd.com> Reported-by: syzbot+8bb6fd945af4e0ad9299@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67faff52.050a0220.379d84.001b.GAE@google.com Tested-by: syzbot+8bb6fd945af4e0ad9299@syzkaller.appspotmail.com Cc: Alistair Popple <apopple@nvidia.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: kernel test robot <lkp@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12btrfs: add back warning for mount option commit values exceeding 300Kyoji Ogasawara
The Btrfs documentation states that if the commit value is greater than 300 a warning should be issued. The warning was accidentally lost in the new mount API update. Fixes: 6941823cc878 ("btrfs: remove old mount API code") CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Kyoji Ogasawara <sawara04.o@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12btrfs: fix folio leak in submit_one_async_extent()Boris Burkov
If btrfs_reserve_extent() fails while submitting an async_extent for a compressed write, then we fail to call free_async_extent_pages() on the async_extent and leak its folios. A likely cause for such a failure would be btrfs_reserve_extent() failing to find a large enough contiguous free extent for the compressed extent. I was able to reproduce this by: 1. mount with compress-force=zstd:3 2. fallocating most of a filesystem to a big file 3. fragmenting the remaining free space 4. trying to copy in a file which zstd would generate large compressed extents for (vmlinux worked well for this) Step 4. hits the memory leak and can be repeated ad nauseam to eventually exhaust the system memory. Fix this by detecting the case where we fallback to uncompressed submission for a compressed async_extent and ensuring that we call free_async_extent_pages(). Fixes: 131a821a243f ("btrfs: fallback if compressed IO fails for ENOSPC") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Co-developed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12btrfs: fix discard worker infinite loop after disabling discardFilipe Manana
If the discard worker is running and there's currently only one block group, that block group is a data block group, it's in the unused block groups discard list and is being used (it got an extent allocated from it after becoming unused), the worker can end up in an infinite loop if a transaction abort happens or the async discard is disabled (during remount or unmount for example). This happens like this: 1) Task A, the discard worker, is at peek_discard_list() and find_next_block_group() returns block group X; 2) Block group X is in the unused block groups discard list (its discard index is BTRFS_DISCARD_INDEX_UNUSED) since at some point in the past it become an unused block group and was added to that list, but then later it got an extent allocated from it, so its ->used counter is not zero anymore; 3) The current transaction is aborted by task B and we end up at __btrfs_handle_fs_error() in the transaction abort path, where we call btrfs_discard_stop(), which clears BTRFS_FS_DISCARD_RUNNING from fs_info, and then at __btrfs_handle_fs_error() we set the fs to RO mode (setting SB_RDONLY in the super block's s_flags field); 4) Task A calls __add_to_discard_list() with the goal of moving the block group from the unused block groups discard list into another discard list, but at __add_to_discard_list() we end up doing nothing because btrfs_run_discard_work() returns false, since the super block has SB_RDONLY set in its flags and BTRFS_FS_DISCARD_RUNNING is not set anymore in fs_info->flags. So block group X remains in the unused block groups discard list; 5) Task A then does a goto into the 'again' label, calls find_next_block_group() again we gets block group X again. Then it repeats the previous steps over and over since there are not other block groups in the discard lists and block group X is never moved out of the unused block groups discard list since btrfs_run_discard_work() keeps returning false and therefore __add_to_discard_list() doesn't move block group X out of that discard list. When this happens we can get a soft lockup report like this: [71.957] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [kworker/u4:3:97] [71.957] Modules linked in: xfs af_packet rfkill (...) [71.957] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa [71.957] Tainted: [W]=WARN [71.957] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [71.957] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [71.957] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs] [71.957] Code: c1 01 48 83 (...) [71.957] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246 [71.957] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000 [71.957] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad [71.957] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080 [71.957] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860 [71.957] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000 [71.957] FS: 0000000000000000(0000) GS:ffff89707bc00000(0000) knlGS:0000000000000000 [71.957] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [71.957] CR2: 00005605bcc8d2f0 CR3: 000000010376a001 CR4: 0000000000770ef0 [71.957] PKRU: 55555554 [71.957] Call Trace: [71.957] <TASK> [71.957] process_one_work+0x17e/0x330 [71.957] worker_thread+0x2ce/0x3f0 [71.957] ? __pfx_worker_thread+0x10/0x10 [71.957] kthread+0xef/0x220 [71.957] ? __pfx_kthread+0x10/0x10 [71.957] ret_from_fork+0x34/0x50 [71.957] ? __pfx_kthread+0x10/0x10 [71.957] ret_from_fork_asm+0x1a/0x30 [71.957] </TASK> [71.957] Kernel panic - not syncing: softlockup: hung tasks [71.987] CPU: 0 UID: 0 PID: 97 Comm: kworker/u4:3 Tainted: G W L 6.14.2-1-default #1 openSUSE Tumbleweed 968795ef2b1407352128b466fe887416c33af6fa [71.989] Tainted: [W]=WARN, [L]=SOFTLOCKUP [71.989] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 [71.991] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs] [71.992] Call Trace: [71.993] <IRQ> [71.994] dump_stack_lvl+0x5a/0x80 [71.994] panic+0x10b/0x2da [71.995] watchdog_timer_fn.cold+0x9a/0xa1 [71.996] ? __pfx_watchdog_timer_fn+0x10/0x10 [71.997] __hrtimer_run_queues+0x132/0x2a0 [71.997] hrtimer_interrupt+0xff/0x230 [71.998] __sysvec_apic_timer_interrupt+0x55/0x100 [71.999] sysvec_apic_timer_interrupt+0x6c/0x90 [72.000] </IRQ> [72.000] <TASK> [72.001] asm_sysvec_apic_timer_interrupt+0x1a/0x20 [72.002] RIP: 0010:btrfs_discard_workfn+0xc4/0x400 [btrfs] [72.002] Code: c1 01 48 83 (...) [72.005] RSP: 0018:ffffafaec03efe08 EFLAGS: 00000246 [72.006] RAX: ffff897045500000 RBX: ffff8970413ed8d0 RCX: 0000000000000000 [72.006] RDX: 0000000000000001 RSI: ffff8970413ed8d0 RDI: 0000000a8f1272ad [72.007] RBP: 0000000a9d61c60e R08: ffff897045500140 R09: 8080808080808080 [72.008] R10: ffff897040276800 R11: fefefefefefefeff R12: ffff8970413ed860 [72.009] R13: ffff897045500000 R14: ffff8970413ed868 R15: 0000000000000000 [72.010] ? btrfs_discard_workfn+0x51/0x400 [btrfs 23b01089228eb964071fb7ca156eee8cd3bf996f] [72.011] process_one_work+0x17e/0x330 [72.012] worker_thread+0x2ce/0x3f0 [72.013] ? __pfx_worker_thread+0x10/0x10 [72.014] kthread+0xef/0x220 [72.014] ? __pfx_kthread+0x10/0x10 [72.015] ret_from_fork+0x34/0x50 [72.015] ? __pfx_kthread+0x10/0x10 [72.016] ret_from_fork_asm+0x1a/0x30 [72.017] </TASK> [72.017] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [72.019] Rebooting in 90 seconds.. So fix this by making sure we move a block group out of the unused block groups discard list when calling __add_to_discard_list(). Fixes: 2bee7eb8bb81 ("btrfs: discard one region at a time in async discard") Link: https://bugzilla.suse.com/show_bug.cgi?id=1242012 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Daniel Vacek <neelx@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-12Merge tag 'udf_for_v6.15-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF fix from Jan Kara: "Fix a bug in UDF inode eviction leading to spewing pointless error messages" * tag 'udf_for_v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Make sure i_lenExtents is uptodate on inode eviction
2025-05-12Merge tag 'vfs-6.15-rc7.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Ensure that simple_xattr_list() always includes security.* xattrs - Fix eventpoll busy loop optimization when combined with timeouts - Disable swapon() for devices with block sizes greater than page sizes - Don't call errseq_set() twice during mark_buffer_write_io_error(). Just use mapping_set_error() which takes care to not deference unconditionally * tag 'vfs-6.15-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: Remove redundant errseq_set call in mark_buffer_write_io_error. swapfile: disable swapon for bs > ps devices fs/eventpoll: fix endless busy loop after timeout has expired fs/xattr.c: fix simple_xattr_list to always include security.* xattrs
2025-05-12Merge 6.15-rc6 into driver-core-nextGreg Kroah-Hartman
We need the driver core fix in here as well for testing Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-12crypto: lib/chacha - add strongly-typed state zeroizationEric Biggers
Now that the ChaCha state matrix is strongly-typed, add a helper function chacha_zeroize_state() which zeroizes it. Then convert all applicable callers to use it instead of direct memzero_explicit. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-12crypto: lib/chacha - strongly type the ChaCha stateEric Biggers
The ChaCha state matrix is 16 32-bit words. Currently it is represented in the code as a raw u32 array, or even just a pointer to u32. This weak typing is error-prone. Instead, introduce struct chacha_state: struct chacha_state { u32 x[16]; }; Convert all ChaCha and HChaCha functions to use struct chacha_state. No functional changes. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-11nilfs2: do not propagate ENOENT error from nilfs_btree_propagate()Ryusuke Konishi
In preparation for writing logs, in nilfs_btree_propagate(), which makes parent and ancestor node blocks dirty starting from a modified data block or b-tree node block, if the starting block does not belong to the b-tree, i.e. is isolated, nilfs_btree_do_lookup() called within the function fails with -ENOENT. In this case, even though -ENOENT is an internal code, it is propagated to the log writer via nilfs_bmap_propagate() and may be erroneously returned to system calls such as fsync(). Fix this issue by changing the error code to -EINVAL in this case, and having the bmap layer detect metadata corruption and convert the error code appropriately. Link: https://lkml.kernel.org/r/20250428173808.6452-3-konishi.ryusuke@gmail.com Fixes: 1f5abe7e7dbc ("nilfs2: replace BUG_ON and BUG calls triggerable from ioctl") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11nilfs2: add pointer check for nilfs_direct_propagate()Wentao Liang
Patch series "nilfs2: improve sanity checks in dirty state propagation". This fixes one missed check for block mapping anomalies and one improper return of an error code during a preparation step for log writing, thereby improving checking for filesystem corruption on writeback. This patch (of 2): In nilfs_direct_propagate(), the printer get from nilfs_direct_get_ptr() need to be checked to ensure it is not an invalid pointer. If the pointer value obtained by nilfs_direct_get_ptr() is NILFS_BMAP_INVALID_PTR, means that the metadata (in this case, i_bmap in the nilfs_inode_info struct) that should point to the data block at the buffer head of the argument is corrupted and the data block is orphaned, meaning that the file system has lost consistency. Add a value check and return -EINVAL when it is an invalid pointer. Link: https://lkml.kernel.org/r/20250428173808.6452-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20250428173808.6452-2-konishi.ryusuke@gmail.com Fixes: 36a580eb489f ("nilfs2: direct block mapping") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11sort.h: hoist cmp_int() into generic header fileFedor Pchelkin
Deduplicate the same functionality implemented in several places by moving the cmp_int() helper macro into linux/sort.h. The macro performs a three-way comparison of the arguments mostly useful in different sorting strategies and algorithms. Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Suggested-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Coly Li <colyli@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11ocfs2: remove unnecessary NULL check before unregister_sysctl_table()Chen Ni
unregister_sysctl_table() checks for NULL pointers internally. Remove unneeded NULL check here. Link: https://lkml.kernel.org/r/20250422073051.1334310-1-nichen@iscas.ac.cn Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11ocfs2: fix possible memory leak in ocfs2_finish_quota_recoveryMurad Masimov
If ocfs2_finish_quota_recovery() exits due to an error before passing all rc_list elements to ocfs2_recover_local_quota_file() then it can lead to a memory leak as rc_list may still contain elements that have to be freed. Release all memory allocated by ocfs2_add_recovery_chunk() using ocfs2_free_quota_recovery() instead of kfree(). Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Link: https://lkml.kernel.org/r/20250402065628.706359-2-m.masimov@mt-integration.ru Fixes: 2205363dce74 ("ocfs2: Implement quota recovery") Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11ocfs2: simplify return statement in ocfs2_filecheck_attr_store()Thorsten Blum
Don't negate 'ret' and simplify the return statement. No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11ocfs2: o2net_idle_timer: Rename del_timer_sync in commentWangYuli
Commit 8fa7292fee5c ("treewide: Switch/rename to timer_delete[_sync]()") switched del_timer_sync to timer_delete_sync, but did not modify the comment for o2net_idle_timer(). Now fix it. Link: https://lkml.kernel.org/r/BDDB1E4E2876C36C+20250411102610.165946-1-wangyuli@uniontech.com Signed-off-by: WangYuli <wangyuli@uniontech.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11Squashfs: check return result of sb_min_blocksizePhillip Lougher
Syzkaller reports an "UBSAN: shift-out-of-bounds in squashfs_bio_read" bug. Syzkaller forks multiple processes which after mounting the Squashfs filesystem, issues an ioctl("/dev/loop0", LOOP_SET_BLOCK_SIZE, 0x8000). Now if this ioctl occurs at the same time another process is in the process of mounting a Squashfs filesystem on /dev/loop0, the failure occurs. When this happens the following code in squashfs_fill_super() fails. ---- msblk->devblksize = sb_min_blocksize(sb, SQUASHFS_DEVBLK_SIZE); msblk->devblksize_log2 = ffz(~msblk->devblksize); ---- sb_min_blocksize() returns 0, which means msblk->devblksize is set to 0. As a result, ffz(~msblk->devblksize) returns 64, and msblk->devblksize_log2 is set to 64. This subsequently causes the UBSAN: shift-out-of-bounds in fs/squashfs/block.c:195:36 shift exponent 64 is too large for 64-bit type 'u64' (aka 'unsigned long long') This commit adds a check for a 0 return by sb_min_blocksize(). Link: https://lkml.kernel.org/r/20250409024747.876480-1-phillip@squashfs.org.uk Fixes: 0aa666190509 ("Squashfs: super block operations") Reported-by: syzbot+65761fc25a137b9c8c6e@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67f0dd7a.050a0220.0a13.0230.GAE@google.com/ Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11proc: fix the issue of proc_mem_open returning NULLPenglei Jiang
proc_mem_open() can return an errno, NULL, or mm_struct*. If it fails to acquire mm, it returns NULL, but the caller does not check for the case when the return value is NULL. The following conditions lead to failure in acquiring mm: - The task is a kernel thread (PF_KTHREAD) - The task is exiting (PF_EXITING) Changes: - Add documentation comments for the return value of proc_mem_open(). - Add checks in the caller to return -ESRCH when proc_mem_open() returns NULL. Link: https://lkml.kernel.org/r/20250404063357.78891-1-superman.xpt@gmail.com Reported-by: syzbot+f9238a0a31f9b5603fef@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000f52642060d4e3750@google.com Signed-off-by: Penglei Jiang <superman.xpt@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Adrian Ratiu <adrian.ratiu@collabora.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Felix Moessbauer <felix.moessbauer@siemens.com> Cc: Jeff layton <jlayton@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11fs/proc/page: refactor to reduce code duplicationLiu Ye
kpageflags_read() and kpagecgroup_read() are quite similar to kpagecount_read(). Refactor common code into a helper function to reduce code duplication. Link: https://lkml.kernel.org/r/20250318063226.223284-1-liuyerd@163.com Signed-off-by: Liu Ye <liuye@kylinos.cn> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Svetly Todorov <svetly.todorov@memverge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regionsAndrei Vagin
Patch series "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions", v2. Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose information about guard regions. This allows userspace tools, such as CRIU, to detect and handle guard regions. Currently, CRIU utilizes PAGEMAP_SCAN as a more efficient alternative to parsing /proc/pid/pagemap. Without this change, guard regions are incorrectly reported as swap-anon regions, leading CRIU to attempt dumping them and subsequently failing. The series includes updates to the documentation and selftests to reflect the new functionality. This patch (of 3): Introduce the PAGE_IS_GUARD flag in the PAGEMAP_SCAN ioctl to expose information about guard regions. This allows userspace tools, such as CRIU, to detect and handle guard regions. Link: https://lkml.kernel.org/r/20250324065328.107678-1-avagin@google.com Link: https://lkml.kernel.org/r/20250324065328.107678-2-avagin@google.com Signed-off-by: Andrei Vagin <avagin@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11mm: add folio_mk_pmd()Matthew Wilcox (Oracle)
Removes five conversions from folio to page. Also removes both callers of mk_pmd() that aren't part of mk_huge_pmd(), getting us a step closer to removing the confusion between mk_pmd(), mk_huge_pmd() and pmd_mkhuge(). Link: https://lkml.kernel.org/r/20250402181709.2386022-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Muchun Song <muchun.song@linux.dev> Cc: Richard Weinberger <richard@nod.at> Cc: <x86@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11nfsd: remove legacy dprintks from GETATTR and STATFS codepathsJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-05-11nfsd: remove legacy READDIR dprintksJeff Layton
Observability here is now covered by static tracepoints. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>