summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-11-26gfs2: Clean up properly during a withdrawAndreas Gruenbacher
During a withdraw, we don't want to write out any more data than we have to, so in do_xmote(), skip the ->go_sync() glock operation. We still want to keep calling ->go_inval() to discard any cached data or metadata, whether clean or dirty. We do still allow glocks to transition into state LM_ST_UNLOCKED. This has the desired side effect of calling ->go_inval() and invalidating the glock caches. Function gfs2_withdraw_glocks() is already used for dequeuing any left-over waiters. We still want that to happen, but additionally, we want all glocks to be unlocked. Finally, we change function do_promote() to refuse any further promotions. This commit cleans up the leftovers of commit 86934198eefa ("gfs2: Clear flags when withdraw prevents xmote"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Rename gfs2_{gl_dq_holders => withdraw_glocks}Andreas Gruenbacher
Rename function gfs2_gl_dq_holders() to gfs2_withdraw_glocks(). This function will soon be used for more than just dequeuing holders. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: fix infinite loop when checking ail item count before go_inval"Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts commit 33dbd1e41a1d ("gfs2: fix infinite loop when checking ail item count before go_inval"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: Allow some glocks to be used during withdraw"Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts commit a72d2401f54b ("gfs2: Allow some glocks to be used during withdraw"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: Check for log write errors before telling dlm to unlock"Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts the rest of d93ae386ef3d ("gfs2: Check for log write errors before telling dlm to unlock"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: fix a deadlock on withdraw-during-mount"Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts commit 865cc3e9cc0b ("gfs2: fix a deadlock on withdraw-during-mount"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (6/6)Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (5/6)Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (4/6)Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (3/6)Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (2/6)Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: Force withdraw to replay journals and wait for it to finish" (1/6)Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay journals and wait for it to finish"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26Revert "gfs2: don't stop reads while withdraw in progress"Andreas Gruenbacher
The current withdraw code duplicates the journal recovery code gfs2 already has for dealing with node failures, and it does so poorly. That code was added because when releasing a lockspace, we didn't have a way to indicate that the lockspace needs recovery. We now do have this feature, so the current withdraw code can be removed almost entirely. This is one of several steps towards that. The withdrawing node has no role in recovering from the withdraw anymore, so it also no longer needs to read metadata blocks after a withdraw. We now only need to set a single bit in gfs2_withdraw(), so switch from try_cmpxchg() to test_and_set_bit(). Reverts commit 8cc67f704f4b ("gfs2: don't stop reads while withdraw in progress"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Rename LM_FLAG_{NOEXP -> RECOVER}Andreas Gruenbacher
GFS sets the LM_FLAG_NOEXP flag on locking requests it makes during journal recovery, so rename the flag to LM_FLAG_RECOVER for improved code readability. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Kill gfs2_io_error_bh_wdAndreas Gruenbacher
All callers of gfs2_io_error_bh() call gfs2_withdraw() as well, so change gfs2_io_error_bh() to call gfs2_withdraw() directly. This also brings it in line with other similar error reporting functions. With that, gfs2_io_error_bh() is the same as gfs2_io_error_bh_wd(), so remove the latter. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Withdraw immediately on log write errorsAndreas Gruenbacher
Now that gfs2_withdraw() is asynchronous, immediately withdraw when a log write error is detected. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Rename gfs2_{withdrawing_or_ => }withdrawnAndreas Gruenbacher
With delayed withdraws and the SDF_WITHDRAWING flag gone, we can now rename gfs2_withdrawing_or_withdrawn() back to gfs2_withdrawn(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Get rid of delayed withdrawsAndreas Gruenbacher
Now that gfs2_withdraw() is asynchronous, is can be called in any context and there is no more need for gfs2_withdraw_delayed() or for turning delayed withdraws into actual withdraws. Remove the now-obsolete code. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Asynchronous withdrawAndreas Gruenbacher
So far, withdraws are carried out in the context of the calling task. When another task tries to withdraw while a withdraw is already underway, that task blocks as well. Change that to carry out withdraws asynchronously in workqueue context and don't block the task triggering the withdraw anymore. Fixes: syzbot+6b156e132970e550194c@syzkaller.appspotmail.com Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Add clean argument to lm_unmount hookAndreas Gruenbacher
Add a 'clean' argument to ->lm_unmount() that indicates whether the filesystem is clean or needs recovery. Set clean to true for normal unmounts, and to false for withdraws. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Clean up quotad timeout handlingAndreas Gruenbacher
Instead of tracking the remaining time, track the deadline of each of the timeouts. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Fix "gfs2: Switch to wait_event in gfs2_quotad"Andreas Gruenbacher
Commit e4a8b5481c59a ("gfs2: Switch to wait_event in gfs2_quotad") broke cyclic statfs syncing, so the numbers reported by "df" could easily get completely out of sync with reality. Fix this by reverting part of commit e4a8b5481c59a for now. A follow-up commit will clean this code up later. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: Minor cosmetic remote delete cleanupsAndreas Gruenbacher
Rename gfs2_try_evict() to gfs2_try_to_evict(). The GIF_DEFER_DELETE flag has been superceded by the GLF_DEFER_DELETE flag, so fix a left-over comment. Add a clarifying comment to delete_work_func(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26gfs2: fix remote evict for read-only filesystemsAndreas Gruenbacher
When a node tries to delete an inode, it first requests exclusive access to the iopen glock. This triggers demote requests on all remote nodes currently holding the iopen glock. To satisfy those requests, the remote nodes evict the inode in question, or they poke the corresponding inode glock to signal that the inode is still in active use. This behavior doesn't depend on whether or not a filesystem is read-only, so remove the incorrect read-only check. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-11-26libceph: drop started parameter of __ceph_open_session()Ilya Dryomov
With the previous commit revamping the timeout handling, started isn't used anymore. It could be taken into account by adjusting the initial value of the timeout, but there is little point as both callers capture the timestamp shortly before calling __ceph_open_session() -- the only thing of note that happens in the interim is taking client->mount_mutex and that isn't expected to take multiple seconds. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
2025-11-26ext4: align max orphan file size with e2fsprogs limitBaokun Li
Kernel commit 0a6ce20c1564 ("ext4: verify orphan file size is not too big") limits the maximum supported orphan file size to 8 << 20. However, in e2fsprogs, the orphan file size is set to 32–512 filesystem blocks when creating a filesystem. With 64k block size, formatting an ext4 fs >32G gives an orphan file bigger than the kernel allows, so mount prints an error and fails: EXT4-fs (vdb): orphan file too big: 8650752 EXT4-fs (vdb): mount failed To prevent this issue and allow previously created 64KB filesystems to mount, we updates the maximum allowed orphan file size in the kernel to 512 filesystem blocks. Fixes: 0a6ce20c1564 ("ext4: verify orphan file size is not too big") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251120134233.2994147-1-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26fs/ext4: fix typo in commentHaodong Tian
Correct 'metdata' -> 'metadata' in comment. Signed-off-by: Haodong Tian <tianhd25@mails.tsinghua.edu.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Message-ID: <20251112155916.3007639-1-tianhd25@mails.tsinghua.edu.cn> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: correct the comments place for EXT4_EXT_MAY_ZEROOUTYang Erkun
Move the comments just before we set EXT4_EXT_MAY_ZEROOUT in ext4_split_convert_extents. Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-4-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: cleanup for ext4_map_blocksYang Erkun
Retval from ext4_map_create_blocks means we really create some blocks, cannot happened with m_flags without EXT4_MAP_UNWRITTEN and EXT4_MAP_MAPPED. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-3-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: rename EXT4_GET_BLOCKS_PRE_IOYang Erkun
This flag has been generalized to split an unwritten extent when we do dio or dioread_nolock writeback, or to avoid merge new extents which was created by extents split. Update some related comments too. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-2-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: improve integrity checking in __mb_check_buddy by enhancing order-0 ↵Yongjian Sun
validation When the MB_CHECK_ASSERT macro is enabled, we found that the current validation logic in __mb_check_buddy has a gap in detecting certain invalid buddy states, particularly related to order-0 (bitmap) bits. The original logic consists of three steps: 1. Validates higher-order buddies: if a higher-order bit is set, at most one of the two corresponding lower-order bits may be free; if a higher-order bit is clear, both lower-order bits must be allocated (and their bitmap bits must be 0). 2. For any set bit in order-0, ensures all corresponding higher-order bits are not free. 3. Verifies that all preallocated blocks (pa) in the group have pa_pstart within bounds and their bitmap bits marked as allocated. However, this approach fails to properly validate cases where order-0 bits are incorrectly cleared (0), allowing some invalid configurations to pass: corrupt integral order 3 1 1 order 2 1 1 1 1 order 1 1 1 1 1 1 1 1 1 order 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Here we get two adjacent free blocks at order-0 with inconsistent higher-order state, and the right one shows the correct scenario. The root cause is insufficient validation of order-0 zero bits. To fix this and improve completeness without significant performance cost, we refine the logic: 1. Maintain the top-down higher-order validation, but we no longer check the cases where the higher-order bit is 0, as this case will be covered in step 2. 2. Enhance order-0 checking by examining pairs of bits: - If either bit in a pair is set (1), all corresponding higher-order bits must not be free. - If both bits are clear (0), then exactly one of the corresponding higher-order bits must be free 3. Keep the preallocation (pa) validation unchanged. This change closes the validation gap, ensuring illegal buddy states involving order-0 are correctly detected, while removing redundant checks and maintaining efficiency. Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4") Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251106060614.631382-3-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: fix incorrect group number assertion in mb_check_buddyYongjian Sun
When the MB_CHECK_ASSERT macro is enabled, an assertion failure can occur in __mb_check_buddy when checking preallocated blocks (pa) in a block group: Assertion failure in mb_free_blocks() : "groupnr == e4b->bd_group" This happens when a pa at the very end of a block group (e.g., pa_pstart=32765, pa_len=3 in a group of 32768 blocks) becomes exhausted - its pa_pstart is advanced by pa_len to 32768, which lies in the next block group. If this exhausted pa (with pa_len == 0) is still in the bb_prealloc_list during the buddy check, the assertion incorrectly flags it as belonging to the wrong group. A possible sequence is as follows: ext4_mb_new_blocks ext4_mb_release_context pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len) pa->pa_len -= ac->ac_b_ex.fe_len __mb_check_buddy for each pa in group ext4_get_group_no_and_offset MB_CHECK_ASSERT(groupnr == e4b->bd_group) To fix this, we modify the check to skip block group validation for exhausted preallocations (where pa_len == 0). Such entries are in a transitional state and will be removed from the list soon, so they should not trigger an assertion. This change prevents the false positive while maintaining the integrity of the checks for active allocations. Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4") Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251106060614.631382-2-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26ext4: add i_data_sem protection in ext4_destroy_inline_data_nolock()Alexey Nepomnyashih
Fix a race between inline data destruction and block mapping. The function ext4_destroy_inline_data_nolock() changes the inode data layout by clearing EXT4_INODE_INLINE_DATA and setting EXT4_INODE_EXTENTS. At the same time, another thread may execute ext4_map_blocks(), which tests EXT4_INODE_EXTENTS to decide whether to call ext4_ext_map_blocks() or ext4_ind_map_blocks(). Without i_data_sem protection, ext4_ind_map_blocks() may receive inode with EXT4_INODE_EXTENTS flag and triggering assert. kernel BUG at fs/ext4/indirect.c:546! EXT4-fs (loop2): unmounting filesystem. invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:ext4_ind_map_blocks.cold+0x2b/0x5a fs/ext4/indirect.c:546 Call Trace: <TASK> ext4_map_blocks+0xb9b/0x16f0 fs/ext4/inode.c:681 _ext4_get_block+0x242/0x590 fs/ext4/inode.c:822 ext4_block_write_begin+0x48b/0x12c0 fs/ext4/inode.c:1124 ext4_write_begin+0x598/0xef0 fs/ext4/inode.c:1255 ext4_da_write_begin+0x21e/0x9c0 fs/ext4/inode.c:3000 generic_perform_write+0x259/0x5d0 mm/filemap.c:3846 ext4_buffered_write_iter+0x15b/0x470 fs/ext4/file.c:285 ext4_file_write_iter+0x8e0/0x17f0 fs/ext4/file.c:679 call_write_iter include/linux/fs.h:2271 [inline] do_iter_readv_writev+0x212/0x3c0 fs/read_write.c:735 do_iter_write+0x186/0x710 fs/read_write.c:861 vfs_iter_write+0x70/0xa0 fs/read_write.c:902 iter_file_splice_write+0x73b/0xc90 fs/splice.c:685 do_splice_from fs/splice.c:763 [inline] direct_splice_actor+0x10f/0x170 fs/splice.c:950 splice_direct_to_actor+0x33a/0xa10 fs/splice.c:896 do_splice_direct+0x1a9/0x280 fs/splice.c:1002 do_sendfile+0xb13/0x12c0 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1cf/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: c755e251357a ("ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Message-ID: <20251104093326.697381-1-sdl@nppct.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: clear i_state_flags when alloc inodeHaibo Chen
i_state_flags used on 32-bit archs, need to clear this flag when alloc inode. Find this issue when umount ext4, sometimes track the inode as orphan accidently, cause ext4 mesg dump. Fixes: acf943e9768e ("ext4: fix checks for orphan inodes") Signed-off-by: Haibo Chen <haibo.chen@nxp.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251104-ext4-v1-1-73691a0800f9@nxp.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26jbd2: fix the inconsistency between checksum and data in memory for journal sbYe Bin
Copying the file system while it is mounted as read-only results in a mount failure: [~]# mkfs.ext4 -F /dev/sdc [~]# mount /dev/sdc -o ro /mnt/test [~]# dd if=/dev/sdc of=/dev/sda bs=1M [~]# mount /dev/sda /mnt/test1 [ 1094.849826] JBD2: journal checksum error [ 1094.850927] EXT4-fs (sda): Could not load journal inode mount: mount /dev/sda on /mnt/test1 failed: Bad message The process described above is just an abstracted way I came up with to reproduce the issue. In the actual scenario, the file system was mounted read-only and then copied while it was still mounted. It was found that the mount operation failed. The user intended to verify the data or use it as a backup, and this action was performed during a version upgrade. Above issue may happen as follows: ext4_fill_super set_journal_csum_feature_set(sb) if (ext4_has_metadata_csum(sb)) incompat = JBD2_FEATURE_INCOMPAT_CSUM_V3; if (test_opt(sb, JOURNAL_CHECKSUM) jbd2_journal_set_features(sbi->s_journal, compat, 0, incompat); lock_buffer(journal->j_sb_buffer); sb->s_feature_incompat |= cpu_to_be32(incompat); //The data in the journal sb was modified, but the checksum was not updated, so the data remaining in memory has a mismatch between the data and the checksum. unlock_buffer(journal->j_sb_buffer); In this case, the journal sb copied over is in a state where the checksum and data are inconsistent, so mounting fails. To solve the above issue, update the checksum in memory after modifying the journal sb. Fixes: 4fd5ea43bc11 ("jbd2: checksum journal superblock") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251103010123.3753631-1-yebin@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26ext4: check if mount_opts is NUL-terminated in ext4_ioctl_set_tune_sb()Fedor Pchelkin
params.mount_opts may come as potentially non-NUL-term string. Userspace is expected to pass a NUL-term string. Add an extra check to ensure this holds true. Note that further code utilizes strscpy_pad() so this is just for proper informing the user of incorrect data being provided. Found by Linux Verification Center (linuxtesting.org). Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251101160430.222297-2-pchelkin@ispras.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26ext4: fix string copying in parse_apply_sb_mount_options()Fedor Pchelkin
strscpy_pad() can't be used to copy a non-NUL-term string into a NUL-term string of possibly bigger size. Commit 0efc5990bca5 ("string.h: Introduce memtostr() and memtostr_pad()") provides additional information in that regard. So if this happens, the following warning is observed: strnlen: detected buffer overflow: 65 byte read of buffer size 64 WARNING: CPU: 0 PID: 28655 at lib/string_helpers.c:1032 __fortify_report+0x96/0xc0 lib/string_helpers.c:1032 Modules linked in: CPU: 0 UID: 0 PID: 28655 Comm: syz-executor.3 Not tainted 6.12.54-syzkaller-00144-g5f0270f1ba00 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:__fortify_report+0x96/0xc0 lib/string_helpers.c:1032 Call Trace: <TASK> __fortify_panic+0x1f/0x30 lib/string_helpers.c:1039 strnlen include/linux/fortify-string.h:235 [inline] sized_strscpy include/linux/fortify-string.h:309 [inline] parse_apply_sb_mount_options fs/ext4/super.c:2504 [inline] __ext4_fill_super fs/ext4/super.c:5261 [inline] ext4_fill_super+0x3c35/0xad00 fs/ext4/super.c:5706 get_tree_bdev_flags+0x387/0x620 fs/super.c:1636 vfs_get_tree+0x93/0x380 fs/super.c:1814 do_new_mount fs/namespace.c:3553 [inline] path_mount+0x6ae/0x1f70 fs/namespace.c:3880 do_mount fs/namespace.c:3893 [inline] __do_sys_mount fs/namespace.c:4103 [inline] __se_sys_mount fs/namespace.c:4080 [inline] __x64_sys_mount+0x280/0x300 fs/namespace.c:4080 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x64/0x140 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Since userspace is expected to provide s_mount_opts field to be at most 63 characters long with the ending byte being NUL-term, use a 64-byte buffer which matches the size of s_mount_opts, so that strscpy_pad() does its job properly. Return with error if the user still managed to provide a non-NUL-term string here. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 8ecb790ea8c3 ("ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251101160430.222297-1-pchelkin@ispras.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26jbd2: store more accurate errno in superblock when possibleWengang Wang
When jbd2_journal_abort() is called, the provided error code is stored in the journal superblock. Some existing calls hard-code -EIO even when the actual failure is not I/O related. This patch updates those calls to pass more accurate error codes, allowing the superblock to record the true cause of failure. This helps improve diagnostics and debugging clarity when analyzing journal aborts. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Message-ID: <20251031210501.7337-1-wen.gang.wang@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26jbd2: avoid bug_on in jbd2_journal_get_create_access() when file system ↵Ye Bin
corrupted There's issue when file system corrupted: ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1289! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 5 UID: 0 PID: 2031 Comm: mkdir Not tainted 6.18.0-rc1-next RIP: 0010:jbd2_journal_get_create_access+0x3b6/0x4d0 RSP: 0018:ffff888117aafa30 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88811a86b000 RCX: ffffffff89a63534 RDX: 1ffff110200ec602 RSI: 0000000000000004 RDI: ffff888100763010 RBP: ffff888100763000 R08: 0000000000000001 R09: ffff888100763028 R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000000 R13: ffff88812c432000 R14: ffff88812c608000 R15: ffff888120bfc000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f91d6970c99 CR3: 00000001159c4000 CR4: 00000000000006f0 Call Trace: <TASK> __ext4_journal_get_create_access+0x42/0x170 ext4_getblk+0x319/0x6f0 ext4_bread+0x11/0x100 ext4_append+0x1e6/0x4a0 ext4_init_new_dir+0x145/0x1d0 ext4_mkdir+0x326/0x920 vfs_mkdir+0x45c/0x740 do_mkdirat+0x234/0x2f0 __x64_sys_mkdir+0xd6/0x120 do_syscall_64+0x5f/0xfa0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The above issue occurs with us in errors=continue mode when accompanied by storage failures. There have been many inconsistencies in the file system data. In the case of file system data inconsistency, for example, if the block bitmap of a referenced block is not set, it can lead to the situation where a block being committed is allocated and used again. As a result, the following condition will not be satisfied then trigger BUG_ON. Of course, it is entirely possible to construct a problematic image that can trigger this BUG_ON through specific operations. In fact, I have constructed such an image and easily reproduced this issue. Therefore, J_ASSERT() holds true only under ideal conditions, but it may not necessarily be satisfied in exceptional scenarios. Using J_ASSERT() directly in abnormal situations would cause the system to crash, which is clearly not what we want. So here we directly trigger a JBD abort instead of immediately invoking BUG_ON. Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251025072657.307851-1-yebin@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26kernfs: fix memory leak of kernfs_iattrs in __kernfs_new_nodeWill Rosenberg
There exists a memory leak of kernfs_iattrs contained as an element of kernfs_node allocated in __kernfs_new_node(). __kernfs_setattr() allocates kernfs_iattrs as a sub-object, and the LSM security check incorrectly errors out and does not free the kernfs_iattrs sub-object. Make an additional error out case that properly frees kernfs_iattrs if security_kernfs_init_security() fails. Fixes: e19dfdc83b60 ("kernfs: initialize security of newly created nodes") Co-developed-by: Oliver Rosenberg <olrose55@gmail.com> Signed-off-by: Oliver Rosenberg <olrose55@gmail.com> Signed-off-by: Will Rosenberg <whrosenb@asu.edu> Link: https://patch.msgid.link/20251125151332.2010687-1-whrosenb@asu.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26fs/kernfs: raise sb->maxbytes to MAX_LFS_FILESIZEJane Chu
On an ARM64 A1 system, it's possible to have physical memory span up to the 64T boundary, like below $ lsmem -b -r -n -o range,size 0x0000000080000000-0x00000000bfffffff 1073741824 0x0000080000000000-0x000008007fffffff 2147483648 0x00000800c0000000-0x0000087fffffffff 546534588416 0x0000400000000000-0x00004000bfffffff 3221225472 0x0000400100000000-0x0000407fffffffff 545460846592 So it's time to extend /sys/kernel/mm/page_idle/bitmap to be able to account for >2G number of pages, by raising the kernfs file size limit. Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://patch.msgid.link/20251111202606.1505437-1-jane.chu@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26sysfs: attribute_group: enable const variants of is_visible()Thomas Weißschuh
When constifying instances of struct attribute, for consistency the corresponding .is_visible() callback should be adapted, too. Introduce a temporary transition mechanism until all callbacks are converted. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-4-ea7d745acff4@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-11-26fs: inline step_into() and walk_component()Mateusz Guzik
The primary consumer is link_path_walk(), calling walk_component() every time which in turn calls step_into(). Inlining these saves overhead of 2 function calls per path component, along with allowing the compiler to do better job optimizing them in place. step_into() had absolutely atrocious assembly to facilitate the slowpath. In order to lessen the burden at the callsite all the hard work is moved into step_into_slowpath() and instead an inline-able fastpath is implemented for rcu-walk. The new fastpath is a stripped down step_into() RCU handling with a d_managed() check from handle_mounts(). Benchmarked as follows on Sapphire Rapids: 1. the "before" was a kernel with not-yet-merged optimizations (notably elision of calls to security_inode_permission() and marking ext4 inodes as not having acls as applicable) 2. "after" is the same + the prep patch + this patch 3. benchmark consists of issuing 205 calls to access(2) in a loop with pathnames lifted out of gcc and the linker building real code, most of which have several path components and 118 of which fail with -ENOENT. Result in terms of ops/s: before: 21619 after: 22536 (+4%) profile before: 20.25% [kernel] [k] __d_lookup_rcu 10.54% [kernel] [k] link_path_walk 10.22% [kernel] [k] entry_SYSCALL_64 6.50% libc.so.6 [.] __GI___access 6.35% [kernel] [k] strncpy_from_user 4.87% [kernel] [k] step_into 3.68% [kernel] [k] kmem_cache_alloc_noprof 2.88% [kernel] [k] walk_component 2.86% [kernel] [k] kmem_cache_free 2.14% [kernel] [k] set_root 2.08% [kernel] [k] lookup_fast after: 23.38% [kernel] [k] __d_lookup_rcu 11.27% [kernel] [k] entry_SYSCALL_64 10.89% [kernel] [k] link_path_walk 7.00% libc.so.6 [.] __GI___access 6.88% [kernel] [k] strncpy_from_user 3.50% [kernel] [k] kmem_cache_alloc_noprof 2.01% [kernel] [k] kmem_cache_free 2.00% [kernel] [k] set_root 1.99% [kernel] [k] lookup_fast 1.81% [kernel] [k] do_syscall_64 1.69% [kernel] [k] entry_SYSCALL_64_safe_stack While walk_component() and step_into() of course disappear from the profile, the link_path_walk() barely gets more overhead despite the inlining thanks to the fast path added and while completing more walks per second. I did not investigate why overhead grew a lot on __d_lookup_rcu(). Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251120003803.2979978-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26fs: tidy up step_into() & friends before inliningMateusz Guzik
Symlink handling is already marked as unlikely and pushing out some of it into pick_link() reduces register spillage on entry to step_into() with gcc 14.2. The compiler needed additional convincing that handle_mounts() is unlikely to fail. At the same time neither clang nor gcc could be convinced to tail-call into pick_link(). While pick_link() takes an address of stack-based object as an argument (which definitely prevents the optimization), splitting it into separate <dentry, mount> tuple did not help. The issue persists even when compiled without stack protector. As such nothing was done about this for the time being to not grow the diff. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://patch.msgid.link/20251120003803.2979978-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26orangefs: use inode_update_timestamps directlyChristoph Hellwig
Orangefs has no i_version handling and __orangefs_setattr already explicitly marks the inode dirty. So instead of the using the flags return value from generic_update_time, just call the lower level inode_update_timestamps helper directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-7-hch@lst.de Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26btrfs: fix the comment on btrfs_update_timeChristoph Hellwig
Since commit e41f941a2311 ("Btrfs: move over to use ->update_time") this is not a copy of the high-level file_update_time helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26btrfs: use vfs_utimes to update file timestampsChristoph Hellwig
Btrfs updates the device node timestamps for block device special files when it stop using the device. Commit 8f96a5bfa150 ("btrfs: update the bdev time directly when closing") switch that update from the correct layering to directly call the low-level helper on the bdev inode. This is wrong and got fixed in commit 54fde91f52f5 ("btrfs: update device path inode time instead of bd_inode") by updating the file system inode instead of the bdev inode, but this kept the incorrect bypassing of the VFS interfaces and file system ->update_times method. Fix this by using the propet vfs_utimes interface. Fixes: 8f96a5bfa150 ("btrfs: update the bdev time directly when closing") Fixes: 54fde91f52f5 ("btrfs: update device path inode time instead of bd_inode") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-5-hch@lst.de Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26fs: export vfs_utimesChristoph Hellwig
This will be used to replace an incorrect direct call into generic_update_time in btrfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-4-hch@lst.de Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26fs: lift the FMODE_NOCMTIME check into file_update_time_flagsChristoph Hellwig
FMODE_NOCMTIME used to be just a hack for the legacy XFS handle-based "invisible I/O", but commit e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation") started using it from generic callers. I'm not sure other file systems are actually read for this in general, so the above commit should get a closer look, but for it to make any sense, file_update_time needs to respect the flag. Lift the check from file_modified_flags to file_update_time so that users of file_update_time inherit the behavior and so that all the checks are done in one place. Fixes: e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-3-hch@lst.de Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-26fs: refactor file timestamp update logicChristoph Hellwig
Currently the two high-level APIs use two helper functions to implement almost all of the logic. Refactor the two helpers and the common logic into a new file_update_time_flags routine that gets the iocb flags or 0 in case of file_update_time passed so that the entire logic is contained in a single function and can be easily understood and modified. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251120064859.2911749-2-hch@lst.de Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>