summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-11-28ext4: enable block size larger than page sizeBaokun Li
Since block device (See commit 3c20917120ce ("block/bdev: enable large folio support for large logical block sizes")) and page cache (See commit ab95d23bab220ef8 ("filemap: allocate mapping_min_order folios in the page cache")) has the ability to have a minimum order when allocating folio, and ext4 has supported large folio in commit 7ac67301e82f ("ext4: enable large folio for regular file"), now add support for block_size > PAGE_SIZE in ext4. set_blocksize() -> bdev_validate_blocksize() already validates the block size, so ext4_load_super() does not need to perform additional checks. Here we only need to add the FS_LBS bit to fs_flags. In addition, block sizes larger than the page size are currently supported only when CONFIG_TRANSPARENT_HUGEPAGE is enabled. To make this explicit, a blocksize_gt_pagesize entry has been added under /sys/fs/ext4/feature/, indicating whether bs > ps is supported. This allows mke2fs to check the interface and determine whether a warning should be issued when formatting a filesystem with block size larger than the page size. Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-25-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: add checks for large folio incompatibilities when BS > PSBaokun Li
Supporting a block size greater than the page size (BS > PS) requires support for large folios. However, several features (e.g., encrypt) do not yet support large folios. To prevent conflicts, this patch adds checks at mount time to prohibit these features from being used when BS > PS. Since these features cannot be changed on remount, there is no need to check on remount. This patch adds s_max_folio_order, initialized during mount according to filesystem features and mount options. If s_max_folio_order is 0, large folios are disabled. With this in place, ext4_set_inode_mapping_order() can be simplified by checking s_max_folio_order, avoiding redundant checks. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-24-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support verifying data from large folios with fs-verityBaokun Li
Eric Biggers already added support for verifying data from large folios several years ago in commit 5d0f0e57ed90 ("fsverity: support verifying data from large folios"). With ext4 now supporting large block sizes, the fs-verity tests `kvm-xfstests -c ext4/64k -g verity -x encrypt` pass without issues. Therefore, remove the restriction and allow large folios to be enabled together with fs-verity. Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-23-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: make data=journal support large block sizeBaokun Li
Currently, ext4_set_inode_mapping_order() does not set max folio order for files with the data journalling flag. For files that already have large folios enabled, ext4_inode_journal_mode() ignores the data journalling flag once max folio order is set. This is not because data journalling cannot work with large folios, but because credit estimates will go through the roof if there are too many blocks per folio. Since the real constraint is blocks-per-folio, to support data=journal under LBS, we now set max folio order to be equal to min folio order for files with the journalling flag. When LBS is disabled, the max folio order remains unset as before. Therefore, before ext4_change_inode_journal_flag() switches the journalling mode, we call truncate_pagecache() to drop all page cache for that inode, and filemap_write_and_wait() is called unconditionally. After that, once the journalling mode has been switched, we can safely reset the inode mapping order, and the mapping_large_folio_support() check in ext4_inode_journal_mode() can be removed. Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-22-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in __ext4_block_zero_page_range()Zhihao Cheng
Use the EXT4_PG_TO_LBLK() macro to convert folio indexes to blocks to avoid negative left shifts after supporting blocksize greater than PAGE_SIZE. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-21-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in mpage_prepare_extent_to_map()Baokun Li
Use the EXT4_PG_TO_LBLK/EXT4_LBLK_TO_PG macros to complete the conversion between folio indexes and blocks to avoid negative left/right shifts after supporting blocksize greater than PAGE_SIZE. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-20-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in mpage_map_and_submit_buffers()Baokun Li
Use the EXT4_PG_TO_LBLK/EXT4_LBLK_TO_PG macros to complete the conversion between folio indexes and blocks to avoid negative left/right shifts after supporting blocksize greater than PAGE_SIZE. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-19-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in ext4_block_write_begin()Baokun Li
Use the EXT4_PG_TO_LBLK() macro to convert folio indexes to blocks to avoid negative left shifts after supporting blocksize greater than PAGE_SIZE. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-18-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in ext4_mpage_readpages()Baokun Li
Use the EXT4_PG_TO_LBLK() macro to convert folio indexes to blocks to avoid negative left shifts after supporting blocksize greater than PAGE_SIZE. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-17-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: rename 'page' references to 'folio' in multi-block allocatorZhihao Cheng
The ext4 multi-block allocator now fully supports folio objects. Update all variable names, function names, and comments to replace legacy 'page' terminology with 'folio', improving clarity and consistency. No functional changes. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-16-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: prepare buddy cache inode for BS > PS with large foliosBaokun Li
We use EXT4_BAD_INO for the buddy cache inode number. This inode is not accessed via __ext4_new_inode() or __ext4_iget(), meaning ext4_set_inode_mapping_order() is not called to set its folio order range. However, future block size greater than page size support requires this inode to support large folios, and the buddy cache code already handles BS > PS. Therefore, ext4_set_inode_mapping_order() is now explicitly called for this specific inode to set its folio order range. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-15-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in ext4_mb_init_cache()Baokun Li
Currently, ext4_mb_init_cache() uses blocks_per_page to calculate the folio index and offset. However, when blocksize is larger than PAGE_SIZE, blocks_per_page becomes zero, leading to a potential division-by-zero bug. Since we now have the folio, we know its exact size. This allows us to convert {blocks, groups}_per_page to {blocks, groups}_per_folio, thus supporting block sizes greater than page size. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-14-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in ext4_mb_get_buddy_page_lock()Baokun Li
Currently, ext4_mb_get_buddy_page_lock() uses blocks_per_page to calculate folio index and offset. However, when blocksize is larger than PAGE_SIZE, blocks_per_page becomes zero, leading to a potential division-by-zero bug. To support BS > PS, use bytes to compute folio index and offset within folio to get rid of blocks_per_page. Also, since ext4_mb_get_buddy_page_lock() already fully supports folio, rename it to ext4_mb_get_buddy_folio_lock(). Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-13-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in ext4_mb_load_buddy_gfp()Baokun Li
Currently, ext4_mb_load_buddy_gfp() uses blocks_per_page to calculate the folio index and offset. However, when blocksize is larger than PAGE_SIZE, blocks_per_page becomes zero, leading to a potential division-by-zero bug. To support BS > PS, use bytes to compute folio index and offset within folio to get rid of blocks_per_page. Also, if buddy and bitmap land in the same folio, we get that folio’s ref instead of looking it up again before updating the buddy. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-12-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: add EXT4_LBLK_TO_PG and EXT4_PG_TO_LBLK for block/page conversionBaokun Li
As BS > PS support is coming, all block number to page index (and vice-versa) conversions must now go via bytes. Added EXT4_LBLK_TO_PG() and EXT4_PG_TO_LBLK() macros to simplify these conversions and handle both BS <= PS and BS > PS scenarios cleanly. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-11-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: add EXT4_LBLK_TO_B macro for logical block to bytes conversionBaokun Li
No functional changes. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-10-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in ext4_readdir()Baokun Li
In ext4_readdir(), page_cache_sync_readahead() is used to readahead mapped physical blocks. With LBS support, this can lead to a negative right shift. To fix this, the page index is now calculated by first converting the physical block number (pblk) to a file position (pos) before converting it to a page index. Also, the correct number of pages to readahead is now passed. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-9-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: support large block size in ext4_calculate_overhead()Baokun Li
ext4_calculate_overhead() used a single page for its bitmap buffer, which worked fine when PAGE_SIZE >= block size. However, with block size greater than page size (BS > PS) support, the bitmap can exceed a single page. To address this, we now use kvmalloc() to allocate memory of the filesystem block size, to properly support BS > PS. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-8-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: introduce s_min_folio_order for future BS > PS supportBaokun Li
This commit introduces the s_min_folio_order field to the ext4_sb_info structure. This field will store the minimum folio order required by the current filesystem, laying groundwork for future support of block sizes greater than PAGE_SIZE. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-7-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: enable DIOREAD_NOLOCK by default for BS > PS as wellBaokun Li
The dioread_nolock related processes already support large folio, so dioread_nolock is enabled by default regardless of whether the blocksize is less than, equal to, or greater than PAGE_SIZE. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-6-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: make ext4_punch_hole() support large block sizeBaokun Li
When preparing for bs > ps support, clean up unnecessary PAGE_SIZE references in ext4_punch_hole(). Previously, when a hole extended beyond i_size, we aligned the hole end upwards to PAGE_SIZE to handle partial folio invalidation. Now that truncate_inode_pages_range() already handles partial folio invalidation correctly, this alignment is no longer required. However, to save pointless tail block zeroing, we still keep rounding up to the block size here. In addition, as Honza pointed out, when the hole end equals i_size, it should also be rounded up to the block size. This patch fixes that as well. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-5-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: remove PAGE_SIZE checks for rec_len conversionBaokun Li
Previously, ext4_rec_len_(to|from)_disk only performed complex rec_len conversions when PAGE_SIZE >= 65536 to reduce complexity. However, we are soon to support file system block sizes greater than page size, which makes these conditional checks unnecessary. Thus, these checks are now removed. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-4-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: remove page offset calculation in ext4_block_truncate_page()Baokun Li
For bs <= ps scenarios, calculating the offset within the block is sufficient. For bs > ps, an initial page offset calculation can lead to incorrect behavior. Thus this redundant calculation has been removed. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-3-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-28ext4: remove page offset calculation in ext4_block_zero_page_range()Zhihao Cheng
For bs <= ps scenarios, calculating the offset within the block is sufficient. For bs > ps, an initial page offset calculation can lead to incorrect behavior. Thus this redundant calculation has been removed. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Message-ID: <20251121090654.631996-2-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: align max orphan file size with e2fsprogs limitBaokun Li
Kernel commit 0a6ce20c1564 ("ext4: verify orphan file size is not too big") limits the maximum supported orphan file size to 8 << 20. However, in e2fsprogs, the orphan file size is set to 32–512 filesystem blocks when creating a filesystem. With 64k block size, formatting an ext4 fs >32G gives an orphan file bigger than the kernel allows, so mount prints an error and fails: EXT4-fs (vdb): orphan file too big: 8650752 EXT4-fs (vdb): mount failed To prevent this issue and allow previously created 64KB filesystems to mount, we updates the maximum allowed orphan file size in the kernel to 512 filesystem blocks. Fixes: 0a6ce20c1564 ("ext4: verify orphan file size is not too big") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251120134233.2994147-1-libaokun@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26Documentation: ext4: Document casefold and encrypt flagsDaniel Tang
Based on ext4(5) and fs/ext4/ext4.h. For INCOMPAT_ENCRYPT, it's possible to create a new filesystem with that flag without creating any encrypted inodes. ext4(5) says it adds "support" but doesn't say whether anything's actually present like COMPAT_RESIZE_INODE does. Signed-off-by: Daniel Tang <danielzgtg.opensource@gmail.com> Message-ID: <4506189.9SDvczpPoe@daniel-desktop3> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26fs/ext4: fix typo in commentHaodong Tian
Correct 'metdata' -> 'metadata' in comment. Signed-off-by: Haodong Tian <tianhd25@mails.tsinghua.edu.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Message-ID: <20251112155916.3007639-1-tianhd25@mails.tsinghua.edu.cn> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: correct the comments place for EXT4_EXT_MAY_ZEROOUTYang Erkun
Move the comments just before we set EXT4_EXT_MAY_ZEROOUT in ext4_split_convert_extents. Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-4-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: cleanup for ext4_map_blocksYang Erkun
Retval from ext4_map_create_blocks means we really create some blocks, cannot happened with m_flags without EXT4_MAP_UNWRITTEN and EXT4_MAP_MAPPED. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-3-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: rename EXT4_GET_BLOCKS_PRE_IOYang Erkun
This flag has been generalized to split an unwritten extent when we do dio or dioread_nolock writeback, or to avoid merge new extents which was created by extents split. Update some related comments too. Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Yang Erkun <yangerkun@huawei.com> Message-ID: <20251112084538.1658232-2-yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: improve integrity checking in __mb_check_buddy by enhancing order-0 ↵Yongjian Sun
validation When the MB_CHECK_ASSERT macro is enabled, we found that the current validation logic in __mb_check_buddy has a gap in detecting certain invalid buddy states, particularly related to order-0 (bitmap) bits. The original logic consists of three steps: 1. Validates higher-order buddies: if a higher-order bit is set, at most one of the two corresponding lower-order bits may be free; if a higher-order bit is clear, both lower-order bits must be allocated (and their bitmap bits must be 0). 2. For any set bit in order-0, ensures all corresponding higher-order bits are not free. 3. Verifies that all preallocated blocks (pa) in the group have pa_pstart within bounds and their bitmap bits marked as allocated. However, this approach fails to properly validate cases where order-0 bits are incorrectly cleared (0), allowing some invalid configurations to pass: corrupt integral order 3 1 1 order 2 1 1 1 1 order 1 1 1 1 1 1 1 1 1 order 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Here we get two adjacent free blocks at order-0 with inconsistent higher-order state, and the right one shows the correct scenario. The root cause is insufficient validation of order-0 zero bits. To fix this and improve completeness without significant performance cost, we refine the logic: 1. Maintain the top-down higher-order validation, but we no longer check the cases where the higher-order bit is 0, as this case will be covered in step 2. 2. Enhance order-0 checking by examining pairs of bits: - If either bit in a pair is set (1), all corresponding higher-order bits must not be free. - If both bits are clear (0), then exactly one of the corresponding higher-order bits must be free 3. Keep the preallocation (pa) validation unchanged. This change closes the validation gap, ensuring illegal buddy states involving order-0 are correctly detected, while removing redundant checks and maintaining efficiency. Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4") Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251106060614.631382-3-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: fix incorrect group number assertion in mb_check_buddyYongjian Sun
When the MB_CHECK_ASSERT macro is enabled, an assertion failure can occur in __mb_check_buddy when checking preallocated blocks (pa) in a block group: Assertion failure in mb_free_blocks() : "groupnr == e4b->bd_group" This happens when a pa at the very end of a block group (e.g., pa_pstart=32765, pa_len=3 in a group of 32768 blocks) becomes exhausted - its pa_pstart is advanced by pa_len to 32768, which lies in the next block group. If this exhausted pa (with pa_len == 0) is still in the bb_prealloc_list during the buddy check, the assertion incorrectly flags it as belonging to the wrong group. A possible sequence is as follows: ext4_mb_new_blocks ext4_mb_release_context pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len) pa->pa_len -= ac->ac_b_ex.fe_len __mb_check_buddy for each pa in group ext4_get_group_no_and_offset MB_CHECK_ASSERT(groupnr == e4b->bd_group) To fix this, we modify the check to skip block group validation for exhausted preallocations (where pa_len == 0). Such entries are in a transitional state and will be removed from the list soon, so they should not trigger an assertion. This change prevents the false positive while maintaining the integrity of the checks for active allocations. Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4") Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251106060614.631382-2-sunyongjian@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26ext4: add i_data_sem protection in ext4_destroy_inline_data_nolock()Alexey Nepomnyashih
Fix a race between inline data destruction and block mapping. The function ext4_destroy_inline_data_nolock() changes the inode data layout by clearing EXT4_INODE_INLINE_DATA and setting EXT4_INODE_EXTENTS. At the same time, another thread may execute ext4_map_blocks(), which tests EXT4_INODE_EXTENTS to decide whether to call ext4_ext_map_blocks() or ext4_ind_map_blocks(). Without i_data_sem protection, ext4_ind_map_blocks() may receive inode with EXT4_INODE_EXTENTS flag and triggering assert. kernel BUG at fs/ext4/indirect.c:546! EXT4-fs (loop2): unmounting filesystem. invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:ext4_ind_map_blocks.cold+0x2b/0x5a fs/ext4/indirect.c:546 Call Trace: <TASK> ext4_map_blocks+0xb9b/0x16f0 fs/ext4/inode.c:681 _ext4_get_block+0x242/0x590 fs/ext4/inode.c:822 ext4_block_write_begin+0x48b/0x12c0 fs/ext4/inode.c:1124 ext4_write_begin+0x598/0xef0 fs/ext4/inode.c:1255 ext4_da_write_begin+0x21e/0x9c0 fs/ext4/inode.c:3000 generic_perform_write+0x259/0x5d0 mm/filemap.c:3846 ext4_buffered_write_iter+0x15b/0x470 fs/ext4/file.c:285 ext4_file_write_iter+0x8e0/0x17f0 fs/ext4/file.c:679 call_write_iter include/linux/fs.h:2271 [inline] do_iter_readv_writev+0x212/0x3c0 fs/read_write.c:735 do_iter_write+0x186/0x710 fs/read_write.c:861 vfs_iter_write+0x70/0xa0 fs/read_write.c:902 iter_file_splice_write+0x73b/0xc90 fs/splice.c:685 do_splice_from fs/splice.c:763 [inline] direct_splice_actor+0x10f/0x170 fs/splice.c:950 splice_direct_to_actor+0x33a/0xa10 fs/splice.c:896 do_splice_direct+0x1a9/0x280 fs/splice.c:1002 do_sendfile+0xb13/0x12c0 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1cf/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fixes: c755e251357a ("ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()") Cc: stable@vger.kernel.org # v4.11+ Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru> Message-ID: <20251104093326.697381-1-sdl@nppct.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26ext4: clear i_state_flags when alloc inodeHaibo Chen
i_state_flags used on 32-bit archs, need to clear this flag when alloc inode. Find this issue when umount ext4, sometimes track the inode as orphan accidently, cause ext4 mesg dump. Fixes: acf943e9768e ("ext4: fix checks for orphan inodes") Signed-off-by: Haibo Chen <haibo.chen@nxp.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251104-ext4-v1-1-73691a0800f9@nxp.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26jbd2: fix the inconsistency between checksum and data in memory for journal sbYe Bin
Copying the file system while it is mounted as read-only results in a mount failure: [~]# mkfs.ext4 -F /dev/sdc [~]# mount /dev/sdc -o ro /mnt/test [~]# dd if=/dev/sdc of=/dev/sda bs=1M [~]# mount /dev/sda /mnt/test1 [ 1094.849826] JBD2: journal checksum error [ 1094.850927] EXT4-fs (sda): Could not load journal inode mount: mount /dev/sda on /mnt/test1 failed: Bad message The process described above is just an abstracted way I came up with to reproduce the issue. In the actual scenario, the file system was mounted read-only and then copied while it was still mounted. It was found that the mount operation failed. The user intended to verify the data or use it as a backup, and this action was performed during a version upgrade. Above issue may happen as follows: ext4_fill_super set_journal_csum_feature_set(sb) if (ext4_has_metadata_csum(sb)) incompat = JBD2_FEATURE_INCOMPAT_CSUM_V3; if (test_opt(sb, JOURNAL_CHECKSUM) jbd2_journal_set_features(sbi->s_journal, compat, 0, incompat); lock_buffer(journal->j_sb_buffer); sb->s_feature_incompat |= cpu_to_be32(incompat); //The data in the journal sb was modified, but the checksum was not updated, so the data remaining in memory has a mismatch between the data and the checksum. unlock_buffer(journal->j_sb_buffer); In this case, the journal sb copied over is in a state where the checksum and data are inconsistent, so mounting fails. To solve the above issue, update the checksum in memory after modifying the journal sb. Fixes: 4fd5ea43bc11 ("jbd2: checksum journal superblock") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251103010123.3753631-1-yebin@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26ext4: check if mount_opts is NUL-terminated in ext4_ioctl_set_tune_sb()Fedor Pchelkin
params.mount_opts may come as potentially non-NUL-term string. Userspace is expected to pass a NUL-term string. Add an extra check to ensure this holds true. Note that further code utilizes strscpy_pad() so this is just for proper informing the user of incorrect data being provided. Found by Linux Verification Center (linuxtesting.org). Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251101160430.222297-2-pchelkin@ispras.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-26ext4: fix string copying in parse_apply_sb_mount_options()Fedor Pchelkin
strscpy_pad() can't be used to copy a non-NUL-term string into a NUL-term string of possibly bigger size. Commit 0efc5990bca5 ("string.h: Introduce memtostr() and memtostr_pad()") provides additional information in that regard. So if this happens, the following warning is observed: strnlen: detected buffer overflow: 65 byte read of buffer size 64 WARNING: CPU: 0 PID: 28655 at lib/string_helpers.c:1032 __fortify_report+0x96/0xc0 lib/string_helpers.c:1032 Modules linked in: CPU: 0 UID: 0 PID: 28655 Comm: syz-executor.3 Not tainted 6.12.54-syzkaller-00144-g5f0270f1ba00 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:__fortify_report+0x96/0xc0 lib/string_helpers.c:1032 Call Trace: <TASK> __fortify_panic+0x1f/0x30 lib/string_helpers.c:1039 strnlen include/linux/fortify-string.h:235 [inline] sized_strscpy include/linux/fortify-string.h:309 [inline] parse_apply_sb_mount_options fs/ext4/super.c:2504 [inline] __ext4_fill_super fs/ext4/super.c:5261 [inline] ext4_fill_super+0x3c35/0xad00 fs/ext4/super.c:5706 get_tree_bdev_flags+0x387/0x620 fs/super.c:1636 vfs_get_tree+0x93/0x380 fs/super.c:1814 do_new_mount fs/namespace.c:3553 [inline] path_mount+0x6ae/0x1f70 fs/namespace.c:3880 do_mount fs/namespace.c:3893 [inline] __do_sys_mount fs/namespace.c:4103 [inline] __se_sys_mount fs/namespace.c:4080 [inline] __x64_sys_mount+0x280/0x300 fs/namespace.c:4080 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x64/0x140 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e Since userspace is expected to provide s_mount_opts field to be at most 63 characters long with the ending byte being NUL-term, use a 64-byte buffer which matches the size of s_mount_opts, so that strscpy_pad() does its job properly. Return with error if the user still managed to provide a non-NUL-term string here. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 8ecb790ea8c3 ("ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()") Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251101160430.222297-1-pchelkin@ispras.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26jbd2: store more accurate errno in superblock when possibleWengang Wang
When jbd2_journal_abort() is called, the provided error code is stored in the journal superblock. Some existing calls hard-code -EIO even when the actual failure is not I/O related. This patch updates those calls to pass more accurate error codes, allowing the superblock to record the true cause of failure. This helps improve diagnostics and debugging clarity when analyzing journal aborts. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Message-ID: <20251031210501.7337-1-wen.gang.wang@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-26jbd2: avoid bug_on in jbd2_journal_get_create_access() when file system ↵Ye Bin
corrupted There's issue when file system corrupted: ------------[ cut here ]------------ kernel BUG at fs/jbd2/transaction.c:1289! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 5 UID: 0 PID: 2031 Comm: mkdir Not tainted 6.18.0-rc1-next RIP: 0010:jbd2_journal_get_create_access+0x3b6/0x4d0 RSP: 0018:ffff888117aafa30 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88811a86b000 RCX: ffffffff89a63534 RDX: 1ffff110200ec602 RSI: 0000000000000004 RDI: ffff888100763010 RBP: ffff888100763000 R08: 0000000000000001 R09: ffff888100763028 R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000000 R13: ffff88812c432000 R14: ffff88812c608000 R15: ffff888120bfc000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f91d6970c99 CR3: 00000001159c4000 CR4: 00000000000006f0 Call Trace: <TASK> __ext4_journal_get_create_access+0x42/0x170 ext4_getblk+0x319/0x6f0 ext4_bread+0x11/0x100 ext4_append+0x1e6/0x4a0 ext4_init_new_dir+0x145/0x1d0 ext4_mkdir+0x326/0x920 vfs_mkdir+0x45c/0x740 do_mkdirat+0x234/0x2f0 __x64_sys_mkdir+0xd6/0x120 do_syscall_64+0x5f/0xfa0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The above issue occurs with us in errors=continue mode when accompanied by storage failures. There have been many inconsistencies in the file system data. In the case of file system data inconsistency, for example, if the block bitmap of a referenced block is not set, it can lead to the situation where a block being committed is allocated and used again. As a result, the following condition will not be satisfied then trigger BUG_ON. Of course, it is entirely possible to construct a problematic image that can trigger this BUG_ON through specific operations. In fact, I have constructed such an image and easily reproduced this issue. Therefore, J_ASSERT() holds true only under ideal conditions, but it may not necessarily be satisfied in exceptional scenarios. Using J_ASSERT() directly in abnormal situations would cause the system to crash, which is clearly not what we want. So here we directly trigger a JBD abort instead of immediately invoking BUG_ON. Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd") Signed-off-by: Ye Bin <yebin10@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251025072657.307851-1-yebin@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-13jbd2: use a weaker annotation in journal handlingByungchul Park
jbd2 journal handling code doesn't want jbd2_might_wait_for_commit() to be placed between start_this_handle() and stop_this_handle(). So it marks the region with rwsem_acquire_read() and rwsem_release(). However, the annotation is too strong for that purpose. We don't have to use more than try lock annotation for that. rwsem_acquire_read() implies: 1. might be a waiter on contention of the lock. 2. enter to the critical section of the lock. All we need in here is to act 2, not 1. So trylock version of annotation is sufficient for that purpose. Now that dept partially relies on lockdep annotaions, dept interpets rwsem_acquire_read() as a potential wait and might report a deadlock by the wait. Replace it with trylock version of annotation. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Message-ID: <20251024073940.1063-1-byungchul@sk.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-13jbd2: use a per-journal lock_class_key for jbd2_trans_commit_keyTetsuo Handa
syzbot is reporting possibility of deadlock due to sharing lock_class_key for jbd2_handle across ext4 and ocfs2. But this is a false positive, for one disk partition can't have two filesystems at the same time. Reported-by: syzbot+6e493c165d26d6fcbf72@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6e493c165d26d6fcbf72 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot+6e493c165d26d6fcbf72@syzkaller.appspotmail.com Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <987110fc-5470-457a-a218-d286a09dd82f@I-love.SAKURA.ne.jp> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
2025-11-13ext4: xattr: fix null pointer deref in ext4_raw_inode()Karina Yankevich
If ext4_get_inode_loc() fails (e.g. if it returns -EFSCORRUPTED), iloc.bh will remain set to NULL. Since ext4_xattr_inode_dec_ref_all() lacks error checking, this will lead to a null pointer dereference in ext4_raw_inode(), called right after ext4_get_inode_loc(). Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: c8e008b60492 ("ext4: ignore xattrs past end") Cc: stable@kernel.org Signed-off-by: Karina Yankevich <k.yankevich@omp.ru> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Reviewed-by: Baokun Li <libaokun1@huawei.com> Message-ID: <20251022093253.3546296-1-k.yankevich@omp.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-13ext4: refresh inline data size before write operationsDeepanshu Kartikey
The cached ei->i_inline_size can become stale between the initial size check and when ext4_update_inline_data()/ext4_create_inline_data() use it. Although ext4_get_max_inline_size() reads the correct value at the time of the check, concurrent xattr operations can modify i_inline_size before ext4_write_lock_xattr() is acquired. This causes ext4_update_inline_data() and ext4_create_inline_data() to work with stale capacity values, leading to a BUG_ON() crash in ext4_write_inline_data(): kernel BUG at fs/ext4/inline.c:1331! BUG_ON(pos + len > EXT4_I(inode)->i_inline_size); The race window: 1. ext4_get_max_inline_size() reads i_inline_size = 60 (correct) 2. Size check passes for 50-byte write 3. [Another thread adds xattr, i_inline_size changes to 40] 4. ext4_write_lock_xattr() acquires lock 5. ext4_update_inline_data() uses stale i_inline_size = 60 6. Attempts to write 50 bytes but only 40 bytes actually available 7. BUG_ON() triggers Fix this by recalculating i_inline_size via ext4_find_inline_data_nolock() immediately after acquiring xattr_sem. This ensures ext4_update_inline_data() and ext4_create_inline_data() work with current values that are protected from concurrent modifications. This is similar to commit a54c4613dac1 ("ext4: fix race writing to an inline_data file while its xattrs are changing") which fixed i_inline_off staleness. This patch addresses the related i_inline_size staleness issue. Reported-by: syzbot+f3185be57d7e8dda32b8@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=f3185be57d7e8dda32b8 Cc: stable@kernel.org Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Message-ID: <20251020060936.474314-1-kartikey406@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: add two trace points for moving extentsZhang Yi
To facilitate tracking the length, type, and outcome of the move extent, add a trace point at both the entry and exit of mext_move_extent(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-13-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: add large folios support for moving extentsZhang Yi
Pass the moving extent length into mext_folio_double_lock() so that it can acquire a higher-order folio if the length exceeds PAGE_SIZE. This can speed up extent moving when the extent is larger than one page. Additionally, remove the unnecessary comments from mext_folio_double_lock(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-12-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: switch to using the new extent movement methodZhang Yi
Now that we have mext_move_extent(), we can switch to this new interface and deprecate move_extent_per_page(). First, after acquiring the i_rwsem, we can directly use ext4_map_blocks() to obtain a contiguous extent from the original inode as the extent to be moved. It can and it's safe to get mapping information from the extent status tree without needing to access the ondisk extent tree, because ext4_move_extent() will check the sequence cookie under the folio lock. Then, after populating the mext_data structure, we call ext4_move_extent() to move the extent. Finally, the length of the extent will be adjusted in mext.orig_map.m_len and the actual length moved is returned through m_len. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-11-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: introduce mext_move_extent()Zhang Yi
When moving extents, the current move_extent_per_page() process can only move extents of length PAGE_SIZE at a time, which is highly inefficient, especially when the fragmentation of the file is not particularly severe, this will result in a large number of unnecessary extent split and merge operations. Moreover, since the ext4 file system now supports large folios, using PAGE_SIZE as the processing unit is no longer practical. Therefore, introduce a new move extents method, mext_move_extent(). It moves one extent of the origin inode at a time, but not exceeding the size of a folio. The parameters for the move are passed through the new mext_data data structure, which includes the origin inode, donor inode, the mapping extent of the origin inode to be moved, and the starting offset of the donor inode. The move process is similar to move_extent_per_page() and can be categorized into three types: MEXT_SKIP_EXTENT, MEXT_MOVE_EXTENT, and MEXT_COPY_DATA. MEXT_SKIP_EXTENT indicates that the corresponding area of the donor file is a hole, meaning no actual space is allocated, so the move is skipped. MEXT_MOVE_EXTENT indicates that the corresponding areas of both the origin and donor files are unwritten, so no data needs to be copied; only the extents are swapped. MEXT_COPY_DATA indicates that the corresponding areas of both the origin and donor files contain data, so data must be copied. The data copying is performed in three steps: first, the data from the original location is read into the page cache; then, the extents are swapped, and the page cache is rebuilt to reflect the index of the physical blocks; finally, the dirty page cache is marked and written back to ensure that the data is written to disk before the metadata is persisted. One important point to note is that the folio lock and i_data_sem are held only during the moving process. Therefore, before moving an extent, it is necessary to check whether the sequence cookie of the area to be moved has changed while holding the folio lock. If a change is detected, it indicates that concurrent write-back operations may have occurred during this period, and the type of the extent to be moved can no longer be considered reliable. For example, it may have changed from unwritten to written. In such cases, return -ESTALE, and the calling function should reacquire the move extent of the original file and retry the movement. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-10-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: rename mext_page_mkuptodate() to mext_folio_mkuptodate()Zhang Yi
mext_page_mkuptodate() no longer works on a single page, so rename it to mext_folio_mkuptodate(). Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-9-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: refactor mext_check_arguments()Zhang Yi
When moving extents, mext_check_validity() performs some basic file system and file checks. However, some essential checks need to be performed after acquiring the i_rwsem are still scattered in mext_check_arguments(). Move those checks into mext_check_validity() and make it executes entirely under the i_rwsem to make the checks clearer. Furthermore, rename mext_check_arguments() to mext_check_adjust_range(), as it only performs checks and length adjustments on the move extent range. Finally, also change the print message for the non-existent file check to be consistent with other unsupported checks. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-8-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2025-11-06ext4: add mext_check_validity() to do basic checkZhang Yi
Currently, the basic validation checks during the move extent operation are scattered across __ext4_ioctl() and ext4_move_extents(), which makes the code somewhat disorganized. Introduce a new helper, mext_check_validity(), to handle these checks. This change involves only code relocation without any logical modifications. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Message-ID: <20251013015128.499308-7-yi.zhang@huaweicloud.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>