| Age | Commit message (Collapse) | Author |
|
During a withdraw, we don't want to write out any more data than we have
to, so in do_xmote(), skip the ->go_sync() glock operation. We still
want to keep calling ->go_inval() to discard any cached data or
metadata, whether clean or dirty.
We do still allow glocks to transition into state LM_ST_UNLOCKED. This
has the desired side effect of calling ->go_inval() and invalidating the
glock caches.
Function gfs2_withdraw_glocks() is already used for dequeuing any
left-over waiters. We still want that to happen, but additionally, we
want all glocks to be unlocked.
Finally, we change function do_promote() to refuse any further
promotions.
This commit cleans up the leftovers of commit 86934198eefa ("gfs2: Clear
flags when withdraw prevents xmote").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Rename function gfs2_gl_dq_holders() to gfs2_withdraw_glocks(). This
function will soon be used for more than just dequeuing holders.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts commit 33dbd1e41a1d ("gfs2: fix infinite loop when checking ail
item count before go_inval").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts commit a72d2401f54b ("gfs2: Allow some glocks to be used during
withdraw").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts the rest of d93ae386ef3d ("gfs2: Check for log write errors
before telling dlm to unlock").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts commit 865cc3e9cc0b ("gfs2: fix a deadlock on
withdraw-during-mount").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay
journals and wait for it to finish").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay
journals and wait for it to finish").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay
journals and wait for it to finish").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay
journals and wait for it to finish").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay
journals and wait for it to finish").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
Reverts parts of commit 601ef0d52e96 ("gfs2: Force withdraw to replay
journals and wait for it to finish").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The current withdraw code duplicates the journal recovery code gfs2
already has for dealing with node failures, and it does so poorly. That
code was added because when releasing a lockspace, we didn't have a way
to indicate that the lockspace needs recovery. We now do have this
feature, so the current withdraw code can be removed almost entirely.
This is one of several steps towards that.
The withdrawing node has no role in recovering from the withdraw
anymore, so it also no longer needs to read metadata blocks after a
withdraw.
We now only need to set a single bit in gfs2_withdraw(), so switch from
try_cmpxchg() to test_and_set_bit().
Reverts commit 8cc67f704f4b ("gfs2: don't stop reads while withdraw in
progress").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
GFS sets the LM_FLAG_NOEXP flag on locking requests it makes during
journal recovery, so rename the flag to LM_FLAG_RECOVER for improved
code readability.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
All callers of gfs2_io_error_bh() call gfs2_withdraw() as well, so
change gfs2_io_error_bh() to call gfs2_withdraw() directly. This also
brings it in line with other similar error reporting functions.
With that, gfs2_io_error_bh() is the same as gfs2_io_error_bh_wd(), so
remove the latter.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Now that gfs2_withdraw() is asynchronous, immediately withdraw when
a log write error is detected.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
With delayed withdraws and the SDF_WITHDRAWING flag gone, we can now
rename gfs2_withdrawing_or_withdrawn() back to gfs2_withdrawn().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Now that gfs2_withdraw() is asynchronous, is can be called in any
context and there is no more need for gfs2_withdraw_delayed() or for
turning delayed withdraws into actual withdraws. Remove the
now-obsolete code.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
So far, withdraws are carried out in the context of the calling task.
When another task tries to withdraw while a withdraw is already
underway, that task blocks as well. Change that to carry out withdraws
asynchronously in workqueue context and don't block the task triggering
the withdraw anymore.
Fixes: syzbot+6b156e132970e550194c@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Add a 'clean' argument to ->lm_unmount() that indicates whether the
filesystem is clean or needs recovery. Set clean to true for normal
unmounts, and to false for withdraws.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Instead of tracking the remaining time, track the deadline of each of
the timeouts.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Commit e4a8b5481c59a ("gfs2: Switch to wait_event in gfs2_quotad") broke
cyclic statfs syncing, so the numbers reported by "df" could easily get
completely out of sync with reality. Fix this by reverting part of
commit e4a8b5481c59a for now.
A follow-up commit will clean this code up later.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Rename gfs2_try_evict() to gfs2_try_to_evict(). The GIF_DEFER_DELETE
flag has been superceded by the GLF_DEFER_DELETE flag, so fix a
left-over comment. Add a clarifying comment to delete_work_func().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
When a node tries to delete an inode, it first requests exclusive access
to the iopen glock. This triggers demote requests on all remote nodes
currently holding the iopen glock. To satisfy those requests, the
remote nodes evict the inode in question, or they poke the corresponding
inode glock to signal that the inode is still in active use.
This behavior doesn't depend on whether or not a filesystem is
read-only, so remove the incorrect read-only check.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
With the previous commit revamping the timeout handling, started isn't
used anymore. It could be taken into account by adjusting the initial
value of the timeout, but there is little point as both callers capture
the timestamp shortly before calling __ceph_open_session() -- the only
thing of note that happens in the interim is taking client->mount_mutex
and that isn't expected to take multiple seconds.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
|
|
Kernel commit 0a6ce20c1564 ("ext4: verify orphan file size is not too big")
limits the maximum supported orphan file size to 8 << 20.
However, in e2fsprogs, the orphan file size is set to 32–512 filesystem
blocks when creating a filesystem.
With 64k block size, formatting an ext4 fs >32G gives an orphan file bigger
than the kernel allows, so mount prints an error and fails:
EXT4-fs (vdb): orphan file too big: 8650752
EXT4-fs (vdb): mount failed
To prevent this issue and allow previously created 64KB filesystems to
mount, we updates the maximum allowed orphan file size in the kernel to
512 filesystem blocks.
Fixes: 0a6ce20c1564 ("ext4: verify orphan file size is not too big")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251120134233.2994147-1-libaokun@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Correct 'metdata' -> 'metadata' in comment.
Signed-off-by: Haodong Tian <tianhd25@mails.tsinghua.edu.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Message-ID: <20251112155916.3007639-1-tianhd25@mails.tsinghua.edu.cn>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Move the comments just before we set EXT4_EXT_MAY_ZEROOUT in
ext4_split_convert_extents.
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Message-ID: <20251112084538.1658232-4-yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Retval from ext4_map_create_blocks means we really create some blocks,
cannot happened with m_flags without EXT4_MAP_UNWRITTEN and
EXT4_MAP_MAPPED.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Message-ID: <20251112084538.1658232-3-yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This flag has been generalized to split an unwritten extent when we do
dio or dioread_nolock writeback, or to avoid merge new extents which was
created by extents split. Update some related comments too.
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Yang Erkun <yangerkun@huawei.com>
Message-ID: <20251112084538.1658232-2-yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
validation
When the MB_CHECK_ASSERT macro is enabled, we found that the
current validation logic in __mb_check_buddy has a gap in
detecting certain invalid buddy states, particularly related
to order-0 (bitmap) bits.
The original logic consists of three steps:
1. Validates higher-order buddies: if a higher-order bit is
set, at most one of the two corresponding lower-order bits
may be free; if a higher-order bit is clear, both lower-order
bits must be allocated (and their bitmap bits must be 0).
2. For any set bit in order-0, ensures all corresponding
higher-order bits are not free.
3. Verifies that all preallocated blocks (pa) in the group
have pa_pstart within bounds and their bitmap bits marked as
allocated.
However, this approach fails to properly validate cases where
order-0 bits are incorrectly cleared (0), allowing some invalid
configurations to pass:
corrupt integral
order 3 1 1
order 2 1 1 1 1
order 1 1 1 1 1 1 1 1 1
order 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Here we get two adjacent free blocks at order-0 with inconsistent
higher-order state, and the right one shows the correct scenario.
The root cause is insufficient validation of order-0 zero bits.
To fix this and improve completeness without significant performance
cost, we refine the logic:
1. Maintain the top-down higher-order validation, but we no longer
check the cases where the higher-order bit is 0, as this case will
be covered in step 2.
2. Enhance order-0 checking by examining pairs of bits:
- If either bit in a pair is set (1), all corresponding
higher-order bits must not be free.
- If both bits are clear (0), then exactly one of the
corresponding higher-order bits must be free
3. Keep the preallocation (pa) validation unchanged.
This change closes the validation gap, ensuring illegal buddy states
involving order-0 are correctly detected, while removing redundant
checks and maintaining efficiency.
Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4")
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251106060614.631382-3-sunyongjian@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When the MB_CHECK_ASSERT macro is enabled, an assertion failure can
occur in __mb_check_buddy when checking preallocated blocks (pa) in
a block group:
Assertion failure in mb_free_blocks() : "groupnr == e4b->bd_group"
This happens when a pa at the very end of a block group (e.g.,
pa_pstart=32765, pa_len=3 in a group of 32768 blocks) becomes
exhausted - its pa_pstart is advanced by pa_len to 32768, which
lies in the next block group. If this exhausted pa (with pa_len == 0)
is still in the bb_prealloc_list during the buddy check, the assertion
incorrectly flags it as belonging to the wrong group. A possible
sequence is as follows:
ext4_mb_new_blocks
ext4_mb_release_context
pa->pa_pstart += EXT4_C2B(sbi, ac->ac_b_ex.fe_len)
pa->pa_len -= ac->ac_b_ex.fe_len
__mb_check_buddy
for each pa in group
ext4_get_group_no_and_offset
MB_CHECK_ASSERT(groupnr == e4b->bd_group)
To fix this, we modify the check to skip block group validation for
exhausted preallocations (where pa_len == 0). Such entries are in a
transitional state and will be removed from the list soon, so they
should not trigger an assertion. This change prevents the false
positive while maintaining the integrity of the checks for active
allocations.
Fixes: c9de560ded61f ("ext4: Add multi block allocator for ext4")
Signed-off-by: Yongjian Sun <sunyongjian1@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251106060614.631382-2-sunyongjian@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Fix a race between inline data destruction and block mapping.
The function ext4_destroy_inline_data_nolock() changes the inode data
layout by clearing EXT4_INODE_INLINE_DATA and setting EXT4_INODE_EXTENTS.
At the same time, another thread may execute ext4_map_blocks(), which
tests EXT4_INODE_EXTENTS to decide whether to call ext4_ext_map_blocks()
or ext4_ind_map_blocks().
Without i_data_sem protection, ext4_ind_map_blocks() may receive inode
with EXT4_INODE_EXTENTS flag and triggering assert.
kernel BUG at fs/ext4/indirect.c:546!
EXT4-fs (loop2): unmounting filesystem.
invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:ext4_ind_map_blocks.cold+0x2b/0x5a fs/ext4/indirect.c:546
Call Trace:
<TASK>
ext4_map_blocks+0xb9b/0x16f0 fs/ext4/inode.c:681
_ext4_get_block+0x242/0x590 fs/ext4/inode.c:822
ext4_block_write_begin+0x48b/0x12c0 fs/ext4/inode.c:1124
ext4_write_begin+0x598/0xef0 fs/ext4/inode.c:1255
ext4_da_write_begin+0x21e/0x9c0 fs/ext4/inode.c:3000
generic_perform_write+0x259/0x5d0 mm/filemap.c:3846
ext4_buffered_write_iter+0x15b/0x470 fs/ext4/file.c:285
ext4_file_write_iter+0x8e0/0x17f0 fs/ext4/file.c:679
call_write_iter include/linux/fs.h:2271 [inline]
do_iter_readv_writev+0x212/0x3c0 fs/read_write.c:735
do_iter_write+0x186/0x710 fs/read_write.c:861
vfs_iter_write+0x70/0xa0 fs/read_write.c:902
iter_file_splice_write+0x73b/0xc90 fs/splice.c:685
do_splice_from fs/splice.c:763 [inline]
direct_splice_actor+0x10f/0x170 fs/splice.c:950
splice_direct_to_actor+0x33a/0xa10 fs/splice.c:896
do_splice_direct+0x1a9/0x280 fs/splice.c:1002
do_sendfile+0xb13/0x12c0 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x1cf/0x210 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Fixes: c755e251357a ("ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea()")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Alexey Nepomnyashih <sdl@nppct.ru>
Message-ID: <20251104093326.697381-1-sdl@nppct.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
i_state_flags used on 32-bit archs, need to clear this flag when
alloc inode.
Find this issue when umount ext4, sometimes track the inode as orphan
accidently, cause ext4 mesg dump.
Fixes: acf943e9768e ("ext4: fix checks for orphan inodes")
Signed-off-by: Haibo Chen <haibo.chen@nxp.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251104-ext4-v1-1-73691a0800f9@nxp.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
Copying the file system while it is mounted as read-only results in
a mount failure:
[~]# mkfs.ext4 -F /dev/sdc
[~]# mount /dev/sdc -o ro /mnt/test
[~]# dd if=/dev/sdc of=/dev/sda bs=1M
[~]# mount /dev/sda /mnt/test1
[ 1094.849826] JBD2: journal checksum error
[ 1094.850927] EXT4-fs (sda): Could not load journal inode
mount: mount /dev/sda on /mnt/test1 failed: Bad message
The process described above is just an abstracted way I came up with to
reproduce the issue. In the actual scenario, the file system was mounted
read-only and then copied while it was still mounted. It was found that
the mount operation failed. The user intended to verify the data or use
it as a backup, and this action was performed during a version upgrade.
Above issue may happen as follows:
ext4_fill_super
set_journal_csum_feature_set(sb)
if (ext4_has_metadata_csum(sb))
incompat = JBD2_FEATURE_INCOMPAT_CSUM_V3;
if (test_opt(sb, JOURNAL_CHECKSUM)
jbd2_journal_set_features(sbi->s_journal, compat, 0, incompat);
lock_buffer(journal->j_sb_buffer);
sb->s_feature_incompat |= cpu_to_be32(incompat);
//The data in the journal sb was modified, but the checksum was not
updated, so the data remaining in memory has a mismatch between the
data and the checksum.
unlock_buffer(journal->j_sb_buffer);
In this case, the journal sb copied over is in a state where the checksum
and data are inconsistent, so mounting fails.
To solve the above issue, update the checksum in memory after modifying
the journal sb.
Fixes: 4fd5ea43bc11 ("jbd2: checksum journal superblock")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251103010123.3753631-1-yebin@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
params.mount_opts may come as potentially non-NUL-term string. Userspace
is expected to pass a NUL-term string. Add an extra check to ensure this
holds true. Note that further code utilizes strscpy_pad() so this is just
for proper informing the user of incorrect data being provided.
Found by Linux Verification Center (linuxtesting.org).
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251101160430.222297-2-pchelkin@ispras.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
strscpy_pad() can't be used to copy a non-NUL-term string into a NUL-term
string of possibly bigger size. Commit 0efc5990bca5 ("string.h: Introduce
memtostr() and memtostr_pad()") provides additional information in that
regard. So if this happens, the following warning is observed:
strnlen: detected buffer overflow: 65 byte read of buffer size 64
WARNING: CPU: 0 PID: 28655 at lib/string_helpers.c:1032 __fortify_report+0x96/0xc0 lib/string_helpers.c:1032
Modules linked in:
CPU: 0 UID: 0 PID: 28655 Comm: syz-executor.3 Not tainted 6.12.54-syzkaller-00144-g5f0270f1ba00 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:__fortify_report+0x96/0xc0 lib/string_helpers.c:1032
Call Trace:
<TASK>
__fortify_panic+0x1f/0x30 lib/string_helpers.c:1039
strnlen include/linux/fortify-string.h:235 [inline]
sized_strscpy include/linux/fortify-string.h:309 [inline]
parse_apply_sb_mount_options fs/ext4/super.c:2504 [inline]
__ext4_fill_super fs/ext4/super.c:5261 [inline]
ext4_fill_super+0x3c35/0xad00 fs/ext4/super.c:5706
get_tree_bdev_flags+0x387/0x620 fs/super.c:1636
vfs_get_tree+0x93/0x380 fs/super.c:1814
do_new_mount fs/namespace.c:3553 [inline]
path_mount+0x6ae/0x1f70 fs/namespace.c:3880
do_mount fs/namespace.c:3893 [inline]
__do_sys_mount fs/namespace.c:4103 [inline]
__se_sys_mount fs/namespace.c:4080 [inline]
__x64_sys_mount+0x280/0x300 fs/namespace.c:4080
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x64/0x140 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Since userspace is expected to provide s_mount_opts field to be at most 63
characters long with the ending byte being NUL-term, use a 64-byte buffer
which matches the size of s_mount_opts, so that strscpy_pad() does its job
properly. Return with error if the user still managed to provide a
non-NUL-term string here.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: 8ecb790ea8c3 ("ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()")
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251101160430.222297-1-pchelkin@ispras.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When jbd2_journal_abort() is called, the provided error code is stored
in the journal superblock. Some existing calls hard-code -EIO even when
the actual failure is not I/O related.
This patch updates those calls to pass more accurate error codes,
allowing the superblock to record the true cause of failure. This helps
improve diagnostics and debugging clarity when analyzing journal aborts.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Message-ID: <20251031210501.7337-1-wen.gang.wang@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
corrupted
There's issue when file system corrupted:
------------[ cut here ]------------
kernel BUG at fs/jbd2/transaction.c:1289!
Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 5 UID: 0 PID: 2031 Comm: mkdir Not tainted 6.18.0-rc1-next
RIP: 0010:jbd2_journal_get_create_access+0x3b6/0x4d0
RSP: 0018:ffff888117aafa30 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88811a86b000 RCX: ffffffff89a63534
RDX: 1ffff110200ec602 RSI: 0000000000000004 RDI: ffff888100763010
RBP: ffff888100763000 R08: 0000000000000001 R09: ffff888100763028
R10: 0000000000000003 R11: 0000000000000000 R12: 0000000000000000
R13: ffff88812c432000 R14: ffff88812c608000 R15: ffff888120bfc000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f91d6970c99 CR3: 00000001159c4000 CR4: 00000000000006f0
Call Trace:
<TASK>
__ext4_journal_get_create_access+0x42/0x170
ext4_getblk+0x319/0x6f0
ext4_bread+0x11/0x100
ext4_append+0x1e6/0x4a0
ext4_init_new_dir+0x145/0x1d0
ext4_mkdir+0x326/0x920
vfs_mkdir+0x45c/0x740
do_mkdirat+0x234/0x2f0
__x64_sys_mkdir+0xd6/0x120
do_syscall_64+0x5f/0xfa0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
The above issue occurs with us in errors=continue mode when accompanied by
storage failures. There have been many inconsistencies in the file system
data.
In the case of file system data inconsistency, for example, if the block
bitmap of a referenced block is not set, it can lead to the situation where
a block being committed is allocated and used again. As a result, the
following condition will not be satisfied then trigger BUG_ON. Of course,
it is entirely possible to construct a problematic image that can trigger
this BUG_ON through specific operations. In fact, I have constructed such
an image and easily reproduced this issue.
Therefore, J_ASSERT() holds true only under ideal conditions, but it may
not necessarily be satisfied in exceptional scenarios. Using J_ASSERT()
directly in abnormal situations would cause the system to crash, which is
clearly not what we want. So here we directly trigger a JBD abort instead
of immediately invoking BUG_ON.
Fixes: 470decc613ab ("[PATCH] jbd2: initial copy of files from jbd")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-ID: <20251025072657.307851-1-yebin@huaweicloud.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|
There exists a memory leak of kernfs_iattrs contained as an element
of kernfs_node allocated in __kernfs_new_node(). __kernfs_setattr()
allocates kernfs_iattrs as a sub-object, and the LSM security check
incorrectly errors out and does not free the kernfs_iattrs sub-object.
Make an additional error out case that properly frees kernfs_iattrs if
security_kernfs_init_security() fails.
Fixes: e19dfdc83b60 ("kernfs: initialize security of newly created nodes")
Co-developed-by: Oliver Rosenberg <olrose55@gmail.com>
Signed-off-by: Oliver Rosenberg <olrose55@gmail.com>
Signed-off-by: Will Rosenberg <whrosenb@asu.edu>
Link: https://patch.msgid.link/20251125151332.2010687-1-whrosenb@asu.edu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
On an ARM64 A1 system, it's possible to have physical memory span
up to the 64T boundary, like below
$ lsmem -b -r -n -o range,size
0x0000000080000000-0x00000000bfffffff 1073741824
0x0000080000000000-0x000008007fffffff 2147483648
0x00000800c0000000-0x0000087fffffffff 546534588416
0x0000400000000000-0x00004000bfffffff 3221225472
0x0000400100000000-0x0000407fffffffff 545460846592
So it's time to extend /sys/kernel/mm/page_idle/bitmap to be able
to account for >2G number of pages, by raising the kernfs file size
limit.
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Link: https://patch.msgid.link/20251111202606.1505437-1-jane.chu@oracle.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When constifying instances of struct attribute, for consistency the
corresponding .is_visible() callback should be adapted, too.
Introduce a temporary transition mechanism until all callbacks are
converted.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://patch.msgid.link/20251029-sysfs-const-attr-prep-v5-4-ea7d745acff4@weissschuh.net
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The primary consumer is link_path_walk(), calling walk_component() every
time which in turn calls step_into().
Inlining these saves overhead of 2 function calls per path component,
along with allowing the compiler to do better job optimizing them in place.
step_into() had absolutely atrocious assembly to facilitate the
slowpath. In order to lessen the burden at the callsite all the hard
work is moved into step_into_slowpath() and instead an inline-able
fastpath is implemented for rcu-walk.
The new fastpath is a stripped down step_into() RCU handling with a
d_managed() check from handle_mounts().
Benchmarked as follows on Sapphire Rapids:
1. the "before" was a kernel with not-yet-merged optimizations (notably
elision of calls to security_inode_permission() and marking ext4
inodes as not having acls as applicable)
2. "after" is the same + the prep patch + this patch
3. benchmark consists of issuing 205 calls to access(2) in a loop with
pathnames lifted out of gcc and the linker building real code, most
of which have several path components and 118 of which fail with
-ENOENT.
Result in terms of ops/s:
before: 21619
after: 22536 (+4%)
profile before:
20.25% [kernel] [k] __d_lookup_rcu
10.54% [kernel] [k] link_path_walk
10.22% [kernel] [k] entry_SYSCALL_64
6.50% libc.so.6 [.] __GI___access
6.35% [kernel] [k] strncpy_from_user
4.87% [kernel] [k] step_into
3.68% [kernel] [k] kmem_cache_alloc_noprof
2.88% [kernel] [k] walk_component
2.86% [kernel] [k] kmem_cache_free
2.14% [kernel] [k] set_root
2.08% [kernel] [k] lookup_fast
after:
23.38% [kernel] [k] __d_lookup_rcu
11.27% [kernel] [k] entry_SYSCALL_64
10.89% [kernel] [k] link_path_walk
7.00% libc.so.6 [.] __GI___access
6.88% [kernel] [k] strncpy_from_user
3.50% [kernel] [k] kmem_cache_alloc_noprof
2.01% [kernel] [k] kmem_cache_free
2.00% [kernel] [k] set_root
1.99% [kernel] [k] lookup_fast
1.81% [kernel] [k] do_syscall_64
1.69% [kernel] [k] entry_SYSCALL_64_safe_stack
While walk_component() and step_into() of course disappear from the
profile, the link_path_walk() barely gets more overhead despite the
inlining thanks to the fast path added and while completing more walks
per second.
I did not investigate why overhead grew a lot on __d_lookup_rcu().
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251120003803.2979978-2-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Symlink handling is already marked as unlikely and pushing out some of
it into pick_link() reduces register spillage on entry to step_into()
with gcc 14.2.
The compiler needed additional convincing that handle_mounts() is
unlikely to fail.
At the same time neither clang nor gcc could be convinced to tail-call
into pick_link().
While pick_link() takes an address of stack-based object as an argument
(which definitely prevents the optimization), splitting it into separate
<dentry, mount> tuple did not help. The issue persists even when
compiled without stack protector. As such nothing was done about this
for the time being to not grow the diff.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://patch.msgid.link/20251120003803.2979978-1-mjguzik@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Orangefs has no i_version handling and __orangefs_setattr already
explicitly marks the inode dirty. So instead of the using
the flags return value from generic_update_time, just call the
lower level inode_update_timestamps helper directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-7-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Since commit e41f941a2311 ("Btrfs: move over to use ->update_time") this
is not a copy of the high-level file_update_time helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-6-hch@lst.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Btrfs updates the device node timestamps for block device special files
when it stop using the device.
Commit 8f96a5bfa150 ("btrfs: update the bdev time directly when closing")
switch that update from the correct layering to directly call the
low-level helper on the bdev inode. This is wrong and got fixed in
commit 54fde91f52f5 ("btrfs: update device path inode time instead of
bd_inode") by updating the file system inode instead of the bdev inode,
but this kept the incorrect bypassing of the VFS interfaces and file
system ->update_times method. Fix this by using the propet vfs_utimes
interface.
Fixes: 8f96a5bfa150 ("btrfs: update the bdev time directly when closing")
Fixes: 54fde91f52f5 ("btrfs: update device path inode time instead of bd_inode")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-5-hch@lst.de
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
This will be used to replace an incorrect direct call into
generic_update_time in btrfs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-4-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
FMODE_NOCMTIME used to be just a hack for the legacy XFS handle-based
"invisible I/O", but commit e5e9b24ab8fa ("nfsd: freeze c/mtime updates
with outstanding WRITE_ATTRS delegation") started using it from
generic callers.
I'm not sure other file systems are actually read for this in general,
so the above commit should get a closer look, but for it to make any
sense, file_update_time needs to respect the flag.
Lift the check from file_modified_flags to file_update_time so that
users of file_update_time inherit the behavior and so that all the
checks are done in one place.
Fixes: e5e9b24ab8fa ("nfsd: freeze c/mtime updates with outstanding WRITE_ATTRS delegation")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-3-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Currently the two high-level APIs use two helper functions to implement
almost all of the logic. Refactor the two helpers and the common logic
into a new file_update_time_flags routine that gets the iocb flags or
0 in case of file_update_time passed so that the entire logic is
contained in a single function and can be easily understood and modified.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20251120064859.2911749-2-hch@lst.de
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|