Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.
Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).
If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.
This introduces fine-grained timestamps that are used when they are
actively queried.
This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.
As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.
Various preparatory changes, fixes and cleanups are included:
- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.
- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.
- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.
- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.
- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"
* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...
|
|
Yikebaer reported an issue:
==================================================================
BUG: KASAN: slab-use-after-free in ext4_es_insert_extent+0xc68/0xcb0
fs/ext4/extents_status.c:894
Read of size 4 at addr ffff888112ecc1a4 by task syz-executor/8438
CPU: 1 PID: 8438 Comm: syz-executor Not tainted 6.5.0-rc5 #1
Call Trace:
[...]
kasan_report+0xba/0xf0 mm/kasan/report.c:588
ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894
ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
ext4_zero_range fs/ext4/extents.c:4622 [inline]
ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
[...]
Allocated by task 8438:
[...]
kmem_cache_zalloc include/linux/slab.h:693 [inline]
__es_alloc_extent fs/ext4/extents_status.c:469 [inline]
ext4_es_insert_extent+0x672/0xcb0 fs/ext4/extents_status.c:873
ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
ext4_zero_range fs/ext4/extents.c:4622 [inline]
ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
[...]
Freed by task 8438:
[...]
kmem_cache_free+0xec/0x490 mm/slub.c:3823
ext4_es_try_to_merge_right fs/ext4/extents_status.c:593 [inline]
__es_insert_extent+0x9f4/0x1440 fs/ext4/extents_status.c:802
ext4_es_insert_extent+0x2ca/0xcb0 fs/ext4/extents_status.c:882
ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680
ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462
ext4_zero_range fs/ext4/extents.c:4622 [inline]
ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721
[...]
==================================================================
The flow of issue triggering is as follows:
1. remove es
raw es es removed es1
|-------------------| -> |----|.......|------|
2. insert es
es insert es1 merge with es es1 merge with es and free es1
|----|.......|------| -> |------------|------| -> |-------------------|
es merges with newes, then merges with es1, frees es1, then determines
if es1->es_len is 0 and triggers a UAF.
The code flow is as follows:
ext4_es_insert_extent
es1 = __es_alloc_extent(true);
es2 = __es_alloc_extent(true);
__es_remove_extent(inode, lblk, end, NULL, es1)
__es_insert_extent(inode, &newes, es1) ---> insert es1 to es tree
__es_insert_extent(inode, &newes, es2)
ext4_es_try_to_merge_right
ext4_es_free_extent(inode, es1) ---> es1 is freed
if (es1 && !es1->es_len)
// Trigger UAF by determining if es1 is used.
We determine whether es1 or es2 is used immediately after calling
__es_remove_extent() or __es_insert_extent() to avoid triggering a
UAF if es1 or es2 is freed.
Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com>
Closes: https://lore.kernel.org/lkml/CALcu4raD4h9coiyEBL4Bm0zjDwxC2CyPiTwsP3zFuhot6y9Beg@mail.gmail.com
Fixes: 2a69c450083d ("ext4: using nofail preallocation in ext4_es_insert_extent()")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230815070808.3377171-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Now that ext4 does not allow inodes with the casefold flag to be
instantiated when unsupported, it's unnecessary to repeatedly check for
support later on during random filesystem operations.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230814182903.37267-3-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
It is invalid for the casefold inode flag to be set without the casefold
superblock feature flag also being set. e2fsck already considers this
case to be invalid and handles it by offering to clear the casefold flag
on the inode. __ext4_iget() also already considered this to be invalid,
sort of, but it only got so far as logging an error message; it didn't
actually reject the inode. Make it reject the inode so that other code
doesn't have to handle this case. This matches what f2fs does.
Note: we could check 's_encoding != NULL' instead of
ext4_has_feature_casefold(). This would make the check robust against
the casefold feature being enabled by userspace writing to the page
cache of the mounted block device. However, it's unsolvable in general
for filesystems to be robust against concurrent writes to the page cache
of the mounted block device. Though this very particular scenario
involving the casefold feature is solvable, we should not pretend that
we can support this model, so let's just check the casefold feature.
tune2fs already forbids enabling casefold on a mounted filesystem.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230814182903.37267-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use LIST_HEAD() to initialize the list_head instead of open-coding it.
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Link: https://lore.kernel.org/r/20230812071839.3481909-1-ruanjinjie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
In the delalloc append write scenario, if inode's i_size is extended due
to buffer write, there are delalloc writes pending in the range up to
i_size, and no need to touch i_disksize since writeback will push
i_disksize up to i_size eventually. Offers significant performance
improvement in high-frequency append write scenarios.
I conducted tests in my 32-core environment by launching 32 concurrent
threads to append write to the same file. Each write operation had a
length of 1024 bytes and was repeated 100000 times. Without using this
patch, the test was completed in 7705 ms. However, with this patch, the
test was completed in 5066 ms, resulting in a performance improvement of
34%.
Moreover, in test scenarios of Kafka version 2.6.2, using packet size of
2K, with this patch resulted in a 10% performance improvement.
Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810154333.84921-1-liusong@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The most common use that s_error_work will get scheduled is now the
periodic update of the superblock. So rename it to s_sb_upd_work.
Also rename the function flush_stashed_error_work() to
update_super_work().
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
This patch introduces a mechanism to periodically check and update
the superblock within the ext4 file system. The main purpose of this
patch is to keep the disk superblock up to date. The update will be
performed if more than one hour has passed since the last update, and
if more than 16MB of data have been written to disk.
This check and update is performed within the ext4_journal_commit_callback
function, ensuring that the superblock is written while the disk is
active, rather than based on a timer that may trigger during disk idle
periods.
Discussion https://www.spinics.net/lists/linux-ext4/msg85865.html
Signed-off-by: Vitaliy Kuznetsov <vk.en.mail@gmail.com>
Link: https://lore.kernel.org/r/20230810143852.40228-1-vk.en.mail@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The commit referenced below opened up concurrent unaligned dio under
shared locking for pure overwrites. In doing so, it enabled use of
the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected
-EAGAIN returns as an extra precaution, since ext4 does not retry
writes in such cases. The flag itself is advisory in this case since
ext4 checks for unaligned I/Os and uses appropriate locking up
front, rather than on a retry in response to -EAGAIN.
As it turns out, the warning check is susceptible to false positives
because there are scenarios where -EAGAIN can be expected from lower
layers without necessarily having IOCB_NOWAIT set on the iocb. For
example, one instance of the warning has been seen where io_uring
sets IOCB_HIPRI, which in turn results in REQ_POLLED|REQ_NOWAIT on
the bio. This results in -EAGAIN if the block layer is unable to
allocate a request, etc. [Note that there is an outstanding patch to
untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on
IOCB_NOWAIT, which would also address this instance of the warning.]
Another instance of the warning has been reproduced by syzbot. A dio
write is interrupted down in __get_user_pages_locked() waiting on
the mm lock and returns -EAGAIN up the stack. If the iomap dio
iteration layer has made no progress on the write to this point,
-EAGAIN returns up to the filesystem and triggers the warning.
This use of the overwrite flag in ext4 is precautionary and
half-baked. I.e., ext4 doesn't actually implement overwrite checking
in the iomap callbacks when the flag is set, so the only extra
verification it provides are i_size checks in the generic iomap dio
layer. Combined with the tendency for false positives, the added
verification is not worth the extra trouble. Remove the flag,
associated warning, and update the comments to document when
concurrent unaligned dio writes are allowed and why said flag is not
used.
Cc: stable@kernel.org
Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com
Reported-by: Pengfei Xu <pengfei.xu@intel.com>
Fixes: 310ee0902b8d ("ext4: allow concurrent unaligned dio overwrites")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When setup_system_zone, flex_bg is not initialized so it is always 1.
Use a new helper function, ext4_num_base_meta_blocks() which does not
depend on sbi->s_log_groups_per_flex being initialized.
[ Squashed two patches in the Link URL's below together into a single
commit, which is simpler to review/understand. Also fix checkpatch
warnings. --TYT ]
Cc: stable@kernel.org
Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com>
Link: https://lore.kernel.org/r/tencent_21AF0D446A9916ED5C51492CC6C9A0A77B05@qq.com
Link: https://lore.kernel.org/r/tencent_D744D1450CC169AEA77FCF0A64719909ED05@qq.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
These functions do not have its function implementation.
So those function declaration is useless. Remove these
Signed-off-by: Cai Xinchen <caixinchen1@huawei.com>
Link: https://lore.kernel.org/r/20230802030025.173148-1-caixinchen1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
clang's static analysis warning: fs/ext4/mballoc.c
line 4178, column 6, Branch condition evaluates to a garbage value.
err is uninitialized and will be judged when 'len <= 0' or
it first enters the loop while the condition "!ext4_sb_block_valid()"
is true. Although this can't make problems now, it's better to
correct it.
Signed-off-by: Su Hui <suhui@nfschina.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20230725043310.1227621-1-suhui@nfschina.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Link: https://lore.kernel.org/r/20230707115907.26637-1-luhongfei@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The return value type of i_blocksize() is 'unsigned int', so the
type of blocksize has been modified from 'int' to 'unsigned int'
to ensure data type consistency.
Signed-off-by: Lu Hongfei <luhongfei@vivo.com>
Link: https://lore.kernel.org/r/20230707105516.9156-1-luhongfei@vivo.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Running generic/475(filesystem consistent tests after power cut) could
easily trigger unattached inode error while doing fsck:
Unattached zero-length inode 39405. Clear? no
Unattached inode 39405
Connect to /lost+found? no
Above inconsistence is caused by following process:
P1 P2
ext4_create
inode = ext4_new_inode_start_handle // itable records nlink=1
ext4_add_nondir
err = ext4_add_entry // ENOSPC
ext4_append
ext4_bread
ext4_getblk
ext4_map_blocks // returns ENOSPC
drop_nlink(inode) // won't be updated into disk inode
ext4_orphan_add(handle, inode)
ext4_orphan_file_add
ext4_journal_stop(handle)
jbd2_journal_commit_transaction // commit success
>> power cut <<
ext4_fill_super
ext4_load_and_init_journal // itable records nlink=1
ext4_orphan_cleanup
ext4_process_orphan
if (inode->i_nlink) // true, inode won't be deleted
Then, allocated inode will be reserved on disk and corresponds to no
dentries, so e2fsck reports 'unattached inode' problem.
The problem won't happen if orphan file feature is disabled, because
ext4_orphan_add() will update disk inode in orphan list mode. There
are several places not updating disk inode while putting inode into
orphan area, such as ext4_add_nondir(), ext4_symlink() and whiteout
in ext4_rename(). Fix it by updating inode into disk in all error
branches of these places.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217605
Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling")
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230628132011.650383-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_get_journal() and ext4_get_dev_journal() return NULL if they failed
to init journal, making them return proper error value instead, also
rename them to ext4_open_{inode,dev}_journal().
[ Folded fix to ext4_calculate_overhead() to check for an ERR_PTR
instead of NULL. ]
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-13-yi.zhang@huaweicloud.com
Reported-by: syzbot+b3123e6d9842e526de39@syzkaller.appspotmail.com
Link: https://lore.kernel.org/r/20230826011029.2023140-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Remove the unnecessary encoding of page order into an enum and pass the
page order directly. That lets us get rid of pe_order().
The switch constructs have to be changed to if/else constructs to prevent
GCC from warning on builds with 3-level page tables where PMD_ORDER and
PUD_ORDER have the same value.
If you are looking at this commit because your driver stopped compiling,
look at the previous commit as well and audit your driver to be sure it
doesn't depend on mmap_lock being held in its ->huge_fault method.
[willy@infradead.org: use "order %u" to match the (non dev_t) style]
Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org
Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "New page table range API", v6.
This patchset changes the API used by the MM to set up page table entries.
The four APIs are:
set_ptes(mm, addr, ptep, pte, nr)
update_mmu_cache_range(vma, addr, ptep, nr)
flush_dcache_folio(folio)
flush_icache_pages(vma, page, nr)
flush_dcache_folio() isn't technically new, but no architecture
implemented it, so I've done that for them. The old APIs remain around
but are mostly implemented by calling the new interfaces.
The new APIs are based around setting up N page table entries at once.
The N entries belong to the same PMD, the same folio and the same VMA, so
ptep++ is a legitimate operation, and locking is taken care of for you.
Some architectures can do a better job of it than just a loop, but I have
hesitated to make too deep a change to architectures I don't understand
well.
One thing I have changed in every architecture is that PG_arch_1 is now a
per-folio bit instead of a per-page bit when used for dcache clean/dirty
tracking. This was something that would have to happen eventually, and it
makes sense to do it now rather than iterate over every page involved in a
cache flush and figure out if it needs to happen.
The point of all this is better performance, and Fengwei Yin has measured
improvement on x86. I suspect you'll see improvement on your architecture
too. Try the new will-it-scale test mentioned here:
https://lore.kernel.org/linux-mm/20230206140639.538867-5-fengwei.yin@intel.com/
You'll need to run it on an XFS filesystem and have
CONFIG_TRANSPARENT_HUGEPAGE set.
This patchset is the basis for much of the anonymous large folio work
being done by Ryan, so it's received quite a lot of testing over the last
few months.
This patch (of 38):
Determine if a value lies within a range more efficiently (subtraction +
comparison vs two comparisons and an AND). It also has useful (under some
circumstances) behaviour if the range exceeds the maximum value of the
type. Convert all the conflicting definitions of in_range() within the
kernel; some can use the generic definition while others need their own
definition.
Link: https://lkml.kernel.org/r/20230802151406.3735276-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20230802151406.3735276-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Factor out a new helper form ext4_get_dev_journal() to get external
journal bdev and check validation of this device, drop ext4_blkdev_get()
helper, and also remove duplicate check of journal feature. It makes
ext4_get_dev_journal() more clear than before.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-12-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Current jbd2_journal_init_{dev,inode} return NULL if some error
happens, make them to pass out proper error return value.
[ Fix from Yang Yingliang folded in. ]
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230811063610.2980059-11-yi.zhang@huaweicloud.com
Link: https://lore.kernel.org/r/20230822030018.644419-1-yangyingliang@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Patch series "mm, netfs, fscache: Stop read optimisation when folio
removed from pagecache", v7.
This fixes an optimisation in fscache whereby we don't read from the cache
for a particular file until we know that there's data there that we don't
have in the pagecache. The problem is that I'm no longer using PG_fscache
(aka PG_private_2) to indicate that the page is cached and so I don't get
a notification when a cached page is dropped from the pagecache.
The first patch merges some folio_has_private() and
filemap_release_folio() pairs and introduces a helper,
folio_needs_release(), to indicate if a release is required.
The second patch is the actual fix. Following Willy's suggestions[1], it
adds an AS_RELEASE_ALWAYS flag to an address_space that will make
filemap_release_folio() always call ->release_folio(), even if
PG_private/PG_private_2 aren't set. folio_needs_release() is altered to
add a check for this.
This patch (of 2):
Make filemap_release_folio() check folio_has_private(). Then, in most
cases, where a call to folio_has_private() is immediately followed by a
call to filemap_release_folio(), we can get rid of the test in the pair.
There are a couple of sites in mm/vscan.c that this can't so easily be
done. In shrink_folio_list(), there are actually three cases (something
different is done for incompletely invalidated buffers), but
filemap_release_folio() elides two of them.
In shrink_active_list(), we don't have have the folio lock yet, so the
check allows us to avoid locking the page unnecessarily.
A wrapper function to check if a folio needs release is provided for those
places that still need to do it in the mm/ directory. This will acquire
additional parts to the condition in a future patch.
After this, the only remaining caller of folio_has_private() outside of
mm/ is a check in fuse.
Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com
Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com
Reported-by: Rohith Surabattula <rohiths.msft@gmail.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Dominique Martinet <asmadeus@codewreck.org>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Xiubo Li <xiubli@redhat.com>
Cc: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
block_commit_write() always returns 0, this patch changes it to return
void.
Link: https://lkml.kernel.org/r/20230626055518.842392-3-beanhuo@iokpp.de
Signed-off-by: Bean Huo <beanhuo@micron.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Luís Henriques <ocfs2-devel@oss.oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
the existing helper folio_next_index().
Link: https://lkml.kernel.org/r/20230627174349.491803-1-sidhartha.kumar@oracle.com
Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the generic fs_holder_ops to shut down the file system when the
log device goes away instead of duplicating the logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230802154131.2221419-11-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Just like get_tree_bdev needs to drop s_umount when opening the main
device, we need to do the same for the ext4 log device to avoid a
potential lock order reversal with s_unmount for the mark_dead path.
It might be preferable to just drop s_umount over ->fill_super entirely,
but that will require a fairly massive audit first, so we'll do the easy
version here first.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230802154131.2221419-10-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.
For ext4, we only need to enable the FS_MGTIME flag.
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-12-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Check for sb->s_type which is the right place to look at the file system
type, not the holder, which is just an implementation detail in the VFS
helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230802154131.2221419-6-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
blkdev_put must not be called under sb->s_umount to avoid a lock order
reversal with disk->open_mutex. Move closing the external journal device
into ->kill_sb to archive that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230809220545.1308228-9-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.
Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
bdev->bd_super is unused now, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Message-Id: <20230807112625.652089-5-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
__ext4_journal_get_write_access already has a super_block available,
and there is no need to go from that to the bdev to go back to the
owning super_block.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230807112625.652089-3-hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Using CR_BEST_AVAIL_LEN only make sense for regular files, as for
non-regular files we never normalize the allocation request length i.e.
goal len is same as original length (ac_g_ex.fe_len == ac_o_ex.fe_len).
Hence there is no scope of trimming the goal length to make it
satisfy original request len. Thus this patch avoids using
CR_BEST_AVAIL_LEN criteria for non-regular files request.
Cc: stable@kernel.org
Fixes: 33122aa930f1 ("ext4: Add allocation criteria 1.5 (CR1_5)")
Reported-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Tested-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/2a694c748ff8b8c4b416995a24f06f07b55047a8.1689516047.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
If the filename casefolding fails, we'll be leaking memory from the
fscrypt_name struct, namely from the 'crypto_buf.name' member.
Make sure we free it in the error path on both ext4_fname_setup_filename()
and ext4_fname_prepare_lookup() functions.
Cc: stable@kernel.org
Fixes: 1ae98e295fa2 ("ext4: optimize match for casefolded encrypted dirs")
Signed-off-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20230803091713.13239-1-lhenriques@suse.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We named criteria with CR_XXX, correct stale comment to criteria with
raw number.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-11-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Return good group when it's found in loop to remove futher check if good
group is found after loop.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Return good group when it's found in loop to remove futher check if good
group is found after loop.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Remove ext4_set_bit_atomic and ext4_clear_bit_atomic which are defined but not
used.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Replace the traditional ternary conditional operator with with max()/min()
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The return at end of void function is unnecessary, just remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Use intuitive is_power_of_2 helper in ext4_mb_regular_allocator.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Return good group when it's found in loop to remove unnecessary NULL
initialization of grp and futher check if good group is found after loop.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ngroups is ext4_group_t (unsigned int) while next_linear_group treat it
in int. If ngroups is bigger than max number described by int, it will
be treat as a negative number. Then "return group + 1 >= ngroups ? 0 :
group + 1;" may keep returning 0.
Switch int to ext4_group_t in next_linear_group to fix the overflow.
Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Group corruption check will access memory of grp and will trigger kernel
crash if grp is NULL. So do NULL check before corruption check.
Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail")
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230801143204.2284343-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Replace CR_FAST with ext4_mb_cr_expensive() inline function for better
readability. This function returns true if the criteria is one of the
expensive/slower ones where lots of disk IO/prefetching is acceptable.
No functional changes are intended in this patch.
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230630085927.140137-1-ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Add a new config option that controls building the buffer_head code, and
select it from all file systems and stacking drivers that need it.
For the block device nodes and alternative iomap based buffered I/O path
is provided when buffer_head support is not enabled, and iomap needs a
a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call
into the buffer_head code when it doesn't exist.
Otherwise this is just Kconfig and ifdef changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
block_page_mkwrite_return is neither block nor mkwrite specific, and
should not be under CONFIG_BLOCK. Move it to mm.h and rename it to
vmf_fs_error.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20230801172201.1923299-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The multi-mount protection kthread checks for read-only filesystem and
aborts in that case. The remount code actually handles stopping of the
kthread on remount so the only purpose of the check is in case of
emergency remount read-only. Replace the check for read-only filesystem
with a check for shutdown filesystem as running MMP on such is risky
anyway and it makes ordering of things during remount simpler.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616165109.21695-11-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
JBD2 code will quickly return without doing anything when there's
nothing to commit so there's no point in the read-only check in
ext4_force_commit(). Just drop it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616165109.21695-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We should not have dirty inodes on read-only filesystem. Also silently
bailing without writing anything would be a problem when we enable
quotas during remount while the filesystem is read-only. So drop the
read-only check.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616165109.21695-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
We better should not be initializing inode tables on read-only
filesystem. The following transaction start will warn us and make the
function bail anyway so drop the pointless check.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230616165109.21695-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|