summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-01-27Merge tag 'nfsd-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "Jeff Layton contributed an implementation of NFSv4.2+ attribute delegation, as described here: https://www.ietf.org/archive/id/draft-ietf-nfsv4-delstid-08.html This interoperates with similar functionality introduced into the Linux NFS client in v6.11. An attribute delegation permits an NFS client to manage a file's mtime, rather than flushing dirty data to the NFS server so that the file's mtime reflects the last write, which is considerably slower. Neil Brown contributed dynamic NFSv4.1 session slot table resizing. This facility enables NFSD to increase or decrease the number of slots per NFS session depending on server memory availability. More session slots means greater parallelism. Chuck Lever fixed a long-standing latent bug where NFSv4 COMPOUND encoding screws up when crossing a page boundary in the encoding buffer. This is a zero-day bug, but hitting it is rare and depends on the NFS client implementation. The Linux NFS client does not happen to trigger this issue. A variety of bug fixes and other incremental improvements fill out the list of commits in this release. Great thanks to all contributors, reviewers, testers, and bug reporters who participated during this development cycle" * tag 'nfsd-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (42 commits) sunrpc: Remove gss_{de,en}crypt_xdr_buf deadcode sunrpc: Remove gss_generic_token deadcode sunrpc: Remove unused xprt_iter_get_xprt Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" nfsd: implement OPEN_ARGS_SHARE_ACCESS_WANT_OPEN_XOR_DELEGATION nfsd: handle delegated timestamps in SETATTR nfsd: add support for delegated timestamps nfsd: rework NFS4_SHARE_WANT_* flag handling nfsd: add support for FATTR4_OPEN_ARGUMENTS nfsd: prepare delegation code for handing out *_ATTRS_DELEG delegations nfsd: rename NFS4_SHARE_WANT_* constants to OPEN4_SHARE_ACCESS_WANT_* nfsd: switch to autogenerated definitions for open_delegation_type4 nfs_common: make include/linux/nfs4.h include generated nfs4_1.h nfsd: fix handling of delegated change attr in CB_GETATTR SUNRPC: Document validity guarantees of the pointer returned by reserve_space NFSD: Insulate nfsd4_encode_fattr4() from page boundaries in the encode buffer NFSD: Insulate nfsd4_encode_secinfo() from page boundaries in the encode buffer NFSD: Refactor nfsd4_do_encode_secinfo() again NFSD: Insulate nfsd4_encode_readlink() from page boundaries in the encode buffer NFSD: Insulate nfsd4_encode_read_plus_data() from page boundaries in the encode buffer ...
2025-01-27add a string-to-qstr constructorAl Viro
Quite a few places want to build a struct qstr by given string; it would be convenient to have a primitive doing that, rather than open-coding it via QSTR_INIT(). The closest approximation was in bcachefs, but that expands to initializer list - {.len = strlen(string), .name = string}. It would be more useful to have it as compound literal - (struct qstr){.len = strlen(string), .name = string}. Unlike initializer list it's a valid expression. What's more, it's a valid lvalue - it's an equivalent of anonymous local variable with such initializer, so the things like path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name)); are valid. It can also be used as initializer, with identical effect - struct qstr x = (struct qstr){.name = s, .len = strlen(s)}; is equivalent to struct qstr anon_variable = {.name = s, .len = strlen(s)}; struct qstr x = anon_variable; // anon_variable is never used after that point and any even remotely sane compiler will manage to collapse that into struct qstr x = {.name = s, .len = strlen(s)}; What compound literals can't be used for is initialization of global variables, but those are covered by QSTR_INIT(). This commit lifts definition(s) of QSTR() into linux/dcache.h, converts it to compound literal (all bcachefs users are fine with that) and converts assorted open-coded instances to using that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-279p: fix ->rename_sem exclusionAl Viro
9p wants to be able to build a path from given dentry to fs root and keep it valid over a blocking operation. ->s_vfs_rename_mutex would be a natural candidate, but there are places where we need that and where we have no way to tell if ->s_vfs_rename_mutex is already held deeper in callchain. Moreover, it's only held for cross-directory renames; name changes within the same directory happen without it. Solution: * have d_move() done in ->rename() rather than in its caller * maintain a 9p-private rwsem (per-filesystem) * hold it exclusive over the relevant part of ->rename() * hold it shared over the places where we want the path. That almost works. FS_RENAME_DOES_D_MOVE is enough to put all d_move() and d_exchange() calls under filesystem's control. However, there's also __d_unalias(), which isn't covered by any of that. If ->lookup() hits a directory inode with preexisting dentry elsewhere (due to e.g. rename done on server behind our back), d_splice_alias() called by ->lookup() will move/rename that alias. Add a couple of optional methods, so that __d_unalias() would do if alias->d_op->d_unalias_trylock != NULL if (!alias->d_op->d_unalias_trylock(alias)) fail (resulting in -ESTALE from lookup) __d_move(...) if alias->d_op->d_unalias_unlock != NULL alias->d_unalias_unlock(alias) where it currently does __d_move(). 9p instances do down_write_trylock() and up_write() of ->rename_mutex. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27orangefs_d_revalidate(): use stable parent inode and name passed by callerAl Viro
->d_name use is a UAF if the userland side of things can be slowed down by attacker. Tested-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27ocfs2_dentry_revalidate(): use stable parent inode and name passed by callerAl Viro
theoretically, ->d_name use in there is a UAF, but only if you are messing with tracepoints... Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27nfs: fix ->d_revalidate() UAF on ->d_name accessesAl Viro
Pass the stable name all the way down to ->rpc_ops->lookup() instances. Note that passing &dentry->d_name is safe in e.g. nfs_lookup() - it *is* stable there, as it is in ->create() et.al. dget_parent() in nfs_instantiate() should be redundant - it'd better be stable there; if it's not, we have more trouble, since ->d_name would also be unsafe in such case. nfs_submount() and nfs4_submount() may or may not require fixes - if they ever get moved on server with fhandle preserved, we are in trouble there... UAF window is fairly narrow here and exfiltration requires the ability to watch the traffic. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27nfs{,4}_lookup_validate(): use stable parent inode passed by callerAl Viro
we can't kill __nfs_lookup_revalidate() completely, but ->d_parent boilerplate in it is gone Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27gfs2_drevalidate(): use stable parent inode and name passed by callerAl Viro
No need to mess with dget_parent() for the former; for the latter we really should not rely upon ->d_name.name remaining stable. Theoretically a UAF, but it's hard to exfiltrate the information... Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27fuse_dentry_revalidate(): use stable parent inode and name passed by callerAl Viro
No need to mess with dget_parent() for the former; for the latter we really should not rely upon ->d_name.name remaining stable - it's a real-life UAF. Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27vfat_revalidate{,_ci}(): use stable parent inode passed by callerAl Viro
Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27exfat_d_revalidate(): use stable parent inode passed by callerAl Viro
... no need to bother with ->d_lock and ->d_parent->d_inode. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27fscrypt_d_revalidate(): use stable parent inode passed by callerAl Viro
The only thing it's using is parent directory inode and we are already given a stable reference to that - no need to bother with boilerplate. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27ceph_d_revalidate(): propagate stable name down into request encodingAl Viro
Currently get_fscrypt_altname() requires ->r_dentry->d_name to be stable and it gets that in almost all cases. The only exception is ->d_revalidate(), where we have a stable name, but it's passed separately - dentry->d_name is not stable there. Propagate it down to get_fscrypt_altname() as a new field of struct ceph_mds_request - ->r_dname, to be used instead ->r_dentry->d_name when non-NULL. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27ceph_d_revalidate(): use stable parent inode passed by callerAl Viro
No need to mess with the boilerplate for obtaining what we already have. Note that ceph is one of the "will want a path from filesystem root if we want to talk to server" cases, so the name of the last component is of little use - it is passed to fscrypt_d_revalidate() and it's used to deal with (also crypt-related) case in request marshalling, when encrypted name turns out to be too long. The former is not a problem, but the latter is racy; that part will be handled in the next commit. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27afs_d_revalidate(): use stable name and parent inode passed by callerAl Viro
No need to bother with boilerplate for obtaining the latter and for the former we really should not count upon ->d_name.name remaining stable under us. Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27Pass parent directory inode and expected name to ->d_revalidate()Al Viro
->d_revalidate() often needs to access dentry parent and name; that has to be done carefully, since the locking environment varies from caller to caller. We are not guaranteed that dentry in question will not be moved right under us - not unless the filesystem is such that nothing on it ever gets renamed. It can be dealt with, but that results in boilerplate code that isn't even needed - the callers normally have just found the dentry via dcache lookup and want to verify that it's in the right place; they already have the values of ->d_parent and ->d_name stable. There is a couple of exceptions (overlayfs and, to less extent, ecryptfs), but for the majority of calls that song and dance is not needed at all. It's easier to make ecryptfs and overlayfs find and pass those values if there's a ->d_revalidate() instance to be called, rather than doing that in the instances. This commit only changes the calling conventions; making use of supplied values is left to followups. NOTE: some instances need more than just the parent - things like CIFS may need to build an entire path from filesystem root, so they need more precautions than the usual boilerplate. This series doesn't do anything to that need - these filesystems have to keep their locking mechanisms (rename_lock loops, use of dentry_path_raw(), private rwsem a-la v9fs). One thing to keep in mind when using name is that name->name will normally point into the pathname being resolved; the filename in question occupies name->len bytes starting at name->name, and there is NUL somewhere after it, but it the next byte might very well be '/' rather than '\0'. Do not ignore name->len. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27generic_ci_d_compare(): use shortname_storageAl Viro
... and check the "name might be unstable" predicate the right way. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27ext4 fast_commit: make use of name_snapshot primitivesAl Viro
... rather than open-coding them. As a bonus, that avoids the pointless work with extra allocations, etc. for long names. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27dissolve external_name.u into separate membersAl Viro
... and document the constraints on the layout. Kept separate from the previous commit to keep the noise separate from actual changes. The reason for explicit __aligned() on ->name[] rather than relying upon the alignment of the previous field is that the previous iteration of that commit tried to save 4 bytes on 64bit by eliminating a hole in there, which broke the assumptions in dentry_string_cmp(). Better spell it out and avoid the temptation for the future... Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-01-27Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "A small number of improvements all over the place: - vdpa/octeon support for multiple interrupts - virtio-pci support for error recovery - vp_vdpa support for notification with data - vhost/net fix to set num_buffers for spec compliance - virtio-mem now works with kdump on s390 And small cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (23 commits) virtio_blk: Add support for transport error recovery virtio_pci: Add support for PCIe Function Level Reset vhost/net: Set num_buffers for virtio 1.0 vdpa/octeon_ep: read vendor-specific PCI capability virtio-pci: define type and header for PCI vendor data vdpa/octeon_ep: handle device config change events vdpa/octeon_ep: enable support for multiple interrupts per device vdpa: solidrun: Replace deprecated PCI functions s390/kdump: virtio-mem kdump support (CONFIG_PROC_VMCORE_DEVICE_RAM) virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAM virtio-mem: remember usable region size virtio-mem: mark device ready before registering callbacks in kdump mode fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges in 2nd kernel fs/proc/vmcore: factor out freeing a list of vmcore ranges fs/proc/vmcore: factor out allocating a vmcore range and adding it to a list fs/proc/vmcore: move vmcore definitions out of kcore.h fs/proc/vmcore: prefix all pr_* with "vmcore:" fs/proc/vmcore: disallow vmcore modifications while the vmcore is open fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutex fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutex ...
2025-01-27fuse: prevent disabling io-uring on active connectionsBernd Schubert
The enable_uring module parameter allows administrators to enable/disable io-uring support for FUSE at runtime. However, disabling io-uring while connections already have it enabled can lead to an inconsistent state. Fix this by keeping io-uring enabled on connections that were already using it, even if the module parameter is later disabled. This ensures active FUSE mounts continue to function correctly. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: enable fuse-over-io-uringBernd Schubert
All required parts are handled now, fuse-io-uring can be enabled. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: block request allocation until io-uring init is completeBernd Schubert
Avoid races and block request allocation until io-uring queues are ready. This is a especially important for background requests, as bg request completion might cause lock order inversion of the typical queue->lock and then fc->bg_lock fuse_request_end spin_lock(&fc->bg_lock); flush_bg_queue fuse_send_one fuse_uring_queue_fuse_req spin_lock(&queue->lock); Signed-off-by: Bernd Schubert <bernd@bsbernd.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: {io-uring} Prevent mount point hang on fuse-server terminationBernd Schubert
When the fuse-server terminates while the fuse-client or kernel still has queued URING_CMDs, these commands retain references to the struct file used by the fuse connection. This prevents fuse_dev_release() from being invoked, resulting in a hung mount point. This patch addresses the issue by making queued URING_CMDs cancelable, allowing fuse_dev_release() to proceed as expected and preventing the mount point from hanging. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: Allow to queue bg requests through io-uringBernd Schubert
This prepares queueing and sending background requests through io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: Allow to queue fg requests through io-uringBernd Schubert
This prepares queueing and sending foreground requests through io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: {io-uring} Make fuse_dev_queue_{interrupt,forget} non-staticBernd Schubert
These functions are also needed by fuse-over-io-uring. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: {io-uring} Handle teardown of ring entriesBernd Schubert
On teardown struct file_operations::uring_cmd requests need to be completed by calling io_uring_cmd_done(). Not completing all ring entries would result in busy io-uring tasks giving warning messages in intervals and unreleased struct file. Additionally the fuse connection and with that the ring can only get released when all io-uring commands are completed. Completion is done with ring entries that are a) in waiting state for new fuse requests - io_uring_cmd_done is needed b) already in userspace - io_uring_cmd_done through teardown is not needed, the request can just get released. If fuse server is still active and commits such a ring entry, fuse_uring_cmd() already checks if the connection is active and then complete the io-uring itself with -ENOTCONN. I.e. special handling is not needed. This scheme is basically represented by the ring entry state FRRS_WAIT and FRRS_USERSPACE. Entries in state: - FRRS_INIT: No action needed, do not contribute to ring->queue_refs yet - All other states: Are currently processed by other tasks, async teardown is needed and it has to wait for the two states above. It could be also solved without an async teardown task, but would require additional if conditions in hot code paths. Also in my personal opinion the code looks cleaner with async teardown. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27fuse: Add io-uring sqe commit and fetch supportBernd Schubert
This adds support for fuse request completion through ring SQEs (FUSE_URING_CMD_COMMIT_AND_FETCH handling). After committing the ring entry it becomes available for new fuse requests. Handling of requests through the ring (SQE/CQE handling) is complete now. Fuse request data are copied through the mmaped ring buffer, there is no support for any zero copy yet. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> # io_uring Reviewed-by: Luis Henriques <luis@igalia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2025-01-27ceph: exchange hardcoded value on NAME_MAXViacheslav Dubeyko
Initially, ceph_fs_debugfs_init() had temporary name buffer with hardcoded length of 80 symbols. Then, it was hardcoded again for 100 symbols. Finally, it makes sense to exchange hardcoded value on properly defined constant and 255 symbols should be enough for any name case. Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Patrick Donnelly <pdonnell@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-01-27ceph: streamline request head structures in MDS clientLiang Jie
The existence of the ceph_mds_request_head_old structure in the MDS client code is no longer required due to improvements in handling different MDS request header versions. This patch removes the now redundant ceph_mds_request_head_old structure and replaces its usage with the flexible and extensible ceph_mds_request_head structure. Changes include: - Modification of find_legacy_request_head to directly cast the pointer to ceph_mds_request_head_legacy without going through the old structure. - Update sizeof calculations in create_request_message to use offsetofend for consistency and future-proofing, rather than referencing the old structure. - Use of the structured ceph_mds_request_head directly instead of the old one. Additionally, this consolidation normalizes the handling of request_head_version v1 to align with versions v2 and v3, leading to a more consistent and maintainable codebase. These changes simplify the codebase and reduce potential confusion stemming from the existence of an obsolete structure. Signed-off-by: Liang Jie <liangjie@lixiang.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2025-01-27virtio-mem: support CONFIG_PROC_VMCORE_DEVICE_RAMDavid Hildenbrand
Let's implement the get_device_ram() vmcore callback, so architectures that select NEED_PROC_VMCORE_NEED_DEVICE_RAM, like s390 soon, can include that memory in a crash dump. Merge ranges, and process ranges that might contain a mixture of plugged and unplugged, to reduce the total number of ranges. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-12-david@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: introduce PROC_VMCORE_DEVICE_RAM to detect device RAM ranges ↵David Hildenbrand
in 2nd kernel s390 allocates+prepares the elfcore hdr in the dump (2nd) kernel, not in the crashed kernel. RAM provided by memory devices such as virtio-mem can only be detected using the device driver; when vmcore_init() is called, these device drivers are usually not loaded yet, or the devices did not get probed yet. Consequently, on s390 these RAM ranges will not be included in the crash dump, which makes the dump partially corrupt and is unfortunate. Instead of deferring the vmcore_init() call, to an (unclear?) later point, let's reuse the vmcore_cb infrastructure to obtain device RAM ranges as the device drivers probe the device and get access to this information. Then, we'll add these ranges to the vmcore, adding more PT_LOAD entries and updating the offsets+vmcore size. Use a separate Kconfig option to be set by an architecture to include this code only if the arch really needs it. Further, we'll make the config depend on the relevant drivers (i.e., virtio_mem) once they implement support (next). The alternative of having a PROVIDE_PROC_VMCORE_DEVICE_RAM config option was dropped for now for simplicity. The current target use case is s390, which only creates an elf64 elfcore, so focusing on elf64 is sufficient. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-9-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: factor out freeing a list of vmcore rangesDavid Hildenbrand
Let's factor it out into include/linux/crash_dump.h, from where we can use it also outside of vmcore.c later. Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-8-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: factor out allocating a vmcore range and adding it to a listDavid Hildenbrand
Let's factor it out into include/linux/crash_dump.h, from where we can use it also outside of vmcore.c later. Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-7-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: move vmcore definitions out of kcore.hDavid Hildenbrand
These vmcore defines are not related to /proc/kcore, move them out. We'll move "struct vmcoredd_node" to vmcore.c, because it is only used internally. While "struct vmcore" is only used internally for now, we're planning on using it from inline functions in crash_dump.h next, so move it to crash_dump.h. While at it, rename "struct vmcore" to "struct vmcore_range", which is a more suitable name and will make the usage of it outside of vmcore.c clearer. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-6-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: prefix all pr_* with "vmcore:"David Hildenbrand
Let's use "vmcore: " as a prefix, converting the single "Kdump: vmcore not initialized" one to effectively be "vmcore: not initialized". Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-5-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: disallow vmcore modifications while the vmcore is openDavid Hildenbrand
The vmcoredd_update_size() call and its effects (size/offset changes) are currently completely unsynchronized, and will cause trouble when performed concurrently, or when done while someone is already reading the vmcore. Let's protect all vmcore modifications by the vmcore_mutex, disallow vmcore modifications while the vmcore is open, and warn on vmcore modifications after the vmcore was already opened once: modifications while the vmcore is open are unsafe, and modifications after the vmcore was opened indicates trouble. Properly synchronize against concurrent opening of the vmcore. No need to grab the mutex during mmap()/read(): after we opened the vmcore, modifications are impossible. It's worth noting that modifications after the vmcore was opened are completely unexpected, so failing if open, and warning if already opened (+closed again) is good enough. This change not only handles concurrent adding of device dumps + concurrent reading of the vmcore properly, it also prepares for other mechanisms that will modify the vmcore. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-4-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: replace vmcoredd_mutex by vmcore_mutexDavid Hildenbrand
Now that we have a mutex that synchronizes against opening of the vmcore, let's use that one to replace vmcoredd_mutex: there is no need to have two separate ones. This is a preparation for properly preventing vmcore modifications after the vmcore was opened. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-3-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27fs/proc/vmcore: convert vmcore_cb_lock into vmcore_mutexDavid Hildenbrand
We want to protect vmcore modifications from concurrent opening of the vmcore, and also serialize vmcore modification. (a) We can currently modify the vmcore after it was opened. This can happen if a vmcoredd is added after the vmcore module was initialized and already opened by user space. We want to fix that and prepare for new code wanting to serialize against concurrent opening. (b) To handle it cleanly we need to protect the modifications against concurrent opening. As the modifications end up allocating memory and can sleep, we cannot rely on the spinlock. Let's convert the spinlock into a mutex to prepare for further changes. Signed-off-by: David Hildenbrand <david@redhat.com> Message-Id: <20241204125444.1734652-2-david@redhat.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-01-27xfs: Add error handling for xfs_reflink_cancel_cow_rangeWentao Liang
In xfs_inactive(), xfs_reflink_cancel_cow_range() is called without error handling, risking unnoticed failures and inconsistent behavior compared to other parts of the code. Fix this issue by adding an error handling for the xfs_reflink_cancel_cow_range(), improving code robustness. Fixes: 6231848c3aa5 ("xfs: check for cow blocks before trying to clear them") Cc: stable@vger.kernel.org # v4.17 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-27xfs: Propagate errors from xfs_reflink_cancel_cow_range in ↵Wentao Liang
xfs_dax_write_iomap_end In xfs_dax_write_iomap_end(), directly return the result of xfs_reflink_cancel_cow_range() when !written, ensuring proper error propagation and improving code robustness. Fixes: ea6c49b784f0 ("xfs: support CoW in fsdax mode") Cc: stable@vger.kernel.org # v6.0 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-01-26cifs: Change translation of STATUS_NOT_A_REPARSE_POINT to -ENODATAPali Rohár
STATUS_NOT_A_REPARSE_POINT indicates that object does not have reparse point buffer attached, for example returned by FSCTL_GET_REPARSE_POINT. Currently STATUS_NOT_A_REPARSE_POINT is translated to -EIO. Change it to -ENODATA which better describe the situation when no reparse point is set. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-26bcachefs: Improve trace_move_extent_finishKent Overstreet
We're currently debugging issues with rebalance, where it's not making progress as quickly as it should be (or sometimes not at all). Add the full data_update to the move_extent_finish tracepoint, so we can check that the replicas we wrote match what we were supposed to do. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-26bcachefs: Fix trace_copygcKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-26bcachefs: Journal writes are now IOPRIO_CLASS_RTKent Overstreet
System performance is particularly sensitive to journal write latency, the number of outstanding journal writes is bounded and we can't issue journal flushes until other journal writes have completed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-01-26Merge tag 'mm-stable-2025-01-26-14-59' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...
2025-01-26Merge tag 'mm-nonmm-stable-2025-01-24-23-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Mainly individually changelogged singleton patches. The patch series in this pull are: - "lib min_heap: Improve min_heap safety, testing, and documentation" from Kuan-Wei Chiu provides various tightenings to the min_heap library code - "xarray: extract __xa_cmpxchg_raw" from Tamir Duberstein preforms some cleanup and Rust preparation in the xarray library code - "Update reference to include/asm-<arch>" from Geert Uytterhoeven fixes pathnames in some code comments - "Converge on using secs_to_jiffies()" from Easwar Hariharan uses the new secs_to_jiffies() in various places where that is appropriate - "ocfs2, dlmfs: convert to the new mount API" from Eric Sandeen switches two filesystems to the new mount API - "Convert ocfs2 to use folios" from Matthew Wilcox does that - "Remove get_task_comm() and print task comm directly" from Yafang Shao removes now-unneeded calls to get_task_comm() in various places - "squashfs: reduce memory usage and update docs" from Phillip Lougher implements some memory savings in squashfs and performs some maintainability work - "lib: clarify comparison function requirements" from Kuan-Wei Chiu tightens the sort code's behaviour and adds some maintenance work - "nilfs2: protect busy buffer heads from being force-cleared" from Ryusuke Konishi fixes an issues in nlifs when the fs is presented with a corrupted image - "nilfs2: fix kernel-doc comments for function return values" from Ryusuke Konishi fixes some nilfs kerneldoc - "nilfs2: fix issues with rename operations" from Ryusuke Konishi addresses some nilfs BUG_ONs which syzbot was able to trigger - "minmax.h: Cleanups and minor optimisations" from David Laight does some maintenance work on the min/max library code - "Fixes and cleanups to xarray" from Kemeng Shi does maintenance work on the xarray library code" * tag 'mm-nonmm-stable-2025-01-24-23-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (131 commits) ocfs2: use str_yes_no() and str_no_yes() helper functions include/linux/lz4.h: add some missing macros Xarray: use xa_mark_t in xas_squash_marks() to keep code consistent Xarray: remove repeat check in xas_squash_marks() Xarray: distinguish large entries correctly in xas_split_alloc() Xarray: move forward index correctly in xas_pause() Xarray: do not return sibling entries from xas_find_marked() ipc/util.c: complete the kernel-doc function descriptions gcov: clang: use correct function param names latencytop: use correct kernel-doc format for func params minmax.h: remove some #defines that are only expanded once minmax.h: simplify the variants of clamp() minmax.h: move all the clamp() definitions after the min/max() ones minmax.h: use BUILD_BUG_ON_MSG() for the lo < hi test in clamp() minmax.h: reduce the #define expansion of min(), max() and clamp() minmax.h: update some comments minmax.h: add whitespace around operators and after commas nilfs2: do not update mtime of renamed directory that is not moved nilfs2: handle errors that nilfs_prepare_chunk() may return CREDITS: fix spelling mistake ...
2025-01-25ksm: add ksm involvement information for each processxu xin
In /proc/<pid>/ksm_stat, add two extra ksm involvement items including KSM_mergeable and KSM_merge_any. It helps administrators to better know the system's KSM behavior at process level. ksm_merge_any: yes/no whether the process'mm is added by prctl() into the candidate list of KSM or not, and fully enabled at process level. ksm_mergeable: yes/no whether any VMAs of the process'mm are currently applicable to KSM. Purpose ======= These two items are just to improve the observability of KSM at process level, so that users can know if a certain process has enabled KSM. For example, if without these two items, when we look at /proc/<pid>/ksm_stat and there's no merging pages found, We are not sure whether it is because KSM was not enabled or because KSM did not successfully merge any pages. Although "mg" in /proc/<pid>/smaps indicate VM_MERGEABLE, it's opaque and not very obvious for non professionals. [akpm@linux-foundation.org: wording tweaks, per David and akpm] Link: https://lkml.kernel.org/r/20250110174034304QOb8eDoqtFkp3_t8mqnqc@zte.com.cn Signed-off-by: xu xin <xu.xin16@zte.com.cn> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Mario Casquero <mcasquer@redhat.com> Cc: Wang Yaxin <wang.yaxin@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-01-25mm, swap: clean up device availability checkKairui Song
Remove highest_bit and lowest_bit. After the HDD allocation path has been removed, the only purpose of these two fields is to determine whether the device is full or not, which can instead be determined by checking the inuse_pages. Link: https://lkml.kernel.org/r/20250113175732.48099-6-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chis Li <chrisl@kernel.org> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickens <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>