summaryrefslogtreecommitdiff
path: root/fs/file_table.c
AgeCommit message (Collapse)Author
2025-03-24Merge tag 'vfs-6.15-rc1.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file handling updates from Christian Brauner: "This contains performance improvements for struct file's new refcount mechanism and various other performance work: - The stock kernel transitioning the file to no refs held penalizes the caller with an extra atomic to block any increments. For cases where the file is highly likely to be going away this is easily avoidable. Add file_ref_put_close() to better handle the common case where closing a file descriptor also operates on the last reference and build fput_close_sync() and fput_close() on top of it. This brings about 1% performance improvement by eliding one atomic in the common case. - Predict no error in close() since the vast majority of the time system call returns 0. - Reduce the work done in fdget_pos() by predicting that the file was found and by explicitly comparing the reference count to one and ignoring the dead zone" * tag 'vfs-6.15-rc1.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: reduce work in fdget_pos() fs: use fput_close() in path_openat() fs: use fput_close() in filp_close() fs: use fput_close_sync() in close() file: add fput and file_ref_put routines optimized for use when closing a fd fs: predict no error in close()
2025-03-24Merge tag 'vfs-6.15-rc1.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "Features: - Add CONFIG_DEBUG_VFS infrastucture: - Catch invalid modes in open - Use the new debug macros in inode_set_cached_link() - Use debug-only asserts around fd allocation and install - Place f_ref to 3rd cache line in struct file to resolve false sharing Cleanups: - Start using anon_inode_getfile_fmode() helper in various places - Don't take f_lock during SEEK_CUR if exclusion is guaranteed by f_pos_lock - Add unlikely() to kcmp() - Remove legacy ->remount_fs method from ecryptfs after port to the new mount api - Remove invalidate_inodes() in favour of evict_inodes() - Simplify ep_busy_loopER by removing unused argument - Avoid mmap sem relocks when coredumping with many missing pages - Inline getname() - Inline new_inode_pseudo() and de-staticize alloc_inode() - Dodge an atomic in putname if ref == 1 - Consistently deref the files table with rcu_dereference_raw() - Dedup handling of struct filename init and refcounts bumps - Use wq_has_sleeper() in end_dir_add() - Drop the lock trip around I_NEW wake up in evict() - Load the ->i_sb pointer once in inode_sb_list_{add,del} - Predict not reaching the limit in alloc_empty_file() - Tidy up do_sys_openat2() with likely/unlikely - Call inode_sb_list_add() outside of inode hash lock - Sort out fd allocation vs dup2 race commentary - Turn page_offset() into a wrapper around folio_pos() - Remove locking in exportfs around ->get_parent() call - try_lookup_one_len() does not need any locks in autofs - Fix return type of several functions from long to int in open - Fix return type of several functions from long to int in ioctls Fixes: - Fix watch queue accounting mismatch" * tag 'vfs-6.15-rc1.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: sort out fd allocation vs dup2 race commentary, take 2 fs: call inode_sb_list_add() outside of inode hash lock fs: tidy up do_sys_openat2() with likely/unlikely fs: predict not reaching the limit in alloc_empty_file() fs: load the ->i_sb pointer once in inode_sb_list_{add,del} fs: drop the lock trip around I_NEW wake up in evict() fs: use wq_has_sleeper() in end_dir_add() VFS/autofs: try_lookup_one_len() does not need any locks fs: dedup handling of struct filename init and refcounts bumps fs: consistently deref the files table with rcu_dereference_raw() exportfs: remove locking around ->get_parent() call. fs: use debug-only asserts around fd allocation and install fs: dodge an atomic in putname if ref == 1 vfs: Remove invalidate_inodes() ecryptfs: remove NULL remount_fs from super_operations watch_queue: fix pipe accounting mismatch fs: place f_ref to 3rd cache line in struct file to resolve false sharing epoll: simplify ep_busy_loop by removing always 0 argument fs: Turn page_offset() into a wrapper around folio_pos() kcmp: improve performance adding an unlikely hint to task comparisons ...
2025-03-19fs: predict not reaching the limit in alloc_empty_file()Mateusz Guzik
Eliminates a jump over a call to capable() in the common case. By default the limit is not even set, in which case the check can't even fail to begin with. It remains unset at least on Debian and Ubuntu. For this cases this can probably become a static branch instead. In the meantime tidy it up. I note the check separate from the bump makes the entire thing racy. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250319124923.1838719-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-05file: add fput and file_ref_put routines optimized for use when closing a fdMateusz Guzik
Vast majority of the time closing a file descriptor also operates on the last reference, where a regular fput usage will result in 2 atomics. This can be changed to only suffer 1. See commentary above file_ref_put_close() for more information. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20250305123644.554845-2-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07fsnotify: disable pre-content and permission events by defaultAmir Goldstein
After introducing pre-content events, we had a regression related to disabling huge faults on files that should never have pre-content events enabled. This happened because the default f_mode of allocated files (0) does not disable pre-content events. Pre-content events are disabled in file_set_fsnotify_mode_by_watchers() but internal files may not get to call this helper. Initialize f_mode to disable permission and pre-content events for all files and if needed they will be enabled for the callers of file_set_fsnotify_mode_by_watchers(). Fixes: 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches") Reported-by: Alex Williamson <alex.williamson@redhat.com> Closes: https://lore.kernel.org/linux-fsdevel/20250131121703.1e4d00a7.alex.williamson@redhat.com/ Tested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20250203223205.861346-4-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-07fsnotify: disable notification by default for all pseudo filesAmir Goldstein
Most pseudo files are not applicable for fsnotify events at all, let alone to the new pre-content events. Disable notifications to all files allocated with alloc_file_pseudo() and enable legacy inotify events for the specific cases of pipe and socket, which have known users of inotify events. Pre-content events are also kept disabled for sockets and pipes. Fixes: 20bf82a898b6 ("mm: don't allow huge faults for files with pre content watches") Reported-by: Alex Williamson <alex.williamson@redhat.com> Closes: https://lore.kernel.org/linux-fsdevel/20250131121703.1e4d00a7.alex.williamson@redhat.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/linux-fsdevel/CAHk-=wi2pThSVY=zhO=ZKxViBj5QCRX-=AS2+rVknQgJnHXDFg@mail.gmail.com/ Tested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20250203223205.861346-3-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-02-01Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull misc vfs cleanups from Al Viro: "Two unrelated patches - one is a removal of long-obsolete include in overlayfs (it used to need fs/internal.h, but the extern it wanted has been moved back to include/linux/namei.h) and another introduces convenience helper constructing struct qstr by a NUL-terminated string" * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add a string-to-qstr constructor fs/overlayfs/namei.c: get rid of include ../internal.h
2025-01-28treewide: const qualify ctl_tables where applicableJoel Granados
Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-27add a string-to-qstr constructorAl Viro
Quite a few places want to build a struct qstr by given string; it would be convenient to have a primitive doing that, rather than open-coding it via QSTR_INIT(). The closest approximation was in bcachefs, but that expands to initializer list - {.len = strlen(string), .name = string}. It would be more useful to have it as compound literal - (struct qstr){.len = strlen(string), .name = string}. Unlike initializer list it's a valid expression. What's more, it's a valid lvalue - it's an equivalent of anonymous local variable with such initializer, so the things like path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name)); are valid. It can also be used as initializer, with identical effect - struct qstr x = (struct qstr){.name = s, .len = strlen(s)}; is equivalent to struct qstr anon_variable = {.name = s, .len = strlen(s)}; struct qstr x = anon_variable; // anon_variable is never used after that point and any even remotely sane compiler will manage to collapse that into struct qstr x = {.name = s, .len = strlen(s)}; What compound literals can't be used for is initialization of global variables, but those are covered by QSTR_INIT(). This commit lifts definition(s) of QSTR() into linux/dcache.h, converts it to compound literal (all bcachefs users are fine with that) and converts assorted open-coded instances to using that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-12-22file: flush delayed work in delayed fput()shao mingyin
The fput() of file rcS might not have completed causing issues when executing the file. rcS is opened in do_populate_rootfs before executed. At the end of do_populate_rootfs() flush_delayed_fput() is called. Now do_populate_rootfs() assumes that all fput()s caused by do_populate_rootfs() have completed. But flush_delayed_fput() can only ensure that fput() on the current delayed_fput_list has finished. Any file that has been removed from delayed_fput_list asynchronously in the meantime might not have completed causing the exec to fail. do_populate_rootfs delayed_fput_list delayed_fput execve fput() a fput() a->b fput() a->b->rcS __fput(a) fput() c fput() c->d __fput(b) flush_delayed_fput __fput(c) __fput(d) __fput(b) __fput(b) execve(rcS) Ensure that all delayed work is done by calling flush_delayed_work() in flush_delayed_fput() explicitly. Signed-off-by: Chen Lin <chen.lin5@zte.com.cn> Signed-off-by: Shao Mingyin <shao.mingyin@zte.com.cn> Link: https://lore.kernel.org/r/20241023135850067m3w2R0UXESiVCYz_wdAoT@zte.com.cn Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Yang Tao <yang.tao172@zte.com.cn> Cc: Xu Xin <xu.xin16@zte.com.cn> [brauner: rewrite commit message] Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-22fs: fix proc_handler for sysctl_nr_openJinliang Zheng
Use proc_douintvec_minmax() instead of proc_dointvec_minmax() to handle sysctl_nr_open, because its data type is unsigned int, not int. Fixes: 9b80a184eaad ("fs/file: more unsigned file descriptors") Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Link: https://lore.kernel.org/r/20241124034636.325337-1-alexjlzheng@tencent.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-30Merge branch 'work.fdtable' into vfs.fileChristian Brauner
Bring in the fdtable changes for this cycle. Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-30fs: port files to file_refChristian Brauner
Port files to rely on file_ref reference to improve scaling and gain overflow protection. - We continue to WARN during get_file() in case a file that is already marked dead is revived as get_file() is only valid if the caller already holds a reference to the file. This hasn't changed just the check changes. - The semantics for epoll and ttm's dmabuf usage have changed. Both epoll and ttm synchronize with __fput() to prevent the underlying file from beeing freed. (1) epoll Explaining epoll is straightforward using a simple diagram. Essentially, the mutex of the epoll instance needs to be taken in both __fput() and around epi_fget() preventing the file from being freed while it is polled or preventing the file from being resurrected. CPU1 CPU2 fput(file) -> __fput(file) -> eventpoll_release(file) -> eventpoll_release_file(file) mutex_lock(&ep->mtx) epi_item_poll() -> epi_fget() -> file_ref_get(file) mutex_unlock(&ep->mtx) mutex_lock(&ep->mtx); __ep_remove() mutex_unlock(&ep->mtx); -> kmem_cache_free(file) (2) ttm dmabuf This explanation is a bit more involved. A regular dmabuf file stashed the dmabuf in file->private_data and the file in dmabuf->file: file->private_data = dmabuf; dmabuf->file = file; The generic release method of a dmabuf file handles file specific things: f_op->release::dma_buf_file_release() while the generic dentry release method of a dmabuf handles dmabuf freeing including driver specific things: dentry->d_release::dma_buf_release() During ttm dmabuf initialization in ttm_object_device_init() the ttm driver copies the provided struct dma_buf_ops into a private location: struct ttm_object_device { spinlock_t object_lock; struct dma_buf_ops ops; void (*dmabuf_release)(struct dma_buf *dma_buf); struct idr idr; }; ttm_object_device_init(const struct dma_buf_ops *ops) { // copy original dma_buf_ops in private location tdev->ops = *ops; // stash the release method of the original struct dma_buf_ops tdev->dmabuf_release = tdev->ops.release; // override the release method in the copy of the struct dma_buf_ops // with ttm's own dmabuf release method tdev->ops.release = ttm_prime_dmabuf_release; } When a new dmabuf is created the struct dma_buf_ops with the overriden release method set to ttm_prime_dmabuf_release is passed in exp_info.ops: DEFINE_DMA_BUF_EXPORT_INFO(exp_info); exp_info.ops = &tdev->ops; exp_info.size = prime->size; exp_info.flags = flags; exp_info.priv = prime; The call to dma_buf_export() then sets mutex_lock_interruptible(&prime->mutex); dma_buf = dma_buf_export(&exp_info) { dmabuf->ops = exp_info->ops; } mutex_unlock(&prime->mutex); which creates a new dmabuf file and then install a file descriptor to it in the callers file descriptor table: ret = dma_buf_fd(dma_buf, flags); When that dmabuf file is closed we now get: fput(file) -> __fput(file) -> f_op->release::dma_buf_file_release() -> dput() -> d_op->d_release::dma_buf_release() -> dmabuf->ops->release::ttm_prime_dmabuf_release() mutex_lock(&prime->mutex); if (prime->dma_buf == dma_buf) prime->dma_buf = NULL; mutex_unlock(&prime->mutex); Where we can see that prime->dma_buf is set to NULL. So when we have the following diagram: CPU1 CPU2 fput(file) -> __fput(file) -> f_op->release::dma_buf_file_release() -> dput() -> d_op->d_release::dma_buf_release() -> dmabuf->ops->release::ttm_prime_dmabuf_release() ttm_prime_handle_to_fd() mutex_lock_interruptible(&prime->mutex) dma_buf = prime->dma_buf dma_buf && get_dma_buf_unless_doomed(dma_buf) -> file_ref_get(dma_buf->file) mutex_unlock(&prime->mutex); mutex_lock(&prime->mutex); if (prime->dma_buf == dma_buf) prime->dma_buf = NULL; mutex_unlock(&prime->mutex); -> kmem_cache_free(file) The logic of the mechanism is the same as for epoll: sync with __fput() preventing the file from being freed. Here the synchronization happens through the ttm instance's prime->mutex. Basically, the lifetime of the dma_buf and the file are tighly coupled. Both (1) and (2) used to call atomic_inc_not_zero() to check whether the file has already been marked dead and then refuse to revive it. This is only safe because both (1) and (2) sync with __fput() and thus prevent kmem_cache_free() on the file being called and thus prevent the file from being immediately recycled due to SLAB_TYPESAFE_BY_RCU. Both (1) and (2) have been ported from atomic_inc_not_zero() to file_ref_get(). That means a file that is already in the process of being marked as FILE_REF_DEAD: file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) can be revived again: CPU1 CPU2 file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) file_ref_get() // Brings reference back to FILE_REF_ONEREF atomic_long_add_negative() atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) This is fine and inherent to the file_ref_get()/file_ref_put() semantics. For both (1) and (2) this is safe because __fput() is prevented from making progress if file_ref_get() fails due to the aforementioned synchronization mechanisms. Two cases need to be considered that affect both (1) epoll and (2) ttm dmabuf: (i) fput()'s file_ref_put() and marks the file as FILE_REF_NOREF but before that fput() can mark the file as FILE_REF_DEAD someone manages to sneak in a file_ref_get() and brings the refcount back from FILE_REF_NOREF to FILE_REF_ONEREF. In that case the original fput() doesn't call __fput(). For epoll the poll will finish and for ttm dmabuf the file can be used again. For ttm dambuf this is actually an advantage because it avoids immediately allocating a new dmabuf object. CPU1 CPU2 file_ref_put() cnt = atomic_long_dec_return() -> __file_ref_put(cnt) if (cnt == FIlE_REF_NOREF) file_ref_get() // Brings reference back to FILE_REF_ONEREF atomic_long_add_negative() atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD) (ii) fput()'s file_ref_put() marks the file FILE_REF_NOREF and also suceeds in actually marking it FILE_REF_DEAD and then calls into __fput() to free the file. When either (1) or (2) call file_ref_get() they fail as atomic_long_add_negative() will return true. At the same time, both (1) and (2) all file_ref_get() under mutexes that __fput() must also acquire preventing kmem_cache_free() from freeing the file. So while this might be treated as a change in semantics for (1) and (2) it really isn't. It if should end up causing issues this can be fixed by adding a helper that does something like: long cnt = atomic_long_read(&ref->refcnt); do { if (cnt < 0) return false; } while (!atomic_long_try_cmpxchg(&ref->refcnt, &cnt, cnt + 1)); return true; which would block FILE_REF_NOREF to FILE_REF_ONEREF transitions. - Jann correctly pointed out that kmem_cache_zalloc() cannot be used anymore once files have been ported to file_ref_t. The kmem_cache_zalloc() call will memset() the whole struct file to zero when it is reallocated. This will also set file->f_ref to zero which mens that a concurrent file_ref_get() can return true: CPU1 CPU2 __get_file_rcu() rcu_dereference_raw() close() [frees file] alloc_empty_file() kmem_cache_zalloc() [reallocates same file] memset(..., 0, ...) file_ref_get() [increments 0->1, returns true] init_file() file_ref_init(..., 1) [sets to 0] rcu_dereference_raw() fput() file_ref_put() [decrements 0->FILE_REF_NOREF, frees file] [UAF] causing a concurrent __get_file_rcu() call to acquire a reference to the file that is about to be reallocated and immediately freeing it on realizing that it has been recycled. This causes a UAF for the task that reallocated/recycled the file. This is prevented by switching from kmem_cache_zalloc() to kmem_cache_alloc() and initializing the fields manually. With file->f_ref initialized last. Note that a memset() also isn't guaranteed to atomically update an unsigned long so it's theoretically possible to see torn and therefore bogus counter values. Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-08fs: protect backing files with rcuChristian Brauner
Currently backing files are not under any form of rcu protection. Switching to file_ref requires rcu protection and so does the speculative vma lookup. Switch backing files to the same rcu slab type as regular files. There should be no additional magic required as the lifetime of a backing file is always tied to a regular file. Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-1-387e24dc9163@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07remove pointless includes of <linux/fdtable.h>Al Viro
some of those used to be needed, some had been cargo-culted for no reason... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-09-18Merge tag 'slab-for-6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: "This time it's mostly refactoring and improving APIs for slab users in the kernel, along with some debugging improvements. - kmem_cache_create() refactoring (Christian Brauner) Over the years have been growing new parameters to kmem_cache_create() where most of them are needed only for a small number of caches - most recently the rcu_freeptr_offset parameter. To avoid adding new parameters to kmem_cache_create() and adjusting all its callers, or creating new wrappers such as kmem_cache_create_rcu(), we can now pass extra parameters using the new struct kmem_cache_args. Not explicitly initialized fields default to values interpreted as unused. kmem_cache_create() is for now a wrapper that works both with the new form: kmem_cache_create(name, object_size, args, flags) and the legacy form: kmem_cache_create(name, object_size, align, flags, ctor) - kmem_cache_destroy() waits for kfree_rcu()'s in flight (Vlastimil Babka, Uladislau Rezki) Since SLOB removal, kfree() is allowed for freeing objects allocated by kmem_cache_create(). By extension kfree_rcu() as allowed as well, which can allow converting simple call_rcu() callbacks that only do kmem_cache_free(), as there was never a kmem_cache_free_rcu() variant. However, for caches that can be destroyed e.g. on module removal, the cache owners knew to issue rcu_barrier() first to wait for the pending call_rcu()'s, and this is not sufficient for pending kfree_rcu()'s due to its internal batching optimizations. Ulad has provided a new kvfree_rcu_barrier() and to make the usage less error-prone, kmem_cache_destroy() calls it. Additionally, destroying SLAB_TYPESAFE_BY_RCU caches now again issues rcu_barrier() synchronously instead of using an async work, because the past motivation for async work no longer applies. Users of custom call_rcu() callbacks should however keep calling rcu_barrier() before cache destruction. - Debugging use-after-free in SLAB_TYPESAFE_BY_RCU caches (Jann Horn) Currently, KASAN cannot catch UAFs in such caches as it is legal to access them within a grace period, and we only track the grace period when trying to free the underlying slab page. The new CONFIG_SLUB_RCU_DEBUG option changes the freeing of individual object to be RCU-delayed, after which KASAN can poison them. - Delayed memcg charging (Shakeel Butt) In some cases, the memcg is uknown at allocation time, such as receiving network packets in softirq context. With kmem_cache_charge() these may be now charged later when the user and its memcg is known. - Misc fixes and improvements (Pedro Falcato, Axel Rasmussen, Christoph Lameter, Yan Zhen, Peng Fan, Xavier)" * tag 'slab-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (34 commits) mm, slab: restore kerneldoc for kmem_cache_create() io_uring: port to struct kmem_cache_args slab: make __kmem_cache_create() static inline slab: make kmem_cache_create_usercopy() static inline slab: remove kmem_cache_create_rcu() file: port to struct kmem_cache_args slab: create kmem_cache_create() compatibility layer slab: port KMEM_CACHE_USERCOPY() to struct kmem_cache_args slab: port KMEM_CACHE() to struct kmem_cache_args slab: remove rcu_freeptr_offset from struct kmem_cache slab: pass struct kmem_cache_args to do_kmem_cache_create() slab: pull kmem_cache_open() into do_kmem_cache_create() slab: pass struct kmem_cache_args to create_cache() slab: port kmem_cache_create_usercopy() to struct kmem_cache_args slab: port kmem_cache_create_rcu() to struct kmem_cache_args slab: port kmem_cache_create() to struct kmem_cache_args slab: add struct kmem_cache_args slab: s/__kmem_cache_create/do_kmem_cache_create/g memcg: add charging of already allocated slab objects mm/slab: Optimize the code logic in find_mergeable() ...
2024-09-16Merge tag 'vfs-6.12.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs file updates from Christian Brauner: "This is the work to cleanup and shrink struct file significantly. Right now, (focusing on x86) struct file is 232 bytes. After this series struct file will be 184 bytes aka 3 cacheline and a spare 8 bytes for future extensions at the end of the struct. With struct file being as ubiquitous as it is this should make a difference for file heavy workloads and allow further optimizations in the future. - struct fown_struct was embedded into struct file letting it take up 32 bytes in total when really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct now allocate the struct on demand. This frees up 24 bytes. - Move struct file_ra_state into the union containg the cleanup hooks and move f_iocb_flags out of the union. This closes a 4 byte hole we created earlier and brings struct file to 192 bytes. Which means struct file is 3 cachelines and we managed to shrink it by 40 bytes. - Reorder struct file so that nothing crosses a cacheline. I suspect that in the future we will end up reordering some members to mitigate false sharing issues or just because someone does actually provide really good perf data. - Shrinking struct file to 192 bytes is only part of the work. Files use a slab that is SLAB_TYPESAFE_BY_RCU and when a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer must be located outside of the object because the cache doesn't know what part of the memory can safely be overwritten as it may be needed to prevent object recycling. That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a new cacheline. So this also contains work to add a new kmem_cache_create_rcu() function that allows the caller to specify an offset where the freelist pointer is supposed to be placed. Thus avoiding the implicit addition of a fourth cacheline. - And finally this removes the f_version member in struct file. The f_version member isn't particularly well-defined. It is mainly used as a cookie to detect concurrent seeks when iterating directories. But it is also abused by some subsystems for completely unrelated things. It is mostly a directory and filesystem specific thing that doesn't really need to live in struct file and with its wonky semantics it really lacks a specific function. For pipes, f_version is (ab)used to defer poll notifications until a write has happened. And struct pipe_inode_info is used by multiple struct files in their ->private_data so there's no chance of pushing that down into file->private_data without introducing another pointer indirection. But pipes don't rely on f_pos_lock so this adds a union into struct file encompassing f_pos_lock and a pipe specific f_pipe member that pipes can use. This union of course can be extended to other file types and is similar to what we do in struct inode already" * tag 'vfs-6.12.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (26 commits) fs: remove f_version pipe: use f_pipe fs: add f_pipe ubifs: store cookie in private data ufs: store cookie in private data udf: store cookie in private data proc: store cookie in private data ocfs2: store cookie in private data input: remove f_version abuse ext4: store cookie in private data ext2: store cookie in private data affs: store cookie in private data fs: add generic_llseek_cookie() fs: use must_set_pos() fs: add must_set_pos() fs: add vfs_setpos_cookie() s390: remove unused f_version ceph: remove unused f_version adi: remove unused f_version mm: Removed @freeptr_offset to prevent doc warning ...
2024-09-12fs: add f_pipeChristian Brauner
Only regular files with FMODE_ATOMIC_POS and directories need f_pos_lock. Place a new f_pipe member in a union with f_pos_lock that they can use and make them stop abusing f_version in follow-up patches. Link: https://lore.kernel.org/r/20240830-vfs-file-f_version-v1-18-6d3e4816aa7b@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-10file: port to struct kmem_cache_argsChristian Brauner
Port filp_cache to struct kmem_cache_args. Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-29fs: use kmem_cache_create_rcu()Christian Brauner
Switch to the new kmem_cache_create_rcu() helper which allows us to use a custom free pointer offset avoiding the need to have an external free pointer which would grow struct file behind our backs. Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-3-5460bc1f09f6@kernel.org Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-28file: reclaim 24 bytes from f_ownerChristian Brauner
We do embedd struct fown_struct into struct file letting it take up 32 bytes in total. We could tweak struct fown_struct to be more compact but really it shouldn't even be embedded in struct file in the first place. Instead, actual users of struct fown_struct should allocate the struct on demand. This frees up 24 bytes in struct file. That will have some potentially user-visible changes for the ownership fcntl()s. Some of them can now fail due to allocation failures. Practically, that probably will almost never happen as the allocations are small and they only happen once per file. The fown_struct is used during kill_fasync() which is used by e.g., pipes to generate a SIGIO signal. Sending of such signals is conditional on userspace having set an owner for the file using one of the F_OWNER fcntl()s. Such users will be unaffected if struct fown_struct is allocated during the fcntl() call. There are a few subsystems that call __f_setown() expecting file->f_owner to be allocated: (1) tun devices file->f_op->fasync::tun_chr_fasync() -> __f_setown() There are no callers of tun_chr_fasync(). (2) tty devices file->f_op->fasync::tty_fasync() -> __tty_fasync() -> __f_setown() tty_fasync() has no additional callers but __tty_fasync() has. Note that __tty_fasync() only calls __f_setown() if the @on argument is true. It's called from: file->f_op->release::tty_release() -> tty_release() -> __tty_fasync() -> __f_setown() tty_release() calls __tty_fasync() with @on false => __f_setown() is never called from tty_release(). => All callers of tty_release() are safe as well. file->f_op->release::tty_open() -> tty_release() -> __tty_fasync() -> __f_setown() __tty_hangup() calls __tty_fasync() with @on false => __f_setown() is never called from tty_release(). => All callers of __tty_hangup() are safe as well. From the callchains it's obvious that (1) and (2) end up getting called via file->f_op->fasync(). That can happen either through the F_SETFL fcntl() with the FASYNC flag raised or via the FIOASYNC ioctl(). If FASYNC is requested and the file isn't already FASYNC then file->f_op->fasync() is called with @on true which ends up causing both (1) and (2) to call __f_setown(). (1) and (2) are the only subsystems that call __f_setown() from the file->f_op->fasync() handler. So both (1) and (2) have been updated to allocate a struct fown_struct prior to calling fasync_helper() to register with the fasync infrastructure. That's safe as they both call fasync_helper() which also does allocations if @on is true. The other interesting case are file leases: (3) file leases lease_manager_ops->lm_setup::lease_setup() -> __f_setown() Which in turn is called from: generic_add_lease() -> lease_manager_ops->lm_setup::lease_setup() -> __f_setown() So here again we can simply make generic_add_lease() allocate struct fown_struct prior to the lease_manager_ops->lm_setup::lease_setup() which happens under a spinlock. With that the two remaining subsystems that call __f_setown() are: (4) dnotify (5) sockets Both have their own custom ioctls to set struct fown_struct and both have been converted to allocate a struct fown_struct on demand from their respective ioctls. Interactions with O_PATH are fine as well e.g., when opening a /dev/tty as O_PATH then no file->f_op->open() happens thus no file->f_owner is allocated. That's fine as no file operation will be set for those and the device has never been opened. fcntl()s called on such things will just allocate a ->f_owner on demand. Although I have zero idea why'd you care about f_owner on an O_PATH fd. Link: https://lore.kernel.org/r/20240813-work-f_owner-v2-1-4e9343a79f9f@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-19Fixed: fs: file_table_c: Missing blank line warnings and struct declaration ↵Mohit0404
improved Fixed- WARNING: Missing a blank line after declarations WARNING: Missing a blank line after declarations Declaration format: improved struct file declaration format Signed-off-by: Mohit0404 <mohitpawar@mitaoe.ac.in> Link: https://lore.kernel.org/r/20240727072134.130962-2-mohitpawar@mitaoe.ac.in Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-07-24sysctl: treewide: constify the ctl_table argument of proc_handlersJoel Granados
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *); @r4@ identifier func, ctl; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos); ``` * Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-03-12Merge tag 'lsm-pr-20240312' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Promote IMA/EVM to a proper LSM This is the bulk of the diffstat, and the source of all the changes in the VFS code. Prior to the start of the LSM stacking work it was important that IMA/EVM were separate from the rest of the LSMs, complete with their own hooks, infrastructure, etc. as it was the only way to enable IMA/EVM at the same time as a LSM. However, now that the bulk of the LSM infrastructure supports multiple simultaneous LSMs, we can simplify things greatly by bringing IMA/EVM into the LSM infrastructure as proper LSMs. This is something I've wanted to see happen for quite some time and Roberto was kind enough to put in the work to make it happen. - Use the LSM hook default values to simplify the call_int_hook() macro Previously the call_int_hook() macro required callers to supply a default return value, despite a default value being specified when the LSM hook was defined. This simplifies the macro by using the defined default return value which makes life easier for callers and should also reduce the number of return value bugs in the future (we've had a few pop up recently, hence this work). - Use the KMEM_CACHE() macro instead of kmem_cache_create() The guidance appears to be to use the KMEM_CACHE() macro when possible and there is no reason why we can't use the macro, so let's use it. - Fix a number of comment typos in the LSM hook comment blocks Not much to say here, we fixed some questionable grammar decisions in the LSM hook comment blocks. * tag 'lsm-pr-20240312' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (28 commits) cred: Use KMEM_CACHE() instead of kmem_cache_create() lsm: use default hook return value in call_int_hook() lsm: fix typos in security/security.c comment headers integrity: Remove LSM ima: Make it independent from 'integrity' LSM evm: Make it independent from 'integrity' LSM evm: Move to LSM infrastructure ima: Move IMA-Appraisal to LSM infrastructure ima: Move to LSM infrastructure integrity: Move integrity_kernel_module_request() to IMA security: Introduce key_post_create_or_update hook security: Introduce inode_post_remove_acl hook security: Introduce inode_post_set_acl hook security: Introduce inode_post_create_tmpfile hook security: Introduce path_post_mknod hook security: Introduce file_release hook security: Introduce file_post_open hook security: Introduce inode_post_removexattr hook security: Introduce inode_post_setattr hook security: Align inode_setattr hook definition with EVM ...
2024-02-25file: add alloc_file_pseudo_noaccount()Christian Brauner
When we open block devices as files we want to make sure to not charge them against the open file limit of the caller as that can cause spurious failures. Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-1-adbd023e19cc@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-25file: prepare for new helperChristian Brauner
In order to add a helper to open files that aren't accounted split alloc_file() and parts of alloc_file_pseudo() into helpers. One to prepare a path, another one to setup the file. Suggested-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240129160241.GA2793@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-15ima: Move to LSM infrastructureRoberto Sassu
Move hardcoded IMA function calls (not appraisal-specific functions) from various places in the kernel to the LSM infrastructure, by introducing a new LSM named 'ima' (at the end of the LSM list and always enabled like 'integrity'). Having IMA before EVM in the Makefile is sufficient to preserve the relative order of the new 'ima' LSM in respect to the upcoming 'evm' LSM, and thus the order of IMA and EVM function calls as when they were hardcoded. Make moved functions as static (except ima_post_key_create_or_update(), which is not in ima_main.c), and register them as implementation of the respective hooks in the new function init_ima_lsm(). Select CONFIG_SECURITY_PATH, to ensure that the path-based LSM hook path_post_mknod is always available and ima_post_path_mknod() is always executed to mark files as new, as before the move. A slight difference is that IMA and EVM functions registered for the inode_post_setattr, inode_post_removexattr, path_post_mknod, inode_post_create_tmpfile, inode_post_set_acl and inode_post_remove_acl won't be executed for private inodes. Since those inodes are supposed to be fs-internal, they should not be of interest to IMA or EVM. The S_PRIVATE flag is used for anonymous inodes, hugetlbfs, reiserfs xattrs, XFS scrub and kernel-internal tmpfs files. Conditionally register ima_post_key_create_or_update() if CONFIG_IMA_MEASURE_ASYMMETRIC_KEYS is enabled. Also, conditionally register ima_kernel_module_request() if CONFIG_INTEGRITY_ASYMMETRIC_KEYS is enabled. Finally, add the LSM_ID_IMA case in lsm_list_modules_test.c. Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Acked-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-02-15security: Introduce file_release hookRoberto Sassu
In preparation for moving IMA and EVM to the LSM infrastructure, introduce the file_release hook. IMA calculates at file close the new digest of the file content and writes it to security.ima, so that appraisal at next file access succeeds. The new hook cannot return an error and cannot cause the operation to be reverted. Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-01-11Merge tag 'pull-dcache' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull dcache updates from Al Viro: "Change of locking rules for __dentry_kill(), regularized refcounting rules in that area, assorted cleanups and removal of weird corner cases (e.g. now ->d_iput() on child is always called before the parent might hit __dentry_kill(), etc)" * tag 'pull-dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits) dcache: remove unnecessary NULL check in dget_dlock() kill DCACHE_MAY_FREE __d_unalias() doesn't use inode argument d_alloc_parallel(): in-lookup hash insertion doesn't need an RCU variant get rid of DCACHE_GENOCIDE d_genocide(): move the extern into fs/internal.h simple_fill_super(): don't bother with d_genocide() on failure nsfs: use d_make_root() d_alloc_pseudo(): move setting ->d_op there from the (sole) caller kill d_instantate_anon(), fold __d_instantiate_anon() into remaining caller retain_dentry(): introduce a trimmed-down lockless variant __dentry_kill(): new locking scheme d_prune_aliases(): use a shrink list switch select_collect{,2}() to use of to_shrink_list() to_shrink_list(): call only if refcount is 0 fold dentry_kill() into dput() don't try to cut corners in shrink_lock_dentry() fold the call of retain_dentry() into fast_dput() Call retain_dentry() with refcount 0 dentry_kill(): don't bother with retain_dentry() on slow path ...
2024-01-10Merge tag 'sysctl-6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. In the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7 we had all arch/ and drivers/ modified to remove the sentinel. For v6.8-rc1 we get a few more updates for fs/ directory only. The kernel/ directory is left but we'll save that for v6.9-rc1 as those patches are still being reviewed. After that we then can expect also the removal of the no longer needed check for procname == NULL. Let us recap the purpose of this work: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files Thomas Weißschuh also sent a few cleanups, for v6.9-rc1 we expect to see further work by Thomas Weißschuh with the constificatin of the struct ctl_table. Due to Joel Granados's work, and to help bring in new blood, I have suggested for him to become a maintainer and he's accepted. So for v6.9-rc1 I look forward to seeing him sent you a pull request for further sysctl changes. This also removes Iurii Zaikin as a maintainer as he has moved on to other projects and has had no time to help at all" * tag 'sysctl-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sysctl: remove struct ctl_path sysctl: delete unused define SYSCTL_PERM_EMPTY_DIR coda: Remove the now superfluous sentinel elements from ctl_table array sysctl: Remove the now superfluous sentinel elements from ctl_table array fs: Remove the now superfluous sentinel elements from ctl_table array cachefiles: Remove the now superfluous sentinel element from ctl_table array sysclt: Clarify the results of selftest run sysctl: Add a selftest for handling empty dirs sysctl: Fix out of bounds access for empty sysctl registers MAINTAINERS: Add Joel Granados as co-maintainer for proc sysctl MAINTAINERS: remove Iurii Zaikin from proc sysctl
2023-12-28fs: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel elements ctl_table struct. Special attention was placed in making sure that an empty directory for fs/verity was created when CONFIG_FS_VERITY_BUILTIN_SIGNATURES is not defined. In this case we use the register sysctl call that expects a size. Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-12-12fs: replace f_rcuhead with f_task_workChristian Brauner
The naming is actively misleading since we switched to SLAB_TYPESAFE_BY_RCU. rcu_head is #define callback_head. Use callback_head directly and rename f_rcuhead to f_task_work. Add comments in there to explain what it's used for. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-3-e73ca6f4ea83@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12file: massage cleanup of files that failed to openAl Viro
A file that has never gotten FMODE_OPENED will never have RCU-accessed references, its final fput() is equivalent to file_free() and if it doesn't have FMODE_BACKING either, it can be done from any context and won't need task_work treatment. Now that we have SLAB_TYPESAFE_BY_RCU we can simplify this and have other callers benefit. All of that can be achieved easier is to make fput() recoginze that case and call file_free() directly. No need to introduce a special primitive for that. It also allowed things like failing dentry_open() could benefit from that as well. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [Christian Brauner <brauner@kernel.org>: massage commit message] Link: https://lore.kernel.org/r/20231126020834.GC38156@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-25d_alloc_pseudo(): move setting ->d_op there from the (sole) callerAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-11-02Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
2023-10-19fs: store real path instead of fake path in backing file f_pathAmir Goldstein
A backing file struct stores two path's, one "real" path that is referring to f_inode and one "fake" path, which should be displayed to users in /proc/<pid>/maps. There is a lot more potential code that needs to know the "real" path, then code that needs to know the "fake" path. Instead of code having to request the "real" path with file_real_path(), store the "real" path in f_path and require code that needs to know the "fake" path request it with file_user_path(). Replace the file_real_path() helper with a simple const accessor f_path(). After this change, file_dentry() is not expected to observe any files with overlayfs f_path and real f_inode, so the call to ->d_real() should not be needed. Leave the ->d_real() call for now and add an assertion in ovl_d_real() to catch if we made wrong assumptions. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Link: https://lore.kernel.org/r/CAJfpegtt48eXhhjDFA1ojcHPNKj3Go6joryCPtEFAKpocyBsnw@mail.gmail.com/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231009153712.1566422-4-amir73il@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19backing file: free directlyChristian Brauner
Backing files as used by overlayfs are never installed into file descriptor tables and are explicitly documented as such. They aren't subject to rcu access conditions like regular files are. Their lifetime is bound to the lifetime of the overlayfs file, i.e., they're stashed in ovl_file->private_data and go away otherwise. If they're set as vma->vm_file - which is their main purpose - then they're subject to regular refcount rules and vma->vm_file can't be installed into an fdtable after having been set. All in all I don't see any need for rcu delay here. So free it directly. This all hinges on such hybrid beasts to never actually be installed into fdtables which - as mentioned before - is not allowed. So add an explicit WARN_ON_ONCE() so we catch any case where someone is suddenly trying to install one of those things into a file descriptor table so we can have a nice long chat with them. Link: https://lore.kernel.org/r/20231005-sakralbau-wappnen-f5c31755ed70@brauner Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19file: convert to SLAB_TYPESAFE_BY_RCUChristian Brauner
In recent discussions around some performance improvements in the file handling area we discussed switching the file cache to rely on SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based freeing for files completely. This is a pretty sensitive change overall but it might actually be worth doing. The main downside is the subtlety. The other one is that we should really wait for Jann's patch to land that enables KASAN to handle SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this exists. With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times which requires a few changes. So it isn't sufficient anymore to just acquire a reference to the file in question under rcu using atomic_long_inc_not_zero() since the file might have already been recycled and someone else might have bumped the reference. In other words, callers might see reference count bumps from newer users. For this reason it is necessary to verify that the pointer is the same before and after the reference count increment. This pattern can be seen in get_file_rcu() and __files_get_rcu(). In addition, it isn't possible to access or check fields in struct file without first aqcuiring a reference on it. Not doing that was always very dodgy and it was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a reference under rcu or they must hold the files_lock of the fdtable. Failing to do either one of this is a bug. Thanks to Jann for pointing out that we need to ensure memory ordering between reallocations and pointer check by ensuring that all subsequent loads have a dependency on the second load in get_file_rcu() and providing a fixup that was folded into this patch. Cc: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-19vfs: shave work on failed file openMateusz Guzik
Failed opens (mostly ENOENT) legitimately happen a lot, for example here are stats from stracing kernel build for few seconds (strace -fc make): % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ------------------ 0.76 0.076233 5 15040 3688 openat (this is tons of header files tried in different paths) In the common case of there being nothing to close (only the file object to free) there is a lot of overhead which can be avoided. This is most notably delegation of freeing to task_work, which comes with an enormous cost (see 021a160abf62 ("fs: use __fput_sync in close(2)" for an example). Benchmarked with will-it-scale with a custom testcase based on tests/open1.c, stuffed into tests/openneg.c: [snip] while (1) { int fd = open("/tmp/nonexistent", O_RDONLY); assert(fd == -1); (*iterations)++; } [/snip] Sapphire Rapids, openneg_processes -t 1 (ops/s): before: 1950013 after: 2914973 (+49%) file refcount is checked as a safety belt against buggy consumers with an atomic cmpxchg. Technically it is not necessary, but it happens to not be measurable due to several other atomics which immediately follow. Optmizing them away to make this atomic into a problem is left as an exercise for the reader. v2: - unexport fput_badopen and move to fs/internal.h - handle the refcount with cmpxchg, adjust commentary accordingly - tweak the commit message Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18treewide: mark stuff as __ro_after_initAlexey Dobriyan
__read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1. Also, mark some stuff as "const" and "__init" while I'm at it. [akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-08fs: use __fput_sync in close(2)Linus Torvalds
close(2) is a special case which guarantees a shallow kernel stack, making delegation to task_work machinery unnecessary. Said delegation is problematic as it involves atomic ops and interrupt masking trips, none of which are cheap on x86-64. Forcing close(2) to do it looks like an oversight in the original work. Moreover presence of CONFIG_RSEQ adds an additional overhead as fput() -> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the thread returning to userspace land in resume_user_mode_work(), where rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for the thread (and it is by default with contemporary glibc). Sample result when benchmarking open1_processes -t 1 from will-it-scale (that's an open + close loop) + tmpfs on /tmp, running on the Sapphire Rapid CPU (ops/s): stock+RSEQ: 1329857 stock-RSEQ: 1421667 (+7%) patched: 1523521 (+14.5% / +7%) (with / without rseq) Patched result is the same regardless of rseq as the codepath is avoided. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-07-02fs: move cleanup from init_file() into its callersAmir Goldstein
The use of file_free_rcu() in init_file() to free the struct that was allocated by the caller was hacky and we got what we deserved. Let init_file() and its callers take care of cleaning up each after their own allocated resources on error. Fixes: 62d53c4a1dfe ("fs: use backing_file container for internal files with "fake" f_path") # mainline only Reported-and-tested-by: syzbot+ada42aab05cf51b00e98@syzkaller.appspotmail.com Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230701171134.239409-1-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: use backing_file container for internal files with "fake" f_pathAmir Goldstein
Overlayfs uses open_with_fake_path() to allocate internal kernel files, with a "fake" path - whose f_path is not on the same fs as f_inode. Allocate a container struct backing_file for those internal files, that is used to hold the "fake" ovl path along with the real path. backing_file_real_path() can be used to access the stored real path. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Message-Id: <20230615112229.2143178-5-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19fs: move kmem_cache_zalloc() into alloc_empty_file*() helpersAmir Goldstein
Use a common helper init_file() instead of __alloc_file() for alloc_empty_file*() helpers and improrve the documentation. This is needed for a follow up patch that allocates a backing_file container. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230615112229.2143178-4-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-01-11filelock: move file locking definitions to separate header fileJeff Layton
The file locking definitions have lived in fs.h since the dawn of time, but they are only used by a small subset of the source files that include it. Move the file locking definitions to a new header file, and add the appropriate #include directives to the source files that need them. By doing this we trim down fs.h a bit and limit the amount of rebuilding that has to be done when we make changes to the file locking APIs. Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Steve French <stfrench@microsoft.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2022-08-16locks: fix TOCTOU race when granting write leaseAmir Goldstein
Thread A trying to acquire a write lease checks the value of i_readcount and i_writecount in check_conflicting_open() to verify that its own fd is the only fd referencing the file. Thread B trying to open the file for read will call break_lease() in do_dentry_open() before incrementing i_readcount, which leaves a small window where thread A can acquire the write lease and then thread B completes the open of the file for read without breaking the write lease that was acquired by thread A. Fix this race by incrementing i_readcount before checking for existing leases, same as the case with i_writecount. Use a helper put_file_access() to decrement i_readcount or i_writecount in do_dentry_open() and __fput(). Fixes: 387e3746d01c ("locks: eliminate false positive conflicts for write lease") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-03Merge tag 'pull-work.iov_iter-base' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs iov_iter updates from Al Viro: "Part 1 - isolated cleanups and optimizations. One of the goals is to reduce the overhead of using ->read_iter() and ->write_iter() instead of ->read()/->write(). new_sync_{read,write}() has a surprising amount of overhead, in particular inside iocb_flags(). That's the explanation for the beginning of the series is in this pile; it's not directly iov_iter-related, but it's a part of the same work..." * tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: first_iovec_segment(): just return address iov_iter: massage calling conventions for first_{iovec,bvec}_segment() iov_iter: first_{iovec,bvec}_segment() - simplify a bit iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment() iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT iov_iter_bvec_advance(): don't bother with bvec_iter copy_page_{to,from}_iter(): switch iovec variants to generic keep iocb_flags() result cached in struct file iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC struct file: use anonymous union member for rcuhead and llist btrfs: use IOMAP_DIO_NOSYNC teach iomap_dio_rw() to suppress dsync No need of likely/unlikely on calls of check_copy_size()
2022-07-16fs: remove no_llseekJason A. Donenfeld
Now that all callers of ->llseek are going through vfs_llseek(), we don't gain anything by keeping no_llseek around. Nothing actually calls it and setting ->llseek to no_lseek is completely equivalent to leaving it NULL. Longer term (== by the end of merge window) we want to remove all such intializations. To simplify the merge window this commit does *not* touch initializers - it only defines no_llseek as NULL (and simplifies the tests on file opening). At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-07-16fs: clear or set FMODE_LSEEK based on llseek functionJason A. Donenfeld
Pipe-like behaviour on llseek(2) (i.e. unconditionally failing with -ESPIPE) can be expresses in 3 ways: 1) ->llseek set to NULL in file_operations 2) ->llseek set to no_llseek in file_operations 3) FMODE_LSEEK *not* set in ->f_mode. Enforce (3) in cases (1) and (2); that will allow to simplify the checks and eventually get rid of no_llseek boilerplate. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-06-10keep iocb_flags() result cached in struct fileAl Viro
* calculate at the time we set FMODE_OPENED (do_dentry_open() for normal opens, alloc_file() for pipe()/socket()/etc.) * update when handling F_SETFL * keep in a new field - file->f_iocb_flags; since that thing is needed only before the refcount reaches zero, we can put it into the same anon union where ->f_rcuhead and ->f_llist live - those are used only after refcount reaches zero. Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>