summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2024-10-29mm/kasan: Don't store metadata inside kmalloc object when ↵Feng Tang
slub_debug_orig_size is on For a kmalloc object, when both kasan and slub redzone sanity check are enabled, they could both manipulate its data space like storing kasan free meta data and setting up kmalloc redzone, and may affect accuracy of that object's 'orig_size'. As an accurate 'orig_size' will be needed by some function like krealloc() soon, save kasan's free meta data in slub's metadata area instead of inside object when 'orig_size' is enabled. This will make it easier to maintain/understand the code. Size wise, when these two options are both enabled, the slub meta data space is already huge, and this just slightly increase the overall size. Signed-off-by: Feng Tang <feng.tang@intel.com> Acked-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-10-29mm: krealloc: Fix MTE false alarm in __do_kreallocQun-Wei Lin
This patch addresses an issue introduced by commit 1a83a716ec233 ("mm: krealloc: consider spare memory for __GFP_ZERO") which causes MTE (Memory Tagging Extension) to falsely report a slab-out-of-bounds error. The problem occurs when zeroing out spare memory in __do_krealloc. The original code only considered software-based KASAN and did not account for MTE. It does not reset the KASAN tag before calling memset, leading to a mismatch between the pointer tag and the memory tag, resulting in a false positive. Example of the error: ================================================================== swapper/0: BUG: KASAN: slab-out-of-bounds in __memset+0x84/0x188 swapper/0: Write at addr f4ffff8005f0fdf0 by task swapper/0/1 swapper/0: Pointer tag: [f4], memory tag: [fe] swapper/0: swapper/0: CPU: 4 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.12. swapper/0: Hardware name: MT6991(ENG) (DT) swapper/0: Call trace: swapper/0: dump_backtrace+0xfc/0x17c swapper/0: show_stack+0x18/0x28 swapper/0: dump_stack_lvl+0x40/0xa0 swapper/0: print_report+0x1b8/0x71c swapper/0: kasan_report+0xec/0x14c swapper/0: __do_kernel_fault+0x60/0x29c swapper/0: do_bad_area+0x30/0xdc swapper/0: do_tag_check_fault+0x20/0x34 swapper/0: do_mem_abort+0x58/0x104 swapper/0: el1_abort+0x3c/0x5c swapper/0: el1h_64_sync_handler+0x80/0xcc swapper/0: el1h_64_sync+0x68/0x6c swapper/0: __memset+0x84/0x188 swapper/0: btf_populate_kfunc_set+0x280/0x3d8 swapper/0: __register_btf_kfunc_id_set+0x43c/0x468 swapper/0: register_btf_kfunc_id_set+0x48/0x60 swapper/0: register_nf_nat_bpf+0x1c/0x40 swapper/0: nf_nat_init+0xc0/0x128 swapper/0: do_one_initcall+0x184/0x464 swapper/0: do_initcall_level+0xdc/0x1b0 swapper/0: do_initcalls+0x70/0xc0 swapper/0: do_basic_setup+0x1c/0x28 swapper/0: kernel_init_freeable+0x144/0x1b8 swapper/0: kernel_init+0x20/0x1a8 swapper/0: ret_from_fork+0x10/0x20 ================================================================== Fixes: 1a83a716ec233 ("mm: krealloc: consider spare memory for __GFP_ZERO") Signed-off-by: Qun-Wei Lin <qun-wei.lin@mediatek.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-10-28mm: avoid unconditional one-tick sleep when swapcache_prepare failsBarry Song
Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") introduced an unconditional one-tick sleep when `swapcache_prepare()` fails, which has led to reports of UI stuttering on latency-sensitive Android devices. To address this, we can use a waitqueue to wake up tasks that fail `swapcache_prepare()` sooner, instead of always sleeping for a full tick. While tasks may occasionally be woken by an unrelated `do_swap_page()`, this method is preferable to two scenarios: rapid re-entry into page faults, which can cause livelocks, and multiple millisecond sleeps, which visibly degrade user experience. Oven's testing shows that a single waitqueue resolves the UI stuttering issue. If a 'thundering herd' problem becomes apparent later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` for page bit locks can be introduced. [v-songbaohua@oppo.com: wake_up only when swapcache_wq waitqueue is active] Link: https://lkml.kernel.org/r/20241008130807.40833-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240926211936.75373-1-21cnbao@gmail.com Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reported-by: Oven Liyang <liyangouwen1@oppo.com> Tested-by: Oven Liyang <liyangouwen1@oppo.com> Cc: Kairui Song <kasong@tencent.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Yu Zhao <yuzhao@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: SeongJae Park <sj@kernel.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm: split critical region in remap_file_pages() and invoke LSMs in betweenKirill A. Shutemov
Commit ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()") fixed a security issue, it added an LSM check when trying to remap file pages, so that LSMs have the opportunity to evaluate such action like for other memory operations such as mmap() and mprotect(). However, that commit called security_mmap_file() inside the mmap_lock lock, while the other calls do it before taking the lock, after commit 8b3ec6814c83 ("take security_mmap_file() outside of ->mmap_sem"). This caused lock inversion issue with IMA which was taking the mmap_lock and i_mutex lock in the opposite way when the remap_file_pages() system call was called. Solve the issue by splitting the critical region in remap_file_pages() in two regions: the first takes a read lock of mmap_lock, retrieves the VMA and the file descriptor associated, and calculates the 'prot' and 'flags' variables; the second takes a write lock on mmap_lock, checks that the VMA flags and the VMA file descriptor are the same as the ones obtained in the first critical region (otherwise the system call fails), and calls do_mmap(). In between, after releasing the read lock and before taking the write lock, call security_mmap_file(), and solve the lock inversion issue. Link: https://lkml.kernel.org/r/20241018161415.3845146-1-roberto.sassu@huaweicloud.com Fixes: ea7e2d5e49c0 ("mm: call the security_mmap_file() LSM hook in remap_file_pages()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com> Reported-by: syzbot+1cd571a672400ef3a930@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-security-module/66f7b10e.050a0220.46d20.0036.GAE@google.com/ Tested-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Paul Moore <paul@paul-moore.com> Tested-by: syzbot+1cd571a672400ef3a930@syzkaller.appspotmail.com Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Dmitry Kasatkin <dmitry.kasatkin@gmail.com> Cc: Eric Snowberg <eric.snowberg@oracle.com> Cc: James Morris <jmorris@namei.org> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Shu Han <ebpqwerty472123@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/vma: add expand-only VMA merge mode and optimise do_brk_flags()Lorenzo Stoakes
Patch series "introduce VMA merge mode to improve brk() performance". A ~5% performance regression was discovered on the aim9.brk_test.ops_per_sec by the linux kernel test bot [0]. In the past to satisfy brk() performance we duplicated VMA expansion code and special-cased do_brk_flags(). This is however horrid and undoes work to abstract this logic, so in resolving the issue I have endeavoured to avoid this. Investigating further I was able to observe that the use of a vma_iter_next_range() and vma_prev() pair, causing an unnecessary maple tree walk. In addition there is work that we do that is simply unnecessary for brk(). Therefore, add a special VMA merge mode VMG_FLAG_JUST_EXPAND to avoid doing any of this - it assumes the VMA iterator is pointing at the previous VMA and which skips logic that brk() does not require. This mostly eliminates the performance regression reducing it to ~2% which is in the realm of noise. In addition, the will-it-scale test brk2, written to be more representative of real-world brk() usage, shows a modest performance improvement - which gives me confidence that we are not meaningfully regressing real workloads here. This series includes a test asserting that the 'just expand' mode works as expected. With many thanks to Oliver Sang for helping with performance testing of candidate patch sets! [0]:https://lore.kernel.org/linux-mm/202409301043.629bea78-oliver.sang@intel.com This patch (of 2): We know in advance that do_brk_flags() wants only to perform a VMA expansion (if the prior VMA is compatible), and that we assume no mergeable VMA follows it. These are the semantics of this function prior to the recent rewrite of the VMA merging logic, however we are now doing more work than necessary - positioning the VMA iterator at the prior VMA and performing tasks that are not required. Add a new field to the vmg struct to permit merge flags and add a new merge flag VMG_FLAG_JUST_EXPAND which implies this behaviour, and have do_brk_flags() use this. This fixes a reported performance regression in a brk() benchmarking suite. Link: https://lkml.kernel.org/r/cover.1729174352.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/4e65d4395e5841c5acf8470dbcb714016364fd39.1729174352.git.lorenzo.stoakes@oracle.com Fixes: cacded5e42b9 ("mm: avoid using vma_merge() for new VMAs") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/linux-mm/202409301043.629bea78-oliver.sang@intel.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm: numa_clear_kernel_node_hotplug: Add NUMA_NO_NODE check for node idNobuhiro Iwamatsu
The acquired memory blocks for reserved may include blocks outside of memory management. In this case, the nid variable is set to NUMA_NO_NODE (-1), so an error occurs in node_set(). This adds a check using numa_valid_node() to numa_clear_kernel_node_hotplug() that skips node_set() when nid is set to NUMA_NO_NODE. Link: https://lkml.kernel.org/r/1729070461-13576-1-git-send-email-nobuhiro1.iwamatsu@toshiba.co.jp Fixes: 87482708210f ("mm: introduce numa_memblks") Signed-off-by: Nobuhiro Iwamatsu <nobuhiro1.iwamatsu@toshiba.co.jp> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Suggested-by: Yuji Ishikawa <yuji2.ishikawa@toshiba.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm: shmem: fix data-race in shmem_getattr()Jeongjun Park
I got the following KCSAN report during syzbot testing: ================================================================== BUG: KCSAN: data-race in generic_fillattr / inode_set_ctime_current write to 0xffff888102eb3260 of 4 bytes by task 6565 on cpu 1: inode_set_ctime_to_ts include/linux/fs.h:1638 [inline] inode_set_ctime_current+0x169/0x1d0 fs/inode.c:2626 shmem_mknod+0x117/0x180 mm/shmem.c:3443 shmem_create+0x34/0x40 mm/shmem.c:3497 lookup_open fs/namei.c:3578 [inline] open_last_lookups fs/namei.c:3647 [inline] path_openat+0xdbc/0x1f00 fs/namei.c:3883 do_filp_open+0xf7/0x200 fs/namei.c:3913 do_sys_openat2+0xab/0x120 fs/open.c:1416 do_sys_open fs/open.c:1431 [inline] __do_sys_openat fs/open.c:1447 [inline] __se_sys_openat fs/open.c:1442 [inline] __x64_sys_openat+0xf3/0x120 fs/open.c:1442 x64_sys_call+0x1025/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:258 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e read to 0xffff888102eb3260 of 4 bytes by task 3498 on cpu 0: inode_get_ctime_nsec include/linux/fs.h:1623 [inline] inode_get_ctime include/linux/fs.h:1629 [inline] generic_fillattr+0x1dd/0x2f0 fs/stat.c:62 shmem_getattr+0x17b/0x200 mm/shmem.c:1157 vfs_getattr_nosec fs/stat.c:166 [inline] vfs_getattr+0x19b/0x1e0 fs/stat.c:207 vfs_statx_path fs/stat.c:251 [inline] vfs_statx+0x134/0x2f0 fs/stat.c:315 vfs_fstatat+0xec/0x110 fs/stat.c:341 __do_sys_newfstatat fs/stat.c:505 [inline] __se_sys_newfstatat+0x58/0x260 fs/stat.c:499 __x64_sys_newfstatat+0x55/0x70 fs/stat.c:499 x64_sys_call+0x141f/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:263 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e value changed: 0x2755ae53 -> 0x27ee44d3 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 3498 Comm: udevd Not tainted 6.11.0-rc6-syzkaller-00326-gd1f2d51b711a-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024 ================================================================== When calling generic_fillattr(), if you don't hold read lock, data-race will occur in inode member variables, which can cause unexpected behavior. Since there is no special protection when shmem_getattr() calls generic_fillattr(), data-race occurs by functions such as shmem_unlink() or shmem_mknod(). This can cause unexpected results, so commenting it out is not enough. Therefore, when calling generic_fillattr() from shmem_getattr(), it is appropriate to protect the inode using inode_lock_shared() and inode_unlock_shared() to prevent data-race. Link: https://lkml.kernel.org/r/20240909123558.70229-1-aha310510@gmail.com Fixes: 44a30220bc0a ("shmem: recalculate file inode when fstat") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reported-by: syzbot <syzkaller@googlegroup.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm: mark mas allocation in vms_abort_munmap_vmas as __GFP_NOFAILJann Horn
vms_abort_munmap_vmas() is a recovery path where, on entry, some VMAs have already been torn down halfway (in a way we can't undo) but are still present in the maple tree. At this point, we *must* remove the VMAs from the VMA tree, otherwise we get UAF. Because removing VMA tree nodes can require memory allocation, the existing code has an error path which tries to handle this by reattaching the VMAs; but that can't be done safely. A nicer way to fix it would probably be to preallocate enough maple tree nodes for the removal before the point of no return, or something like that; but for now, fix it the easy and kinda ugly way, by marking this allocation __GFP_NOFAIL. Link: https://lkml.kernel.org/r/20241016-fix-munmap-abort-v1-1-601c94b2240d@google.com Fixes: 4f87153e82c4 ("mm: change failure of MAP_FIXED to restoring the gap on failure") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28resource: remove dependency on SPARSEMEM from GET_FREE_REGIONHuang Ying
We want to use the functions (get_free_mem_region()) configured via GET_FREE_REGION in resource kunit tests. However, GET_FREE_REGION depends on SPARSEMEM now. This makes resource kunit tests cannot be built on some architectures lacking SPARSEMEM, or causes config warning as follows, WARNING: unmet direct dependencies detected for GET_FREE_REGION Depends on [n]: SPARSEMEM [=n] Selected by [y]: - RESOURCE_KUNIT_TEST [=y] && RUNTIME_TESTING_MENU [=y] && KUNIT [=y] When get_free_mem_region() was introduced the only consumers were those looking to pass the address range to memremap_pages(). That address range needed to be mindful of the maximum addressable platform physical address which at the time only SPARSMEM defined via MAX_PHYSMEM_BITS. Given that memremap_pages() also depended on SPARSEMEM via ZONE_DEVICE, it was easier to just depend on that definition than invent a general MAX_PHYSMEM_BITS concept outside of SPARSEMEM. Turns out that decision was buggy and did not account for KASAN consumption of physical address space. That problem was resolved recently with commit ea72ce5da228 ("x86/kaslr: Expose and use the end of the physical memory address space"), and GET_FREE_REGION dropped its MAX_PHYSMEM_BITS dependency. Then commit 99185c10d5d9 ("resource, kunit: add test case for region_intersects()"), went ahead and fixed up the only remaining dependency on SPARSEMEM which was usage of the PA_SECTION_SHIFT macro for setting the default alignment. A PAGE_SIZE fallback is fine in the SPARSEMEM=n case. With those build dependencies gone GET_FREE_REGION no longer depends on SPARSEMEM. So, the patch removes dependency on SPARSEMEM from GET_FREE_REGION to fix the build issues. Link: https://lkml.kernel.org/r/20241016014730.339369-1-ying.huang@intel.com Link: https://lore.kernel.org/lkml/20240922225041.603186-1-linux@roeck-us.net/ Link: https://lkml.kernel.org/r/20241015051554.294734-1-ying.huang@intel.com Fixes: 99185c10d5d9 ("resource, kunit: add test case for region_intersects()") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Co-developed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Nathan Chancellor <nathan@kernel.org> # build Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/mmap: fix race in mmap_region() with ftruncate()Liam R. Howlett
Avoiding the zeroing of the vma tree in mmap_region() introduced a race with truncate in the page table walk. To avoid any races, create a hole in the rmap during the operation by clearing the pagetable entries earlier under the mmap write lock and (critically) before the new vma is installed into the vma tree. The result is that the old vma(s) are left in the vma tree, but free_pgtables() removes them from the rmap and clears the ptes while holding the necessary locks. This change extends the fix required for hugetblfs and the call_mmap() function by moving the cleanup higher in the function and running it unconditionally. Link: https://lkml.kernel.org/r/20241016013455.2241533-1-Liam.Howlett@oracle.com Fixes: f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/all/CAG48ez0ZpGzxi=-5O_uGQ0xKXOmbjeQ0LjZsRJ1Qtf2X5eOr1w@mail.gmail.com/ Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/page_alloc: let GFP_ATOMIC order-0 allocs access highatomic reservesMatt Fleming
Under memory pressure it's possible for GFP_ATOMIC order-0 allocations to fail even though free pages are available in the highatomic reserves. GFP_ATOMIC allocations cannot trigger unreserve_highatomic_pageblock() since it's only run from reclaim. Given that such allocations will pass the watermarks in __zone_watermark_unusable_free(), it makes sense to fallback to highatomic reserves the same way that ALLOC_OOM can. This fixes order-0 page allocation failures observed on Cloudflare's fleet when handling network packets: kswapd1: page allocation failure: order:0, mode:0x820(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0-7 CPU: 10 PID: 696 Comm: kswapd1 Kdump: loaded Tainted: G O 6.6.43-CUSTOM #1 Hardware name: MACHINE Call Trace: <IRQ> dump_stack_lvl+0x3c/0x50 warn_alloc+0x13a/0x1c0 __alloc_pages_slowpath.constprop.0+0xc9d/0xd10 __alloc_pages+0x327/0x340 __napi_alloc_skb+0x16d/0x1f0 bnxt_rx_page_skb+0x96/0x1b0 [bnxt_en] bnxt_rx_pkt+0x201/0x15e0 [bnxt_en] __bnxt_poll_work+0x156/0x2b0 [bnxt_en] bnxt_poll+0xd9/0x1c0 [bnxt_en] __napi_poll+0x2b/0x1b0 bpf_trampoline_6442524138+0x7d/0x1000 __napi_poll+0x5/0x1b0 net_rx_action+0x342/0x740 handle_softirqs+0xcf/0x2b0 irq_exit_rcu+0x6c/0x90 sysvec_apic_timer_interrupt+0x72/0x90 </IRQ> [mfleming@cloudflare.com: update comment] Link: https://lkml.kernel.org/r/20241015125158.3597702-1-matt@readmodwrite.com Link: https://lkml.kernel.org/r/20241011120737.3300370-1-matt@readmodwrite.com Link: https://lore.kernel.org/all/CAGis_TWzSu=P7QJmjD58WWiu3zjMTVKSzdOwWE8ORaGytzWJwQ@mail.gmail.com/ Fixes: 1d91df85f399 ("mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs") Signed-off-by: Matt Fleming <mfleming@cloudflare.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/pagewalk: fix usage of pmd_leaf()/pud_leaf() without present checkDavid Hildenbrand
pmd_leaf()/pud_leaf() only implies a pmd_present()/pud_present() check on some architectures. We really should check for pmd_present()/pud_present() first. This should explain the report we got on ppc64 (which has CONFIG_PGTABLE_HAS_HUGE_LEAVES set in the config) that triggered: VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp))); Likely we had a PMD migration entry for which pmd_leaf() did not trigger. We raced with restoring the PMD migration entry, and suddenly saw a pmd_leaf(). In this case, pte_offset_map_lock() saved us from more trouble, because it rechecks the PMD value, but we would not have processed the migration entry -- which is not too bad because the only user of FW_MIGRATION is KSM for unsharing, and KSM only applies to small folios. Further, we shouldn't re-read the PMD/PUD value for our warning, the primary purpose of the VM_WARN_ON_ONCE() is to find spurious use of pmd_leaf()/pud_leaf() without CONFIG_PGTABLE_HAS_HUGE_LEAVES. As a side note, we are currently not implementing FW_MIGRATION support for PUD migration entries, which likely should exist due to hugetlb. Add a TODO so this won't fall through the cracks if more FW_MIGRATION users get added. Was able to write a quick reproducer and verify that the issue no longer triggers with this fix. https://gitlab.com/davidhildenbrand/scratchspace/-/blob/main/reproducers/move-pages-pmd-leaf.c Without this fix after a couple of seconds in a VM with 2 NUMA nodes: [ 54.333753] ------------[ cut here ]------------ [ 54.334901] WARNING: CPU: 20 PID: 1704 at mm/pagewalk.c:815 folio_walk_start+0x48f/0x6e0 [ 54.336455] Modules linked in: ... [ 54.345009] CPU: 20 UID: 0 PID: 1704 Comm: move-pages-pmd- Not tainted 6.12.0-rc2+ #81 [ 54.346529] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 [ 54.348191] RIP: 0010:folio_walk_start+0x48f/0x6e0 [ 54.349134] Code: b5 ad 48 8d 35 00 00 00 00 e8 6d 59 d7 ff e8 08 74 da ff e9 9c fe ff ff 4c 8b 7c 24 08 4c 89 ff e8 26 2b be 00 e9 8a fe ff ff <0f> 0b e9 ec fe ff ff f7 c2 ff 0f 00 00 0f 85 81 fe ff ff 48 8b 02 [ 54.352660] RSP: 0018:ffffb7e4c430bc78 EFLAGS: 00010282 [ 54.353679] RAX: 80000002a3e008e7 RBX: ffff9946039aa580 RCX: ffff994380000000 [ 54.355056] RDX: ffff994606aec000 RSI: 00007f004b000000 RDI: 0000000000000000 [ 54.356440] RBP: 00007f004b000000 R08: 0000000000000591 R09: 0000000000000001 [ 54.357820] R10: 0000000000000200 R11: 0000000000000001 R12: ffffb7e4c430bd10 [ 54.359198] R13: ffff994606aec2c0 R14: 0000000000000002 R15: ffff994604a89b00 [ 54.360564] FS: 00007f004ae006c0(0000) GS:ffff9947f7400000(0000) knlGS:0000000000000000 [ 54.362111] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 54.363242] CR2: 00007f004adffe58 CR3: 0000000281e12005 CR4: 0000000000770ef0 [ 54.364615] PKRU: 55555554 [ 54.365153] Call Trace: [ 54.365646] <TASK> [ 54.366073] ? __warn.cold+0xb7/0x14d [ 54.366796] ? folio_walk_start+0x48f/0x6e0 [ 54.367628] ? report_bug+0xff/0x140 [ 54.368324] ? handle_bug+0x58/0x90 [ 54.369019] ? exc_invalid_op+0x17/0x70 [ 54.369771] ? asm_exc_invalid_op+0x1a/0x20 [ 54.370606] ? folio_walk_start+0x48f/0x6e0 [ 54.371415] ? folio_walk_start+0x9e/0x6e0 [ 54.372227] do_pages_move+0x1c5/0x680 [ 54.372972] kernel_move_pages+0x1a1/0x2b0 [ 54.373804] __x64_sys_move_pages+0x25/0x30 Link: https://lkml.kernel.org/r/20241015111236.1290921-1-david@redhat.com Fixes: aa39ca6940f1 ("mm/pagewalk: introduce folio_walk_start() + folio_walk_end()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+7d917f67c05066cec295@syzkaller.appspotmail.com Closes: https://lkml.kernel.org/r/670d3248.050a0220.3e960.0064.GAE@google.com Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28mm/gup: Add folio_add_pins()Steve Sistare
Export a function that adds pins to an already-pinned huge-page folio. This allows any range of small pages within the folio to be unpinned later. For example, pages pinned via memfd_pin_folios and modified by folio_add_pins could be unpinned via unpin_user_page(s). Link: https://patch.msgid.link/r/1729861919-234514-2-git-send-email-steven.sistare@oracle.com Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-10-28tmpfs: Expose filesystem features via sysfsAndré Almeida
Expose filesystem features through sysfs, so userspace can query if tmpfs support casefold. This follows the same setup as defined by ext4 and f2fs to expose casefold support to userspace. Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-8-f443d5814194@igalia.com Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28tmpfs: Add flag FS_CASEFOLD_FL support for tmpfs dirsAndré Almeida
Enable setting flag FS_CASEFOLD_FL for tmpfs directories, when tmpfs is mounted with casefold support. A special check is need for this flag, since it can't be set for non-empty directories. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-7-f443d5814194@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28tmpfs: Add casefold lookup supportAndré Almeida
Enable casefold lookup in tmpfs, based on the encoding defined by userspace. That means that instead of comparing byte per byte a file name, it compares to a case-insensitive equivalent of the Unicode string. * Dcache handling There's a special need when dealing with case-insensitive dentries. First of all, we currently invalidated every negative casefold dentries. That happens because currently VFS code has no proper support to deal with that, giving that it could incorrectly reuse a previous filename for a new file that has a casefold match. For instance, this could happen: $ mkdir DIR $ rm -r DIR $ mkdir dir $ ls DIR/ And would be perceived as inconsistency from userspace point of view, because even that we match files in a case-insensitive manner, we still honor whatever is the initial filename. Along with that, tmpfs stores only the first equivalent name dentry used in the dcache, preventing duplications of dentries in the dcache. The d_compare() version for casefold files uses a normalized string, so the filename under lookup will be compared to another normalized string for the existing file, achieving a casefolded lookup. * Enabling casefold via mount options Most filesystems have their data stored in disk, so casefold option need to be enabled when building a filesystem on a device (via mkfs). However, as tmpfs is a RAM backed filesystem, there's no disk information and thus no mkfs to store information about casefold. For tmpfs, create casefold options for mounting. Userspace can then enable casefold support for a mount point using: $ mount -t tmpfs -o casefold=utf8-12.1.0 fs_name mount_dir/ Userspace must set what Unicode standard is aiming to. The available options depends on what the kernel Unicode subsystem supports. And for strict encoding: $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name mount_dir/ Strict encoding means that tmpfs will refuse to create invalid UTF-8 sequences. When this option is not enabled, any invalid sequence will be treated as an opaque byte sequence, ignoring the encoding thus not being able to be looked up in a case-insensitive way. * Check for casefold dirs on simple_lookup() On simple_lookup(), do not create dentries for casefold directories. Currently, VFS does not support case-insensitive negative dentries and can create inconsistencies in the filesystem. Prevent such dentries to being created in the first place. Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: André Almeida <andrealmeid@igalia.com> Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-6-f443d5814194@igalia.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28memblock: uniformly initialize all reserved pages to MIGRATE_MOVABLEHua Su
Currently when CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set, the reserved pages are initialized to MIGRATE_MOVABLE by default in memmap_init. Reserved memory mainly store the metadata of struct page. When HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=Y and hugepages are allocated, the HVO will remap the vmemmap virtual address range to the page which vmemmap_reuse is mapped to. The pages previously mapping the range will be freed to the buddy system. Before this patch: when CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set, the freed memory was placed on the Movable list; When CONFIG_DEFERRED_STRUCT_PAGE_INIT=Y, the freed memory was placed on the Unmovable list. After this patch, the freed memory is placed on the Movable list regardless of whether CONFIG_DEFERRED_STRUCT_PAGE_INIT is set. Eg: Tested on a virtual machine(1000GB): Intel(R) Xeon(R) Platinum 8358P CPU After vm start: echo 500000 > /proc/sys/vm/nr_hugepages cat /proc/meminfo | grep -i huge HugePages_Total: 500000 HugePages_Free: 500000 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 1024000000 kB cat /proc/pagetypeinfo before: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 … Node 0, zone Normal, type Unmovable 51 2 1 28 53 35 35 43 40 69 3852 Node 0, zone Normal, type Movable 6485 4610 666 202 200 185 208 87 54 2 240 Node 0, zone Normal, type Reclaimable 2 2 1 23 13 1 2 1 0 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Unmovable ≈ 15GB after: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 … Node 0, zone Normal, type Unmovable 0 1 1 0 0 0 0 1 1 1 0 Node 0, zone Normal, type Movable 1563 4107 1119 189 256 368 286 132 109 4 3841 Node 0, zone Normal, type Reclaimable 2 2 1 23 13 1 2 1 0 1 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Signed-off-by: Hua Su <suhua.tanke@gmail.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Link: https://lore.kernel.org/r/20241021051151.4664-1-suhua.tanke@gmail.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2024-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. No conflicts. Adjacent changes in: include/linux/bpf.h include/uapi/linux/bpf.h kernel/bpf/btf.c kernel/bpf/helpers.c kernel/bpf/syscall.c kernel/bpf/verifier.c kernel/trace/bpf_trace.c mm/slab_common.c tools/include/uapi/linux/bpf.h tools/testing/selftests/bpf/Makefile Link: https://lore.kernel.org/all/20241024215724.60017-1-daniel@iogearbox.net/ Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22mm/page-writeback.c: Fix comment of wb_domain_writeout_add()Tang Yizhou
__bdi_writeout_inc() has undergone multiple renamings, but the comment within the function body have not been updated accordingly. Update it to reflect the latest wb_domain_writeout_add(). Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://lore.kernel.org/r/20241009151728.300477-3-yizhou.tang@shopee.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22mm/page-writeback.c: Update comment for BANDWIDTH_INTERVALTang Yizhou
The name of the BANDWIDTH_INTERVAL macro is misleading, as it is not only used in the bandwidth update functions wb_update_bandwidth() and __wb_update_bandwidth(), but also in the dirty limit update function domain_update_dirty_limit(). Currently, we haven't found an ideal name, so update the comment only. Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Link: https://lore.kernel.org/r/20241009151728.300477-2-yizhou.tang@shopee.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel spaceBibo Mao
There are two pages in one TLB entry on LoongArch system. For kernel space, it requires both two pte entries (buddies) with PAGE_GLOBAL bit set, otherwise HW treats it as non-global tlb, there will be potential problems if tlb entry for kernel space is not global. Such as fail to flush kernel tlb with the function local_flush_tlb_kernel_range() which supposed only flush tlb with global bit. Kernel address space areas include percpu, vmalloc, vmemmap, fixmap and kasan areas. For these areas both two consecutive page table entries should be enabled with PAGE_GLOBAL bit. So with function set_pte() and pte_clear(), pte buddy entry is checked and set besides its own pte entry. However it is not atomic operation to set both two pte entries, there is problem with test_vmalloc test case. So function kernel_pte_init() is added to init a pte table when it is created for kernel address space, and the default initial pte value is PAGE_GLOBAL rather than zero at beginning. Then only its own pte entry need update with function set_pte() and pte_clear(), nothing to do with the pte buddy entry. Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-10-18mm: fix follow_pfnmap API lockdep assertLinus Torvalds
The lockdep asserts for the new follow_pfnmap() API "knows" that a pfnmap always has a vma->vm_file, since that's the only way to create such a mapping. And that's actually true for all the normal cases. But not for the mmap failure case, where the incomplete mapping is torn down and we have cleared vma->vm_file because the failure occured before the file was linked to the vma. So this codepath does actually need to check for vm_file being NULL. Reported-by: Jann Horn <jannh@google.com> Fixes: 6da8e9634bb7 ("mm: new follow_pfnmap API") Cc: Peter Xu <peterx@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-18mm: Use str_on_off() helper function in report_meminit()Thorsten Blum
Remove hard-coded strings by using the helper function str_on_off(). Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://lore.kernel.org/r/20241018103150.96824-2-thorsten.blum@linux.dev Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2024-10-17mm/mglru: only clear kswapd_failures if reclaimableWei Xu
lru_gen_shrink_node() unconditionally clears kswapd_failures, which can prevent kswapd from sleeping and cause 100% kswapd cpu usage even when kswapd repeatedly fails to make progress in reclaim. Only clear kswap_failures in lru_gen_shrink_node() if reclaim makes some progress, similar to shrink_node(). I happened to run into this problem in one of my tests recently. It requires a combination of several conditions: The allocator needs to allocate a right amount of pages such that it can wake up kswapd without itself being OOM killed; there is no memory for kswapd to reclaim (My test disables swap and cleans page cache first); no other process frees enough memory at the same time. Link: https://lkml.kernel.org/r/20241014221211.832591-1-weixugc@google.com Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: Wei Xu <weixugc@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Jan Alexander Steffens <heftig@archlinux.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm/swapfile: skip HugeTLB pages for unuse_vmaLiu Shixin
I got a bad pud error and lost a 1GB HugeTLB when calling swapoff. The problem can be reproduced by the following steps: 1. Allocate an anonymous 1GB HugeTLB and some other anonymous memory. 2. Swapout the above anonymous memory. 3. run swapoff and we will get a bad pud error in kernel message: mm/pgtable-generic.c:42: bad pud 00000000743d215d(84000001400000e7) We can tell that pud_clear_bad is called by pud_none_or_clear_bad in unuse_pud_range() by ftrace. And therefore the HugeTLB pages will never be freed because we lost it from page table. We can skip HugeTLB pages for unuse_vma to fix it. Link: https://lkml.kernel.org/r/20241015014521.570237-1-liushixin2@huawei.com Fixes: 0fe6e20b9c4c ("hugetlb, rmap: add reverse mapping for hugepage") Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: swap: prevent possible data-race in __try_to_reclaim_swapJeongjun Park
A report [1] was uploaded from syzbot. In the previous commit 862590ac3708 ("mm: swap: allow cache reclaim to skip slot cache"), the __try_to_reclaim_swap() function reads offset and folio->entry from folio without folio_lock protection. In the currently reported KCSAN log, it is assumed that the actual data-race will not occur because the calltrace that does WRITE already obtains the folio_lock and then writes. However, the existing __try_to_reclaim_swap() function was already implemented to perform reads under folio_lock protection [1], and there is a risk of a data-race occurring through a function other than the one shown in the KCSAN log. Therefore, I think it is appropriate to change read operations for folio to be performed under folio_lock. [1] ================================================================== BUG: KCSAN: data-race in __delete_from_swap_cache / __try_to_reclaim_swap write to 0xffffea0004c90328 of 8 bytes by task 5186 on cpu 0: __delete_from_swap_cache+0x1f0/0x290 mm/swap_state.c:163 delete_from_swap_cache+0x72/0xe0 mm/swap_state.c:243 folio_free_swap+0x1d8/0x1f0 mm/swapfile.c:1850 free_swap_cache mm/swap_state.c:293 [inline] free_pages_and_swap_cache+0x1fc/0x410 mm/swap_state.c:325 __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline] tlb_batch_pages_flush mm/mmu_gather.c:149 [inline] tlb_flush_mmu_free mm/mmu_gather.c:366 [inline] tlb_flush_mmu+0x2cf/0x440 mm/mmu_gather.c:373 zap_pte_range mm/memory.c:1700 [inline] zap_pmd_range mm/memory.c:1739 [inline] zap_pud_range mm/memory.c:1768 [inline] zap_p4d_range mm/memory.c:1789 [inline] unmap_page_range+0x1f3c/0x22d0 mm/memory.c:1810 unmap_single_vma+0x142/0x1d0 mm/memory.c:1856 unmap_vmas+0x18d/0x2b0 mm/memory.c:1900 exit_mmap+0x18a/0x690 mm/mmap.c:1864 __mmput+0x28/0x1b0 kernel/fork.c:1347 mmput+0x4c/0x60 kernel/fork.c:1369 exit_mm+0xe4/0x190 kernel/exit.c:571 do_exit+0x55e/0x17f0 kernel/exit.c:926 do_group_exit+0x102/0x150 kernel/exit.c:1088 get_signal+0xf2a/0x1070 kernel/signal.c:2917 arch_do_signal_or_restart+0x95/0x4b0 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x59/0x130 kernel/entry/common.c:218 do_syscall_64+0xd6/0x1c0 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f read to 0xffffea0004c90328 of 8 bytes by task 5189 on cpu 1: __try_to_reclaim_swap+0x9d/0x510 mm/swapfile.c:198 free_swap_and_cache_nr+0x45d/0x8a0 mm/swapfile.c:1915 zap_pte_range mm/memory.c:1656 [inline] zap_pmd_range mm/memory.c:1739 [inline] zap_pud_range mm/memory.c:1768 [inline] zap_p4d_range mm/memory.c:1789 [inline] unmap_page_range+0xcf8/0x22d0 mm/memory.c:1810 unmap_single_vma+0x142/0x1d0 mm/memory.c:1856 unmap_vmas+0x18d/0x2b0 mm/memory.c:1900 exit_mmap+0x18a/0x690 mm/mmap.c:1864 __mmput+0x28/0x1b0 kernel/fork.c:1347 mmput+0x4c/0x60 kernel/fork.c:1369 exit_mm+0xe4/0x190 kernel/exit.c:571 do_exit+0x55e/0x17f0 kernel/exit.c:926 __do_sys_exit kernel/exit.c:1055 [inline] __se_sys_exit kernel/exit.c:1053 [inline] __x64_sys_exit+0x1f/0x20 kernel/exit.c:1053 x64_sys_call+0x2d46/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:61 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc9/0x1c0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f value changed: 0x0000000000000242 -> 0x0000000000000000 Link: https://lkml.kernel.org/r/20241007070623.23340-1-aha310510@gmail.com Reported-by: syzbot+fa43f1b63e3aa6f66329@syzkaller.appspotmail.com Fixes: 862590ac3708 ("mm: swap: allow cache reclaim to skip slot cache") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: khugepaged: fix the incorrect statistics when collapsing large file foliosBaolin Wang
Khugepaged already supports collapsing file large folios (including shmem mTHP) by commit 7de856ffd007 ("mm: khugepaged: support shmem mTHP collapse"), and the control parameters in khugepaged: 'khugepaged_max_ptes_swap' and 'khugepaged_max_ptes_none', still compare based on PTE granularity to determine whether a file collapse is needed. However, the statistics for 'present' and 'swap' in hpage_collapse_scan_file() do not take into account the large folios, which may lead to incorrect judgments regarding the khugepaged_max_ptes_swap/none parameters, resulting in unnecessary file collapses. To fix this issue, take into account the large folios' statistics for 'present' and 'swap' variables in the hpage_collapse_scan_file(). Link: https://lkml.kernel.org/r/c76305d96d12d030a1a346b50503d148364246d2.1728901391.git.baolin.wang@linux.alibaba.com Fixes: 7de856ffd007 ("mm: khugepaged: support shmem mTHP collapse") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: don't install PMD mappings when THPs are disabled by the hw/process/vmaDavid Hildenbrand
We (or rather, readahead logic :) ) might be allocating a THP in the pagecache and then try mapping it into a process that explicitly disabled THP: we might end up installing PMD mappings. This is a problem for s390x KVM, which explicitly remaps all PMD-mapped THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before starting the VM. For example, starting a VM backed on a file system with large folios supported makes the VM crash when the VM tries accessing such a mapping using KVM. Is it also a problem when the HW disabled THP using TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case without X86_FEATURE_PSE. In the future, we might be able to do better on s390x and only disallow PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED really wants. For now, fix it by essentially performing the same check as would be done in __thp_vma_allowable_orders() or in shmem code, where this works as expected, and disallow PMD mappings, making us fallback to PTE mappings. Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Leo Fu <bfu@redhat.com> Tested-by: Thomas Huth <thuth@redhat.com> Cc: Thomas Huth <thuth@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()Kefeng Wang
Patch series "mm: don't install PMD mappings when THPs are disabled by the hw/process/vma". During testing, it was found that we can get PMD mappings in processes where THP (and more precisely, PMD mappings) are supposed to be disabled. While it works as expected for anon+shmem, the pagecache is the problematic bit. For s390 KVM this currently means that a VM backed by a file located on filesystem with large folio support can crash when KVM tries accessing the problematic page, because the readahead logic might decide to use a PMD-sized THP and faulting it into the page tables will install a PMD mapping, something that s390 KVM cannot tolerate. This might also be a problem with HW that does not support PMD mappings, but I did not try reproducing it. Fix it by respecting the ways to disable THPs when deciding whether we can install a PMD mapping. khugepaged should already be taking care of not collapsing if THPs are effectively disabled for the hw/process/vma. This patch (of 2): Add vma_thp_disabled() and thp_disabled_by_hw() helpers to be shared by shmem_allowable_huge_orders() and __thp_vma_allowable_orders(). [david@redhat.com: rename to vma_thp_disabled(), split out thp_disabled_by_hw() ] Link: https://lkml.kernel.org/r/20241011102445.934409-2-david@redhat.com Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Leo Fu <bfu@redhat.com> Tested-by: Thomas Huth <thuth@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Boqiao Fu <bfu@redhat.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace pointYang Shi
The "addr" and "is_shmem" arguments have different order in TP_PROTO and TP_ARGS. This resulted in the incorrect trace result: text-hugepage-644429 [276] 392092.878683: mm_khugepaged_collapse_file: mm=0xffff20025d52c440, hpage_pfn=0x200678c00, index=512, addr=1, is_shmem=0, filename=text-hugepage, nr=512, result=failed The value of "addr" is wrong because it was treated as bool value, the type of is_shmem. Fix the order in TP_PROTO to keep "addr" is before "is_shmem" since the original patch review suggested this order to achieve best packing. And use "lx" for "addr" instead of "ld" in TP_printk because address is typically shown in hex. After the fix, the trace result looks correct: text-hugepage-7291 [004] 128.627251: mm_khugepaged_collapse_file: mm=0xffff0001328f9500, hpage_pfn=0x20016ea00, index=512, addr=0x400000, is_shmem=0, filename=text-hugepage, nr=512, result=failed Link: https://lkml.kernel.org/r/20241012011702.1084846-1-yang@os.amperecomputing.com Fixes: 4c9473e87e75 ("mm/khugepaged: add tracepoint to collapse_file()") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Cc: Gautam Menghani <gautammenghani201@gmail.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> [6.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets()Jinjie Ruan
The sysfs_target->regions allocated in damon_sysfs_regions_alloc() is not freed in damon_sysfs_test_add_targets(), which cause the following memory leak, free it to fix it. unreferenced object 0xffffff80c2a8db80 (size 96): comm "kunit_try_catch", pid 187, jiffies 4294894363 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc 0): [<0000000001e3714d>] kmemleak_alloc+0x34/0x40 [<000000008e6835c1>] __kmalloc_cache_noprof+0x26c/0x2f4 [<000000001286d9f8>] damon_sysfs_test_add_targets+0x1cc/0x738 [<0000000032ef8f77>] kunit_try_run_case+0x13c/0x3ac [<00000000f3edea23>] kunit_generic_run_threadfn_adapter+0x80/0xec [<00000000adf936cf>] kthread+0x2e8/0x374 [<0000000041bb1628>] ret_from_fork+0x10/0x20 Link: https://lkml.kernel.org/r/20241010125323.3127187-1-ruanjinjie@huawei.com Fixes: b8ee5575f763 ("mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm: remove unused stub for can_swapin_thp()Andy Shevchenko
When can_swapin_thp() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: mm/memory.c:4184:20: error: unused function 'can_swapin_thp' [-Werror,-Wunused-function] Fix this by removing the unused stub. See also commit 6863f5643dd7 ("kbuild: allow Clang to find unused static inline functions for W=1 build"). Link: https://lkml.kernel.org/r/20241008191329.2332346-1-andriy.shevchenko@linux.intel.com Fixes: 242d12c98174 ("mm: support large folios swap-in for sync io devices") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Barry Song <baohua@kernel.org> Cc: Bill Wendling <morbo@google.com> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm/mremap: fix move_normal_pmd/retract_page_tables raceJann Horn
In mremap(), move_page_tables() looks at the type of the PMD entry and the specified address range to figure out by which method the next chunk of page table entries should be moved. At that point, the mmap_lock is held in write mode, but no rmap locks are held yet. For PMD entries that point to page tables and are fully covered by the source address range, move_pgt_entry(NORMAL_PMD, ...) is called, which first takes rmap locks, then does move_normal_pmd(). move_normal_pmd() takes the necessary page table locks at source and destination, then moves an entire page table from the source to the destination. The problem is: The rmap locks, which protect against concurrent page table removal by retract_page_tables() in the THP code, are only taken after the PMD entry has been read and it has been decided how to move it. So we can race as follows (with two processes that have mappings of the same tmpfs file that is stored on a tmpfs mount with huge=advise); note that process A accesses page tables through the MM while process B does it through the file rmap: process A process B ========= ========= mremap mremap_to move_vma move_page_tables get_old_pmd alloc_new_pmd *** PREEMPT *** madvise(MADV_COLLAPSE) do_madvise madvise_walk_vmas madvise_vma_behavior madvise_collapse hpage_collapse_scan_file collapse_file retract_page_tables i_mmap_lock_read(mapping) pmdp_collapse_flush i_mmap_unlock_read(mapping) move_pgt_entry(NORMAL_PMD, ...) take_rmap_locks move_normal_pmd drop_rmap_locks When this happens, move_normal_pmd() can end up creating bogus PMD entries in the line `pmd_populate(mm, new_pmd, pmd_pgtable(pmd))`. The effect depends on arch-specific and machine-specific details; on x86, you can end up with physical page 0 mapped as a page table, which is likely exploitable for user->kernel privilege escalation. Fix the race by letting process B recheck that the PMD still points to a page table after the rmap locks have been taken. Otherwise, we bail and let the caller fall back to the PTE-level copying path, which will then bail immediately at the pmd_none() check. Bug reachability: Reaching this bug requires that you can create shmem/file THP mappings - anonymous THP uses different code that doesn't zap stuff under rmap locks. File THP is gated on an experimental config flag (CONFIG_READ_ONLY_THP_FOR_FS), so on normal distro kernels you need shmem THP to hit this bug. As far as I know, getting shmem THP normally requires that you can mount your own tmpfs with the right mount flags, which would require creating your own user+mount namespace; though I don't know if some distros maybe enable shmem THP by default or something like that. Bug impact: This issue can likely be used for user->kernel privilege escalation when it is reachable. Link: https://lkml.kernel.org/r/20241007-move_normal_pmd-vs-collapse-fix-2-v1-1-5ead9631f2ea@google.com Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David Hildenbrand <david@redhat.com> Co-developed-by: David Hildenbrand <david@redhat.com> Closes: https://project-zero.issues.chromium.org/371047675 Acked-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17mm/mmap: correct error handling in mmap_region()Lorenzo Stoakes
Commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") changed how error handling is performed in mmap_region(). The error value defaults to -ENOMEM, but then gets reassigned immediately to the result of vms_gather_munmap_vmas() if we are performing a MAP_FIXED mapping over existing VMAs (and thus unmapping them). This overwrites the error value, potentially clearing it. After this, we invoke may_expand_vm() and possibly vm_area_alloc(), and check to see if they failed. If they do so, then we perform error-handling logic, but importantly, we do NOT update the error code. This means that, if vms_gather_munmap_vmas() succeeds, but one of these calls does not, the function will return indicating no error, but rather an address value of zero, which is entirely incorrect. Correct this and avoid future confusion by strictly setting error on each and every occasion we jump to the error handling logic, and set the error code immediately prior to doing so. This way we can see at a glance that the error code is always correct. Many thanks to Vegard Nossum who spotted this issue in discussion around this problem. Link: https://lkml.kernel.org/r/20241002073932.13482-1-lorenzo.stoakes@oracle.com Fixes: f8d112a4e657 ("mm/mmap: avoid zeroing vma tree in mmap_region()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Vegard Nossum <vegard.nossum@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-16mm/bpf: Add bpf_get_kmem_cache() kfuncNamhyung Kim
The bpf_get_kmem_cache() is to get a slab cache information from a virtual address like virt_to_cache(). If the address is a pointer to a slab object, it'd return a valid kmem_cache pointer, otherwise NULL is returned. It doesn't grab a reference count of the kmem_cache so the caller is responsible to manage the access. The returned point is marked as PTR_UNTRUSTED. The intended use case for now is to symbolize locks in slab objects from the lock contention tracepoints. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> (mm/*) Acked-by: Vlastimil Babka <vbabka@suse.cz> #mm/slab Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20241010232505.1339892-3-namhyung@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-16mm/damon/core: Use generic upper bound recommondation for usleep_range()Anna-Maria Behnsen
The upper bound for usleep_range_idle() was taken from the outdated documentation. As a recommondation for the upper bound of usleep_range() depends on HZ configuration it is not possible to hard code it. Use the define "USLEEP_RANGE_UPPER_BOUND" instead. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-8-dc8b907cb62f@linutronix.de
2024-10-16timers: Rename usleep_idle_range() to usleep_range_idle()Anna-Maria Behnsen
usleep_idle_range() is a variant of usleep_range(). Both are using usleep_range_state() as a base. To be able to find all the related functions in one go, rename it usleep_idle_range() to usleep_range_idle(). No functional change. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-4-dc8b907cb62f@linutronix.de
2024-10-15rust: treewide: switch to the kernel `Vec` typeDanilo Krummrich
Now that we got the kernel `Vec` in place, convert all existing `Vec` users to make use of it. Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Benno Lossin <benno.lossin@proton.me> Reviewed-by: Gary Guo <gary@garyguo.net> Signed-off-by: Danilo Krummrich <dakr@kernel.org> Link: https://lore.kernel.org/r/20241004154149.93856-20-dakr@kernel.org [ Converted `kasan_test_rust.rs` too, as discussed. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2024-10-10Merge patch series "timekeeping/fs: multigrain timestamp redux"Christian Brauner
Jeff Layton <jlayton@kernel.org> says: The VFS has always used coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide when to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. What we need is a way to only use fine-grained timestamps when they are being actively queried. Use the (unused) top bit in inode->i_ctime_nsec as a flag that indicates whether the current timestamps have been queried via stat() or the like. When it's set, we allow the kernel to use a fine-grained timestamp iff it's necessary to make the ctime show a different value. This solves the problem of being able to distinguish the timestamp between updates, but introduces a new problem: it's now possible for a file being changed to get a fine-grained timestamp. A file that is altered just a bit later can then get a coarse-grained one that appears older than the earlier fine-grained time. This violates timestamp ordering guarantees. To remedy this, keep a global monotonic atomic64_t value that acts as a timestamp floor. When we go to stamp a file, we first get the latter of the current floor value and the current coarse-grained time. If the inode ctime hasn't been queried then we just attempt to stamp it with that value. If it has been queried, then first see whether the current coarse time is later than the existing ctime. If it is, then we accept that value. If it isn't, then we get a fine-grained time and try to swap that into the global floor. Whether that succeeds or fails, we take the resulting floor time, convert it to realtime and try to swap that into the ctime. We take the result of the ctime swap whether it succeeds or fails, since either is just as valid. Filesystems can opt into this by setting the FS_MGTIME fstype flag. Others should be unaffected (other than being subject to the same floor value as multigrain filesystems). * patches from https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org: tmpfs: add support for multigrain timestamps btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps Documentation: add a new file documenting multigrain timestamps fs: add percpu counters for significant multigrain timestamp events fs: tracepoints around multigrain timestamp events fs: handle delegated timestamps in setattr_copy_mgtime fs: have setattr_copy handle multigrain timestamps appropriately fs: add infrastructure for multigrain timestamps Link: https://lore.kernel.org/r/20241002-mgtime-v10-0-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10tmpfs: add support for multigrain timestampsJeff Layton
Enable multigrain timestamps, which should ensure that there is an apparent change to the timestamp whenever it has been written after being actively observed via getattr. tmpfs only requires the FS_MGTIME flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20241002-mgtime-v10-12-d1c4717f5284@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-09mm: zswap: delete comments for "value" member of 'struct zswap_entry'.Kanchana P Sridhar
Made a minor edit in the comments for 'struct zswap_entry' to delete the description of the 'value' member that was deleted in commit 20a5532ffa53 ("mm: remove code to handle same filled pages"). Link: https://lkml.kernel.org/r/20241002194213.30041-1-kanchana.p.sridhar@intel.com Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Fixes: 20a5532ffa53 ("mm: remove code to handle same filled pages") Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Usama Arif <usamaarif642@gmail.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wajdi Feghali <wajdi.k.feghali@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09secretmem: disable memfd_secret() if arch cannot set direct mapPatrick Roy
Return -ENOSYS from memfd_secret() syscall if !can_set_direct_map(). This is the case for example on some arm64 configurations, where marking 4k PTEs in the direct map not present can only be done if the direct map is set up at 4k granularity in the first place (as ARM's break-before-make semantics do not easily allow breaking apart large/gigantic pages). More precisely, on arm64 systems with !can_set_direct_map(), set_direct_map_invalid_noflush() is a no-op, however it returns success (0) instead of an error. This means that memfd_secret will seemingly "work" (e.g. syscall succeeds, you can mmap the fd and fault in pages), but it does not actually achieve its goal of removing its memory from the direct map. Note that with this patch, memfd_secret() will start erroring on systems where can_set_direct_map() returns false (arm64 with CONFIG_RODATA_FULL_DEFAULT_ENABLED=n, CONFIG_DEBUG_PAGEALLOC=n and CONFIG_KFENCE=n), but that still seems better than the current silent failure. Since CONFIG_RODATA_FULL_DEFAULT_ENABLED defaults to 'y', most arm64 systems actually have a working memfd_secret() and aren't be affected. From going through the iterations of the original memfd_secret patch series, it seems that disabling the syscall in these scenarios was the intended behavior [1] (preferred over having set_direct_map_invalid_noflush return an error as that would result in SIGBUSes at page-fault time), however the check for it got dropped between v16 [2] and v17 [3], when secretmem moved away from CMA allocations. [1]: https://lore.kernel.org/lkml/20201124164930.GK8537@kernel.org/ [2]: https://lore.kernel.org/lkml/20210121122723.3446-11-rppt@kernel.org/#t [3]: https://lore.kernel.org/lkml/20201125092208.12544-10-rppt@kernel.org/ Link: https://lkml.kernel.org/r/20241001080056.784735-1-roypat@amazon.co.uk Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Patrick Roy <roypat@amazon.co.uk> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09mm/huge_memory: check pmd_special() only after pmd_present()David Hildenbrand
We should only check for pmd_special() after we made sure that we have a present PMD. For example, if we have a migration PMD, pmd_special() might indicate that we have a special PMD although we really don't. This fixes confusing migration entries as PFN mappings, and not doing what we are supposed to do in the "is_swap_pmd()" case further down in the function -- including messing up COW, page table handling and accounting. Link: https://lkml.kernel.org/r/20240926154234.2247217-1-david@redhat.com Fixes: bc02afbd4d73 ("mm/fork: accept huge pfnmap entries") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: syzbot+bf2c35fa302ebe3c7471@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/66f15c8d.050a0220.c23dd.000f.GAE@google.com/ Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-07rust: enable `clippy::undocumented_unsafe_blocks` lintMiguel Ojeda
Checking that we are not missing any `// SAFETY` comments in our `unsafe` blocks is something we have wanted to do for a long time, as well as cleaning up the remaining cases that were not documented [1]. Back when Rust for Linux started, this was something that could have been done via a script, like Rust's `tidy`. Soon after, in Rust 1.58.0, Clippy implemented the `undocumented_unsafe_blocks` lint [2]. Even though the lint has a few false positives, e.g. in some cases where attributes appear between the comment and the `unsafe` block [3], there are workarounds and the lint seems quite usable already. Thus enable the lint now. We still have a few cases to clean up, so just allow those for the moment by writing a `TODO` comment -- some of those may be good candidates for new contributors. Link: https://github.com/Rust-for-Linux/linux/issues/351 [1] Link: https://rust-lang.github.io/rust-clippy/master/#/undocumented_unsafe_blocks [2] Link: https://github.com/rust-lang/rust-clippy/issues/13189 [3] Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Trevor Gross <tmgross@umich.edu> Tested-by: Gary Guo <gary@garyguo.net> Reviewed-by: Gary Guo <gary@garyguo.net> Link: https://lore.kernel.org/r/20240904204347.168520-5-ojeda@kernel.org Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
2024-10-04mm: Introduce ARCH_HAS_USER_SHADOW_STACKMark Brown
Since multiple architectures have support for shadow stacks and we need to select support for this feature in several places in the generic code provide a generic config option that the architectures can select. Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Deepak Gupta <debug@rivosinc.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kees Cook <kees@kernel.org> Tested-by: Kees Cook <kees@kernel.org> Acked-by: Shuah Khan <skhan@linuxfoundation.org> Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org> Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-1-222b78d87eee@kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04migrate: Remove references to Private2Matthew Wilcox (Oracle)
These comments are now stale; rewrite them. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241002040111.1023018-7-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-04fs: Move clearing of mappedtodisk to buffer.cMatthew Wilcox (Oracle)
The mappedtodisk flag is only meaningful for buffer head based filesystems. It should not be cleared for other filesystems. This allows us to reuse the mappedtodisk flag to have other meanings in filesystems that do not use buffer heads. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20241002040111.1023018-2-willy@infradead.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-02mm, slab: suppress warnings in test_leak_destroy kunit testVlastimil Babka
The test_leak_destroy kunit test intends to test the detection of stray objects in kmem_cache_destroy(), which normally produces a warning. The other slab kunit tests suppress the warnings in the kunit test context, so suppress warnings and related printk output in this test as well. Automated test running environments then don't need to learn to filter the warnings. Also rename the test's kmem_cache, the name was wrongly copy-pasted from test_kfree_rcu. Fixes: 4e1c44b3db79 ("kunit, slub: add test_kfree_rcu() and test_leak_destroy()") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202408251723.42f3d902-oliver.sang@intel.com Reported-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Closes: https://lore.kernel.org/all/CAB=+i9RHHbfSkmUuLshXGY_ifEZg9vCZi3fqr99+kmmnpDus7Q@mail.gmail.com/ Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/all/6fcb1252-7990-4f0d-8027-5e83f0fb9409@roeck-us.net/ Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-10-02filemap: filemap_read() should check that the offset is positive or zeroTrond Myklebust
We do check that the read offset is less than the filesystem limit, however for good measure we should also check that it is positive or zero, and return EINVAL if that is not the case. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/482ee0b8a30b62324adb9f7c551a99926f037393.1726257832.git.trond.myklebust@hammerspace.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-01mm, slab: fix use of SLAB_SUPPORTS_SYSFS in kmem_cache_release()Nilay Shroff
The fix implemented in commit 4ec10268ed98 ("mm, slab: unlink slabinfo, sysfs and debugfs immediately") caused a subtle side effect due to which while destroying the kmem cache, the code path would never get into sysfs_slab_release() function even though SLAB_SUPPORTS_SYSFS is defined and slab state is FULL. Due to this side effect, we would never release kobject defined for kmem cache and leak the associated memory. The issue here's with the use of __is_defined() macro in kmem_cache_ release(). The __is_defined() macro expands to __take_second_arg( arg1_or_junk 1, 0). If "arg1_or_junk" is defined to 1 then it expands to __take_second_arg(0, 1, 0) and returns 1. If "arg1_or_junk" is NOT defined to any value then it expands to __take_second_arg(... 1, 0) and returns 0. In this particular issue, SLAB_SUPPORTS_SYSFS is defined without any associated value and that causes __is_defined(SLAB_SUPPORTS_SYSFS) to always evaluate to 0 and hence it would never invoke sysfs_slab_release(). This patch helps fix this issue by defining SLAB_SUPPORTS_SYSFS to 1. Fixes: 4ec10268ed98 ("mm, slab: unlink slabinfo, sysfs and debugfs immediately") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs9YCCcfmdxN43-9H3HnTYQsRtTYw1Kzq-L468GfLKAENA@mail.gmail.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>