summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2025-03-16mm/damon/core: initialize damos->walk_completed in damon_new_scheme()SeongJae Park
The function for allocating and initialize a 'struct damos' object, damon_new_scheme(), is not initializing damos->walk_completed field. Only damos_walk_complete() is setting the field. Hence the field will be eventually set and used correctly from second damos_walk() call for the scheme. But the first damos_walk() could mistakenly not walk on the regions. Actually, a common usage of DAMOS for taking an access pattern snapshot is installing a monitoring-purpose DAMOS scheme, doing damos_walk() to retrieve the snapshot, and then removing the scheme. DAMON user-space tool (damo) also gets runtime snapshot in the way. Hence the problem can continuously happen in such use cases. Initialize it properly in the allocation function. Link: https://lkml.kernel.org/r/20250228174450.41472-1-sj@kernel.org Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16mm/damon: respect core layer filters' allowance decision on ops layerSeongJae Park
Filtering decisions are made in filters evaluation order. Once a decision is made by a filter, filters that scheduled to be evaluated after the decision-made filter should just respect it. This is the intended and documented behavior. Since core layer-handled filters are evaluated before operations layer-handled filters, decisions made on core layer should respected by ops layer. In case of reject filters, the decision is respected, since core layer-rejected regions are not passed to ops layer. But in case of allow filters, ops layer filters don't know if the region has passed to them because it was allowed by core filters or just because it didn't match to any core layer. The current wrong implementation assumes it was due to not matched by any core filters. As a reuslt, the decision is not respected. Pass the missing information to ops layer using a new filed in 'struct damos', and make the ops layer filters respect it. Link: https://lkml.kernel.org/r/20250228175336.42781-1-sj@kernel.org Fixes: 491fee286e56 ("mm/damon/core: support damos_filter->allow") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-16filemap: move prefaulting out of hot write pathDave Hansen
There is a generic anti-pattern that shows up in the VFS and several filesystems where the hot write paths touch userspace twice when they could get away with doing it once. Dave Chinner suggested that they should all be fixed up[1]. I agree[2]. But, the series to do that fixup spans a bunch of filesystems and a lot of people. This patch fixes common code that absolutely everyone uses. It has measurable performance benefits[3]. I think this patch can go in and not be held up by the others. I will post them separately to their separate maintainers for consideration. But, honestly, I'm not going to lose any sleep if the maintainers don't pick those up. 1. https://lore.kernel.org/all/Z5f-x278Z3wTIugL@dread.disaster.area/ 2. https://lore.kernel.org/all/20250129181749.C229F6F3@davehans-spike.ostc.intel.com/ 3. https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/ This patch: There is a bit of a sordid history here. I originally wrote 998ef75ddb57 ("fs: do not prefault sys_write() user buffer pages") to fix a performance issue that showed up on early SMAP hardware. But that was reverted with 00a3d660cbac because it exposed an underlying filesystem bug. This is a reimplementation of the original commit along with some simplification and comment improvements. The basic problem is that the generic write path has two userspace accesses: one to prefault the write source buffer and then another to perform the actual write. On x86, this means an extra STAC/CLAC pair. These are relatively expensive instructions because they function as barriers. Keep the prefaulting behavior but move it into the slow path that gets run when the write did not make any progress. This avoids livelocks that can happen when the write's source and destination target the same folio. Contrary to the existing comments, the fault-in does not prevent deadlocks. That's accomplished by using an "atomic" usercopy that disables page faults. The end result is that the generic write fast path now touches userspace once instead of twice. 0day has shown some improvements on a couple of microbenchmarks: https://lore.kernel.org/all/202502121529.d62a409e-lkp@intel.com/ Link: https://lkml.kernel.org/r/20250228203722.CAEB63AC@davehans-spike.ostc.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/yxyuijjfd6yknryji2q64j3keq2ygw6ca6fs5jwyolklzvo45s@4u63qqqyosy2/ Cc: Ted Ts'o <tytso@mit.edu> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-13Revert "fanotify: disable readahead if we have pre-content watches"Amir Goldstein
This reverts commit fac84846a28c0950d4433118b3dffd44306df62d. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-7-amir73il@gmail.com
2025-03-13Revert "mm: don't allow huge faults for files with pre content watches"Amir Goldstein
This reverts commit 20bf82a898b65c129af76deb96a1b415d3098a28. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-6-amir73il@gmail.com
2025-03-13Revert "fsnotify: generate pre-content permission event on page fault"Amir Goldstein
This reverts commit 8392bc2ff8c8bf7c4c5e6dfa71ccd893a3c046f6. In the use case of buffered write whose input buffer is mmapped file on a filesystem with a pre-content mark, the prefaulting of the buffer can happen under the filesystem freeze protection (obtained in vfs_write()) which breaks assumptions of pre-content hook and introduces potential deadlock of HSM handler in userspace with filesystem freezing. Now that we have pre-content hooks at file mmap() time, disable the pre-content event hooks on page fault to avoid the potential deadlock. Reported-by: syzbot+7229071b47908b19d5b7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/ Fixes: 8392bc2ff8c8 ("fsnotify: generate pre-content permission event on page fault") Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-5-amir73il@gmail.com
2025-03-12fsnotify: add pre-content hooks on mmap()Amir Goldstein
Pre-content hooks in page faults introduces potential deadlock of HSM handler in userspace with filesystem freezing. The requirement with pre-content event is that for every accessed file range an event covering at least this range will be generated at least once before the file data is accesses. In preparation to disabling pre-content event hooks on page faults, add pre-content hooks at mmap() variants for the entire mmaped range, so HSM can fill content when user requests to map a portion of the file. Note that exec() variant also calls vm_mmap_pgoff() internally to map code sections, so pre-content hooks are also generated in this case. Link: https://lore.kernel.org/linux-fsdevel/7ehxrhbvehlrjwvrduoxsao5k3x4aw275patsb3krkwuq573yv@o2hskrfawbnc/ Suggested-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20250312073852.2123409-2-amir73il@gmail.com
2025-03-11mm: Fix a build breakage in memcontrol-v1.cTejun Heo
While adding a deprecation message, fd4fd0a869e9 ("mm: Add transformation message for per-memcg swappiness") missed the semicolon after the new pr_info_once() statement causing build breakage when CONFIG_MEMCG_V1 is enabled. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Fixes: fd4fd0a869e9 ("mm: Add transformation message for per-memcg swappiness") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202503120710.guZkJx0h-lkp@intel.com/
2025-03-12Backmerge tag 'v6.14-rc6' into drm-nextDave Airlie
This is a backmerge from Linux 6.14-rc6, needed for the nova PR. Signed-off-by: Dave Airlie <airlied@redhat.com>
2025-03-11mm: Add transformation message for per-memcg swappinessMichal Koutný
The concept of per-memcg swappiness has never landed well in memcg for cgroup v2. Add a message to users who use it on v1 hierarchy. Decreased swappiness transforms to memory.swap.max=0 whereas increased swappiness transforms into active memory.reclaim operation. Link: https://lore.kernel.org/r/1577252208-32419-1-git-send-email-teawater@gmail.com/ Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-03-08Merge tag 'mm-hotfixes-stable-2025-03-08-16-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "33 hotfixes. 24 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 26 are for MM and 7 are for non-MM. - "mm: memory_failure: unmap poisoned folio during migrate properly" from Ma Wupeng fixes a couple of two year old bugs involving the migration of hwpoisoned folios. - "selftests/damon: three fixes for false results" from SeongJae Park fixes three one year old bugs in the SAMON selftest code. The remainder are singletons and doubletons. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2025-03-08-16-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (33 commits) mm/page_alloc: fix uninitialized variable rapidio: add check for rio_add_net() in rio_scan_alloc_net() rapidio: fix an API misues when rio_add_net() fails MAINTAINERS: .mailmap: update Sumit Garg's email address Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" mm: fix finish_fault() handling for large folios mm: don't skip arch_sync_kernel_mappings() in error paths mm: shmem: remove unnecessary warning in shmem_writepage() userfaultfd: fix PTE unmapping stack-allocated PTE copies userfaultfd: do not block on locking a large folio with raised refcount mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pages mm: shmem: fix potential data corruption during shmem swapin mm: fix kernel BUG when userfaultfd_move encounters swapcache selftests/damon/damon_nr_regions: sort collected regiosn before checking with min/max boundaries selftests/damon/damon_nr_regions: set ops update for merge results check to 100ms selftests/damon/damos_quota: make real expectation of quota exceeds include/linux/log2.h: mark is_power_of_2() with __always_inline NFS: fix nfs_release_folio() to not deadlock via kcompactd writeback mm, swap: avoid BUG_ON in relocate_cluster() mm: swap: use correct step in loop to wait all clusters in wait_for_allocation() ...
2025-03-08Merge branch 'locking/urgent' into locking/core, to pick up locking fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-07Merge tag 'slab-for-6.14-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - Stable fix for kmem_cache_destroy() called from a WQ_MEM_RECLAIM workqueue causing a warning due to the new kvfree_rcu_barrier() (Uladzislau Rezki) * tag 'slab-for-6.14-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab/kvfree_rcu: Switch to WQ_MEM_RECLAIM wq
2025-03-06fs/pipe: add simpler helpers for common casesLinus Torvalds
The fix to atomically read the pipe head and tail state when not holding the pipe mutex has caused a number of headaches due to the size change of the involved types. It turns out that we don't have _that_ many places that access these fields directly and were affected, but we have more than we strictly should have, because our low-level helper functions have been designed to have intimate knowledge of how the pipes work. And as a result, that random noise of direct 'pipe->head' and 'pipe->tail' accesses makes it harder to pinpoint any actual potential problem spots remaining. For example, we didn't have a "is the pipe full" helper function, but instead had a "given these pipe buffer indexes and this pipe size, is the pipe full". That's because some low-level pipe code does actually want that much more complicated interface. But most other places literally just want a "is the pipe full" helper, and not having it meant that those places ended up being unnecessarily much too aware of this all. It would have been much better if only the very core pipe code that cared had been the one aware of this all. So let's fix it - better late than never. This just introduces the trivial wrappers for "is this pipe full or empty" and to get how many pipe buffers are used, so that instead of writing if (pipe_full(pipe->head, pipe->tail, pipe->max_usage)) the places that literally just want to know if a pipe is full can just say if (pipe_is_full(pipe)) instead. The existing trivial cases were converted with a 'sed' script. This cuts down on the places that access pipe->head and pipe->tail directly outside of the pipe code (and core splice code) quite a lot. The splice code in particular still revels in doing the direct low-level accesses, and the fuse fuse_dev_splice_write() code also seems a bit unnecessarily eager to go very low-level, but it's at least a bit better than it used to be. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-03-06mm/migrate: Trylock device page in do_swap_pageMatthew Brost
Avoid multiple CPU page faults to the same device page racing by trying to lock the page in do_swap_page before taking an extra reference to the page. This prevents scenarios where multiple CPU page faults each take an extra reference to a device page, which could abort migration in folio_migrate_mapping. With the device page being locked in do_swap_page, the migrate_vma_* functions need to be updated to avoid locking the fault_page argument. Prior to this change, a livelock scenario could occur in Xe's (Intel GPU DRM driver) SVM implementation if enough threads faulted the same device page. v3: - Put page after unlocking page (Alistair) - Warn on spliting a TPH which is fault page (Alistair) - Warn on dst page == fault page (Alistair) v6: - Add more verbose comment around THP (Alistair) v7: - Fix migrate_device_finalize alignment (Checkpatch) Cc: Alistair Popple <apopple@nvidia.com> Cc: Philip Yang <Philip.Yang@amd.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Simona Vetter <simona.vetter@ffwll.ch> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250306012657.3505757-4-matthew.brost@intel.com
2025-03-06mm/migrate: Add migrate_device_pfnsMatthew Brost
Add migrate_device_pfns which prepares an array of pre-populated device pages for migration. This is needed for eviction of known set of non-contiguous devices pages to cpu pages which is a common case for SVM in DRM drivers using TTM. v2: - s/migrate_device_vma_range/migrate_device_prepopulated_range - Drop extra mmu invalidation (Vetter) v3: - s/migrate_device_prepopulated_range/migrate_device_pfns (Alistar) - Use helper to lock device pages (Alistar) - Update commit message with why this is required (Alistar) Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Gwan-gyeong Mun <gwan-gyeong.mun@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20250306012657.3505757-3-matthew.brost@intel.com
2025-03-06slub: Handle freelist cycle in on_freelist()Lilith Gkini
The on_freelist() doesn't have a way to handle the edgecase of having a full freelist that doesn't end in NULL and instead has another valid pointer in the slab as a result of a Use-After-Free or anything similar. This case won't get caught by check_valid_pointer() and it will result in nr incrementing to `slab->objects + 1`, corrupting the slab->inuse entry later in the code by setting it to -1. Add an if check to detect that case, report it and handle the freelist and slab appropriately, as is the standard process in these situations. Furthermore change the return type of the function from int to bool as per coding style guidelines. Also move the `break;` line inside the `if (object) {` to make it more obvious that the code breaks the while loop in that branch. Signed-off-by: Lilith Persefoni Gkini <lilithgkini@proton.me> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-03-05mm/page_alloc: fix uninitialized variableHao Zhang
The variable "compact_result" is not initialized in function __alloc_pages_slowpath(). It causes should_compact_retry() to use an uninitialized value. Initialize variable "compact_result" with the value COMPACT_SKIPPED. BUG: KMSAN: uninit-value in __alloc_pages_slowpath+0xee8/0x16c0 mm/page_alloc.c:4416 __alloc_pages_slowpath+0xee8/0x16c0 mm/page_alloc.c:4416 __alloc_frozen_pages_noprof+0xa4c/0xe00 mm/page_alloc.c:4752 alloc_pages_mpol+0x4cd/0x890 mm/mempolicy.c:2270 alloc_frozen_pages_noprof mm/mempolicy.c:2341 [inline] alloc_pages_noprof mm/mempolicy.c:2361 [inline] folio_alloc_noprof+0x1dc/0x350 mm/mempolicy.c:2371 filemap_alloc_folio_noprof+0xa6/0x440 mm/filemap.c:1019 __filemap_get_folio+0xb9a/0x1840 mm/filemap.c:1970 grow_dev_folio fs/buffer.c:1039 [inline] grow_buffers fs/buffer.c:1105 [inline] __getblk_slow fs/buffer.c:1131 [inline] bdev_getblk+0x2c9/0xab0 fs/buffer.c:1431 getblk_unmovable include/linux/buffer_head.h:369 [inline] ext4_getblk+0x3b7/0xe50 fs/ext4/inode.c:864 ext4_bread_batch+0x9f/0x7d0 fs/ext4/inode.c:933 __ext4_find_entry+0x1ebb/0x36c0 fs/ext4/namei.c:1627 ext4_lookup_entry fs/ext4/namei.c:1729 [inline] ext4_lookup+0x189/0xb40 fs/ext4/namei.c:1797 __lookup_slow+0x538/0x710 fs/namei.c:1793 lookup_slow+0x6a/0xd0 fs/namei.c:1810 walk_component fs/namei.c:2114 [inline] link_path_walk+0xf29/0x1420 fs/namei.c:2479 path_openat+0x30f/0x6250 fs/namei.c:3985 do_filp_open+0x268/0x600 fs/namei.c:4016 do_sys_openat2+0x1bf/0x2f0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x2a1/0x310 fs/open.c:1454 x64_sys_call+0x36f5/0x3c30 arch/x86/include/generated/asm/syscalls_64.h:258 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Local variable compact_result created at: __alloc_pages_slowpath+0x66/0x16c0 mm/page_alloc.c:4218 __alloc_frozen_pages_noprof+0xa4c/0xe00 mm/page_alloc.c:4752 Link: https://lkml.kernel.org/r/tencent_ED1032321D6510B145CDBA8CBA0093178E09@qq.com Reported-by: syzbot+0cfd5e38e96a5596f2b6@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0cfd5e38e96a5596f2b6 Signed-off-by: Hao Zhang <zhanghao1@kylinos.cn> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] ↵Gabriel Krisman Bertazi
for empty zone" Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone") removes the protection of lower zones from allocations targeting memory-less high zones. This had an unintended impact on the pattern of reclaims because it makes the high-zone-targeted allocation more likely to succeed in lower zones, which adds pressure to said zones. I.e, the following corresponding checks in zone_watermark_ok/zone_watermark_fast are less likely to trigger: if (free_pages <= min + z->lowmem_reserve[highest_zoneidx]) return false; As a result, we are observing an increase in reclaim and kswapd scans, due to the increased pressure. This was initially observed as increased latency in filesystem operations when benchmarking with fio on a machine with some memory-less zones, but it has since been associated with increased contention in locks related to memory reclaim. By reverting this patch, the original performance was recovered on that machine. The original commit was introduced as a clarification of the /proc/zoneinfo output, so it doesn't seem there are usecases depending on it, making the revert a simple solution. For reference, I collected vmstat with and without this patch on a freshly booted system running intensive randread io from an nvme for 5 minutes. I got: rpm-6.12.0-slfo.1.2 -> pgscan_kswapd 5629543865 Patched -> pgscan_kswapd 33580844 33M scans is similar to what we had in kernels predating this patch. These numbers is fairly representative of the workload on this machine, as measured in several runs. So we are talking about a 2-order of magnitude increase. Link: https://lkml.kernel.org/r/20250226032258.234099-1-krisman@suse.de Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: fix finish_fault() handling for large foliosBrian Geffon
When handling faults for anon shmem finish_fault() will attempt to install ptes for the entire folio. Unfortunately if it encounters a single non-pte_none entry in that range it will bail, even if the pte that triggered the fault is still pte_none. When this situation happens the fault will be retried endlessly never making forward progress. This patch fixes this behavior and if it detects that a pte in the range is not pte_none it will fall back to setting a single pte. [bgeffon@google.com: tweak whitespace] Link: https://lkml.kernel.org/r/20250227133236.1296853-1-bgeffon@google.com Link: https://lkml.kernel.org/r/20250226162341.915535-1-bgeffon@google.com Fixes: 43e027e41423 ("mm: memory: extend finish_fault() to support large folio") Signed-off-by: Brian Geffon <bgeffon@google.com> Suggested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Marek Maslanka <mmaslanka@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickens <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: don't skip arch_sync_kernel_mappings() in error pathsRyan Roberts
Fix callers that previously skipped calling arch_sync_kernel_mappings() if an error occurred during a pgtable update. The call is still required to sync any pgtable updates that may have occurred prior to hitting the error condition. These are theoretical bugs discovered during code review. Link: https://lkml.kernel.org/r/20250226121610.2401743-1-ryan.roberts@arm.com Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Fixes: 0c95cba49255 ("mm: apply_to_pte_range warn and fail if a large pte is encountered") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Christop Hellwig <hch@infradead.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: shmem: remove unnecessary warning in shmem_writepage()Ricardo Cañuelo Navarro
Although the scenario where shmem_writepage() is called with info->flags & VM_LOCKED is unlikely to happen, it's still possible, as evidenced by syzbot [1]. However, the warning in this case isn't necessary because the situation is already handled correctly [2]. [2] https://lore.kernel.org/lkml/8afe1f7f-31a2-4fc0-1fbd-f9ba8a116fe3@google.com/ Link: https://lkml.kernel.org/r/20250226-20250221-warning-in-shmem_writepage-v1-1-5ad19420e17e@igalia.com Fixes: 9a976f0c847b ("shmem: skip page split if we're not reclaiming") Signed-off-by: Ricardo Cañuelo Navarro <rcn@igalia.com> Reported-by: Pengfei Xu <pengfei.xu@intel.com> Closes: https://lore.kernel.org/lkml/ZZ9PShXjKJkVelNm@xpf.sh.intel.com/ [1] Suggested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Florent Revest <revest@chromium.org> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Florent Revest <revest@chromium.org> Cc: Luis Chamberalin <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05userfaultfd: fix PTE unmapping stack-allocated PTE copiesSuren Baghdasaryan
Current implementation of move_pages_pte() copies source and destination PTEs in order to detect concurrent changes to PTEs involved in the move. However these copies are also used to unmap the PTEs, which will fail if CONFIG_HIGHPTE is enabled because the copies are allocated on the stack. Fix this by using the actual PTEs which were kmap()ed. Link: https://lkml.kernel.org/r/20250226185510.2732648-3-surenb@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Peter Xu <peterx@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05userfaultfd: do not block on locking a large folio with raised refcountSuren Baghdasaryan
Lokesh recently raised an issue about UFFDIO_MOVE getting into a deadlock state when it goes into split_folio() with raised folio refcount. split_folio() expects the reference count to be exactly mapcount + num_pages_in_folio + 1 (see can_split_folio()) and fails with EAGAIN otherwise. If multiple processes are trying to move the same large folio, they raise the refcount (all tasks succeed in that) then one of them succeeds in locking the folio, while others will block in folio_lock() while keeping the refcount raised. The winner of this race will proceed with calling split_folio() and will fail returning EAGAIN to the caller and unlocking the folio. The next competing process will get the folio locked and will go through the same flow. In the meantime the original winner will be retried and will block in folio_lock(), getting into the queue of waiting processes only to repeat the same path. All this results in a livelock. An easy fix would be to avoid waiting for the folio lock while holding folio refcount, similar to madvise_free_huge_pmd() where folio lock is acquired before raising the folio refcount. Since we lock and take a refcount of the folio while holding the PTE lock, changing the order of these operations should not break anything. Modify move_pages_pte() to try locking the folio first and if that fails and the folio is large then return EAGAIN without touching the folio refcount. If the folio is single-page then split_folio() is not called, so we don't have this issue. Lokesh has a reproducer [1] and I verified that this change fixes the issue. [1] https://github.com/lokeshgidra/uffd_move_ioctl_deadlock [akpm@linux-foundation.org: reflow comment to 80 cols, s/end/end up/] Link: https://lkml.kernel.org/r/20250226185510.2732648-2-surenb@google.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Lokesh Gidra <lokeshgidra@google.com> Reviewed-by: Peter Xu <peterx@redhat.com> Acked-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: zswap: use ATOMIC_LONG_INIT to initialize zswap_stored_pagesSun YangKai
This is currently the only atomic_long_t variable initialized by ATOMIC_INIT macro found in the kernel by using `grep -r atomic_long_t | grep ATOMIC_INIT` This was introduced in 6e1fa555ec77, in which we modified the type of zswap_stored_pages to atomic_long_t, but didn't change the initialization. Link: https://lkml.kernel.org/r/20250226153253.19179-1-sunk67188@gmail.com Fixes: 6e1fa555ec77 ("mm: zswap: modify zswap_stored_pages to be atomic_long_t") Signed-off-by: Sun YangKai <sunk67188@gmail.com> Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: shmem: fix potential data corruption during shmem swapinBaolin Wang
Alex and Kairui reported some issues (system hang or data corruption) when swapping out or swapping in large shmem folios. This is especially easy to reproduce when the tmpfs is mount with the 'huge=within_size' parameter. Thanks to Kairui's reproducer, the issue can be easily replicated. The root cause of the problem is that swap readahead may asynchronously swap in order 0 folios into the swap cache, while the shmem mapping can still store large swap entries. Then an order 0 folio is inserted into the shmem mapping without splitting the large swap entry, which overwrites the original large swap entry, leading to data corruption. When getting a folio from the swap cache, we should split the large swap entry stored in the shmem mapping if the orders do not match, to fix this issue. Link: https://lkml.kernel.org/r/2fe47c557e74e9df5fe2437ccdc6c9115fa1bf70.1740476943.git.baolin.wang@linux.alibaba.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca> Reported-by: Kairui Song <ryncsn@gmail.com> Closes: https://lore.kernel.org/all/1738717785.im3r5g2vxc.none@localhost/ Tested-by: Kairui Song <kasong@tencent.com> Cc: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcow <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: fix kernel BUG when userfaultfd_move encounters swapcacheBarry Song
userfaultfd_move() checks whether the PTE entry is present or a swap entry. - If the PTE entry is present, move_present_pte() handles folio migration by setting: src_folio->index = linear_page_index(dst_vma, dst_addr); - If the PTE entry is a swap entry, move_swap_pte() simply copies the PTE to the new dst_addr. This approach is incorrect because, even if the PTE is a swap entry, it can still reference a folio that remains in the swap cache. This creates a race window between steps 2 and 4. 1. add_to_swap: The folio is added to the swapcache. 2. try_to_unmap: PTEs are converted to swap entries. 3. pageout: The folio is written back. 4. Swapcache is cleared. If userfaultfd_move() occurs in the window between steps 2 and 4, after the swap PTE has been moved to the destination, accessing the destination triggers do_swap_page(), which may locate the folio in the swapcache. However, since the folio's index has not been updated to match the destination VMA, do_swap_page() will detect a mismatch. This can result in two critical issues depending on the system configuration. If KSM is disabled, both small and large folios can trigger a BUG during the add_rmap operation due to: page_pgoff(folio, page) != linear_page_index(vma, address) [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0 [ 13.337716] memcg:ffff00000405f000 [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff) [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361 [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000 [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001 [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000 [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address)) [ 13.340190] ------------[ cut here ]------------ [ 13.340316] kernel BUG at mm/rmap.c:1380! [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [ 13.340969] Modules linked in: [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299 [ 13.341470] Hardware name: linux,dummy-virt (DT) [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 [ 13.342018] sp : ffff80008752bb20 [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001 [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001 [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00 [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0 [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40 [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8 [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000 [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f [ 13.343876] Call trace: [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 [ 13.344333] do_swap_page+0x1060/0x1400 [ 13.344417] __handle_mm_fault+0x61c/0xbc8 [ 13.344504] handle_mm_fault+0xd8/0x2e8 [ 13.344586] do_page_fault+0x20c/0x770 [ 13.344673] do_translation_fault+0xb4/0xf0 [ 13.344759] do_mem_abort+0x48/0xa0 [ 13.344842] el0_da+0x58/0x130 [ 13.344914] el0t_64_sync_handler+0xc4/0x138 [ 13.345002] el0t_64_sync+0x1ac/0x1b0 [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) [ 13.345504] ---[ end trace 0000000000000000 ]--- [ 13.345715] note: a.out[107] exited with irqs disabled [ 13.345954] note: a.out[107] exited with preempt_count 2 If KSM is enabled, Peter Xu also discovered that do_swap_page() may trigger an unexpected CoW operation for small folios because ksm_might_need_to_copy() allocates a new folio when the folio index does not match linear_page_index(vma, addr). This patch also checks the swapcache when handling swap entries. If a match is found in the swapcache, it processes it similarly to a present PTE. However, there are some differences. For example, the folio is no longer exclusive because folio_try_share_anon_rmap_pte() is performed during unmapping. Furthermore, in the case of swapcache, the folio has already been unmapped, eliminating the risk of concurrent rmap walks and removing the need to acquire src_folio's anon_vma or lock. Note that for large folios, in the swapcache handling path, we directly return -EBUSY since split_folio() will return -EBUSY regardless if the folio is under writeback or unmapped. This is not an urgent issue, so a follow-up patch may address it separately. [v-songbaohua@oppo.com: minor cleanup according to Peter Xu] Link: https://lkml.kernel.org/r/20250226024411.47092-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250226001400.9129-1-21cnbao@gmail.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nicolas Geoffray <ngeoffray@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: ZhangPeng <zhangpeng362@huawei.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05NFS: fix nfs_release_folio() to not deadlock via kcompactd writebackMike Snitzer
Add PF_KCOMPACTD flag and current_is_kcompactd() helper to check for it so nfs_release_folio() can skip calling nfs_wb_folio() from kcompactd. Otherwise NFS can deadlock waiting for kcompactd enduced writeback which recurses back to NFS (which triggers writeback to NFSD via NFS loopback mount on the same host, NFSD blocks waiting for XFS's call to __filemap_get_folio): 6070.550357] INFO: task kcompactd0:58 blocked for more than 4435 seconds. {--- [58] "kcompactd0" [<0>] folio_wait_bit+0xe8/0x200 [<0>] folio_wait_writeback+0x2b/0x80 [<0>] nfs_wb_folio+0x80/0x1b0 [nfs] [<0>] nfs_release_folio+0x68/0x130 [nfs] [<0>] split_huge_page_to_list_to_order+0x362/0x840 [<0>] migrate_pages_batch+0x43d/0xb90 [<0>] migrate_pages_sync+0x9a/0x240 [<0>] migrate_pages+0x93c/0x9f0 [<0>] compact_zone+0x8e2/0x1030 [<0>] compact_node+0xdb/0x120 [<0>] kcompactd+0x121/0x2e0 [<0>] kthread+0xcf/0x100 [<0>] ret_from_fork+0x31/0x40 [<0>] ret_from_fork_asm+0x1a/0x30 ---} [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/20250225022002.26141-1-snitzer@kernel.org Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page") Signed-off-by: Mike Snitzer <snitzer@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm, swap: avoid BUG_ON in relocate_cluster()Kemeng Shi
If allocation is racy with swapoff, we may call free_cluster for cluster already in free list and trigger BUG_ON() as following: Allocation Swapoff cluster_alloc_swap_entry ... /* may get a free cluster with offset */ offset = xxx; if (offset) ci = lock_cluster(si, offset); ... del_from_avail_list(p, true); si->flags &= ~SWP_WRITEOK; alloc_swap_scan_cluster(si, ci, ...) ... /* failed to alloc entry from free entry */ if (!cluster_alloc_range(...)) break; ... /* add back a free cluster */ relocate_cluster(si, ci); if (!ci->count) free_cluster(si, ci); VM_BUG_ON(ci->flags == CLUSTER_FLAG_FREE); To prevent the BUG_ON(), call free_cluster() for free cluster to move the cluster to tail of list. Check cluster is not free before calling free_cluster() in relocate_cluster() to avoid BUG_ON(). Link: https://lkml.kernel.org/r/20250222160850.505274-4-shikemeng@huaweicloud.com Fixes: 3b644773eefd ("mm, swap: reduce contention on device lock") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: swap: use correct step in loop to wait all clusters in wait_for_allocation()Kemeng Shi
Use correct step in loop to wait all clusters in wait_for_allocation(). If we miss some cluster in wait_for_allocation(), use after free may occur as follows: shmem_writepage swapoff folio_alloc_swap get_swap_pages scan_swap_map_slots cluster_alloc_swap_entry alloc_swap_scan_cluster cluster_alloc_range /* SWP_WRITEOK is valid */ if (!(si->flags & SWP_WRITEOK)) ... del_from_avail_list(p, true); ... /* miss the cluster in shmem_writepage */ wait_for_allocation() ... try_to_unuse() memset(si->swap_map + start, usage, nr_pages); swap_range_alloc(si, nr_pages); ci->count += nr_pages; /* return a valid entry */ ... exit_swap_address_space(p->type); ... ... add_to_swap_cache /* dereference swap_address_space(entry) which is NULL */ xas_lock_irq(&xas); Link: https://lkml.kernel.org/r/20250222160850.505274-3-shikemeng@huaweicloud.com Fixes: 9a0ddeb79880 ("mm, swap: hold a reference during scan and cleanup flag usage") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: swap: add back full cluster when no entry is reclaimedKemeng Shi
If no swap cache is reclaimed, cluster taken off from full_clusters list will not be put in any list and we can't reclaime HAS_CACHE slots efficiently. Do relocate_cluster for such cluster to avoid inefficiency. Link: https://lkml.kernel.org/r/20250224113910.522439-1-shikemeng@huaweicloud.com Fixes: 3b644773eefd ("mm, swap: reduce contention on device lock") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Kairui Song <kasong@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: abort vma_modify() on merge out of memory failureLorenzo Stoakes
The remainder of vma_modify() relies upon the vmg state remaining pristine after a merge attempt. Usually this is the case, however in the one edge case scenario of a merge attempt failing not due to the specified range being unmergeable, but rather due to an out of memory error arising when attempting to commit the merge, this assumption becomes untrue. This results in vmg->start, end being modified, and thus the proceeding attempts to split the VMA will be done with invalid start/end values. Thankfully, it is likely practically impossible for us to hit this in reality, as it would require a maple tree node pre-allocation failure that would likely never happen due to it being 'too small to fail', i.e. the kernel would simply keep retrying reclaim until it succeeded. However, this scenario remains theoretically possible, and what we are doing here is wrong so we must correct it. The safest option is, when this scenario occurs, to simply give up the operation. If we cannot allocate memory to merge, then we cannot allocate memory to split either (perhaps moreso!). Any scenario where this would be happening would be under very extreme (likely fatal) memory pressure, so it's best we give up early. So there is no doubt it is appropriate to simply bail out in this scenario. However, in general we must if at all possible never assume VMG state is stable after a merge attempt, since merge operations update VMG fields. As a result, additionally also make this clear by storing start, end in local variables. The issue was reported originally by syzkaller, and by Brad Spengler (via an off-list discussion), and in both instances it manifested as a triggering of the assert: VM_WARN_ON_VMG(start >= end, vmg); In vma_merge_existing_range(). It seems at least one scenario in which this is occurring is one in which the merge being attempted is due to an madvise() across multiple VMAs which looks like this: start end |<------>| |----------|------| | vma | next | |----------|------| When madvise_walk_vmas() is invoked, we first find vma in the above (determining prev to be equal to vma as we are offset into vma), and then enter the loop. We determine the end of vma that forms part of the range we are madvise()'ing by setting 'tmp' to this value: /* Here vma->vm_start <= start < (end|vma->vm_end) */ tmp = vma->vm_end; We then invoke the madvise() operation via visit(), letting prev get updated to point to vma as part of the operation: /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = visit(vma, &prev, start, tmp, arg); Where the visit() function pointer in this instance is madvise_vma_behavior(). As observed in syzkaller reports, it is ultimately madvise_update_vma() that is invoked, calling vma_modify_flags_name() and vma_modify() in turn. Then, in vma_modify(), we attempt the merge: merged = vma_merge_existing_range(vmg); if (merged) return merged; We invoke this with vmg->start, end set to start, tmp as such: start tmp |<--->| |----------|------| | vma | next | |----------|------| We find ourselves in the merge right scenario, but the one in which we cannot remove the middle (we are offset into vma). Here we have a special case where vmg->start, end get set to perhaps unintuitive values - we intended to shrink the middle VMA and expand the next. This means vmg->start, end are set to... vma->vm_start, start. Now the commit_merge() fails, and vmg->start, end are left like this. This means we return to the rest of vma_modify() with vmg->start, end (here denoted as start', end') set as: start' end' |<-->| |----------|------| | vma | next | |----------|------| So we now erroneously try to split accordingly. This is where the unfortunate stuff begins. We start with: /* Split any preceding portion of the VMA. */ if (vma->vm_start < vmg->start) { ... } This doesn't trigger as we are no longer offset into vma at the start. But then we invoke: /* Split any trailing portion of the VMA. */ if (vma->vm_end > vmg->end) { ... } Which does get invoked. This leaves us with: start' end' |<-->| |----|-----|------| | vma| new | next | |----|-----|------| We then return ultimately to madvise_walk_vmas(). Here 'new' is unknown, and putting back the values known in this function we are faced with: start tmp end | | | |----|-----|------| | vma| new | next | |----|-----|------| prev Then: start = tmp; So: start end | | |----|-----|------| | vma| new | next | |----|-----|------| prev The following code does not cause anything to happen: if (prev && start < prev->vm_end) start = prev->vm_end; if (start >= end) break; And then we invoke: if (prev) vma = find_vma(mm, prev->vm_end); Which is where a problem occurs - we don't know about 'new' so we essentially look for the vma after prev, which is new, whereas we actually intended to discover next! So we end up with: start end | | |----|-----|------| |prev| vma | next | |----|-----|------| And we have successfully bypassed all of the checks madvise_walk_vmas() has to ensure early exit should we end up moving out of range. We loop around, and hit: /* Here vma->vm_start <= start < (end|vma->vm_end) */ tmp = vma->vm_end; Oh dear. Now we have: tmp start end | | |----|-----|------| |prev| vma | next | |----|-----|------| We then invoke: /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ error = visit(vma, &prev, start, tmp, arg); Where start == tmp. That is, a zero range. This is not good. We invoke visit() which is madvise_vma_behavior() which does not check the range (for good reason, it assumes all checks have been done before it was called), which in turn finally calls madvise_update_vma(). The madvise_update_vma() function calls vma_modify_flags_name() in turn, which ultimately invokes vma_modify() with... start == end. vma_modify() calls vma_merge_existing_range() and finally we hit: VM_WARN_ON_VMG(start >= end, vmg); Which triggers, as start == end. While it might be useful to add some CONFIG_DEBUG_VM asserts in these instances to catch this kind of error, since we have just eliminated any possibility of that happening, we will add such asserts separately as to reduce churn and aid backporting. Link: https://lkml.kernel.org/r/20250222161952.41957-1-lorenzo.stoakes@oracle.com Fixes: 2f1c6611b0a8 ("mm: introduce vma_merge_struct and abstract vma_merge(),vma_modify()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Brad Spengler <brad.spengler@opensrcsec.com> Reported-by: Brad Spengler <brad.spengler@opensrcsec.com> Reported-by: syzbot+46423ed8fa1f1148c6e4@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/6774c98f.050a0220.25abdd.0991.GAE@google.com/ Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm/hugetlb: wait for hugetlb folios to be freedGe Yang
Since the introduction of commit c77c0a8ac4c52 ("mm/hugetlb: defer freeing of huge pages if in non-task context"), which supports deferring the freeing of hugetlb pages, the allocation of contiguous memory through cma_alloc() may fail probabilistically. In the CMA allocation process, if it is found that the CMA area is occupied by in-use hugetlb folios, these in-use hugetlb folios need to be migrated to another location. When there are no available hugetlb folios in the free hugetlb pool during the migration of in-use hugetlb folios, new folios are allocated from the buddy system. A temporary state is set on the newly allocated folio. Upon completion of the hugetlb folio migration, the temporary state is transferred from the new folios to the old folios. Normally, when the old folios with the temporary state are freed, it is directly released back to the buddy system. However, due to the deferred freeing of hugetlb pages, the PageBuddy() check fails, ultimately leading to the failure of cma_alloc(). Here is a simplified call trace illustrating the process: cma_alloc() ->__alloc_contig_migrate_range() // Migrate in-use hugetlb folios ->unmap_and_move_huge_page() ->folio_putback_hugetlb() // Free old folios ->test_pages_isolated() ->__test_page_isolated_in_pageblock() ->PageBuddy(page) // Check if the page is in buddy To resolve this issue, we have implemented a function named wait_for_freed_hugetlb_folios(). This function ensures that the hugetlb folios are properly released back to the buddy system after their migration is completed. By invoking wait_for_freed_hugetlb_folios() before calling PageBuddy(), we ensure that PageBuddy() will succeed. Link: https://lkml.kernel.org/r/1739936804-18199-1-git-send-email-yangge1116@126.com Fixes: c77c0a8ac4c5 ("mm/hugetlb: defer freeing of huge pages if in non-task context") Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: fix possible NULL pointer dereference in __swap_duplicategao xu
Add a NULL check on the return value of swp_swap_info in __swap_duplicate to prevent crashes caused by NULL pointer dereference. The reason why swp_swap_info() returns NULL is unclear; it may be due to CPU cache issues or DDR bit flips. The probability of this issue is very small - it has been observed to occur approximately 1 in 500,000 times per week. The stack info we encountered is as follows: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058 [RB/E]rb_sreason_str_set: sreason_str set null_pointer Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 39-bit VAs, pgdp=00000008a80e5000 [0000000000000058] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Skip md ftrace buffer dump for: 0x1609e0 ... pc : swap_duplicate+0x44/0x164 lr : copy_page_range+0x508/0x1e78 sp : ffffffc0f2a699e0 x29: ffffffc0f2a699e0 x28: ffffff8a5b28d388 x27: ffffff8b06603388 x26: ffffffdf7291fe70 x25: 0000000000000006 x24: 0000000000100073 x23: 00000000002d2d2f x22: 0000000000000008 x21: 0000000000000000 x20: 00000000002d2d2f x19: 18000000002d2d2f x18: ffffffdf726faec0 x17: 0000000000000000 x16: 0010000000000001 x15: 0040000000000001 x14: 0400000000000001 x13: ff7ffffffffffb7f x12: ffeffffffffffbff x11: ffffff8a5c7e1898 x10: 0000000000000018 x9 : 0000000000000006 x8 : 1800000000000000 x7 : 0000000000000000 x6 : ffffff8057c01f10 x5 : 000000000000a318 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000006daf200000 x1 : 0000000000000001 x0 : 18000000002d2d2f Call trace: swap_duplicate+0x44/0x164 copy_page_range+0x508/0x1e78 copy_process+0x1278/0x21cc kernel_clone+0x90/0x438 __arm64_sys_clone+0x5c/0x8c invoke_syscall+0x58/0x110 do_el0_svc+0x8c/0xe0 el0_svc+0x38/0x9c el0t_64_sync_handler+0x44/0xec el0t_64_sync+0x1a8/0x1ac Code: 9139c35a 71006f3f 54000568 f8797b55 (f9402ea8) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs The patch seems to only provide a workaround, but there are no more effective software solutions to handle the bit flips problem. This path will change the issue from a system crash to a process exception, thereby reducing the impact on the entire machine. akpm: this is probably a kernel bug, but this patch keeps the system running and doesn't reduce that bug's debuggability. Link: https://lkml.kernel.org/r/e223b0e6ba2f4924984b1917cc717bd5@honor.com Signed-off-by: gao xu <gaoxu2@honor.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05dma: kmsan: export kmsan_handle_dma() for modulesSebastian Andrzej Siewior
kmsan_handle_dma() is used by virtio_ring() which can be built as a module. kmsan_handle_dma() needs to be exported otherwise building the virtio_ring fails. Export kmsan_handle_dma for modules. Link: https://lkml.kernel.org/r/20250218091411.MMS3wBN9@linutronix.de Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502150634.qjxwSeJR-lkp@intel.com/ Fixes: 7ade4f10779c ("dma: kmsan: unpoison DMA mappings") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Macro Elver <elver@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05hwpoison, memory_hotplug: lock folio before unmap hwpoisoned folioMa Wupeng
Commit b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined) add page poison checks in do_migrate_range in order to make offline hwpoisoned page possible by introducing isolate_lru_page and try_to_unmap for hwpoisoned page. However folio lock must be held before calling try_to_unmap. Add it to fix this problem. Warning will be produced if folio is not locked during unmap: ------------[ cut here ]------------ kernel BUG at ./include/linux/swapops.h:400! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 4 UID: 0 PID: 411 Comm: bash Tainted: G W 6.13.0-rc1-00016-g3c434c7ee82a-dirty #41 Tainted: [W]=WARN Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : try_to_unmap_one+0xb08/0xd3c lr : try_to_unmap_one+0x3dc/0xd3c Call trace: try_to_unmap_one+0xb08/0xd3c (P) try_to_unmap_one+0x3dc/0xd3c (L) rmap_walk_anon+0xdc/0x1f8 rmap_walk+0x3c/0x58 try_to_unmap+0x88/0x90 unmap_poisoned_folio+0x30/0xa8 do_migrate_range+0x4a0/0x568 offline_pages+0x5a4/0x670 memory_block_action+0x17c/0x374 memory_subsys_offline+0x3c/0x78 device_offline+0xa4/0xd0 state_store+0x8c/0xf0 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x118/0x1a8 vfs_write+0x3a8/0x4bc ksys_write+0x6c/0xf8 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x100 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x30/0xd0 el0t_64_sync_handler+0xc8/0xcc el0t_64_sync+0x198/0x19c Code: f9407be0 b5fff320 d4210000 17ffff97 (d4210000) ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20250217014329.3610326-4-mawupeng1@huawei.com Fixes: b15c87263a69 ("hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: memory-hotplug: check folio ref count first in do_migrate_rangeMa Wupeng
If a folio has an increased reference count, folio_try_get() will acquire it, perform necessary operations, and then release it. In the case of a poisoned folio without an elevated reference count (which is unlikely for memory-failure), folio_try_get() will simply bypass it. Therefore, relocate the folio_try_get() function, responsible for checking and acquiring this reference count at first. Link: https://lkml.kernel.org/r/20250217014329.3610326-3-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-05mm: memory-failure: update ttu flag inside unmap_poisoned_folioMa Wupeng
Patch series "mm: memory_failure: unmap poisoned folio during migrate properly", v3. Fix two bugs during folio migration if the folio is poisoned. This patch (of 3): Commit 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") introduce TTU_HWPOISON to replace TTU_IGNORE_HWPOISON in order to stop send SIGBUS signal when accessing an error page after a memory error on a clean folio. However during page migration, anon folio must be set with TTU_HWPOISON during unmap_*(). For pagecache we need some policy just like the one in hwpoison_user_mappings to set this flag. So move this policy from hwpoison_user_mappings to unmap_poisoned_folio to handle this warning properly. Warning will be produced during unamp poison folio with the following log: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 365 at mm/rmap.c:1847 try_to_unmap_one+0x8fc/0xd3c Modules linked in: CPU: 1 UID: 0 PID: 365 Comm: bash Tainted: G W 6.13.0-rc1-00018-gacdb4bbda7ab #42 Tainted: [W]=WARN Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 20400005 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : try_to_unmap_one+0x8fc/0xd3c lr : try_to_unmap_one+0x3dc/0xd3c Call trace: try_to_unmap_one+0x8fc/0xd3c (P) try_to_unmap_one+0x3dc/0xd3c (L) rmap_walk_anon+0xdc/0x1f8 rmap_walk+0x3c/0x58 try_to_unmap+0x88/0x90 unmap_poisoned_folio+0x30/0xa8 do_migrate_range+0x4a0/0x568 offline_pages+0x5a4/0x670 memory_block_action+0x17c/0x374 memory_subsys_offline+0x3c/0x78 device_offline+0xa4/0xd0 state_store+0x8c/0xf0 dev_attr_store+0x18/0x2c sysfs_kf_write+0x44/0x54 kernfs_fop_write_iter+0x118/0x1a8 vfs_write+0x3a8/0x4bc ksys_write+0x6c/0xf8 __arm64_sys_write+0x1c/0x28 invoke_syscall+0x44/0x100 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x30/0xd0 el0t_64_sync_handler+0xc8/0xcc el0t_64_sync+0x198/0x19c ---[ end trace 0000000000000000 ]--- [mawupeng1@huawei.com: unmap_poisoned_folio(): remove shadowed local `mapping', per Miaohe] Link: https://lkml.kernel.org/r/20250219060653.3849083-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20250217014329.3610326-1-mawupeng1@huawei.com Link: https://lkml.kernel.org/r/20250217014329.3610326-2-mawupeng1@huawei.com Fixes: 6da6b1d4a7df ("mm/hwpoison: convert TTU_IGNORE_HWPOISON to TTU_HWPOISON") Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-04mm: Remove wait_on_page_locked()Matthew Wilcox (Oracle)
This compatibility wrapper has no callers left, so remove it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04mm: Remove grab_cache_page_write_begin()Matthew Wilcox (Oracle)
All callers have now been converted to use folios, so remove this compatibility wrapper. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04mm: Remove wait_for_stable_page()Matthew Wilcox (Oracle)
The last caller has been converted to call folio_wait_stable(), so we can remove this wrapper. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-03-04Merge branch 'x86/cpu' into x86/asm, to pick up dependent commitsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-04Merge branch 'x86/urgent' into x86/cpu, to pick up dependent commitsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-04slab: Mark large folios for debugging purposesMatthew Wilcox (Oracle)
If a user calls p = kmalloc(1024); kfree(p); kfree(p); and 'p' was the only object in the slab, we may free the slab after the first call to kfree(). If we do, we clear PGTY_slab and the second call to kfree() will call free_large_kmalloc(). That will leave a trace in the logs ("object pointer: 0x%p"), but otherwise proceed to free the memory, which is likely to corrupt the page allocator's metadata. Allocate a new page type for large kmalloc and mark the memory with it while it's allocated. That lets us detect this double-free and return without harming any data structures. Reported-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-03-04mm, slab: cleanup slab_bug() parametersVlastimil Babka
slab_err() has variadic printf arguments but instead of passing them to slab_bug() it does vsnprintf() to a buffer and passes %s, buf. To allow passing them directly, turn slab_bug() to __slab_bug() with a va_list parameter, and slab_bug() a wrapper with fmt, ... parameters. Then slab_err() can call __slab_bug() without the intermediate buffer. Also constify fmt everywhere, which also simplifies object_err()'s call to slab_bug(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
2025-03-04mm: slub: call WARN() when detecting a slab corruptionHyesoo Yu
If a slab object is corrupted or an error occurs in its internal validation, continuing after restoration may cause other side effects. At this point, it is difficult to debug because the problem occurred in the past. It is useful to use WARN() to catch errors at the point of issue because WARN() could trigger panic for system debugging when panic_on_warn is enabled. WARN() is added where to detect the error on slab_err and object_err. It makes sense to only do the WARN() after printing the logs. slab_err is splited to __slab_err that calls the WARN() and it is called after printing logs. Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-03-04mm: slub: Print the broken data before restoring themHyesoo Yu
Previously, the restore occurred after printing the object in slub. After commit 47d911b02cbe ("slab: make check_object() more consistent"), the bytes are printed after the restore. This information about the bytes before the restore is highly valuable for debugging purpose. For instance, in a event of cache issue, it displays byte patterns by breaking them down into 64-bytes units. Without this information, we can only speculate on how it was broken. Hence the corrupted regions should be printed prior to the restoration process. However if an object breaks in multiple places, the same log may be output multiple times. Therefore the slub log is reported only once to prevent redundant printing, by sending a parameter indicating whether an error has occurred previously. Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-03-04slab: Achieve better kmalloc caches randomization in kvmallocGONG Ruiqi
As revealed by this writeup[1], due to the fact that __kmalloc_node (now renamed to __kmalloc_node_noprof) is an exported symbol and will never get inlined, using it in kvmalloc_node (now is __kvmalloc_node_noprof) would make the RET_IP inside always point to the same address: upper_caller kvmalloc kvmalloc_node kvmalloc_node_noprof __kvmalloc_node_noprof <-- all macros all the way down here __kmalloc_node_noprof __do_kmalloc_node(.., _RET_IP_) ... <-- _RET_IP_ points to That literally means all kmalloc invoked via kvmalloc would use the same seed for cache randomization (CONFIG_RANDOM_KMALLOC_CACHES), which makes this hardening non-functional. The root cause of this problem, IMHO, is that using RET_IP only cannot identify the actual allocation site in case of kmalloc being called inside non-inlined wrappers or helper functions. And I believe there could be similar cases in other functions. Nevertheless, I haven't thought of any good solution for this. So for now let's solve this specific case first. For __kvmalloc_node_noprof, replace __kmalloc_node_noprof and call __do_kmalloc_node directly instead, so that RET_IP can take the return address of kvmalloc and differentiate each kvmalloc invocation: upper_caller kvmalloc kvmalloc_node kvmalloc_node_noprof __kvmalloc_node_noprof <-- all macros all the way down here __do_kmalloc_node(.., _RET_IP_) ... <-- _RET_IP_ points to Thanks to Tamás Koczka for the report and discussion! Link: https://github.com/google/security-research/blob/908d59b573960dc0b90adda6f16f7017aca08609/pocs/linux/kernelctf/CVE-2024-27397_mitigation/docs/exploit.md?plain=1#L259 [1] Reported-by: Tamás Koczka <poprdi@google.com> Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-03-04slab: Adjust placement of __kvmalloc_node_noprofGONG Ruiqi
Move __kvmalloc_node_noprof (as well as kvfree*, kvrealloc_noprof and kmalloc_gfp_adjust for consistency) into mm/slub.c so that it can directly invoke __do_kmalloc_node, which is needed for the next patch. No functional changes intended. Signed-off-by: GONG Ruiqi <gongruiqi1@huawei.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2025-03-04mm/slab: simplify SLAB_* flag handlingKevin Brodsky
SLUB is the only remaining allocator. We can therefore get rid of the logic for allocator-specific flags: * Merge SLAB_CACHE_FLAGS into SLAB_CORE_FLAGS. * Remove CACHE_CREATE_MASK and instead mask out SLAB_DEBUG_FLAGS if !CONFIG_SLUB_DEBUG. SLAB_DEBUG_FLAGS is now defined unconditionally (no impact on existing code, which ignores it if !CONFIG_SLUB_DEBUG). * Define SLAB_FLAGS_PERMITTED in terms of SLAB_CORE_FLAGS and SLAB_DEBUG_FLAGS (no functional change). While at it also remove misleading comments that suggest that multiple allocators are available. Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>