summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2024-08-30mm: Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range()David Howells
Fix filemap_invalidate_inode() to use invalidate_inode_pages2_range() rather than truncate_inode_pages_range(). The latter clears the invalidated bit of a partial pages rather than discarding it entirely. This causes copy_file_range() to fail on cifs because the partial pages at either end of the destination range aren't evicted and reread, but rather just partly cleared. This causes generic/075 and generic/112 xfstests to fail. Fixes: 74e797d79cf1 ("mm: Provide a means of invalidation without using launder_folio") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240828210249.1078637-5-dhowells@redhat.com cc: Matthew Wilcox <willy@infradead.org> cc: Miklos Szeredi <miklos@szeredi.hu> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Christoph Hellwig <hch@lst.de> cc: Andrew Morton <akpm@linux-foundation.org> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christian Brauner <brauner@kernel.org> cc: Jeff Layton <jlayton@kernel.org> cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: devel@lists.orangefs.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-30fs: move FMODE_UNSIGNED_OFFSET to fop_flagsChristian Brauner
This is another flag that is statically set and doesn't need to use up an FMODE_* bit. Move it to ->fop_flags and free up another FMODE_* bit. (1) mem_open() used from proc_mem_operations (2) adi_open() used from adi_fops (3) drm_open_helper(): (3.1) accel_open() used from DRM_ACCEL_FOPS (3.2) drm_open() used from (3.2.1) amdgpu_driver_kms_fops (3.2.2) psb_gem_fops (3.2.3) i915_driver_fops (3.2.4) nouveau_driver_fops (3.2.5) panthor_drm_driver_fops (3.2.6) radeon_driver_kms_fops (3.2.7) tegra_drm_fops (3.2.8) vmwgfx_driver_fops (3.2.9) xe_driver_fops (3.2.10) DRM_GEM_FOPS (3.2.11) DEFINE_DRM_GEM_DMA_FOPS (4) struct memdev sets fmode flags based on type of device opened. For devices using struct mem_fops unsigned offset is used. Mark all these file operations as FOP_UNSIGNED_OFFSET and add asserts into the open helper to ensure that the flag is always set. Link: https://lore.kernel.org/r/20240809-work-fop_unsigned-v1-1-658e054d893e@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-29mm: add kmem_cache_create_rcu()Christian Brauner
When a kmem cache is created with SLAB_TYPESAFE_BY_RCU the free pointer must be located outside of the object because we don't know what part of the memory can safely be overwritten as it may be needed to prevent object recycling. That has the consequence that SLAB_TYPESAFE_BY_RCU may end up adding a new cacheline. This is the case for e.g., struct file. After having it shrunk down by 40 bytes and having it fit in three cachelines we still have SLAB_TYPESAFE_BY_RCU adding a fourth cacheline because it needs to accommodate the free pointer. Add a new kmem_cache_create_rcu() function that allows the caller to specify an offset where the free pointer is supposed to be placed. Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-2-5460bc1f09f6@kernel.org Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-29mm: remove unused argument from create_cache()Christian Brauner
That root_cache argument is unused so remove it. Link: https://lore.kernel.org/r/20240828-work-kmem_cache-rcu-v3-1-5460bc1f09f6@kernel.org Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-27mm, slab: use kmem_cache_free() to free from kmem_buckets_cacheYan Zhen
In kmem_buckets_create(), the kmem_buckets object is allocated by kmem_cache_alloc() from kmem_buckets_cache, but in the failure case, it's freed by kfree(). This is not wrong, but using kmem_cache_free() is the more common pattern, so use it. Signed-off-by: Yan Zhen <yanzhen@vivo.com> Reviewed-by: Christoph Lameter <cl@linux.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-27slub: Introduce CONFIG_SLUB_RCU_DEBUGJann Horn
Currently, KASAN is unable to catch use-after-free in SLAB_TYPESAFE_BY_RCU slabs because use-after-free is allowed within the RCU grace period by design. Add a SLUB debugging feature which RCU-delays every individual kmem_cache_free() before either actually freeing the object or handing it off to KASAN, and change KASAN to poison freed objects as normal when this option is enabled. For now I've configured Kconfig.debug to default-enable this feature in the KASAN GENERIC and SW_TAGS modes; I'm not enabling it by default in HW_TAGS mode because I'm not sure if it might have unwanted performance degradation effects there. Note that this is mostly useful with KASAN in the quarantine-based GENERIC mode; SLAB_TYPESAFE_BY_RCU slabs are basically always also slabs with a ->ctor, and KASAN's assign_tag() currently has to assign fixed tags for those, reducing the effectiveness of SW_TAGS/HW_TAGS mode. (A possible future extension of this work would be to also let SLUB call the ->ctor() on every allocation instead of only when the slab page is allocated; then tag-based modes would be able to assign new tags on every reallocation.) Tested-by: syzbot+263726e59eab6b442723@syzkaller.appspotmail.com Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Acked-by: Marco Elver <elver@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> #slab Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-27kasan: catch invalid free before SLUB reinitializes the objectJann Horn
Currently, when KASAN is combined with init-on-free behavior, the initialization happens before KASAN's "invalid free" checks. More importantly, a subsequent commit will want to RCU-delay the actual SLUB freeing of an object, and we'd like KASAN to still validate synchronously that freeing the object is permitted. (Otherwise this change will make the existing testcase kmem_cache_invalid_free fail.) So add a new KASAN hook that allows KASAN to pre-validate a kmem_cache_free() operation before SLUB actually starts modifying the object or its metadata. Inside KASAN, this: - moves checks from poison_slab_object() into check_slab_allocation() - moves kasan_arch_is_ready() up into callers of poison_slab_object() - removes "ip" argument of poison_slab_object() and __kasan_slab_free() (since those functions no longer do any reporting) Acked-by: Vlastimil Babka <vbabka@suse.cz> #slub Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-27mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()Vlastimil Babka
We would like to replace call_rcu() users with kfree_rcu() where the existing callback is just a kmem_cache_free(). However this causes issues when the cache can be destroyed (such as due to module unload). Currently such modules should be issuing rcu_barrier() before kmem_cache_destroy() to have their call_rcu() callbacks processed first. This barrier is however not sufficient for kfree_rcu() in flight due to the batching introduced by a35d16905efc ("rcu: Add basic support for kfree_rcu() batching"). This is not a problem for kmalloc caches which are never destroyed, but since removing SLOB, kfree_rcu() is allowed also for any other cache, that might be destroyed. In order not to complicate the API, put the responsibility for handling outstanding kfree_rcu() in kmem_cache_destroy() itself. Use the newly introduced kvfree_rcu_barrier() to wait before destroying the cache. This is similar to how we issue rcu_barrier() for SLAB_TYPESAFE_BY_RCU caches, but has to be done earlier, as the latter only needs to wait for the empty slab pages to finish freeing, and not objects from the slab. Users of call_rcu() with arbitrary callbacks should still issue rcu_barrier() before destroying the cache and unloading the module, as kvfree_rcu_barrier() is not a superset of rcu_barrier() and the callbacks may be invoking module code or performing other actions that are necessary for a successful unload. Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-27mm, slab: reintroduce rcu_barrier() into kmem_cache_destroy()Vlastimil Babka
There used to be a rcu_barrier() for SLAB_TYPESAFE_BY_RCU caches in kmem_cache_destroy() until commit 657dc2f97220 ("slab: remove synchronous rcu_barrier() call in memcg cache release path") moved it to an asynchronous work that finishes the destroying of such caches. The motivation for that commit was the MEMCG_KMEM integration that at the time created and removed clones of the global slab caches together with their cgroups, and blocking cgroups removal was unwelcome. The implementation later changed to per-object memcg tracking using a single cache, so there should be no more need for a fast non-blocking kmem_cache_destroy(), which is typically only done when a module is unloaded etc. Going back to synchronous barrier has the following advantages: - simpler implementation - it's easier to test the result of kmem_cache_destroy() in a kunit test Thus effectively revert commit 657dc2f97220. It is not a 1:1 revert as the code has changed since. The main part is that kmem_cache_release(s) is always called from kmem_cache_destroy(), but for SLAB_TYPESAFE_BY_RCU caches there's a rcu_barrier() first. Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-27mm, slab: move kfence_shutdown_cache() outside slab_mutexVlastimil Babka
kfence_shutdown_cache() is called under slab_mutex when the cache is destroyed synchronously, and outside slab_mutex during the delayed destruction of SLAB_TYPESAFE_BY_RCU caches. It seems it should always be safe to call it outside of slab_mutex so we can just move the call to kmem_cache_release(), which is called outside. Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-27mm, slab: unlink slabinfo, sysfs and debugfs immediatelyVlastimil Babka
kmem_cache_destroy() includes removing the associated sysfs and debugfs directories, and the cache from the list of caches that appears in /proc/slabinfo. Currently this might not happen immediately when: - the cache is SLAB_TYPESAFE_BY_RCU and the cleanup is delayed, including the directores removal - __kmem_cache_shutdown() fails due to outstanding objects - the directories remain indefinitely When a cache is recreated with the same name, such as due to module unload followed by a load, the directories will fail to be recreated for the new instance of the cache due to the old directories being present. The cache will also appear twice in /proc/slabinfo. While we want to convert the SLAB_TYPESAFE_BY_RCU cleanup to be synchronous again, the second point remains. So let's fix this first and have the directories and slabinfo removed immediately in kmem_cache_destroy() and regardless of __kmem_cache_shutdown() success. This should not make debugging harder if __kmem_cache_shutdown() fails, because a detailed report of outstanding objects is printed into dmesg already due to the failure. Also simplify kmem_cache_release() sysfs handling by using __is_defined(SLAB_SUPPORTS_SYSFS). Note the resulting code in kmem_cache_destroy() is a bit ugly but will be further simplified - this is in order to make small bisectable steps. Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-27mm, slab: dissolve shutdown_cache() into its callerVlastimil Babka
There's only one caller of shutdown_cache() so move the code into it. Preparatory patch for further changes, no functional change. Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-26Reenable NUMA policy support in the slab allocatorChristoph Lameter
Revert commit 8014c46ad991 ("slub: use alloc_pages_node() in alloc_slab_page()"). The patch disabled the numa policy support in the slab allocator. It did not consider that alloc_pages() uses memory policies but alloc_pages_node() does not. As a result of this patch slab memory allocations are no longer spread via interleave policy across all available NUMA nodes on bootup. Instead all slab memory is allocated close to the boot processor. This leads to an imbalance of memory accesses on NUMA systems. Also applications using MPOL_INTERLEAVE as a memory policy will no longer spread slab allocations over all nodes in the interleave set but allocate memory locally. This may also result in unbalanced allocations on a single numa node. SLUB does not apply memory policies to individual object allocations. However, it relies on the page allocators support of memory policies through alloc_pages() to do the NUMA memory allocations on a per folio or page level. SLUB also applies memory policies when retrieving partial allocated slab pages from the partial list. Fixes: 8014c46ad991 ("slub: use alloc_pages_node() in alloc_slab_page()") Signed-off-by: Christoph Lameter <cl@gentwo.org> Reviewed-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-26mm, slub: print CPU id (and its node) on slab OOMAxel Rasmussen
Depending on how remote_node_defrag_ratio is configured, allocations can end up in this path as a result of the local node being OOM, despite the allocation overall being unconstrained (node == -1). When we print a warning, printing the current CPU makes that situation more clear (i.e., you can immediately see which node's OOM status matters for the allocation at hand). Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-26slab: Warn on duplicate cache names when DEBUG_VM=yPedro Falcato
Duplicate slab cache names can create havoc for userspace tooling that expects slab cache names to be unique [1]. This is a reasonable expectation. Sadly, too many duplicate name problems are out there in the wild, so simply warn instead of pr_err() + failing the sanity check. [ vbabka@suse.cz: change to WARN_ON(), see the discussion at [2] ] Link: https://lore.kernel.org/linux-fsdevel/2d1d053da1cafb3e7940c4f25952da4f0af34e38.1722293276.git.osandov@fb.com/ [1] Link: https://lore.kernel.org/all/20240807090746.2146479-1-pedro.falcato@gmail.com/ [2] Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-08-24mm: Fix missing folio invalidation calls during truncationDavid Howells
When AS_RELEASE_ALWAYS is set on a mapping, the ->release_folio() and ->invalidate_folio() calls should be invoked even if PG_private and PG_private_2 aren't set. This is used by netfslib to keep track of the point above which reads can be skipped in favour of just zeroing pagecache locally. There are a couple of places in truncation in which invalidation is only called when folio_has_private() is true. Fix these to check folio_needs_release() instead. Without this, the generic/075 and generic/112 xfstests (both fsx-based tests) fail with minimum folio size patches applied[1]. Fixes: b4fa966f03b7 ("mm, netfs, fscache: stop read optimisation when folio removed from pagecache") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240815090849.972355-1-kernel@pankajraghav.com/ [1] Link: https://lore.kernel.org/r/20240823200819.532106-2-dhowells@redhat.com Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Pankaj Raghav <p.raghav@samsung.com> cc: Jeff Layton <jlayton@kernel.org> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netfs@lists.linux.dev cc: linux-mm@kvack.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23readahead: allocate folios with mapping_min_order in readaheadPankaj Raghav
page_cache_ra_unbounded() was allocating single pages (0 order folios) if there was no folio found in an index. Allocate mapping_min_order folios as we need to guarantee the minimum order if it is set. page_cache_ra_order() tries to allocate folio to the page cache with a higher order if the index aligns with that order. Modify it so that the order does not go below the mapping_min_order requirement of the page cache. This function will do the right thing even if the new_order passed is less than the mapping_min_order. When adding new folios to the page cache we must also ensure the index used is aligned to the mapping_min_order as the page cache requires the index to be aligned to the order of the folio. readahead_expand() is called from readahead aops to extend the range of the readahead so this function can assume ractl->_index to be aligned with min_order. Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Co-developed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240822135018.1931258-4-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23filemap: allocate mapping_min_order folios in the page cachePankaj Raghav
filemap_create_folio() and do_read_cache_folio() were always allocating folio of order 0. __filemap_get_folio was trying to allocate higher order folios when fgp_flags had higher order hint set but it will default to order 0 folio if higher order memory allocation fails. Supporting mapping_min_order implies that we guarantee each folio in the page cache has at least an order of mapping_min_order. When adding new folios to the page cache we must also ensure the index used is aligned to the mapping_min_order as the page cache requires the index to be aligned to the order of the folio. Co-developed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-3-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-23fs: Allow fine-grained control of folio sizesMatthew Wilcox (Oracle)
We need filesystems to be able to communicate acceptable folio sizes to the pagecache for a variety of uses (e.g. large block sizes). Support a range of folio sizes between order-0 and order-31. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-2-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-20x86/kaslr: Expose and use the end of the physical memory address spaceThomas Gleixner
iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <max8rr8@gmail.com> Reported-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-By: Max Ramanouski <max8rr8@gmail.com> Tested-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Kees Cook <kees@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
2024-08-17Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. All except one are for MM. 10 of these are cc:stable and the others pertain to post-6.10 issues. As usual with these merges, singletons and doubletons all over the place, no identifiable-by-me theme. Please see the lovingly curated changelogs to get the skinny" * tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/migrate: fix deadlock in migrate_pages_batch() on large folios alloc_tag: mark pages reserved during CMA activation as not tagged alloc_tag: introduce clear_page_tag_ref() helper function crash: fix riscv64 crash memory reserve dead loop selftests: memfd_secret: don't build memfd_secret test on unsupported arches mm: fix endless reclaim on machines with unaccepted memory selftests/mm: compaction_test: fix off by one in check_compaction() mm/numa: no task_numa_fault() call if PMD is changed mm/numa: no task_numa_fault() call if PTE is changed mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0 mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu mm: don't account memmap per-node mm: add system wide stats items category mm: don't account memmap on failure mm/hugetlb: fix hugetlb vs. core-mm PT locking mseal: fix is_madv_discard()
2024-08-16Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull memcg-v1 fix from Al Viro: "memcg_write_event_control() oops fix" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: memcg_write_event_control(): fix a user-triggerable oops
2024-08-15mm/migrate: fix deadlock in migrate_pages_batch() on large foliosGao Xiang
Currently, migrate_pages_batch() can lock multiple locked folios with an arbitrary order. Although folio_trylock() is used to avoid deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") mentioned, it seems try_split_folio() is still missing. It was found by compaction stress test when I explicitly enable EROFS compressed files to use large folios, which case I cannot reproduce with the same workload if large folio support is off (current mainline). Typically, filesystem reads (with locked file-backed folios) could use another bdev/meta inode to load some other I/Os (e.g. inode extent metadata or caching compressed data), so the locking order will be: file-backed folios (A) bdev/meta folios (B) The following calltrace shows the deadlock: Thread 1 takes (B) lock and tries to take folio (A) lock Thread 2 takes (A) lock and tries to take folio (B) lock [Thread 1] INFO: task stress:1824 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio mapping ffff00036d69cb18 index 996 (**) __folio_lock+0x24/0x38 migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2) // migrate_pages_batch (mm/migrate.c:1734:16) <--- LIST_HEAD(unmap_folios) has .. folio mapping 0xffff0000d184f1d8 index 1711; (*) folio mapping 0xffff0000d184f1d8 index 1712; .. migrate_pages+0xb28/0xe90 compact_zone+0xa08/0x10f0 compact_node+0x9c/0x180 sysctl_compaction_handler+0x8c/0x118 proc_sys_call_handler+0x1a8/0x280 proc_sys_write+0x1c/0x30 vfs_write+0x240/0x380 ksys_write+0x78/0x118 __arm64_sys_write+0x24/0x38 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 [Thread 2] INFO: task stress:1825 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*) __folio_lock+0x24/0x38 z_erofs_runqueue+0x384/0x9c0 [erofs] z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**) read_pages+0x74/0x328 page_cache_ra_order+0x26c/0x348 ondemand_readahead+0x1c0/0x3a0 page_cache_sync_ra+0x9c/0xc0 filemap_get_pages+0xc4/0x708 filemap_read+0x104/0x3a8 generic_file_read_iter+0x4c/0x150 vfs_read+0x27c/0x330 ksys_pread64+0x84/0xd0 __arm64_sys_pread64+0x28/0x40 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 Link: https://lkml.kernel.org/r/20240729021306.398286-1-hsiangkao@linux.alibaba.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15alloc_tag: mark pages reserved during CMA activation as not taggedSuren Baghdasaryan
During CMA activation, pages in CMA area are prepared and then freed without being allocated. This triggers warnings when memory allocation debug config (CONFIG_MEM_ALLOC_PROFILING_DEBUG) is enabled. Fix this by marking these pages not tagged before freeing them. Link: https://lkml.kernel.org/r/20240813150758.855881-2-surenb@google.com Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15alloc_tag: introduce clear_page_tag_ref() helper functionSuren Baghdasaryan
In several cases we are freeing pages which were not allocated using common page allocators. For such cases, in order to keep allocation accounting correct, we should clear the page tag to indicate that the page being freed is expected to not have a valid allocation tag. Introduce clear_page_tag_ref() helper function to be used for this. Link: https://lkml.kernel.org/r/20240813150758.855881-1-surenb@google.com Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mm: fix endless reclaim on machines with unaccepted memoryKirill A. Shutemov
Unaccepted memory is considered unusable free memory, which is not counted as free on the zone watermark check. This causes get_page_from_freelist() to accept more memory to hit the high watermark, but it creates problems in the reclaim path. The reclaim path encounters a failed zone watermark check and attempts to reclaim memory. This is usually successful, but if there is little or no reclaimable memory, it can result in endless reclaim with little to no progress. This can occur early in the boot process, just after start of the init process when the only reclaimable memory is the page cache of the init executable and its libraries. Make unaccepted memory free from watermark check point of view. This way unaccepted memory will never be the trigger of memory reclaim. Accept more memory in the get_page_from_freelist() if needed. Link: https://lkml.kernel.org/r/20240809114854.3745464-2-kirill.shutemov@linux.intel.com Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Jianxiong Gao <jxgao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Jianxiong Gao <jxgao@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mm/numa: no task_numa_fault() call if PMD is changedZi Yan
When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling") restructured do_huge_pmd_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pmd_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/20240809145906.1513458-3-ziy@nvidia.com Fixes: c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling") Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/ Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mm/numa: no task_numa_fault() call if PTE is changedZi Yan
When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") restructured do_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pte_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/20240809145906.1513458-2-ziy@nvidia.com Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/ Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order ↵Hailong Liu
fallback to order 0 The __vmap_pages_range_noflush() assumes its argument pages** contains pages with the same page shift. However, since commit e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations"), if gfp_flags includes __GFP_NOFAIL with high order in vm_area_alloc_pages() and page allocation failed for high order, the pages** may contain two different page shifts (high order and order-0). This could lead __vmap_pages_range_noflush() to perform incorrect mappings, potentially resulting in memory corruption. Users might encounter this as follows (vmap_allow_huge = true, 2M is for PMD_SIZE): kvmalloc(2M, __GFP_NOFAIL|GFP_X) __vmalloc_node_range_noprof(vm_flags=VM_ALLOW_HUGE_VMAP) vm_area_alloc_pages(order=9) ---> order-9 allocation failed and fallback to order-0 vmap_pages_range() vmap_pages_range_noflush() __vmap_pages_range_noflush(page_shift = 21) ----> wrong mapping happens We can remove the fallback code because if a high-order allocation fails, __vmalloc_node_range_noprof() will retry with order-0. Therefore, it is unnecessary to fallback to order-0 here. Therefore, fix this by removing the fallback code. Link: https://lkml.kernel.org/r/20240808122019.3361-1-hailong.liu@oppo.com Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations") Signed-off-by: Hailong Liu <hailong.liu@oppo.com> Reported-by: Tangquan Zheng <zhengtangquan@oppo.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpuWaiman Long
The memory_failure_cpu structure is a per-cpu structure. Access to its content requires the use of get_cpu_var() to lock in the current CPU and disable preemption. The use of a regular spinlock_t for locking purpose is fine for a non-RT kernel. Since the integration of RT spinlock support into the v5.15 kernel, a spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping lock in a preemption disabled context is illegal resulting in the following kind of warning. [12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0 [12135.732252] preempt_count: 1, expected: 0 [12135.732255] RCU nest depth: 2, expected: 2 : [12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021 [12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred [12135.732433] Call Trace: [12135.732436] <TASK> [12135.732450] dump_stack_lvl+0x57/0x81 [12135.732461] __might_resched.cold+0xf4/0x12f [12135.732479] rt_spin_lock+0x4c/0x100 [12135.732491] memory_failure_queue+0x40/0xe0 [12135.732503] ghes_do_memory_failure+0x53/0x390 [12135.732516] ghes_do_proc.constprop.0+0x229/0x3e0 [12135.732575] ghes_proc+0xf9/0x1a0 [12135.732591] ghes_notify_hed+0x6a/0x150 [12135.732602] notifier_call_chain+0x43/0xb0 [12135.732626] blocking_notifier_call_chain+0x43/0x60 [12135.732637] acpi_ev_notify_dispatch+0x47/0x70 [12135.732648] acpi_os_execute_deferred+0x13/0x20 [12135.732654] process_one_work+0x41f/0x500 [12135.732695] worker_thread+0x192/0x360 [12135.732715] kthread+0x111/0x140 [12135.732733] ret_from_fork+0x29/0x50 [12135.732779] </TASK> Fix it by using a raw_spinlock_t for locking instead. Also move the pr_err() out of the lock critical section and after put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep with this call. [longman@redhat.com: don't hold percpu ref across pr_err(), per Miaohe] Link: https://lkml.kernel.org/r/20240807181130.1122660-1-longman@redhat.com Link: https://lkml.kernel.org/r/20240806164107.1044956-1-longman@redhat.com Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mm: don't account memmap per-nodePasha Tatashin
Fix invalid access to pgdat during hot-remove operation: ndctl users reported a GPF when trying to destroy a namespace: $ ndctl destroy-namespace all -r all -f Segmentation fault dmesg: Oops: general protection fault, probably for non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: probably user-memory-access in range [0x000000000002b280-0x000000000002b287] CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.20.1 09/13/2023 RIP: 0010:mod_node_page_state+0x2a/0x110 cxl-test users report a GPF when trying to unload the test module: $ modrpobe -r cxl-test dmesg BUG: unable to handle page fault for address: 0000000000004200 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197 Tainted: [O]=OOT_MODULE, [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15 RIP: 0010:mod_node_page_state+0x6/0x90 Currently, when memory is hot-plugged or hot-removed the accounting is done based on the assumption that memmap is allocated from the same node as the hot-plugged/hot-removed memory, which is not always the case. In addition, there are challenges with keeping the node id of the memory that is being remove to the time when memmap accounting is actually performed: since this is done after remove_pfn_range_from_zone(), and also after remove_memory_block_devices(). Meaning that we cannot use pgdat nor walking though memblocks to get the nid. Given all of that, account the memmap overhead system wide instead. For this we are going to be using global atomic counters, but given that memmap size is rarely modified, and normally is only modified either during early boot when there is only one CPU, or under a hotplug global mutex lock, therefore there is no need for per-cpu optimizations. Also, while we are here rename nr_memmap to nr_memmap_pages, and nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the units are in page count. [pasha.tatashin@soleen.com: address a few nits from David Hildenbrand] Link: https://lkml.kernel.org/r/20240809191020.1142142-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240809191020.1142142-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240808213437.682006-4-pasha.tatashin@soleen.com Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com Reported-by: Alison Schofield <alison.schofield@intel.com> Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t Tested-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Fan Ni <fan.ni@samsung.com> Cc: Joel Granados <j.granados@samsung.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mm: add system wide stats items categoryPasha Tatashin
/proc/vmstat contains events and stats, events can only grow, but stats can grow and shrink. vmstat has the following: ------------------------- NR_VM_ZONE_STAT_ITEMS: per-zone stats NR_VM_NUMA_EVENT_ITEMS: per-numa events NR_VM_NODE_STAT_ITEMS: per-numa stats NR_VM_WRITEBACK_STAT_ITEMS: system-wide background-writeback and dirty-throttling tresholds. NR_VM_EVENT_ITEMS: system-wide events ------------------------- Rename NR_VM_WRITEBACK_STAT_ITEMS to NR_VM_STAT_ITEMS, to track the system-wide stats, we are going to add per-page metadata stats to this category in the next patch. Also delete unused writeback_stat_name(). Link: https://lkml.kernel.org/r/20240809191020.1142142-2-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240808213437.682006-3-pasha.tatashin@soleen.com Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Suggested-by: Yosry Ahmed <yosryahmed@google.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Joel Granados <j.granados@samsung.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yi Zhang <yi.zhang@redhat.com> Cc: Fan Ni <fan.ni@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mm: don't account memmap on failurePasha Tatashin
Patch series "Fixes for memmap accounting", v4. Memmap accounting provides us with observability of how much memory is used for per-page metadata: i.e. "struct page"'s and "struct page_ext". It also provides with information of how much was allocated using boot allocator (i.e. not part of MemTotal), and how much was allocated using buddy allocated (i.e. part of MemTotal). This small series fixes a few problems that were discovered with the original patch. This patch (of 3): When we fail to allocate the mmemmap in alloc_vmemmap_page_list(), do not account any already-allocated pages: we're going to free all them before we return from the function. Link: https://lkml.kernel.org/r/20240809191020.1142142-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240808213437.682006-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240808213437.682006-2-pasha.tatashin@soleen.com Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Joel Granados <j.granados@samsung.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-15mseal: fix is_madv_discard()Pedro Falcato
is_madv_discard did its check wrong. MADV_ flags are not bitwise, they're normal sequential numbers. So, for instance: behavior & (/* ... */ | MADV_REMOVE) tagged both MADV_REMOVE and MADV_RANDOM (bit 0 set) as discard operations. As a result the kernel could erroneously block certain madvises (e.g MADV_RANDOM or MADV_HUGEPAGE) on sealed VMAs due to them sharing bits with blocked MADV operations (e.g REMOVE or WIPEONFORK). This is obviously incorrect, so use a switch statement instead. Link: https://lkml.kernel.org/r/20240807173336.2523757-1-pedro.falcato@gmail.com Link: https://lkml.kernel.org/r/20240807173336.2523757-2-pedro.falcato@gmail.com Fixes: 8be7258aad44 ("mseal: add mseal syscall") Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Tested-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-14binfmt_elf: mseal address zeroJeff Xu
In load_elf_binary as part of the execve(), when the current task’s personality has MMAP_PAGE_ZERO set, the kernel allocates one page at address 0. According to the comment: /* Why this, you ask??? Well SVr4 maps page 0 as read-only, and some applications "depend" upon this behavior. Since we do not have the power to recompile these, we emulate the SVr4 behavior. Sigh. */ At one point, Linus suggested removing this [1]. Code search in debian didn't see much use of MMAP_PAGE_ZERO [2], it exists in util and test (rr). Sealing this is probably safe, the comment doesn't say the app ever wanting to change the mapping to rwx. Sealing also ensures that never happens. If there is a complaint, we can make this configurable. Link: https://lore.kernel.org/lkml/CAHk-=whVa=nm_GW=NVfPHqcxDbWt4JjjK1YWb0cLjO4ZSGyiDA@mail.gmail.com/ [1] Link: https://codesearch.debian.net/search?q=MMAP_PAGE_ZERO&literal=1&perpkg=1&page=1 [2] Signed-off-by: Jeff Xu <jeffxu@chromium.org> Link: https://lore.kernel.org/r/20240806214931.2198172-2-jeffxu@google.com Signed-off-by: Kees Cook <kees@kernel.org>
2024-08-12introduce fd_file(), convert all accessors to it.Al Viro
For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-12memcg_write_event_control(): fix a user-triggerable oopsAl Viro
we are *not* guaranteed that anything past the terminating NUL is mapped (let alone initialized with anything sane). Fixes: 0dea116876ee ("cgroup: implement eventfd-based generic API for notifications") Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-08-11mm/memblock: introduce a new helper memblock_estimated_nr_free_pages()Wei Yang
During bootup, system may need the number of free pages in the whole system to do some calculation before all pages are freed to buddy system. Usually this number is get from totalram_pages(). Since we plan to move the free pages accounting in __free_pages_core(), this value may not represent total free pages at the early stage, especially when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled. Instead of using raw memblock api, let's introduce a new helper for user to get the estimated number of free pages from memblock point of view. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> CC: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20240808001415.6298-1-richard.weiyang@gmail.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2024-08-08Merge tag 'mm-hotfixes-stable-2024-08-07-18-32' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Nine hotfixes. Five are cc:stable, the others either pertain to post-6.10 material or aren't considered necessary for earlier kernels. Five are MM and four are non-MM. No identifiable theme here - please see the individual changelogs" * tag 'mm-hotfixes-stable-2024-08-07-18-32' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: padata: Fix possible divide-by-0 panic in padata_mt_helper() mailmap: update entry for David Heidelberg memcg: protect concurrent access to mem_cgroup_idr mm: shmem: fix incorrect aligned index when checking conflicts mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmem mm: list_lru: fix UAF for memory cgroup kcov: properly check for softirq context MAINTAINERS: Update LTP members and web selftests: mm: add s390 to ARCH check
2024-08-07memcg: protect concurrent access to mem_cgroup_idrShakeel Butt
Commit 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") decoupled the memcg IDs from the CSS ID space to fix the cgroup creation failures. It introduced IDR to maintain the memcg ID space. The IDR depends on external synchronization mechanisms for modifications. For the mem_cgroup_idr, the idr_alloc() and idr_replace() happen within css callback and thus are protected through cgroup_mutex from concurrent modifications. However idr_remove() for mem_cgroup_idr was not protected against concurrency and can be run concurrently for different memcgs when they hit their refcnt to zero. Fix that. We have been seeing list_lru based kernel crashes at a low frequency in our fleet for a long time. These crashes were in different part of list_lru code including list_lru_add(), list_lru_del() and reparenting code. Upon further inspection, it looked like for a given object (dentry and inode), the super_block's list_lru didn't have list_lru_one for the memcg of that object. The initial suspicions were either the object is not allocated through kmem_cache_alloc_lru() or somehow memcg_list_lru_alloc() failed to allocate list_lru_one() for a memcg but returned success. No evidence were found for these cases. Looking more deeply, we started seeing situations where valid memcg's id is not present in mem_cgroup_idr and in some cases multiple valid memcgs have same id and mem_cgroup_idr is pointing to one of them. So, the most reasonable explanation is that these situations can happen due to race between multiple idr_remove() calls or race between idr_alloc()/idr_replace() and idr_remove(). These races are causing multiple memcgs to acquire the same ID and then offlining of one of them would cleanup list_lrus on the system for all of them. Later access from other memcgs to the list_lru cause crashes due to missing list_lru_one. Link: https://lkml.kernel.org/r/20240802235822.1830976-1-shakeel.butt@linux.dev Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-07mm: shmem: fix incorrect aligned index when checking conflictsBaolin Wang
In the shmem_suitable_orders() function, xa_find() is used to check for conflicts in the pagecache to select suitable huge orders. However, when checking each huge order in every loop, the aligned index is calculated from the previous iteration, which may cause suitable huge orders to be missed. We should use the original index each time in the loop to calculate a new aligned index for checking conflicts to avoid this issue. Link: https://lkml.kernel.org/r/07433b0f16a152bffb8cee34934a5c040e8e2ad6.1722404078.git.baolin.wang@linux.alibaba.com Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-07mm: shmem: avoid allocating huge pages larger than MAX_PAGECACHE_ORDER for shmemBaolin Wang
Similar to commit d659b715e94ac ("mm/huge_memory: avoid PMD-size page cache if needed"), ARM64 can support 512MB PMD-sized THP when the base page size is 64KB, which is larger than the maximum supported page cache size MAX_PAGECACHE_ORDER. This is not expected. To fix this issue, use THP_ORDERS_ALL_FILE_DEFAULT for shmem to filter allowable huge orders. [baolin.wang@linux.alibaba.com: remove comment, per Barry] Link: https://lkml.kernel.org/r/c55d7ef7-78aa-4ed6-b897-c3e03a3f3ab7@linux.alibaba.com [wangkefeng.wang@huawei.com: remove local `orders'] Link: https://lkml.kernel.org/r/87769ae8-b6c6-4454-925d-1864364af9c8@huawei.com Link: https://lkml.kernel.org/r/117121665254442c3c7f585248296495e5e2b45c.1722404078.git.baolin.wang@linux.alibaba.com Fixes: e7a2ab7b3bb5 ("mm: shmem: add mTHP support for anonymous shmem") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Barry Song <baohua@kernel.org> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-07mm: list_lru: fix UAF for memory cgroupMuchun Song
The mem_cgroup_from_slab_obj() is supposed to be called under rcu lock or cgroup_mutex or others which could prevent returned memcg from being freed. Fix it by adding missing rcu read lock. Found by code inspection. [songmuchun@bytedance.com: only grab rcu lock when necessary, per Vlastimil] Link: https://lkml.kernel.org/r/20240801024603.1865-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20240718083607.42068-1-songmuchun@bytedance.com Fixes: 0a97c01cd20b ("list_lru: allow explicit memcg and NUMA node selection") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-08-07sched/rt: Rename realtime_{prio, task}() to rt_or_dl_{prio, task}()Qais Yousef
Some find the name realtime overloaded. Use rt_or_dl() as an alternative, hopefully better, name. Suggested-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20240610192018.1567075-4-qyousef@layalina.io
2024-08-07sched/rt: Clean up usage of rt_task()Qais Yousef
rt_task() checks if a task has RT priority. But depends on your dictionary, this could mean it belongs to RT class, or is a 'realtime' task, which includes RT and DL classes. Since this has caused some confusion already on discussion [1], it seemed a clean up is due. I define the usage of rt_task() to be tasks that belong to RT class. Make sure that it returns true only for RT class and audit the users and replace the ones required the old behavior with the new realtime_task() which returns true for RT and DL classes. Introduce similar realtime_prio() to create similar distinction to rt_prio() and update the users that required the old behavior to use the new function. Move MAX_DL_PRIO to prio.h so it can be used in the new definitions. Document the functions to make it more obvious what is the difference between them. PI-boosted tasks is a factor that must be taken into account when choosing which function to use. Rename task_is_realtime() to realtime_task_policy() as the old name is confusing against the new realtime_task(). No functional changes were intended. [1] https://lore.kernel.org/lkml/20240506100509.GL40213@noisy.programming.kicks-ass.net/ Signed-off-by: Qais Yousef <qyousef@layalina.io> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: "Steven Rostedt (Google)" <rostedt@goodmis.org> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20240610192018.1567075-2-qyousef@layalina.io
2024-08-07fs: Convert aops->write_begin to take a folioMatthew Wilcox (Oracle)
Convert all callers from working on a page to working on one page of a folio (support for working on an entire folio can come later). Removes a lot of folio->page->folio conversions. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-07fs: Convert aops->write_end to take a folioMatthew Wilcox (Oracle)
Most callers have a folio, and most implementations operate on a folio, so remove the conversion from folio->page->folio to fit through this interface. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-05Merge tag 'slab-fixes-for-6.11-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: "Since v6.8 we've had a subtle breakage in SLUB with KFENCE enabled, that can cause a crash. It hasn't been found earlier due to quite specific conditions necessary (OOM during kmem_cache_alloc_bulk())" * tag 'slab-fixes-for-6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm, slub: do not call do_slab_free for kfence object
2024-07-30mm, slub: do not call do_slab_free for kfence objectRik van Riel
In 782f8906f805 the freeing of kfence objects was moved from deep inside do_slab_free to the wrapper functions outside. This is a nice change, but unfortunately it missed one spot in __kmem_cache_free_bulk. This results in a crash like this: BUG skbuff_head_cache (Tainted: G S B E ): Padding overwritten. 0xffff88907fea0f00-0xffff88907fea0fff @offset=3840 slab_err (mm/slub.c:1129) free_to_partial_list (mm/slub.c:? mm/slub.c:4036) slab_pad_check (mm/slub.c:864 mm/slub.c:1290) check_slab (mm/slub.c:?) free_to_partial_list (mm/slub.c:3171 mm/slub.c:4036) kmem_cache_alloc_bulk (mm/slub.c:? mm/slub.c:4495 mm/slub.c:4586 mm/slub.c:4635) napi_build_skb (net/core/skbuff.c:348 net/core/skbuff.c:527 net/core/skbuff.c:549) All the other callers to do_slab_free appear to be ok. Add a kfence_free check in __kmem_cache_free_bulk to avoid the crash. Reported-by: Chris Mason <clm@meta.com> Fixes: 782f8906f805 ("mm/slub: free KFENCE objects in slab_free_hook()") Cc: stable@kernel.org Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2024-07-28minmax: make generic MIN() and MAX() macros available everywhereLinus Torvalds
This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>