| Age | Commit message (Collapse) | Author |
|
On ARM32 with HIGHMEM/HIGHPTE, break_ksm_pmd_entry() triggers a BUG during
KSM unmerging because pte_unmap_unlock() is passed a pointer that may be
beyond the mapped PTE page.
The issue occurs when the PTE iteration loop completes without finding a
KSM page. After the loop, 'ptep' has been incremented past the last PTE
entry. On ARM32 LPAE with 512 PTEs per page (512 * 8 = 4096 bytes), this
means ptep points to the next page, outside the kmap'd region.
When pte_unmap_unlock(ptep, ptl) calls kunmap_local(ptep), it unmaps the
wrong page address, leaving the original kmap slot still mapped. The next
kmap_local then finds this slot unexpectedly occupied:
WARNING: mm/highmem.c:622 kunmap_local_indexed (address mismatch)
kernel BUG at mm/highmem.c:564 __kmap_local_pfn_prot (slot not empty)
Fix this by passing start_ptep to pte_unmap_unlock(), which always points
within the originally mapped PTE page.
Reproducer: Run LTP ksm03 test on ARM32 with HIGHMEM enabled. The test
triggers KSM merging followed by unmerging (writing 0 then 2 to
/sys/kernel/mm/ksm/run), which exercises break_ksm_pmd_entry().
Link: https://lkml.kernel.org/r/20251220202926.318366-1-sashal@kernel.org
Fixes: 5d4939fc2258 ("ksm: perform a range-walk in break_ksm")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Assisted-by: claude-opus-4-5-20251101
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: xu xin <xu.xin16@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The page_owner_stack_fops->open() callback invokes seq_open_private(),
therefore its corresponding ->release() callback must call
seq_release_private(). Otherwise it will cause a memory leak of struct
stack_print_ctx.
Link: https://lkml.kernel.org/r/20251219074232.136482-1-ranxiaokai627@163.com
Fixes: 765973a09803 ("mm,page_owner: display all stacks and their count")
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Marco Elver <elver@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This patch addresses a warning that I discovered while working on famfs,
which is an fs-dax file system that virtually always does PMD faults (next
famfs patch series coming after the holidays).
However, XFS also does PMD faults in fs-dax mode, and it also triggers the
warning. It takes some effort to get XFS to do a PMD fault, but
instructions to reproduce it are below.
The VM_WARN_ON_ONCE(folio_test_large(folio)) check in
free_zone_device_folio() incorrectly triggers for MEMORY_DEVICE_FS_DAX
when PMD (2MB) mappings are used.
FS-DAX legitimately creates large file-backed folios when handling PMD
faults. This is a core feature of FS-DAX that provides significant
performance benefits by mapping 2MB regions directly to persistent memory.
When these mappings are unmapped, the large folios are freed through
free_zone_device_folio(), which triggers the spurious warning.
The warning was introduced by commit that added support for large zone
device private folios. However, that commit did not account for FS-DAX
file-backed folios, which have always supported large (PMD-sized)
mappings.
The check distinguishes between anonymous folios (which clear
AnonExclusive flags for each sub-page) and file-backed folios. For
file-backed folios, it assumes large folios are unexpected - but this
assumption is incorrect for FS-DAX.
The fix is to exempt MEMORY_DEVICE_FS_DAX from the large folio warning,
allowing FS-DAX to continue using PMD mappings without triggering false
warnings.
Link: https://lkml.kernel.org/r/20251219123717.39330-1-john@groves.net
Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios")
Signed-off-by: John Groves <john@groves.net>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Gregory Price <gourry@gourry.net>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 2783088ef24e ("mm/page_alloc: prevent reporting pcp->batch = 0")
moved the error handling (0-handling) of zone_batchsize from its callers
to inside the function. However, the commit left out the error handling
for the NOMMU case, leading to deadlocks on NOMMU systems.
For NOMMU systems, return 1 instead of 0 for zone_batchsize, which
restores the previous deadlock-free behavior.
There is no functional difference expected with this patch before commit
2783088ef24e, other than the pr_debug in zone_pcp_init now printing out 1
instead of 0 for zones in NOMMU systems. Not only is this a pr_debug, the
difference is purely semantic anyways.
Link: https://lkml.kernel.org/r/20251218083200.2435789-1-joshua.hahnjy@gmail.com
Fixes: 2783088ef24e ("mm/page_alloc: prevent reporting pcp->batch = 0")
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reported-by: Daniel Palmer <daniel@thingy.jp>
Closes: https://lore.kernel.org/linux-mm/CAFr9PX=_HaM3_xPtTiBn5Gw5-0xcRpawpJ02NStfdr0khF2k7g@mail.gmail.com/
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/all/42143500-c380-41fe-815c-696c17241506@roeck-us.net/
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Daniel Palmer <daniel@thingy.jp>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: SeongJae Park <sj@kernel.org>
Tested-by: Hajime Tazaki <thehajime@gmail.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The commit bc8e51c05ad5 ("mm: memcg: dump memcg protection info on oom or
alloc failures") added functionality to dump memcg protections on OOM or
allocation failures. It uses K() macro to dump the information and passes
bytes to the macro. However the macro take number of pages instead of
bytes. It is defined as:
#define K(x) ((x) << (PAGE_SHIFT-10))
Let's fix this.
Link: https://lkml.kernel.org/r/20251216212054.484079-1-shakeel.butt@linux.dev
Fixes: bc8e51c05ad5 ("mm: memcg: dump memcg protection info on oom or alloc failures")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reported-by: Chris Mason <clm@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The memory failure handling implementation for the PFNMAP memory with no
struct pages is faulty. The VA of the mapping is determined based on the
the PFN. It should instead be based on the file mapping offset.
At the occurrence of poison, the memory_failure_pfn is triggered on the
poisoned PFN. Introduce a callback function that allows mm to translate
the PFN to the corresponding file page offset. The kernel module using
the registration API must implement the callback function and provide the
translation. The translated value is then used to determine the VA
information and sending the SIGBUS to the usermode process mapped to the
poisoned PFN.
The callback is also useful for the driver to be notified of the poisoned
PFN, which may then track it.
Link: https://lkml.kernel.org/r/20251211070603.338701-2-ankita@nvidia.com
Fixes: 2ec41967189c ("mm: handle poisoning of pfn without struct pages")
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Matthew R. Ochs <mochs@nvidia.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Neo Jia <cjia@nvidia.com>
Cc: Vikram Sethi <vsethi@nvidia.com>
Cc: Yishai Hadas <yishaih@nvidia.com>
Cc: Zhi Wang <zhiw@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
If the PTE page table lock is acquired by pte_offset_map_lock(), the lock
must be released via pte_unmap_unlock().
However, in damos_va_migrate_pmd_entry(), if damos_va_filter_out() returns
true, it immediately returns without releasing the lock.
This fixes the issue by not stopping page table traversal when
damos_va_filter_out() returns true and ensuring that the lock is released.
Link: https://lkml.kernel.org/r/20251209151034.77221-1-akinobu.mita@gmail.com
Fixes: 09efc56a3b1c ("mm/damon/vaddr: consistently use only pmd_entry for damos_migrate")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When a page is freed it coalesces with a buddy into a higher order page
while possible. When the buddy page migrate type differs, it is expected
to be updated to match the one of the page being freed.
However, only the first pageblock of the buddy page is updated, while the
rest of the pageblocks are left unchanged.
That causes warnings in later expand() and other code paths (like below),
since an inconsistency between migration type of the list containing the
page and the page-owned pageblocks migration types is introduced.
[ 308.986589] ------------[ cut here ]------------
[ 308.987227] page type is 0, passed migratetype is 1 (nr=256)
[ 308.987275] WARNING: CPU: 1 PID: 5224 at mm/page_alloc.c:812 expand+0x23c/0x270
[ 308.987293] Modules linked in: algif_hash(E) af_alg(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) loop(E) nfnetlink(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vsock(E) ctcm(E) fsm(E) diag288_wdt(E) watchdog(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) aes_s390(E) des_generic(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) paes_s390(E) crypto_engine(E) pkey_cca(E) pkey_ep11(E) zcrypt(E) rng_core(E) pkey_pckmo(E) pkey(E) autofs4(E)
[ 308.987439] Unloaded tainted modules: hmac_s390(E):2
[ 308.987650] CPU: 1 UID: 0 PID: 5224 Comm: mempig_verify Kdump: loaded Tainted: G E 6.18.0-gcc-bpf-debug #431 PREEMPT
[ 308.987657] Tainted: [E]=UNSIGNED_MODULE
[ 308.987661] Hardware name: IBM 3906 M04 704 (z/VM 7.3.0)
[ 308.987666] Krnl PSW : 0404f00180000000 00000349976fa600 (expand+0x240/0x270)
[ 308.987676] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[ 308.987682] Krnl GPRS: 0000034980000004 0000000000000005 0000000000000030 000003499a0e6d88
[ 308.987688] 0000000000000005 0000034980000005 000002be803ac000 0000023efe6c8300
[ 308.987692] 0000000000000008 0000034998d57290 000002be00000100 0000023e00000008
[ 308.987696] 0000000000000000 0000000000000000 00000349976fa5fc 000002c99b1eb6f0
[ 308.987708] Krnl Code: 00000349976fa5f0: c020008a02f2 larl %r2,000003499883abd4
00000349976fa5f6: c0e5ffe3f4b5 brasl %r14,0000034997378f60
#00000349976fa5fc: af000000 mc 0,0
>00000349976fa600: a7f4ff4c brc 15,00000349976fa498
00000349976fa604: b9040026 lgr %r2,%r6
00000349976fa608: c0300088317f larl %r3,0000034998800906
00000349976fa60e: c0e5fffdb6e1 brasl %r14,00000349976b13d0
00000349976fa614: af000000 mc 0,0
[ 308.987734] Call Trace:
[ 308.987738] [<00000349976fa600>] expand+0x240/0x270
[ 308.987744] ([<00000349976fa5fc>] expand+0x23c/0x270)
[ 308.987749] [<00000349976ff95e>] rmqueue_bulk+0x71e/0x940
[ 308.987754] [<00000349976ffd7e>] __rmqueue_pcplist+0x1fe/0x2a0
[ 308.987759] [<0000034997700966>] rmqueue.isra.0+0xb46/0xf40
[ 308.987763] [<0000034997703ec8>] get_page_from_freelist+0x198/0x8d0
[ 308.987768] [<0000034997706fa8>] __alloc_frozen_pages_noprof+0x198/0x400
[ 308.987774] [<00000349977536f8>] alloc_pages_mpol+0xb8/0x220
[ 308.987781] [<0000034997753bf6>] folio_alloc_mpol_noprof+0x26/0xc0
[ 308.987786] [<0000034997753e4c>] vma_alloc_folio_noprof+0x6c/0xa0
[ 308.987791] [<0000034997775b22>] vma_alloc_anon_folio_pmd+0x42/0x240
[ 308.987799] [<000003499777bfea>] __do_huge_pmd_anonymous_page+0x3a/0x210
[ 308.987804] [<00000349976cb08e>] __handle_mm_fault+0x4de/0x500
[ 308.987809] [<00000349976cb14c>] handle_mm_fault+0x9c/0x3a0
[ 308.987813] [<000003499734d70e>] do_exception+0x1de/0x540
[ 308.987822] [<0000034998387390>] __do_pgm_check+0x130/0x220
[ 308.987830] [<000003499839a934>] pgm_check_handler+0x114/0x160
[ 308.987838] 3 locks held by mempig_verify/5224:
[ 308.987842] #0: 0000023ea44c1e08 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xb2/0x2a0
[ 308.987859] #1: 0000023ee4d41b18 (&pcp->lock){+.+.}-{2:2}, at: rmqueue.isra.0+0xad6/0xf40
[ 308.987871] #2: 0000023efe6c8998 (&zone->lock){..-.}-{2:2}, at: rmqueue_bulk+0x5a/0x940
[ 308.987886] Last Breaking-Event-Address:
[ 308.987890] [<0000034997379096>] __warn_printk+0x136/0x140
[ 308.987897] irq event stamp: 52330356
[ 308.987901] hardirqs last enabled at (52330355): [<000003499838742e>] __do_pgm_check+0x1ce/0x220
[ 308.987907] hardirqs last disabled at (52330356): [<000003499839932e>] _raw_spin_lock_irqsave+0x9e/0xe0
[ 308.987913] softirqs last enabled at (52329882): [<0000034997383786>] handle_softirqs+0x2c6/0x530
[ 308.987922] softirqs last disabled at (52329859): [<0000034997382f86>] __irq_exit_rcu+0x126/0x140
[ 308.987929] ---[ end trace 0000000000000000 ]---
[ 308.987936] ------------[ cut here ]------------
[ 308.987940] page type is 0, passed migratetype is 1 (nr=256)
[ 308.987951] WARNING: CPU: 1 PID: 5224 at mm/page_alloc.c:860 __del_page_from_free_list+0x1be/0x1e0
[ 308.987960] Modules linked in: algif_hash(E) af_alg(E) nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) s390_trng(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) sch_fq_codel(E) drm(E) i2c_core(E) drm_panel_orientation_quirks(E) loop(E) nfnetlink(E) vsock_loopback(E) vmw_vsock_virtio_transport_common(E) vsock(E) ctcm(E) fsm(E) diag288_wdt(E) watchdog(E) zfcp(E) scsi_transport_fc(E) ghash_s390(E) prng(E) aes_s390(E) des_generic(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) paes_s390(E) crypto_engine(E) pkey_cca(E) pkey_ep11(E) zcrypt(E) rng_core(E) pkey_pckmo(E) pkey(E) autofs4(E)
[ 308.988070] Unloaded tainted modules: hmac_s390(E):2
[ 308.988087] CPU: 1 UID: 0 PID: 5224 Comm: mempig_verify Kdump: loaded Tainted: G W E 6.18.0-gcc-bpf-debug #431 PREEMPT
[ 308.988095] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE
[ 308.988100] Hardware name: IBM 3906 M04 704 (z/VM 7.3.0)
[ 308.988105] Krnl PSW : 0404f00180000000 00000349976f9e32 (__del_page_from_free_list+0x1c2/0x1e0)
[ 308.988118] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[ 308.988127] Krnl GPRS: 0000034980000004 0000000000000005 0000000000000030 000003499a0e6d88
[ 308.988133] 0000000000000005 0000034980000005 0000034998d57290 0000023efe6c8300
[ 308.988139] 0000000000000001 0000000000000008 000002be00000100 000002be803ac000
[ 308.988144] 0000000000000000 0000000000000001 00000349976f9e2e 000002c99b1eb728
[ 308.988153] Krnl Code: 00000349976f9e22: c020008a06d9 larl %r2,000003499883abd4
00000349976f9e28: c0e5ffe3f89c brasl %r14,0000034997378f60
#00000349976f9e2e: af000000 mc 0,0
>00000349976f9e32: a7f4ff4e brc 15,00000349976f9cce
00000349976f9e36: b904002b lgr %r2,%r11
00000349976f9e3a: c030008a06e7 larl %r3,000003499883ac08
00000349976f9e40: c0e5fffdbac8 brasl %r14,00000349976b13d0
00000349976f9e46: af000000 mc 0,0
[ 308.988184] Call Trace:
[ 308.988188] [<00000349976f9e32>] __del_page_from_free_list+0x1c2/0x1e0
[ 308.988195] ([<00000349976f9e2e>] __del_page_from_free_list+0x1be/0x1e0)
[ 308.988202] [<00000349976ff946>] rmqueue_bulk+0x706/0x940
[ 308.988208] [<00000349976ffd7e>] __rmqueue_pcplist+0x1fe/0x2a0
[ 308.988214] [<0000034997700966>] rmqueue.isra.0+0xb46/0xf40
[ 308.988221] [<0000034997703ec8>] get_page_from_freelist+0x198/0x8d0
[ 308.988227] [<0000034997706fa8>] __alloc_frozen_pages_noprof+0x198/0x400
[ 308.988233] [<00000349977536f8>] alloc_pages_mpol+0xb8/0x220
[ 308.988240] [<0000034997753bf6>] folio_alloc_mpol_noprof+0x26/0xc0
[ 308.988247] [<0000034997753e4c>] vma_alloc_folio_noprof+0x6c/0xa0
[ 308.988253] [<0000034997775b22>] vma_alloc_anon_folio_pmd+0x42/0x240
[ 308.988260] [<000003499777bfea>] __do_huge_pmd_anonymous_page+0x3a/0x210
[ 308.988267] [<00000349976cb08e>] __handle_mm_fault+0x4de/0x500
[ 308.988273] [<00000349976cb14c>] handle_mm_fault+0x9c/0x3a0
[ 308.988279] [<000003499734d70e>] do_exception+0x1de/0x540
[ 308.988286] [<0000034998387390>] __do_pgm_check+0x130/0x220
[ 308.988293] [<000003499839a934>] pgm_check_handler+0x114/0x160
[ 308.988300] 3 locks held by mempig_verify/5224:
[ 308.988305] #0: 0000023ea44c1e08 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xb2/0x2a0
[ 308.988322] #1: 0000023ee4d41b18 (&pcp->lock){+.+.}-{2:2}, at: rmqueue.isra.0+0xad6/0xf40
[ 308.988334] #2: 0000023efe6c8998 (&zone->lock){..-.}-{2:2}, at: rmqueue_bulk+0x5a/0x940
[ 308.988346] Last Breaking-Event-Address:
[ 308.988350] [<0000034997379096>] __warn_printk+0x136/0x140
[ 308.988356] irq event stamp: 52330356
[ 308.988360] hardirqs last enabled at (52330355): [<000003499838742e>] __do_pgm_check+0x1ce/0x220
[ 308.988366] hardirqs last disabled at (52330356): [<000003499839932e>] _raw_spin_lock_irqsave+0x9e/0xe0
[ 308.988373] softirqs last enabled at (52329882): [<0000034997383786>] handle_softirqs+0x2c6/0x530
[ 308.988380] softirqs last disabled at (52329859): [<0000034997382f86>] __irq_exit_rcu+0x126/0x140
[ 308.988388] ---[ end trace 0000000000000000 ]---
Link: https://lkml.kernel.org/r/20251215081002.3353900A9c-agordeev@linux.ibm.com
Link: https://lkml.kernel.org/r/20251212151457.3898073Add-agordeev@linux.ibm.com
Fixes: e6cf9e1c4cde ("mm: page_alloc: fix up block types when merging compatible blocks")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Closes: https://lore.kernel.org/linux-mm/87wmalyktd.fsf@linux.ibm.com/
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Cc: Marc Hartmayer <mhartmay@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A KASAN tag mismatch, possibly causing a kernel panic, can be observed on
systems with a tag-based KASAN enabled and with multiple NUMA nodes. It
was reported on arm64 and reproduced on x86. It can be explained in the
following points:
1. There can be more than one virtual memory chunk.
2. Chunk's base address has a tag.
3. The base address points at the first chunk and thus inherits
the tag of the first chunk.
4. The subsequent chunks will be accessed with the tag from the
first chunk.
5. Thus, the subsequent chunks need to have their tag set to
match that of the first chunk.
Use the new vmalloc flag that disables random tag assignment in
__kasan_unpoison_vmalloc() - pass the same random tag to all the
vm_structs by tagging the pointers before they go inside
__kasan_unpoison_vmalloc(). Assigning a common tag resolves the pcpu
chunk address mismatch.
[akpm@linux-foundation.org: use WARN_ON_ONCE(), per Andrey]
Link: https://lkml.kernel.org/r/CA+fCnZeuGdKSEm11oGT6FS71_vGq1vjq-xY36kxVdFvwmag2ZQ@mail.gmail.com
[maciej.wieczor-retman@intel.com: remove unneeded pr_warn()]
Link: https://lkml.kernel.org/r/919897daaaa3c982a27762a2ee038769ad033991.1764945396.git.m.wieczorretman@pm.me
Link: https://lkml.kernel.org/r/873821114a9f722ffb5d6702b94782e902883fdf.1764874575.git.m.wieczorretman@pm.me
Fixes: 1d96320f8d53 ("kasan, vmalloc: add vmalloc tagging for SW_TAGS")
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: Kees Cook <kees@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org> [6.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A KASAN tag mismatch, possibly causing a kernel panic, can be observed
on systems with a tag-based KASAN enabled and with multiple NUMA nodes.
It was reported on arm64 and reproduced on x86. It can be explained in
the following points:
1. There can be more than one virtual memory chunk.
2. Chunk's base address has a tag.
3. The base address points at the first chunk and thus inherits
the tag of the first chunk.
4. The subsequent chunks will be accessed with the tag from the
first chunk.
5. Thus, the subsequent chunks need to have their tag set to
match that of the first chunk.
Refactor code by reusing __kasan_unpoison_vmalloc in a new helper in
preparation for the actual fix.
Link: https://lkml.kernel.org/r/eb61d93b907e262eefcaa130261a08bcb6c5ce51.1764874575.git.m.wieczorretman@pm.me
Fixes: 1d96320f8d53 ("kasan, vmalloc: add vmalloc tagging for SW_TAGS")
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: Kees Cook <kees@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org> [6.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "kasan: vmalloc: Fixes for the percpu allocator and
vrealloc", v3.
Patches fix two issues related to KASAN and vmalloc.
The first one, a KASAN tag mismatch, possibly resulting in a kernel panic,
can be observed on systems with a tag-based KASAN enabled and with
multiple NUMA nodes. Initially it was only noticed on x86 [1] but later a
similar issue was also reported on arm64 [2].
Specifically the problem is related to how vm_structs interact with
pcpu_chunks - both when they are allocated, assigned and when pcpu_chunk
addresses are derived.
When vm_structs are allocated they are unpoisoned, each with a different
random tag, if vmalloc support is enabled along the KASAN mode. Later
when first pcpu chunk is allocated it gets its 'base_addr' field set to
the first allocated vm_struct. With that it inherits that vm_struct's
tag.
When pcpu_chunk addresses are later derived (by pcpu_chunk_addr(), for
example in pcpu_alloc_noprof()) the base_addr field is used and offsets
are added to it. If the initial conditions are satisfied then some of the
offsets will point into memory allocated with a different vm_struct. So
while the lower bits will get accurately derived the tag bits in the top
of the pointer won't match the shadow memory contents.
The solution (proposed at v2 of the x86 KASAN series [3]) is to unpoison
the vm_structs with the same tag when allocating them for the per cpu
allocator (in pcpu_get_vm_areas()).
The second one reported by syzkaller [4] is related to vrealloc and
happens because of random tag generation when unpoisoning memory without
allocating new pages. This breaks shadow memory tracking and needs to
reuse the existing tag instead of generating a new one. At the same time
an inconsistency in used flags is corrected.
This patch (of 3):
Syzkaller reported a memory out-of-bounds bug [4]. This patch fixes two
issues:
1. In vrealloc the KASAN_VMALLOC_VM_ALLOC flag is missing when
unpoisoning the extended region. This flag is required to correctly
associate the allocation with KASAN's vmalloc tracking.
Note: In contrast, vzalloc (via __vmalloc_node_range_noprof)
explicitly sets KASAN_VMALLOC_VM_ALLOC and calls
kasan_unpoison_vmalloc() with it. vrealloc must behave consistently --
especially when reusing existing vmalloc regions -- to ensure KASAN can
track allocations correctly.
2. When vrealloc reuses an existing vmalloc region (without allocating
new pages) KASAN generates a new tag, which breaks tag-based memory
access tracking.
Introduce KASAN_VMALLOC_KEEP_TAG, a new KASAN flag that allows reusing the
tag already attached to the pointer, ensuring consistent tag behavior
during reallocation.
Pass KASAN_VMALLOC_KEEP_TAG and KASAN_VMALLOC_VM_ALLOC to the
kasan_unpoison_vmalloc inside vrealloc_node_align_noprof().
Link: https://lkml.kernel.org/r/cover.1765978969.git.m.wieczorretman@pm.me
Link: https://lkml.kernel.org/r/38dece0a4074c43e48150d1e242f8242c73bf1a5.1764874575.git.m.wieczorretman@pm.me
Link: https://lore.kernel.org/all/e7e04692866d02e6d3b32bb43b998e5d17092ba4.1738686764.git.maciej.wieczor-retman@intel.com/ [1]
Link: https://lore.kernel.org/all/aMUrW1Znp1GEj7St@MiWiFi-R3L-srv/ [2]
Link: https://lore.kernel.org/all/CAPAsAGxDRv_uFeMYu9TwhBVWHCCtkSxoWY4xmFB_vowMbi8raw@mail.gmail.com/ [3]
Link: https://syzkaller.appspot.com/bug?extid=997752115a851cb0cf36 [4]
Fixes: a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing")
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Co-developed-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com>
Reported-by: syzbot+997752115a851cb0cf36@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68e243a2.050a0220.1696c6.007d.GAE@google.com/T/
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- A stable fix for a missing tag reset that can happen in
kfree_nolock() with KASAN+SLUB_TINY configs (Deepanshu Kartikey)
* tag 'slab-for-6.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slub: reset KASAN tag in defer_free() before accessing freed memory
|
|
Pull shmem rename fixes from Al Viro:
"A couple of shmem rename fixes - recent regression from tree-in-dcache
series and older breakage from stable directory offsets stuff"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
shmem: fix recovery on rename failures
shmem_whiteout(): fix regression from tree-in-dcache series
|
|
maple_tree insertions can fail if we are seriously short on memory;
simple_offset_rename() does not recover well if it runs into that.
The same goes for simple_offset_rename_exchange().
Moreover, shmem_whiteout() expects that if it succeeds, the caller will
progress to d_move(), i.e. that shmem_rename2() won't fail past the
successful call of shmem_whiteout().
Not hard to fix, fortunately - mtree_store() can't fail if the index we
are trying to store into is already present in the tree as a singleton.
For simple_offset_rename_exchange() that's enough - we just need to be
careful about the order of operations.
For simple_offset_rename() solution is to preinsert the target into the
tree for new_dir; the rest can be done without any potentially failing
operations.
That preinsertion has to be done in shmem_rename2() rather than in
simple_offset_rename() itself - otherwise we'd need to deal with the
possibility of failure after successful shmem_whiteout().
Fixes: a2e459555c5f ("shmem: stable directory offsets")
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Now that shmem_mknod() hashes the new dentry, d_rehash() in
shmem_whiteout() should be removed.
X-paperbag: brown
Reported-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Tested-by: Hugh Dickins <hughd@google.com>
Fixes: 2313598222f9 ("convert ramfs and tmpfs")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc updates from Andrew Morton:
"There are no significant series in this small merge. Please see the
individual changelogs for details"
[ Editor's note: it's mainly ocfs2 and a couple of random fixes ]
* tag 'mm-nonmm-stable-2025-12-11-11-47' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: memfd_luo: add CONFIG_SHMEM dependency
mm: shmem: avoid build warning for CONFIG_SHMEM=n
ocfs2: fix memory leak in ocfs2_merge_rec_left()
ocfs2: invalidate inode if i_mode is zero after block read
ocfs2: avoid -Wflex-array-member-not-at-end warning
ocfs2: convert remaining read-only checks to ocfs2_emergency_state
ocfs2: add ocfs2_emergency_state helper and apply to setattr
checkpatch: add uninitialized pointer with __free attribute check
args: fix documentation to reflect the correct numbers
ocfs2: fix kernel BUG in ocfs2_find_victim_chain
liveupdate: luo_core: fix redundant bound check in luo_ioctl()
ocfs2: validate inline xattr size and entry count in ocfs2_xattr_ibody_list
fs/fat: remove unnecessary wrapper fat_max_cache()
ocfs2: replace deprecated strcpy with strscpy
ocfs2: check tl_used after reading it from trancate log inode
liveupdate: luo_file: don't use invalid list iterator
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "powerpc/pseries/cmm: two smaller fixes" (David Hildenbrand)
fixes a couple of minor things in ppc land
- "Improve folio split related functions" (Zi Yan)
some cleanups and minorish fixes in the folio splitting code
* tag 'mm-stable-2025-12-11-11-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/damon/tests/core-kunit: avoid damos_test_commit stack warning
mm: vmscan: correct nr_requested tracing in scan_folios
MAINTAINERS: add idr core-api doc file to XARRAY
mm/hugetlb: fix incorrect error return from hugetlb_reserve_pages()
mm: fix CONFIG_STACK_GROWSUP typo in mm.h
mm/huge_memory: fix folio split stats counting
mm/huge_memory: make min_order_for_split() always return an order
mm/huge_memory: replace can_split_folio() with direct refcount calculation
mm/huge_memory: change folio_split_supported() to folio_check_splittable()
mm/sparse: fix sparse_vmemmap_init_nid_early definition without CONFIG_SPARSEMEM
powerpc/pseries/cmm: adjust BALLOON_MIGRATE when migrating pages
powerpc/pseries/cmm: call balloon_devinfo_init() also without CONFIG_BALLOON_COMPACTION
|
|
The newly added 'flags' variable is unused and causes a warning if
CONFIG_SHMEM is disabled, since the shmem_acct_size() macro it is passed
into does nothing:
mm/shmem.c: In function '__shmem_file_setup':
mm/shmem.c:5816:23: error: unused variable 'flags' [-Werror=unused-variable]
5816 | unsigned long flags = (vm_flags & VM_NORESERVE) ? SHMEM_F_NORESERVE : 0;
| ^~~~~
Replace the two macros with equivalent inline functions to get the
argument checking.
Link: https://lkml.kernel.org/r/20251204102905.1048000-1-arnd@kernel.org
Fixes: 6ff1610ced56 ("mm: shmem: use SHMEM_F_* flags instead of VM_* flags")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: guoweikang <guoweikang.kernel@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fix from Vlastimil Babka:
- A stable fix for performance regression in tests that perform
kmem_cache_destroy() a lot, due to unnecessarily wide scope of
kvfree_rcu_barrier() (Harry Yoo)
* tag 'slab-for-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction
|
|
When CONFIG_SLUB_TINY is enabled, kfree_nolock() calls kasan_slab_free()
before defer_free(). On ARM64 with MTE (Memory Tagging Extension),
kasan_slab_free() poisons the memory and changes the tag from the
original (e.g., 0xf3) to a poison tag (0xfe).
When defer_free() then tries to write to the freed object to build the
deferred free list via llist_add(), the pointer still has the old tag,
causing a tag mismatch and triggering a KASAN use-after-free report:
BUG: KASAN: slab-use-after-free in defer_free+0x3c/0xbc mm/slub.c:6537
Write at addr f3f000000854f020 by task kworker/u8:6/983
Pointer tag: [f3], memory tag: [fe]
Fix this by calling kasan_reset_tag() before accessing the freed memory.
This is safe because defer_free() is part of the allocator itself and is
expected to manipulate freed memory for bookkeeping purposes.
Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().")
Cc: stable@vger.kernel.org
Reported-by: syzbot+7a25305a76d872abcfa1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7a25305a76d872abcfa1
Tested-by: syzbot+7a25305a76d872abcfa1@syzkaller.appspotmail.com
Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Link: https://patch.msgid.link/20251210022024.3255826-1-kartikey406@gmail.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
The newly added damos_test_commit() constructs multiple large structures
on the stack, which exceeds the warning limit in some cases:
In file included from mm/damon/core.c:2941:
mm/damon/tests/core-kunit.h: In function 'damos_test_commit':
mm/damon/tests/core-kunit.h:965:1: error: the frame size of 1520 bytes is larger than 1280 bytes [-Werror=frame-larger-than=]
Split this function up into two separate ones that are called
sequentially, so they can occupy the same stack slots.
Link: https://lkml.kernel.org/r/20251204100403.1034980-1-arnd@kernel.org
Fixes: 299a88f6ec13 ("mm/damon/tests/core-kunit: add damos_commit() test")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Quanmin Yan <yanquanmin1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When enabling vmscan tracing, it is observed that nr_requested is always
4096, which is confusing.
mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
mm_vmscan_lru_isolate: classzone=3 order=0 nr_requested=4096 ...
This is because it prints MAX_LRU_BATCH, which is meaningless as it's a
constant. To fix this, modify it to print capped valued.
Link: https://lkml.kernel.org/r/20251204122355.1822919-1-chenridong@huaweicloud.com
Fixes: 8c2214fc9a47 ("mm: multi-gen LRU: reuse some legacy trace events")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jaewon Kim <jaewon31.kim@samsung.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Lu Jialin <lujialin4@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The function hugetlb_reserve_pages() returns the number of pages added
to the reservation map on success and a negative error code on failure
(e.g. -EINVAL, -ENOMEM). However, in some error paths, it may return -1
directly.
For example, a failure at:
if (hugetlb_acct_memory(h, gbl_reserve) < 0)
goto out_put_pages;
results in returning -1 (since add = -1), which may be misinterpreted
in userspace as -EPERM.
Fix this by explicitly capturing and propagating the return values from
helper functions, and using -EINVAL for all other failure cases.
Link: https://lkml.kernel.org/r/20251125171350.86441-1-skolothumtho@nvidia.com
Fixes: 986f5f2b4be3 ("mm/hugetlb: make hugetlb_reserve_pages() return nr of entries updated")
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew R. Ochs <mochs@nvidia.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nicolin Chen <nicolinc@nvidia.com>
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The "return <error code>" statements for error checks at the beginning of
__folio_split() skip necessary count_vm_event() and count_mthp_stat() at
the end of the function. Fix these by replacing them with "ret = <error
code>; goto out;".
Link: https://lkml.kernel.org/r/20251126210618.1971206-5-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
min_order_for_split() returns -EBUSY when the folio is truncated and
cannot be split. In commit 77008e1b2ef7 ("mm/huge_memory: do not change
split_huge_page*() target order silently"), memory_failure() does not
handle it and pass -EBUSY to try_to_split_thp_page() directly.
try_to_split_thp_page() returns -EINVAL since -EBUSY becomes 0xfffffff0 as
new_order is unsigned int in __folio_split() and this large new_order is
rejected as an invalid input. The code does not cause a bug.
soft_offline_in_use_page() also uses min_order_for_split() but it always
passes 0 as new_order for split.
Fix it by making min_order_for_split() always return an order. When the
given folio is truncated, namely folio->mapping == NULL, return 0 and let
a subsequent split function handle the situation and return -EBUSY.
Add kernel-doc to min_order_for_split() to clarify its use.
Link: https://lkml.kernel.org/r/20251126210618.1971206-4-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
can_split_folio() is just a refcount comparison, making sure only the
split caller holds an extra pin. Open code it with
folio_expected_ref_count() != folio_ref_count() - 1. For the extra_pins
used by folio_ref_freeze(), add folio_cache_ref_count() to calculate it.
Also replace folio_expected_ref_count() with folio_cache_ref_count() used
by folio_ref_unfreeze(), since they are returning the same values when a
folio is frozen and folio_cache_ref_count() does not have unnecessary
folio_mapcount() in its implementation.
Link: https://lkml.kernel.org/r/20251126210618.1971206-3-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Balbir Singh <balbirs@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Improve folio split related functions", v4.
This patchset improves several folio split related functions to avoid
future misuse. The changes are:
1. Consolidated folio splittable checks by moving truncated folio check,
huge zero folio check, and writeback folio check into
folio_split_supported(). Changed the function return type. Renamed it
to folio_check_splittable() for clarification.
2. Replaced can_split_folio() with open coded folio_expected_ref_count()
and folio_ref_count() and introduced folio_cache_ref_count().
3. Changed min_order_for_split() to always return an order.
4. Fixed folio split stats counting.
Motivation
==========
This is based on Wei's observation[1] and solves several potential
issues:
1. Dereferencing NULL folio->mapping in try_folio_split_to_order() if it
is called on truncated folios.
2. Not handling of negative return value of min_order_for_split() in
mm/memory-failure.c
There is no bug in the current code.
This patch (of 4):
folio_split_supported() used in try_folio_split_to_order() requires
folio->mapping to be non NULL, but current try_folio_split_to_order() does
not check it. There is no issue in the current code, since
try_folio_split_to_order() is only used in truncate_inode_partial_folio(),
where folio->mapping is not NULL.
To prevent future misuse, move folio->mapping NULL check (i.e., folio is
truncated) into folio_split_supported(). Since folio->mapping NULL check
returns -EBUSY and folio_split_supported() == false means -EINVAL, change
folio_split_supported() return type from bool to int and return error
numbers accordingly. Rename folio_split_supported() to
folio_check_splittable() to match the return type change.
While at it, move is_huge_zero_folio() check and folio_test_writeback()
check into folio_check_splittable() and add kernel-doc.
Remove all warnings inside folio_check_splittable() and give warnings
in __folio_split() instead, so that bool warns parameter can be removed.
Link: https://lkml.kernel.org/r/20251126210618.1971206-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20251126210618.1971206-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Acked-by: David Hildenbrand (Red Hat) <david@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, kvfree_rcu_barrier() flushes RCU sheaves across all slab
caches when a cache is destroyed. This is unnecessary; only the RCU
sheaves belonging to the cache being destroyed need to be flushed.
As suggested by Vlastimil Babka, introduce a weaker form of
kvfree_rcu_barrier() that operates on a specific slab cache.
Factor out flush_rcu_sheaves_on_cache() from flush_all_rcu_sheaves() and
call it from flush_all_rcu_sheaves() and kvfree_rcu_barrier_on_cache().
Call kvfree_rcu_barrier_on_cache() instead of kvfree_rcu_barrier() on
cache destruction.
The performance benefit is evaluated on a 12 core 24 threads AMD Ryzen
5900X machine (1 socket), by loading slub_kunit module.
Before:
Total calls: 19
Average latency (us): 18127
Total time (us): 344414
After:
Total calls: 19
Average latency (us): 10066
Total time (us): 191264
Two performance regression have been reported:
- stress module loader test's runtime increases by 50-60% (Daniel)
- internal graphics test's runtime on Tegra234 increases by 35% (Jon)
They are fixed by this change.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: ec66e0d59952 ("slab: add sheaf support for batching kfree_rcu() operations")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/1bda09da-93be-4737-aef0-d47f8c5c9301@suse.cz
Reported-and-tested-by: Daniel Gomez <da.gomez@samsung.com>
Closes: https://lore.kernel.org/linux-mm/0406562e-2066-4cf8-9902-b2b0616dd742@kernel.org
Reported-and-tested-by: Jon Hunter <jonathanh@nvidia.com>
Closes: https://lore.kernel.org/linux-mm/e988eff6-1287-425e-a06c-805af5bbf262@nvidia.com
Signed-off-by: Harry Yoo <harry.yoo@oracle.com>
Link: https://patch.msgid.link/20251207154148.117723-1-harry.yoo@oracle.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull memblock update from Mike Rapoport:
"Introduce a 'check_pages' boot parameter to decouple simple checks for
page state on allocation and free from CONFIG_DEBUG_VM.
This allows enabling page checking without building kernel with
CONFIG_DEBUG_VM or forcing init_on_{alloc, free} or other heavier
mechanisms"
* tag 'memblock-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
mm/mm_init: Introduce a boot parameter for check_pages
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "panic: sys_info: Refactor and fix a potential issue" (Andy Shevchenko)
fixes a build issue and does some cleanup in ib/sys_info.c
- "Implement mul_u64_u64_div_u64_roundup()" (David Laight)
enhances the 64-bit math code on behalf of a PWM driver and beefs up
the test module for these library functions
- "scripts/gdb/symbols: make BPF debug info available to GDB" (Ilya Leoshkevich)
makes BPF symbol names, sizes, and line numbers available to the GDB
debugger
- "Enable hung_task and lockup cases to dump system info on demand" (Feng Tang)
adds a sysctl which can be used to cause additional info dumping when
the hung-task and lockup detectors fire
- "lib/base64: add generic encoder/decoder, migrate users" (Kuan-Wei Chiu)
adds a general base64 encoder/decoder to lib/ and migrates several
users away from their private implementations
- "rbree: inline rb_first() and rb_last()" (Eric Dumazet)
makes TCP a little faster
- "liveupdate: Rework KHO for in-kernel users" (Pasha Tatashin)
reworks the KEXEC Handover interfaces in preparation for Live Update
Orchestrator (LUO), and possibly for other future clients
- "kho: simplify state machine and enable dynamic updates" (Pasha Tatashin)
increases the flexibility of KEXEC Handover. Also preparation for LUO
- "Live Update Orchestrator" (Pasha Tatashin)
is a major new feature targeted at cloud environments. Quoting the
cover letter:
This series introduces the Live Update Orchestrator, a kernel
subsystem designed to facilitate live kernel updates using a
kexec-based reboot. This capability is critical for cloud
environments, allowing hypervisors to be updated with minimal
downtime for running virtual machines. LUO achieves this by
preserving the state of selected resources, such as memory,
devices and their dependencies, across the kernel transition.
As a key feature, this series includes support for preserving
memfd file descriptors, which allows critical in-memory data, such
as guest RAM or any other large memory region, to be maintained in
RAM across the kexec reboot.
Mike Rappaport merits a mention here, for his extensive review and
testing work.
- "kexec: reorganize kexec and kdump sysfs" (Sourabh Jain)
moves the kexec and kdump sysfs entries from /sys/kernel/ to
/sys/kernel/kexec/ and adds back-compatibility symlinks which can
hopefully be removed one day
- "kho: fixes for vmalloc restoration" (Mike Rapoport)
fixes a BUG which was being hit during KHO restoration of vmalloc()
regions
* tag 'mm-nonmm-stable-2025-12-06-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (139 commits)
calibrate: update header inclusion
Reinstate "resource: avoid unnecessary lookups in find_next_iomem_res()"
vmcoreinfo: track and log recoverable hardware errors
kho: fix restoring of contiguous ranges of order-0 pages
kho: kho_restore_vmalloc: fix initialization of pages array
MAINTAINERS: TPM DEVICE DRIVER: update the W-tag
init: replace simple_strtoul with kstrtoul to improve lpj_setup
KHO: fix boot failure due to kmemleak access to non-PRESENT pages
Documentation/ABI: new kexec and kdump sysfs interface
Documentation/ABI: mark old kexec sysfs deprecated
kexec: move sysfs entries to /sys/kernel/kexec
test_kho: always print restore status
kho: free chunks using free_page() instead of kfree()
selftests/liveupdate: add kexec test for multiple and empty sessions
selftests/liveupdate: add simple kexec-based selftest for LUO
selftests/liveupdate: add userspace API selftests
docs: add documentation for memfd preservation via LUO
mm: memfd_luo: allow preserving memfd
liveupdate: luo_file: add private argument to store runtime state
mm: shmem: export some functions to internal.h
...
|
|
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull persistent dentry infrastructure and conversion from Al Viro:
"Some filesystems use a kinda-sorta controlled dentry refcount leak to
pin dentries of created objects in dcache (and undo it when removing
those). A reference is grabbed and not released, but it's not actually
_stored_ anywhere.
That works, but it's hard to follow and verify; among other things, we
have no way to tell _which_ of the increments is intended to be an
unpaired one. Worse, on removal we need to decide whether the
reference had already been dropped, which can be non-trivial if that
removal is on umount and we need to figure out if this dentry is
pinned due to e.g. unlink() not done. Usually that is handled by using
kill_litter_super() as ->kill_sb(), but there are open-coded special
cases of the same (consider e.g. /proc/self).
Things get simpler if we introduce a new dentry flag
(DCACHE_PERSISTENT) marking those "leaked" dentries. Having it set
claims responsibility for +1 in refcount.
The end result this series is aiming for:
- get these unbalanced dget() and dput() replaced with new primitives
that would, in addition to adjusting refcount, set and clear
persistency flag.
- instead of having kill_litter_super() mess with removing the
remaining "leaked" references (e.g. for all tmpfs files that hadn't
been removed prior to umount), have the regular
shrink_dcache_for_umount() strip DCACHE_PERSISTENT of all dentries,
dropping the corresponding reference if it had been set. After that
kill_litter_super() becomes an equivalent of kill_anon_super().
Doing that in a single step is not feasible - it would affect too many
places in too many filesystems. It has to be split into a series.
This work has really started early in 2024; quite a few preliminary
pieces have already gone into mainline. This chunk is finally getting
to the meat of that stuff - infrastructure and most of the conversions
to it.
Some pieces are still sitting in the local branches, but the bulk of
that stuff is here"
* tag 'pull-persistency' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
d_make_discardable(): warn if given a non-persistent dentry
kill securityfs_recursive_remove()
convert securityfs
get rid of kill_litter_super()
convert rust_binderfs
convert nfsctl
convert rpc_pipefs
convert hypfs
hypfs: swich hypfs_create_u64() to returning int
hypfs: switch hypfs_create_str() to returning int
hypfs: don't pin dentries twice
convert gadgetfs
gadgetfs: switch to simple_remove_by_name()
convert functionfs
functionfs: switch to simple_remove_by_name()
functionfs: fix the open/removal races
functionfs: need to cancel ->reset_work in ->kill_sb()
functionfs: don't bother with ffs->ref in ffs_data_{opened,closed}()
functionfs: don't abuse ffs_data_closed() on fs shutdown
convert selinuxfs
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki)
Rework the vmalloc() code to support non-blocking allocations
(GFP_ATOIC, GFP_NOWAIT)
"ksm: fix exec/fork inheritance" (xu xin)
Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not
inherited across fork/exec
"mm/zswap: misc cleanup of code and documentations" (SeongJae Park)
Some light maintenance work on the zswap code
"mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira)
Enhance the /sys/kernel/debug/page_owner debug feature by adding
unique identifiers to differentiate the various stack traces so
that userspace monitoring tools can better match stack traces over
time
"mm/page_alloc: pcp->batch cleanups" (Joshua Hahn)
Minor alterations to the page allocator's per-cpu-pages feature
"Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra)
Address a scalability issue in userfaultfd's UFFDIO_MOVE operation
"kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov)
"drivers/base/node: fold node register and unregister functions" (Donet Tom)
Clean up the NUMA node handling code a little
"mm: some optimizations for prot numa" (Kefeng Wang)
Cleanups and small optimizations to the NUMA allocation hinting
code
"mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn)
Address long lock hold times at boot on large machines. These were
causing (harmless) softlockup warnings
"optimize the logic for handling dirty file folios during reclaim" (Baolin Wang)
Remove some now-unnecessary work from page reclaim
"mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park)
Enhance the DAMOS auto-tuning feature
"mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan)
Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace
configuration
"expand mmap_prepare functionality, port more users" (Lorenzo Stoakes)
Enhance the new(ish) file_operations.mmap_prepare() method and port
additional callsites from the old ->mmap() over to ->mmap_prepare()
"Fix stale IOTLB entries for kernel address space" (Lu Baolu)
Fix a bug (and possible security issue on non-x86) in the IOMMU
code. In some situations the IOMMU could be left hanging onto a
stale kernel pagetable entry
"mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang)
Clean up and optimize the folio splitting code
"mm, swap: misc cleanup and bugfix" (Kairui Song)
Some cleanups and a minor fix in the swap discard code
"mm/damon: misc documentation fixups" (SeongJae Park)
"mm/damon: support pin-point targets removal" (SeongJae Park)
Permit userspace to remove a specific monitoring target in the
middle of the current targets list
"mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo)
A couple of cleanups related to mm header file inclusion
"mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He)
improve the selection of swap devices for NUMA machines
"mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista)
Change the memory block labels from macros to enums so they will
appear in kernel debug info
"ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes)
Address an inefficiency when KSM unmerges an address range
"mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park)
Fix leaks and unhandled malloc() failures in DAMON userspace unit
tests
"some cleanups for pageout()" (Baolin Wang)
Clean up a couple of minor things in the page scanner's
writeback-for-eviction code
"mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu)
Move hugetlb's sysfs/sysctl handling code into a new file
"introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes)
Make the VMA guard regions available in /proc/pid/smaps and
improves the mergeability of guarded VMAs
"mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes)
Reduce mmap lock contention for callers performing VMA guard region
operations
"vma_start_write_killable" (Matthew Wilcox)
Start work on permitting applications to be killed when they are
waiting on a read_lock on the VMA lock
"mm/damon/tests: add more tests for online parameters commit" (SeongJae Park)
Add additional userspace testing of DAMON's "commit" feature
"mm/damon: misc cleanups" (SeongJae Park)
"make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes)
Address the possible loss of a VMA's VM_SOFTDIRTY flag when that
VMA is merged with another
"mm: support device-private THP" (Balbir Singh)
Introduce support for Transparent Huge Page (THP) migration in zone
device-private memory
"Optimize folio split in memory failure" (Zi Yan)
"mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang)
Some more cleanups in the folio splitting code
"mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes)
Clean up our handling of pagetable leaf entries by introducing the
concept of 'software leaf entries', of type softleaf_t
"reparent the THP split queue" (Muchun Song)
Reparent the THP split queue to its parent memcg. This is in
preparation for addressing the long-standing "dying memcg" problem,
wherein dead memcg's linger for too long, consuming memory
resources
"unify PMD scan results and remove redundant cleanup" (Wei Yang)
A little cleanup in the hugepage collapse code
"zram: introduce writeback bio batching" (Sergey Senozhatsky)
Improve zram writeback efficiency by introducing batched bio
writeback support
"memcg: cleanup the memcg stats interfaces" (Shakeel Butt)
Clean up our handling of the interrupt safety of some memcg stats
"make vmalloc gfp flags usage more apparent" (Vishal Moola)
Clean up vmalloc's handling of incoming GFP flags
"mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang)
Teach soft dirty and userfaultfd write protect tracking to use
RISC-V's Svrsw60t59b extension
"mm: swap: small fixes and comment cleanups" (Youngjun Park)
Fix a small bug and clean up some of the swap code
"initial work on making VMA flags a bitmap" (Lorenzo Stoakes)
Start work on converting the vma struct's flags to a bitmap, so we
stop running out of them, especially on 32-bit
"mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park)
Address a possible bug in the swap discard code and clean things
up a little
[ This merge also reverts commit ebb9aeb980e5 ("vfio/nvgrace-gpu:
register device memory for poison handling") because it looks
broken to me, I've asked for clarification - Linus ]
* tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm: fix vma_start_write_killable() signal handling
mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate
mm/swapfile: fix list iteration when next node is removed during discard
fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling
mm/kfence: add reboot notifier to disable KFENCE on shutdown
memcg: remove inc/dec_lruvec_kmem_state helpers
selftests/mm/uffd: initialize char variable to Null
mm: fix DEBUG_RODATA_TEST indentation in Kconfig
mm: introduce VMA flags bitmap type
tools/testing/vma: eliminate dependency on vma->__vm_flags
mm: simplify and rename mm flags function for clarity
mm: declare VMA flags by bit
zram: fix a spelling mistake
mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity
mm/vmscan: skip increasing kswapd_failures when reclaim was boosted
pagemap: update BUDDY flag documentation
mm: swap: remove scan_swap_map_slots() references from comments
mm: swap: change swap_alloc_slow() to void
mm, swap: remove redundant comment for read_swap_cache_async
mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational
...
|
|
Pull VFIO updates from Alex Williamson:
- Move libvfio selftest artifacts in preparation of more tightly
coupled integration with KVM selftests (David Matlack)
- Fix comment typo in mtty driver (Chu Guangqing)
- Support for new hardware revision in the hisi_acc vfio-pci variant
driver where the migration registers can now be accessed via the PF.
When enabled for this support, the full BAR can be exposed to the
user (Longfang Liu)
- Fix vfio cdev support for VF token passing, using the correct size
for the kernel structure, thereby actually allowing userspace to
provide a non-zero UUID token. Also set the match token callback for
the hisi_acc, fixing VF token support for this this vfio-pci variant
driver (Raghavendra Rao Ananta)
- Introduce internal callbacks on vfio devices to simplify and
consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO
data, removing various ioctl intercepts with a more structured
solution (Jason Gunthorpe)
- Introduce dma-buf support for vfio-pci devices, allowing MMIO regions
to be exposed through dma-buf objects with lifecycle managed through
move operations. This enables low-level interactions such as a
vfio-pci based SPDK drivers interacting directly with dma-buf capable
RDMA devices to enable peer-to-peer operations. IOMMUFD is also now
able to build upon this support to fill a long standing feature gap
versus the legacy vfio type1 IOMMU backend with an implementation of
P2P support for VM use cases that better manages the lifecycle of the
P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy)
- Convert eventfd triggering for error and request signals to use RCU
mechanisms in order to avoid a 3-way lockdep reported deadlock issue
(Alex Williamson)
- Fix a 32-bit overflow introduced via dma-buf support manifesting with
large DMA buffers (Alex Mastro)
- Convert nvgrace-gpu vfio-pci variant driver to insert mappings on
fault rather than at mmap time. This conversion serves both to make
use of huge PFNMAPs but also to both avoid corrected RAS events
during reset by now being subject to vfio-pci-core's use of
unmap_mapping_range(), and to enable a device readiness test after
reset (Ankit Agrawal)
- Refactoring of vfio selftests to support multi-device tests and split
code to provide better separation between IOMMU and device objects.
This work also enables a new test suite addition to measure parallel
device initialization latency (David Matlack)
* tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits)
vfio: selftests: Add vfio_pci_device_init_perf_test
vfio: selftests: Eliminate INVALID_IOVA
vfio: selftests: Split libvfio.h into separate header files
vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c
vfio: selftests: Rename vfio_util.h to libvfio.h
vfio: selftests: Stop passing device for IOMMU operations
vfio: selftests: Move IOVA allocator into iova_allocator.c
vfio: selftests: Move IOMMU library code into iommu.c
vfio: selftests: Rename struct vfio_dma_region to dma_region
vfio: selftests: Upgrade driver logging to dev_err()
vfio: selftests: Prefix logs with device BDF where relevant
vfio: selftests: Eliminate overly chatty logging
vfio: selftests: Support multiple devices in the same container/iommufd
vfio: selftests: Introduce struct iommu
vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
vfio: selftests: Allow passing multiple BDFs on the command line
vfio: selftests: Split run.sh into separate scripts
vfio: selftests: Move run.sh into scripts directory
vfio/nvgrace-gpu: wait for the GPU mem to be ready
vfio/nvgrace-gpu: Inform devmem unmapped after reset
...
|
|
Use-after-free and double-free bugs can be very difficult to track down.
The kernel is good at tracking these and preventing bad pages from being
used/created through simple checks gated behind "check_pages_enabled".
Currently, the only ways to enable this flag is by building with
CONFIG_DEBUG_VM, or as a side effect of other checks such as
init_on_{alloc, free}, page_poisoning, or debug_pagealloc among others.
These solutions are powerful, but may often be too coarse in balancing
the performance vs. safety that a user may want, particularly in
latency-sensitive production environments.
Introduce a new boot parameter "check_pages", which enables page checking
with no other side effects. It takes kstrbool-able inputs as an argument
(i.e. 0/1, true/false, on/off, ...). This patch is backwards-compatible;
setting CONFIG_DEBUG_VM still enables page checking.
Acked-by: SeongJae Park <sj@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Link: https://patch.msgid.link/20251201180739.2330474-1-joshua.hahnjy@gmail.com
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab updates from Vlastimil Babka:
- mempool_alloc_bulk() support for upcoming users in the block layer
that need to allocate multiple objects at once with the mempool's
guaranteed progress semantics, which is not achievable with an
allocation single objects in a loop. Along with refactoring and
various improvements (Christoph Hellwig)
- Preparations for the upcoming separation of struct slab from struct
page, mostly by removing the struct folio layer, as the purpose of
struct folio has shifted since it became used in slab code (Matthew
Wilcox)
- Modernisation of slab's boot param API usage, which removes some
unexpected parsing corner cases (Petr Tesarik)
- Refactoring of freelist_aba_t (now struct freelist_counters) and
associated functions for double cmpxchg, enabled by -fms-extensions
(Vlastimil Babka)
- Cleanups and improvements related to sheaves caching layer, that were
part of the full conversion to sheaves, which is planned for the next
release (Vlastimil Babka)
* tag 'slab-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (42 commits)
slab: Remove unnecessary call to compound_head() in alloc_from_pcs()
mempool: clarify behavior of mempool_alloc_preallocated()
mempool: drop the file name in the top of file comment
mempool: de-typedef
mempool: remove mempool_{init,create}_kvmalloc_pool
mempool: legitimize the io_schedule_timeout in mempool_alloc_from_pool
mempool: add mempool_{alloc,free}_bulk
mempool: factor out a mempool_alloc_from_pool helper
slab: Remove references to folios from virt_to_slab()
kasan: Remove references to folio in __kasan_mempool_poison_object()
memcg: Convert mem_cgroup_from_obj_folio() to mem_cgroup_from_obj_slab()
mempool: factor out a mempool_adjust_gfp helper
mempool: add error injection support
mempool: improve kerneldoc comments
mm: improve kerneldoc comments for __alloc_pages_bulk
fault-inject: make enum fault_flags available unconditionally
usercopy: Remove folio references from check_heap_object()
slab: Remove folio references from kfree_nolock()
slab: Remove folio references from kfree_rcu_sheaf()
slab: Remove folio references from build_detached_freelist()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull selinux updates from Paul Moore:
- Improve the granularity of SELinux labeling for memfd files
Currently when creating a memfd file, SELinux treats it the same as
any other tmpfs, or hugetlbfs, file. While simple, the drawback is
that it is not possible to differentiate between memfd and tmpfs
files.
This adds a call to the security_inode_init_security_anon() LSM hook
and wires up SELinux to provide a set of memfd specific access
controls, including the ability to control the execution of memfds.
As usual, the commit message has more information.
- Improve the SELinux AVC lookup performance
Adopt MurmurHash3 for the SELinux AVC hash function instead of the
custom hash function currently used. MurmurHash3 is already used for
the SELinux access vector table so the impact to the code is minimal,
and performance tests have shown improvements in both hash
distribution and latency.
See the commit message for the performance measurments.
- Introduce a Kconfig option for the SELinux AVC bucket/slot size
While we have the ability to grow the number of AVC hash buckets
today, the size of the buckets (slot size) is fixed at 512. This pull
request makes that slot size configurable at build time through a new
Kconfig knob, CONFIG_SECURITY_SELINUX_AVC_HASH_BITS.
* tag 'selinux-pr-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: improve bucket distribution uniformity of avc_hash()
selinux: Move avtab_hash() to a shared location for future reuse
selinux: Introduce a new config to make avc cache slot size adjustable
memfd,selinux: call security_inode_init_security_anon()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"These are the arm64 updates for 6.19.
The biggest part is the Arm MPAM driver under drivers/resctrl/.
There's a patch touching mm/ to handle spurious faults for huge pmd
(similar to the pte version). The corresponding arm64 part allows us
to avoid the TLB maintenance if a (huge) page is reused after a write
fault. There's EFI refactoring to allow runtime services with
preemption enabled and the rest is the usual perf/PMU updates and
several cleanups/typos.
Summary:
Core features:
- Basic Arm MPAM (Memory system resource Partitioning And Monitoring)
driver under drivers/resctrl/ which makes use of the fs/rectrl/ API
Perf and PMU:
- Avoid cycle counter on multi-threaded CPUs
- Extend CSPMU device probing and add additional filtering support
for NVIDIA implementations
- Add support for the PMUs on the NoC S3 interconnect
- Add additional compatible strings for new Cortex and C1 CPUs
- Add support for data source filtering to the SPE driver
- Add support for i.MX8QM and "DB" PMU in the imx PMU driver
Memory managemennt:
- Avoid broadcast TLBI if page reused in write fault
- Elide TLB invalidation if the old PTE was not valid
- Drop redundant cpu_set_*_tcr_t0sz() macros
- Propagate pgtable_alloc() errors outside of __create_pgd_mapping()
- Propagate return value from __change_memory_common()
ACPI and EFI:
- Call EFI runtime services without disabling preemption
- Remove unused ACPI function
Miscellaneous:
- ptrace support to disable streaming on SME-only systems
- Improve sysreg generation to include a 'Prefix' descriptor
- Replace __ASSEMBLY__ with __ASSEMBLER__
- Align register dumps in the kselftest zt-test
- Remove some no longer used macros/functions
- Various spelling corrections"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
arm64/mm: Document why linear map split failure upon vm_reset_perms is not problematic
arm64/pageattr: Propagate return value from __change_memory_common
arm64/sysreg: Remove unused define ARM64_FEATURE_FIELD_BITS
KVM: arm64: selftests: Consider all 7 possible levels of cache
KVM: arm64: selftests: Remove ARM64_FEATURE_FIELD_BITS and its last user
arm64: atomics: lse: Remove unused parameters from ATOMIC_FETCH_OP_AND macros
Documentation/arm64: Fix the typo of register names
ACPI: GTDT: Get rid of acpi_arch_timer_mem_init()
perf: arm_spe: Add support for filtering on data source
perf: Add perf_event_attr::config4
perf/imx_ddr: Add support for PMU in DB (system interconnects)
perf/imx_ddr: Get and enable optional clks
perf/imx_ddr: Move ida_alloc() from ddr_perf_init() to ddr_perf_probe()
dt-bindings: perf: fsl-imx-ddr: Add compatible string for i.MX8QM, i.MX8QXP and i.MX8DXL
arm64: remove duplicate ARCH_HAS_MEM_ENCRYPT
arm64: mm: use untagged address to calculate page index
MAINTAINERS: new entry for MPAM Driver
arm_mpam: Add kunit tests for props_mismatch()
arm_mpam: Add kunit test for bitmap reset
arm_mpam: Add helper to reset saved mbwu state
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Heiko Carstens:
- Provide a new interface for dynamic configuration and deconfiguration
of hotplug memory, allowing with and without memmap_on_memory
support. This makes the way memory hotplug is handled on s390 much
more similar to other architectures
- Remove compat support. There shouldn't be any compat user space
around anymore, therefore get rid of a lot of code which also doesn't
need to be tested anymore
- Add stackprotector support. GCC 16 will get new compiler options,
which allow to generate code required for kernel stackprotector
support
- Merge pai_crypto and pai_ext PMU drivers into a new driver. This
removes a lot of duplicated code. The new driver is also extendable
and allows to support new PMUs
- Add driver override support for AP queues
- Rework and extend zcrypt and AP trace events to allow for tracing of
crypto requests
- Support block sizes larger than 65535 bytes for CCW tape devices
- Since the rework of the virtual kernel address space the module area
and the kernel image are within the same 4GB area. This eliminates
the need of weak per cpu variables. Get rid of
ARCH_MODULE_NEEDS_WEAK_PER_CPU
- Various other small improvements and fixes
* tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (92 commits)
watchdog: diag288_wdt: Remove KMSG_COMPONENT macro
s390/entry: Use lay instead of aghik
s390/vdso: Get rid of -m64 flag handling
s390/vdso: Rename vdso64 to vdso
s390: Rename head64.S to head.S
s390/vdso: Use common STABS_DEBUG and DWARF_DEBUG macros
s390: Add stackprotector support
s390/modules: Simplify module_finalize() slightly
s390: Remove KMSG_COMPONENT macro
s390/percpu: Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU
s390/ap: Restrict driver_override versus apmask and aqmask use
s390/ap: Rename mutex ap_perms_mutex to ap_attr_mutex
s390/ap: Support driver_override for AP queue devices
s390/ap: Use all-bits-one apmask/aqmask for vfio in_use() checks
s390/debug: Update description of resize operation
s390/syscalls: Switch to generic system call table generation
s390/syscalls: Remove system call table pointer from thread_struct
s390/uapi: Remove 31 bit support from uapi header files
s390: Remove compat support
tools: Remove s390 compat support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull fd prepare updates from Christian Brauner:
"This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the
common pattern of get_unused_fd_flags() + create file + fd_install()
that is used extensively throughout the kernel and currently requires
cumbersome cleanup paths.
FD_ADD() - For simple cases where a file is installed immediately:
fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device));
if (fd < 0)
vfio_device_put_registration(device);
return fd;
FD_PREPARE() - For cases requiring access to the fd or file, or
additional work before publishing:
FD_PREPARE(fdf, O_CLOEXEC, sync_file->file);
if (fdf.err) {
fput(sync_file->file);
return fdf.err;
}
data.fence = fd_prepare_fd(fdf);
if (copy_to_user((void __user *)arg, &data, sizeof(data)))
return -EFAULT;
return fd_publish(fdf);
The primitives are centered around struct fd_prepare. FD_PREPARE()
encapsulates all allocation and cleanup logic and must be followed by
a call to fd_publish() which associates the fd with the file and
installs it into the caller's fdtable. If fd_publish() isn't called,
both are deallocated automatically. FD_ADD() is a shorthand that does
fd_publish() immediately and never exposes the struct to the caller.
I've implemented this in a way that it's compatible with the cleanup
infrastructure while also being usable separately. IOW, it's centered
around struct fd_prepare which is aliased to class_fd_prepare_t and so
we can make use of all the basica guard infrastructure"
* tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits)
io_uring: convert io_create_mock_file() to FD_PREPARE()
file: convert replace_fd() to FD_PREPARE()
vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD()
tty: convert ptm_open_peer() to FD_ADD()
ntsync: convert ntsync_obj_get_fd() to FD_PREPARE()
media: convert media_request_alloc() to FD_PREPARE()
hv: convert mshv_ioctl_create_partition() to FD_ADD()
gpio: convert linehandle_create() to FD_PREPARE()
pseries: port papr_rtas_setup_file_interface() to FD_ADD()
pseries: convert papr_platform_dump_create_handle() to FD_ADD()
spufs: convert spufs_gang_open() to FD_PREPARE()
papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE()
spufs: convert spufs_context_open() to FD_PREPARE()
net/socket: convert __sys_accept4_file() to FD_ADD()
net/socket: convert sock_map_fd() to FD_ADD()
net/kcm: convert kcm_ioctl() to FD_PREPARE()
net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE()
secretmem: convert memfd_secret() to FD_ADD()
memfd: convert memfd_create() to FD_ADD()
bpf: convert bpf_token_create() to FD_PREPARE()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull folio updates from Christian Brauner:
"Add a new folio_next_pos() helper function that returns the file
position of the first byte after the current folio. This is a common
operation in filesystems when needing to know the end of the current
folio.
The helper is lifted from btrfs which already had its own version, and
is now used across multiple filesystems and subsystems:
- btrfs
- buffer
- ext4
- f2fs
- gfs2
- iomap
- netfs
- xfs
- mm
This fixes a long-standing bug in ocfs2 on 32-bit systems with files
larger than 2GiB. Presumably this is not a common configuration, but
the fix is backported anyway. The other filesystems did not have bugs,
they were just mildly inefficient.
This also introduce uoff_t as the unsigned version of loff_t. A recent
commit inadvertently changed a comparison from being unsigned (on
64-bit systems) to being signed (which it had always been on 32-bit
systems), leading to sporadic fstests failures.
Generally file sizes are restricted to being a signed integer, but in
places where -1 is passed to indicate "up to the end of the file", it
is convenient to have an unsigned type to ensure comparisons are
always unsigned regardless of architecture"
* tag 'vfs-6.19-rc1.folio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Add uoff_t
mm: Use folio_next_pos()
xfs: Use folio_next_pos()
netfs: Use folio_next_pos()
iomap: Use folio_next_pos()
gfs2: Use folio_next_pos()
f2fs: Use folio_next_pos()
ext4: Use folio_next_pos()
buffer: Use folio_next_pos()
btrfs: Use folio_next_pos()
filemap: Add folio_next_pos()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull writeback updates from Christian Brauner:
"Features:
- Allow file systems to increase the minimum writeback chunk size.
The relatively low minimal writeback size of 4MiB means that
written back inodes on rotational media are switched a lot. Besides
introducing additional seeks, this also can lead to extreme file
fragmentation on zoned devices when a lot of files are cached
relative to the available writeback bandwidth.
This adds a superblock field that allows the file system to
override the default size, and sets it to the zone size for zoned
XFS.
- Add logging for slow writeback when it exceeds
sysctl_hung_task_timeout_secs. This helps identify tasks waiting
for a long time and pinpoint potential issues. Recording the
starting jiffies is also useful when debugging a crashed vmcore.
- Wake up waiting tasks when finishing the writeback of a chunk
Cleanups:
- filemap_* writeback interface cleanups.
Adding filemap_fdatawrite_wbc ended up being a mistake, as all but
the original btrfs caller should be using better high level
interfaces instead.
This series removes all these low-level interfaces, switches btrfs
to a more specific interface, and cleans up other too low-level
interfaces. With this the writeback_control that is passed to the
writeback code is only initialized in three places.
- Remove __filemap_fdatawrite, __filemap_fdatawrite_range, and
filemap_fdatawrite_wbc
- Add filemap_flush_nr helper for btrfs
- Push struct writeback_control into start_delalloc_inodes in btrfs
- Rename filemap_fdatawrite_range_kick to filemap_flush_range
- Stop opencoding filemap_fdatawrite_range in 9p, ocfs2, and mm
- Make wbc_to_tag() inline and use it in fs"
* tag 'vfs-6.19-rc1.writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
fs: Make wbc_to_tag() inline and use it in fs.
xfs: set s_min_writeback_pages for zoned file systems
writeback: allow the file system to override MIN_WRITEBACK_PAGES
writeback: cleanup writeback_chunk_size
mm: rename filemap_fdatawrite_range_kick to filemap_flush_range
mm: remove __filemap_fdatawrite_range
mm: remove filemap_fdatawrite_wbc
mm: remove __filemap_fdatawrite
mm,btrfs: add a filemap_flush_nr helper
btrfs: push struct writeback_control into start_delalloc_inodes
btrfs: use the local tmp_inode variable in start_delalloc_inodes
ocfs2: don't opencode filemap_fdatawrite_range in ocfs2_journal_submit_inode_data_buffers
9p: don't opencode filemap_fdatawrite_range in v9fs_mmap_vm_close
mm: don't opencode filemap_fdatawrite_range in filemap_invalidate_inode
writeback: Add logging for slow writeback (exceeds sysctl_hung_task_timeout_secs)
writeback: Wake up waiting tasks when finishing the writeback of a chunk.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs inode updates from Christian Brauner:
"Features:
- Hide inode->i_state behind accessors. Open-coded accesses prevent
asserting they are done correctly. One obvious aspect is locking,
but significantly more can be checked. For example it can be
detected when the code is clearing flags which are already missing,
or is setting flags when it is illegal (e.g., I_FREEING when
->i_count > 0)
- Provide accessors for ->i_state, converts all filesystems using
coccinelle and manual conversions (btrfs, ceph, smb, f2fs, gfs2,
overlayfs, nilfs2, xfs), and makes plain ->i_state access fail to
compile
- Rework I_NEW handling to operate without fences, simplifying the
code after the accessor infrastructure is in place
Cleanups:
- Move wait_on_inode() from writeback.h to fs.h
- Spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
for clarity
- Cosmetic fixes to LRU handling
- Push list presence check into inode_io_list_del()
- Touch up predicts in __d_lookup_rcu()
- ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
- Assert on ->i_count in iput_final()
- Assert ->i_lock held in __iget()
Fixes:
- Add missing fences to I_NEW handling"
* tag 'vfs-6.19-rc1.inode' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (22 commits)
dcache: touch up predicts in __d_lookup_rcu()
fs: push list presence check into inode_io_list_del()
fs: cosmetic fixes to lru handling
fs: rework I_NEW handling to operate without fences
fs: make plain ->i_state access fail to compile
xfs: use the new ->i_state accessors
nilfs2: use the new ->i_state accessors
overlayfs: use the new ->i_state accessors
gfs2: use the new ->i_state accessors
f2fs: use the new ->i_state accessors
smb: use the new ->i_state accessors
ceph: use the new ->i_state accessors
btrfs: use the new ->i_state accessors
Manual conversion to use ->i_state accessors of all places not covered by coccinelle
Coccinelle-based conversion to use ->i_state accessors
fs: provide accessors for ->i_state
fs: spell out fenced ->i_state accesses with explicit smp_wmb/smp_rmb
fs: move wait_on_inode() from writeback.h to fs.h
fs: add missing fences to I_NEW handling
ocfs2: retire ocfs2_drop_inode() and I_WILL_FREE usage
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull iomap updates from Christian Brauner:
"FUSE iomap Support for Buffered Reads:
This adds iomap support for FUSE buffered reads and readahead. This
enables granular uptodate tracking with large folios so only
non-uptodate portions need to be read. Also fixes a race condition
with large folios + writeback cache that could cause data corruption
on partial writes followed by reads.
- Refactored iomap read/readahead bio logic into helpers
- Added caller-provided callbacks for read operations
- Moved buffered IO bio logic into new file
- FUSE now uses iomap for read_folio and readahead
Zero Range Folio Batch Support:
Add folio batch support for iomap_zero_range() to handle dirty
folios over unwritten mappings. Fix raciness issues where dirty data
could be lost during zero range operations.
- filemap_get_folios_tag_range() helper for dirty folio lookup
- Optional zero range dirty folio processing
- XFS fills dirty folios on zero range of unwritten mappings
- Removed old partial EOF zeroing optimization
DIO Write Completions from Interrupt Context:
Restore pre-iomap behavior where pure overwrite completions run
inline rather than being deferred to workqueue. Reduces context
switches for high-performance workloads like ScyllaDB.
- Removed unused IOCB_DIO_CALLER_COMP code
- Error completions always run in user context (fixes zonefs)
- Reworked REQ_FUA selection logic
- Inverted IOMAP_DIO_INLINE_COMP to IOMAP_DIO_OFFLOAD_COMP
Buffered IO Cleanups:
Some performance and code clarity improvements:
- Replace manual bitmap scanning with find_next_bit()
- Simplify read skip logic for writes
- Optimize pending async writeback accounting
- Better variable naming
- Documentation for iomap_finish_folio_write() requirements
Misaligned Vectors for Zoned XFS:
Enables sub-block aligned vectors in XFS always-COW mode for zoned
devices via new IOMAP_DIO_FSBLOCK_ALIGNED flag.
Bug Fixes:
- Allocate s_dio_done_wq for async reads (fixes syzbot report after
error completion changes)
- Fix iomap_read_end() for already uptodate folios (regression fix)"
* tag 'vfs-6.19-rc1.iomap' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (40 commits)
iomap: allocate s_dio_done_wq for async reads as well
iomap: fix iomap_read_end() for already uptodate folios
iomap: invert the polarity of IOMAP_DIO_INLINE_COMP
iomap: support write completions from interrupt context
iomap: rework REQ_FUA selection
iomap: always run error completions in user context
fs, iomap: remove IOCB_DIO_CALLER_COMP
iomap: use find_next_bit() for uptodate bitmap scanning
iomap: use find_next_bit() for dirty bitmap scanning
iomap: simplify when reads can be skipped for writes
iomap: simplify ->read_folio_range() error handling for reads
iomap: optimize pending async writeback accounting
docs: document iomap writeback's iomap_finish_folio_write() requirement
iomap: account for unaligned end offsets when truncating read range
iomap: rename bytes_pending/bytes_accounted to bytes_submitted/bytes_not_submitted
xfs: support sub-block aligned vectors in always COW mode
iomap: add IOMAP_DIO_FSBLOCK_ALIGNED flag
xfs: error tag to force zeroing on debug kernels
iomap: remove old partial eof zeroing optimization
xfs: fill dirty folios on zero range of unwritten mappings
...
|
|
If we get a signal, we need to restore the vm_refcnt. We don't think that
the refcount can actually be decremented to zero here as it requires the
VMA to be detached, and the vma_mark_detached() uses TASK_UNINTERRUPTIBLE.
However, that's a bit subtle, so handle it as if the refcount was zero at
the start of this function.
Link: https://lkml.kernel.org/r/20251128040100.3022561-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: syzbot+5b19bad23ac7f44bf8b8@syzkaller.appspotmail.com
Fixes: 2197bb60f890 ("mm: add vma_start_write_killable()")
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The loop breaks immediately after finding the first swap device and
never modifies the list. Replace plist_for_each_entry_safe() with
plist_for_each_entry() and remove the unused next variable.
Link: https://lkml.kernel.org/r/20251127100303.783198-3-youngjun.park@lge.com
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/swapfile: fix and cleanup swap list iterations", v2.
This series fixes a potential list iteration issue in swap_sync_discard()
when devices are removed, and includes a cleanup for
__folio_throttle_swaprate().
This patch (of 2):
When the next node is removed from the plist (e.g. by swapoff),
plist_del() makes the node point to itself, causing the iteration to loop
on the same entry indefinitely.
Add a plist_node_empty() check to detect this case and restart iteration,
allowing swap_sync_discard() to continue processing remaining swap devices
that still have pending discard entries.
Additionally, switch from swap_avail_lock/swap_avail_head to
swap_lock/swap_active_head so that iteration is only affected by swapoff
operations rather than frequent availability changes, reducing exceptional
condition checks and lock contention.
Link: https://lkml.kernel.org/r/20251127100303.783198-1-youngjun.park@lge.com
Link: https://lkml.kernel.org/r/20251127100303.783198-2-youngjun.park@lge.com
Fixes: 686ea517f471 ("mm, swap: do not perform synchronous discard during allocation")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
Suggested-by: Kairui Song <kasong@tencent.com>
Acked-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
During system shutdown, KFENCE can cause IPI synchronization issues if it
remains active through the reboot process. To prevent this, register a
reboot notifier that disables KFENCE and cancels any pending timer work
early in the shutdown sequence.
This is only necessary when CONFIG_KFENCE_STATIC_KEYS is enabled, as this
configuration sends IPIs that can interfere with shutdown. Without static
keys, no IPIs are generated and KFENCE can safely remain active.
The notifier uses maximum priority (INT_MAX) to ensure KFENCE shuts down
before other subsystems that might still depend on stable memory
allocation behavior.
This fixes a late kexec CSD lockup[1] when kfence is trying to IPI a CPU
that is busy in a IRQ-disabled context printing characters to the console.
Link: https://lkml.kernel.org/r/20251127-kfence-v2-1-daeccb5ef9aa@debian.org
Link: https://lkml.kernel.org/r/20251126-kfence-v1-1-5a6e1d7c681c@debian.org
Link: https://lore.kernel.org/all/sqwajvt7utnt463tzxgwu2yctyn5m6bjwrslsnupfexeml6hkd@v6sqmpbu3vvu/ [1]
Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The dec_lruvec_kmem_state helper is unused by any caller and can be safely
removed. Meanwhile, the inc_lruvec_kmem_state helper is only referenced
by shadow_lru_isolate, retaining these two helpers is unnecessary. This
patch removes both helper functions to eliminate redundant code.
Link: https://lkml.kernel.org/r/20251126020435.1511637-1-chenridong@huaweicloud.com
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Lu Jialin <lujialin4@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Most of the DEBUG_RODATA_TEST section is indented by four spaces instead
of the customary single TAB.
Link: https://lkml.kernel.org/r/74f39b1bffc6ed802088cb3e7d17b4c82330e8b3.1764058676.git.geert@linux-m68k.org
Fixes: 2959a5f726f6 ("mm: add arch-independent testcases for RODATA")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Jinbum Park <jinb.park7@gmail.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|