path: root/mm/ksm.c
AgeCommit message (Collapse)Author
2021-05-14ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in ↵Hugh Dickins
remove_rmap_item_from_tree()" This reverts commit 3e96b6a2e9ad929a3230a22f4d64a74671a0720b. General Protection Fault in rmap_walk_ksm() under memory pressure: remove_rmap_item_from_tree() needs to take page lock, of course. Link: Signed-off-by: Hugh Dickins <> Cc: Miaohe Lin <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-05-07mm: fix typos in commentsIngo Molnar
Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: Link: Signed-off-by: Ingo Molnar <> Reviewed-by: Matthew Wilcox (Oracle) <> Reviewed-by: Randy Dunlap <> Cc: Bhaskar Chowdhury <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-05-07drivers/char: remove /dev/kmem for goodDavid Hildenbrand
Patch series "drivers/char: remove /dev/kmem for good". Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and memory ballooning, I started questioning the existence of /dev/kmem. Comparing it with the /proc/kcore implementation, it does not seem to be able to deal with things like a) Pages unmapped from the direct mapping (e.g., to be used by secretmem) -> kern_addr_valid(). virt_addr_valid() is not sufficient. b) Special cases like gart aperture memory that is not to be touched -> mem_pfn_is_ram() Unless I am missing something, it's at least broken in some cases and might fault/crash the machine. Looks like its existence has been questioned before in 2005 and 2010 [1], after ~11 additional years, it might make sense to revive the discussion. CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by mistake?). All distributions disable it: in Ubuntu it has been disabled for more than 10 years, in Debian since 2.6.31, in Fedora at least starting with FC3, in RHEL starting with RHEL4, in SUSE starting from 15sp2, and OpenSUSE has it disabled as well. 1) /dev/kmem was popular for rootkits [2] before it got disabled basically everywhere. Ubuntu documents [3] "There is no modern user of /dev/kmem any more beyond attackers using it to load kernel rootkits.". RHEL documents in a BZ [5] "it served no practical purpose other than to serve as a potential security problem or to enable binary module drivers to access structures/functions they shouldn't be touching" 2) /proc/kcore is a decent interface to have a controlled way to read kernel memory for debugging puposes. (will need some extensions to deal with memory offlining/unplug, memory ballooning, and poisoned pages, though) 3) It might be useful for corner case debugging [1]. KDB/KGDB might be a better fit, especially, to write random memory; harder to shoot yourself into the foot. 4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems to be incompatible with 64bit [1]. For educational purposes, /proc/kcore might be used to monitor value updates -- or older kernels can be used. 5) It's broken on arm64, and therefore, completely disabled there. Looks like it's essentially unused and has been replaced by better suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's just remove it. [1] [2] [3] [4] [5] Link: Link: Signed-off-by: David Hildenbrand <> Acked-by: Michal Hocko <> Acked-by: Kees Cook <> Cc: Linus Torvalds <> Cc: Greg Kroah-Hartman <> Cc: "Alexander A. Klimov" <> Cc: Alexander Viro <> Cc: Alexandre Belloni <> Cc: Andrew Lunn <> Cc: Andrey Zhizhikin <> Cc: Arnd Bergmann <> Cc: Benjamin Herrenschmidt <> Cc: Brian Cain <> Cc: Christian Borntraeger <> Cc: Christophe Leroy <> Cc: Chris Zankel <> Cc: Corentin Labbe <> Cc: "David S. Miller" <> Cc: "Eric W. Biederman" <> Cc: Geert Uytterhoeven <> Cc: Gerald Schaefer <> Cc: Greentime Hu <> Cc: Gregory Clement <> Cc: Heiko Carstens <> Cc: Helge Deller <> Cc: Hillf Danton <> Cc: huang ying <> Cc: Ingo Molnar <> Cc: Ivan Kokshaysky <> Cc: "James E.J. Bottomley" <> Cc: James Troup <> Cc: Jiaxun Yang <> Cc: Jonas Bonn <> Cc: Jonathan Corbet <> Cc: Kairui Song <> Cc: Krzysztof Kozlowski <> Cc: Kuninori Morimoto <> Cc: Liviu Dudau <> Cc: Lorenzo Pieralisi <> Cc: Luc Van Oostenryck <> Cc: Luis Chamberlain <> Cc: Matthew Wilcox <> Cc: Matt Turner <> Cc: Max Filippov <> Cc: Michael Ellerman <> Cc: Mike Rapoport <> Cc: Mikulas Patocka <> Cc: Minchan Kim <> Cc: Niklas Schnelle <> Cc: Oleksiy Avramchenko <> Cc: Cc: Palmer Dabbelt <> Cc: Paul Mackerras <> Cc: "Pavel Machek (CIP)" <> Cc: Pavel Machek <> Cc: "Peter Zijlstra (Intel)" <> Cc: Pierre Morel <> Cc: Randy Dunlap <> Cc: Richard Henderson <> Cc: Rich Felker <> Cc: Robert Richter <> Cc: Rob Herring <> Cc: Russell King <> Cc: Sam Ravnborg <> Cc: Sebastian Andrzej Siewior <> Cc: Sebastian Hesselbarth <> Cc: Cc: Stafford Horne <> Cc: Stefan Kristiansson <> Cc: Steven Rostedt <> Cc: Sudeep Holla <> Cc: Theodore Dubois <> Cc: Thomas Bogendoerfer <> Cc: Thomas Gleixner <> Cc: Vasily Gorbik <> Cc: Viresh Kumar <> Cc: William Cohen <> Cc: Xiaoming Ni <> Cc: Yoshinori Sato <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-05-05mm/ksm: remove unused parameter from remove_trailing_rmap_items()Chengyang Fan
Since commit 6514d511dbe5 ("ksm: singly-linked rmap_list") was merged, remove_trailing_rmap_items() doesn't use the 'mm_slot' parameter. So remove it, and update caller accordingly. Link: Signed-off-by: Chengyang Fan <> Reviewed-by: David Hildenbrand <> Cc: Hugh Dickins <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-05-05ksm: fix potential missing rmap_item for stable_nodeMiaohe Lin
When removing rmap_item from stable tree, STABLE_FLAG of rmap_item is cleared with head reserved. So the following scenario might happen: For ksm page with rmap_item1: cmp_and_merge_page stable_node->head = &migrate_nodes; remove_rmap_item_from_tree, but head still equal to stable_node; try_to_merge_with_ksm_page failed; return; For the same ksm page with rmap_item2, stable node migration succeed this time. The stable_node->head does not equal to migrate_nodes now. For ksm page with rmap_item1 again: cmp_and_merge_page stable_node->head != &migrate_nodes && rmap_item->head == stable_node return; We would miss the rmap_item for stable_node and might result in failed rmap_walk_ksm(). Fix this by set rmap_item->head to NULL when rmap_item is removed from stable tree. Link: Fixes: 4146d2d673e8 ("ksm: make !merge_across_nodes migration safe") Signed-off-by: Miaohe Lin <> Cc: Hugh Dickins <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-05-05ksm: remove dedicated macro KSM_FLAG_MASKMiaohe Lin
The macro KSM_FLAG_MASK is used in rmap_walk_ksm() only. So we can replace ~KSM_FLAG_MASK with PAGE_MASK to remove this dedicated macro and make code more consistent because PAGE_MASK is used elsewhere in this file. Link: Signed-off-by: Miaohe Lin <> Cc: Hugh Dickins <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-05-05ksm: use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()Miaohe Lin
It's unnecessary to lock the page when get ksm page if we're going to remove the rmap item as page migration is irrelevant in this case. Use GET_KSM_PAGE_NOLOCK instead to save some page lock cycles. Link: Signed-off-by: Miaohe Lin <> Cc: Hugh Dickins <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2021-05-05ksm: remove redundant VM_BUG_ON_PAGE() on stable_tree_search()Miaohe Lin
Patch series "Cleanup and fixup for ksm". This series contains cleanups to remove unnecessary VM_BUG_ON_PAGE and dedicated macro KSM_FLAG_MASK. Also this fixes potential missing rmap_item for stable_node which would result in failed rmap_walk_ksm(). More details can be found in the respective changelogs. This patch (of 4): The same VM_BUG_ON_PAGE() check is already done in the callee. Remove these extra caller one to simplify code slightly. Link: Link: Signed-off-by: Miaohe Lin <> Cc: Hugh Dickins <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2020-12-15mm: cleanup kstrto*() usageAlexey Dobriyan
Range checks can folded into proper conversion function. kstrto*() exist for all arithmetic types. Link: Signed-off-by: Alexey Dobriyan <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2020-12-15mm: use sysfs_emit for struct kobject * usesJoe Perches
Patch series "mm: Convert sysfs sprintf family to sysfs_emit", v2. Use the new sysfs_emit family and not the sprintf family. This patch (of 5): Use the sysfs_emit function instead of the sprintf family. Done with cocci script as in commit 3c6bff3cf988 ("RDMA: Convert sysfs kobject * show functions to use sysfs_emit()") Link: Link: Signed-off-by: Joe Perches <> Cc: Mike Kravetz <> Cc: Hugh Dickins <> Cc: Christoph Lameter <> Cc: Pekka Enberg <> Cc: David Rientjes <> Cc: Joonsoo Kim <> Cc: Matthew Wilcox <> Cc: Greg Kroah-Hartman <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2020-10-15docs: get rid of :c:type explicit declarations for structsMauro Carvalho Chehab
The :c:type:`foo` only works properly with structs before Sphinx 3.x. On Sphinx 3.x, structs should now be declared using the .. c:struct, and referenced via :c:struct tag. As we now have the macro, that automatically convert: struct foo into cross-references, let's get rid of that, solving several warnings when building docs with Sphinx 3.x. Reviewed-by: André Almeida <> # blk-mq.rst Reviewed-by: Takashi Iwai <> # sound Reviewed-by: Mike Rapoport <> Reviewed-by: Greg Kroah-Hartman <> Signed-off-by: Mauro Carvalho Chehab <>
2020-09-19ksm: reinstate memcg charge on copied pagesHugh Dickins
Patch series "mm: fixes to past from future testing". Here's a set of independent fixes against 5.9-rc2: prompted by testing Alex Shi's "warning on !memcg" and lru_lock series, but I think fit for 5.9 - though maybe only the first for stable. This patch (of 5): In 5.8 some instances of memcg charging in do_swap_page() and unuse_pte() were removed, on the understanding that swap cache is now already charged at those points; but a case was missed, when ksm_might_need_to_copy() has decided it must allocate a substitute page: such pages were never charged. Fix it inside ksm_might_need_to_copy(). This was discovered by Alex Shi's prospective commit "mm/memcg: warning on !memcg after readahead page charged". But there is a another surprise: this also fixes some rarer uncharged PageAnon cases, when KSM is configured in, but has never been activated. ksm_might_need_to_copy()'s anon_vma->root and linear_page_index() check sometimes catches a case which would need to have been copied if KSM were turned on. Or that's my optimistic interpretation (of my own old code), but it leaves some doubt as to whether everything is working as intended there - might it hint at rare anon ptes which rmap cannot find? A question not easily answered: put in the fix for missed memcg charges. Cc; Matthew Wilcox <> Fixes: 4c6355b25e8b ("mm: memcontrol: charge swapin pages on instantiation") Signed-off-by: Hugh Dickins <> Signed-off-by: Andrew Morton <> Reviewed-by: Shakeel Butt <> Acked-by: Johannes Weiner <> Cc: Alex Shi <> Cc: Michal Hocko <> Cc: Mike Kravetz <> Cc: Qian Cai <> Cc: <> [5.8] Link: Link: Signed-off-by: Linus Torvalds <>
2020-09-04Merge branch 'simplify-do_wp_page'Linus Torvalds
Merge emailed patches from Peter Xu: "This is a small series that I picked up from Linus's suggestion to simplify cow handling (and also make it more strict) by checking against page refcounts rather than mapcounts. This makes uffd-wp work again (verified by running upmapsort)" Note: this is horrendously bad timing, and making this kind of fundamental vm change after -rc3 is not at all how things should work. The saving grace is that it really is a a nice simplification: 8 files changed, 29 insertions(+), 120 deletions(-) The reason for the bad timing is that it turns out that commit 17839856fd58 ("gup: document and work around 'COW can break either way' issue" broke not just UFFD functionality (as Peter noticed), but Mikulas Patocka also reports that it caused issues for strace when running in a DAX environment with ext4 on a persistent memory setup. And we can't just revert that commit without re-introducing the original issue that is a potential security hole, so making COW stricter (and in the process much simpler) is a step to then undoing the forced COW that broke other uses. Link: * emailed patches from Peter Xu <>: mm: Add PGREUSE counter mm/gup: Remove enfornced COW mechanism mm/ksm: Remove reuse_ksm_page() mm: do_wp_page() simplification
2020-09-04mm/ksm: Remove reuse_ksm_page()Peter Xu
Remove the function as the last reference has gone away with the do_wp_page() changes. Signed-off-by: Peter Xu <> Signed-off-by: Linus Torvalds <>
2020-08-24Revert "powerpc/64s: Remove PROT_SAO support"Shawn Anastasio
This reverts commit 5c9fa16e8abd342ce04dc830c1ebb2a03abf6c05. Since PROT_SAO can still be useful for certain classes of software, reintroduce it. Concerns about guest migration for LPARs using SAO will be addressed next. Signed-off-by: Shawn Anastasio <> Signed-off-by: Michael Ellerman <> Link:
2020-08-12mm: do page fault accounting in handle_mm_faultPeter Xu
Patch series "mm: Page fault accounting cleanups", v5. This is v5 of the pf accounting cleanup series. It originates from Gerald Schaefer's report on an issue a week ago regarding to incorrect page fault accountings for retried page fault after commit 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times"): What this series did: - Correct page fault accounting: we do accounting for a page fault (no matter whether it's from #PF handling, or gup, or anything else) only with the one that completed the fault. For example, page fault retries should not be counted in page fault counters. Same to the perf events. - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf event is used in an adhoc way across different archs. Case (1): for many archs it's done at the entry of a page fault handler, so that it will also cover e.g. errornous faults. Case (2): for some other archs, it is only accounted when the page fault is resolved successfully. Case (3): there're still quite some archs that have not enabled this perf event. Since this series will touch merely all the archs, we unify this perf event to always follow case (1), which is the one that makes most sense. And since we moved the accounting into handle_mm_fault, the other two MAJ/MIN perf events are well taken care of naturally. - Unify definition of "major faults": the definition of "major fault" is slightly changed when used in accounting (not VM_FAULT_MAJOR). More information in patch 1. - Always account the page fault onto the one that triggered the page fault. This does not matter much for #PF handlings, but mostly for gup. More information on this in patch 25. Patchset layout: Patch 1: Introduced the accounting in handle_mm_fault(), not enabled. Patch 2-23: Enable the new accounting for arch #PF handlers one by one. Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.) Patch 25: Cleanup GUP task_struct pointer since it's not needed any more This patch (of 25): This is a preparation patch to move page fault accountings into the general code in handle_mm_fault(). This includes both the per task flt_maj/flt_min counters, and the major/minor page fault perf events. To do this, the pt_regs pointer is passed into handle_mm_fault(). PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault handlers. So far, all the pt_regs pointer that passed into handle_mm_fault() is NULL, which means this patch should have no intented functional change. Suggested-by: Linus Torvalds <> Signed-off-by: Peter Xu <> Signed-off-by: Andrew Morton <> Cc: Albert Ou <> Cc: Alexander Gordeev <> Cc: Andy Lutomirski <> Cc: Benjamin Herrenschmidt <> Cc: Borislav Petkov <> Cc: Brian Cain <> Cc: Catalin Marinas <> Cc: Christian Borntraeger <> Cc: Chris Zankel <> Cc: Dave Hansen <> Cc: David S. Miller <> Cc: Geert Uytterhoeven <> Cc: Gerald Schaefer <> Cc: Greentime Hu <> Cc: Guo Ren <> Cc: Heiko Carstens <> Cc: Helge Deller <> Cc: H. Peter Anvin <> Cc: Ingo Molnar <> Cc: Ivan Kokshaysky <> Cc: James E.J. Bottomley <> Cc: John Hubbard <> Cc: Jonas Bonn <> Cc: Ley Foon Tan <> Cc: "Luck, Tony" <> Cc: Matt Turner <> Cc: Max Filippov <> Cc: Michael Ellerman <> Cc: Michal Simek <> Cc: Nick Hu <> Cc: Palmer Dabbelt <> Cc: Paul Mackerras <> Cc: Paul Walmsley <> Cc: Pekka Enberg <> Cc: Peter Zijlstra <> Cc: Richard Henderson <> Cc: Rich Felker <> Cc: Russell King <> Cc: Stafford Horne <> Cc: Stefan Kristiansson <> Cc: Thomas Bogendoerfer <> Cc: Thomas Gleixner <> Cc: Vasily Gorbik <> Cc: Vincent Chen <> Cc: Vineet Gupta <> Cc: Will Deacon <> Cc: Yoshinori Sato <> Link: Link: Signed-off-by: Linus Torvalds <>
2020-08-07Merge tag 'powerpc-5.9-1' of ↵Linus Torvalds
git:// Pull powerpc updates from Michael Ellerman: - Add support for (optionally) using queued spinlocks & rwlocks. - Support for a new faster system call ABI using the scv instruction on Power9 or later. - Drop support for the PROT_SAO mmap/mprotect flag as it will be unsupported on Power10 and future processors, leaving us with no way to implement the functionality it requests. This risks breaking userspace, though we believe it is unused in practice. - A bug fix for, and then the removal of, our custom stack expansion checking. We now allow stack expansion up to the rlimit, like other architectures. - Remove the remnants of our (previously disabled) topology update code, which tried to react to NUMA layout changes on virtualised systems, but was prone to crashes and other problems. - Add PMU support for Power10 CPUs. - A change to our signal trampoline so that we don't unbalance the link stack (branch return predictor) in the signal delivery path. - Lots of other cleanups, refactorings, smaller features and so on as usual. Thanks to: Abhishek Goel, Alastair D'Silva, Alexander A. Klimov, Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Anton Blanchard, Arnd Bergmann, Athira Rajeev, Balamuruhan S, Bharata B Rao, Bill Wendling, Bin Meng, Cédric Le Goater, Chris Packham, Christophe Leroy, Christoph Hellwig, Daniel Axtens, Dan Williams, David Lamparter, Desnes A. Nunes do Rosario, Erhard F., Finn Thain, Frederic Barrat, Ganesh Goudar, Gautham R. Shenoy, Geoff Levand, Greg Kurz, Gustavo A. R. Silva, Hari Bathini, Harish, Imre Kaloz, Joel Stanley, Joe Perches, John Crispin, Jordan Niethe, Kajol Jain, Kamalesh Babulal, Kees Cook, Laurent Dufour, Leonardo Bras, Li RongQing, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Cave-Ayland, Michal Suchanek, Milton Miller, Mimi Zohar, Murilo Opsfelder Araujo, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nayna Jain, Nicholas Piggin, Oliver O'Halloran, Palmer Dabbelt, Pedro Miraglia Franco de Carvalho, Philippe Bergheaud, Pingfan Liu, Pratik Rajesh Sampat, Qian Cai, Qinglang Miao, Randy Dunlap, Ravi Bangoria, Sachin Sant, Sam Bobroff, Sandipan Das, Santosh Sivaraj, Satheesh Rajendran, Shirisha Ganta, Sourabh Jain, Srikar Dronamraju, Stan Johnson, Stephen Rothwell, Thadeu Lima de Souza Cascardo, Thiago Jung Bauermann, Tom Lane, Vaibhav Jain, Vladis Dronov, Wei Yongjun, Wen Xiong, YueHaibing. * tag 'powerpc-5.9-1' of git:// (337 commits) selftests/powerpc: Fix pkey syscall redefinitions powerpc: Fix circular dependency between percpu.h and mmu.h powerpc/powernv/sriov: Fix use of uninitialised variable selftests/powerpc: Skip vmx/vsx/tar/etc tests on older CPUs powerpc/40x: Fix assembler warning about r0 powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric powerpc/papr_scm: Fetch nvdimm performance stats from PHYP cpuidle: pseries: Fixup exit latency for CEDE(0) cpuidle: pseries: Add function to parse extended CEDE records cpuidle: pseries: Set the latency-hint before entering CEDE selftests/powerpc: Fix online CPU selection powerpc/perf: Consolidate perf_callchain_user_[64|32]() powerpc/pseries/hotplug-cpu: Remove double free in error path powerpc/pseries/mobility: Add pr_debug() for device tree changes powerpc/pseries/mobility: Set pr_fmt() powerpc/cacheinfo: Warn if cache object chain becomes unordered powerpc/cacheinfo: Improve diagnostics about malformed cache lists powerpc/cacheinfo: Use name@unit instead of full DT path in debug messages powerpc/cacheinfo: Set pr_fmt() powerpc: fix function annotations to avoid section mismatch warnings with gcc-10 ...
2020-07-22powerpc/64s: Remove PROT_SAO supportNicholas Piggin
ISA v3.1 does not support the SAO storage control attribute required to implement PROT_SAO. PROT_SAO was used by specialised system software (Lx86) that has been discontinued for about 7 years, and is not thought to be used elsewhere, so removal should not cause problems. We rather remove it than keep support for older processors, because live migrating guest partitions to newer processors may not be possible if SAO is in use (or worse allowed with silent races). - PROT_SAO stays in the uapi header so code using it would still build. - arch_validate_prot() is removed, the generic version rejects PROT_SAO so applications would get a failure at mmap() time. Signed-off-by: Nicholas Piggin <> [mpe: Drop KVM change for the time being] Signed-off-by: Michael Ellerman <> Link:
2020-07-16treewide: Remove uninitialized_var() usageKees Cook
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] [2] [3] [4] Reviewed-by: Leon Romanovsky <> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <> # IB Acked-by: Kalle Valo <> # wireless drivers Reviewed-by: Chao Yu <> # erofs Signed-off-by: Kees Cook <>
2020-06-09mmap locking API: convert mmap_sem commentsMichel Lespinasse
Convert comments that reference mmap_sem to reference mmap_lock instead. [ fix up linux-next leftovers] [ s/lockaphore/lock/, per Vlastimil] [ more linux-next fixups, per Michel] Signed-off-by: Michel Lespinasse <> Signed-off-by: Andrew Morton <> Reviewed-by: Vlastimil Babka <> Reviewed-by: Daniel Jordan <> Cc: Davidlohr Bueso <> Cc: David Rientjes <> Cc: Hugh Dickins <> Cc: Jason Gunthorpe <> Cc: Jerome Glisse <> Cc: John Hubbard <> Cc: Laurent Dufour <> Cc: Liam Howlett <> Cc: Matthew Wilcox <> Cc: Peter Zijlstra <> Cc: Ying Han <> Link: Signed-off-by: Linus Torvalds <>
2020-06-09mmap locking API: convert mmap_sem API commentsMichel Lespinasse
Convert comments that reference old mmap_sem APIs to reference corresponding new mmap locking APIs instead. Signed-off-by: Michel Lespinasse <> Signed-off-by: Andrew Morton <> Reviewed-by: Vlastimil Babka <> Reviewed-by: Davidlohr Bueso <> Reviewed-by: Daniel Jordan <> Cc: David Rientjes <> Cc: Hugh Dickins <> Cc: Jason Gunthorpe <> Cc: Jerome Glisse <> Cc: John Hubbard <> Cc: Laurent Dufour <> Cc: Liam Howlett <> Cc: Matthew Wilcox <> Cc: Peter Zijlstra <> Cc: Ying Han <> Link: Signed-off-by: Linus Torvalds <>
2020-06-09mmap locking API: use coccinelle to convert mmap_sem rwsem call sitesMichel Lespinasse
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: Michel Lespinasse <> Signed-off-by: Andrew Morton <> Reviewed-by: Daniel Jordan <> Reviewed-by: Laurent Dufour <> Reviewed-by: Vlastimil Babka <> Cc: Davidlohr Bueso <> Cc: David Rientjes <> Cc: Hugh Dickins <> Cc: Jason Gunthorpe <> Cc: Jerome Glisse <> Cc: John Hubbard <> Cc: Liam Howlett <> Cc: Matthew Wilcox <> Cc: Peter Zijlstra <> Cc: Ying Han <> Link: Signed-off-by: Linus Torvalds <>
2020-06-04mm: ksm: fix a typo in comment "alreaady"->"already"Ethon Paul
There is a typo in comment, fix it. Signed-off-by: Ethon Paul <> Signed-off-by: Andrew Morton <> Reviewed-by: Andrew Morton <> Reviewed-by: Ralph Campbell <> Link: Signed-off-by: Linus Torvalds <>
2020-04-21mm/ksm: fix NULL pointer dereference when KSM zero page is enabledMuchun Song
find_mergeable_vma() can return NULL. In this case, it leads to a crash when we access vm_mm(its offset is 0x40) later in write_protect_page. And this case did happen on our server. The following call trace is captured in kernel 4.19 with the following patch applied and KSM zero page enabled on our server. commit e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring") So add a vma check to fix it. BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 Oops: 0000 [#1] SMP NOPTI CPU: 9 PID: 510 Comm: ksmd Kdump: loaded Tainted: G OE 4.19.36.bsk.9-amd64 #4.19.36.bsk.9 RIP: try_to_merge_one_page+0xc7/0x760 Code: 24 58 65 48 33 34 25 28 00 00 00 89 e8 0f 85 a3 06 00 00 48 83 c4 60 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 46 08 a8 01 75 b8 <49> 8b 44 24 40 4c 8d 7c 24 20 b9 07 00 00 00 4c 89 e6 4c 89 ff 48 RSP: 0018:ffffadbdd9fffdb0 EFLAGS: 00010246 RAX: ffffda83ffd4be08 RBX: ffffda83ffd4be40 RCX: 0000002c6e800000 RDX: 0000000000000000 RSI: ffffda83ffd4be40 RDI: 0000000000000000 RBP: ffffa11939f02ec0 R08: 0000000094e1a447 R09: 00000000abe76577 R10: 0000000000000962 R11: 0000000000004e6a R12: 0000000000000000 R13: ffffda83b1e06380 R14: ffffa18f31f072c0 R15: ffffda83ffd4be40 FS: 0000000000000000(0000) GS:ffffa0da43b80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000040 CR3: 0000002c77c0a003 CR4: 00000000007626e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ksm_scan_thread+0x115e/0x1960 kthread+0xf5/0x130 ret_from_fork+0x1f/0x30 [ if the vma is out of date, just exit] Link: [ add the conventional braces, replace /** with /*] Fixes: e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring") Co-developed-by: Xiongchun Duan <> Signed-off-by: Muchun Song <> Signed-off-by: Andrew Morton <> Reviewed-by: David Hildenbrand <> Reviewed-by: Kirill Tkhai <> Cc: Hugh Dickins <> Cc: Yang Shi <> Cc: Claudio Imbrenda <> Cc: Markus Elfring <> Cc: <> Link: Link: Signed-off-by: Linus Torvalds <>
2020-04-07mm: use fallthrough;Joe Perches
Convert the various /* fallthrough */ comments to the pseudo-keyword fallthrough; Done via script: Signed-off-by: Joe Perches <> Signed-off-by: Andrew Morton <> Reviewed-by: Gustavo A. R. Silva <> Link: Signed-off-by: Linus Torvalds <>
2020-04-07mm/ksm.c: update get_user_pages() argument in commentLi Chen
This updates get_user_pages()'s argument in ksm_test_exit()'s comment Signed-off-by: Li Chen <> Signed-off-by: Andrew Morton <> Reviewed-by: Andrew Morton <> Link: Signed-off-by: Linus Torvalds <>
2019-12-04Merge tag 'for-linus' of git:// Torvalds
Pull more KVM updates from Paolo Bonzini: - PPC secure guest support - small x86 cleanup - fix for an x86-specific out-of-bounds write on a ioctl (not guest triggerable, data not attacker-controlled) * tag 'for-linus' of git:// kvm: vmx: Stop wasting a page for guest_msrs KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332) Documentation: kvm: Fix mention to number of ioctls classes powerpc: Ultravisor: Add PPC_UV config option KVM: PPC: Book3S HV: Support reset of secure guest KVM: PPC: Book3S HV: Handle memory plug/unplug to secure VM KVM: PPC: Book3S HV: Radix changes for secure guest KVM: PPC: Book3S HV: Shared pages support for secure guests KVM: PPC: Book3S HV: Support for running secure guests mm: ksm: Export ksm_madvise() KVM x86: Move kvm cpuid support out of svm
2019-11-28mm: ksm: Export ksm_madvise()Bharata B Rao
On PEF-enabled POWER platforms that support running of secure guests, secure pages of the guest are represented by device private pages in the host. Such pages needn't participate in KSM merging. This is achieved by using ksm_madvise() call which need to be exported since KVM PPC can be a kernel module. Signed-off-by: Bharata B Rao <> Acked-by: Hugh Dickins <> Cc: Andrea Arcangeli <> Signed-off-by: Paul Mackerras <>
2019-11-22mm/ksm.c: don't WARN if page is still mapped in remove_stable_node()Andrey Ryabinin
It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in remove_stable_node() when it races with __mmput() and squeezes in between ksm_exit() and exit_mmap(). WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150 Call Trace: remove_all_stable_nodes+0x12b/0x330 run_store+0x4ef/0x7b0 kernfs_fop_write+0x200/0x420 vfs_write+0x154/0x450 ksys_write+0xf9/0x1d0 do_syscall_64+0x99/0x510 entry_SYSCALL_64_after_hwframe+0x49/0xbe Remove the warning as there is nothing scary going on. Link: Fixes: cbf86cfe04a6 ("ksm: remove old stable nodes more thoroughly") Signed-off-by: Andrey Ryabinin <> Acked-by: Hugh Dickins <> Cc: Andrea Arcangeli <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-09-24mm: move memcmp_pages() and pages_identical()Song Liu
Patch series "THP aware uprobe", v13. This patchset makes uprobe aware of THPs. Currently, when uprobe is attached to text on THP, the page is split by FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of THP. This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces FOLL_SPLIT_PMD, which only split PMD for uprobe. After all uprobes within the THP are removed, the PTE-mapped pages are regrouped as huge PMD. This set (plus a few THP patches) is also available at This patch (of 6): Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we can use them in other files. Link: Signed-off-by: Song Liu <> Acked-by: Kirill A. Shutemov <> Reviewed-by: Oleg Nesterov <> Cc: Johannes Weiner <> Cc: Matthew Wilcox <> Cc: William Kucharski <> Cc: Srikar Dronamraju <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482Thomas Gleixner
Based on 1 normalized pattern(s): this work is licensed under the terms of the gnu gpl version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 48 file(s). Signed-off-by: Thomas Gleixner <> Reviewed-by: Allison Randal <> Reviewed-by: Enrico Weigelt <> Cc: Link: Signed-off-by: Greg Kroah-Hartman <>
2019-05-14mm/mmu_notifier: use correct mmu_notifier events for each invalidationJérôme Glisse
This updates each existing invalidation to use the correct mmu notifier event that represent what is happening to the CPU page table. See the patch which introduced the events to see the rational behind this. Link: Signed-off-by: Jérôme Glisse <> Reviewed-by: Ralph Campbell <> Reviewed-by: Ira Weiny <> Cc: Christian König <> Cc: Joonas Lahtinen <> Cc: Jani Nikula <> Cc: Rodrigo Vivi <> Cc: Jan Kara <> Cc: Andrea Arcangeli <> Cc: Peter Xu <> Cc: Felix Kuehling <> Cc: Jason Gunthorpe <> Cc: Ross Zwisler <> Cc: Dan Williams <> Cc: Paolo Bonzini <> Cc: Radim Krcmar <> Cc: Michal Hocko <> Cc: Christian Koenig <> Cc: John Hubbard <> Cc: Arnd Bergmann <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-05-14mm/mmu_notifier: contextual information for event triggering invalidationJérôme Glisse
CPU page table update can happens for many reasons, not only as a result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as a result of kernel activities (memory compression, reclaim, migration, ...). Users of mmu notifier API track changes to the CPU page table and take specific action for them. While current API only provide range of virtual address affected by the change, not why the changes is happening. This patchset do the initial mechanical convertion of all the places that calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP event as well as the vma if it is know (most invalidation happens against a given vma). Passing down the vma allows the users of mmu notifier to inspect the new vma page protection. The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier should assume that every for the range is going away when that event happens. A latter patch do convert mm call path to use a more appropriate events for each call. This is done as 2 patches so that no call site is forgotten especialy as it uses this following coccinelle patch: %<---------------------------------------------------------------------- @@ identifier I1, I2, I3, I4; @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1, +enum mmu_notifier_event event, +unsigned flags, +struct vm_area_struct *vma, struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... } @@ @@ -#define mmu_notifier_range_init(range, mm, start, end) +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end) @@ expression E1, E3, E4; identifier I1; @@ <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, I1, I1->vm_mm, E3, E4) ...> @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(..., struct vm_area_struct *VMA, ...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN, VMA; @@ FN(...) { struct vm_area_struct *VMA; <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, VMA, E2, E3, E4) ...> } @@ expression E1, E2, E3, E4; identifier FN; @@ FN(...) { <... mmu_notifier_range_init(E1, +MMU_NOTIFY_UNMAP, 0, NULL, E2, E3, E4) ...> } ---------------------------------------------------------------------->% Applied with: spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place spatch --sp-file mmu-notifier.spatch --dir mm --in-place Link: Signed-off-by: Jérôme Glisse <> Reviewed-by: Ralph Campbell <> Reviewed-by: Ira Weiny <> Cc: Christian König <> Cc: Joonas Lahtinen <> Cc: Jani Nikula <> Cc: Rodrigo Vivi <> Cc: Jan Kara <> Cc: Andrea Arcangeli <> Cc: Peter Xu <> Cc: Felix Kuehling <> Cc: Jason Gunthorpe <> Cc: Ross Zwisler <> Cc: Dan Williams <> Cc: Paolo Bonzini <> Cc: Radim Krcmar <> Cc: Michal Hocko <> Cc: Christian Koenig <> Cc: John Hubbard <> Cc: Arnd Bergmann <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-03-05mm: ksm: do not block on page lock when searching stable treeYang Shi
ksmd needs to search the stable tree to look for the suitable KSM page, but the KSM page might be locked for a while due to i.e. KSM page rmap walk. Basically it is not a big deal since commit 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page deduplication limit"), since max_page_sharing limits the number of shared KSM pages. But it still sounds not worth waiting for the lock, the page can be skip, then try to merge it in the next scan to avoid potential stall if its content is still intact. Introduce trylock mode to get_ksm_page() to not block on page lock, like what try_to_merge_one_page() does. And, define three possible operations (nolock, lock and trylock) as enum type to avoid stacking up bools and make the code more readable. Return -EBUSY if trylock fails, since NULL means not find suitable KSM page, which is a valid case. With the default max_page_sharing setting (256), there is almost no observed change comparing lock vs trylock. However, with ksm02 of LTP, the reduced ksmd full scan time can be observed, which has set max_page_sharing to 786432. With lock version, ksmd may tak 10s - 11s to run two full scans, with trylock version ksmd may take 8s - 11s to run two full scans. And, the number of pages_sharing and pages_to_scan keep same. Basically, this change has no harm. [ fix BUG_ON()] Link: Link: Signed-off-by: Yang Shi <> Signed-off-by: Hugh Dickins <> Suggested-by: John Hubbard <> Reviewed-by: Kirill Tkhai <> Cc: Hugh Dickins <> Cc: Andrea Arcangeli <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-03-05mm: reuse only-pte-mapped KSM page in do_wp_page()Kirill Tkhai
Add an optimization for KSM pages almost in the same way that we have for ordinary anonymous pages. If there is a write fault in a page, which is mapped to an only pte, and it is not related to swap cache; the page may be reused without copying its content. [ Note that we do not consider PageSwapCache() pages at least for now, since we don't want to complicate __get_ksm_page(), which has nice optimization based on this (for the migration case). Currenly it is spinning on PageSwapCache() pages, waiting for when they have unfreezed counters (i.e., for the migration finish). But we don't want to make it also spinning on swap cache pages, which we try to reuse, since there is not a very high probability to reuse them. So, for now we do not consider PageSwapCache() pages at all. ] So in reuse_ksm_page() we check for 1) PageSwapCache() and 2) page_stable_node(), to skip a page, which KSM is currently trying to link to stable tree. Then we do page_ref_freeze() to prohibit KSM to merge one more page into the page, we are reusing. After that, nobody can refer to the reusing page: KSM skips !PageSwapCache() pages with zero refcount; and the protection against of all other participants is the same as for reused ordinary anon pages pte lock, page lock and mmap_sem. [ replace BUG_ON()s with WARN_ON()s] Link: Signed-off-by: Kirill Tkhai <> Reviewed-by: Yang Shi <> Cc: "Kirill A. Shutemov" <> Cc: Hugh Dickins <> Cc: Andrea Arcangeli <> Cc: Christian Koenig <> Cc: Claudio Imbrenda <> Cc: Rik van Riel <> Cc: Huang Ying <> Cc: Minchan Kim <> Cc: Kirill Tkhai <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2019-03-05mm: replace all open encodings for NUMA_NO_NODEAnshuman Khandual
Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: Signed-off-by: Anshuman Khandual <> Reviewed-by: David Hildenbrand <> Acked-by: Jeff Kirsher <> [ixgbe] Acked-by: Jens Axboe <> [mtip32xx] Acked-by: Vinod Koul <> [dmaengine.c] Acked-by: Michael Ellerman <> [powerpc] Acked-by: Doug Ledford <> [drivers/infiniband] Cc: Joseph Qi <> Cc: Hans Verkuil <> Cc: Stephen Rothwell <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-12-28ksm: react on changing "sleep_millisecs" parameter fasterKirill Tkhai
ksm thread unconditionally sleeps in ksm_scan_thread() after each iteration: schedule_timeout_interruptible( msecs_to_jiffies(ksm_thread_sleep_millisecs)) The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs. In case of user writes a big value by a mistake, and the thread enters into schedule_timeout_interruptible(), it's not possible to cancel the sleep by writing a new smaler value; the thread is just sleeping till timeout expires. The patch fixes the problem by waking the thread each time after the value is updated. This also may be useful for debug purposes; and also for userspace daemons, which change sleep_millisecs value in dependence of system load. Link: Signed-off-by: Kirill Tkhai <> Acked-by: Cyrill Gorcunov <> Reviewed-by: Andrew Morton <> Cc: Michal Hocko <> Cc: Hugh Dickins <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-12-28mm/mmu_notifier: use structure for invalidate_range_start/end calls v2Jérôme Glisse
To avoid having to change many call sites everytime we want to add a parameter use a structure to group all parameters for the mmu_notifier invalidate_range_start/end cakks. No functional changes with this patch. [ coding style fixes] Link: Signed-off-by: Jérôme Glisse <> Acked-by: Christian König <> Acked-by: Jan Kara <> Cc: Matthew Wilcox <> Cc: Ross Zwisler <> Cc: Dan Williams <> Cc: Paolo Bonzini <> Cc: Radim Krcmar <> Cc: Michal Hocko <> Cc: Felix Kuehling <> Cc: Ralph Campbell <> Cc: John Hubbard <> From: Jérôme Glisse <> Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3 fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n Link: Signed-off-by: Jérôme Glisse <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-12-28ksm: replace jhash2 with xxhashTimofey Titovets
Replace jhash2 with xxhash. Perf numbers: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz ksm: crc32c hash() 12081 MB/s ksm: xxh64 hash() 8770 MB/s ksm: xxh32 hash() 4529 MB/s ksm: jhash2 hash() 1569 MB/s Sioh Lee did some testing: crc32c_intel: 1084.10ns crc32c (no hardware acceleration): 7012.51ns xxhash32: 2227.75ns xxhash64: 1413.16ns jhash2: 5128.30ns As jhash2 always will be slower (for data size like PAGE_SIZE). Don't use it in ksm at all. Use only xxhash for now, because for using crc32c, cryptoapi must be initialized first - that requires some tricky solution to work well in all situations. Link: Signed-off-by: Timofey Titovets <> Signed-off-by: leesioh <> Reviewed-by: Pavel Tatashin <> Reviewed-by: Mike Rapoport <> Reviewed-by: Andrew Morton <> Cc: Andrea Arcangeli <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-08-22include/linux/compiler*.h: make compiler-*.h mutually exclusiveNick Desaulniers
Commit cafa0010cd51 ("Raise the minimum required gcc version to 4.6") recently exposed a brittle part of the build for supporting non-gcc compilers. Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and __GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't added compiler specific checks for __clang__ or __INTEL_COMPILER. This is brittle, as they happened to get compatibility by posing as a certain version of GCC. This broke when upgrading the minimal version of GCC required to build the kernel, to a version above what ICC and Clang claim to be. Rather than always including compiler-gcc.h then undefining or redefining macros in compiler-intel.h or compiler-clang.h, let's separate out the compiler specific macro definitions into mutually exclusive headers, do more proper compiler detection, and keep shared definitions in compiler_types.h. Fixes: cafa0010cd51 ("Raise the minimum required gcc version to 4.6") Reported-by: Masahiro Yamada <> Suggested-by: Eli Friedman <> Suggested-by: Joe Perches <> Signed-off-by: Nick Desaulniers <> Signed-off-by: Linus Torvalds <>
2018-08-22mm: fix page_freeze_refs and page_unfreeze_refs in commentsJiang Biao
page_freeze_refs/page_unfreeze_refs have already been relplaced by page_ref_freeze/page_ref_unfreeze , but they are not modified in the comments. Link: Signed-off-by: Jiang Biao <> Acked-by: Michal Hocko <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-08-17mm: convert return type of handle_mm_fault() caller to vm_fault_tSouptick Joarder
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t") In this patch all the caller of handle_mm_fault() are changed to return vm_fault_t type. Link: Signed-off-by: Souptick Joarder <> Cc: Matthew Wilcox <> Cc: Richard Henderson <> Cc: Tony Luck <> Cc: Matt Turner <> Cc: Vineet Gupta <> Cc: Russell King <> Cc: Catalin Marinas <> Cc: Will Deacon <> Cc: Richard Kuo <> Cc: Geert Uytterhoeven <> Cc: Michal Simek <> Cc: James Hogan <> Cc: Ley Foon Tan <> Cc: Jonas Bonn <> Cc: James E.J. Bottomley <> Cc: Benjamin Herrenschmidt <> Cc: Palmer Dabbelt <> Cc: Yoshinori Sato <> Cc: David S. Miller <> Cc: Richard Weinberger <> Cc: Guan Xuetao <> Cc: Thomas Gleixner <> Cc: "H. Peter Anvin" <> Cc: "Levin, Alexander (Sasha Levin)" <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-08-17dax: remove VM_MIXEDMAP for fsdax and device daxDave Jiang
This patch is reworked from an earlier patch that Dan has posted: VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that the memory page it is dealing with is not typical memory from the linear map. The get_user_pages_fast() path, since it does not resolve the vma, is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we use that as a VM_MIXEDMAP replacement in some locations. In the cases where there is no pte to consult we fallback to using vma_is_dax() to detect the VM_MIXEDMAP special case. Now that we have explicit driver pfn_t-flag opt-in/opt-out for get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This also means we no longer need to worry about safely manipulating vm_flags in a future where we support dynamically changing the dax mode of a file. DAX should also now be supported with madvise_behavior(), vma_merge(), and copy_page_range(). This patch has been tested against ndctl unit test. It has also been tested against xfstests commit: 625515d using fake pmem created by memmap and no additional issues have been observed. Link: Signed-off-by: Dave Jiang <> Acked-by: Dan Williams <> Cc: Jan Kara <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-06-15mm/ksm.c: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm()Jia He
In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by PAGE_SIZE unaligned for rmap_item->address under memory pressure tests(start 20 guests and run memhog in the host). WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8 CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8 Call trace: kvm_age_hva_handler+0xc0/0xc8 handle_hva_to_gpa+0xa8/0xe0 kvm_age_hva+0x4c/0xe8 kvm_mmu_notifier_clear_flush_young+0x54/0x98 __mmu_notifier_clear_flush_young+0x6c/0xa0 page_referenced_one+0x154/0x1d8 rmap_walk_ksm+0x12c/0x1d0 rmap_walk+0x94/0xa0 page_referenced+0x194/0x1b0 shrink_page_list+0x674/0xc28 shrink_inactive_list+0x26c/0x5b8 shrink_node_memcg+0x35c/0x620 shrink_node+0x100/0x430 do_try_to_free_pages+0xe0/0x3a8 try_to_free_pages+0xe4/0x230 __alloc_pages_nodemask+0x564/0xdc0 alloc_pages_vma+0x90/0x228 do_anonymous_page+0xc8/0x4d0 __handle_mm_fault+0x4a0/0x508 handle_mm_fault+0xf8/0x1b0 do_page_fault+0x218/0x4b8 do_translation_fault+0x90/0xa0 do_mem_abort+0x68/0xf0 el0_da+0x24/0x28 In rmap_walk_ksm, the rmap_item->address might still have the STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa on arm64. This patch fixes it by ignoring (not removing) the low bits of address when doing rmap_walk_ksm. IMO, it should be backported to stable tree. the storm of WARN_ONs is very easy for me to reproduce. More than that, I watched a panic (not reproducible) as follows: page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0 flags: 0x1fffc00000000000() raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000 raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000 page dumped because: nonzero _refcount CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1 Hardware name: <snip for confidential issues> Call trace: dump_backtrace+0x0/0x22c show_stack+0x24/0x2c dump_stack+0x8c/0xb0 bad_page+0xf4/0x154 free_pages_check_bad+0x90/0x9c free_pcppages_bulk+0x464/0x518 free_hot_cold_page+0x22c/0x300 __put_page+0x54/0x60 unmap_stage2_range+0x170/0x2b4 kvm_unmap_hva_handler+0x30/0x40 handle_hva_to_gpa+0xb0/0xec kvm_unmap_hva_range+0x5c/0xd0 I even injected a fault on purpose in kvm_unmap_hva_range by seting size=size-0x200, the call trace is similar as above. So I thought the panic is similarly caused by the root cause of WARN_ON. Andrea said: : It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would : zap those bits and hide the misalignment caused by the low metadata : bits being erroneously left set in the address, but the arm code : notices when that's the last page in the memslot and the hva_end is : getting aligned and the size is below one page. : : I think the problem triggers in the addr += PAGE_SIZE of : unmap_stage2_ptes that never matches end because end is aligned but : addr is not. : : } while (pte++, addr += PAGE_SIZE, addr != end); : : x86 again only works on hva_start/hva_end after converting it to : gfn_start/end and that being in pfn units the bits are zapped before : they risk to cause trouble. Jia He said: : I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without : this patch, the WARN_ON is very easy for reproducing. After this patch, I : have run the same benchmarch for a whole day without any WARN_ONs Link: Signed-off-by: Jia He <> Reviewed-by: Andrea Arcangeli <> Tested-by: Jia He <> Cc: Suzuki K Poulose <> Cc: Minchan Kim <> Cc: Claudio Imbrenda <> Cc: Arvind Yadav <> Cc: Mike Rapoport <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-06-07mm/ksm: move [set_]page_stable_node from ksm.h to ksm.cMike Rapoport
page_stable_node() and set_page_stable_node() are only used in mm/ksm.c and there is no point to keep them in the include/linux/ksm.h [ fix SYSFS=n build] Link: Signed-off-by: Mike Rapoport <> Reviewed-by: Andrew Morton <> Cc: Andrea Arcangeli <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-04-27mm/ksm: docs: extend overview comment and make it "DOC:"Mike Rapoport
The existing comment provides a good overview of KSM implementation. Let's update it to reflect recent additions of "chain" and "dup" variants of the stable tree nodes and mark it as "DOC:" for inclusion into the KSM documentation. Signed-off-by: Mike Rapoport <> Signed-off-by: Jonathan Corbet <>
2018-04-16Merge branch 'mm-rst' into docs-nextJonathan Corbet
Mike Rapoport says: These patches convert files in Documentation/vm to ReST format, add an initial index and link it to the top level documentation. There are no contents changes in the documentation, except few spelling fixes. The relatively large diffstat stems from the indentation and paragraph wrapping changes. I've tried to keep the formatting as consistent as possible, but I could miss some places that needed markup and add some markup where it was not necessary. [jc: significant conflicts in vm/hmm.rst]
2018-04-16docs/vm: rename documentation files to .rstMike Rapoport
Signed-off-by: Mike Rapoport <> Signed-off-by: Jonathan Corbet <>
2018-04-11mm/ksm.c: fix inconsistent accounting of zero pagesClaudio Imbrenda
When using KSM with use_zero_pages, we replace anonymous pages containing only zeroes with actual zero pages, which are not anonymous. We need to do proper accounting of the mm counters, otherwise we will get wrong values in /proc and a BUG message in dmesg when tearing down the mm. Link: Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring") Signed-off-by: Claudio Imbrenda <> Reviewed-by: Andrew Morton <> Cc: Andrea Arcangeli <> Cc: Minchan Kim <> Cc: Kirill A. Shutemov <> Cc: Hugh Dickins <> Cc: Christian Borntraeger <> Cc: Gerald Schaefer <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2018-04-05mm/ksm: fix interaction with THPClaudio Imbrenda
This patch fixes a corner case for KSM. When two pages belong or belonged to the same transparent hugepage, and they should be merged, KSM fails to split the page, and therefore no merging happens. This bug can be reproduced by: * making sure ksm is running (in case disabling ksmtuned) * enabling transparent hugepages * allocating a THP-aligned 1-THP-sized buffer e.g. on amd64: posix_memalign(&p, 1<<21, 1<<21) * filling it with the same values e.g. memset(p, 42, 1<<21) * performing madvise to make it mergeable e.g. madvise(p, 1<<21, MADV_MERGEABLE) * waiting for KSM to perform a few scans The expected outcome is that the all the pages get merged (1 shared and the rest sharing); the actual outcome is that no pages get merged (1 unshared and the rest volatile) The reason of this behaviour is that we increase the reference count once for both pages we want to merge, but if they belong to the same hugepage (or compound page), the reference counter used in both cases is the one of the head of the compound page. This means that split_huge_page will find a value of the reference counter too high and will fail. This patch solves this problem by testing if the two pages to merge belong to the same hugepage when attempting to merge them. If so, the hugepage is split safely. This means that the hugepage is not split if not necessary. Link: Signed-off-by: Claudio Imbrenda <> Co-authored-by: Gerald Schaefer <> Reviewed-by: Andrew Morton <> Cc: Andrea Arcangeli <> Cc: Minchan Kim <> Cc: Kirill A. Shutemov <> Cc: Hugh Dickins <> Cc: Christian Borntraeger <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>