summaryrefslogtreecommitdiff
path: root/arch/riscv/mm
AgeCommit message (Collapse)Author
2023-06-24riscv/mm: Convert to using lock_mm_and_find_vma()Ben Hutchings
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-06-20riscv: mm: try VMA lock-based page fault handling firstJisheng Zhang
Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. A simple running the ebizzy benchmark on Lichee Pi 4A shows that PER_VMA_LOCK can improve the ebizzy benchmark by about 32.68%. In theory, the more CPUs, the bigger improvement, but I don't have any HW platform which has more than 4 CPUs. This is the riscv variant of "x86/mm: try VMA lock-based page fault handling first". Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Guo Ren <guoren@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/r/20230523165942.2630-1-jszhang@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-06-19riscv: mm: Pre-allocate PGD entries for vmalloc/modules areaBjörn Töpel
The RISC-V port requires that kernel PGD entries are to be synchronized between MMs. This is done via the vmalloc_fault() function, that simply copies the PGD entries from init_mm to the faulting one. Historically, faulting in PGD entries have been a source for both bugs [1], and poor performance. One way to get rid of vmalloc faults is by pre-allocating the PGD entries. Pre-allocating the entries potientially wastes 64 * 4K (65 on SV39). The pre-allocation function is pulled from Jörg Rödel's x86 work, with the addition of 3-level page tables (PMD allocations). The pmd_alloc() function needs the ptlock cache to be initialized (when split page locks is enabled), so the pre-allocation is done in a RISC-V specific pgtable_cache_init() implementation. Pre-allocate the kernel PGD entries for the vmalloc/modules area, but only for 64b platforms. Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ # [1] Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230531093817.665799-1-bjorn@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-06-19riscv/hugetlb: pte_alloc_huge() pte_offset_huge()Hugh Dickins
pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits that: to keep balance in future, use the recently added pte_alloc_huge() instead; with pte_offset_huge() a better name for pte_offset_kernel(). Link: https://lkml.kernel.org/r/291f20-5947-9f5f-ec7f-96a18df336d9@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosync.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Greg Ungerer <gerg@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: John David Anglin <dave.anglin@bell.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-14riscv: mm: stub extable related functions/macros for !MMUJisheng Zhang
extable relies on the MMU to work properly, so it's useless to include __ex_table sections and build extable related functions for !MMU case. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Link: https://lore.kernel.org/r/20230509152641.805-1-jszhang@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-06-07riscv: Check the virtual alignment before choosing a map sizeAlexandre Ghiti
We used to only check the alignment of the physical address to decide which mapping would fit for a certain region of the linear mapping, but it is not enough since the virtual address must also be aligned, so check that too. Fixes: 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping") Reported-by: Song Shuai <songshuaishuai@tinylab.org> Link: https://lore.kernel.org/linux-riscv/tencent_7C3B580B47C1B17C16488EC1@qq.com/ Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230607125851.63370-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-06-07riscv: Fix kfence now that the linear mapping can be backed by PUD/P4D/PGDAlexandre Ghiti
RISC-V Kfence implementation used to rely on the fact the linear mapping was backed by at most PMD hugepages, which is not true anymore since commit 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping"). Instead of splitting PUD/P4D/PGD mappings afterwards, directly map the kfence pool region using PTE mappings by allocating this region before setup_vm_final(). Reported-by: syzbot+a74d57bddabbedd75135@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a74d57bddabbedd75135 Fixes: 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping") Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230606130444.25090-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-06-07riscv: mm: Ensure prot of VM_WRITE and VM_EXEC must be readableHsieh-Tseng Shen
Commit 8aeb7b17f04e ("RISC-V: Make mmap() with PROT_WRITE imply PROT_READ") allows riscv to use mmap with PROT_WRITE only, and meanwhile mmap with w+x is also permitted. However, when userspace tries to access this page with PROT_WRITE|PROT_EXEC, which causes infinite loop at load page fault as well as it triggers soft lockup. According to riscv privileged spec, "Writable pages must also be marked readable". The fix to drop the `PAGE_COPY_READ_EXEC` and then `PAGE_COPY_EXEC` would be just used instead. This aligns the other arches (i.e arm64) for protection_map. Fixes: 8aeb7b17f04e ("RISC-V: Make mmap() with PROT_WRITE imply PROT_READ") Signed-off-by: Hsieh-Tseng Shen <woodrow.shen@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230425102828.1616812-1-woodrow.shen@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-06-01riscv: Implement missing huge_ptep_getAlexandre Ghiti
huge_ptep_get must be reimplemented in order to go through all the PTEs of a NAPOT region: this is needed because the HW can update the A/D bits of any of the PTE that constitutes the NAPOT region. Fixes: 82a1a1f3bfb6 ("riscv: mm: support Svnapot in hugetlb page") Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20230428120120.21620-2-alexghiti@rivosinc.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-06-01riscv: Fix huge_ptep_set_wrprotect when PTE is a NAPOTAlexandre Ghiti
We need to avoid inconsistencies across the PTEs that form a NAPOT region, so when we write protect such a region, we should clear and flush all the PTEs to make sure that any of those PTEs is not cached which would result in such inconsistencies (arm64 does the same). Fixes: 82a1a1f3bfb6 ("riscv: mm: support Svnapot in hugetlb page") Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20230428120120.21620-1-alexghiti@rivosinc.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-05-29riscv: mm: init: Pass a pointer to virt_to_page()Linus Walleij
Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void *) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void *). Fix this in the RISCV mm init code, so we can implement a strongly typed virt_to_pfn(). Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2023-05-24riscv: Fix unused variable warning when BUILTIN_DTB is setAlexandre Ghiti
commit ef69d2559fe9 ("riscv: Move early dtb mapping into the fixmap region") wrongly moved the #ifndef CONFIG_BUILTIN_DTB surrounding the pa variable definition in create_fdt_early_page_table(), so move it back to its right place to quiet the following warning: ../arch/riscv/mm/init.c: In function ‘create_fdt_early_page_table’: ../arch/riscv/mm/init.c:925:12: warning: unused variable ‘pa’ [-Wunused-variable] 925 | uintptr_t pa = dtb_pa & ~(PMD_SIZE - 1); Fixes: ef69d2559fe9 ("riscv: Move early dtb mapping into the fixmap region") Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230519131311.391960-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-29riscv: mm: remove redundant parameter of create_fdt_early_page_tableSong Shuai
create_fdt_early_page_table() explicitly uses early_pg_dir for 32-bit fdt mapping and the pgdir parameter is redundant here. So remove it and its caller. Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Song Shuai <suagrfillet@gmail.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Fixes: ef69d2559fe9 ("riscv: Move early dtb mapping into the fixmap region") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230426100009.685435-1-suagrfillet@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-29Merge patch series "RISC-V Hibernation Support"Palmer Dabbelt
Sia Jee Heng <jeeheng.sia@starfivetech.com> says: This series adds RISC-V Hibernation/suspend to disk support. Low level Arch functions were created to support hibernation. swsusp_arch_suspend() relies code from __cpu_suspend_enter() to write cpu state onto the stack, then calling swsusp_save() to save the memory image. Arch specific hibernation header is implemented and is utilized by the arch_hibernation_header_restore() and arch_hibernation_header_save() functions. The arch specific hibernation header consists of satp, hartid, and the cpu_resume address. The kernel built version is also need to be saved into the hibernation image header to making sure only the same kernel is restore when resume. swsusp_arch_resume() creates a temporary page table that covering only the linear map. It copies the restore code to a 'safe' page, then start to restore the memory image. Once completed, it restores the original kernel's page table. It then calls into __hibernate_cpu_resume() to restore the CPU context. Finally, it follows the normal hibernation path back to the hibernation core. To enable hibernation/suspend to disk into RISCV, the below config need to be enabled: - CONFIG_HIBERNATION - CONFIG_ARCH_HIBERNATION_HEADER - CONFIG_ARCH_HIBERNATION_POSSIBLE At high-level, this series includes the following changes: 1) Change suspend_save_csrs() and suspend_restore_csrs() to public function as these functions are common to suspend/hibernation. (patch 1) 2) Refactor the common code in the __cpu_resume_enter() function and __hibernate_cpu_resume() function. The common code are used by hibernation and suspend. (patch 2) 3) Enhance kernel_page_present() function to support huge page. (patch 3) 4) Add arch/riscv low level functions to support hibernation/suspend to disk. (patch 4) * b4-shazam-merge: RISC-V: Add arch functions to support hibernation/suspend-to-disk RISC-V: mm: Enable huge page support to kernel_page_present() function RISC-V: Factor out common code of __cpu_resume_enter() RISC-V: Change suspend_save_csrs and suspend_restore_csrs to public function Link: https://lore.kernel.org/r/20230330064321.1008373-1-jeeheng.sia@starfivetech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-29RISC-V: mm: Enable huge page support to kernel_page_present() functionSia Jee Heng
Currently kernel_page_present() function doesn't support huge page detection causes the function to mistakenly return false to the hibernation core. Add huge page detection to the function to solve the problem. Fixes: 9e953cda5cdf ("riscv: Introduce huge page support for 32/64bit kernel") Signed-off-by: Sia Jee Heng <jeeheng.sia@starfivetech.com> Reviewed-by: Ley Foon Tan <leyfoon.tan@starfivetech.com> Reviewed-by: Mason Huo <mason.huo@starfivetech.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230330064321.1008373-4-jeeheng.sia@starfivetech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-28Merge tag 'riscv-for-linus-6.4-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for runtime detection of the Svnapot extension - Support for Zicboz when clearing pages - We've moved to GENERIC_ENTRY - Support for !MMU on rv32 systems - The linear region is now mapped via huge pages - Support for building relocatable kernels - Support for the hwprobe interface - Various fixes and cleanups throughout the tree * tag 'riscv-for-linus-6.4-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (57 commits) RISC-V: hwprobe: Explicity check for -1 in vdso init RISC-V: hwprobe: There can only be one first riscv: Allow to downgrade paging mode from the command line dt-bindings: riscv: add sv57 mmu-type RISC-V: hwprobe: Remove __init on probe_vendor_features() riscv: Use --emit-relocs in order to move .rela.dyn in init riscv: Check relocations at compile time powerpc: Move script to check relocations at compile time in scripts/ riscv: Introduce CONFIG_RELOCATABLE riscv: Move .rela.dyn outside of init to avoid empty relocations riscv: Prepare EFI header for relocatable kernels riscv: Unconditionnally select KASAN_VMALLOC if KASAN riscv: Fix ptdump when KASAN is enabled riscv: Fix EFI stub usage of KASAN instrumented strcmp function riscv: Move DTB_EARLY_BASE_VA to the kernel address space riscv: Rework kasan population functions riscv: Split early and final KASAN population functions riscv: Use PUD/P4D/PGD pages for the linear mapping riscv: Move the linear mapping creation in its own function riscv: Get rid of riscv_pfn_base variable ...
2023-04-26riscv: Allow to downgrade paging mode from the command lineAlexandre Ghiti
Add 2 early command line parameters that allow to downgrade satp mode (using the same naming as x86): - "no5lvl": use a 4-level page table (down from sv57 to sv48) - "no4lvl": use a 3-level page table (down from sv57/sv48 to sv39) Note that going through the device tree to get the kernel command line works with ACPI too since the efi stub creates a device tree anyway with the command line. In KASAN kernels, we can't use the libfdt that early in the boot process since we are not ready to execute instrumented functions. So instead of using the "generic" libfdt, we compile our own versions of those functions that are not instrumented and that are prefixed so that they do not conflict with the generic ones. We also need the non-instrumented versions of the string functions and the prefixed versions of memcpy/memmove. This is largely inspired by commit aacd149b6238 ("arm64: head: avoid relocating the kernel twice for KASLR") from which I removed compilation flags that were not relevant to RISC-V at the moment (LTO, SCS). Also note that we have to link with -z norelro to avoid ld.lld to throw a warning with the new .got sections, like in commit 311bea3cb9ee ("arm64: link with -z norelro for LLD or aarch64-elf"). Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20230424092313.178699-2-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-25Merge tag 'irq-core-2023-04-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt updates from Thomas Gleixner: "Core: - Add tracepoints for tasklet callbacks which makes it possible to analyze individual tasklet functions instead of guess working from the overall duration of tasklet processing - Ensure that secondary interrupt threads have their affinity adjusted correctly Drivers: - A large rework of the RISC-V IPI management to prepare for a new RISC-V interrupt architecture - Small fixes and enhancements all over the place - Removal of support for various obsolete hardware platforms and the related code" * tag 'irq-core-2023-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) irqchip/st: Remove stih415/stih416 and stid127 platforms support irqchip/gic-v3: Add Rockchip 3588001 erratum workaround genirq: Update affinity of secondary threads softirq: Add trace points for tasklet entry/exit irqchip/loongson-pch-pic: Fix pch_pic_acpi_init calling irqchip/loongson-pch-pic: Fix registration of syscore_ops irqchip/loongson-eiointc: Fix registration of syscore_ops irqchip/loongson-eiointc: Fix incorrect use of acpi_get_vec_parent irqchip/loongson-eiointc: Fix returned value on parsing MADT irqchip/riscv-intc: Add empty irq_eoi() for chained irq handlers RISC-V: Use IPIs for remote icache flush when possible RISC-V: Use IPIs for remote TLB flush when possible RISC-V: Allow marking IPIs as suitable for remote FENCEs RISC-V: Treat IPIs as normal Linux IRQs irqchip/riscv-intc: Allow drivers to directly discover INTC hwnode RISC-V: Clear SIP bit only when using SBI IPI operations irqchip/irq-sifive-plic: Add syscore callbacks for hibernation irqchip: Use of_property_read_bool() for boolean properties irqchip/bcm-6345-l1: Request memory region irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4 ...
2023-04-19Merge patch series "Introduce 64b relocatable kernel"Palmer Dabbelt
Alexandre Ghiti <alexghiti@rivosinc.com> says: After multiple attempts, this patchset is now based on the fact that the 64b kernel mapping was moved outside the linear mapping. The first patch allows to build relocatable kernels but is not selected by default. That patch is a requirement for KASLR. The second and third patches take advantage of an already existing powerpc script that checks relocations at compile-time, and uses it for riscv. * b4-shazam-merge: riscv: Use --emit-relocs in order to move .rela.dyn in init riscv: Check relocations at compile time powerpc: Move script to check relocations at compile time in scripts/ riscv: Introduce CONFIG_RELOCATABLE riscv: Move .rela.dyn outside of init to avoid empty relocations riscv: Prepare EFI header for relocatable kernels Link: https://lore.kernel.org/r/20230329045329.64565-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-19riscv: Introduce CONFIG_RELOCATABLEAlexandre Ghiti
This config allows to compile 64b kernel as PIE and to relocate it at any virtual address at runtime: this paves the way to KASLR. Runtime relocation is possible since relocation metadata are embedded into the kernel. Note that relocating at runtime introduces an overhead even if the kernel is loaded at the same address it was linked at and that the compiler options are those used in arm64 which uses the same RELA relocation format. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230329045329.64565-4-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-19Merge patch series "RISC-V kasan rework"Palmer Dabbelt
Alexandre Ghiti <alexghiti@rivosinc.com> says: As described in patch 2, our current kasan implementation is intricate, so I tried to simplify the implementation and mimic what arm64/x86 are doing. In addition it fixes UEFI bootflow with a kasan kernel and kasan inline instrumentation: all kasan configurations were tested on a large ubuntu kernel with success with KASAN_KUNIT_TEST and KASAN_MODULE_TEST. inline ubuntu config + uefi: sv39: OK sv48: OK sv57: OK outline ubuntu config + uefi: sv39: OK sv48: OK sv57: OK Actually 1 test always fails with KASAN_KUNIT_TEST that I have to check: KASAN failure expected in "set_bit(nr, addr)", but none occurrred Note that Palmer recently proposed to remove COMMAND_LINE_SIZE from the userspace abi https://lore.kernel.org/lkml/20221211061358.28035-1-palmer@rivosinc.com/T/ so that we can finally increase the command line to fit all kasan kernel parameters. All of this should hopefully fix the syzkaller riscv build that has been failing for a few months now, any test is appreciated and if I can help in any way, please ask. * b4-shazam-merge: riscv: Unconditionnally select KASAN_VMALLOC if KASAN riscv: Fix ptdump when KASAN is enabled riscv: Fix EFI stub usage of KASAN instrumented strcmp function riscv: Move DTB_EARLY_BASE_VA to the kernel address space riscv: Rework kasan population functions riscv: Split early and final KASAN population functions Link: https://lore.kernel.org/r/20230203075232.274282-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-19riscv: Fix ptdump when KASAN is enabledAlexandre Ghiti
The KASAN shadow region was moved next to the kernel mapping but the ptdump code was not updated and it appears to break the dump of the kernel page table, so fix this by moving the KASAN shadow region in ptdump. Fixes: f7ae02333d13 ("riscv: Move KASAN mapping next to the kernel mapping") Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20230203075232.274282-6-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-19riscv: Move DTB_EARLY_BASE_VA to the kernel address spaceAlexandre Ghiti
The early virtual address should lie in the kernel address space for inline kasan instrumentation to succeed, otherwise kasan tries to dereference an address that does not exist in the address space (since kasan only maps *kernel* address space, not the userspace). Simply use the very first address of the kernel address space for the early fdt mapping. It allowed an Ubuntu kernel to boot successfully with inline instrumentation. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20230203075232.274282-4-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-19riscv: Rework kasan population functionsAlexandre Ghiti
Our previous kasan population implementation used to have the final kasan shadow region mapped with kasan_early_shadow_page, because we did not clean the early mapping and then we had to populate the kasan region "in-place" which made the code cumbersome. So now we clear the early mapping, establish a temporary mapping while we populate the kasan shadow region with just the kernel regions that will be used. This new version uses the "generic" way of going through a page table that may be folded at runtime (avoid the XXX_next macros). It was tested with outline instrumentation on an Ubuntu kernel configuration successfully. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20230203075232.274282-3-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-19riscv: Split early and final KASAN population functionsAlexandre Ghiti
This is a preliminary work that allows to make the code more understandable. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20230203075232.274282-2-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18Merge patch series "riscv: Use PUD/P4D/PGD pages for the linear mapping"Palmer Dabbelt
Alexandre Ghiti <alexghiti@rivosinc.com> says: This patchset intends to improve tlb utilization by using hugepages for the linear mapping. As reported by Anup in v6, when STRICT_KERNEL_RWX is enabled, we must take care of isolating the kernel text and rodata so that they are not mapped with a PUD mapping which would then assign wrong permissions to the whole region: it is achieved the same way as arm64 by using the memblock nomap API which isolates those regions and re-merge them afterwards thus avoiding any issue with the system resources tree creation. arch/riscv/include/asm/page.h | 19 ++++++- arch/riscv/mm/init.c | 102 ++++++++++++++++++++++++++-------- arch/riscv/mm/physaddr.c | 16 ++++++ drivers/of/fdt.c | 11 ++-- 4 files changed, 118 insertions(+), 30 deletions(-) * b4-shazam-merge: riscv: Use PUD/P4D/PGD pages for the linear mapping riscv: Move the linear mapping creation in its own function riscv: Get rid of riscv_pfn_base variable Link: https://lore.kernel.org/r/20230324155421.271544-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18riscv: Use PUD/P4D/PGD pages for the linear mappingAlexandre Ghiti
During the early page table creation, we used to set the mapping for PAGE_OFFSET to the kernel load address: but the kernel load address is always offseted by PMD_SIZE which makes it impossible to use PUD/P4D/PGD pages as this physical address is not aligned on PUD/P4D/PGD size (whereas PAGE_OFFSET is). But actually we don't have to establish this mapping (ie set va_pa_offset) that early in the boot process because: - first, setup_vm installs a temporary kernel mapping and among other things, discovers the system memory, - then, setup_vm_final creates the final kernel mapping and takes advantage of the discovered system memory to create the linear mapping. During the first phase, we don't know the start of the system memory and then until the second phase is finished, we can't use the linear mapping at all and phys_to_virt/virt_to_phys translations must not be used because it would result in a different translation from the 'real' one once the final mapping is installed. So here we simply delay the initialization of va_pa_offset to after the system memory discovery. But to make sure noone uses the linear mapping before, we add some guard in the DEBUG_VIRTUAL config. Finally we can use PUD/P4D/PGD hugepages when possible, which will result in a better TLB utilization. Note that: - this does not apply to rv32 as the kernel mapping lies in the linear mapping. - we rely on the firmware to protect itself using PMP. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Acked-by: Rob Herring <robh@kernel.org> # DT bits Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20230324155421.271544-4-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18riscv: Move the linear mapping creation in its own functionAlexandre Ghiti
No change intended, it just splits the linear mapping creation from setup_vm_final: this prepares for upcoming additions to the linear mapping creation. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20230324155421.271544-3-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-18riscv: Get rid of riscv_pfn_base variableAlexandre Ghiti
Use directly phys_ram_base instead, riscv_pfn_base is just the pfn of the address contained in phys_ram_base. Even if there is no functional change intended in this patch, actually setting phys_ram_base that early changes the behaviour of kernel_mapping_pa_to_va during the early boot: phys_ram_base used to be zero before this patch and now it is set to the physical start address of the kernel. But it does not break the conversion of a kernel physical address into a virtual address since kernel_mapping_pa_to_va should only be used on kernel physical addresses, i.e. addresses greater than the physical start address of the kernel. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20230324155421.271544-2-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-13riscv: No need to relocate the dtb as it lies in the fixmap regionAlexandre Ghiti
We used to access the dtb via its linear mapping address but now that the dtb early mapping was moved in the fixmap region, we can keep using this address since it is present in swapper_pg_dir, and remove the dtb relocation. Note that the relocation was wrong anyway since early_memremap() is restricted to 256K whereas the maximum fdt size is 2MB. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230329081932.79831-4-alexghiti@rivosinc.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-13riscv: Move early dtb mapping into the fixmap regionAlexandre Ghiti
riscv establishes 2 virtual mappings: - early_pg_dir maps the kernel which allows to discover the system memory - swapper_pg_dir installs the final mapping (linear mapping included) We used to map the dtb in early_pg_dir using DTB_EARLY_BASE_VA, and this mapping was not carried over in swapper_pg_dir. It happens that early_init_fdt_scan_reserved_mem() must be called before swapper_pg_dir is setup otherwise we could allocate reserved memory defined in the dtb. And this function initializes reserved_mem variable with addresses that lie in the early_pg_dir dtb mapping: when those addresses are reused with swapper_pg_dir, this mapping does not exist and then we trap. The previous "fix" was incorrect as early_init_fdt_scan_reserved_mem() must be called before swapper_pg_dir is set up otherwise we could allocate in reserved memory defined in the dtb. So move the dtb mapping in the fixmap region which is established in early_pg_dir and handed over to swapper_pg_dir. Fixes: 922b0375fc93 ("riscv: Fix memblock reservation for device tree blob") Fixes: 8f3a2b4a96dc ("RISC-V: Move DT mapping outof fixmap") Fixes: 50e63dd8ed92 ("riscv: fix reserved memory setup") Reported-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/all/f8e67f82-103d-156c-deb0-d6d6e2756f5e@microchip.com/ Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Tested-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230329081932.79831-2-alexghiti@rivosinc.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-04-08RISC-V: Use IPIs for remote icache flush when possibleAnup Patel
If we have specialized interrupt controller (such as AIA IMSIC) which allows supervisor mode to directly inject IPIs without any assistance from M-mode or HS-mode then using such specialized interrupt controller, we can do remote icache flushe directly from supervisor mode instead of using the SBI RFENCE calls. This patch extends remote icache flush functions to use supervisor mode IPIs whenever direct supervisor mode IPIs.are supported by interrupt controller. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230328035223.1480939-7-apatel@ventanamicro.com
2023-04-08RISC-V: Use IPIs for remote TLB flush when possibleAnup Patel
If we have specialized interrupt controller (such as AIA IMSIC) which allows supervisor mode to directly inject IPIs without any assistance from M-mode or HS-mode then using such specialized interrupt controller, we can do remote TLB flushes directly from supervisor mode instead of using the SBI RFENCE calls. This patch extends remote TLB flush functions to use supervisor mode IPIs whenever direct supervisor mode IPIs.are supported by interrupt controller. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230328035223.1480939-6-apatel@ventanamicro.com
2023-03-24Merge patch series "riscv: Add GENERIC_ENTRY support"Palmer Dabbelt
guoren@kernel.org <guoren@kernel.org> says: From: Guo Ren <guoren@linux.alibaba.com> The patches convert riscv to use the generic entry infrastructure from kernel/entry/*. Some optimization for entry.S with new .macro and merge ret_from_kernel_thread into ret_from_fork. * b4-shazam-merge: riscv: entry: Consolidate general regs saving/restoring riscv: entry: Consolidate ret_from_kernel_thread into ret_from_fork riscv: entry: Remove extra level wrappers of trace_hardirqs_{on,off} riscv: entry: Convert to generic entry riscv: entry: Add noinstr to prevent instrumentation inserted riscv: ptrace: Remove duplicate operation Link: https://lore.kernel.org/r/20230222033021.983168-1-guoren@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-23riscv: entry: Convert to generic entryGuo Ren
This patch converts riscv to use the generic entry infrastructure from kernel/entry/*. The generic entry makes maintainers' work easier and codes more elegant. Here are the changes: - More clear entry.S with handle_exception and ret_from_exception - Get rid of complex custom signal implementation - Move syscall procedure from assembly to C, which is much more readable. - Connect ret_from_fork & ret_from_kernel_thread to generic entry. - Wrap with irqentry_enter/exit and syscall_enter/exit_from_user_mode - Use the standard preemption code instead of custom Suggested-by: Huacai Chen <chenhuacai@kernel.org> Reviewed-by: Björn Töpel <bjorn@rivosinc.com> Tested-by: Yipeng Zou <zouyipeng@huawei.com> Tested-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Cc: Ben Hutchings <ben@decadent.org.uk> Link: https://lore.kernel.org/r/20230222033021.983168-5-guoren@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-21riscv: mm: Fix incorrect ASID argument when flushing TLBDylan Jhong
Currently, we pass the CONTEXTID instead of the ASID to the TLB flush function. We should only take the ASID field to prevent from touching the reserved bit field. Fixes: 3f1e782998cd ("riscv: add ASID-based tlbflushing methods") Signed-off-by: Dylan Jhong <dylan@andestech.com> Reviewed-by: Sergey Matyukevich <sergey.matyukevich@syntacore.com> Link: https://lore.kernel.org/r/20230313034906.2401730-1-dylan@andestech.com Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-15Merge patch series "RISC-V: Apply Zicboz to clear_page"Palmer Dabbelt
Andrew Jones <ajones@ventanamicro.com> says: When the Zicboz extension is available we can more rapidly zero naturally aligned Zicboz block sized chunks of memory. As pages are always page aligned and are larger than any Zicboz block size will be, then clear_page() appears to be a good candidate for the extension. While cycle count and energy consumption should also be considered, we can be pretty certain that implementing clear_page() with the Zicboz extension is a win by comparing the new dynamic instruction count with its current count[1]. Doing so we see that the new count is just over a quarter of the old count (see patch6's commit message for more details). For those of you who reviewed v1[2], you may be looking for the memset() patches. As pointed out in v1, and a couple follow-up emails, it's not clear that patching memset() is a win yet. When I get a chance to test on real hardware with a comprehensive benchmark collection then I can post the memset() patches separately (assuming the benchmarks show it's worthwhile). * b4-shazam-merge: RISC-V: KVM: Expose Zicboz to the guest RISC-V: KVM: Provide UAPI for Zicboz block size RISC-V: Use Zicboz in clear_page when available RISC-V: cpufeatures: Put the upper 16 bits of patch ID to work RISC-V: Add Zicboz detection and block size parsing dt-bindings: riscv: Document cboz-block-size RISC-V: Factor out body of riscv_init_cbom_blocksize loop RISC-V: alternatives: Support patching multiple insns in assembly Link: https://lore.kernel.org/r/20230224162631.405473-1-ajones@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-14RISC-V: Add Zicboz detection and block size parsingAndrew Jones
Parse "riscv,cboz-block-size" from the DT by piggybacking on Zicbom's riscv_init_cbom_blocksize(). Additionally check the DT for the presence of the "zicboz" extension and, when it's present, validate the parsed cboz block size as we do Zicbom's cbom block size with riscv_isa_extension_check(). Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230224162631.405473-5-ajones@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-14RISC-V: Factor out body of riscv_init_cbom_blocksize loopAndrew Jones
Refactor riscv_init_cbom_blocksize() to prepare for it to be used for both cbom block size and cboz block size. Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Heiko Stuebner <heiko@sntech.de> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230224162631.405473-3-ajones@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-14RISC-V: mm: Support huge page in vmalloc_fault()Dylan Jhong
Since RISC-V supports ioremap() with huge page (pud/pmd) mapping, However, vmalloc_fault() assumes that the vmalloc range is limited to pte mappings. To complete the vmalloc_fault() function by adding huge page support. Fixes: 310f541a027b ("riscv: Enable HAVE_ARCH_HUGE_VMAP for 64BIT") Cc: stable@vger.kernel.org Signed-off-by: Dylan Jhong <dylan@andestech.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20230310075021.3919290-1-dylan@andestech.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-09Merge patch series "riscv, mm: detect svnapot cpu support at runtime"Palmer Dabbelt
Qinglin Pan <panqinglin00@gmail.com> says: Svnapot is a RISC-V extension for marking contiguous 4K pages as a non-4K page. This patch set is for using Svnapot in hugetlb fs and huge vmap. This patchset adds a Kconfig item for using Svnapot in "Platform type"->"SVNAPOT extension support". Its default value is on, and people can set it off if they don't allow kernel to detect Svnapot hardware support and leverage it. Tested on: - qemu rv64 with "Svnapot support" off and svnapot=true. - qemu rv64 with "Svnapot support" on and svnapot=true. - qemu rv64 with "Svnapot support" off and svnapot=false. - qemu rv64 with "Svnapot support" on and svnapot=false. * b4-shazam-merge: riscv: mm: support Svnapot in huge vmap riscv: mm: support Svnapot in hugetlb page riscv: mm: modify pte format for Svnapot Link: https://lore.kernel.org/r/20230209131647.17245-1-panqinglin00@gmail.com [Palmer: fix up the feature ordering in the merge] Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-09Merge patch series "riscv: asid: switch to alternative way to fix stale TLB ↵Palmer Dabbelt
entries" Sergey Matyukevich <geomatsi@gmail.com> says: Some time ago two different patches have been posted to fix stale TLB entries that caused applications crashes. The patch [0] suggested 'aggregating' mm_cpumask, i.e. current cpu is not cleared for the switched-out task in switch_mm function. For additional explanations see the commit message by Guo Ren. The same approach is used by arc architecture, so another good comment is for switch_mm in arch/arc/include/asm/mmu_context.h. The patch [1] attempted to reduce the number of TLB flushes by deferring (and possibly avoiding) them for CPUs not running the task. Patch [1] has been merged. However we already have two bug reports from different vendors. So apparently something is missing in the approach suggested in [1]. In both cases the patch [0] fixed the issue. This patch series reverts [1] and replaces it by [0]. [0] https://lore.kernel.org/linux-riscv/20221111075902.798571-1-guoren@kernel.org/ [1] https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/ * b4-shazam-merge: riscv: asid: Fixup stale TLB entry cause application crash Revert "riscv: mm: notify remote harts about mmu cache updates" Link: https://lore.kernel.org/r/20230226150137.1919750-1-geomatsi@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-09riscv: asid: Fixup stale TLB entry cause application crashGuo Ren
After use_asid_allocator is enabled, the userspace application will crash by stale TLB entries. Because only using cpumask_clear_cpu without local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh. Then set_mm_asid would cause the user space application to get a stale value by stale TLB entry, but set_mm_noasid is okay. Here is the symptom of the bug: unhandled signal 11 code 0x1 (coredump) 0x0000003fd6d22524 <+4>: auipc s0,0x70 0x0000003fd6d22528 <+8>: ld s0,-148(s0) # 0x3fd6d92490 => 0x0000003fd6d2252c <+12>: ld a5,0(s0) (gdb) i r s0 s0 0x8082ed1cc3198b21 0x8082ed1cc3198b21 (gdb) x /2x 0x3fd6d92490 0x3fd6d92490: 0xd80ac8a8 0x0000003f The core dump file shows that register s0 is wrong, but the value in memory is correct. Because 'ld s0, -148(s0)' used a stale mapping entry in TLB and got a wrong result from an incorrect physical address. When the task ran on CPU0, which loaded/speculative-loaded the value of address(0x3fd6d92490), then the first version of the mapping entry was PTWed into CPU0's TLB. When the task switched from CPU0 to CPU1 (No local_tlb_flush_all here by asid), it happened to write a value on the address (0x3fd6d92490). It caused do_page_fault -> wp_page_copy -> ptep_clear_flush -> ptep_get_and_clear & flush_tlb_page. The flush_tlb_page used mm_cpumask(mm) to determine which CPUs need TLB flush, but CPU0 had cleared the CPU0's mm_cpumask in the previous switch_mm. So we only flushed the CPU1 TLB and set the second version mapping of the PTE. When the task switched from CPU1 to CPU0 again, CPU0 still used a stale TLB mapping entry which contained a wrong target physical address. It raised a bug when the task happened to read that value. CPU0 CPU1 - switch 'task' in - read addr (Fill stale mapping entry into TLB) - switch 'task' out (no tlb_flush) - switch 'task' in (no tlb_flush) - write addr cause pagefault do_page_fault() (change to new addr mapping) wp_page_copy() ptep_clear_flush() ptep_get_and_clear() & flush_tlb_page() write new value into addr - switch 'task' out (no tlb_flush) - switch 'task' in (no tlb_flush) - read addr again (Use stale mapping entry in TLB) get wrong value from old phyical addr, BUG! The solution is to keep all CPUs' footmarks of cpumask(mm) in switch_mm, which could guarantee to invalidate all stale TLB entries during TLB flush. Fixes: 65d4b9c53017 ("RISC-V: Implement ASID allocator") Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Tested-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Tested-by: Zong Li <zong.li@sifive.com> Tested-by: Sergey Matyukevich <sergey.matyukevich@syntacore.com> Cc: Anup Patel <apatel@ventanamicro.com> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: stable@vger.kernel.org Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20230226150137.1919750-3-geomatsi@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-09Revert "riscv: mm: notify remote harts about mmu cache updates"Sergey Matyukevich
This reverts the remaining bits of commit 4bd1d80efb5a ("riscv: mm: notify remote harts harts about mmu cache updates"). According to bug reports, suggested approach to fix stale TLB entries is not sufficient. It needs to be replaced by a more robust solution. Fixes: 4bd1d80efb5a ("riscv: mm: notify remote harts about mmu cache updates") Reported-by: Zong Li <zong.li@sifive.com> Reported-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com> Signed-off-by: Sergey Matyukevich <sergey.matyukevich@syntacore.com> Cc: stable@vger.kernel.org Reviewed-by: Guo Ren <guoren@kernel.org> Link: https://lore.kernel.org/r/20230226150137.1919750-2-geomatsi@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-07riscv: mm: support Svnapot in hugetlb pageQinglin Pan
Svnapot can be used to support 64KB hugetlb page, so it can become a new option when using hugetlbfs. Add a basic implementation of hugetlb page, and support 64KB as a size in it by using Svnapot. For test, boot kernel with command line contains "default_hugepagesz=64K hugepagesz=64K hugepages=20" and run a simple test like this: tools/testing/selftests/vm/map_hugetlb 1 16 And it should be passed. Signed-off-by: Qinglin Pan <panqinglin00@gmail.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20230209131647.17245-3-panqinglin00@gmail.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-03-05Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull VM_FAULT_RETRY fixes from Al Viro: "Some of the page fault handlers do not deal with the following case correctly: - handle_mm_fault() has returned VM_FAULT_RETRY - there is a pending fatal signal - fault had happened in kernel mode Correct action in such case is not "return unconditionally" - fatal signals are handled only upon return to userland and something like copy_to_user() would end up retrying the faulting instruction and triggering the same fault again and again. What we need to do in such case is to make the caller to treat that as failed uaccess attempt - handle exception if there is an exception handler for faulting instruction or oops if there isn't one. Over the years some architectures had been fixed and now are handling that case properly; some still do not. This series should fix the remaining ones. Status: - m68k, riscv, hexagon, parisc: tested/acked by maintainers. - alpha, sparc32, sparc64: tested locally - bug has been reproduced on the unpatched kernel and verified to be fixed by this series. - ia64, microblaze, nios2, openrisc: build, but otherwise completely untested" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: openrisc: fix livelock in uaccess nios2: fix livelock in uaccess microblaze: fix livelock in uaccess ia64: fix livelock in uaccess sparc: fix livelock in uaccess alpha: fix livelock in uaccess parisc: fix livelock in uaccess hexagon: fix livelock in uaccess riscv: fix livelock in uaccess m68k: fix livelock in uaccess
2023-03-02riscv: fix livelock in uaccessAl Viro
riscv equivalent of 26178ec11ef3 "x86: mm: consolidate VM_FAULT_RETRY handling" If e.g. get_user() triggers a page fault and a fatal signal is caught, we might end up with handle_mm_fault() returning VM_FAULT_RETRY and not doing anything to page tables. In such case we must *not* return to the faulting insn - that would repeat the entire thing without making any progress; what we need instead is to treat that as failed (user) memory access. Tested-by: Björn Töpel <bjorn@kernel.org> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-02-25Merge tag 'riscv-for-linus-6.3-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: "There's a bunch of fixes/cleanups throughout the tree as usual, but we also have a handful of new features: - Various improvements to the extension detection and alternative patching infrastructure - Zbb-optimized string routines - Support for cpu-capacity in the RISC-V DT bindings - Zicbom no longer depends on toolchain support - Some performance and code size improvements to ftrace - Support for ARCH_WANT_LD_ORPHAN_WARN - Oops now contain the faulting instruction" * tag 'riscv-for-linus-6.3-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (67 commits) RISC-V: add a spin_shadow_stack declaration riscv: mm: hugetlb: Enable ARCH_WANT_HUGETLB_PAGE_OPTIMIZE_VMEMMAP riscv: Add header include guards to insn.h riscv: alternative: proceed one more instruction for auipc/jalr pair riscv: Avoid enabling interrupts in die() riscv, mm: Perform BPF exhandler fixup on page fault RISC-V: take text_mutex during alternative patching riscv: hwcap: Don't alphabetize ISA extension IDs RISC-V: fix ordering of Zbb extension riscv: jump_label: Fixup unaligned arch_static_branch function RISC-V: Only provide the single-letter extensions in HWCAP riscv: mm: fix regression due to update_mmu_cache change scripts/decodecode: Add support for RISC-V riscv: Add instruction dump to RISC-V splats riscv: select ARCH_WANT_LD_ORPHAN_WARN for !XIP_KERNEL riscv: vmlinux.lds.S: explicitly catch .init.bss sections from EFI stub riscv: vmlinux.lds.S: explicitly catch .riscv.attributes sections riscv: vmlinux.lds.S: explicitly catch .rela.dyn symbols riscv: lds: define RUNTIME_DISCARD_EXIT RISC-V: move some stray __RISCV_INSN_FUNCS definitions from kprobes ...
2023-02-21riscv, mm: Perform BPF exhandler fixup on page faultBjörn Töpel
Commit 21855cac82d3 ("riscv/mm: Prevent kernel module to access user memory without uaccess routines") added early exits/deaths for page faults stemming from accesses to user-space without using proper uaccess routines (where sstatus.SUM is set). Unfortunatly, this is too strict for some BPF programs, which relies on BPF exhandler fixups. These BPF programs loads "BTF pointers". A BTF pointers could either be a valid kernel pointer or NULL, but not a userspace address. Resolve the problem by calling the fixup handler in the early exit path. Fixes: 21855cac82d3 ("riscv/mm: Prevent kernel module to access user memory without uaccess routines") Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Link: https://lore.kernel.org/r/20230214162515.184827-1-bjorn@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2023-02-09riscv: Fixup race condition on PG_dcache_clean in flush_icache_pteGuo Ren
In commit 588a513d3425 ("arm64: Fix race condition on PG_dcache_clean in __sync_icache_dcache()"), we found RISC-V has the same issue as the previous arm64. The previous implementation didn't guarantee the correct sequence of operations, which means flush_icache_all() hasn't been called when the PG_dcache_clean was set. That would cause a risk of page synchronization. Fixes: 08f051eda33b ("RISC-V: Flush I$ when making a dirty page executable") Signed-off-by: Guo Ren <guoren@linux.alibaba.com> Signed-off-by: Guo Ren <guoren@kernel.org> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20230127035306.1819561-1-guoren@kernel.org Cc: stable@vger.kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>