summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-10-30Merge branch kvm-arm64/stage2-vhe-load into kvmarm/nextOliver Upton
* kvm-arm64/stage2-vhe-load: : Setup stage-2 MMU from vcpu_load() for VHE : : Unlike nVHE, there is no need to switch the stage-2 MMU around on guest : entry/exit in VHE mode as the host is running at EL2. Despite this KVM : reloads the stage-2 on every guest entry, which is needless. : : This series moves the setup of the stage-2 MMU context to vcpu_load() : when running in VHE mode. This is likely to be a win across the board, : but also allows us to remove an ISB on the guest entry path for systems : with one of the speculative AT errata. KVM: arm64: Move VTCR_EL2 into struct s2_mmu KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe() KVM: arm64: Rename helpers for VHE vCPU load/put KVM: arm64: Reload stage-2 for VMID change on VHE KVM: arm64: Restore the stage-2 context in VHE's __tlb_switch_to_host() KVM: arm64: Don't zero VTTBR in __tlb_switch_to_host() Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30Merge branch kvm-arm64/nv-trap-fixes into kvmarm/nextOliver Upton
* kvm-arm64/nv-trap-fixes: : NV trap forwarding fixes, courtesy Miguel Luis and Marc Zyngier : : - Explicitly define the effects of HCR_EL2.NV on EL2 sysregs in the : NV trap encoding : : - Make EL2 registers that access AArch32 guest state UNDEF or RAZ/WI : where appropriate for NV guests KVM: arm64: Handle AArch32 SPSR_{irq,abt,und,fiq} as RAZ/WI KVM: arm64: Do not let a L1 hypervisor access the *32_EL2 sysregs KVM: arm64: Refine _EL2 system register list that require trap reinjection arm64: Add missing _EL2 encodings arm64: Add missing _EL12 encodings Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30Merge branch kvm-arm64/smccc-filter-cleanups into kvmarm/nextOliver Upton
* kvm-arm64/smccc-filter-cleanups: : Cleanup the management of KVM's SMCCC maple tree : : Avoid the cost of maintaining the SMCCC filter maple tree if userspace : hasn't writen a rule to the filter. While at it, rip out the now : unnecessary VM flag to indicate whether or not the SMCCC filter was : configured. KVM: arm64: Use mtree_empty() to determine if SMCCC filter configured KVM: arm64: Only insert reserved ranges when SMCCC filter is used KVM: arm64: Add a predicate for testing if SMCCC filter is configured Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30Merge branch kvm-arm64/pmevtyper-filter into kvmarm/nextOliver Upton
* kvm-arm64/pmevtyper-filter: : Fixes to KVM's handling of the PMUv3 exception level filtering bits : : - NSH (count at EL2) and M (count at EL3) should be stateful when the : respective EL is advertised in the ID registers but have no effect on : event counting. : : - NSU and NSK modify the event filtering of EL0 and EL1, respectively. : Though the kernel may not use these bits, other KVM guests might. : Implement these bits exactly as written in the pseudocode if EL3 is : advertised. KVM: arm64: Add PMU event filter bits required if EL3 is implemented KVM: arm64: Make PMEVTYPER<n>_EL0.NSH RES0 if EL2 isn't advertised Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30Merge branch kvm-arm64/feature-flag-refactor into kvmarm/nextOliver Upton
* kvm-arm64/feature-flag-refactor: : vCPU feature flag cleanup : : Clean up KVM's handling of vCPU feature flags to get rid of the : vCPU-scoped bitmaps and remove failure paths from kvm_reset_vcpu(). KVM: arm64: Get rid of vCPU-scoped feature bitmap KVM: arm64: Remove unused return value from kvm_reset_vcpu() KVM: arm64: Hoist NV+SVE check into KVM_ARM_VCPU_INIT ioctl handler KVM: arm64: Prevent NV feature flag on systems w/o nested virt KVM: arm64: Hoist PAuth checks into KVM_ARM_VCPU_INIT ioctl KVM: arm64: Hoist SVE check into KVM_ARM_VCPU_INIT ioctl handler KVM: arm64: Hoist PMUv3 check into KVM_ARM_VCPU_INIT ioctl handler KVM: arm64: Add generic check for system-supported vCPU features Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30Merge branch kvm-arm64/misc into kvmarm/nextOliver Upton
* kvm-arm64/misc: : Miscellaneous updates : : - Put an upper bound on the number of I-cache invalidations by : cacheline to avoid soft lockups : : - Get rid of bogus refererence count transfer for THP mappings : : - Do a local TLB invalidation on permission fault race : : - Fixes for page_fault_test KVM selftest : : - Add a tracepoint for detecting MMIO instructions unsupported by KVM KVM: arm64: Add tracepoint for MMIO accesses where ISV==0 KVM: arm64: selftest: Perform ISB before reading PAR_EL1 KVM: arm64: selftest: Add the missing .guest_prepare() KVM: arm64: Always invalidate TLB for stage-2 permission faults KVM: arm64: Do not transfer page refcount for THP adjustment KVM: arm64: Avoid soft lockups due to I-cache maintenance arm64: tlbflush: Rename MAX_TLBI_OPS KVM: arm64: Don't use kerneldoc comment for arm64_check_features() Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30KVM: arm64: Add tracepoint for MMIO accesses where ISV==0Oliver Upton
It is a pretty well known fact that KVM does not support MMIO emulation without valid instruction syndrome information (ESR_EL2.ISV == 0). The current kvm_pr_unimpl() is pretty useless, as it contains zero context to relate the event to a vCPU. Replace it with a precise tracepoint that dumps the relevant context so the user can make sense of what the guest is doing. Acked-by: Zenghui Yu <yuzenghui@huawei.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231026205306.3045075-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30KVM: arm64: selftest: Perform ISB before reading PAR_EL1Zenghui Yu
It looks like a mistake to issue ISB *after* reading PAR_EL1, we should instead perform it between the AT instruction and the reads of PAR_EL1. As according to DDI0487J.a IJTYVP, "When an address translation instruction is executed, explicit synchronization is required to guarantee the result is visible to subsequent direct reads of PAR_EL1." Otherwise all guest_at testcases fail on my box with ==== Test Assertion Failure ==== aarch64/page_fault_test.c:142: par & 1 == 0 pid=1355864 tid=1355864 errno=4 - Interrupted system call 1 0x0000000000402853: vcpu_run_loop at page_fault_test.c:681 2 0x0000000000402cdb: run_test at page_fault_test.c:730 3 0x0000000000403897: for_each_guest_mode at guest_modes.c:100 4 0x00000000004019f3: for_each_test_and_guest_mode at page_fault_test.c:1105 5 (inlined by) main at page_fault_test.c:1131 6 0x0000ffffb153c03b: ?? ??:0 7 0x0000ffffb153c113: ?? ??:0 8 0x0000000000401aaf: _start at ??:? 0x1 != 0x0 (par & 1 != 0) Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231007124043.626-2-yuzenghui@huawei.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30KVM: arm64: selftest: Add the missing .guest_prepare()Zenghui Yu
Running page_fault_test on a Cortex A72 fails with Test: ro_memslot_no_syndrome_guest_cas Testing guest mode: PA-bits:40, VA-bits:48, 4K pages Testing memory backing src type: anonymous ==== Test Assertion Failure ==== aarch64/page_fault_test.c:117: guest_check_lse() pid=1944087 tid=1944087 errno=4 - Interrupted system call 1 0x00000000004028b3: vcpu_run_loop at page_fault_test.c:682 2 0x0000000000402d93: run_test at page_fault_test.c:731 3 0x0000000000403957: for_each_guest_mode at guest_modes.c:100 4 0x00000000004019f3: for_each_test_and_guest_mode at page_fault_test.c:1108 5 (inlined by) main at page_fault_test.c:1134 6 0x0000ffff868e503b: ?? ??:0 7 0x0000ffff868e5113: ?? ??:0 8 0x0000000000401aaf: _start at ??:? guest_check_lse() because we don't have a guest_prepare stage to check the presence of FEAT_LSE and skip the related guest_cas testing, and we end-up failing in GUEST_ASSERT(guest_check_lse()). Add the missing .guest_prepare() where it's indeed required. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231007124043.626-1-yuzenghui@huawei.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-30KVM: arm64: Always invalidate TLB for stage-2 permission faultsOliver Upton
It is possible for multiple vCPUs to fault on the same IPA and attempt to resolve the fault. One of the page table walks will actually update the PTE and the rest will return -EAGAIN per our race detection scheme. KVM elides the TLB invalidation on the racing threads as the return value is nonzero. Before commit a12ab1378a88 ("KVM: arm64: Use local TLBI on permission relaxation") KVM always used broadcast TLB invalidations when handling permission faults, which had the convenient property of making the stage-2 updates visible to all CPUs in the system. However now we do a local invalidation, and TLBI elision leads to the vCPU thread faulting again on the stale entry. Remember that the architecture permits the TLB to cache translations that precipitate a permission fault. Invalidate the TLB entry responsible for the permission fault if the stage-2 descriptor has been relaxed, regardless of which thread actually did the job. Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230922223229.1608155-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-25KVM: arm64: Handle AArch32 SPSR_{irq,abt,und,fiq} as RAZ/WIMarc Zyngier
When trapping accesses from a NV guest that tries to access SPSR_{irq,abt,und,fiq}, make sure we handle them as RAZ/WI, as if AArch32 wasn't implemented. This involves a bit of repainting to make the visibility handler more generic. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231023095444.1587322-6-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-25KVM: arm64: Do not let a L1 hypervisor access the *32_EL2 sysregsMarc Zyngier
DBGVCR32_EL2, DACR32_EL2, IFSR32_EL2 and FPEXC32_EL2 are required to UNDEF when AArch32 isn't implemented, which is definitely the case when running NV. Given that this is the only case where these registers can trap, unconditionally inject an UNDEF exception. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20231023095444.1587322-5-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-25KVM: arm64: Refine _EL2 system register list that require trap reinjectionMiguel Luis
Implement a fine grained approach in the _EL2 sysreg range instead of the current wide cast trap. This ensures that we don't mistakenly inject the wrong exception into the guest. [maz: commit message massaging, dropped secure and AArch32 registers from the list] Fixes: d0fc0a2519a6 ("KVM: arm64: nv: Add trap forwarding for HCR_EL2") Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Miguel Luis <miguel.luis@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231023095444.1587322-4-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-25arm64: Add missing _EL2 encodingsMiguel Luis
Some _EL2 encodings are missing. Add them. Signed-off-by: Miguel Luis <miguel.luis@oracle.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> [maz: dropped secure encodings] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231023095444.1587322-3-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-25arm64: Add missing _EL12 encodingsMiguel Luis
Some _EL12 encodings are missing. Add them. Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Miguel Luis <miguel.luis@oracle.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231023095444.1587322-2-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-24KVM: arm64: Add PMU event filter bits required if EL3 is implementedOliver Upton
Suzuki noticed that KVM's PMU emulation is oblivious to the NSU and NSK event filter bits. On systems that have EL3 these bits modify the filter behavior in non-secure EL0 and EL1, respectively. Even though the kernel doesn't use these bits, it is entirely possible some other guest OS does. Additionally, it would appear that these and the M bit are required by the architecture if EL3 is implemented. Allow the EL3 event filter bits to be set if EL3 is advertised in the guest's ID register. Implement the behavior of NSU and NSK according to the pseudocode, and entirely ignore the M bit for perf event creation. Reported-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20231019185618.3442949-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-24KVM: arm64: Make PMEVTYPER<n>_EL0.NSH RES0 if EL2 isn't advertisedOliver Upton
The NSH bit, which filters event counting at EL2, is required by the architecture if an implementation has EL2. Even though KVM doesn't support nested virt yet, it makes no effort to hide the existence of EL2 from the ID registers. Userspace can, however, change the value of PFR0 to hide EL2. Align KVM's sysreg emulation with the architecture and make NSH RES0 if EL2 isn't advertised. Keep in mind the bit is ignored when constructing the backing perf event. While at it, build the event type mask using explicit field definitions instead of relying on ARMV8_PMU_EVTYPE_MASK. KVM probably should've been doing this in the first place, as it avoids changes to the aforementioned mask affecting sysreg emulation. Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/20231019185618.3442949-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-23KVM: arm64: Move VTCR_EL2 into struct s2_mmuMarc Zyngier
We currently have a global VTCR_EL2 value for each guest, even if the guest uses NV. This implies that the guest's own S2 must fit in the host's. This is odd, for multiple reasons: - the PARange values and the number of IPA bits don't necessarily match: you can have 33 bits of IPA space, and yet you can only describe 32 or 36 bits of PARange - When userspace set the IPA space, it creates a contract with the kernel saying "this is the IPA space I'm prepared to handle". At no point does it constraint the guest's own IPA space as long as the guest doesn't try to use a [I]PA outside of the IPA space set by userspace - We don't even try to hide the value of ID_AA64MMFR0_EL1.PARange. And then there is the consequence of the above: if a guest tries to create a S2 that has for input address something that is larger than the IPA space defined by the host, we inject a fatal exception. This is no good. For all intent and purposes, a guest should be able to have the S2 it really wants, as long as the *output* address of that S2 isn't outside of the IPA space. For that, we need to have a per-s2_mmu VTCR_EL2 setting, which allows us to represent the full PARange. Move the vctr field into the s2_mmu structure, which has no impact whatsoever, except for NV. Note that once we are able to override ID_AA64MMFR0_EL1.PARange from userspace, we'll also be able to restrict the size of the shadow S2 that NV uses. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231012205108.3937270-1-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-20KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()Oliver Upton
To date the VHE code has aggressively reloaded the stage-2 MMU context on every guest entry, despite the fact that this isn't necessary. This was probably done for consistency with the nVHE code, which needs to switch in/out the stage-2 MMU context as both the host and guest run at EL1. Hoist __load_stage2() into kvm_vcpu_load_vhe(), thus avoiding a reload on every guest entry/exit. This is likely to be beneficial to systems with one of the speculative AT errata, as there is now one fewer context synchronization event on the guest entry path. Additionally, it is possible that implementations have hitched correctness mitigations on writes to VTTBR_EL2, which are now elided on guest re-entry. Note that __tlb_switch_to_guest() is deliberately left untouched as it can be called outside the context of a running vCPU. Link: https://lore.kernel.org/r/20231018233212.2888027-6-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-20KVM: arm64: Rename helpers for VHE vCPU load/putOliver Upton
The names for the helpers we expose to the 'generic' KVM code are a bit imprecise; we switch the EL0 + EL1 sysreg context and setup trap controls that do not need to change for every guest entry/exit. Rename + shuffle things around a bit in preparation for loading the stage-2 MMU context on vcpu_load(). Link: https://lore.kernel.org/r/20231018233212.2888027-5-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-20KVM: arm64: Reload stage-2 for VMID change on VHEMarc Zyngier
Naturally, a change to the VMID for an MMU implies a new value for VTTBR. Reload on VMID change in anticipation of loading stage-2 on vcpu_load() instead of every guest entry. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231018233212.2888027-4-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-20KVM: arm64: Restore the stage-2 context in VHE's __tlb_switch_to_host()Marc Zyngier
An MMU notifier could cause us to clobber the stage-2 context loaded on a CPU when we switch to another VM's context to invalidate. This isn't an issue right now as the stage-2 context gets reloaded on every guest entry, but is disastrous when moving __load_stage2() into the vcpu_load() path. Restore the previous stage-2 context on the way out of a TLB invalidation if we installed something else. Deliberately do this after TGE=1 is synchronized to keep things safe in light of the speculative AT errata. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231018233212.2888027-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-20KVM: arm64: Don't zero VTTBR in __tlb_switch_to_host()Oliver Upton
HCR_EL2.TGE=0 is sufficient to disable stage-2 translation, so there's no need to explicitly zero VTTBR_EL2. Link: https://lore.kernel.org/r/20231018233212.2888027-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-05KVM: arm64: Use mtree_empty() to determine if SMCCC filter configuredOliver Upton
The smccc_filter maple tree is only populated if userspace attempted to configure it. Use the state of the maple tree to determine if the filter has been configured, eliminating the VM flag. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231004234947.207507-4-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-05KVM: arm64: Only insert reserved ranges when SMCCC filter is usedOliver Upton
The reserved ranges are only useful for preventing userspace from adding a rule that intersects with functions we must handle in KVM. If userspace never writes to the SMCCC filter than this is all just wasted work/memory. Insert reserved ranges on the first call to KVM_ARM_VM_SMCCC_FILTER. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231004234947.207507-3-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-10-05KVM: arm64: Add a predicate for testing if SMCCC filter is configuredOliver Upton
Eventually we can drop the VM flag, move around the existing implementation for now. Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231004234947.207507-2-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-09-30KVM: arm64: Do not transfer page refcount for THP adjustmentVincent Donnefort
GUP affects a refcount common to all pages forming the THP. There is therefore no need to move the refcount from a tail to the head page. Under the hood it decrements and increments the same counter. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230928173205.2826598-2-vdonnefort@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-09-24Linux 6.6-rc3Linus Torvalds
2023-09-24Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix EL2 Stage-1 MMIO mappings where a random address was used - Fix SMCCC function number comparison when the SVE hint is set RISC-V: - Fix KVM_GET_REG_LIST API for ISA_EXT registers - Fix reading ISA_EXT register of a missing extension - Fix ISA_EXT register handling in get-reg-list test - Fix filtering of AIA registers in get-reg-list test x86: - Fixes for TSC_AUX virtualization - Stop zapping page tables asynchronously, since we don't zap them as often as before" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX KVM: SVM: Fix TSC_AUX virtualization setup KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe() KVM: x86/mmu: Open code leaf invalidation from mmu_notifier KVM: riscv: selftests: Selectively filter-out AIA registers KVM: riscv: selftests: Fix ISA_EXT register handling in get-reg-list RISC-V: KVM: Fix riscv_vcpu_get_isa_ext_single() for missing extensions RISC-V: KVM: Fix KVM_GET_REG_LIST API for ISA_EXT registers KVM: selftests: Assert that vasprintf() is successful KVM: arm64: nvhe: Ignore SVE hint in SMCCC function ID KVM: arm64: Properly return allocated EL2 VA from hyp_alloc_private_va_range()
2023-09-24Merge tag 'trace-v6.6-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix the "bytes" output of the per_cpu stat file The tracefs/per_cpu/cpu*/stats "bytes" was giving bogus values as the accounting was not accurate. It is suppose to show how many used bytes are still in the ring buffer, but even when the ring buffer was empty it would still show there were bytes used. - Fix a bug in eventfs where reading a dynamic event directory (open) and then creating a dynamic event that goes into that diretory screws up the accounting. On close, the newly created event dentry will get a "dput" without ever having a "dget" done for it. The fix is to allocate an array on dir open to save what dentries were actually "dget" on, and what ones to "dput" on close. * tag 'trace-v6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: eventfs: Remember what dentries were created on dir open ring-buffer: Fix bytes info in per_cpu buffer stats
2023-09-24Merge tag 'cxl-fixes-6.6-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl fixes from Dan Williams: "A collection of regression fixes, bug fixes, and some small cleanups to the Compute Express Link code. The regressions arrived in the v6.5 dev cycle and missed the v6.6 merge window due to my personal absences this cycle. The most important fixes are for scenarios where the CXL subsystem fails to parse valid region configurations established by platform firmware. This is important because agreement between OS and BIOS on the CXL configuration is fundamental to implementing "OS native" error handling, i.e. address translation and component failure identification. Other important fixes are a driver load error when the BIOS lets the Linux PCI core handle AER events, but not CXL memory errors. The other fixex might have end user impact, but for now are only known to trigger in our test/emulation environment. Summary: - Fix multiple scenarios where platform firmware defined regions fail to be assembled by the CXL core. - Fix a spurious driver-load failure on platforms that enable OS native AER, but not OS native CXL error handling. - Fix a regression detecting "poison" commands when "security" commands are also defined. - Fix a cxl_test regression with the move to centralize CXL port register enumeration in the CXL core. - Miscellaneous small fixes and cleanups" * tag 'cxl-fixes-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: cxl/acpi: Annotate struct cxl_cxims_data with __counted_by cxl/port: Fix cxl_test register enumeration regression cxl/region: Refactor granularity select in cxl_port_setup_targets() cxl/region: Match auto-discovered region decoders by HPA range cxl/mbox: Fix CEL logic for poison and security commands cxl/pci: Replace host_bridge->native_aer with pcie_aer_is_native() PCI/AER: Export pcie_aer_is_native() cxl/pci: Fix appropriate checking for _OSC while handling CXL RAS registers
2023-09-23Merge tag 'gpio-fixes-for-v6.6-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: - fix an invalid usage of __free(kfree) leading to kfreeing an ERR_PTR() - fix an irq domain leak in gpio-tb10x - MAINTAINERS update * tag 'gpio-fixes-for-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpio: sim: fix an invalid __free() usage gpio: tb10x: Fix an error handling path in tb10x_gpio_probe() MAINTAINERS: gpio-regmap: make myself a maintainer of it
2023-09-23Merge tag 'mm-hotfixes-stable-2023-09-23-10-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes, 10 of which pertain to post-6.5 issues. The other three are cc:stable" * tag 'mm-hotfixes-stable-2023-09-23-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: proc: nommu: fix empty /proc/<pid>/maps filemap: add filemap_map_order0_folio() to handle order0 folio proc: nommu: /proc/<pid>/maps: release mmap read lock mm: memcontrol: fix GFP_NOFS recursion in memory.high enforcement pidfd: prevent a kernel-doc warning argv_split: fix kernel-doc warnings scatterlist: add missing function params to kernel-doc selftests/proc: fixup proc-empty-vm test after KSM changes revert "scripts/gdb/symbols: add specific ko module load command" selftests: link libasan statically for tests with -fsanitize=address task_work: add kerneldoc annotation for 'data' argument mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list sh: mm: re-add lost __ref to ioremap_prot() to fix modpost warning
2023-09-23Merge tag '6.6-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: "Six smb3 client fixes, including three for stable, from the SMB plugfest (testing event) this week: - Reparse point handling fix (found when investigating dir enumeration when fifo in dir) - Fix excessive thread creation for dir lease cleanup - UAF fix in negotiate path - remove duplicate error message mapping and fix confusing warning message - add dynamic trace point to improve debugging RDMA connection attempts" * tag '6.6-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: fix confusing debug message smb: client: handle STATUS_IO_REPARSE_TAG_NOT_HANDLED smb3: remove duplicate error mapping cifs: Fix UAF in cifs_demultiplex_thread() smb3: do not start laundromat thread when dir leases disabled smb3: Add dynamic trace points for RDMA (smbdirect) reconnect
2023-09-23Merge tag 'i2c-for-6.6-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "A set of I2C driver fixes. Mostly fixing resource leaks or sanity checks" * tag 'i2c-for-6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: xiic: Correct return value check for xiic_reinit() i2c: mux: gpio: Add missing fwnode_handle_put() i2c: mux: demux-pinctrl: check the return value of devm_kstrdup() i2c: designware: fix __i2c_dw_disable() in case master is holding SCL low i2c: i801: unregister tco_pdev in i801_probe() error path
2023-09-23mfd: cs42l43: Use correct macro for new-style PM runtime opsCharles Keepax
The code was accidentally mixing new and old style macros, update the macros used to remove an unused function warning whilst building with no PM enabled in the config. Fixes: ace6d1448138 ("mfd: cs42l43: Add support for cs42l43 core driver") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Link: https://lore.kernel.org/all/20230822114914.340359-1-ckeepax@opensource.cirrus.com/ Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Lee Jones <lee@kernel.org> Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2023-09-23Merge tag 'loongarch-fixes-6.6-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai Chen: "Fix lockdep, fix a boot failure, fix some build warnings, fix document links, and some cleanups" * tag 'loongarch-fixes-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: docs/zh_CN/LoongArch: Update the links of ABI docs/LoongArch: Update the links of ABI LoongArch: Don't inline kasan_mem_to_shadow()/kasan_shadow_to_mem() kasan: Cleanup the __HAVE_ARCH_SHADOW_MAP usage LoongArch: Set all reserved memblocks on Node#0 at initialization LoongArch: Remove dead code in relocate_new_kernel LoongArch: Use _UL() and _ULL() LoongArch: Fix some build warnings with W=1 LoongArch: Fix lockdep static memory detection
2023-09-23Merge tag 's390-6.6-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix potential string buffer overflow in hypervisor user-defined certificates handling - Update defconfigs * tag 's390-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/cert_store: fix string length handling s390: update defconfigs
2023-09-23Merge tag 'iomap-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull iomap fixes from Darrick Wong: - Return EIO on bad inputs to iomap_to_bh instead of BUGging, to deal less poorly with block device io racing with block device resizing - Fix a stale page data exposure bug introduced in 6.6-rc1 when unsharing a file range that is not in the page cache * tag 'iomap-6.6-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: convert iomap_unshare_iter to use large folios iomap: don't skip reading in !uptodate folios when unsharing a range iomap: handle error conditions more gracefully in iomap_to_bh
2023-09-23Merge tag 'kvm-riscv-fixes-6.6-1' of https://github.com/kvm-riscv/linux into ↵Paolo Bonzini
HEAD KVM/riscv fixes for 6.6, take #1 - Fix KVM_GET_REG_LIST API for ISA_EXT registers - Fix reading ISA_EXT register of a missing extension - Fix ISA_EXT register handling in get-reg-list test - Fix filtering of AIA registers in get-reg-list test
2023-09-23KVM: SVM: Do not use user return MSR support for virtualized TSC_AUXTom Lendacky
When the TSC_AUX MSR is virtualized, the TSC_AUX value is swap type "B" within the VMSA. This means that the guest value is loaded on VMRUN and the host value is restored from the host save area on #VMEXIT. Since the value is restored on #VMEXIT, the KVM user return MSR support for TSC_AUX can be replaced by populating the host save area with the current host value of TSC_AUX. And, since TSC_AUX is not changed by Linux post-boot, the host save area can be set once in svm_hardware_enable(). This eliminates the two WRMSR instructions associated with the user return MSR support. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <d381de38eb0ab6c9c93dda8503b72b72546053d7.1694811272.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23KVM: SVM: Fix TSC_AUX virtualization setupTom Lendacky
The checks for virtualizing TSC_AUX occur during the vCPU reset processing path. However, at the time of initial vCPU reset processing, when the vCPU is first created, not all of the guest CPUID information has been set. In this case the RDTSCP and RDPID feature support for the guest is not in place and so TSC_AUX virtualization is not established. This continues for each vCPU created for the guest. On the first boot of an AP, vCPU reset processing is executed as a result of an APIC INIT event, this time with all of the guest CPUID information set, resulting in TSC_AUX virtualization being enabled, but only for the APs. The BSP always sees a TSC_AUX value of 0 which probably went unnoticed because, at least for Linux, the BSP TSC_AUX value is 0. Move the TSC_AUX virtualization enablement out of the init_vmcb() path and into the vcpu_after_set_cpuid() path to allow for proper initialization of the support after the guest CPUID information has been set. With the TSC_AUX virtualization support now in the vcpu_set_after_cpuid() path, the intercepts must be either cleared or set based on the guest CPUID input. Fixes: 296d5a17e793 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <4137fbcb9008951ab5f0befa74a0399d2cce809a.1694811272.git.thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23KVM: SVM: INTERCEPT_RDTSCP is never intercepted anywayPaolo Bonzini
svm_recalc_instruction_intercepts() is always called at least once before the vCPU is started, so the setting or clearing of the RDTSCP intercept can be dropped from the TSC_AUX virtualization support. Extracted from a patch by Tom Lendacky. Cc: stable@vger.kernel.org Fixes: 296d5a17e793 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronouslySean Christopherson
Stop zapping invalidate TDP MMU roots via work queue now that KVM preserves TDP MMU roots until they are explicitly invalidated. Zapping roots asynchronously was effectively a workaround to avoid stalling a vCPU for an extended during if a vCPU unloaded a root, which at the time happened whenever the guest toggled CR0.WP (a frequent operation for some guest kernels). While a clever hack, zapping roots via an unbound worker had subtle, unintended consequences on host scheduling, especially when zapping multiple roots, e.g. as part of a memslot. Because the work of zapping a root is no longer bound to the task that initiated the zap, things like the CPU affinity and priority of the original task get lost. Losing the affinity and priority can be especially problematic if unbound workqueues aren't affined to a small number of CPUs, as zapping multiple roots can cause KVM to heavily utilize the majority of CPUs in the system, *beyond* the CPUs KVM is already using to run vCPUs. When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root zap can result in KVM occupying all logical CPUs for ~8ms, and result in high priority tasks not being scheduled in in a timely manner. In v5.15, which doesn't preserve unloaded roots, the issues were even more noticeable as KVM would zap roots more frequently and could occupy all CPUs for 50ms+. Consuming all CPUs for an extended duration can lead to significant jitter throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots is a semi-frequent operation as memslots are deleted and recreated with different host virtual addresses to react to host GPU drivers allocating and freeing GPU blobs. On ChromeOS, the jitter manifests as audio blips during games due to the audio server's tasks not getting scheduled in promptly, despite the tasks having a high realtime priority. Deleting memslots isn't exactly a fast path and should be avoided when possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the memslot shenanigans, but KVM is squarely in the wrong. Not to mention that removing the async zapping eliminates a non-trivial amount of complexity. Note, one of the subtle behaviors hidden behind the async zapping is that KVM would zap invalidated roots only once (ignoring partial zaps from things like mmu_notifier events). Preserve this behavior by adding a flag to identify roots that are scheduled to be zapped versus roots that have already been zapped but not yet freed. Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can encounter invalid roots, as it's not at all obvious why zapping invalidated roots shouldn't simply zap all invalid roots. Reported-by: Pattara Teerapong <pteerapong@google.com> Cc: David Stevens <stevensd@google.com> Cc: Yiwei Zhang<zzyiwei@google.com> Cc: Paul Hsia <paulhsia@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230916003916.2545000-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-23KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe()Paolo Bonzini
All callers except the MMU notifier want to process all address spaces. Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe() and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe(). Extracted out of a patch by Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-22Merge tag 'hardening-v6.6-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening fixes from Kees Cook: - Fix UAPI stddef.h to avoid C++-ism (Alexey Dobriyan) - Fix harmless UAPI stddef.h header guard endif (Alexey Dobriyan) * tag 'hardening-v6.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: uapi: stddef.h: Fix __DECLARE_FLEX_ARRAY for C++ uapi: stddef.h: Fix header guard location
2023-09-22Merge tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Chandan Babu: - Fix an integer overflow bug when processing an fsmap call - Fix crash due to CPU hot remove event racing with filesystem mount operation - During read-only mount, XFS does not allow the contents of the log to be recovered when there are one or more unrecognized rcompat features in the primary superblock, since the log might have intent items which the kernel does not know how to process - During recovery of log intent items, XFS now reserves log space sufficient for one cycle of a permanent transaction to execute. Otherwise, this could lead to livelocks due to non-availability of log space - On an fs which has an ondisk unlinked inode list, trying to delete a file or allocating an O_TMPFILE file can cause the fs to the shutdown if the first inode in the ondisk inode list is not present in the inode cache. The bug is solved by explicitly loading the first inode in the ondisk unlinked inode list into the inode cache if it is not already cached A similar problem arises when the uncached inode is present in the middle of the ondisk unlinked inode list. This second bug is triggered when executing operations like quotacheck and bulkstat. In this case, XFS now reads in the entire ondisk unlinked inode list - Enable LARP mode only on recent v5 filesystems - Fix a out of bounds memory access in scrub - Fix a performance bug when locating the tail of the log during mounting a filesystem * tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail xfs: only call xchk_stats_merge after validating scrub inputs xfs: require a relatively recent V5 filesystem for LARP mode xfs: make inode unlinked bucket recovery work with quotacheck xfs: load uncached unlinked inodes into memory on demand xfs: reserve less log space when recovering log intent items xfs: fix log recovery when unknown rocompat bits are set xfs: reload entire unlinked bucket lists xfs: allow inode inactivation during a ro mount log recovery xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list xfs: remove CPU hotplug infrastructure xfs: remove the all-mounts list xfs: use per-mount cpumask to track nonempty percpu inodegc lists xfs: fix an agbno overflow in __xfs_getfsmap_datadev xfs: fix per-cpu CIL structure aggregation racing with dying cpus xfs: fix select in config XFS_ONLINE_SCRUB_STATS
2023-09-22cxl/acpi: Annotate struct cxl_cxims_data with __counted_byKees Cook
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct cxl_cxims_data. Additionally, since the element count member must be set before accessing the annotated flexible array member, move its initialization earlier. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Alison Schofield <alison.schofield@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-cxl@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/20230922175319.work.096-kees@kernel.org Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-09-22cxl/port: Fix cxl_test register enumeration regressionDan Williams
The cxl_test unit test environment models a CXL topology for sysfs/user-ABI regression testing. It uses interface mocking via the "--wrap=" linker option to redirect cxl_core routines that parse hardware registers with versions that just publish objects, like devm_cxl_enumerate_decoders(). Starting with: Commit 19ab69a60e3b ("cxl/port: Store the port's Component Register mappings in struct cxl_port") ...port register enumeration is moved into devm_cxl_add_port(). This conflicts with the "cxl_test avoids emulating registers stance" so either the port code needs to be refactored (too violent), or modified so that register enumeration is skipped on "fake" cxl_test ports (annoying, but straightforward). This conflict has happened previously and the "check for platform device" workaround to avoid instrusive refactoring was deployed in those scenarios. In general, refactoring should only benefit production code, test code needs to remain minimally instrusive to the greatest extent possible. This was missed previously because it may sometimes just cause warning messages to be emitted, but it can also cause test failures. The backport to -stable is only nice to have for clean cxl_test runs. Fixes: 19ab69a60e3b ("cxl/port: Store the port's Component Register mappings in struct cxl_port") Cc: stable@vger.kernel.org Reported-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Link: https://lore.kernel.org/r/169476525052.1013896.6235102957693675187.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2023-09-22eventfs: Remember what dentries were created on dir openSteven Rostedt (Google)
Using the following code with libtracefs: int dfd; // create the directory events/kprobes/kp1 tracefs_kprobe_raw(NULL, "kp1", "schedule_timeout", "time=$arg1"); // Open the kprobes directory dfd = tracefs_instance_file_open(NULL, "events/kprobes", O_RDONLY); // Do a lookup of the kprobes/kp1 directory (by looking at enable) tracefs_file_exists(NULL, "events/kprobes/kp1/enable"); // Now create a new entry in the kprobes directory tracefs_kprobe_raw(NULL, "kp2", "schedule_hrtimeout", "expires=$arg1"); // Do another lookup to create the dentries tracefs_file_exists(NULL, "events/kprobes/kp2/enable")) // Close the directory close(dfd); What happened above, the first open (dfd) will call dcache_dir_open_wrapper() that will create the dentries and up their ref counts. Now the creation of "kp2" will add another dentry within the kprobes directory. Upon the close of dfd, eventfs_release() will now do a dput for all the entries in kprobes. But this is where the problem lies. The open only upped the dentry of kp1 and not kp2. Now the close is decrementing both kp1 and kp2, which causes kp2 to get a negative count. Doing a "trace-cmd reset" which deletes all the kprobes cause the kernel to crash! (due to the messed up accounting of the ref counts). To solve this, save all the dentries that are opened in the dcache_dir_open_wrapper() into an array, and use this array to know what dentries to do a dput on in eventfs_release(). Since the dcache_dir_open_wrapper() calls dcache_dir_open() which uses the file->private_data, we need to also add a wrapper around dcache_readdir() that uses the cursor assigned to the file->private_data. This is because the dentries need to also be saved in the file->private_data. To do this create the structure: struct dentry_list { void *cursor; struct dentry **dentries; }; Which will hold both the cursor and the dentries. Some shuffling around is needed to make sure that dcache_dir_open() and dcache_readdir() only see the cursor. Link: https://lore.kernel.org/linux-trace-kernel/20230919211804.230edf1e@gandalf.local.home/ Link: https://lore.kernel.org/linux-trace-kernel/20230922163446.1431d4fa@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ajay Kaher <akaher@vmware.com> Fixes: 63940449555e7 ("eventfs: Implement eventfs lookup, read, open functions") Reported-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>