summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-02-25KVM: x86/mmu: clear MMIO cache when unloading the MMUPaolo Bonzini
For cleanliness, do not leave a stale GVA in the cache after all the roots are cleared. In practice, kvm_mmu_load will go through kvm_mmu_sync_roots if paging is on, and will not use vcpu_match_mmio_gva at all if paging is off. However, leaving data in the cache might cause bugs in the future. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86/mmu: Always use current mmu's role when loading new PGDPaolo Bonzini
Since the guest PGD is now loaded after the MMU has been set up completely, the desired role for a cache hit is simply the current mmu_role. There is no need to compute it again, so __kvm_mmu_new_pgd can be folded in kvm_mmu_new_pgd. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86/mmu: load new PGD after the shadow MMU is initializedPaolo Bonzini
Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and shadow_root_level anymore, pull the PGD load after the initialization of the shadow MMUs. Besides being more intuitive, this enables future simplifications and optimizations because it's not necessary anymore to compute the role outside kvm_init_mmu. In particular, kvm_mmu_reset_context was not attempting to use a cached PGD to avoid having to figure out the new role. With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing, and avoid unloading all the cached roots. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86/mmu: look for a cached PGD when going from 32-bit to 64-bitPaolo Bonzini
Right now, PGD caching avoids placing a PAE root in the cache by using the old value of mmu->root_level and mmu->shadow_root_level; it does not look for a cached PGD if the old root is a PAE one, and then frees it using kvm_mmu_free_roots. Change the logic instead to free the uncacheable root early. This way, __kvm_new_mmu_pgd is able to look up the cache when going from 32-bit to 64-bit (if there is a hit, the invalid root becomes the least recently used). An example of this is nested virtualization with shadow paging, when a 64-bit L1 runs a 32-bit L2. As a side effect (which is actually the reason why this patch was written), PGD caching does not use the old value of mmu->root_level and mmu->shadow_root_level anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86/mmu: do not pass vcpu to root freeing functionsPaolo Bonzini
These functions only operate on a given MMU, of which there is more than one in a vCPU (we care about two, because the third does not have any roots and is only used to walk guest page tables). They do need a struct kvm in order to lock the mmu_lock, but they do not needed anything else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86/mmu: do not consult levels when freeing rootsPaolo Bonzini
Right now, PGD caching requires a complicated dance of first computing the MMU role and passing it to __kvm_mmu_new_pgd(), and then separately calling kvm_init_mmu(). Part of this is due to kvm_mmu_free_roots using mmu->root_level and mmu->shadow_root_level to distinguish whether the page table uses a single root or 4 PAE roots. Because kvm_init_mmu() can overwrite mmu->root_level, kvm_mmu_free_roots() must be called before kvm_init_mmu(). However, even after kvm_init_mmu() there is a way to detect whether the page table may hold PAE roots, as root.hpa isn't backed by a shadow when it points at PAE roots. Using this method results in simpler code, and is one less obstacle in moving all calls to __kvm_mmu_new_pgd() after the MMU has been initialized. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: use struct kvm_mmu_root_info for mmu->rootPaolo Bonzini
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugsPaolo Bonzini
WARN and bail if KVM attempts to free a root that isn't backed by a shadow page. KVM allocates a bare page for "special" roots, e.g. when using PAE paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa will be valid but won't be backed by a shadow page. It's all too easy to blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of crashing KVM and possibly the kernel. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: do not deliver asynchronous page faults if CR0.PG=0Paolo Bonzini
Enabling async page faults is nonsensical if paging is disabled, but it is allowed because CR0.PG=0 does not clear the async page fault MSR. Just ignore them and only use the artificial halt state, similar to what happens in guest mode if async #PF vmexits are disabled. Given the increasingly complex logic, and the nicer code if the new "if" is placed last, opportunistically change the "||" into a chain of "if (...) return false" statements. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: Reinitialize context if host userspace toggles EFER.LMEPaolo Bonzini
While the guest runs, EFER.LME cannot change unless CR0.PG is clear, and therefore EFER.NX is the only bit that can affect the MMU role. However, set_efer accepts a host-initiated change to EFER.LME even with CR0.PG=1. In that case, the MMU has to be reset. Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: selftests: Verify disabling PMU virtualization via KVM_CAP_CONFIG_PMUDavid Dunn
On a VM with PMU disabled via KVM_CAP_PMU_CONFIG, the PMU should not be usable by the guest. Signed-off-by: David Dunn <daviddunn@google.com> Message-Id: <20220223225743.2703915-4-daviddunn@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: selftests: Carve out helper to create "default" VM without vCPUsDavid Dunn
Carve out portion of vm_create_default so that selftests can modify a "default" VM prior to creating vcpus. Signed-off-by: David Dunn <daviddunn@google.com> Message-Id: <20220223225743.2703915-3-daviddunn@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: Provide per VM capability for disabling PMU virtualizationDavid Dunn
Add a new capability, KVM_CAP_PMU_CAPABILITY, that takes a bitmask of settings/features to allow userspace to configure PMU virtualization on a per-VM basis. For now, support a single flag, KVM_PMU_CAP_DISABLE, to allow disabling PMU virtualization for a VM even when KVM is configured with enable_pmu=true a module level. To keep KVM simple, disallow changing VM's PMU configuration after vCPUs have been created. Signed-off-by: David Dunn <daviddunn@google.com> Message-Id: <20220223225743.2703915-2-daviddunn@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: Fix pointer mistmatch warning when patching RET0 static callsSean Christopherson
Cast kvm_x86_ops.func to 'void *' when updating KVM static calls that are conditionally patched to __static_call_return0(). clang complains about using mismatching pointers in the ternary operator, which breaks the build when compiling with CONFIG_KVM_WERROR=y. >> arch/x86/include/asm/kvm-x86-ops.h:82:1: warning: pointer type mismatch ('bool (*)(struct kvm_vcpu *)' and 'void *') [-Wpointer-type-mismatch] Fixes: 5be2226f417d ("KVM: x86: allow defining return-0 static calls") Reported-by: Like Xu <like.xu.linux@gmail.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Dunn <daviddunn@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Message-Id: <20220223162355.3174907-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: Move VM's worker kthreads back to the original cgroup before exiting.Vipin Sharma
VM worker kthreads can linger in the VM process's cgroup for sometime after KVM terminates the VM process. KVM terminates the worker kthreads by calling kthread_stop() which waits on the 'exited' completion, triggered by exit_mm(), via mm_release(), in do_exit() during the kthread's exit. However, these kthreads are removed from the cgroup using the cgroup_exit() which happens after the exit_mm(). Therefore, A VM process can terminate in between the exit_mm() and cgroup_exit() calls, leaving only worker kthreads in the cgroup. Moving worker kthreads back to the original cgroup (kthreadd_task's cgroup) makes sure that the cgroup is empty as soon as the main VM process is terminated. Signed-off-by: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220222054848.563321-1-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: VMX: Remove scratch 'cpu' variable that shadows an identical scratch varPeng Hao
From: Peng Hao <flyingpeng@tencent.com> Remove a redundant 'cpu' declaration from inside an if-statement that that shadows an identical declaration at function scope. Both variables are used as scratch variables in for_each_*_cpu() loops, thus there's no harm in sharing a variable. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peng Hao <flyingpeng@tencent.com> Message-Id: <20220222103954.70062-1-flyingpeng@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25kvm: vmx: Fix typos comment in __loaded_vmcs_clear()Peng Hao
Fix a comment documenting the memory barrier related to clearing a loaded_vmcs; loaded_vmcs tracks the host CPU the VMCS is loaded on via the field 'cpu', it doesn't have a 'vcpu' field. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peng Hao <flyingpeng@tencent.com> Message-Id: <20220222104029.70129-1-flyingpeng@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: nVMX: Make setup/unsetup under the same conditionsPeng Hao
Make sure nested_vmx_hardware_setup/unsetup() are called in pairs under the same conditions. Calling nested_vmx_hardware_unsetup() when nested is false "works" right now because it only calls free_page() on zero- initialized pointers, but it's possible that more code will be added to nested_vmx_hardware_unsetup() in the future. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peng Hao <flyingpeng@tencent.com> Message-Id: <20220222104054.70286-1-flyingpeng@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25Merge branch 'kvm-hv-xmm-hypercall-fixes' into HEADPaolo Bonzini
The fixes for 5.17 conflict with cleanups made in the same area earlier in the 5.18 development cycle.
2022-02-25KVM: selftests: aarch64: Skip tests if we can't create a vgic-v3Mark Brown
The arch_timer and vgic_irq kselftests assume that they can create a vgic-v3, using the library function vgic_v3_setup() which aborts with a test failure if it is not possible to do so. Since vgic-v3 can only be instantiated on systems where the host has GICv3 this leads to false positives on older systems where that is not the case. Fix this by changing vgic_v3_setup() to return an error if the vgic can't be instantiated and have the callers skip if this happens. We could also exit flagging a skip in vgic_v3_setup() but this would prevent future test cases conditionally deciding which GIC to use or generally doing more complex output. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Andrew Jones <drjones@redhat.com> Tested-by: Ricardo Koller <ricarkol@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220223131624.1830351-1-broonie@kernel.org
2022-02-25KVM: x86: hyper-v: HVCALL_SEND_IPI_EX is an XMM fast hypercallVitaly Kuznetsov
It has been proven on practice that at least Windows Server 2019 tries using HVCALL_SEND_IPI_EX in 'XMM fast' mode when it has more than 64 vCPUs and it needs to send an IPI to a vCPU > 63. Similarly to other XMM Fast hypercalls (HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}{,_EX}), this information is missing in TLFS as of 6.0b. Currently, KVM returns an error (HV_STATUS_INVALID_HYPERCALL_INPUT) and Windows crashes. Note, HVCALL_SEND_IPI is a 'standard' fast hypercall (not 'XMM fast') as all its parameters fit into RDX:R8 and this is handled by KVM correctly. Cc: stable@vger.kernel.org # 5.14.x: 3244867af8c0: KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req Cc: stable@vger.kernel.org # 5.14.x Fixes: d8f5537a8816 ("KVM: hyper-v: Advertise support for fast XMM hypercalls") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220222154642.684285-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: hyper-v: Fix the maximum number of sparse banks for XMM fast TLB ↵Vitaly Kuznetsov
flush hypercalls When TLB flush hypercalls (HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX are issued in 'XMM fast' mode, the maximum number of allowed sparse_banks is not 'HV_HYPERCALL_MAX_XMM_REGISTERS - 1' (5) but twice as many (10) as each XMM register is 128 bit long and can hold two 64 bit long banks. Cc: stable@vger.kernel.org # 5.14.x Fixes: 5974565bc26d ("KVM: x86: kvm_hv_flush_tlb use inputs from XMM registers") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220222154642.684285-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: hyper-v: Drop redundant 'ex' parameter from kvm_hv_flush_tlb()Vitaly Kuznetsov
'struct kvm_hv_hcall' has all the required information already, there's no need to pass 'ex' additionally. No functional change intended. Cc: stable@vger.kernel.org # 5.14.x Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220222154642.684285-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25KVM: x86: hyper-v: Drop redundant 'ex' parameter from kvm_hv_send_ipi()Vitaly Kuznetsov
'struct kvm_hv_hcall' has all the required information already, there's no need to pass 'ex' additionally. No functional change intended. Cc: stable@vger.kernel.org # 5.14.x Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220222154642.684285-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"Sean Christopherson
Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter. The PCID (ASID) part of CR3 can be bumped without KVM being scheduled out, as the kernel will switch CR3 during __text_poke(), e.g. in response to a static key toggling. If switch_mm_irqs_off() chooses a new ASID for the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale vmcs.HOST_CR3. Add a comment to explain why KVM must wait until VM-Enter is imminent to refresh vmcs.HOST_CR3. The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being loaded for the KVM-associated mm while KVM has a "running" vCPU: static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush) { struct kvm_vcpu *vcpu = kvm_get_running_vcpu(); ... WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) && (vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK), "KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3); } ------------[ cut here ]------------ KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003 WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0 Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel CPU: 4 PID: 20717 Comm: stable Tainted: G W 5.17.0-rc3+ #747 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:load_new_mm_cr3+0x82/0xe0 RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082 RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788 RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33 FS: 00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0 Call Trace: <TASK> switch_mm_irqs_off+0x1cb/0x460 __text_poke+0x308/0x3e0 text_poke_bp_batch+0x168/0x220 text_poke_finish+0x1b/0x30 arch_jump_label_transform_apply+0x18/0x30 static_key_slow_inc_cpuslocked+0x7c/0x90 static_key_slow_inc+0x16/0x20 kvm_lapic_set_base+0x116/0x190 kvm_set_apic_base+0xa5/0xe0 kvm_set_msr_common+0x2f4/0xf60 vmx_set_msr+0x355/0xe70 [kvm_intel] kvm_set_msr_ignored_check+0x91/0x230 kvm_emulate_wrmsr+0x36/0x120 vmx_handle_exit+0x609/0x6c0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x146f/0x1b80 kvm_vcpu_ioctl+0x279/0x690 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ---[ end trace 0000000000000000 ]--- This reverts commit 15ad9762d69fd8e40a4a51828c1d6b0c1b8fbea0. Fixes: 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()") Reported-by: Wanpeng Li <kernellwp@gmail.com> Cc: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Lai Jiangshan <jiangshanlai@gmail.com> Message-Id: <20220224191917.3508476-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-25Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()"Sean Christopherson
Undo a nested VMX fix as a step toward reverting the commit it fixed, 15ad9762d69f ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"), as the underlying premise that "host CR3 in the vcpu thread can only be changed when scheduling" is wrong. This reverts commit a9f2705ec84449e3b8d70c804766f8e97e23080d. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220224191917.3508476-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-24KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non ↵Maxim Levitsky
default value when tsc scaling disabled If nested tsc scaling is disabled, MSR_AMD64_TSC_RATIO should never have non default value. Due to way nested tsc scaling support was implmented in qemu, it would set this msr to 0 when nested tsc scaling was disabled. Ignore that value for now, as it causes no harm. Fixes: 5228eb96a487 ("KVM: x86: nSVM: implement nested TSC scaling") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220223115649.319134-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-24KVM: x86/mmu: make apf token non-zero to fix bugLiang Zhang
In current async pagefault logic, when a page is ready, KVM relies on kvm_arch_can_dequeue_async_page_present() to determine whether to deliver a READY event to the Guest. This function test token value of struct kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a READY event is finished by Guest. If value is zero meaning that a READY event is done, so the KVM can deliver another. But the kvm_arch_setup_async_pf() may produce a valid token with zero value, which is confused with previous mention and may lead the loss of this READY event. This bug may cause task blocked forever in Guest: INFO: task stress:7532 blocked for more than 1254 seconds. Not tainted 5.10.0 #16 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack: 0 pid: 7532 ppid: 1409 flags:0x00000080 Call Trace: __schedule+0x1e7/0x650 schedule+0x46/0xb0 kvm_async_pf_task_wait_schedule+0xad/0xe0 ? exit_to_user_mode_prepare+0x60/0x70 __kvm_handle_async_pf+0x4f/0xb0 ? asm_exc_page_fault+0x8/0x30 exc_page_fault+0x6f/0x110 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x402d00 RSP: 002b:00007ffd31912500 EFLAGS: 00010206 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000 Signed-off-by: Liang Zhang <zhangliang5@huawei.com> Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-24Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18Paolo Bonzini
2022-02-22Merge tag 'kvm-s390-next-5.18-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: Changes for 5.18 part1 - add Claudio as Maintainer - first step to do proper storage key checking - testcase for missing memop check
2022-02-22Merge branch 'kvm-ppc-cap-210' into kvm-masterPaolo Bonzini
By request of Nick Piggin: > Patch 3 requires a KVM_CAP_PPC number allocated. QEMU maintainers are > happy with it (link in changelog) just waiting on KVM upstreaming. Do > you have objections to the series going to ppc/kvm tree first, or > another option is you could take patch 3 alone first (it's relatively > independent of the other 2) and ppc/kvm gets it from you?
2022-02-22KVM: PPC: reserve capability 210 for KVM_CAP_PPC_AIL_MODE_3Nicholas Piggin
Add KVM_CAP_PPC_AIL_MODE_3 to advertise the capability to set the AIL resource mode to 3 with the H_SET_MODE hypercall. This capability differs between processor types and KVM types (PR, HV, Nested HV), and affects guest-visible behaviour. QEMU will implement a cap-ail-mode-3 to control this behaviour[1], and use the KVM CAP if available to determine KVM support[2]. Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-22KVM: s390: Add missing vm MEM_OP size checkJanis Schoetterl-Glausch
Check that size is not zero, preventing the following warning: WARNING: CPU: 0 PID: 9692 at mm/vmalloc.c:3059 __vmalloc_node_range+0x528/0x648 Modules linked in: CPU: 0 PID: 9692 Comm: memop Not tainted 5.17.0-rc3-e4+ #80 Hardware name: IBM 8561 T01 701 (LPAR) Krnl PSW : 0704c00180000000 0000000082dc584c (__vmalloc_node_range+0x52c/0x648) R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3 Krnl GPRS: 0000000000000083 ffffffffffffffff 0000000000000000 0000000000000001 0000038000000000 000003ff80000000 0000000000000cc0 000000008ebb8000 0000000087a8a700 000000004040aeb1 000003ffd9f7dec8 000000008ebb8000 000000009d9b8000 000000000102a1b4 00000380035afb68 00000380035afaa8 Krnl Code: 0000000082dc583e: d028a7f4ff80 trtr 2036(41,%r10),3968(%r15) 0000000082dc5844: af000000 mc 0,0 #0000000082dc5848: af000000 mc 0,0 >0000000082dc584c: a7d90000 lghi %r13,0 0000000082dc5850: b904002d lgr %r2,%r13 0000000082dc5854: eb6ff1080004 lmg %r6,%r15,264(%r15) 0000000082dc585a: 07fe bcr 15,%r14 0000000082dc585c: 47000700 bc 0,1792 Call Trace: [<0000000082dc584c>] __vmalloc_node_range+0x52c/0x648 [<0000000082dc5b62>] vmalloc+0x5a/0x68 [<000003ff8067f4ca>] kvm_arch_vm_ioctl+0x2da/0x2a30 [kvm] [<000003ff806705bc>] kvm_vm_ioctl+0x4ec/0x978 [kvm] [<0000000082e562fe>] __s390x_sys_ioctl+0xbe/0x100 [<000000008360a9bc>] __do_syscall+0x1d4/0x200 [<0000000083618bd2>] system_call+0x82/0xb0 Last Breaking-Event-Address: [<0000000082dc5348>] __vmalloc_node_range+0x28/0x648 Other than the warning, there is no ill effect from the missing check, the condition is detected by subsequent code and causes a return with ENOMEM. Fixes: ef11c9463ae0 (KVM: s390: Add vm IOCTL for key checked guest absolute memory access) Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/r/20220221163237.4122868-1-scgl@linux.ibm.com Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-02-22KVM: s390: Clarify key argument for MEM_OP in api docsJanis Schoetterl-Glausch
Clarify that the key argument represents the access key, not the whole storage key. Signed-off-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com> Link: https://lore.kernel.org/r/20220221143657.3712481-1-scgl@linux.ibm.com Fixes: 5e35d0eb472b ("KVM: s390: Update api documentation for memop ioctl") Signed-off-by: Christian Borntraeger <borntraeger@linux.ibm.com>
2022-02-18KVM: x86/mmu: Remove MMU auditingSean Christopherson
Remove mmu_audit.c and all its collateral, the auditing code has suffered severe bitrot, ironically partly due to shadow paging being more stable and thus not benefiting as much from auditing, but mostly due to TDP supplanting shadow paging for non-nested guests and shadowing of nested TDP not heavily stressing the logic that is being audited. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: allow defining return-0 static callsPaolo Bonzini
A few vendor callbacks are only used by VMX, but they return an integer or bool value. Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is NULL in struct kvm_x86_ops, it will be changed to __static_call_return0 when updating static calls. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: make several APIC virtualization callbacks optionalPaolo Bonzini
All their invocations are conditional on vcpu->arch.apicv_active, meaning that they need not be implemented by vendor code: even though at the moment both vendors implement APIC virtualization, all of them can be optional. In fact SVM does not need many of them, and their implementation can be deleted now. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: warn on incorrectly NULL members of kvm_x86_opsPaolo Bonzini
Use the newly corrected KVM_X86_OP annotations to warn about possible NULL pointer dereferences as soon as the vendor module is loaded. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: remove KVM_X86_OP_NULL and mark optional kvm_x86_opsPaolo Bonzini
The original use of KVM_X86_OP_NULL, which was to mark calls that do not follow a specific naming convention, is not in use anymore. Instead, let's mark calls that are optional because they are always invoked within conditionals or with static_call_cond. Those that are _not_, i.e. those that are defined with KVM_X86_OP, must be defined by both vendor modules or some kind of NULL pointer dereference is bound to happen at runtime. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: use static_call_cond for optional callbacksPaolo Bonzini
SVM implements neither update_emulated_instruction nor set_apic_access_page_addr. Remove an "if" by calling them with static_call_cond(). Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: x86: return 1 unconditionally for availability of KVM_CAP_VAPICPaolo Bonzini
The two ioctls used to implement userspace-accelerated TPR, KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available even if hardware-accelerated TPR can be used. So there is no reason not to report KVM_CAP_VAPIC. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18selftests: KVM: allow sev_migrate_tests on machines without SEV-ESPaolo Bonzini
I managed to get hold of a machine that has SEV but not SEV-ES, and sev_migrate_tests fails because sev_vm_create(true) returns ENOTTY. Fix this, and while at it also return KSFT_SKIP on machines that do not have SEV at all, instead of returning 0. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18KVM: SEV: Allow SEV intra-host migration of VM with mirrorsPeter Gonda
For SEV-ES VMs with mirrors to be intra-host migrated they need to be able to migrate with the mirror. This is due to that fact that all VMSAs need to be added into the VM with LAUNCH_UPDATE_VMSA before lAUNCH_FINISH. Allowing migration with mirrors allows users of SEV-ES to keep the mirror VMs VMSAs during migration. Adds a list of mirror VMs for the original VM iterate through during its migration. During the iteration the owner pointers can be updated from the source to the destination. This fixes the ASID leaking issue which caused the blocking of migration of VMs with mirrors. Signed-off-by: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Marc Orr <marcorr@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20220211193634.3183388-1-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18x86/kvm: Don't use pv tlb/ipi/sched_yield if on 1 vCPUWanpeng Li
Inspired by commit 3553ae5690a (x86/kvm: Don't use pvqspinlock code if only 1 vCPU), on a VM with only 1 vCPU, there is no need to enable pv tlb/ipi/sched_yield and we can save the memory for __pv_cpu_mask. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1645171838-2855-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-18x86/kvm: Fix compilation warning in non-x86_64 buildsLeonardo Bras
On non-x86_64 builds, helpers gtod_is_based_on_tsc() and kvm_guest_supported_xfd() are defined but never used. Because these are static inline but are in a .c file, some compilers do warn for them with -Wunused-function, which becomes an error if -Werror is present. Add #ifdef so they are only defined in x86_64 builds. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Leonardo Bras <leobras@redhat.com> Message-Id: <20220218034100.115702-1-leobras@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0Leonardo Bras
kvm_vcpu_arch currently contains the guest supported features in both guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field. Currently both fields are set to the same value in kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that. Since it's not good to keep duplicated data, remove guest_supported_xcr0. To keep the code more readable, introduce kvm_guest_supported_xcr() and kvm_guest_supported_xfd() to replace the previous usages of guest_supported_xcr0. Signed-off-by: Leonardo Bras <leobras@redhat.com> Message-Id: <20220217053028.96432-3-leobras@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0Leonardo Bras
During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate(). When xsave feature is available, the fpu swap is done by: - xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used to store the current state of the fpu registers to a buffer. - xrstor(s) instruction, with (fpu_kernel_cfg.max_features & XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs. For xsave(s) the mask is used to limit what parts of the fpu regs will be copied to the buffer. Likewise on xrstor(s), the mask is used to limit what parts of the fpu regs will be changed. The mask for xsave(s), the guest's fpstate->xfeatures, is defined on kvm_arch_vcpu_create(), which (in summary) sets it to all features supported by the cpu which are enabled on kernel config. This means that xsave(s) will save to guest buffer all the fpu regs contents the cpu has enabled when the guest is paused, even if they are not used. This would not be an issue, if xrstor(s) would also do that. xrstor(s)'s mask for host/guest swap is basically every valid feature contained in kernel config, except XFEATURE_MASK_PKRU. Accordingto kernel src, it is instead switched in switch_to() and flush_thread(). Then, the following happens with a host supporting PKRU starts a guest that does not support it: 1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest, 2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU) 3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU) 4 - guest runs, then switch back to host, 5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU) 6 - xrstor(s) host fpstate to fpu regs. 7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with XFEATURE_MASK_PKRU, which should not be supported by guest vcpu) On 5, even though the guest does not support PKRU, it does have the flag set on guest fpstate, which is transferred to userspace via vcpu ioctl KVM_GET_XSAVE. This becomes a problem when the user decides on migrating the above guest to another machine that does not support PKRU: the new host restores guest's fpu regs to as they were before (xrstor(s)), but since the new host don't support PKRU, a general-protection exception ocurs in xrstor(s) and that crashes the guest. This can be solved by making the guest's fpstate->user_xfeatures hold a copy of guest_supported_xcr0. This way, on 7 the only flags copied to userspace will be the ones compatible to guest requirements, and thus there will be no issue during migration. As a bonus, it will also fail if userspace tries to set fpu features (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest configuration. Such features will never be returned by KVM_GET_XSAVE or KVM_GET_XSAVE2. Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures, there is not need to set it in kvm_check_cpuid(). So, change fpstate_realloc() so it does not touch fpstate->user_xfeatures if a non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid() calls it. Signed-off-by: Leonardo Bras <leobras@redhat.com> Message-Id: <20220217053028.96432-2-leobras@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup modeAnton Romanov
If vcpu has tsc_always_catchup set each request updates pvclock data. KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on host's side and do hypercall inside pvclock_read_retry loop leading to infinite loop in such situation. v3: Removed warn Changed return code to KVM_EFAULT v2: Added warn Signed-off-by: Anton Romanov <romanton@google.com> Message-Id: <20220216182653.506850-1-romanton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17KVM: Fix lockdep false negative during host resumeWanpeng Li
I saw the below splatting after the host suspended and resumed. WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm] CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G W IOE 5.17.0-rc3+ #4 RIP: 0010:kvm_resume+0x2c/0x30 [kvm] Call Trace: <TASK> syscore_resume+0x90/0x340 suspend_devices_and_enter+0xaee/0xe90 pm_suspend.cold+0x36b/0x3c2 state_store+0x82/0xf0 kernfs_fop_write_iter+0x1b6/0x260 new_sync_write+0x258/0x370 vfs_write+0x33f/0x510 ksys_write+0xc9/0x160 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae lockdep_is_held() can return -1 when lockdep is disabled which triggers this warning. Let's use lockdep_assert_not_held() which can detect incorrect calls while holding a lock and it also avoids false negatives when lockdep is disabled. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1644920142-81249-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-02-17KVM: x86: Add KVM_CAP_ENABLE_CAP to x86Aaron Lewis
Follow the precedent set by other architectures that support the VCPU ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP. This way, userspace can ensure that KVM_ENABLE_CAP is available on a vcpu before using it. Fixes: 5c919412fe61 ("kvm/x86: Hyper-V synthetic interrupt controller") Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220214212950.1776943-1-aaronlewis@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>