diff options
Diffstat (limited to 'Documentation/virt')
| -rw-r--r-- | Documentation/virt/hyperv/coco.rst | 139 | ||||
| -rw-r--r-- | Documentation/virt/kvm/api.rst | 90 | ||||
| -rw-r--r-- | Documentation/virt/kvm/devices/arm-vgic-v3.rst | 3 | ||||
| -rw-r--r-- | Documentation/virt/kvm/x86/errata.rst | 9 |
4 files changed, 231 insertions, 10 deletions
diff --git a/Documentation/virt/hyperv/coco.rst b/Documentation/virt/hyperv/coco.rst index c15d6fe34b4e..3231e51444da 100644 --- a/Documentation/virt/hyperv/coco.rst +++ b/Documentation/virt/hyperv/coco.rst @@ -178,7 +178,7 @@ These Hyper-V and VMBus memory pages are marked as decrypted: * VMBus monitor pages -* Synthetic interrupt controller (synic) related pages (unless supplied by +* Synthetic interrupt controller (SynIC) related pages (unless supplied by the paravisor) * Per-cpu hypercall input and output pages (unless running with a paravisor) @@ -232,6 +232,143 @@ with arguments explicitly describing the access. See _hv_pcifront_read_config() and _hv_pcifront_write_config() and the "use_calls" flag indicating to use hypercalls. +Confidential VMBus +------------------ +The confidential VMBus enables the confidential guest not to interact with +the untrusted host partition and the untrusted hypervisor. Instead, the guest +relies on the trusted paravisor to communicate with the devices processing +sensitive data. The hardware (SNP or TDX) encrypts the guest memory and the +register state while measuring the paravisor image using the platform security +processor to ensure trusted and confidential computing. + +Confidential VMBus provides a secure communication channel between the guest +and the paravisor, ensuring that sensitive data is protected from hypervisor- +level access through memory encryption and register state isolation. + +Confidential VMBus is an extension of Confidential Computing (CoCo) VMs +(a.k.a. "Isolated" VMs in Hyper-V terminology). Without Confidential VMBus, +guest VMBus device drivers (the "VSC"s in VMBus terminology) communicate +with VMBus servers (the VSPs) running on the Hyper-V host. The +communication must be through memory that has been decrypted so the +host can access it. With Confidential VMBus, one or more of the VSPs reside +in the trusted paravisor layer in the guest VM. Since the paravisor layer also +operates in encrypted memory, the memory used for communication with +such VSPs does not need to be decrypted and thereby exposed to the +Hyper-V host. The paravisor is responsible for communicating securely +with the Hyper-V host as necessary. + +The data is transferred directly between the VM and a vPCI device (a.k.a. +a PCI pass-thru device, see :doc:`vpci`) that is directly assigned to VTL2 +and that supports encrypted memory. In such a case, neither the host partition +nor the hypervisor has any access to the data. The guest needs to establish +a VMBus connection only with the paravisor for the channels that process +sensitive data, and the paravisor abstracts the details of communicating +with the specific devices away providing the guest with the well-established +VSP (Virtual Service Provider) interface that has had support in the Hyper-V +drivers for a decade. + +In the case the device does not support encrypted memory, the paravisor +provides bounce-buffering, and although the data is not encrypted, the backing +pages aren't mapped into the host partition through SLAT. While not impossible, +it becomes much more difficult for the host partition to exfiltrate the data +than it would be with a conventional VMBus connection where the host partition +has direct access to the memory used for communication. + +Here is the data flow for a conventional VMBus connection (`C` stands for the +client or VSC, `S` for the server or VSP, the `DEVICE` is a physical one, might +be with multiple virtual functions):: + + +---- GUEST ----+ +----- DEVICE ----+ +----- HOST -----+ + | | | | | | + | | | | | | + | | | ========== | + | | | | | | + | | | | | | + | | | | | | + +----- C -------+ +-----------------+ +------- S ------+ + || || + || || + +------||------------------ VMBus --------------------------||------+ + | Interrupts, MMIO | + +-------------------------------------------------------------------+ + +and the Confidential VMBus connection:: + + +---- GUEST --------------- VTL0 ------+ +-- DEVICE --+ + | | | | + | +- PARAVISOR --------- VTL2 -----+ | | | + | | +-- VMBus Relay ------+ ====+================ | + | | | Interrupts, MMIO | | | | | + | | +-------- S ----------+ | | +------------+ + | | || | | + | +---------+ || | | + | | Linux | || OpenHCL | | + | | kernel | || | | + | +---- C --+-----||---------------+ | + | || || | + +-------++------- C -------------------+ +------------+ + || | HOST | + || +---- S -----+ + +-------||----------------- VMBus ---------------------------||-----+ + | Interrupts, MMIO | + +-------------------------------------------------------------------+ + +An implementation of the VMBus relay that offers the Confidential VMBus +channels is available in the OpenVMM project as a part of the OpenHCL +paravisor. Please refer to + + * https://openvmm.dev/, and + * https://github.com/microsoft/openvmm + +for more information about the OpenHCL paravisor. + +A guest that is running with a paravisor must determine at runtime if +Confidential VMBus is supported by the current paravisor. The x86_64-specific +approach relies on the CPUID Virtualization Stack leaf; the ARM64 implementation +is expected to support the Confidential VMBus unconditionally when running +ARM CCA guests. + +Confidential VMBus is a characteristic of the VMBus connection as a whole, +and of each VMBus channel that is created. When a Confidential VMBus +connection is established, the paravisor provides the guest the message-passing +path that is used for VMBus device creation and deletion, and it provides a +per-CPU synthetic interrupt controller (SynIC) just like the SynIC that is +offered by the Hyper-V host. Each VMBus device that is offered to the guest +indicates the degree to which it participates in Confidential VMBus. The offer +indicates if the device uses encrypted ring buffers, and if the device uses +encrypted memory for DMA that is done outside the ring buffer. These settings +may be different for different devices using the same Confidential VMBus +connection. + +Although these settings are separate, in practice it'll always be encrypted +ring buffer only, or both encrypted ring buffer and external data. If a channel +is offered by the paravisor with confidential VMBus, the ring buffer can always +be encrypted since it's strictly for communication between the VTL2 paravisor +and the VTL0 guest. However, other memory regions are often used for e.g. DMA, +so they need to be accessible by the underlying hardware, and must be +unencrypted (unless the device supports encrypted memory). Currently, there are +not any VSPs in OpenHCL that support encrypted external memory, but future +versions are expected to enable this capability. + +Because some devices on a Confidential VMBus may require decrypted ring buffers +and DMA transfers, the guest must interact with two SynICs -- the one provided +by the paravisor and the one provided by the Hyper-V host when Confidential +VMBus is not offered. Interrupts are always signaled by the paravisor SynIC, +but the guest must check for messages and for channel interrupts on both SynICs. + +In the case of a confidential VMBus, regular SynIC access by the guest is +intercepted by the paravisor (this includes various MSRs such as the SIMP and +SIEFP, as well as hypercalls like HvPostMessage and HvSignalEvent). If the +guest actually wants to communicate with the hypervisor, it has to use special +mechanisms (GHCB page on SNP, or tdcall on TDX). Messages can be of either +kind: with confidential VMBus, messages use the paravisor SynIC, and if the +guest chose to communicate directly to the hypervisor, they use the hypervisor +SynIC. For interrupt signaling, some channels may be running on the host +(non-confidential, using the VMBus relay) and use the hypervisor SynIC, and +some on the paravisor and use its SynIC. The RelIDs are coordinated by the +OpenHCL VMBus server and are guaranteed to be unique regardless of whether +the channel originated on the host or the paravisor. + load_unaligned_zeropad() ------------------------ When transitioning memory between encrypted and decrypted, the caller of diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 6ae24c5ca559..01a3abef8abb 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1229,6 +1229,9 @@ It is not possible to read back a pending external abort (injected via KVM_SET_VCPU_EVENTS or otherwise) because such an exception is always delivered directly to the virtual CPU). +Calling this ioctl on a vCPU that hasn't been initialized will return +-ENOEXEC. + :: struct kvm_vcpu_events { @@ -1309,6 +1312,8 @@ exceptions by manipulating individual registers using the KVM_SET_ONE_REG API. See KVM_GET_VCPU_EVENTS for the data structure. +Calling this ioctl on a vCPU that hasn't been initialized will return +-ENOEXEC. 4.33 KVM_GET_DEBUGREGS ---------------------- @@ -6432,9 +6437,18 @@ most one mapping per page, i.e. binding multiple memory regions to a single guest_memfd range is not allowed (any number of memory regions can be bound to a single guest_memfd file, but the bound ranges must not overlap). -When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field -supports GUEST_MEMFD_FLAG_MMAP. Setting this flag on guest_memfd creation -enables mmap() and faulting of guest_memfd memory to host userspace. +The capability KVM_CAP_GUEST_MEMFD_FLAGS enumerates the `flags` that can be +specified via KVM_CREATE_GUEST_MEMFD. Currently defined flags: + + ============================ ================================================ + GUEST_MEMFD_FLAG_MMAP Enable using mmap() on the guest_memfd file + descriptor. + GUEST_MEMFD_FLAG_INIT_SHARED Make all memory in the file shared during + KVM_CREATE_GUEST_MEMFD (memory files created + without INIT_SHARED will be marked private). + Shared memory can be faulted into host userspace + page tables. Private memory cannot. + ============================ ================================================ When the KVM MMU performs a PFN lookup to service a guest fault and the backing guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be @@ -7272,6 +7286,41 @@ exit, even without calls to ``KVM_ENABLE_CAP`` or similar. In this case, it will enter with output fields already valid; in the common case, the ``unknown.ret`` field of the union will be ``TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED``. Userspace need not do anything if it does not wish to support a TDVMCALL. + +:: + + /* KVM_EXIT_ARM_SEA */ + struct { + #define KVM_EXIT_ARM_SEA_FLAG_GPA_VALID (1ULL << 0) + __u64 flags; + __u64 esr; + __u64 gva; + __u64 gpa; + } arm_sea; + +Used on arm64 systems. When the VM capability ``KVM_CAP_ARM_SEA_TO_USER`` is +enabled, a KVM exits to userspace if a guest access causes a synchronous +external abort (SEA) and the host APEI fails to handle the SEA. + +``esr`` is set to a sanitized value of ESR_EL2 from the exception taken to KVM, +consisting of the following fields: + + - ``ESR_EL2.EC`` + - ``ESR_EL2.IL`` + - ``ESR_EL2.FnV`` + - ``ESR_EL2.EA`` + - ``ESR_EL2.CM`` + - ``ESR_EL2.WNR`` + - ``ESR_EL2.FSC`` + - ``ESR_EL2.SET`` (when FEAT_RAS is implemented for the VM) + +``gva`` is set to the value of FAR_EL2 from the exception taken to KVM when +``ESR_EL2.FnV == 0``. Otherwise, the value of ``gva`` is unknown. + +``gpa`` is set to the faulting IPA from the exception taken to KVM when +the ``KVM_EXIT_ARM_SEA_FLAG_GPA_VALID`` flag is set. Otherwise, the value of +``gpa`` is unknown. + :: /* Fix the size of the union. */ @@ -7806,7 +7855,7 @@ where 0xff represents CPUs 0-7 in cluster 0. :Architectures: s390 :Parameters: none -With this capability enabled, all illegal instructions 0x0000 (2 bytes) will +With this capability enabled, the illegal instruction 0x0000 (2 bytes) will be intercepted and forwarded to user space. User space can use this mechanism e.g. to realize 2-byte software breakpoints. The kernel will not inject an operating exception for these instructions, user space has @@ -8014,7 +8063,7 @@ will be initialized to 1 when created. This also improves performance because dirty logging can be enabled gradually in small chunks on the first call to KVM_CLEAR_DIRTY_LOG. KVM_DIRTY_LOG_INITIALLY_SET depends on KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE (it is also only available on -x86 and arm64 for now). +x86, arm64 and riscv for now). KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 was previously available under the name KVM_CAP_MANUAL_DIRTY_LOG_PROTECT, but the implementation had bugs that make @@ -8510,7 +8559,7 @@ Therefore, the ioctl must be called *before* reading the content of the dirty pages. The dirty ring can get full. When it happens, the KVM_RUN of the -vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL. +vcpu will return with exit reason KVM_EXIT_DIRTY_RING_FULL. The dirty ring interface has a major difference comparing to the KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from @@ -8678,7 +8727,7 @@ given VM. When this capability is enabled, KVM resets the VCPU when setting MP_STATE_INIT_RECEIVED through IOCTL. The original MP_STATE is preserved. -7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED +7.44 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED ------------------------------------------- :Architectures: arm64 @@ -8689,6 +8738,33 @@ This capability indicate to the userspace whether a PFNMAP memory region can be safely mapped as cacheable. This relies on the presence of force write back (FWB) feature support on the hardware. +7.45 KVM_CAP_ARM_SEA_TO_USER +---------------------------- + +:Architecture: arm64 +:Target: VM +:Parameters: none +:Returns: 0 on success, -EINVAL if unsupported. + +When this capability is enabled, KVM may exit to userspace for SEAs taken to +EL2 resulting from a guest access. See ``KVM_EXIT_ARM_SEA`` for more +information. + +7.46 KVM_CAP_S390_USER_OPEREXEC +------------------------------- + +:Architectures: s390 +:Parameters: none + +When this capability is enabled KVM forwards all operation exceptions +that it doesn't handle itself to user space. This also includes the +0x0000 instructions managed by KVM_CAP_S390_USER_INSTR0. This is +helpful if user space wants to emulate instructions which are not +(yet) implemented in hardware. + +This capability can be enabled dynamically even if VCPUs were already +created and are running. + 8. Other capabilities. ====================== diff --git a/Documentation/virt/kvm/devices/arm-vgic-v3.rst b/Documentation/virt/kvm/devices/arm-vgic-v3.rst index ff02102f7141..5395ee66fc32 100644 --- a/Documentation/virt/kvm/devices/arm-vgic-v3.rst +++ b/Documentation/virt/kvm/devices/arm-vgic-v3.rst @@ -13,7 +13,8 @@ will act as the VM interrupt controller, requiring emulated user-space devices to inject interrupts to the VGIC instead of directly to CPUs. It is not possible to create both a GICv3 and GICv2 on the same VM. -Creating a guest GICv3 device requires a host GICv3 as well. +Creating a guest GICv3 device requires a host GICv3 host, or a GICv5 host with +support for FEAT_GCIE_LEGACY. Groups: diff --git a/Documentation/virt/kvm/x86/errata.rst b/Documentation/virt/kvm/x86/errata.rst index 37c79362a48f..a9cf0e004651 100644 --- a/Documentation/virt/kvm/x86/errata.rst +++ b/Documentation/virt/kvm/x86/errata.rst @@ -48,7 +48,14 @@ versus "has_error_code", i.e. KVM's ABI follows AMD behavior. Nested virtualization features ------------------------------ -TBD +On AMD CPUs, when GIF is cleared, #DB exceptions or traps due to a breakpoint +register match are ignored and discarded by the CPU. The CPU relies on the VMM +to fully virtualize this behavior, even when vGIF is enabled for the guest +(i.e. vGIF=0 does not cause the CPU to drop #DBs when the guest is running). +KVM does not virtualize this behavior as the complexity is unjustified given +the rarity of the use case. One way to handle this would be for KVM to +intercept the #DB, temporarily disable the breakpoint, single-step over the +instruction, then re-enable the breakpoint. x2APIC ------ |
