summaryrefslogtreecommitdiff
path: root/Documentation/virt/kvm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/virt/kvm')
-rw-r--r--Documentation/virt/kvm/api.rst356
-rw-r--r--Documentation/virt/kvm/cpuid.rst7
-rw-r--r--Documentation/virt/kvm/hypercalls.rst21
-rw-r--r--Documentation/virt/kvm/locking.rst5
-rw-r--r--Documentation/virt/kvm/mmu.rst7
-rw-r--r--Documentation/virt/kvm/msr.rst13
6 files changed, 401 insertions, 8 deletions
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7fcb2fd38f42..b9ddce5638f5 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -688,9 +688,14 @@ MSRs that have been set successfully.
Defines the vcpu responses to the cpuid instruction. Applications
should use the KVM_SET_CPUID2 ioctl if available.
-Note, when this IOCTL fails, KVM gives no guarantees that previous valid CPUID
-configuration (if there is) is not corrupted. Userspace can get a copy of the
-resulting CPUID configuration through KVM_GET_CPUID2 in case.
+Caveat emptor:
+ - If this IOCTL fails, KVM gives no guarantees that previous valid CPUID
+ configuration (if there is) is not corrupted. Userspace can get a copy
+ of the resulting CPUID configuration through KVM_GET_CPUID2 in case.
+ - Using KVM_SET_CPUID{,2} after KVM_RUN, i.e. changing the guest vCPU model
+ after running the guest, may cause guest instability.
+ - Using heterogeneous CPUID configurations, modulo APIC IDs, topology, etc...
+ may cause guest instability.
::
@@ -5034,6 +5039,260 @@ see KVM_XEN_VCPU_SET_ATTR above.
The KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST type may not be used
with the KVM_XEN_VCPU_GET_ATTR ioctl.
+4.130 KVM_ARM_MTE_COPY_TAGS
+---------------------------
+
+:Capability: KVM_CAP_ARM_MTE
+:Architectures: arm64
+:Type: vm ioctl
+:Parameters: struct kvm_arm_copy_mte_tags
+:Returns: number of bytes copied, < 0 on error (-EINVAL for incorrect
+ arguments, -EFAULT if memory cannot be accessed).
+
+::
+
+ struct kvm_arm_copy_mte_tags {
+ __u64 guest_ipa;
+ __u64 length;
+ void __user *addr;
+ __u64 flags;
+ __u64 reserved[2];
+ };
+
+Copies Memory Tagging Extension (MTE) tags to/from guest tag memory. The
+``guest_ipa`` and ``length`` fields must be ``PAGE_SIZE`` aligned. The ``addr``
+field must point to a buffer which the tags will be copied to or from.
+
+``flags`` specifies the direction of copy, either ``KVM_ARM_TAGS_TO_GUEST`` or
+``KVM_ARM_TAGS_FROM_GUEST``.
+
+The size of the buffer to store the tags is ``(length / 16)`` bytes
+(granules in MTE are 16 bytes long). Each byte contains a single tag
+value. This matches the format of ``PTRACE_PEEKMTETAGS`` and
+``PTRACE_POKEMTETAGS``.
+
+If an error occurs before any data is copied then a negative error code is
+returned. If some tags have been copied before an error occurs then the number
+of bytes successfully copied is returned. If the call completes successfully
+then ``length`` is returned.
+
+4.131 KVM_GET_SREGS2
+------------------
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (out)
+:Returns: 0 on success, -1 on error
+
+Reads special registers from the vcpu.
+This ioctl (when supported) replaces the KVM_GET_SREGS.
+
+::
+
+struct kvm_sregs2 {
+ /* out (KVM_GET_SREGS2) / in (KVM_SET_SREGS2) */
+ struct kvm_segment cs, ds, es, fs, gs, ss;
+ struct kvm_segment tr, ldt;
+ struct kvm_dtable gdt, idt;
+ __u64 cr0, cr2, cr3, cr4, cr8;
+ __u64 efer;
+ __u64 apic_base;
+ __u64 flags;
+ __u64 pdptrs[4];
+};
+
+flags values for ``kvm_sregs2``:
+
+``KVM_SREGS2_FLAGS_PDPTRS_VALID``
+
+ Indicates thats the struct contain valid PDPTR values.
+
+
+4.132 KVM_SET_SREGS2
+------------------
+
+:Capability: KVM_CAP_SREGS2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_sregs2 (in)
+:Returns: 0 on success, -1 on error
+
+Writes special registers into the vcpu.
+See KVM_GET_SREGS2 for the data structures.
+This ioctl (when supported) replaces the KVM_SET_SREGS.
+
+4.133 KVM_GET_STATS_FD
+----------------------
+
+:Capability: KVM_CAP_STATS_BINARY_FD
+:Architectures: all
+:Type: vm ioctl, vcpu ioctl
+:Parameters: none
+:Returns: statistics file descriptor on success, < 0 on error
+
+Errors:
+
+ ====== ======================================================
+ ENOMEM if the fd could not be created due to lack of memory
+ EMFILE if the number of opened files exceeds the limit
+ ====== ======================================================
+
+The returned file descriptor can be used to read VM/vCPU statistics data in
+binary format. The data in the file descriptor consists of four blocks
+organized as follows:
+
++-------------+
+| Header |
++-------------+
+| id string |
++-------------+
+| Descriptors |
++-------------+
+| Stats Data |
++-------------+
+
+Apart from the header starting at offset 0, please be aware that it is
+not guaranteed that the four blocks are adjacent or in the above order;
+the offsets of the id, descriptors and data blocks are found in the
+header. However, all four blocks are aligned to 64 bit offsets in the
+file and they do not overlap.
+
+All blocks except the data block are immutable. Userspace can read them
+only one time after retrieving the file descriptor, and then use ``pread`` or
+``lseek`` to read the statistics repeatedly.
+
+All data is in system endianness.
+
+The format of the header is as follows::
+
+ struct kvm_stats_header {
+ __u32 flags;
+ __u32 name_size;
+ __u32 num_desc;
+ __u32 id_offset;
+ __u32 desc_offset;
+ __u32 data_offset;
+ };
+
+The ``flags`` field is not used at the moment. It is always read as 0.
+
+The ``name_size`` field is the size (in byte) of the statistics name string
+(including trailing '\0') which is contained in the "id string" block and
+appended at the end of every descriptor.
+
+The ``num_desc`` field is the number of descriptors that are included in the
+descriptor block. (The actual number of values in the data block may be
+larger, since each descriptor may comprise more than one value).
+
+The ``id_offset`` field is the offset of the id string from the start of the
+file indicated by the file descriptor. It is a multiple of 8.
+
+The ``desc_offset`` field is the offset of the Descriptors block from the start
+of the file indicated by the file descriptor. It is a multiple of 8.
+
+The ``data_offset`` field is the offset of the Stats Data block from the start
+of the file indicated by the file descriptor. It is a multiple of 8.
+
+The id string block contains a string which identifies the file descriptor on
+which KVM_GET_STATS_FD was invoked. The size of the block, including the
+trailing ``'\0'``, is indicated by the ``name_size`` field in the header.
+
+The descriptors block is only needed to be read once for the lifetime of the
+file descriptor contains a sequence of ``struct kvm_stats_desc``, each followed
+by a string of size ``name_size``.
+
+ #define KVM_STATS_TYPE_SHIFT 0
+ #define KVM_STATS_TYPE_MASK (0xF << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_CUMULATIVE (0x0 << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_INSTANT (0x1 << KVM_STATS_TYPE_SHIFT)
+ #define KVM_STATS_TYPE_PEAK (0x2 << KVM_STATS_TYPE_SHIFT)
+
+ #define KVM_STATS_UNIT_SHIFT 4
+ #define KVM_STATS_UNIT_MASK (0xF << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_NONE (0x0 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_BYTES (0x1 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_SECONDS (0x2 << KVM_STATS_UNIT_SHIFT)
+ #define KVM_STATS_UNIT_CYCLES (0x3 << KVM_STATS_UNIT_SHIFT)
+
+ #define KVM_STATS_BASE_SHIFT 8
+ #define KVM_STATS_BASE_MASK (0xF << KVM_STATS_BASE_SHIFT)
+ #define KVM_STATS_BASE_POW10 (0x0 << KVM_STATS_BASE_SHIFT)
+ #define KVM_STATS_BASE_POW2 (0x1 << KVM_STATS_BASE_SHIFT)
+
+ struct kvm_stats_desc {
+ __u32 flags;
+ __s16 exponent;
+ __u16 size;
+ __u32 offset;
+ __u32 unused;
+ char name[];
+ };
+
+The ``flags`` field contains the type and unit of the statistics data described
+by this descriptor. Its endianness is CPU native.
+The following flags are supported:
+
+Bits 0-3 of ``flags`` encode the type:
+ * ``KVM_STATS_TYPE_CUMULATIVE``
+ The statistics data is cumulative. The value of data can only be increased.
+ Most of the counters used in KVM are of this type.
+ The corresponding ``size`` field for this type is always 1.
+ All cumulative statistics data are read/write.
+ * ``KVM_STATS_TYPE_INSTANT``
+ The statistics data is instantaneous. Its value can be increased or
+ decreased. This type is usually used as a measurement of some resources,
+ like the number of dirty pages, the number of large pages, etc.
+ All instant statistics are read only.
+ The corresponding ``size`` field for this type is always 1.
+ * ``KVM_STATS_TYPE_PEAK``
+ The statistics data is peak. The value of data can only be increased, and
+ represents a peak value for a measurement, for example the maximum number
+ of items in a hash table bucket, the longest time waited and so on.
+ The corresponding ``size`` field for this type is always 1.
+
+Bits 4-7 of ``flags`` encode the unit:
+ * ``KVM_STATS_UNIT_NONE``
+ There is no unit for the value of statistics data. This usually means that
+ the value is a simple counter of an event.
+ * ``KVM_STATS_UNIT_BYTES``
+ It indicates that the statistics data is used to measure memory size, in the
+ unit of Byte, KiByte, MiByte, GiByte, etc. The unit of the data is
+ determined by the ``exponent`` field in the descriptor.
+ * ``KVM_STATS_UNIT_SECONDS``
+ It indicates that the statistics data is used to measure time or latency.
+ * ``KVM_STATS_UNIT_CYCLES``
+ It indicates that the statistics data is used to measure CPU clock cycles.
+
+Bits 8-11 of ``flags``, together with ``exponent``, encode the scale of the
+unit:
+ * ``KVM_STATS_BASE_POW10``
+ The scale is based on power of 10. It is used for measurement of time and
+ CPU clock cycles. For example, an exponent of -9 can be used with
+ ``KVM_STATS_UNIT_SECONDS`` to express that the unit is nanoseconds.
+ * ``KVM_STATS_BASE_POW2``
+ The scale is based on power of 2. It is used for measurement of memory size.
+ For example, an exponent of 20 can be used with ``KVM_STATS_UNIT_BYTES`` to
+ express that the unit is MiB.
+
+The ``size`` field is the number of values of this statistics data. Its
+value is usually 1 for most of simple statistics. 1 means it contains an
+unsigned 64bit data.
+
+The ``offset`` field is the offset from the start of Data Block to the start of
+the corresponding statistics data.
+
+The ``unused`` field is reserved for future support for other types of
+statistics data, like log/linear histogram. Its value is always 0 for the types
+defined above.
+
+The ``name`` field is the name string of the statistics data. The name string
+starts at the end of ``struct kvm_stats_desc``. The maximum length including
+the trailing ``'\0'``, is indicated by ``name_size`` in the header.
+
+The Stats Data block contains an array of 64-bit values in the same order
+as the descriptors in Descriptors block.
+
5. The kvm_run structure
========================
@@ -6323,6 +6582,7 @@ KVM_RUN_BUS_LOCK flag is used to distinguish between them.
This capability can be used to check / enable 2nd DAWR feature provided
by POWER10 processor.
+
7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
-------------------------------------
@@ -6362,6 +6622,66 @@ default.
See Documentation/x86/sgx/2.Kernel-internals.rst for more details.
+7.26 KVM_CAP_PPC_RPT_INVALIDATE
+-------------------------------
+
+:Capability: KVM_CAP_PPC_RPT_INVALIDATE
+:Architectures: ppc
+:Type: vm
+
+This capability indicates that the kernel is capable of handling
+H_RPT_INVALIDATE hcall.
+
+In order to enable the use of H_RPT_INVALIDATE in the guest,
+user space might have to advertise it for the guest. For example,
+IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is
+present in the "ibm,hypertas-functions" device-tree property.
+
+This capability is enabled for hypervisors on platforms like POWER9
+that support radix MMU.
+
+7.27 KVM_CAP_EXIT_ON_EMULATION_FAILURE
+--------------------------------------
+
+:Architectures: x86
+:Parameters: args[0] whether the feature should be enabled or not
+
+When this capability is enabled, an emulation failure will result in an exit
+to userspace with KVM_INTERNAL_ERROR (except when the emulator was invoked
+to handle a VMware backdoor instruction). Furthermore, KVM will now provide up
+to 15 instruction bytes for any exit to userspace resulting from an emulation
+failure. When these exits to userspace occur use the emulation_failure struct
+instead of the internal struct. They both have the same layout, but the
+emulation_failure struct matches the content better. It also explicitly
+defines the 'flags' field which is used to describe the fields in the struct
+that are valid (ie: if KVM_INTERNAL_ERROR_EMULATION_FLAG_INSTRUCTION_BYTES is
+set in the 'flags' field then both 'insn_size' and 'insn_bytes' have valid data
+in them.)
+
+7.28 KVM_CAP_ARM_MTE
+--------------------
+
+:Architectures: arm64
+:Parameters: none
+
+This capability indicates that KVM (and the hardware) supports exposing the
+Memory Tagging Extensions (MTE) to the guest. It must also be enabled by the
+VMM before creating any VCPUs to allow the guest access. Note that MTE is only
+available to a guest running in AArch64 mode and enabling this capability will
+cause attempts to create AArch32 VCPUs to fail.
+
+When enabled the guest is able to access tags associated with any memory given
+to the guest. KVM will ensure that the tags are maintained during swap or
+hibernation of the host; however the VMM needs to manually save/restore the
+tags as appropriate if the VM is migrated.
+
+When this capability is enabled all memory in memslots must be mapped as
+not-shareable (no MAP_SHARED), attempts to create a memslot with a
+MAP_SHARED mmap will result in an -EINVAL return.
+
+When enabled the VMM may make use of the ``KVM_ARM_MTE_COPY_TAGS`` ioctl to
+perform a bulk copy of tags to/from the guest.
+
8. Other capabilities.
======================
@@ -6891,3 +7211,33 @@ This capability is always enabled.
This capability indicates that the KVM virtual PTP service is
supported in the host. A VMM can check whether the service is
available to the guest on migration.
+
+8.33 KVM_CAP_HYPERV_ENFORCE_CPUID
+-----------------------------
+
+Architectures: x86
+
+When enabled, KVM will disable emulated Hyper-V features provided to the
+guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all
+currently implmented Hyper-V features are provided unconditionally when
+Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001)
+leaf.
+
+8.34 KVM_CAP_EXIT_HYPERCALL
+---------------------------
+
+:Capability: KVM_CAP_EXIT_HYPERCALL
+:Architectures: x86
+:Type: vm
+
+This capability, if enabled, will cause KVM to exit to userspace
+with KVM_EXIT_HYPERCALL exit reason to process some hypercalls.
+
+Calling KVM_CHECK_EXTENSION for this capability will return a bitmask
+of hypercalls that can be configured to exit to userspace.
+Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE.
+
+The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
+of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace
+the hypercalls whose corresponding bit is in the argument, and return
+ENOSYS for the others.
diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index cf62162d4be2..bda3e3e737d7 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -96,6 +96,13 @@ KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit
before using extended destination
ID bits in MSI address bits 11-5.
+KVM_FEATURE_HC_MAP_GPA_RANGE 16 guest checks this feature bit before
+ using the map gpa range hypercall
+ to notify the page state change
+
+KVM_FEATURE_MIGRATION_CONTROL 17 guest checks this feature bit before
+ using MSR_KVM_MIGRATION_CONTROL
+
KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side
per-cpu warps are expected in
kvmclock
diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst
index ed4fddd364ea..e56fa8b9cfca 100644
--- a/Documentation/virt/kvm/hypercalls.rst
+++ b/Documentation/virt/kvm/hypercalls.rst
@@ -169,3 +169,24 @@ a0: destination APIC ID
:Usage example: When sending a call-function IPI-many to vCPUs, yield if
any of the IPI target vCPUs was preempted.
+
+8. KVM_HC_MAP_GPA_RANGE
+-------------------------
+:Architecture: x86
+:Status: active
+:Purpose: Request KVM to map a GPA range with the specified attributes.
+
+a0: the guest physical address of the start page
+a1: the number of (4kb) pages (must be contiguous in GPA space)
+a2: attributes
+
+ Where 'attributes' :
+ * bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc...
+ * bit 4 - plaintext = 0, encrypted = 1
+ * bits 63:5 - reserved (must be zero)
+
+**Implementation note**: this hypercall is implemented in userspace via
+the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability
+before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In
+addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace
+must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL.
diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index 1fc860c007a3..35eca377543d 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -16,6 +16,11 @@ The acquisition orders for mutexes are as follows:
- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
them together is quite rare.
+- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before
+ synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock
+ can be taken inside a kvm->srcu read-side critical section,
+ while kvm->slots_lock cannot.
+
On x86:
- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst
index 20d85daed395..f60f5488e121 100644
--- a/Documentation/virt/kvm/mmu.rst
+++ b/Documentation/virt/kvm/mmu.rst
@@ -180,8 +180,8 @@ Shadow pages contain the following information:
role.gpte_is_8_bytes:
Reflects the size of the guest PTE for which the page is valid, i.e. '1'
if 64-bit gptes are in use, '0' if 32-bit gptes are in use.
- role.nxe:
- Contains the value of efer.nxe for which the page is valid.
+ role.efer_nx:
+ Contains the value of efer.nx for which the page is valid.
role.cr0_wp:
Contains the value of cr0.wp for which the page is valid.
role.smep_andnot_wp:
@@ -192,9 +192,6 @@ Shadow pages contain the following information:
Contains the value of cr4.smap && !cr0.wp for which the page is valid
(pages for which this is true are different from other pages; see the
treatment of cr0.wp=0 below).
- role.ept_sp:
- This is a virtual flag to denote a shadowed nested EPT page. ept_sp
- is true if "cr0_wp && smap_andnot_wp", an otherwise invalid combination.
role.smm:
Is 1 if the page is valid in system management mode. This field
determines which of the kvm_memslots array was used to build this
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index e37a14c323d2..9315fc385fb0 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -376,3 +376,16 @@ data:
write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
and check if there are more notifications pending. The MSR is available
if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_MIGRATION_CONTROL:
+ 0x4b564d08
+
+data:
+ This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in
+ CPUID. Bit 0 represents whether live migration of the guest is allowed.
+
+ When a guest is started, bit 0 will be 0 if the guest has encrypted
+ memory and 1 if the guest does not have encrypted memory. If the
+ guest is communicating page encryption status to the host using the
+ ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
+ allow live migration of the guest.