diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2021-11-02 11:24:14 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2021-11-02 11:24:14 -0700 |
commit | d7e0a795bf37a13554c80cfc5ba97abedf53f391 (patch) | |
tree | 26f107fbe530b1bd0912a748b808cbe476bfbf49 /Documentation | |
parent | 44261f8e287d1b02a2e4bfbd7399fb8d37d1ee24 (diff) | |
parent | 52cf891d8dbd7592261fa30f373410b97f22b76c (diff) |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- More progress on the protected VM front, now with the full fixed
feature set as well as the limitation of some hypercalls after
initialisation.
- Cleanup of the RAZ/WI sysreg handling, which was pointlessly
complicated
- Fixes for the vgic placement in the IPA space, together with a
bunch of selftests
- More memcg accounting of the memory allocated on behalf of a guest
- Timer and vgic selftests
- Workarounds for the Apple M1 broken vgic implementation
- KConfig cleanups
- New kvmarm.mode=none option, for those who really dislike us
RISC-V:
- New KVM port.
x86:
- New API to control TSC offset from userspace
- TSC scaling for nested hypervisors on SVM
- Switch masterclock protection from raw_spin_lock to seqcount
- Clean up function prototypes in the page fault code and avoid
repeated memslot lookups
- Convey the exit reason to userspace on emulation failure
- Configure time between NX page recovery iterations
- Expose Predictive Store Forwarding Disable CPUID leaf
- Allocate page tracking data structures lazily (if the i915 KVM-GT
functionality is not compiled in)
- Cleanups, fixes and optimizations for the shadow MMU code
s390:
- SIGP Fixes
- initial preparations for lazy destroy of secure VMs
- storage key improvements/fixes
- Log the guest CPNC
Starting from this release, KVM-PPC patches will come from Michael
Ellerman's PPC tree"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits)
RISC-V: KVM: fix boolreturn.cocci warnings
RISC-V: KVM: remove unneeded semicolon
RISC-V: KVM: Fix GPA passed to __kvm_riscv_hfence_gvma_xyz() functions
RISC-V: KVM: Factor-out FP virtualization into separate sources
KVM: s390: add debug statement for diag 318 CPNC data
KVM: s390: pv: properly handle page flags for protected guests
KVM: s390: Fix handle_sske page fault handling
KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol
KVM: x86: On emulation failure, convey the exit reason, etc. to userspace
KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info
KVM: x86: Clarify the kvm_run.emulation_failure structure layout
KVM: s390: Add a routine for setting userspace CPU state
KVM: s390: Simplify SIGP Set Arch handling
KVM: s390: pv: avoid stalls when making pages secure
KVM: s390: pv: avoid stalls for kvm_s390_pv_init_vm
KVM: s390: pv: avoid double free of sida page
KVM: s390: pv: add macros for UVC CC values
s390/mm: optimize reset_guest_reference_bit()
s390/mm: optimize set_guest_storage_key()
s390/mm: no need for pte_alloc_map_lock() if we know the pmd is present
...
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 15 | ||||
-rw-r--r-- | Documentation/virt/kvm/api.rst | 241 | ||||
-rw-r--r-- | Documentation/virt/kvm/devices/vcpu.rst | 70 | ||||
-rw-r--r-- | Documentation/virt/kvm/devices/xics.rst | 2 | ||||
-rw-r--r-- | Documentation/virt/kvm/devices/xive.rst | 2 |
5 files changed, 308 insertions, 22 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index fb76a64a2168..6aea495b0970 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2353,7 +2353,14 @@ [KVM] Controls how many 4KiB pages are periodically zapped back to huge pages. 0 disables the recovery, otherwise if the value is N KVM will zap 1/Nth of the 4KiB pages every - minute. The default is 60. + period (see below). The default is 60. + + kvm.nx_huge_pages_recovery_period_ms= + [KVM] Controls the time period at which KVM zaps 4KiB pages + back to huge pages. If the value is a non-zero N, KVM will + zap a portion (see ratio above) of the pages every N msecs. + If the value is 0 (the default), KVM will pick a period based + on the ratio, such that a page is zapped after 1 hour on average. kvm-amd.nested= [KVM,AMD] Allow nested virtualization in KVM/SVM. Default is 1 (enabled) @@ -2365,6 +2372,8 @@ kvm-arm.mode= [KVM,ARM] Select one of KVM/arm64's modes of operation. + none: Forcefully disable KVM. + nvhe: Standard nVHE-based mode, without support for protected guests. @@ -2372,7 +2381,9 @@ state is kept private from the host. Not valid if the kernel is running in EL2. - Defaults to VHE/nVHE based on hardware support. + Defaults to VHE/nVHE based on hardware support. Setting + mode to "protected" will disable kexec and hibernation + for the host. kvm-arm.vgic_v3_group0_trap= [KVM,ARM] Trap guest accesses to GICv3 group-0 diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index a6729c8cf063..3b093d6dbe22 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -532,7 +532,7 @@ translation mode. ------------------ :Capability: basic -:Architectures: x86, ppc, mips +:Architectures: x86, ppc, mips, riscv :Type: vcpu ioctl :Parameters: struct kvm_interrupt (in) :Returns: 0 on success, negative on failure. @@ -601,6 +601,23 @@ interrupt number dequeues the interrupt. This is an asynchronous vcpu ioctl and can be invoked from any thread. +RISC-V: +^^^^^^^ + +Queues an external interrupt to be injected into the virutal CPU. This ioctl +is overloaded with 2 different irq values: + +a) KVM_INTERRUPT_SET + + This sets external interrupt for a virtual CPU and it will receive + once it is ready. + +b) KVM_INTERRUPT_UNSET + + This clears pending external interrupt for a virtual CPU. + +This is an asynchronous vcpu ioctl and can be invoked from any thread. + 4.17 KVM_DEBUG_GUEST -------------------- @@ -993,20 +1010,37 @@ such as migration. When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the set of bits that KVM can return in struct kvm_clock_data's flag member. -The only flag defined now is KVM_CLOCK_TSC_STABLE. If set, the returned -value is the exact kvmclock value seen by all VCPUs at the instant -when KVM_GET_CLOCK was called. If clear, the returned value is simply -CLOCK_MONOTONIC plus a constant offset; the offset can be modified -with KVM_SET_CLOCK. KVM will try to make all VCPUs follow this clock, -but the exact value read by each VCPU could differ, because the host -TSC is not stable. +The following flags are defined: + +KVM_CLOCK_TSC_STABLE + If set, the returned value is the exact kvmclock + value seen by all VCPUs at the instant when KVM_GET_CLOCK was called. + If clear, the returned value is simply CLOCK_MONOTONIC plus a constant + offset; the offset can be modified with KVM_SET_CLOCK. KVM will try + to make all VCPUs follow this clock, but the exact value read by each + VCPU could differ, because the host TSC is not stable. + +KVM_CLOCK_REALTIME + If set, the `realtime` field in the kvm_clock_data + structure is populated with the value of the host's real time + clocksource at the instant when KVM_GET_CLOCK was called. If clear, + the `realtime` field does not contain a value. + +KVM_CLOCK_HOST_TSC + If set, the `host_tsc` field in the kvm_clock_data + structure is populated with the value of the host's timestamp counter (TSC) + at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field + does not contain a value. :: struct kvm_clock_data { __u64 clock; /* kvmclock current value */ __u32 flags; - __u32 pad[9]; + __u32 pad0; + __u64 realtime; + __u64 host_tsc; + __u32 pad[4]; }; @@ -1023,12 +1057,25 @@ Sets the current timestamp of kvmclock to the value specified in its parameter. In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios such as migration. +The following flags can be passed: + +KVM_CLOCK_REALTIME + If set, KVM will compare the value of the `realtime` field + with the value of the host's real time clocksource at the instant when + KVM_SET_CLOCK was called. The difference in elapsed time is added to the final + kvmclock value that will be provided to guests. + +Other flags returned by ``KVM_GET_CLOCK`` are accepted but ignored. + :: struct kvm_clock_data { __u64 clock; /* kvmclock current value */ __u32 flags; - __u32 pad[9]; + __u32 pad0; + __u64 realtime; + __u64 host_tsc; + __u32 pad[4]; }; @@ -1399,7 +1446,7 @@ for vm-wide capabilities. --------------------- :Capability: KVM_CAP_MP_STATE -:Architectures: x86, s390, arm, arm64 +:Architectures: x86, s390, arm, arm64, riscv :Type: vcpu ioctl :Parameters: struct kvm_mp_state (out) :Returns: 0 on success; -1 on error @@ -1416,7 +1463,8 @@ uniprocessor guests). Possible values are: ========================== =============================================== - KVM_MP_STATE_RUNNABLE the vcpu is currently running [x86,arm/arm64] + KVM_MP_STATE_RUNNABLE the vcpu is currently running + [x86,arm/arm64,riscv] KVM_MP_STATE_UNINITIALIZED the vcpu is an application processor (AP) which has not yet received an INIT signal [x86] KVM_MP_STATE_INIT_RECEIVED the vcpu has received an INIT signal, and is @@ -1425,7 +1473,7 @@ Possible values are: is waiting for an interrupt [x86] KVM_MP_STATE_SIPI_RECEIVED the vcpu has just received a SIPI (vector accessible via KVM_GET_VCPU_EVENTS) [x86] - KVM_MP_STATE_STOPPED the vcpu is stopped [s390,arm/arm64] + KVM_MP_STATE_STOPPED the vcpu is stopped [s390,arm/arm64,riscv] KVM_MP_STATE_CHECK_STOP the vcpu is in a special error state [s390] KVM_MP_STATE_OPERATING the vcpu is operating (running or halted) [s390] @@ -1437,8 +1485,8 @@ On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel irqchip, the multiprocessing state must be maintained by userspace on these architectures. -For arm/arm64: -^^^^^^^^^^^^^^ +For arm/arm64/riscv: +^^^^^^^^^^^^^^^^^^^^ The only states that are valid are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not. @@ -1447,7 +1495,7 @@ KVM_MP_STATE_RUNNABLE which reflect if the vcpu is paused or not. --------------------- :Capability: KVM_CAP_MP_STATE -:Architectures: x86, s390, arm, arm64 +:Architectures: x86, s390, arm, arm64, riscv :Type: vcpu ioctl :Parameters: struct kvm_mp_state (in) :Returns: 0 on success; -1 on error @@ -1459,8 +1507,8 @@ On x86, this ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel irqchip, the multiprocessing state must be maintained by userspace on these architectures. -For arm/arm64: -^^^^^^^^^^^^^^ +For arm/arm64/riscv: +^^^^^^^^^^^^^^^^^^^^ The only states that are valid are KVM_MP_STATE_STOPPED and KVM_MP_STATE_RUNNABLE which reflect if the vcpu should be paused or not. @@ -2577,6 +2625,144 @@ following id bit patterns:: 0x7020 0000 0003 02 <0:3> <reg:5> +RISC-V registers are mapped using the lower 32 bits. The upper 8 bits of +that is the register group type. + +RISC-V config registers are meant for configuring a Guest VCPU and it has +the following id bit patterns:: + + 0x8020 0000 01 <index into the kvm_riscv_config struct:24> (32bit Host) + 0x8030 0000 01 <index into the kvm_riscv_config struct:24> (64bit Host) + +Following are the RISC-V config registers: + +======================= ========= ============================================= + Encoding Register Description +======================= ========= ============================================= + 0x80x0 0000 0100 0000 isa ISA feature bitmap of Guest VCPU +======================= ========= ============================================= + +The isa config register can be read anytime but can only be written before +a Guest VCPU runs. It will have ISA feature bits matching underlying host +set by default. + +RISC-V core registers represent the general excution state of a Guest VCPU +and it has the following id bit patterns:: + + 0x8020 0000 02 <index into the kvm_riscv_core struct:24> (32bit Host) + 0x8030 0000 02 <index into the kvm_riscv_core struct:24> (64bit Host) + +Following are the RISC-V core registers: + +======================= ========= ============================================= + Encoding Register Description +======================= ========= ============================================= + 0x80x0 0000 0200 0000 regs.pc Program counter + 0x80x0 0000 0200 0001 regs.ra Return address + 0x80x0 0000 0200 0002 regs.sp Stack pointer + 0x80x0 0000 0200 0003 regs.gp Global pointer + 0x80x0 0000 0200 0004 regs.tp Task pointer + 0x80x0 0000 0200 0005 regs.t0 Caller saved register 0 + 0x80x0 0000 0200 0006 regs.t1 Caller saved register 1 + 0x80x0 0000 0200 0007 regs.t2 Caller saved register 2 + 0x80x0 0000 0200 0008 regs.s0 Callee saved register 0 + 0x80x0 0000 0200 0009 regs.s1 Callee saved register 1 + 0x80x0 0000 0200 000a regs.a0 Function argument (or return value) 0 + 0x80x0 0000 0200 000b regs.a1 Function argument (or return value) 1 + 0x80x0 0000 0200 000c regs.a2 Function argument 2 + 0x80x0 0000 0200 000d regs.a3 Function argument 3 + 0x80x0 0000 0200 000e regs.a4 Function argument 4 + 0x80x0 0000 0200 000f regs.a5 Function argument 5 + 0x80x0 0000 0200 0010 regs.a6 Function argument 6 + 0x80x0 0000 0200 0011 regs.a7 Function argument 7 + 0x80x0 0000 0200 0012 regs.s2 Callee saved register 2 + 0x80x0 0000 0200 0013 regs.s3 Callee saved register 3 + 0x80x0 0000 0200 0014 regs.s4 Callee saved register 4 + 0x80x0 0000 0200 0015 regs.s5 Callee saved register 5 + 0x80x0 0000 0200 0016 regs.s6 Callee saved register 6 + 0x80x0 0000 0200 0017 regs.s7 Callee saved register 7 + 0x80x0 0000 0200 0018 regs.s8 Callee saved register 8 + 0x80x0 0000 0200 0019 regs.s9 Callee saved register 9 + 0x80x0 0000 0200 001a regs.s10 Callee saved register 10 + 0x80x0 0000 0200 001b regs.s11 Callee saved register 11 + 0x80x0 0000 0200 001c regs.t3 Caller saved register 3 + 0x80x0 0000 0200 001d regs.t4 Caller saved register 4 + 0x80x0 0000 0200 001e regs.t5 Caller saved register 5 + 0x80x0 0000 0200 001f regs.t6 Caller saved register 6 + 0x80x0 0000 0200 0020 mode Privilege mode (1 = S-mode or 0 = U-mode) +======================= ========= ============================================= + +RISC-V csr registers represent the supervisor mode control/status registers +of a Guest VCPU and it has the following id bit patterns:: + + 0x8020 0000 03 <index into the kvm_riscv_csr struct:24> (32bit Host) + 0x8030 0000 03 <index into the kvm_riscv_csr struct:24> (64bit Host) + +Following are the RISC-V csr registers: + +======================= ========= ============================================= + Encoding Register Description +======================= ========= ============================================= + 0x80x0 0000 0300 0000 sstatus Supervisor status + 0x80x0 0000 0300 0001 sie Supervisor interrupt enable + 0x80x0 0000 0300 0002 stvec Supervisor trap vector base + 0x80x0 0000 0300 0003 sscratch Supervisor scratch register + 0x80x0 0000 0300 0004 sepc Supervisor exception program counter + 0x80x0 0000 0300 0005 scause Supervisor trap cause + 0x80x0 0000 0300 0006 stval Supervisor bad address or instruction + 0x80x0 0000 0300 0007 sip Supervisor interrupt pending + 0x80x0 0000 0300 0008 satp Supervisor address translation and protection +======================= ========= ============================================= + +RISC-V timer registers represent the timer state of a Guest VCPU and it has +the following id bit patterns:: + + 0x8030 0000 04 <index into the kvm_riscv_timer struct:24> + +Following are the RISC-V timer registers: + +======================= ========= ============================================= + Encoding Register Description +======================= ========= ============================================= + 0x8030 0000 0400 0000 frequency Time base frequency (read-only) + 0x8030 0000 0400 0001 time Time value visible to Guest + 0x8030 0000 0400 0002 compare Time compare programmed by Guest + 0x8030 0000 0400 0003 state Time compare state (1 = ON or 0 = OFF) +======================= ========= ============================================= + +RISC-V F-extension registers represent the single precision floating point +state of a Guest VCPU and it has the following id bit patterns:: + + 0x8020 0000 05 <index into the __riscv_f_ext_state struct:24> + +Following are the RISC-V F-extension registers: + +======================= ========= ============================================= + Encoding Register Description +======================= ========= ============================================= + 0x8020 0000 0500 0000 f[0] Floating point register 0 + ... + 0x8020 0000 0500 001f f[31] Floating point register 31 + 0x8020 0000 0500 0020 fcsr Floating point control and status register +======================= ========= ============================================= + +RISC-V D-extension registers represent the double precision floating point +state of a Guest VCPU and it has the following id bit patterns:: + + 0x8020 0000 06 <index into the __riscv_d_ext_state struct:24> (fcsr) + 0x8030 0000 06 <index into the __riscv_d_ext_state struct:24> (non-fcsr) + +Following are the RISC-V D-extension registers: + +======================= ========= ============================================= + Encoding Register Description +======================= ========= ============================================= + 0x8030 0000 0600 0000 f[0] Floating point register 0 + ... + 0x8030 0000 0600 001f f[31] Floating point register 31 + 0x8020 0000 0600 0020 fcsr Floating point control and status register +======================= ========= ============================================= + 4.69 KVM_GET_ONE_REG -------------------- @@ -5850,6 +6036,25 @@ Valid values for 'type' are: :: + /* KVM_EXIT_RISCV_SBI */ + struct { + unsigned long extension_id; + unsigned long function_id; + unsigned long args[6]; + unsigned long ret[2]; + } riscv_sbi; +If exit reason is KVM_EXIT_RISCV_SBI then it indicates that the VCPU has +done a SBI call which is not handled by KVM RISC-V kernel module. The details +of the SBI call are available in 'riscv_sbi' member of kvm_run structure. The +'extension_id' field of 'riscv_sbi' represents SBI extension ID whereas the +'function_id' field represents function ID of given SBI extension. The 'args' +array field of 'riscv_sbi' represents parameters for the SBI call and 'ret' +array field represents return values. The userspace should update the return +values of SBI call before resuming the VCPU. For more details on RISC-V SBI +spec refer, https://github.com/riscv/riscv-sbi-doc. + +:: + /* Fix the size of the union. */ char padding[256]; }; diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst index 2acec3b9ef65..60a29972d3f1 100644 --- a/Documentation/virt/kvm/devices/vcpu.rst +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -161,3 +161,73 @@ Specifies the base address of the stolen time structure for this VCPU. The base address must be 64 byte aligned and exist within a valid guest memory region. See Documentation/virt/kvm/arm/pvtime.rst for more information including the layout of the stolen time structure. + +4. GROUP: KVM_VCPU_TSC_CTRL +=========================== + +:Architectures: x86 + +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET + +:Parameters: 64-bit unsigned TSC offset + +Returns: + + ======= ====================================== + -EFAULT Error reading/writing the provided + parameter address. + -ENXIO Attribute not supported + ======= ====================================== + +Specifies the guest's TSC offset relative to the host's TSC. The guest's +TSC is then derived by the following equation: + + guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET + +This attribute is useful to adjust the guest's TSC on live migration, +so that the TSC counts the time during which the VM was paused. The +following describes a possible algorithm to use for this purpose. + +From the source VMM process: + +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src), + kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds + (host_src). + +2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the + guest TSC offset (ofs_src[i]). + +3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the + guest's TSC (freq). + +From the destination VMM process: + +4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from + kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective + fields. Ensure that the KVM_CLOCK_REALTIME flag is set in the provided + structure. + + KVM will advance the VM's kvmclock to account for elapsed time since + recording the clock values. Note that this will cause problems in + the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized + between the source and destination, and a reasonably short time passes + between the source pausing the VMs and the destination executing + steps 4-7. + +5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and + kvmclock nanoseconds (guest_dest). + +6. Adjust the guest TSC offsets for every vCPU to account for (1) time + elapsed since recording state and (2) difference in TSCs between the + source and destination machine: + + ofs_dst[i] = ofs_src[i] - + (guest_src - guest_dest) * freq + + (tsc_src - tsc_dest) + + ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to + a time of 0 in kvmclock. The above formula ensures that it is the + same on the destination as it was on the source). + +7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the + respective value derived in the previous step. diff --git a/Documentation/virt/kvm/devices/xics.rst b/Documentation/virt/kvm/devices/xics.rst index 2d6927e0b776..bf32c77174ab 100644 --- a/Documentation/virt/kvm/devices/xics.rst +++ b/Documentation/virt/kvm/devices/xics.rst @@ -22,7 +22,7 @@ Groups: Errors: ======= ========================================== - -EINVAL Value greater than KVM_MAX_VCPU_ID. + -EINVAL Value greater than KVM_MAX_VCPU_IDS. -EFAULT Invalid user pointer for attr->addr. -EBUSY A vcpu is already connected to the device. ======= ========================================== diff --git a/Documentation/virt/kvm/devices/xive.rst b/Documentation/virt/kvm/devices/xive.rst index 8bdf3dc38f01..8b5e7b40bdf8 100644 --- a/Documentation/virt/kvm/devices/xive.rst +++ b/Documentation/virt/kvm/devices/xive.rst @@ -91,7 +91,7 @@ the legacy interrupt mode, referred as XICS (POWER7/8). Errors: ======= ========================================== - -EINVAL Value greater than KVM_MAX_VCPU_ID. + -EINVAL Value greater than KVM_MAX_VCPU_IDS. -EFAULT Invalid user pointer for attr->addr. -EBUSY A vCPU is already connected to the device. ======= ========================================== |