Age | Commit message (Collapse) | Author |
|
Pull kvm updates from Paolo Bonzini:
"Generic:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all
architectures.
- Clean up Kconfigs that all KVM architectures were selecting
- New functionality around "guest_memfd", a new userspace API that
creates an anonymous file and returns a file descriptor that refers
to it. guest_memfd files are bound to their owning virtual machine,
cannot be mapped, read, or written by userspace, and cannot be
resized. guest_memfd files do however support PUNCH_HOLE, which can
be used to switch a memory area between guest_memfd and regular
anonymous memory.
- New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify
per-page attributes for a given page of guest memory; right now the
only attribute is whether the guest expects to access memory via
guest_memfd or not, which in Confidential SVMs backed by SEV-SNP,
TDX or ARM64 pKVM is checked by firmware or hypervisor that
guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in
the case of pKVM).
x86:
- Support for "software-protected VMs" that can use the new
guest_memfd and page attributes infrastructure. This is mostly
useful for testing, since there is no pKVM-like infrastructure to
provide a meaningfully reduced TCB.
- Fix a relatively benign off-by-one error when splitting huge pages
during CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in
non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with
a non-huge SPTE.
- Use more generic lockdep assertions in paths that don't actually
care about whether the caller is a reader or a writer.
- let Xen guests opt out of having PV clock reported as "based on a
stable TSC", because some of them don't expect the "TSC stable" bit
(added to the pvclock ABI by KVM, but never set by Xen) to be set.
- Revert a bogus, made-up nested SVM consistency check for
TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM
always flushes on nested transitions, i.e. always satisfies flush
requests. This allows running bleeding edge versions of VMware
Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV
support.
- On AMD machines with vNMI, always rely on hardware instead of
intercepting IRET in some cases to detect unmasking of NMIs
- Support for virtualizing Linear Address Masking (LAM)
- Fix a variety of vPMU bugs where KVM fail to stop/reset counters
and other state prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events
using a dedicated field instead of snapshotting the "previous"
counter. If the hardware PMC count triggers overflow that is
recognized in the same VM-Exit that KVM manually bumps an event
count, KVM would pend PMIs for both the hardware-triggered overflow
and for KVM-triggered overflow.
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be
problematic for subsystems that require no regressions for W=1
builds.
- Advertise all of the host-supported CPUID bits that enumerate
IA32_SPEC_CTRL "features".
- Don't force a masterclock update when a vCPU synchronizes to the
current TSC generation, as updating the masterclock can cause
kvmclock's time to "jump" unexpectedly, e.g. when userspace
hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter
fault paths, partly as a super minor optimization, but mostly to
make KVM play nice with position independent executable builds.
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the
code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV
"emulation" at build time.
ARM64:
- LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base
granule sizes. Branch shared with the arm64 tree.
- Large Fine-Grained Trap rework, bringing some sanity to the
feature, although there is more to come. This comes with a prefix
branch shared with the arm64 tree.
- Some additional Nested Virtualization groundwork, mostly
introducing the NV2 VNCR support and retargetting the NV support to
that version of the architecture.
- A small set of vgic fixes and associated cleanups.
Loongarch:
- Optimization for memslot hugepage checking
- Cleanup and fix some HW/SW timer issues
- Add LSX/LASX (128bit/256bit SIMD) support
RISC-V:
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list
selftest
- Support for reporting steal time along with selftest
s390:
- Bugfixes
Selftests:
- Fix an annoying goof where the NX hugepage test prints out garbage
instead of the magic token needed to run the test.
- Fix build errors when a header is delete/moved due to a missing
flag in the Makefile.
- Detect if KVM bugged/killed a selftest's VM and print out a helpful
message instead of complaining that a random ioctl() failed.
- Annotate the guest printf/assert helpers with __printf(), and fix
the various bugs that were lurking due to lack of said annotation"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits)
x86/kvm: Do not try to disable kvmclock if it was not enabled
KVM: x86: add missing "depends on KVM"
KVM: fix direction of dependency on MMU notifiers
KVM: introduce CONFIG_KVM_COMMON
KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd
KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache
RISC-V: KVM: selftests: Add get-reg-list test for STA registers
RISC-V: KVM: selftests: Add steal_time test support
RISC-V: KVM: selftests: Add guest_sbi_probe_extension
RISC-V: KVM: selftests: Move sbi_ecall to processor.c
RISC-V: KVM: Implement SBI STA extension
RISC-V: KVM: Add support for SBI STA registers
RISC-V: KVM: Add support for SBI extension registers
RISC-V: KVM: Add SBI STA info to vcpu_arch
RISC-V: KVM: Add steal-update vcpu request
RISC-V: KVM: Add SBI STA extension skeleton
RISC-V: paravirt: Implement steal-time support
RISC-V: Add SBI STA extension definitions
RISC-V: paravirt: Add skeleton for pv-time support
RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
- Change global variables to local
- Add missing kernel-doc function parameter descriptions
- Remove unused parameter from a macro
- Remove obsolete Kconfig entry
- Fix comments
- Fix typos, mostly scripted, manually reviewed
and a micro-optimization got misplaced as a cleanup:
- Micro-optimize the asm code in secondary_startup_64_no_verify()
* tag 'x86-cleanups-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arch/x86: Fix typos
x86/head_64: Use TESTB instead of TESTL in secondary_startup_64_no_verify()
x86/docs: Remove reference to syscall trampoline in PTI
x86/Kconfig: Remove obsolete config X86_32_SMP
x86/io: Remove the unused 'bw' parameter from the BUILDIO() macro
x86/mtrr: Document missing function parameters in kernel-doc
x86/setup: Make relocated_ramdisk a local variable of relocate_initrd()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Add Jan Kara as VFS reviewer
- Show correct device and inode numbers in proc/<pid>/maps for vma
files on stacked filesystems. This is now easily doable thanks to
the backing file work from the last cycles. This comes with
selftests
Cleanups:
- Remove a redundant might_sleep() from wait_on_inode()
- Initialize pointer with NULL, not 0
- Clarify comment on access_override_creds()
- Rework and simplify eventfd_signal() and eventfd_signal_mask()
helpers
- Process aio completions in batches to avoid needless wakeups
- Completely decouple struct mnt_idmap from namespaces. We now only
keep the actual idmapping around and don't stash references to
namespaces
- Reformat maintainer entries to indicate that a given subsystem
belongs to fs/
- Simplify fput() for files that were never opened
- Get rid of various pointless file helpers
- Rename various file helpers
- Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from
last cycle
- Make relatime_need_update() return bool
- Use GFP_KERNEL instead of GFP_USER when allocating superblocks
- Replace deprecated ida_simple_*() calls with their current ida_*()
counterparts
Fixes:
- Fix comments on user namespace id mapping helpers. They aren't
kernel doc comments so they shouldn't be using /**
- s/Retuns/Returns/g in various places
- Add missing parameter documentation on can_move_mount_beneath()
- Rename i_mapping->private_data to i_mapping->i_private_data
- Fix a false-positive lockdep warning in pipe_write() for watch
queues
- Improve __fget_files_rcu() code generation to improve performance
- Only notify writer that pipe resizing has finished after setting
pipe->max_usage otherwise writers are never notified that the pipe
has been resized and hang
- Fix some kernel docs in hfsplus
- s/passs/pass/g in various places
- Fix kernel docs in ntfs
- Fix kcalloc() arguments order reported by gcc 14
- Fix uninitialized value in reiserfs"
* tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits)
reiserfs: fix uninit-value in comp_keys
watch_queue: fix kcalloc() arguments order
ntfs: dir.c: fix kernel-doc function parameter warnings
fs: fix doc comment typo fs tree wide
selftests/overlayfs: verify device and inode numbers in /proc/pid/maps
fs/proc: show correct device and inode numbers in /proc/pid/maps
eventfd: Remove usage of the deprecated ida_simple_xx() API
fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation
fs/hfsplus: wrapper.c: fix kernel-doc warnings
fs: add Jan Kara as reviewer
fs/inode: Make relatime_need_update return bool
pipe: wakeup wr_wait after setting max_usage
file: remove __receive_fd()
file: stop exposing receive_fd_user()
fs: replace f_rcuhead with f_task_work
file: remove pointless wrapper
file: s/close_fd_get_file()/file_close_fd()/g
Improve __fget_files_rcu() code generation (and thus __fget_light())
file: massage cleanup of files that failed to open
fs/pipe: Fix lockdep false-positive in watchqueue pipe_write()
...
|
|
KVM x86 MMU changes for 6.8:
- Fix a relatively benign off-by-one error when splitting huge pages during
CLEAR_DIRTY_LOG.
- Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf
TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE.
- Relax the TDP MMU's lockdep assertions related to holding mmu_lock for read
versus write so that KVM doesn't pass "bool shared" all over the place just
to have precise assertions in paths that don't actually care about whether
the caller is a reader or a writer.
|
|
KVM Xen change for 6.8:
To workaround Xen guests that don't expect Xen PV clocks to be marked as being
based on a stable TSC, add a Xen config knob to allow userspace to opt out of
KVM setting the "TSC stable" bit in Xen PV clocks. Note, the "TSC stable" bit
was added to the PVCLOCK ABI by KVM without an ack from Xen, i.e. KVM isn't
entirely blameless for the buggy guest behavior.
|
|
KVM SVM changes for 6.8:
- Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL.
- Advertise flush-by-ASID support for nSVM unconditionally, as KVM always
flushes on nested transitions, i.e. always satisfies flush requests. This
allows running bleeding edge versions of VMware Workstation on top of KVM.
- Sanity check that the CPU supports flush-by-ASID when enabling SEV support.
- Fix a benign NMI virtualization bug where KVM would unnecessarily intercept
IRET when manually injecting an NMI, e.g. when KVM pends an NMI and injects
a second, "simultaneous" NMI.
|
|
KVM x86 support for virtualizing Linear Address Masking (LAM)
Add KVM support for Linear Address Masking (LAM). LAM tweaks the canonicality
checks for most virtual address usage in 64-bit mode, such that only the most
significant bit of the untranslated address bits must match the polarity of the
last translated address bit. This allows software to use ignored, untranslated
address bits for metadata, e.g. to efficiently tag pointers for address
sanitization.
LAM can be enabled separately for user pointers and supervisor pointers, and
for userspace LAM can be select between 48-bit and 57-bit masking
- 48-bit LAM: metadata bits 62:48, i.e. LAM width of 15.
- 57-bit LAM: metadata bits 62:57, i.e. LAM width of 6.
For user pointers, LAM enabling utilizes two previously-reserved high bits from
CR3 (similar to how PCID_NOFLUSH uses bit 63): LAM_U48 and LAM_U57, bits 62 and
61 respectively. Note, if LAM_57 is set, LAM_U48 is ignored, i.e.:
- CR3.LAM_U48=0 && CR3.LAM_U57=0 == LAM disabled for user pointers
- CR3.LAM_U48=1 && CR3.LAM_U57=0 == LAM-48 enabled for user pointers
- CR3.LAM_U48=x && CR3.LAM_U57=1 == LAM-57 enabled for user pointers
For supervisor pointers, LAM is controlled by a single bit, CR4.LAM_SUP, with
the 48-bit versus 57-bit LAM behavior following the current paging mode, i.e.:
- CR4.LAM_SUP=0 && CR4.LA57=x == LAM disabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=0 == LAM-48 enabled for supervisor pointers
- CR4.LAM_SUP=1 && CR4.LA57=1 == LAM-57 enabled for supervisor pointers
The modified LAM canonicality checks:
- LAM_S48 : [ 1 ][ metadata ][ 1 ]
63 47
- LAM_U48 : [ 0 ][ metadata ][ 0 ]
63 47
- LAM_S57 : [ 1 ][ metadata ][ 1 ]
63 56
- LAM_U57 + 5-lvl paging : [ 0 ][ metadata ][ 0 ]
63 56
- LAM_U57 + 4-lvl paging : [ 0 ][ metadata ][ 0...0 ]
63 56..47
The bulk of KVM support for LAM is to emulate LAM's modified canonicality
checks. The approach taken by KVM is to "fill" the metadata bits using the
highest bit of the translated address, e.g. for LAM-48, bit 47 is sign-extended
to bits 62:48. The most significant bit, 63, is *not* modified, i.e. its value
from the raw, untagged virtual address is kept for the canonicality check. This
untagging allows
Aside from emulating LAM's canonical checks behavior, LAM has the usual KVM
touchpoints for selectable features: enumeration (CPUID.7.1:EAX.LAM[bit 26],
enabling via CR3 and CR4 bits, etc.
|
|
KVM x86 PMU changes for 6.8:
- Fix a variety of bugs where KVM fail to stop/reset counters and other state
prior to refreshing the vPMU model.
- Fix a double-overflow PMU bug by tracking emulated counter events using a
dedicated field instead of snapshotting the "previous" counter. If the
hardware PMC count triggers overflow that is recognized in the same VM-Exit
that KVM manually bumps an event count, KVM would pend PMIs for both the
hardware-triggered overflow and for KVM-triggered overflow.
|
|
KVM x86 misc changes for 6.8:
- Turn off KVM_WERROR by default for all configs so that it's not
inadvertantly enabled by non-KVM developers, which can be problematic for
subsystems that require no regressions for W=1 builds.
- Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL
"features".
- Don't force a masterclock update when a vCPU synchronizes to the current TSC
generation, as updating the masterclock can cause kvmclock's time to "jump"
unexpectedly, e.g. when userspace hotplugs a pre-created vCPU.
- Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths,
partly as a super minor optimization, but mostly to make KVM play nice with
position independent executable builds.
|
|
KVM x86 Hyper-V changes for 6.8:
- Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on
CONFIG_HYPERV as a minor optimization, and to self-document the code.
- Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation"
at build time.
|
|
Common KVM changes for 6.8:
- Use memdup_array_user() to harden against overflow.
- Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
|
|
Support for KVM software-protected VMs should not be configurable,
if KVM is not available at all.
Fixes: 89ea60c2c7b5 ("KVM: x86: Add support for "protected VMs" that can utilize private memory")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
CONFIG_HAVE_KVM is currently used by some architectures to either
enabled the KVM config proper, or to enable host-side code that is
not part of the KVM module. However, CONFIG_KVM's "select" statement
in virt/kvm/Kconfig corresponds to a third meaning, namely to
enable common Kconfigs required by all architectures that support
KVM.
These three meanings can be replaced respectively by an
architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by
a new Kconfig symbol that is in turn selected by the
architecture-specific "config KVM".
Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON.
Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by
architecture code, and it brings in all dependencies of common
KVM code. In particular, INTERVAL_TREE was missing in loongarch
and riscv, so that is another thing that is fixed.
Fixes: 8132d887a702 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08)
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Closes: https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d3825539d8@infradead.org/
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fix typos, most reported by "codespell arch/x86". Only touches comments,
no code changes.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Link: https://lore.kernel.org/r/20240103004011.1758650-1-helgaas@kernel.org
|
|
KVM/riscv changes for 6.8 part #1
- KVM_GET_REG_LIST improvement for vector registers
- Generate ISA extension reg_list using macros in get-reg-list selftest
- Steal time account support along with selftest
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.8
1. Optimization for memslot hugepage checking.
2. Cleanup and fix some HW/SW timer issues.
3. Add LSX/LASX (128bit/256bit SIMD) support.
|
|
When intercepts are enabled for MSR_IA32_XSS, the host will swap in/out
the guest-defined values while context-switching to/from guest mode.
However, in the case of SEV-ES, vcpu->arch.guest_state_protected is set,
so the guest-defined value is effectively ignored when switching to
guest mode with the understanding that the VMSA will handle swapping
in/out this register state.
However, SVM is still configured to intercept these accesses for SEV-ES
guests, so the values in the initial MSR_IA32_XSS are effectively
read-only, and a guest will experience undefined behavior if it actually
tries to write to this MSR. Fortunately, only CET/shadowstack makes use
of this register on SEV-ES-capable systems currently, which isn't yet
widely used, but this may become more of an issue in the future.
Additionally, enabling intercepts of MSR_IA32_XSS results in #VC
exceptions in the guest in certain paths that can lead to unexpected #VC
nesting levels. One example is SEV-SNP guests when handling #VC
exceptions for CPUID instructions involving leaf 0xD, subleaf 0x1, since
they will access MSR_IA32_XSS as part of servicing the CPUID #VC, then
generate another #VC when accessing MSR_IA32_XSS, which can lead to
guest crashes if an NMI occurs at that point in time. Running perf on a
guest while it is issuing such a sequence is one example where these can
be problematic.
Address this by disabling intercepts of MSR_IA32_XSS for SEV-ES guests
if the host/guest configuration allows it. If the host/guest
configuration doesn't allow for MSR_IA32_XSS, leave it intercepted so
that it can be caught by the existing checks in
kvm_{set,get}_msr_common() if the guest still attempts to access it.
Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Cc: Alexey Kardashevskiy <aik@amd.com>
Suggested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20231016132819.1002933-4-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
All platforms with a kernel irqchip have support for irqfd. Unify the
two configuration items so that userspace can expect to use irqfd to
inject interrupts into the irqchip.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
virt/kvm/eventfd.c is compiled unconditionally, meaning that the ioeventfds
member of struct kvm is accessed unconditionally. CONFIG_HAVE_KVM_EVENTFD
therefore must be defined for KVM common code to compile successfully,
remove it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
In general, activating long mode involves setting the EFER_LME bit in
the EFER register and then enabling the X86_CR0_PG bit in the CR0
register. At this point, the EFER_LMA bit will be set automatically by
hardware.
In the case of SVM/SEV guests where writes to CR0 are intercepted, it's
necessary for the host to set EFER_LMA on behalf of the guest since
hardware does not see the actual CR0 write.
In the case of SEV-ES guests where writes to CR0 are trapped instead of
intercepted, the hardware *does* see/record the write to CR0 before
exiting and passing the value on to the host, so as part of enabling
SEV-ES support commit f1c6366e3043 ("KVM: SVM: Add required changes to
support intercepts under SEV-ES") dropped special handling of the
EFER_LMA bit with the understanding that it would be set automatically.
However, since the guest never explicitly sets the EFER_LMA bit, the
host never becomes aware that it has been set. This becomes problematic
when userspace tries to get/set the EFER values via
KVM_GET_SREGS/KVM_SET_SREGS, since the EFER contents tracked by the host
will be missing the EFER_LMA bit, and when userspace attempts to pass
the EFER value back via KVM_SET_SREGS it will fail a sanity check that
asserts that EFER_LMA should always be set when X86_CR0_PG and EFER_LME
are set.
Fix this by always inferring the value of EFER_LMA based on X86_CR0_PG
and EFER_LME, regardless of whether or not SEV-ES is enabled.
Fixes: f1c6366e3043 ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
Reported-by: Peter Gonda <pgonda@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210507165947.2502412-2-seanjc@google.com>
[A two year old patch that was revived after we noticed the failure in
KVM_SET_SREGS and a similar patch was posted by Michael Roth. This is
Sean's patch, but with Michael's more complete commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
kvm-master
KVM fixes for 6.7-rcN:
- When checking if a _running_ vCPU is "in-kernel", i.e. running at CPL0,
get the CPL directly instead of relying on preempted_in_kernel, which
is valid if and only if the vCPU was preempted, i.e. NOT running.
- Set .owner for various KVM file_operations so that files refcount the
KVM module until KVM is done executing _all_ code, including the last
few instructions of kvm_put_kvm(). And then revert the misguided
attempt to rely on "struct kvm" refcounts to pin KVM-the-module.
- Fix a benign "return void" that was recently introduced.
|
|
Unless explicitly told to do so (by passing 'clocksource=tsc' and
'tsc=stable:socket', and then jumping through some hoops concerning
potential CPU hotplug) Xen will never use TSC as its clocksource.
Hence, by default, a Xen guest will not see PVCLOCK_TSC_STABLE_BIT set
in either the primary or secondary pvclock memory areas. This has
led to bugs in some guest kernels which only become evident if
PVCLOCK_TSC_STABLE_BIT *is* set in the pvclocks. Hence, to support
such guests, give the VMM a new Xen HVM config flag to tell KVM to
forcibly clear the bit in the Xen pvclocks.
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lore.kernel.org/r/20231102162128.2353459-1-paul@xen.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
'struct hv_vmcb_enlightenments' in VMCB only make sense when either
CONFIG_KVM_HYPERV or CONFIG_HYPERV is enabled.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-17-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
'hv_evmcs_vmptr'/'hv_evmcs_map'/'hv_evmcs' fields in 'struct nested_vmx'
should not be used when !CONFIG_KVM_HYPERV, hide them when
!CONFIG_KVM_HYPERV.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-16-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
There's a number of 'vmx->nested.hv_evmcs' accesses in nested.c, introduce
'nested_vmx_evmcs()' accessor to hide them all in !CONFIG_KVM_HYPERV case.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-15-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In order to get rid of raw 'vmx->nested.hv_evmcs_vmptr' accesses when
!CONFIG_KVM_HYPERV, introduce a pair of helpers:
nested_vmx_is_evmptr12_valid() to check that eVMPTR points to a valid
address.
nested_vmx_is_evmptr12_valid() to check that eVMPTR either points to a
valid address or is in 'pending' port-migration state (meaning it is
supposed to be valid but the exact address wasn't acquired from guest's
memory yet).
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-14-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Hyper-V emulation in KVM is a fairly big chunk and in some cases it may be
desirable to not compile it in to reduce module sizes as well as the attack
surface. Introduce CONFIG_KVM_HYPERV option to make it possible.
Note, there's room for further nVMX/nSVM code optimizations when
!CONFIG_KVM_HYPERV, this will be done in follow-up patches.
Reorganize Makefile a bit so all CONFIG_HYPERV and CONFIG_KVM_HYPERV files
are grouped together.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-13-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
In preparation for making Hyper-V emulation optional, move Hyper-V specific
guest_cpuid_has_evmcs() to hyperv.h.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-12-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
To avoid overloading handle_vmclear() with Hyper-V specific details and to
prepare the code to making Hyper-V emulation optional, create a dedicated
nested_evmcs_handle_vmclear() helper.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-9-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
As a preparation to making Hyper-V emulation optional, introduce a helper
to handle pending KVM_REQ_HV_TLB_FLUSH requests.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-8-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Some Enlightened VMCS related code is needed both by Hyper-V on KVM and
KVM on Hyper-V. As a preparation to making Hyper-V emulation optional,
create dedicated 'hyperv_evmcs.{ch}' files which are used by both.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-7-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
As a preparation to making Hyper-V emulation optional, create a dedicated
kvm_hv_synic_has_vector() helper to avoid extra ifdefs in lapic.c.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-6-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
As a preparation to making Hyper-V emulation optional, create a dedicated
kvm_hv_synic_auto_eoi_set() helper to avoid extra ifdefs in lapic.c
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-5-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
hyperv.{ch} is currently a mix of stuff which is needed by both Hyper-V on
KVM and KVM on Hyper-V. As a preparation to making Hyper-V emulation
optional, put KVM-on-Hyper-V specific code into dedicated files.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-4-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Hyper-V partition assist page is used when KVM runs on top of Hyper-V and
is not used for Windows/Hyper-V guests on KVM, this means that 'hv_pa_pg'
placement in 'struct kvm_hv' is unfortunate. As a preparation to making
Hyper-V emulation optional, move 'hv_pa_pg' to 'struct kvm_arch' and put it
under CONFIG_HYPERV.
While on it, introduce hv_get_partition_assist_page() helper to allocate
partition assist page. Move the comment explaining why we use a single page
for all vCPUs from VMX and expand it a bit.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-3-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Saving a few bytes of memory per KVM VM is certainly great but what's more
important is the ability to see where the code accesses Xen emulation
context while CONFIG_KVM_XEN is not enabled. Currently, kvm_cpu_get_extint()
is the only such place and it is harmless: kvm_xen_has_interrupt() always
returns '0' when !CONFIG_KVM_XEN.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20231205103630.1391318-2-vkuznets@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The requested info will be stored in 'guest_xsave->region' referenced by
the incoming pointer "struct kvm_xsave *guest_xsave", thus there is no need
to explicitly use return void expression for a void function "static void
kvm_vcpu_ioctl_x86_get_xsave(...)". The issue is caught with [-Wpedantic].
Fixes: 2d287ec65e79 ("x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer")
Signed-off-by: Like Xu <likexu@tencent.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/r/20231007064019.17472-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed. Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.
Userspace can invoke delete_module() the instant the last reference to KVM
is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.
Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.
To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.
This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.
Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fix the comment about what can and cannot happen when mmu_unsync_pages_lock
is not help. The comment correctly mentions "clearing sp->unsync", but then
it talks about unsync going from 0 to 1.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-5-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
It is cheap to take tdp_mmu_pages_lock in all write-side critical sections.
We already do it all the time when zapping with read_lock(), so it is not
a problem to do it from the kvm_tdp_mmu_zap_all() path (aka
kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The "bool shared" argument is more or less unnecessary in the
for_each_*_tdp_mmu_root_yield_safe() macros. Many users check for
the lock before calling it; all of them either call small functions
that do the check, or end up calling tdp_mmu_set_spte_atomic() and
tdp_mmu_iter_set_spte(). Add a few assertions to make up for the
lost check in for_each_*_tdp_mmu_root_yield_safe(), but even this
is probably overkill and mostly for documentation reasons.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-3-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Neither tdp_mmu_next_root nor kvm_tdp_mmu_put_root need to know
if the lock is taken for read or write. Either way, protection
is achieved via RCU and tdp_mmu_pages_lock. Remove the argument
and just assert that the lock is taken.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20231125083400.1399197-2-pbonzini@redhat.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Re-check that the given SPTE is still a leaf and present SPTE after a
failed cmpxchg in clear_dirty_gfn_range(). clear_dirty_gfn_range()
intends to only operate on present leaf SPTEs, but that could change
after a failed cmpxchg.
A check for present was added in commit 3354ef5a592d ("KVM: x86/mmu:
Check for present SPTE when clearing dirty bit in TDP MMU") but the
check for leaf is still buried in tdp_root_for_each_leaf_pte() and does
not get rechecked on retry.
Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20231027172640.2335197-3-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fix an off-by-1 error when passing in the range of pages to
kvm_mmu_try_split_huge_pages() during CLEAR_DIRTY_LOG. Specifically, end
is the last page that needs to be split (inclusive) so pass in `end + 1`
since kvm_mmu_try_split_huge_pages() expects the `end` to be
non-inclusive.
At worst this will cause a huge page to be write-protected instead of
eagerly split, which is purely a performance issue, not a correctness
issue. But even that is unlikely as it would require userspace pass in a
bitmap where the last page is the only 4K page on a huge page that needs
to be split.
Reported-by: Vipin Sharma <vipinsh@google.com>
Fixes: cb00a70bd4b7 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG")
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20231027172640.2335197-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
cpuid.c utilizes vmemdup_user() and array_size() to copy two userspace
arrays. This, currently, does not check for an overflow.
Use the new wrapper vmemdup_array_user() to copy the arrays more safely,
as vmemdup_user() doesn't check for overflow.
Note, KVM explicitly checks the number of entries before duplicating the
array, i.e. adding the overflow check should be a glorified nop.
Suggested-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
Link: https://lore.kernel.org/r/20231102181526.43279-2-pstanner@redhat.com
[sean: call out that KVM pre-checks the number of entries]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly track emulated counter events instead of using the common
counter value that's shared with the hardware counter owned by perf.
Bumping the common counter requires snapshotting the pre-increment value
in order to detect overflow from emulation, and the snapshot approach is
inherently flawed.
Snapshotting the previous counter at every increment assumes that there is
at most one emulated counter event per emulated instruction (or rather,
between checks for KVM_REQ_PMU). That's mostly holds true today because
KVM only emulates (branch) instructions retired, but the approach will
fall apart if KVM ever supports event types that don't have a 1:1
relationship with instructions.
And KVM already has a relevant bug, as handle_invalid_guest_state()
emulates multiple instructions without checking KVM_REQ_PMU, i.e. could
miss an overflow event due to clobbering pmc->prev_counter. Not checking
KVM_REQ_PMU is problematic in both cases, but at least with the emulated
counter approach, the resulting behavior is delayed overflow detection,
as opposed to completely lost detection.
Tracking the emulated count fixes another bug where the snapshot approach
can signal spurious overflow due to incorporating both the emulated count
and perf's count in the check, i.e. if overflow is detected by perf, then
KVM's emulation will also incorrectly signal overflow. Add a comment in
the related code to call out the need to process emulated events *after*
pausing the perf event (big kudos to Mingwei for figuring out that
particular wrinkle).
Cc: Mingwei Zhang <mizhang@google.com>
Cc: Roman Kagan <rkagan@amazon.de>
Cc: Jim Mattson <jmattson@google.com>
Cc: Dapeng Mi <dapeng1.mi@linux.intel.com>
Cc: Like Xu <like.xu.linux@gmail.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20231103230541.352265-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Update a PMC's sample period in pmc_write_counter() to deduplicate code
across all callers of pmc_write_counter(). Opportunistically move
pmc_write_counter() into pmc.c now that it's doing more work. WRMSR isn't
such a hot path that an extra CALL+RET pair will be problematic, and the
order of function definitions needs to be changed anyways, i.e. now is a
convenient time to eat the churn.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Remove code that unnecessarily clears event_count and need_cleanup in
kvm_pmu_init(), the entire kvm_pmu is zeroed just a few lines earlier.
Vendor code doesn't set event_count or need_cleanup during .init(), and
if either VMX or SVM did set those fields it would be a flagrant bug.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop kvm_vcpu_reset()'s call to kvm_pmu_reset(), the call is performed
only for RESET, which is really just the same thing as vCPU creation,
and kvm_arch_vcpu_create() *just* called kvm_pmu_init(), i.e. there can't
possibly be any work to do.
Unlike Intel, AMD's amd_pmu_refresh() does fill all_valid_pmc_idx even if
guest CPUID is empty, but everything that is at all dynamic is guaranteed
to be '0'/NULL, e.g. it should be impossible for KVM to have already
created a perf event.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Stop all counters and release all perf events before refreshing the vPMU,
i.e. before reconfiguring the vPMU to respond to changes in the vCPU
model.
Clear need_cleanup in kvm_pmu_reset() as well so that KVM doesn't
prematurely stop counters, e.g. if KVM enters the guest and enables
counters before the vCPU is scheduled out.
Cc: stable@vger.kernel.org
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20231103230541.352265-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|