Age | Commit message (Collapse) | Author |
|
Commit 3bcbc20942db ("selftests/rseq: Play nice with binaries statically
linked against glibc 2.35+") which is now in Linus' tree introduced uses
of __weak but did nothing to ensure that a definition is provided for it
resulting in build failures for the rseq tests:
rseq.c:41:1: error: unknown type name '__weak'
__weak ptrdiff_t __rseq_offset;
^
rseq.c:41:17: error: expected ';' after top level declarator
__weak ptrdiff_t __rseq_offset;
^
;
rseq.c:42:1: error: unknown type name '__weak'
__weak unsigned int __rseq_size;
^
rseq.c:43:1: error: unknown type name '__weak'
__weak unsigned int __rseq_flags;
Fix this by using the definition from tools/include compiler.h.
Fixes: 3bcbc20942db ("selftests/rseq: Play nice with binaries statically linked against glibc 2.35+")
Signed-off-by: Mark Brown <broonie@kernel.org>
Message-Id: <20230804-kselftest-rseq-build-v1-1-015830b66aa9@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.5, part #2
- Fixes for the configuration of SVE/SME traps when hVHE mode is in use
- Allow use of pKVM on systems with FF-A implementations that are v1.0
compatible
- Request/release percpu IRQs (arch timer, vGIC maintenance) correctly
when pKVM is in use
- Fix function prototype after __kvm_host_psci_cpu_entry() rename
- Skip to the next instruction when emulating writes to TCR_EL1 on
AmpereOne systems
|
|
To avoid possible time-of-check/time-of-use issues, the GHCB should
almost never be accessed outside dump_ghcb, sev_es_sync_to_ghcb
and sev_es_sync_from_ghcb. The only legitimate uses are to set the
exitinfo fields and to find the address of the scratch area embedded
in the ghcb. Accessing ghcb_usage also goes through svm->sev_es.ghcb
in sev_es_validate_vmgexit(), but that is because anyway the value is
not used.
Removing a shortcut variable that contains the value of svm->sev_es.ghcb
makes these cases a bit more verbose, but it limits the chance of someone
reading the ghcb by mistake.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
A KVM guest using SEV-ES or SEV-SNP with multiple vCPUs can trigger
a double fetch race condition vulnerability and invoke the VMGEXIT
handler recursively.
sev_handle_vmgexit() maps the GHCB page using kvm_vcpu_map() and then
fetches the exit code using ghcb_get_sw_exit_code(). Soon after,
sev_es_validate_vmgexit() fetches the exit code again. Since the GHCB
page is shared with the guest, the guest is able to quickly swap the
values with another vCPU and hence bypass the validation. One vmexit code
that can be rejected by sev_es_validate_vmgexit() is SVM_EXIT_VMGEXIT;
if sev_handle_vmgexit() observes it in the second fetch, the call
to svm_invoke_exit_handler() will invoke sev_handle_vmgexit() again
recursively.
To avoid the race, always fetch the GHCB data from the places where
sev_es_sync_from_ghcb stores it.
Exploiting recursions on linux kernel has been proven feasible
in the past, but the impact is mitigated by stack guard pages
(CONFIG_VMAP_STACK). Still, if an attacker manages to call the handler
multiple times, they can theoretically trigger a stack overflow and
cause a denial-of-service, or potentially guest-to-host escape in kernel
configurations without stack guard pages.
Note that winning the race reliably in every iteration is very tricky
due to the very tight window of the fetches; depending on the compiler
settings, they are often consecutive because of optimization and inlining.
Tested by booting an SEV-ES RHEL9 guest.
Fixes: CVE-2023-4155
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Cc: stable@vger.kernel.org
Reported-by: Andy Nguyen <theflow@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Validation of the GHCB is susceptible to time-of-check/time-of-use vulnerabilities.
To avoid them, we would like to always snapshot the fields that are read in
sev_es_validate_vmgexit(), and not use the GHCB anymore after it returns.
This means:
- invoking sev_es_sync_from_ghcb() before any GHCB access, including before
sev_es_validate_vmgexit()
- snapshotting all fields including the valid bitmap and the sw_scratch field,
which are currently not caching anywhere.
The valid bitmap is the first thing to be copied out of the GHCB; then,
further accesses will use the copy in svm->sev_es.
Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add coverage to x86's set_sregs_test to verify KVM rejects vendor-agnostic
illegal CR0 values, i.e. CR0 values whose legality doesn't depend on the
current VMX mode. KVM historically has neglected to reject bad CR0s from
userspace, i.e. would happily accept a completely bogus CR0 via
KVM_SET_SREGS{2}.
Punt VMX specific subtests to future work, as they would require quite a
bit more effort, and KVM gets coverage for CR0 checks in general through
other means, e.g. KVM-Unit-Tests.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Stuff CR0 and/or CR4 to be compliant with a restricted guest if and only
if KVM itself is not configured to utilize unrestricted guests, i.e. don't
stuff CR0/CR4 for a restricted L2 that is running as the guest of an
unrestricted L1. Any attempt to VM-Enter a restricted guest with invalid
CR0/CR4 values should fail, i.e. in a nested scenario, KVM (as L0) should
never observe a restricted L2 with incompatible CR0/CR4, since nested
VM-Enter from L1 should have failed.
And if KVM does observe an active, restricted L2 with incompatible state,
e.g. due to a KVM bug, fudging CR0/CR4 instead of letting VM-Enter fail
does more harm than good, as KVM will often neglect to undo the side
effects, e.g. won't clear rmode.vm86_active on nested VM-Exit, and thus
the damage can easily spill over to L1. On the other hand, letting
VM-Enter fail due to bad guest state is more likely to contain the damage
to L2 as KVM relies on hardware to perform most guest state consistency
checks, i.e. KVM needs to be able to reflect a failed nested VM-Enter into
L1 irrespective of (un)restricted guest behavior.
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Fixes: bddd82d19e2e ("KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid,
e.g. due to setting bits 63:32, illegal combinations, or to a value that
isn't allowed in VMX (non-)root mode. The VMX checks in particular are
"fun" as failure to disallow Real Mode for an L2 that is configured with
unrestricted guest disabled, when KVM itself has unrestricted guest
enabled, will result in KVM forcing VM86 mode to virtual Real Mode for
L2, but then fail to unwind the related metadata when synthesizing a
nested VM-Exit back to L1 (which has unrestricted guest enabled).
Opportunistically fix a benign typo in the prototype for is_valid_cr4().
Cc: stable@vger.kernel.org
Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Remove coccinelle's recommendation to use DEFINE_DEBUGFS_ATTRIBUTE()
instead of DEFINE_SIMPLE_ATTRIBUTE(). Regardless of whether or not the
"significant overhead" incurred by debugfs_create_file() is actually
meaningful, warnings from the script have led to a rash of low-quality
patches that have sowed confusion and consumed maintainer time for little
to no benefit. There have been no less than four attempts to "fix" KVM,
and a quick search on lore shows that KVM is not alone.
This reverts commit 5103068eaca290f890a30aae70085fac44cecaf6.
Link: https://lore.kernel.org/all/87tu2nbnz3.fsf@mpe.ellerman.id.au
Link: https://lore.kernel.org/all/c0b98151-16b6-6d8f-1765-0f7d46682d60@redhat.com
Link: https://lkml.kernel.org/r/20230706072954.4881-1-duminjie%40vivo.com
Link: https://lore.kernel.org/all/Y2FsbufV00jbyF0B@google.com
Link: https://lore.kernel.org/all/Y2ENJJ1YiSg5oHiy@orome
Link: https://lore.kernel.org/all/7560b350e7b23786ce712118a9a504356ff1cca4.camel@kernel.org
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230726202920.507756-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Verify that VM and vCPU binary stats files are usable even after userspace
has put its last direct reference to the VM. This is a regression test
for a UAF bug where KVM didn't gift the stats files a reference to the VM.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230711230131.648752-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Expand the binary stats test to verify that a stats fd can be dup()'d
and read, to (very) roughly simulate userspace passing around the file.
Adding the dup() test is primarily an intermediate step towards verifying
that userspace can read VM/vCPU stats before _and_ after userspace closes
its copy of the VM fd; the dup() test itself is only mildly interesting.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230711230131.648752-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Verify that KVM doesn't artificially limit KVM_GET_STATS_FD to a single
file per VM/vCPU. There's no known use case for getting multiple stats
fds, but it should work, and more importantly creating multiple files will
make it easier to test that KVM correct manages VM refcounts for stats
files.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230711230131.648752-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Explicitly free the all-encompassing vcpus array in the binary stats test
so that the test is consistent with respect to freeing all dynamically
allocated resources (versus letting them be freed on exit).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230711230131.648752-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move the stats fd cleanup code into stats_test() and drop the
superfluous vm_stats_test() and vcpu_stats_test() helpers in order to
decouple creation of the stats file from consuming/testing the file
(deduping code is a bonus). This will make it easier to test various
edge cases related to stats, e.g. that userspace can dup() a stats fd,
that userspace can have multiple stats files for a singleVM/vCPU, etc.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230711230131.648752-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use pread() with an explicit offset when reading the header and the header
name for a binary stats fd so that the common helper and the binary stats
test don't subtly rely on the file effectively being untouched, e.g. to
allow multiple reads of the header, name, etc.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230711230131.648752-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Grab a reference to KVM prior to installing VM and vCPU stats file
descriptors to ensure the underlying VM and vCPU objects are not freed
until the last reference to any and all stats fds are dropped.
Note, the stats paths manually invoke fd_install() and so don't need to
grab a reference before creating the file.
Fixes: ce55c049459c ("KVM: stats: Support binary stats retrieval for a VCPU")
Fixes: fcfe1baeddbf ("KVM: stats: Support binary stats retrieval for a VM")
Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu>
Closes: https://lore.kernel.org/all/CAC_GQSr3xzZaeZt85k_RCBd5kfiOve8qXo7a81Cq53LuVQ5r=Q@mail.gmail.com
Cc: stable@vger.kernel.org
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Message-Id: <20230711230131.648752-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
To allow running rseq and KVM's rseq selftests as statically linked
binaries, initialize the various "trampoline" pointers to point directly
at the expect glibc symbols, and skip the dlysm() lookups if the rseq
size is non-zero, i.e. the binary is statically linked *and* the libc
registered its own rseq.
Define weak versions of the symbols so as not to break linking against
libc versions that don't support rseq in any capacity.
The KVM selftests in particular are often statically linked so that they
can be run on targets with very limited runtime environments, i.e. test
machines.
Fixes: 233e667e1ae3 ("selftests/rseq: Uplift rseq selftests for compatibility with glibc-2.35")
Cc: Aaron Lewis <aaronlewis@google.com>
Cc: kvm@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721223352.2333911-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Now that handle_fastpath_set_msr_irqoff() acquires kvm->srcu, i.e. allows
dereferencing memslots during WRMSR emulation, drop the requirement that
"next RIP" is valid. In hindsight, acquiring kvm->srcu would have been a
better fix than avoiding the pastpath, but at the time it was thought that
accessing SRCU-protected data in the fastpath was a one-off edge case.
This reverts commit 5c30e8101e8d5d020b1d7119117889756a6ed713.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721224337.2335137-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Temporarily acquire kvm->srcu for read when potentially emulating WRMSR in
the VM-Exit fastpath handler, as several of the common helpers used during
emulation expect the caller to provide SRCU protection. E.g. if the guest
is counting instructions retired, KVM will query the PMU event filter when
stepping over the WRMSR.
dump_stack+0x85/0xdf
lockdep_rcu_suspicious+0x109/0x120
pmc_event_is_allowed+0x165/0x170
kvm_pmu_trigger_event+0xa5/0x190
handle_fastpath_set_msr_irqoff+0xca/0x1e0
svm_vcpu_run+0x5c3/0x7b0 [kvm_amd]
vcpu_enter_guest+0x2108/0x2580
Alternatively, check_pmu_event_filter() could acquire kvm->srcu, but this
isn't the first bug of this nature, e.g. see commit 5c30e8101e8d ("KVM:
SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid"). Providing
protection for the entirety of WRMSR emulation will allow reverting the
aforementioned commit, and will avoid having to play whack-a-mole when new
uses of SRCU-protected structures are inevitably added in common emulation
helpers.
Fixes: dfdeda67ea2d ("KVM: x86/pmu: Prevent the PMU from counting disallowed events")
Reported-by: Greg Thelen <gthelen@google.com>
Reported-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721224337.2335137-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use vmread_error() to report VM-Fail on VMREAD for the "asm goto" case,
now that trampoline case has yet another wrapper around vmread_error() to
play nice with instrumentation.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721235637.2345403-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Mark vmread_error_trampoline() as noinstr, and add a second trampoline
for the CONFIG_CC_HAS_ASM_GOTO_OUTPUT=n case to enable instrumentation
when handling VM-Fail on VMREAD. VMREAD is used in various noinstr
flows, e.g. immediately after VM-Exit, and objtool rightly complains that
the call to the error trampoline leaves a no-instrumentation section
without annotating that it's safe to do so.
vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0xc9:
call to vmread_error_trampoline() leaves .noinstr.text section
Note, strictly speaking, enabling instrumentation in the VM-Fail path
isn't exactly safe, but if VMREAD fails the kernel/system is likely hosed
anyways, and logging that there is a fatal error is more important than
*maybe* encountering slightly unsafe instrumentation.
Reported-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230721235637.2345403-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
As was attempted commit 14717e203186 ("kvm: Conditionally register IRQ
bypass consumer"): "if we don't support a mechanism for bypassing IRQs,
don't register as a consumer. Initially this applied to AMD processors,
but when AVIC support was implemented for assigned devices,
kvm_arch_has_irq_bypass() was always returning true.
We can still skip registering the consumer where enable_apicv
or posted-interrupts capability is unsupported or globally disabled.
This eliminates meaningless dev_info()s when the connect fails
between producer and consumer", such as on Linux hosts where enable_apicv
or posted-interrupts capability is unsupported or globally disabled.
Cc: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Yong He <alexyonghe@tencent.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20230724111236.76570-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The pid_table of ipiv is the persistent memory allocated by
per-vcpu, which should be counted into the memory cgroup.
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Message-Id: <CAPm50aLxCQ3TQP2Lhc0PX3y00iTRg+mniLBqNDOC=t9CLxMwwA@mail.gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The code was blindly assuming that kvm_cpu_get_interrupt never returns -1
when there is a pending interrupt.
While this should be true, a bug in KVM can still cause this.
If -1 is returned, the code before this patch was converting it to 0xFF,
and 0xFF interrupt was injected to the guest, which results in an issue
which was hard to debug.
Add WARN_ON_ONCE to catch this case and skip the injection
if this happens again.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230726135945.260841-4-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When the APICv is inhibited, the irr_pending optimization is used.
Therefore, when kvm_apic_update_irr sets bits in the IRR,
it must set irr_pending to true as well.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230726135945.260841-3-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If APICv is inhibited, then IPIs from peer vCPUs are done by
atomically setting bits in IRR.
This means, that when __kvm_apic_update_irr copies PIR to IRR,
it has to modify IRR atomically as well.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20230726135945.260841-2-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Whelp, this is embarrassing. Since commit 082fdfd13841 ("KVM: arm64:
Prevent guests from enabling HA/HD on Ampere1") KVM traps writes to
TCR_EL1 on AmpereOne to work around an erratum in the unadvertised
HAFDBS implementation, preventing the guest from enabling the feature.
Unfortunately, I failed virtualization 101 when working on that change,
and forgot to advance PC after instruction emulation.
Do the right thing and skip the MSR instruction after emulating the
write.
Fixes: 082fdfd13841 ("KVM: arm64: Prevent guests from enabling HA/HD on Ampere1")
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230728000824.3848025-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The kvm_host_psci_cpu_entry() function was renamed in order to add a wrapper around
it, but the prototype did not change, so now the missing-prototype warning came
back in W=1 builds:
arch/arm64/kvm/hyp/nvhe/psci-relay.c:203:28: error: no previous prototype for function '__kvm_host_psci_cpu_entry' [-Werror,-Wmissing-prototypes]
asmlinkage void __noreturn __kvm_host_psci_cpu_entry(bool is_cpu_on)
Fixes: dcf89d1111995 ("KVM: arm64: Add missing BTI instructions")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724121850.1386668-1-arnd@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Ensure that SME traps are disabled for (h)VHE when getting the
reset value for the architectural feature control register.
Fixes: 75c76ab5a641 ("KVM: arm64: Rework CPTR_EL2 programming for HVHE configuration")
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724123829.2929609-9-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Ensure that SVE traps are disabled for hVHE, if the FPSIMD state
isn't owned by the guest, when getting the reset value for the
architectural feature control register.
Fixes: 75c76ab5a641 ("KVM: arm64: Rework CPTR_EL2 programming for HVHE configuration")
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724123829.2929609-8-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Instead of writing directly to cptr_el2, use the helper that
selects which feature trap register to write to based on the KVM
mode.
Fixes: 75c76ab5a641 ("KVM: arm64: Rework CPTR_EL2 programming for HVHE configuration")
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724123829.2929609-7-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Factor out the code that decides whether to write to the feature
trap registers, CPTR_EL2 or CPACR_EL1, based on the KVM mode,
i.e., (h)VHE or nVHE.
This function will be used in the subsequent patch.
No functional change intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724123829.2929609-6-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Ensure that SME traps are disabled for (h)VHE when setting up
EL2, as they are for nVHE.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724123829.2929609-5-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Use the architectural feature trap/control register that
corresponds to the current KVM mode, i.e., CPTR_EL2 or CPACR_EL1,
when setting up SVE feature traps.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724123829.2929609-4-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The code for checking whether the kernel is in (h)VHE mode is
repeated, and will be needed again in future patches. Factor it
out in a macro.
No functional change intended.
No change in emitted assembly code intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/kvmarm/20230724123829.2929609-3-tabba@google.com/
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Swapping the ring buffer for snapshotting (for things like irqsoff)
can crash if the ring buffer is being resized. Disable swapping when
this happens. The missed swap will be reported to the tracer
- Report error if the histogram fails to be created due to an error in
adding a histogram variable, in event_hist_trigger_parse()
- Remove unused declaration of tracing_map_set_field_descr()
* tag 'trace-v6.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/histograms: Return an error if we fail to add histogram to hist_vars list
ring-buffer: Do not swap cpu_buffer during resize process
tracing: Remove unused extern declaration tracing_map_set_field_descr()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix stale help text in gconfig
- Support *.S files in compile_commands.json
- Flatten KBUILD_CFLAGS
- Fix external module builds with Rust so that temporary files are
created in the modules directories instead of the kernel tree
* tag 'kbuild-fixes-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: rust: avoid creating temporary files
kbuild: flatten KBUILD_CFLAGS
gen_compile_commands: add assembly files to compilation database
kconfig: gconfig: correct program name in help text
kconfig: gconfig: drop the Show Debug Info help text
|
|
`rustc` outputs by default the temporary files (i.e. the ones saved
by `-Csave-temps`, such as `*.rcgu*` files) in the current working
directory when `-o` and `--out-dir` are not given (even if
`--emit=x=path` is given, i.e. it does not use those for temporaries).
Since out-of-tree modules are compiled from the `linux` tree,
`rustc` then tries to create them there, which may not be accessible.
Thus pass `--out-dir` explicitly, even if it is just for the temporary
files.
Similarly, do so for Rust host programs too.
Reported-by: Raphael Nestler <raphael.nestler@gmail.com>
Closes: https://github.com/Rust-for-Linux/linux/issues/1015
Reported-by: Andrea Righi <andrea.righi@canonical.com>
Tested-by: Raphael Nestler <raphael.nestler@gmail.com> # non-hostprogs
Tested-by: Andrea Righi <andrea.righi@canonical.com> # non-hostprogs
Fixes: 295d8398c67e ("kbuild: specify output names separately for each emission type from rustc")
Cc: stable@vger.kernel.org
Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Tested-by: Martin Rodriguez Reboredo <yakoyoku@gmail.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Avoid pKVM finalization if KVM initialization fails
- Add missing BTI instructions in the hypervisor, fixing an early
boot failure on BTI systems
- Handle MMU notifiers correctly for non hugepage-aligned memslots
- Work around a bug in the architecture where hypervisor timer
controls have UNKNOWN behavior under nested virt
- Disable preemption in kvm_arch_hardware_enable(), fixing a kernel
BUG in cpu hotplug resulting from per-CPU accessor sanity checking
- Make WFI emulation on GICv4 systems robust w.r.t. preemption,
consistently requesting a doorbell interrupt on vcpu_put()
- Uphold RES0 sysreg behavior when emulating older PMU versions
- Avoid macro expansion when initializing PMU register names,
ensuring the tracepoints pretty-print the sysreg
s390:
- Two fixes for asynchronous destroy
x86 fixes will come early next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: pv: fix index value of replaced ASCE
KVM: s390: pv: simplify shutdown and fix race
KVM: arm64: Fix the name of sys_reg_desc related to PMU
KVM: arm64: Correctly handle RES0 bits PMEVTYPER<n>_EL0.evtCount
KVM: arm64: vgic-v4: Make the doorbell request robust w.r.t preemption
KVM: arm64: Add missing BTI instructions
KVM: arm64: Correctly handle page aging notifiers for unaligned memslot
KVM: arm64: Disable preemption in kvm_arch_hardware_enable()
KVM: arm64: Handle kvm_arm_init failure correctly in finalize_pkvm
KVM: arm64: timers: Use CNTHCTL_EL2 when setting non-CNTKCTL_EL1 bits
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Bug and regression fixes for 6.5-rc3 for ext4's mballoc and jbd2's
checkpoint code"
* tag 'ext4_for_linus-6.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix rbtree traversal bug in ext4_mb_use_preallocated
ext4: fix off by one issue in ext4_mb_choose_next_group_best_avail()
ext4: correct inline offset when handling xattrs in inode body
jbd2: remove __journal_try_to_free_buffer()
jbd2: fix a race when checking checkpoint buffer busy
jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint
jbd2: remove journal_clean_one_cp_list()
jbd2: remove t_checkpoint_io_list
jbd2: recheck chechpointing non-dirty buffer
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
"Add minor debugging improvement.
The change improves ability to read a network trace to debug problems
on encrypted connections which are very common (e.g. using wireshark
or tcpdump).
That works today with tools like 'smbinfo keys /mnt/file' but requires
passing in a filename on the mount (see e.g. [1]), but it often makes
more sense to just pass in the mount point path (ie a directory not a
filename).
So this fix was needed to debug some types of problems (an obvious
example is on an encrypted connection failing operations on an empty
share or with no files in the root of the directory) - so you can
simply pass in the 'smbinfo keys <mntpoint>' and get the information
that wireshark needs"
Link: https://wiki.samba.org/index.php/Wireshark_Decryption [1]
* tag '6.5-rc2-smb3-client-fixes-ver2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: update internal module version number for cifs.ko
cifs: allow dumping keys for directories too
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
Two fixes for asynchronous destroy
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 6.5, part #1
- Avoid pKVM finalization if KVM initialization fails
- Add missing BTI instructions in the hypervisor, fixing an early boot
failure on BTI systems
- Handle MMU notifiers correctly for non hugepage-aligned memslots
- Work around a bug in the architecture where hypervisor timer controls
have UNKNOWN behavior under nested virt.
- Disable preemption in kvm_arch_hardware_enable(), fixing a kernel BUG
in cpu hotplug resulting from per-CPU accessor sanity checking.
- Make WFI emulation on GICv4 systems robust w.r.t. preemption,
consistently requesting a doorbell interrupt on vcpu_put()
- Uphold RES0 sysreg behavior when emulating older PMU versions
- Avoid macro expansion when initializing PMU register names, ensuring
the tracepoints pretty-print the sysreg.
|
|
list
Commit 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if
they have referenced variables") added a check to fail histogram creation
if save_hist_vars() failed to add histogram to hist_vars list. But the
commit failed to set ret to failed return code before jumping to
unregister histogram, fix it.
Link: https://lore.kernel.org/linux-trace-kernel/20230714203341.51396-1-mkhalfella@purestorage.com
Cc: stable@vger.kernel.org
Fixes: 6018b585e8c6 ("tracing/histograms: Add histograms to hist_vars if they have referenced variables")
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When ring_buffer_swap_cpu was called during resize process,
the cpu buffer was swapped in the middle, resulting in incorrect state.
Continuing to run in the wrong state will result in oops.
This issue can be easily reproduced using the following two scripts:
/tmp # cat test1.sh
//#! /bin/sh
for i in `seq 0 100000`
do
echo 2000 > /sys/kernel/debug/tracing/buffer_size_kb
sleep 0.5
echo 5000 > /sys/kernel/debug/tracing/buffer_size_kb
sleep 0.5
done
/tmp # cat test2.sh
//#! /bin/sh
for i in `seq 0 100000`
do
echo irqsoff > /sys/kernel/debug/tracing/current_tracer
sleep 1
echo nop > /sys/kernel/debug/tracing/current_tracer
sleep 1
done
/tmp # ./test1.sh &
/tmp # ./test2.sh &
A typical oops log is as follows, sometimes with other different oops logs.
[ 231.711293] WARNING: CPU: 0 PID: 9 at kernel/trace/ring_buffer.c:2026 rb_update_pages+0x378/0x3f8
[ 231.713375] Modules linked in:
[ 231.714735] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15
[ 231.716750] Hardware name: linux,dummy-virt (DT)
[ 231.718152] Workqueue: events update_pages_handler
[ 231.719714] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 231.721171] pc : rb_update_pages+0x378/0x3f8
[ 231.722212] lr : rb_update_pages+0x25c/0x3f8
[ 231.723248] sp : ffff800082b9bd50
[ 231.724169] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[ 231.726102] x26: 0000000000000001 x25: fffffffffffff010 x24: 0000000000000ff0
[ 231.728122] x23: ffff0000c3a0b600 x22: ffff0000c3a0b5c0 x21: fffffffffffffe0a
[ 231.730203] x20: ffff0000c3a0b600 x19: ffff0000c0102400 x18: 0000000000000000
[ 231.732329] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe7aa8510
[ 231.734212] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000002
[ 231.736291] x11: ffff8000826998a8 x10: ffff800082b9baf0 x9 : ffff800081137558
[ 231.738195] x8 : fffffc00030e82c8 x7 : 0000000000000000 x6 : 0000000000000001
[ 231.740192] x5 : ffff0000ffbafe00 x4 : 0000000000000000 x3 : 0000000000000000
[ 231.742118] x2 : 00000000000006aa x1 : 0000000000000001 x0 : ffff0000c0007208
[ 231.744196] Call trace:
[ 231.744892] rb_update_pages+0x378/0x3f8
[ 231.745893] update_pages_handler+0x1c/0x38
[ 231.746893] process_one_work+0x1f0/0x468
[ 231.747852] worker_thread+0x54/0x410
[ 231.748737] kthread+0x124/0x138
[ 231.749549] ret_from_fork+0x10/0x20
[ 231.750434] ---[ end trace 0000000000000000 ]---
[ 233.720486] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 233.721696] Mem abort info:
[ 233.721935] ESR = 0x0000000096000004
[ 233.722283] EC = 0x25: DABT (current EL), IL = 32 bits
[ 233.722596] SET = 0, FnV = 0
[ 233.722805] EA = 0, S1PTW = 0
[ 233.723026] FSC = 0x04: level 0 translation fault
[ 233.723458] Data abort info:
[ 233.723734] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 233.724176] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 233.724589] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 233.725075] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104943000
[ 233.725592] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[ 233.726231] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 233.726720] Modules linked in:
[ 233.727007] CPU: 0 PID: 9 Comm: kworker/0:1 Tainted: G W 6.5.0-rc1-00276-g20edcec23f92 #15
[ 233.727777] Hardware name: linux,dummy-virt (DT)
[ 233.728225] Workqueue: events update_pages_handler
[ 233.728655] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 233.729054] pc : rb_update_pages+0x1a8/0x3f8
[ 233.729334] lr : rb_update_pages+0x154/0x3f8
[ 233.729592] sp : ffff800082b9bd50
[ 233.729792] x29: ffff800082b9bd50 x28: ffff8000825f7000 x27: 0000000000000000
[ 233.730220] x26: 0000000000000000 x25: ffff800082a8b840 x24: ffff0000c0102418
[ 233.730653] x23: 0000000000000000 x22: fffffc000304c880 x21: 0000000000000003
[ 233.731105] x20: 00000000000001f4 x19: ffff0000c0102400 x18: ffff800082fcbc58
[ 233.731727] x17: 0000000000000000 x16: 0000000000000001 x15: 0000000000000001
[ 233.732282] x14: ffff8000825fe0c8 x13: 0000000000000001 x12: 0000000000000000
[ 233.732709] x11: ffff8000826998a8 x10: 0000000000000ae0 x9 : ffff8000801b760c
[ 233.733148] x8 : fefefefefefefeff x7 : 0000000000000018 x6 : ffff0000c03298c0
[ 233.733553] x5 : 0000000000000002 x4 : 0000000000000000 x3 : 0000000000000000
[ 233.733972] x2 : ffff0000c3a0b600 x1 : 0000000000000000 x0 : 0000000000000000
[ 233.734418] Call trace:
[ 233.734593] rb_update_pages+0x1a8/0x3f8
[ 233.734853] update_pages_handler+0x1c/0x38
[ 233.735148] process_one_work+0x1f0/0x468
[ 233.735525] worker_thread+0x54/0x410
[ 233.735852] kthread+0x124/0x138
[ 233.736064] ret_from_fork+0x10/0x20
[ 233.736387] Code: 92400000 910006b5 aa000021 aa0303f7 (f9400060)
[ 233.736959] ---[ end trace 0000000000000000 ]---
After analysis, the seq of the error is as follows [1-5]:
int ring_buffer_resize(struct trace_buffer *buffer, unsigned long size,
int cpu_id)
{
for_each_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
//1. get cpu_buffer, aka cpu_buffer(A)
...
...
schedule_work_on(cpu,
&cpu_buffer->update_pages_work);
//2. 'update_pages_work' is queue on 'cpu', cpu_buffer(A) is passed to
// update_pages_handler, do the update process, set 'update_done' in
// complete(&cpu_buffer->update_done) and to wakeup resize process.
//---->
//3. Just at this moment, ring_buffer_swap_cpu is triggered,
//cpu_buffer(A) be swaped to cpu_buffer(B), the max_buffer.
//ring_buffer_swap_cpu is called as the 'Call trace' below.
Call trace:
dump_backtrace+0x0/0x2f8
show_stack+0x18/0x28
dump_stack+0x12c/0x188
ring_buffer_swap_cpu+0x2f8/0x328
update_max_tr_single+0x180/0x210
check_critical_timing+0x2b4/0x2c8
tracer_hardirqs_on+0x1c0/0x200
trace_hardirqs_on+0xec/0x378
el0_svc_common+0x64/0x260
do_el0_svc+0x90/0xf8
el0_svc+0x20/0x30
el0_sync_handler+0xb0/0xb8
el0_sync+0x180/0x1c0
//<----
/* wait for all the updates to complete */
for_each_buffer_cpu(buffer, cpu) {
cpu_buffer = buffer->buffers[cpu];
//4. get cpu_buffer, cpu_buffer(B) is used in the following process,
//the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong.
//for example, cpu_buffer(A)->update_done will leave be set 1, and will
//not 'wait_for_completion' at the next resize round.
if (!cpu_buffer->nr_pages_to_update)
continue;
if (cpu_online(cpu))
wait_for_completion(&cpu_buffer->update_done);
cpu_buffer->nr_pages_to_update = 0;
}
...
}
//5. the state of cpu_buffer(A) and cpu_buffer(B) is totally wrong,
//Continuing to run in the wrong state, then oops occurs.
Link: https://lore.kernel.org/linux-trace-kernel/202307191558478409990@zte.com.cn
Signed-off-by: Chen Lin <chen.lin5@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Since commit 08d43a5fa063 ("tracing: Add lock-free tracing_map"),
this is never used, so can be removed.
Link: https://lore.kernel.org/linux-trace-kernel/20230722032123.24664-1-yuehaibing@huawei.com
Cc: <mhiramat@kernel.org>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
Make it slightly easier to see which compiler options are added and
removed (and not worry about column limit too!).
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Nicolas Schier <n.schier@avm.de>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
Like C source files, tooling can find it useful to have the assembly
source file compilation recorded.
The .S extension appears to used across all architectures.
Signed-off-by: Benjamin Gray <bgray@linux.ibm.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
|
|
During allocations, while looking for preallocations(PA) in the per
inode rbtree, we can't do a direct traversal of the tree because
ext4_mb_discard_group_preallocation() can paralelly mark the pa deleted
and that can cause direct traversal to skip some entries. This was
leading to a BUG_ON() being hit [1] when we missed a PA that could satisfy
our request and ultimately tried to create a new PA that would overlap
with the missed one.
To makes sure we handle that case while still keeping the performance of
the rbtree, we make use of the fact that the only pa that could possibly
overlap the original goal start is the one that satisfies the below
conditions:
1. It must have it's logical start immediately to the left of
(ie less than) original logical start.
2. It must not be deleted
To find this pa we use the following traversal method:
1. Descend into the rbtree normally to find the immediate neighboring
PA. Here we keep descending irrespective of if the PA is deleted or if
it overlaps with our request etc. The goal is to find an immediately
adjacent PA.
2. If the found PA is on right of original goal, use rb_prev() to find
the left adjacent PA.
3. Check if this PA is deleted and keep moving left with rb_prev() until
a non deleted PA is found.
4. This is the PA we are looking for. Now we can check if it can satisfy
the original request and proceed accordingly.
This approach also takes care of having deleted PAs in the tree.
(While we are at it, also fix a possible overflow bug in calculating the
end of a PA)
[1] https://lore.kernel.org/linux-ext4/CA+G9fYv2FRpLqBZf34ZinR8bU2_ZRAUOjKAD3+tKRFaEQHtt8Q@mail.gmail.com/
Cc: stable@kernel.org # 6.4
Fixes: 3872778664e3 ("ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list")
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Tested-by: Ritesh Harjani (IBM) ritesh.list@gmail.com
Link: https://lore.kernel.org/r/edd2efda6a83e6343c5ace9deea44813e71dbe20.1690045963.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|