Age | Commit message (Collapse) | Author |
|
Clean up the existing export namespace code along the same lines of
commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo)
to __section("foo")") and for the same reason, it is not desired for the
namespace argument to be a macro expansion itself.
Scripted using
git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file;
do
awk -i inplace '
/^#define EXPORT_SYMBOL_NS/ {
gsub(/__stringify\(ns\)/, "ns");
print;
next;
}
/^#define MODULE_IMPORT_NS/ {
gsub(/__stringify\(ns\)/, "ns");
print;
next;
}
/MODULE_IMPORT_NS/ {
$0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g");
}
/EXPORT_SYMBOL_NS/ {
if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) {
if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ &&
$0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ &&
$0 !~ /^my/) {
getline line;
gsub(/[[:space:]]*\\$/, "");
gsub(/[[:space:]]/, "", line);
$0 = $0 " " line;
}
$0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/,
"\\1(\\2, \"\\3\")", "g");
}
}
{ print }' $file;
done
Requested-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc
Acked-by: Greg KH <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The continual trickle of small conversion patches is grating on me, and
is really not helping. Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:
/*
* .remove_new() is a relic from a prototype conversion of .remove().
* New drivers are supposed to implement .remove(). Once all drivers are
* converted to not use .remove_new any more, it will be dropped.
*/
This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.
I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.
Then I just removed the old (sic) .remove_new member function, and this
is the end result. No more unnecessary conversion noise.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull VFIO updates from Alex Williamson:
- Constify an unmodified structure used in linking vfio and kvm
(Christophe JAILLET)
- Add ID for an additional hardware SKU supported by the nvgrace-gpu
vfio-pci variant driver (Ankit Agrawal)
- Fix incorrect signed cast in QAT vfio-pci variant driver, negating
test in check_add_overflow(), though still caught by later tests
(Giovanni Cabiddu)
- Additional debugfs attributes exposed in hisi_acc vfio-pci variant
driver for migration debugging (Longfang Liu)
- Migration support is added to the virtio vfio-pci variant driver,
becoming the primary feature of the driver while retaining emulation
of virtio legacy support as a secondary option (Yishai Hadas)
- Fixes to a few unwind flows in the mlx5 vfio-pci driver discovered
through reviews of the virtio variant driver (Yishai Hadas)
- Fix an unlikely issue where a PCI device exposed to userspace with an
unknown capability at the base of the extended capability chain can
overflow an array index (Avihai Horon)
* tag 'vfio-v6.13-rc1' of https://github.com/awilliam/linux-vfio:
vfio/pci: Properly hide first-in-list PCIe extended capability
vfio/mlx5: Fix unwind flows in mlx5vf_pci_save/resume_device_data()
vfio/mlx5: Fix an unwind issue in mlx5vf_add_migration_pages()
vfio/virtio: Enable live migration once VIRTIO_PCI was configured
vfio/virtio: Add PRE_COPY support for live migration
vfio/virtio: Add support for the basic live migration functionality
virtio-pci: Introduce APIs to execute device parts admin commands
virtio: Manage device and driver capabilities via the admin commands
virtio: Extend the admin command to include the result size
virtio_pci: Introduce device parts access commands
Documentation: add debugfs description for hisi migration
hisi_acc_vfio_pci: register debugfs for hisilicon migration driver
hisi_acc_vfio_pci: create subfunction for data reading
hisi_acc_vfio_pci: extract public functions for container_of
vfio/qat: fix overflow check in qat_vf_resume_write()
vfio/nvgrace-gpu: Add a new GH200 SKU to the devid table
kvm/vfio: Constify struct kvm_device_ops
|
|
There are cases where a PCIe extended capability should be hidden from
the user. For example, an unknown capability (i.e., capability with ID
greater than PCI_EXT_CAP_ID_MAX) or a capability that is intentionally
chosen to be hidden from the user.
Hiding a capability is done by virtualizing and modifying the 'Next
Capability Offset' field of the previous capability so it points to the
capability after the one that should be hidden.
The special case where the first capability in the list should be hidden
is handled differently because there is no previous capability that can
be modified. In this case, the capability ID and version are zeroed
while leaving the next pointer intact. This hides the capability and
leaves an anchor for the rest of the capability list.
However, today, hiding the first capability in the list is not done
properly if the capability is unknown, as struct
vfio_pci_core_device->pci_config_map is set to the capability ID during
initialization but the capability ID is not properly checked later when
used in vfio_config_do_rw(). This leads to the following warning [1] and
to an out-of-bounds access to ecap_perms array.
Fix it by checking cap_id in vfio_config_do_rw(), and if it is greater
than PCI_EXT_CAP_ID_MAX, use an alternative struct perm_bits for direct
read only access instead of the ecap_perms array.
Note that this is safe since the above is the only case where cap_id can
exceed PCI_EXT_CAP_ID_MAX (except for the special capabilities, which
are already checked before).
[1]
WARNING: CPU: 118 PID: 5329 at drivers/vfio/pci/vfio_pci_config.c:1900 vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
CPU: 118 UID: 0 PID: 5329 Comm: simx-qemu-syste Not tainted 6.12.0+ #1
(snip)
Call Trace:
<TASK>
? show_regs+0x69/0x80
? __warn+0x8d/0x140
? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
? report_bug+0x18f/0x1a0
? handle_bug+0x63/0xa0
? exc_invalid_op+0x19/0x70
? asm_exc_invalid_op+0x1b/0x20
? vfio_pci_config_rw+0x395/0x430 [vfio_pci_core]
? vfio_pci_config_rw+0x244/0x430 [vfio_pci_core]
vfio_pci_rw+0x101/0x1b0 [vfio_pci_core]
vfio_pci_core_read+0x1d/0x30 [vfio_pci_core]
vfio_device_fops_read+0x27/0x40 [vfio]
vfs_read+0xbd/0x340
? vfio_device_fops_unl_ioctl+0xbb/0x740 [vfio]
? __rseq_handle_notify_resume+0xa4/0x4b0
__x64_sys_pread64+0x96/0xc0
x64_sys_call+0x1c3d/0x20d0
do_syscall_64+0x4d/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20241124142739.21698-1-avihaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Several new features and uAPI for iommufd:
- IOMMU_IOAS_MAP_FILE allows passing in a file descriptor as the
backing memory for an iommu mapping. To date VFIO/iommufd have used
VMA's and pin_user_pages(), this now allows using memfds and
memfd_pin_folios(). Notably this creates a pure folio path from the
memfd to the iommu page table where memory is never broken down to
PAGE_SIZE.
- IOMMU_IOAS_CHANGE_PROCESS moves the pinned page accounting between
two processes. Combined with the above this allows iommufd to
support a VMM re-start using exec() where something like qemu would
exec() a new version of itself and fd pass the memfds/iommufd/etc
to the new process. The memfd allows DMA access to the memory to
continue while the new process is getting setup, and the
CHANGE_PROCESS updates all the accounting.
- Support for fault reporting to userspace on non-PRI HW, such as ARM
stall-mode embedded devices.
- IOMMU_VIOMMU_ALLOC introduces the concept of a HW/driver backed
virtual iommu. This will be used by VMMs to access hardware
features that are contained with in a VM. The first use is to
inform the kernel of the virtual SID to physical SID mapping when
issuing SID based invalidation on ARM. Further uses will tie HW
features that are directly accessed by the VM, such as invalidation
queue assignment and others.
- IOMMU_VDEVICE_ALLOC informs the kernel about the mapping of virtual
device to physical device within a VIOMMU. Minimially this is used
to translate VM issued cache invalidation commands from virtual to
physical device IDs.
- Enhancements to IOMMU_HWPT_INVALIDATE and IOMMU_HWPT_ALLOC to work
with the VIOMMU
- ARM SMMuv3 support for nested translation. Using the VIOMMU and
VDEVICE the driver can model this HW's behavior for nested
translation. This includes a shared branch from Will"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (51 commits)
iommu/arm-smmu-v3: Import IOMMUFD module namespace
iommufd: IOMMU_IOAS_CHANGE_PROCESS selftest
iommufd: Add IOMMU_IOAS_CHANGE_PROCESS
iommufd: Lock all IOAS objects
iommufd: Export do_update_pinned
iommu/arm-smmu-v3: Support IOMMU_HWPT_INVALIDATE using a VIOMMU object
iommu/arm-smmu-v3: Allow ATS for IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Use S2FWB for NESTED domains
iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
iommu/arm-smmu-v3: Support IOMMU_VIOMMU_ALLOC
Documentation: userspace-api: iommufd: Update vDEVICE
iommufd/selftest: Add vIOMMU coverage for IOMMU_HWPT_INVALIDATE ioctl
iommufd/selftest: Add IOMMU_TEST_OP_DEV_CHECK_CACHE test command
iommufd/selftest: Add mock_viommu_cache_invalidate
iommufd/viommu: Add iommufd_viommu_find_dev helper
iommu: Add iommu_copy_struct_from_full_user_array helper
iommufd: Allow hwpt_id to carry viommu_id for IOMMU_HWPT_INVALIDATE
iommu/viommu: Add cache_invalidate to iommufd_viommu_ops
iommufd/selftest: Add IOMMU_VDEVICE_ALLOC test coverage
iommufd/viommu: Add IOMMUFD_OBJ_VDEVICE and IOMMU_VDEVICE_ALLOC ioctl
...
|
|
Fix unwind flows in mlx5vf_pci_save_device_data() and
mlx5vf_pci_resume_device_data() to avoid freeing the migf pointer at the
'end' label, as this will be handled by fput(migf->filp) through
mlx5vf_release_file().
To ensure mlx5vf_release_file() functions correctly, move the
initialization of migf fields (such as migf->lock) to occur before any
potential unwind flow, as these fields may be accessed within
mlx5vf_release_file().
Fixes: 9945a67ea4b3 ("vfio/mlx5: Refactor PD usage")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241114095318.16556-3-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Fix an unwind issue in mlx5vf_add_migration_pages().
If a set of pages is allocated but fails to be added to the SG table,
they need to be freed to prevent a memory leak.
Any pages successfully added to the SG table will be freed as part of
mlx5vf_free_data_buffer().
Fixes: 6fadb021266d ("vfio/mlx5: Implement vfio_pci driver for mlx5 devices")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20241114095318.16556-2-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Now that the driver supports live migration, only the legacy IO
functionality depends on config VIRTIO_PCI_ADMIN_LEGACY.
As part of that we introduce a bool configuration option as a sub menu
under the driver's main live migration feature named
VIRTIO_VFIO_PCI_ADMIN_LEGACY, to control the legacy IO functionality.
This will let users configuring the kernel, know which features from the
description might be available in the resulting driver.
As of that, move the legacy IO into a separate file to be compiled only
once CONFIG_VIRTIO_VFIO_PCI_ADMIN_LEGACY was configured and let the live
migration depends only on VIRTIO_PCI.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20241113115200.209269-8-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Add PRE_COPY support for live migration.
This functionality may reduce the downtime upon STOP_COPY as of letting
the target machine to get some 'initial data' from the source once the
machine is still in its RUNNING state and let it prepares itself
pre-ahead to get the final STOP_COPY data.
As the Virtio specification does not support reading partial or
incremental device contexts. This means that during the PRE_COPY state,
the vfio-virtio driver reads the full device state.
As the device state can be changed and the benefit is highest when the
pre copy data closely matches the final data we read it in a rate
limiter mode.
This means we avoid reading new data from the device for a specified
time interval after the last read.
With PRE_COPY enabled, we observed a downtime reduction of approximately
70-75% in various scenarios compared to when PRE_COPY was disabled,
while keeping the total migration time nearly the same.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20241113115200.209269-7-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Add support for basic live migration functionality in VFIO over
virtio-net devices, aligned with the virtio device specification 1.4.
This includes the following VFIO features:
VFIO_MIGRATION_STOP_COPY, VFIO_MIGRATION_P2P.
The implementation registers with the VFIO subsystem using vfio_pci_core
and then incorporates the virtio-specific logic for the migration
process.
The migration follows the definitions in uapi/vfio.h and leverages the
virtio VF-to-PF admin queue command channel for execution device parts
related commands.
Additional Notes:
-----------------
The kernel protocol between the source and target devices contains a
header with metadata, including record size, tag, and flags.
The record size allows the target to recognize and read a complete image
from the source before passing the device part data. This adheres to the
virtio device specification, which mandates that partial device parts
cannot be supplied.
The tag and flags serve as placeholders for future extensions of the
kernel protocol between the source and target, ensuring backward and
forward compatibility.
Both the source and target comply with the virtio device specification
by using a device part object with a unique ID as part of the migration
process. Since this resource is limited to a maximum of 255, its
lifecycle is confined to periods with an active live migration flow.
According to the virtio specification, a device has only two modes:
RUNNING and STOPPED. As a result, certain VFIO transitions (i.e.,
RUNNING_P2P->STOP, STOP->RUNNING_P2P) are treated as no-ops. When
transitioning to RUNNING_P2P, the device state is set to STOP, and it
will remain STOPPED until the transition out of RUNNING_P2P->RUNNING, at
which point it returns to RUNNING. During transition to STOP, the virtio
device only stops initiating outgoing requests(e.g. DMA, MSIx, etc.) but
still must accept incoming operations.
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://lore.kernel.org/r/20241113115200.209269-6-yishaih@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
On the debugfs framework of VFIO, if the CONFIG_VFIO_DEBUGFS macro is
enabled, the debug function is registered for the live migration driver
of the HiSilicon accelerator device.
After registering the HiSilicon accelerator device on the debugfs
framework of live migration of vfio, a directory file "hisi_acc"
of debugfs is created, and then three debug function files are
created in this directory:
vfio
|
+---<dev_name1>
| +---migration
| +--state
| +--hisi_acc
| +--dev_data
| +--migf_data
| +--cmd_state
|
+---<dev_name2>
+---migration
+--state
+--hisi_acc
+--dev_data
+--migf_data
+--cmd_state
dev_data file: read device data that needs to be migrated from the
current device in real time
migf_data file: read the migration data of the last live migration
from the current driver.
cmd_state: used to get the cmd channel state for the device.
+----------------+ +--------------+ +---------------+
| migration dev | | src dev | | dst dev |
+-------+--------+ +------+-------+ +-------+-------+
| | |
| +------v-------+ +-------v-------+
| | saving_migf | | resuming_migf |
read | | file | | file |
| +------+-------+ +-------+-------+
| | copy |
| +------------+----------+
| |
+-------v--------+ +-------v--------+
| data buffer | | debug_migf |
+-------+--------+ +-------+--------+
| |
cat | cat |
+-------v--------+ +-------v--------+
| dev_data | | migf_data |
+----------------+ +----------------+
When accessing debugfs, user can obtain the most recent status data
of the device through the "dev_data" file. It can read recent
complete status data of the device. If the current device is being
migrated, it will wait for it to complete.
The data for the last completed migration function will be stored
in debug_migf. Users can read it via "migf_data".
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20241112073322.54550-4-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
This patch generates the code for the operation of reading data from
the device into a sub-function.
Then, it can be called during the device status data saving phase of
the live migration process and the device status data reading function
in debugfs.
Thereby reducing the redundant code of the driver.
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20241112073322.54550-3-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
In the current driver, vdev is obtained from struct
hisi_acc_vf_core_device through the container_of function.
This method is used in many places in the driver. In order to
reduce this repetitive operation, It was extracted into
a public function.
Signed-off-by: Longfang Liu <liulongfang@huawei.com>
Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Link: https://lore.kernel.org/r/20241112073322.54550-2-liulongfang@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
This control causes the ARM SMMU drivers to choose a stage 2
implementation for the IO pagetable (vs the stage 1 usual default),
however this choice has no significant visible impact to the VFIO
user. Further qemu never implemented this and no other userspace user is
known.
The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add
new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
SMMU translation services to the guest operating system" however the rest
of the API to set the guest table pointer for the stage 1 and manage
invalidation was never completed, or at least never upstreamed, rendering
this part useless dead code.
Upstream has now settled on iommufd as the uAPI for controlling nested
translation. Choosing the stage 2 implementation should be done by through
the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation.
Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
enable_nesting iommu_domain_op.
Just in-case there is some userspace using this continue to treat
requesting it as a NOP, but do not advertise support any more.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com>
Reviewed-by: Donald Dutile <ddutile@redhat.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v4-9e99b76f3518+3a8-smmuv3_nesting_jgg@nvidia.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
in all of those failure exits prior to fdget() are plain returns and
the only thing done after fdput() is (on failure exits) a kfree(),
which can be done before fdput() just fine.
NOTE: in acrn_irqfd_assign() 'fail:' failure exit is wrong for
eventfd_ctx_fileget() failure (we only want fdput() there) and once
we stop doing that, it doesn't need to check if eventfd is NULL or
ERR_PTR(...) there.
NOTE: in privcmd we move fdget() up before the allocation - more
to the point, before the copy_from_user() attempt.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
all failure exits prior to fdget() leave the scope, all matching fdput()
are immediately followed by leaving the scope.
[xfs_ioc_commit_range() chunk moved here as well]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The unsigned variable `size_t len` is cast to the signed type `loff_t`
when passed to the function check_add_overflow(). This function considers
the type of the destination, which is of type loff_t (signed),
potentially leading to an overflow. This issue is similar to the one
described in the link below.
Remove the cast.
Note that even if check_add_overflow() is bypassed, by setting `len` to
a value that is greater than LONG_MAX (which is considered as a negative
value after the cast), the function copy_from_user(), invoked a few lines
later, will not perform any copy and return `len` as (len > INT_MAX)
causing qat_vf_resume_write() to fail with -EFAULT.
Fixes: bb208810b1ab ("vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devices")
CC: stable@vger.kernel.org # 6.10+
Link: https://lore.kernel.org/all/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain
Reported-by: Zijie Zhao <zzjas98@gmail.com>
Signed-off-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Xin Zeng <xin.zeng@intel.com>
Link: https://lore.kernel.org/r/20241021123843.42979-1-giovanni.cabiddu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
NVIDIA is planning to productize a new Grace Hopper superchip
SKU with device ID 0x2348.
Add the SKU devid to nvgrace_gpu_vfio_pci_table.
Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
Link: https://lore.kernel.org/r/20241013075216.19229-1-ankita@nvidia.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
no_llseek had been defined to NULL two years ago, in commit 868941b14441
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull VFIO updates from Alex Williamson:
"Just a few cleanups this cycle:
- Remove several unused structure and function declarations, and
unused variables (Dr. David Alan Gilbert, Yue Haibing, Zhang Zekun)
- Constify unmodified structure in mdev (Hongbo Li)
- Convert to unsigned type to catch overflow with less fanfare than
passing a negative value to kcalloc() (Dan Carpenter)"
* tag 'vfio-v6.12-rc1' of https://github.com/awilliam/linux-vfio:
vfio/pci: clean up a type in vfio_pci_ioctl_pci_hot_reset_groups()
vfio/mdev: Constify struct kobj_type
vfio: mdev: Remove unused function declarations
vfio/fsl-mc: Remove unused variable 'hwirq'
vfio/pci: Remove unused struct 'vfio_pci_mmap_vma'
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 'struct fd' updates from Al Viro:
"Just the 'struct fd' layout change, with conversion to accessor
helpers"
* tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
add struct fd constructors, get rid of __to_fd()
struct fd: representation change
introduce fd_file(), convert all accessors to it.
|
|
With the addition of pfnmap support in vmf_insert_pfn_{pmd,pud}() we can
take advantage of PMD and PUD faults to PCI BAR mmaps and create more
efficient mappings. PCI BARs are always a power of two and will typically
get at least PMD alignment without userspace even trying. Userspace
alignment for PUD mappings is also not too difficult.
Consolidate faults through a single handler with a new wrapper for
standard single page faults. The pre-faulting behavior of commit
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") is removed
in this refactoring since huge_fault will cover the bulk of the faults and
results in more efficient page table usage. We also want to avoid that
pre-faulted single page mappings preempt huge page mappings.
Link: https://lkml.kernel.org/r/20240826204353.2228736-20-peterx@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Use the new API that can understand huge pfn mappings.
Link: https://lkml.kernel.org/r/20240826204353.2228736-14-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The "array_count" value comes from the copy_from_user() in
vfio_pci_ioctl_pci_hot_reset(). If the user passes a value larger than
INT_MAX then we'll pass a negative value to kcalloc() which triggers an
allocation failure and a stack trace.
It's better to make the type unsigned so that if (array_count > count)
returns -EINVAL instead.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/262ada03-d848-4369-9c37-81edeeed2da2@stanley.mountain
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
This 'struct kobj_type' is not modified. It is only used in
kobject_init_and_add() which takes a 'const struct kobj_type *ktype'
parameter.
Constifying this structure and moving it to a read-only section,
and this can increase over all security.
```
[Before]
text data bss dec hex filename
2372 600 0 2972 b9c drivers/vfio/mdev/mdev_sysfs.o
[After]
text data bss dec hex filename
2436 568 0 3004 bbc drivers/vfio/mdev/mdev_sysfs.o
```
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240904011837.2010444-1-lihongbo22@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
The definition of mdev_bus_register() and mdev_bus_unregister() have been
removed since commit 6c7f98b334a3 ("vfio/mdev: Remove vfio_mdev.c"). So,
let's remove the unused declarations.
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240812120823.10968-1-zhangzekun11@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Commit 7447d911af69 ("vfio/fsl-mc: Block calling interrupt handler without trigger")
left this variable unused, so remove it.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20240730141133.525771-1-yuehaibing@huawei.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
'vfio_pci_mmap_vma' has been unused since
commit aac6db75a9fc ("vfio/pci: Use unmap_mapping_range()")
Remove it.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240727160307.1000476-1-linux@treblig.org
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
For any changes of struct fd representation we need to
turn existing accesses to fields into calls of wrappers.
Accesses to struct fd::flags are very few (3 in linux/file.h,
1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in
explicit initializers).
Those can be dealt with in the commit converting to
new layout; accesses to struct fd::file are too many for that.
This commit converts (almost) all of f.file to
fd_file(f). It's not entirely mechanical ('file' is used as
a member name more than just in struct fd) and it does not
even attempt to distinguish the uses in pointer context from
those in boolean context; the latter will be eventually turned
into a separate helper (fd_empty()).
NOTE: mass conversion to fd_empty(), tempting as it
might be, is a bad idea; better do that piecewise in commit
that convert from fdget...() to CLASS(...).
[conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c
caught by git; fs/stat.c one got caught by git grep]
[fs/xattr.c conflict]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the big set of driver core changes for 6.11-rc1.
Lots of stuff in here, with not a huge diffstat, but apis are evolving
which required lots of files to be touched. Highlights of the changes
in here are:
- platform remove callback api final fixups (Uwe took many releases
to get here, finally!)
- Rust bindings for basic firmware apis and initial driver-core
interactions.
It's not all that useful for a "write a whole driver in rust" type
of thing, but the firmware bindings do help out the phy rust
drivers, and the driver core bindings give a solid base on which
others can start their work.
There is still a long way to go here before we have a multitude of
rust drivers being added, but it's a great first step.
- driver core const api changes.
This reached across all bus types, and there are some fix-ups for
some not-common bus types that linux-next and 0-day testing shook
out.
This work is being done to help make the rust bindings more safe,
as well as the C code, moving toward the end-goal of allowing us to
put driver structures into read-only memory. We aren't there yet,
but are getting closer.
- minor devres cleanups and fixes found by code inspection
- arch_topology minor changes
- other minor driver core cleanups
All of these have been in linux-next for a very long time with no
reported problems"
* tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits)
ARM: sa1100: make match function take a const pointer
sysfs/cpu: Make crash_hotplug attribute world-readable
dio: Have dio_bus_match() callback take a const *
zorro: make match function take a const pointer
driver core: module: make module_[add|remove]_driver take a const *
driver core: make driver_find_device() take a const *
driver core: make driver_[create|remove]_file take a const *
firmware_loader: fix soundness issue in `request_internal`
firmware_loader: annotate doctests as `no_run`
devres: Correct code style for functions that return a pointer type
devres: Initialize an uninitialized struct member
devres: Fix memory leakage caused by driver API devm_free_percpu()
devres: Fix devm_krealloc() wasting memory
driver core: platform: Switch to use kmemdup_array()
driver core: have match() callback in struct bus_type take a const *
MAINTAINERS: add Rust device abstractions to DRIVER CORE
device: rust: improve safety comments
MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer
MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER
firmware: rust: improve safety comments
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Remove support for 40x CPUs & platforms
- Add support to the 64-bit BPF JIT for cpu v4 instructions
- Fix PCI hotplug driver crash on powernv
- Fix doorbell emulation for KVM on PAPR guests (nestedv2)
- Fix KVM nested guest handling of some less used SPRs
- Online NUMA nodes with no CPU/memory if they have a PCI device
attached
- Reduce memory overhead of enabling kfence on 64-bit Radix MMU kernels
- Reimplement the iommu table_group_ops for pseries for VFIO SPAPR TCE
Thanks to: Anjali K, Artem Savkov, Athira Rajeev, Breno Leitao, Brian
King, Celeste Liu, Christophe Leroy, Esben Haabendal, Gaurav Batra,
Gautam Menghani, Haren Myneni, Hari Bathini, Jeff Johnson, Krishna
Kumar, Krzysztof Kozlowski, Nathan Lynch, Nicholas Piggin, Nick Bowler,
Nilay Shroff, Rob Herring (Arm), Shawn Anastasio, Shivaprasad G Bhat,
Sourabh Jain, Srikar Dronamraju, Timothy Pearson, Uwe Kleine-König, and
Vaibhav Jain.
* tag 'powerpc-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (57 commits)
Documentation/powerpc: Mention 40x is removed
powerpc: Remove 40x leftovers
macintosh/therm_windtunnel: fix module unload.
powerpc: Check only single values are passed to CPU/MMU feature checks
powerpc/xmon: Fix disassembly CPU feature checks
powerpc: Drop clang workaround for builtin constant checks
powerpc64/bpf: jit support for signed division and modulo
powerpc64/bpf: jit support for sign extended mov
powerpc64/bpf: jit support for sign extended load
powerpc64/bpf: jit support for unconditional byte swap
powerpc64/bpf: jit support for 32bit offset jmp instruction
powerpc/pci: Hotplug driver bridge support
pci/hotplug/pnv_php: Fix hotplug driver crash on Powernv
powerpc/configs: Update defconfig with now user-visible CONFIG_FSL_IFC
powerpc: add missing MODULE_DESCRIPTION() macros
macintosh/mac_hid: add MODULE_DESCRIPTION()
KVM: PPC: add missing MODULE_DESCRIPTION() macros
powerpc/kexec: Use of_property_read_reg()
powerpc/64s/radix/kfence: map __kfence_pool at page granularity
powerpc/pseries/iommu: Define spapr_tce_table_group_ops only with CONFIG_IOMMU_API
...
|
|
Pull VFIO updates from Alex Williamson:
- Add support for 8-byte accesses when using read/write through the
device regions. This fills a gap for userspace drivers that might
not be able to use access through mmap to perform native register
width accesses (Gerd Bayer)
- Add missing MODULE_DESCRIPTION to vfio-mdev sample drivers and
replace a non-standard MODULE_INFO usage (Jeff Johnson)
* tag 'vfio-v6.11-rc1' of https://github.com/awilliam/linux-vfio:
vfio-mdev: add missing MODULE_DESCRIPTION() macros
vfio/pci: Fix typo in macro to declare accessors
vfio/pci: Support 8-byte PCI loads and stores
vfio/pci: Extract duplicated code into macro
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Will Deacon:
"Core:
- Support for the "ats-supported" device-tree property
- Removal of the 'ops' field from 'struct iommu_fwspec'
- Introduction of iommu_paging_domain_alloc() and partial conversion
of existing users
- Introduce 'struct iommu_attach_handle' and provide corresponding
IOMMU interfaces which will be used by the IOMMUFD subsystem
- Remove stale documentation
- Add missing MODULE_DESCRIPTION() macro
- Misc cleanups
Allwinner Sun50i:
- Ensure bypass mode is disabled on H616 SoCs
- Ensure page-tables are allocated below 4GiB for the 32-bit
page-table walker
- Add new device-tree compatible strings
AMD Vi:
- Use try_cmpxchg64() instead of cmpxchg64() when updating pte
Arm SMMUv2:
- Print much more useful information on context faults
- Fix Qualcomm TBU probing when CONFIG_ARM_SMMU_QCOM_DEBUG=n
- Add new Qualcomm device-tree bindings
Arm SMMUv3:
- Support for hardware update of access/dirty bits and reporting via
IOMMUFD
- More driver rework from Jason, this time updating the PASID/SVA
support to prepare for full IOMMUFD support
- Add missing MODULE_DESCRIPTION() macro
- Minor fixes and cleanups
NVIDIA Tegra:
- Fix for benign fwspec initialisation issue exposed by rework on the
core branch
Intel VT-d:
- Use try_cmpxchg64() instead of cmpxchg64() when updating pte
- Use READ_ONCE() to read volatile descriptor status
- Remove support for handling Execute-Requested requests
- Avoid calling iommu_domain_alloc()
- Minor fixes and refactoring
Qualcomm MSM:
- Updates to the device-tree bindings"
* tag 'iommu-updates-v6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (72 commits)
iommu/tegra-smmu: Pass correct fwnode to iommu_fwspec_init()
iommu/vt-d: Fix identity map bounds in si_domain_init()
iommu: Move IOMMU_DIRTY_NO_CLEAR define
dt-bindings: iommu: Convert msm,iommu-v0 to yaml
iommu/vt-d: Fix aligned pages in calculate_psi_aligned_address()
iommu/vt-d: Limit max address mask to MAX_AGAW_PFN_WIDTH
docs: iommu: Remove outdated Documentation/userspace-api/iommu.rst
arm64: dts: fvp: Enable PCIe ATS for Base RevC FVP
iommu/of: Support ats-supported device-tree property
dt-bindings: PCI: generic: Add ats-supported property
iommu: Remove iommu_fwspec ops
OF: Simplify of_iommu_configure()
ACPI: Retire acpi_iommu_fwspec_ops()
iommu: Resolve fwspec ops automatically
iommu/mediatek-v1: Clean up redundant fwspec checks
RDMA/usnic: Use iommu_paging_domain_alloc()
wifi: ath11k: Use iommu_paging_domain_alloc()
wifi: ath10k: Use iommu_paging_domain_alloc()
drm/msm: Use iommu_paging_domain_alloc()
vhost-vdpa: Use iommu_paging_domain_alloc()
...
|
|
The count variable is used without initialization, it results in mistakes
in the device counting and crashes the userspace if the get hot reset info
path is triggered.
Fixes: f6944d4a0b87 ("vfio/pci: Collect hot-reset devices to local buffer")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219010
Reported-by: Žilvinas Žaltiena <zaltys@natrix.lt>
Cc: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240710004150.319105-1-yi.l.liu@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Replace iommu_domain_alloc() with iommu_paging_domain_alloc().
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240610085555.88197-4-baolu.lu@linux.intel.com
Signed-off-by: Will Deacon <will@kernel.org>
|
|
In the match() callback, the struct device_driver * should not be
changed, so change the function callback to be a const *. This is one
step of many towards making the driver core safe to have struct
device_driver in read-only memory.
Because the match() callback is in all busses, all busses are modified
to handle this properly. This does entail switching some container_of()
calls to container_of_const() to properly handle the constant *.
For some busses, like PCI and USB and HV, the const * is cast away in
the match callback as those busses do want to modify those structures at
this point in time (they have a local lock in the driver structure.)
That will have to be changed in the future if they wish to have their
struct device * in read-only-memory.
Cc: Rafael J. Wysocki <rafael@kernel.org>
Reviewed-by: Alex Elder <elder@kernel.org>
Acked-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/2024070136-wrongdoer-busily-01e8@gregkh
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The PAPR expects the TCE table to have no entries at the time of
unset window(i.e. remove-pe). The TCE clear right now is done
before freeing the iommu table. On pSeries, the unset window
makes those entries inaccessible to the OS and the H_PUT/GET calls
fail on them with H_CONSTRAINED.
On PowerNV, this has no side effect as the TCE clear can be done
before the DMA window removal as well.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://msgid.link/171923273535.1397.1236742071894414895.stgit@linux.ibm.com
|
|
Many PCI adapters can benefit or even require full 64bit read
and write access to their registers. In order to enable work on
user-space drivers for these devices add two new variations
vfio_pci_core_io{read|write}64 of the existing access methods
when the architecture supports 64-bit ioreads and iowrites.
Signed-off-by: Ben Segal <bpsegal@us.ibm.com>
Co-developed-by: Gerd Bayer <gbayer@linux.ibm.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Link: https://lore.kernel.org/r/20240619115847.1344875-3-gbayer@linux.ibm.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
vfio_pci_core_do_io_rw() repeats the same code for multiple access
widths. Factor this out into a macro
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com>
Link: https://lore.kernel.org/r/20240619115847.1344875-2-gbayer@linux.ibm.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
In order to improve performance of typical scenarios we can try to insert
the entire vma on fault. This accelerates typical cases, such as when
the MMIO region is DMA mapped by QEMU. The vfio_iommu_type1 driver will
fault in the entire DMA mapped range through fixup_user_fault().
In synthetic testing, this improves the time required to walk a PCI BAR
mapping from userspace by roughly 1/3rd.
This is likely an interim solution until vmf_insert_pfn_{pmd,pud}() gain
support for pfnmaps.
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/all/Zl6XdUkt%2FzMMGOLF@yzhao56-desk.sh.intel.com/
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20240607035213.2054226-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
With the vfio device fd tied to the address space of the pseudo fs
inode, we can use the mm to track all vmas that might be mmap'ing
device BARs, which removes our vma_list and all the complicated lock
ordering necessary to manually zap each related vma.
Note that we can no longer store the pfn in vm_pgoff if we want to use
unmap_mapping_range() to zap a selective portion of the device fd
corresponding to BAR mappings.
This also converts our mmap fault handler to use vmf_insert_pfn()
because we no longer have a vma_list to avoid the concurrency problem
with io_remap_pfn_range(). The goal is to eventually use the vm_ops
huge_fault handler to avoid the additional faulting overhead, but
vmf_insert_pfn_{pmd,pud}() need to learn about pfnmaps first.
Also, Jason notes that a race exists between unmap_mapping_range() and
the fops mmap callback if we were to call io_remap_pfn_range() to
populate the vma on mmap. Specifically, mmap_region() does call_mmap()
before it does vma_link_file() which gives a window where the vma is
populated but invisible to unmap_mapping_range().
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240530045236.1005864-3-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
By linking all the device fds we provide to userspace to an
address space through a new pseudo fs, we can use tools like
unmap_mapping_range() to zap all vmas associated with a device.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240530045236.1005864-2-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Pull vfio updates from Alex Williamson:
- The vfio fsl-mc bus driver has become orphaned. We'll consider
removing it in future releases if a new maintainer isn't found (Alex
Williamson)
- Improved usage of opaque data in vfio-pci INTx handling, avoiding
lookups of the eventfd through the interrupt and irqfd runtime paths
(Alex Williamson)
- Resolve an error path memory leak introduced in vfio-pci interrupt
code (Ye Bin)
- Addition of interrupt support for vfio devices exposed on the CDX
bus, including a new MSI allocation helper and export of existing
helpers for MSI alloc and free (Nipun Gupta)
- A new vfio-pci variant driver supporting migration of Intel QAT VF
devices for the GEN4 PFs (Xin Zeng & Yahui Cao)
- Resolve a possibly circular locking dependency in vfio-pci by
avoiding copy_to_user() from a PCI bus walk callback (Alex
Williamson)
- Trivial docs update to remove a duplicate semicolon (Foryun Ma)
* tag 'vfio-v6.10-rc1' of https://github.com/awilliam/linux-vfio:
vfio/pci: Restore zero affected bus reset devices warning
vfio: remove an extra semicolon
vfio/pci: Collect hot-reset devices to local buffer
vfio/qat: Add vfio_pci driver for Intel QAT SR-IOV VF devices
vfio/cdx: add interrupt support
genirq/msi: Add MSI allocation helper and export MSI functions
vfio/pci: fix potential memory leak in vfio_intx_enable()
vfio/pci: Pass eventfd context object through irqfd
vfio/pci: Pass eventfd context to IRQ handler
MAINTAINERS: Orphan vfio fsl-mc bus driver
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull mm updates from Andrew Morton:
"The usual shower of singleton fixes and minor series all over MM,
documented (hopefully adequately) in the respective changelogs.
Notable series include:
- Lucas Stach has provided some page-mapping cleanup/consolidation/
maintainability work in the series "mm/treewide: Remove pXd_huge()
API".
- In the series "Allow migrate on protnone reference with
MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's
MPOL_PREFERRED_MANY mode, yielding almost doubled performance in
one test.
- In their series "Memory allocation profiling" Kent Overstreet and
Suren Baghdasaryan have contributed a means of determining (via
/proc/allocinfo) whereabouts in the kernel memory is being
allocated: number of calls and amount of memory.
- Matthew Wilcox has provided the series "Various significant MM
patches" which does a number of rather unrelated things, but in
largely similar code sites.
- In his series "mm: page_alloc: freelist migratetype hygiene"
Johannes Weiner has fixed the page allocator's handling of
migratetype requests, with resulting improvements in compaction
efficiency.
- In the series "make the hugetlb migration strategy consistent"
Baolin Wang has fixed a hugetlb migration issue, which should
improve hugetlb allocation reliability.
- Liu Shixin has hit an I/O meltdown caused by readahead in a
memory-tight memcg. Addressed in the series "Fix I/O high when
memory almost met memcg limit".
- In the series "mm/filemap: optimize folio adding and splitting"
Kairui Song has optimized pagecache insertion, yielding ~10%
performance improvement in one test.
- Baoquan He has cleaned up and consolidated the early zone
initialization code in the series "mm/mm_init.c: refactor
free_area_init_core()".
- Baoquan has also redone some MM initializatio code in the series
"mm/init: minor clean up and improvement".
- MM helper cleanups from Christoph Hellwig in his series "remove
follow_pfn".
- More cleanups from Matthew Wilcox in the series "Various
page->flags cleanups".
- Vlastimil Babka has contributed maintainability improvements in the
series "memcg_kmem hooks refactoring".
- More folio conversions and cleanups in Matthew Wilcox's series:
"Convert huge_zero_page to huge_zero_folio"
"khugepaged folio conversions"
"Remove page_idle and page_young wrappers"
"Use folio APIs in procfs"
"Clean up __folio_put()"
"Some cleanups for memory-failure"
"Remove page_mapping()"
"More folio compat code removal"
- David Hildenbrand chipped in with "fs/proc/task_mmu: convert
hugetlb functions to work on folis".
- Code consolidation and cleanup work related to GUP's handling of
hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2".
- Rick Edgecombe has developed some fixes to stack guard gaps in the
series "Cover a guard gap corner case".
- Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the
series "mm/ksm: fix ksm exec support for prctl".
- Baolin Wang has implemented NUMA balancing for multi-size THPs.
This is a simple first-cut implementation for now. The series is
"support multi-size THP numa balancing".
- Cleanups to vma handling helper functions from Matthew Wilcox in
the series "Unify vma_address and vma_pgoff_address".
- Some selftests maintenance work from Dev Jain in the series
"selftests/mm: mremap_test: Optimizations and style fixes".
- Improvements to the swapping of multi-size THPs from Ryan Roberts
in the series "Swap-out mTHP without splitting".
- Kefeng Wang has significantly optimized the handling of arm64's
permission page faults in the series
"arch/mm/fault: accelerate pagefault when badaccess"
"mm: remove arch's private VM_FAULT_BADMAP/BADACCESS"
- GUP cleanups from David Hildenbrand in "mm/gup: consistently call
it GUP-fast".
- hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault
path to use struct vm_fault".
- selftests build fixes from John Hubbard in the series "Fix
selftests/mm build without requiring "make headers"".
- Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the
series "Improved Memory Tier Creation for CPUless NUMA Nodes".
Fixes the initialization code so that migration between different
memory types works as intended.
- David Hildenbrand has improved follow_pte() and fixed an errant
driver in the series "mm: follow_pte() improvements and acrn
follow_pte() fixes".
- David also did some cleanup work on large folio mapcounts in his
series "mm: mapcount for large folios + page_mapcount() cleanups".
- Folio conversions in KSM in Alex Shi's series "transfer page to
folio in KSM".
- Barry Song has added some sysfs stats for monitoring multi-size
THP's in the series "mm: add per-order mTHP alloc and swpout
counters".
- Some zswap cleanups from Yosry Ahmed in the series "zswap
same-filled and limit checking cleanups".
- Matthew Wilcox has been looking at buffer_head code and found the
documentation to be lacking. The series is "Improve buffer head
documentation".
- Multi-size THPs get more work, this time from Lance Yang. His
series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free"
optimizes the freeing of these things.
- Kemeng Shi has added more userspace-visible writeback
instrumentation in the series "Improve visibility of writeback".
- Kemeng Shi then sent some maintenance work on top in the series
"Fix and cleanups to page-writeback".
- Matthew Wilcox reduces mmap_lock traffic in the anon vma code in
the series "Improve anon_vma scalability for anon VMAs". Intel's
test bot reported an improbable 3x improvement in one test.
- SeongJae Park adds some DAMON feature work in the series
"mm/damon: add a DAMOS filter type for page granularity access recheck"
"selftests/damon: add DAMOS quota goal test"
- Also some maintenance work in the series
"mm/damon/paddr: simplify page level access re-check for pageout"
"mm/damon: misc fixes and improvements"
- David Hildenbrand has disabled some known-to-fail selftests ni the
series "selftests: mm: cow: flag vmsplice() hugetlb tests as
XFAIL".
- memcg metadata storage optimizations from Shakeel Butt in "memcg:
reduce memory consumption by memcg stats".
- DAX fixes and maintenance work from Vishal Verma in the series
"dax/bus.c: Fixups for dax-bus locking""
* tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits)
memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order
selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp
mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault
selftests: cgroup: add tests to verify the zswap writeback path
mm: memcg: make alloc_mem_cgroup_per_node_info() return bool
mm/damon/core: fix return value from damos_wmark_metric_value
mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED
selftests: cgroup: remove redundant enabling of memory controller
Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree
Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT
Docs/mm/damon/design: use a list for supported filters
Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command
Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file
selftests/damon: classify tests for functionalities and regressions
selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None'
selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts
selftests/damon/_damon_sysfs: check errors from nr_schemes file reads
mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()
selftests/damon: add a test for DAMOS quota goal
...
|
|
Pull ARM updates from Russell King:
- Updates to AMBA bus subsystem to drop .owner struct device_driver
initialisations, moving that to code instead.
- Add LPAE privileged-access-never support
- Add support for Clang CFI
- clkdev: report over-sized device or connection strings
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux: (36 commits)
ARM: 9398/1: Fix userspace enter on LPAE with CC_OPTIMIZE_FOR_SIZE=y
clkdev: report over-sized strings when creating clkdev entries
ARM: 9393/1: mm: Use conditionals for CFI branches
ARM: 9392/2: Support CLANG CFI
ARM: 9391/2: hw_breakpoint: Handle CFI breakpoints
ARM: 9390/2: lib: Annotate loop delay instructions for CFI
ARM: 9389/2: mm: Define prototypes for all per-processor calls
ARM: 9388/2: mm: Type-annotate all per-processor assembly routines
ARM: 9387/2: mm: Rewrite cacheflush vtables in CFI safe C
ARM: 9386/2: mm: Use symbol alias for cache functions
ARM: 9385/2: mm: Type-annotate all cache assembly routines
ARM: 9384/2: mm: Make tlbflush routines CFI safe
ARM: 9382/1: ftrace: Define ftrace_stub_graph
ARM: 9358/2: Implement PAN for LPAE by TTBR0 page table walks disablement
ARM: 9357/2: Reduce the number of #ifdef CONFIG_CPU_SW_DOMAIN_PAN
ARM: 9356/2: Move asm statements accessing TTBCR into C functions
ARM: 9355/2: Add TTBCR_* definitions to pgtable-3level-hwdef.h
ARM: 9379/1: coresight: tpda: drop owner assignment
ARM: 9378/1: coresight: etm4x: drop owner assignment
ARM: 9377/1: hwrng: nomadik: drop owner assignment
...
|
|
Yi notes relative to commit f6944d4a0b87 ("vfio/pci: Collect hot-reset
devices to local buffer") that we previously tested the resulting
device count with a WARN_ON, which was removed when we switched to
the in-loop user copy in commit b56b7aabcf3c ("vfio/pci: Copy hot-reset
device info to userspace in the devices loop"). Finding no devices in
the bus/slot would be an unexpected condition, so let's restore the
warning and trigger a -ERANGE error here as success with no devices
would be an unexpected result to userspace as well.
Suggested-by: Yi Liu <yi.l.liu@intel.com>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Link: https://lore.kernel.org/r/20240516174831.2257970-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Due to an erratum with the SPR_DSA and SPR_IAX devices, it is not secure to assign
these devices to virtual machines. Add the PCI IDs of these devices to the VFIO
denylist to ensure that this is handled appropriately by the VFIO subsystem.
The SPR_DSA and SPR_IAX devices are on-SOC devices for the Sapphire Rapids
(and related) family of products that perform data movement and compression.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
|
|
Lockdep reports the below circular locking dependency issue. The
mmap_lock acquisition while holding pci_bus_sem is due to the use of
copy_to_user() from within a pci_walk_bus() callback.
Building the devices array directly into the user buffer is only for
convenience. Instead we can allocate a local buffer for the array,
bounded by the number of devices on the bus/slot, fill the device
information into this local buffer, then copy it into the user buffer
outside the bus walk callback.
======================================================
WARNING: possible circular locking dependency detected
6.9.0-rc5+ #39 Not tainted
------------------------------------------------------
CPU 0/KVM/4113 is trying to acquire lock:
ffff99a609ee18a8 (&vdev->vma_lock){+.+.}-{4:4}, at: vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core]
but task is already holding lock:
ffff99a243a052a0 (&mm->mmap_lock){++++}-{4:4}, at: vaddr_get_pfns+0x3f/0x170 [vfio_iommu_type1]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&mm->mmap_lock){++++}-{4:4}:
__lock_acquire+0x4e4/0xb90
lock_acquire+0xbc/0x2d0
__might_fault+0x5c/0x80
_copy_to_user+0x1e/0x60
vfio_pci_fill_devs+0x9f/0x130 [vfio_pci_core]
vfio_pci_walk_wrapper+0x45/0x60 [vfio_pci_core]
__pci_walk_bus+0x6b/0xb0
vfio_pci_ioctl_get_pci_hot_reset_info+0x10b/0x1d0 [vfio_pci_core]
vfio_pci_core_ioctl+0x1cb/0x400 [vfio_pci_core]
vfio_device_fops_unl_ioctl+0x7e/0x140 [vfio]
__x64_sys_ioctl+0x8a/0xc0
do_syscall_64+0x8d/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #2 (pci_bus_sem){++++}-{4:4}:
__lock_acquire+0x4e4/0xb90
lock_acquire+0xbc/0x2d0
down_read+0x3e/0x160
pci_bridge_wait_for_secondary_bus.part.0+0x33/0x2d0
pci_reset_bus+0xdd/0x160
vfio_pci_dev_set_hot_reset+0x256/0x270 [vfio_pci_core]
vfio_pci_ioctl_pci_hot_reset_groups+0x1a3/0x280 [vfio_pci_core]
vfio_pci_core_ioctl+0x3b5/0x400 [vfio_pci_core]
vfio_device_fops_unl_ioctl+0x7e/0x140 [vfio]
__x64_sys_ioctl+0x8a/0xc0
do_syscall_64+0x8d/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (&vdev->memory_lock){+.+.}-{4:4}:
__lock_acquire+0x4e4/0xb90
lock_acquire+0xbc/0x2d0
down_write+0x3b/0xc0
vfio_pci_zap_and_down_write_memory_lock+0x1c/0x30 [vfio_pci_core]
vfio_basic_config_write+0x281/0x340 [vfio_pci_core]
vfio_config_do_rw+0x1fa/0x300 [vfio_pci_core]
vfio_pci_config_rw+0x75/0xe50 [vfio_pci_core]
vfio_pci_rw+0xea/0x1a0 [vfio_pci_core]
vfs_write+0xea/0x520
__x64_sys_pwrite64+0x90/0xc0
do_syscall_64+0x8d/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&vdev->vma_lock){+.+.}-{4:4}:
check_prev_add+0xeb/0xcc0
validate_chain+0x465/0x530
__lock_acquire+0x4e4/0xb90
lock_acquire+0xbc/0x2d0
__mutex_lock+0x97/0xde0
vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core]
__do_fault+0x31/0x160
do_pte_missing+0x65/0x3b0
__handle_mm_fault+0x303/0x720
handle_mm_fault+0x10f/0x460
fixup_user_fault+0x7f/0x1f0
follow_fault_pfn+0x66/0x1c0 [vfio_iommu_type1]
vaddr_get_pfns+0xf2/0x170 [vfio_iommu_type1]
vfio_pin_pages_remote+0x348/0x4e0 [vfio_iommu_type1]
vfio_pin_map_dma+0xd2/0x330 [vfio_iommu_type1]
vfio_dma_do_map+0x2c0/0x440 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0xc5/0x1d0 [vfio_iommu_type1]
__x64_sys_ioctl+0x8a/0xc0
do_syscall_64+0x8d/0x170
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
&vdev->vma_lock --> pci_bus_sem --> &mm->mmap_lock
Possible unsafe locking scenario:
block dm-0: the capability attribute has been deprecated.
CPU0 CPU1
---- ----
rlock(&mm->mmap_lock);
lock(pci_bus_sem);
lock(&mm->mmap_lock);
lock(&vdev->vma_lock);
*** DEADLOCK ***
2 locks held by CPU 0/KVM/4113:
#0: ffff99a25f294888 (&iommu->lock#2){+.+.}-{4:4}, at: vfio_dma_do_map+0x60/0x440 [vfio_iommu_type1]
#1: ffff99a243a052a0 (&mm->mmap_lock){++++}-{4:4}, at: vaddr_get_pfns+0x3f/0x170 [vfio_iommu_type1]
stack backtrace:
CPU: 1 PID: 4113 Comm: CPU 0/KVM Not tainted 6.9.0-rc5+ #39
Hardware name: Dell Inc. PowerEdge T640/04WYPY, BIOS 2.15.1 06/16/2022
Call Trace:
<TASK>
dump_stack_lvl+0x64/0xa0
check_noncircular+0x131/0x150
check_prev_add+0xeb/0xcc0
? add_chain_cache+0x10a/0x2f0
? __lock_acquire+0x4e4/0xb90
validate_chain+0x465/0x530
__lock_acquire+0x4e4/0xb90
lock_acquire+0xbc/0x2d0
? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core]
? lock_is_held_type+0x9a/0x110
__mutex_lock+0x97/0xde0
? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core]
? lock_acquire+0xbc/0x2d0
? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core]
? find_held_lock+0x2b/0x80
? vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core]
vfio_pci_mmap_fault+0x35/0x1a0 [vfio_pci_core]
__do_fault+0x31/0x160
do_pte_missing+0x65/0x3b0
__handle_mm_fault+0x303/0x720
handle_mm_fault+0x10f/0x460
fixup_user_fault+0x7f/0x1f0
follow_fault_pfn+0x66/0x1c0 [vfio_iommu_type1]
vaddr_get_pfns+0xf2/0x170 [vfio_iommu_type1]
vfio_pin_pages_remote+0x348/0x4e0 [vfio_iommu_type1]
vfio_pin_map_dma+0xd2/0x330 [vfio_iommu_type1]
vfio_dma_do_map+0x2c0/0x440 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0xc5/0x1d0 [vfio_iommu_type1]
__x64_sys_ioctl+0x8a/0xc0
do_syscall_64+0x8d/0x170
? rcu_core+0x8d/0x250
? __lock_release+0x5e/0x160
? rcu_core+0x8d/0x250
? lock_release+0x5f/0x120
? sched_clock+0xc/0x30
? sched_clock_cpu+0xb/0x190
? irqtime_account_irq+0x40/0xc0
? __local_bh_enable+0x54/0x60
? __do_softirq+0x315/0x3ca
? lockdep_hardirqs_on_prepare.part.0+0x97/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f8300d0357b
Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 68 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007f82ef3fb948 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8300d0357b
RDX: 00007f82ef3fb990 RSI: 0000000000003b71 RDI: 0000000000000023
RBP: 00007f82ef3fb9c0 R08: 0000000000000000 R09: 0000561b7e0bcac2
R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
R13: 0000000200000000 R14: 0000381800000000 R15: 0000000000000000
</TASK>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20240503143138.3562116-1-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
... and centralize the VM_IO/VM_PFNMAP sanity check in there. We'll
now also perform these sanity checks for direct follow_pte()
invocations.
For generic_access_phys(), we might now check multiple times: nothing to
worry about, really.
Link: https://lkml.kernel.org/r/20240410155527.474777-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Sean Christopherson <seanjc@google.com> [KVM]
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Fei Li <fei1.li@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Yonghua Huang <yonghua.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add vfio pci variant driver for Intel QAT SR-IOV VF devices. This driver
registers to the vfio subsystem through the interfaces exposed by the
subsystem. It follows the live migration protocol v2 defined in
uapi/linux/vfio.h and interacts with Intel QAT PF driver through a set
of interfaces defined in qat/qat_mig_dev.h to support live migration of
Intel QAT VF devices.
This version only covers migration for Intel QAT GEN4 VF devices.
Co-developed-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Yahui Cao <yahui.cao@intel.com>
Signed-off-by: Xin Zeng <xin.zeng@intel.com>
Reviewed-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20240426064051.2859652-1-xin.zeng@intel.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|