| Age | Commit message (Collapse) | Author |
|
Lu Baolu says:
====================
This series implements the functionality of delivering IO page faults to
user space through the IOMMUFD framework. One feasible use case is the
nested translation. Nested translation is a hardware feature that supports
two-stage translation tables for IOMMU. The second-stage translation table
is managed by the host VMM, while the first-stage translation table is
owned by user space. This allows user space to control the IOMMU mappings
for its devices.
When an IO page fault occurs on the first-stage translation table, the
IOMMU hardware can deliver the page fault to user space through the
IOMMUFD framework. User space can then handle the page fault and respond
to the device top-down through the IOMMUFD. This allows user space to
implement its own IO page fault handling policies.
User space application that is capable of handling IO page faults should
allocate a fault object, and bind the fault object to any domain that it
is willing to handle the fault generatd for them. On a successful return
of fault object allocation, the user can retrieve and respond to page
faults by reading or writing to the file descriptor (FD) returned.
The iommu selftest framework has been updated to test the IO page fault
delivery and response functionality.
====================
* iommufd_pri:
iommufd/selftest: Add coverage for IOPF test
iommufd/selftest: Add IOPF support for mock device
iommufd: Associate fault object with iommufd_hw_pgtable
iommufd: Fault-capable hwpt attach/detach/replace
iommufd: Add iommufd fault object
iommufd: Add fault and response message definitions
iommu: Extend domain attach group with handle support
iommu: Add attach handle to struct iopf_group
iommu: Remove sva handle list
iommu: Introduce domain attachment handle
Link: https://lore.kernel.org/all/20240702063444.105814-1-baolu.lu@linux.intel.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
When allocating a user iommufd_hw_pagetable, the user space is allowed to
associate a fault object with the hw_pagetable by specifying the fault
object ID in the page table allocation data and setting the
IOMMU_HWPT_FAULT_ID_VALID flag bit.
On a successful return of hwpt allocation, the user can retrieve and
respond to page faults by reading and writing the file interface of the
fault object.
Once a fault object has been associated with a hwpt, the hwpt is
iopf-capable, indicated by hwpt->fault is non NULL. Attaching,
detaching, or replacing an iopf-capable hwpt to an RID or PASID will
differ from those that are not iopf-capable.
Link: https://lore.kernel.org/r/20240702063444.105814-9-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
An iommufd fault object provides an interface for delivering I/O page
faults to user space. These objects are created and destroyed by user
space, and they can be associated with or dissociated from hardware page
table objects during page table allocation or destruction.
User space interacts with the fault object through a file interface. This
interface offers a straightforward and efficient way for user space to
handle page faults. It allows user space to read fault messages
sequentially and respond to them by writing to the same file. The file
interface supports reading messages in poll mode, so it's recommended that
user space applications use io_uring to enhance read and write efficiency.
A fault object can be associated with any iopf-capable iommufd_hw_pgtable
during the pgtable's allocation. All I/O page faults triggered by devices
when accessing the I/O addresses of an iommufd_hw_pgtable are routed
through the fault object to user space. Similarly, user space's responses
to these page faults are routed back to the iommu device driver through
the same fault object.
Link: https://lore.kernel.org/r/20240702063444.105814-7-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
iommu_hwpt_pgfaults represent fault messages that the userspace can
retrieve. Multiple iommu_hwpt_pgfaults might be put in an iopf group,
with the IOMMU_PGFAULT_FLAGS_LAST_PAGE flag set only for the last
iommu_hwpt_pgfault.
An iommu_hwpt_page_response is a response message that the userspace
should send to the kernel after finishing handling a group of fault
messages. The @dev_id, @pasid, and @grpid fields in the message
identify an outstanding iopf group for a device. The @cookie field,
which matches the cookie field of the last fault in the group, will
be used by the kernel to look up the pending message.
Link: https://lore.kernel.org/r/20240702063444.105814-6-baolu.lu@linux.intel.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-07-08
The following pull-request contains BPF updates for your *net-next* tree.
We've added 102 non-merge commits during the last 28 day(s) which contain
a total of 127 files changed, 4606 insertions(+), 980 deletions(-).
The main changes are:
1) Support resilient split BTF which cuts down on duplication and makes BTF
as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman.
2) Add support for dumping kfunc prototypes from BTF which enables both detecting
as well as dumping compilable prototypes for kfuncs, from Daniel Xu.
3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement
support for BPF exceptions, from Ilya Leoshkevich.
4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support
for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui.
5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option
for skbs and add coverage along with it, from Vadim Fedorenko.
6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives
a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan.
7) Extend the BPF verifier to track the delta between linked registers in order
to better deal with recent LLVM code optimizations, from Alexei Starovoitov.
8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should
have been a pointer to the map value, from Benjamin Tissoires.
9) Extend BPF selftests to add regular expression support for test output matching
and adjust some of the selftest when compiled under gcc, from Cupertino Miranda.
10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always
iterates exactly once anyway, from Dan Carpenter.
11) Add the capability to offload the netfilter flowtable in XDP layer through
kfuncs, from Florian Westphal & Lorenzo Bianconi.
12) Various cleanups in networking helpers in BPF selftests to shave off a few
lines of open-coded functions on client/server handling, from Geliang Tang.
13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so
that x86 JIT does not need to implement detection, from Leon Hwang.
14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an
out-of-bounds memory access for dynpointers, from Matt Bobrowski.
15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as
it might lead to problems on 32-bit archs, from Jiri Olsa.
16) Enhance traffic validation and dynamic batch size support in xsk selftests,
from Tushar Vyavahare.
bpf-next-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits)
selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature
bpf: helpers: fix bpf_wq_set_callback_impl signature
libbpf: Add NULL checks to bpf_object__{prev_map,next_map}
selftests/bpf: Remove exceptions tests from DENYLIST.s390x
s390/bpf: Implement exceptions
s390/bpf: Change seen_reg to a mask
bpf: Remove unnecessary loop in task_file_seq_get_next()
riscv, bpf: Optimize stack usage of trampoline
bpf, devmap: Add .map_alloc_check
selftests/bpf: Remove arena tests from DENYLIST.s390x
selftests/bpf: Add UAF tests for arena atomics
selftests/bpf: Introduce __arena_global
s390/bpf: Support arena atomics
s390/bpf: Enable arena
s390/bpf: Support address space cast instruction
s390/bpf: Support BPF_PROBE_MEM32
s390/bpf: Land on the next JITed instruction after exception
s390/bpf: Introduce pre- and post- probe functions
s390/bpf: Get rid of get_probe_mem_regno()
...
====================
Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
The prerequisite for MLO support in cfg80211/mac80211 is that all the links
participating in MLO must be from the same wiphy/ieee80211_hw. To meet this
expectation, some drivers may need to group multiple discrete hardware each
acting as a link in MLO under single wiphy.
With this change, supported frequencies and interface combinations of each
individual radio are reported to user space. This allows user space to figure
out the limitations of what combination of channels can be used concurrently.
Even for non-MLO devices, this improves support for devices capable of
running on multiple channels at the same time.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://patch.msgid.link/18a88f9ce82b1c9f7c12f1672430eaf2bb0be295.1720514221.git-series.nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
|
|
If the server supports the NFSv4.2 protocol extension to optimise away
returning a stateid when it returns a delegation, then we cache that
information in another capability flag.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
Add the attribute delegation XDR definitions from the spec.
Signed-off-by: Tom Haynes <loghyr@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Lance Shelton <lance.shelton@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
|
This patch expands the status information provided by ethtool for PSE c33
with available power limit and available power limit ranges. It also adds
a call to pse_ethtool_set_pw_limit() to configure the PSE control power
limit.
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20240704-feature_poe_power_cap-v6-5-320003204264@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This update expands the status information provided by ethtool for PSE c33.
It includes details such as the detected class, current power delivered,
and extended state information.
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20240704-feature_poe_power_cap-v6-1-320003204264@bootlin.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When a packet sample is observed, the sampling rate that was used is
important to estimate the real frequency of such event.
Store the probability of the parent sample action in the skb's cb area
and use it in psample action to pass it down to psample module.
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Link: https://patch.msgid.link/20240704085710.353845-7-amorenoz@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add support for a new action: psample.
This action accepts a u32 group id and a variable-length cookie and uses
the psample multicast group to make the packet available for
observability.
The maximum length of the user-defined cookie is set to 16, same as
tc_cookie, to discourage using cookies that will not be offloadable.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Reviewed-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Link: https://patch.msgid.link/20240704085710.353845-6-amorenoz@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Although not explicitly documented in the psample module itself, the
definition of PSAMPLE_ATTR_SAMPLE_RATE seems inherited from act_sample.
Quoting tc-sample(8):
"RATE of 100 will lead to an average of one sampled packet out of every
100 observed."
With this semantics, the rates that we can express with an unsigned
32-bits number are very unevenly distributed and concentrated towards
"sampling few packets".
For example, we can express a probability of 2.32E-8% but we
cannot express anything between 100% and 50%.
For sampling applications that are capable of sampling a decent
amount of packets, this sampling rate semantics is not very useful.
Add a new flag to the uAPI that indicates that the sampling rate is
expressed in scaled probability, this is:
- 0 is 0% probability, no packets get sampled.
- U32_MAX is 100% probability, all packets get sampled.
Reviewed-by: Aaron Conole <aconole@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Link: https://patch.msgid.link/20240704085710.353845-5-amorenoz@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a user cookie to the sample metadata so that sample emitters can
provide more contextual information to samples.
If present, send the user cookie in a new attribute:
PSAMPLE_ATTR_USER_COOKIE.
Reviewed-by: Michal Kubiak <michal.kubiak@intel.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Link: https://patch.msgid.link/20240704085710.353845-2-amorenoz@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The exynos-next pull is based on a newer -rc than drm-next. hence
backmerge first to make sure the unrelated conflicts we accumulated
don't end up randomly in the exynos merge pull, but are separated out.
Conflicts are all benign: Adjacent changes in amdgpu and fbdev-dma
code, and cherry-pick conflict in xe.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/phy/aquantia/aquantia.h
219343755eae ("net: phy: aquantia: add missing include guards")
61578f679378 ("net: phy: aquantia: add support for PHY LEDs")
drivers/net/ethernet/wangxun/libwx/wx_hw.c
bd07a9817846 ("net: txgbe: remove separate irq request for MSI and INTx")
b501d261a5b3 ("net: txgbe: add FDIR ATR support")
https://lore.kernel.org/all/20240703112936.483c1975@canb.auug.org.au/
include/linux/mlx5/mlx5_ifc.h
048a403648fc ("net/mlx5: IFC updates for changing max EQs")
99be56171fa9 ("net/mlx5e: SHAMPO, Re-enable HW-GRO")
https://lore.kernel.org/all/20240701133951.6926b2e3@canb.auug.org.au/
Adjacent changes:
drivers/net/wireless/intel/iwlwifi/mvm/mac80211.c
4130c67cd123 ("wifi: iwlwifi: mvm: check vif for NULL/ERR_PTR before dereference")
3f3126515fbe ("wifi: iwlwifi: mvm: add mvm-specific guard")
include/net/mac80211.h
816c6bec09ed ("wifi: mac80211: fix BSS_CHANGED_UNSOL_BCAST_PROBE_RESP")
5a009b42e041 ("wifi: mac80211: track changes in AP's TPE")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
A new PEBS data source format is introduced for the p-core of Lunar
Lake. The data source field is extended to 8 bits with new encodings.
A new layout is introduced into the union intel_x86_pebs_dse.
Introduce the lnl_latency_data() to parse the new format.
Enlarge the pebs_data_source[] accordingly to include new encodings.
Only the mem load and the mem store events can generate the data source.
Introduce INTEL_HYBRID_LDLAT_CONSTRAINT and
INTEL_HYBRID_STLAT_CONSTRAINT to mark them.
Add two new bits for the new cache-related data src, L2_MHB and MSC.
The L2_MHB is short for L2 Miss Handling Buffer, which is similar to
LFB (Line Fill Buffer), but to track the L2 Cache misses.
The MSC stands for the memory-side cache.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-6-kan.liang@linux.intel.com
|
|
To prevent conflicts with other ioctl numbers to allow strace to have an
idea of what is happening, add the range of ioctls for the trace buffer
mapping from _IO("T", 0x1) to the range of "R" 0x20 - 0x2F.
Link: https://lore.kernel.org/linux-trace-kernel/20240630105322.GA17573@altlinux.org/
Link: https://lore.kernel.org/linux-trace-kernel/20240630213626.GA23566@altlinux.org/
Cc: Jonathan Corbet <corbet@lwn.net>
Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer")
Link: https://lore.kernel.org/20240702153354.367861db@rorschach.local.home
Reported-by: "Dmitry V. Levin" <ldv@strace.io>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
With external time travel, a LOT of message can end up
being exchanged on the socket, taking a significant
amount of time just to do that.
Add a new shared memory optimisation to that, where a
number of changes are made:
- the controller sends a client ID and a shared memory FD
(and a logging FD we don't use) in the ACK message to
the initial START
- the shared memory holds the current time and the
free_until value, so that there's no need to exchange
messages for that
- if the client that's running has shared memory support,
any client (the running one included) can request the
next time it wants to run inside the shared memory,
rather than sending a message, by also updating the
free_until value
- when shared memory is enabled, RUN/WAIT messages no
longer have an ACK, further cutting down on messages
Together, this can reduce the number of messages very
significantly, and reduce overall test/simulation run time.
Co-developed-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Link: https://patch.msgid.link/20240702192118.6ad0a083f574.Ie41206c8ce4507fe26b991937f47e86c24ca7a31@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add a message type to the time-travel protocol to broadcast
a small (64-bit) value to all participants in a simulation.
The main use case is to have an identical message come to
all participants in a simulation, e.g. to separate out logs
for different tests running in a single simulation.
Down in the guts of time_travel_handle_message() we can't
use printk() and not even printk_deferred(), so just store
the message and print it at the start of the userspace()
function.
Unfortunately this means that other prints in the kernel
can actually bypass the message, but in most cases where
this is used, for example to separate test logs, userspace
will be involved. Also, even if we could use
printk_deferred(), we'd still need to flush it out in the
userspace() function since otherwise userspace messages
might cross it.
As a result, this is a reasonable compromise, there's no
need to have any core changes and it solves the main use
case we have for it.
Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Link: https://patch.msgid.link/20240702192118.c4093bc5b15e.I2ca8d006b67feeb866ac2017af7b741c9e06445a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next into main
Pablo Neira Ayuso says:
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next:
Patch #1 to #11 to shrink memory consumption for transaction objects:
struct nft_trans_chain { /* size: 120 (-32), cachelines: 2, members: 10 */
struct nft_trans_elem { /* size: 72 (-40), cachelines: 2, members: 4 */
struct nft_trans_flowtable { /* size: 80 (-48), cachelines: 2, members: 5 */
struct nft_trans_obj { /* size: 72 (-40), cachelines: 2, members: 4 */
struct nft_trans_rule { /* size: 80 (-32), cachelines: 2, members: 6 */
struct nft_trans_set { /* size: 96 (-24), cachelines: 2, members: 8 */
struct nft_trans_table { /* size: 56 (-40), cachelines: 1, members: 2 */
struct nft_trans_elem can now be allocated from kmalloc-96 instead of
kmalloc-128 slab.
Series from Florian Westphal. For the record, I have mangled patch #1
to add nft_trans_container_*() and use if for every transaction object.
I have also added BUILD_BUG_ON to ensure struct nft_trans always comes
at the beginning of the container transaction object. And few minor
cleanups, any new bugs are of my own.
Patch #12 simplify check for SCTP GSO in IPVS, from Ismael Luceno.
Patch #13 nf_conncount key length remains in the u32 bound, from Yunjian Wang.
Patch #14 removes unnecessary check for CTA_TIMEOUT_L3PROTO when setting
default conntrack timeouts via nfnetlink_cttimeout API, from
Lin Ma.
Patch #15 updates NFT_SECMARK_CTX_MAXLEN to 4096, SELinux could use
larger secctx names than the existing 256 bytes length.
Patch #16 adds a selftest to exercise nfnetlink_queue listeners leaving
nfnetlink_queue, from Florian Westphal.
Patch #17 increases hitcount from 255 to 65535 in xt_recent, from Phil Sutter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a protocol spec for tcp_metrics, so that it's accessible via YNL.
Useful at the very least for testing fixes.
In this episode of "10,000 ways to complicate netlink" the metric
nest has defines which are off by 1. iproute2 does:
struct rtattr *m[TCP_METRIC_MAX + 1 + 1];
parse_rtattr_nested(m, TCP_METRIC_MAX + 1, a);
for (i = 0; i < TCP_METRIC_MAX + 1; i++) {
// ...
attr = m[i + 1];
This is too weird to support in YNL, add a new set of defines
with _correct_ values to the official kernel header.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
tcp_metrics' header lacks the customary _UAPI in the header guard.
This makes YNL build rules work less seamlessly.
We can easily fix that on YNL side, but this could also be
problematic if we ever needed to create a kernel-only tcp_metrics.h.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Donald Hunter <donald.hunter@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add the necessary infrastructure to the IIO core to support a new
optional DMABUF based interface.
With this new interface, DMABUF objects (externally created) can be
attached to a IIO buffer, and subsequently used for data transfer.
A userspace application can then use this interface to share DMABUF
objects between several interfaces, allowing it to transfer data in a
zero-copy fashion, for instance between IIO and the USB stack.
The userspace application can also memory-map the DMABUF objects, and
access the sample data directly. The advantage of doing this vs. the
read() interface is that it avoids an extra copy of the data between the
kernel and userspace. This is particularly userful for high-speed
devices which produce several megabytes or even gigabytes of data per
second.
As part of the interface, 3 new IOCTLs have been added:
IIO_BUFFER_DMABUF_ATTACH_IOCTL(int fd):
Attach the DMABUF object identified by the given file descriptor to the
buffer.
IIO_BUFFER_DMABUF_DETACH_IOCTL(int fd):
Detach the DMABUF object identified by the given file descriptor from
the buffer. Note that closing the IIO buffer's file descriptor will
automatically detach all previously attached DMABUF objects.
IIO_BUFFER_DMABUF_ENQUEUE_IOCTL(struct iio_dmabuf *):
Request a data transfer to/from the given DMABUF object. Its file
descriptor, as well as the transfer size and flags are provided in the
"iio_dmabuf" structure.
These three IOCTLs have to be performed on the IIO buffer's file
descriptor, obtained using the IIO_BUFFER_GET_FD_IOCTL() ioctl.
Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Co-developed-by: Nuno Sa <nuno.sa@analog.com>
Signed-off-by: Nuno Sa <nuno.sa@analog.com>
Link: https://patch.msgid.link/20240620122726.41232-4-paul@crapouillou.net
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
|
|
statmount() can export arbitrary strings, so utilize the __spare1 slot
for a mnt_opts string pointer, and then support asking for and setting
the mount options during statmount(). This calls into the helper for
showing mount options, which already uses a seq_file, so fits in nicely
with our existing mechanism for exporting strings via statmount().
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/3aa6bf8bd5d0a21df9ebd63813af8ab532c18276.1719257716.git.josef@toxicpanda.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
[brauner: only call sb->s_op->show_options()]
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
CMIS compliant modules such as QSFP-DD might be running a firmware that
can be updated in a vendor-neutral way by exchanging messages between
the host and the module as described in section 7.3.1 of revision 5.2 of
the CMIS standard.
Add a pair of new ethtool messages that allow:
* User space to trigger firmware update of transceiver modules
* The kernel to notify user space about the progress of the process
The user interface is designed to be asynchronous in order to avoid
RTNL being held for too long and to allow several modules to be
updated simultaneously. The interface is designed with CMIS compliant
modules in mind, but kept generic enough to accommodate future use
cases, if these arise.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For users that hold a reference to a pidfd procfs might not even be
available nor is it desirable to parse through procfs just for the sake
of getting namespace file descriptors for a process.
Make it possible to directly retrieve namespace file descriptors from a
pidfd. Pidfds already can be used with setns() to change a set of
namespaces atomically.
Link: https://lore.kernel.org/r/20240627-work-pidfs-v1-4-7e9ab6cc3bb1@kernel.org
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to utilize the listmount() and statmount() extensions that
allow us to call them on different namespaces we need a way to get the
mnt namespace id from user space. Add an ioctl to nsfs that will allow
us to extract the mnt namespace id in order to make these new extensions
usable.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/180449959d5a756af7306d6bda55f41b9d53e3cb.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Expand struct mnt_id_req to add an optional mnt_ns_id field. When this
field is populated, listmount() will be performed on the specified mount
namespace, provided the currently application has CAP_SYS_ADMIN in its
user namespace and the mount namespace is a child of the current
namespace.
Co-developed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/49930bdce29a8367a213eb14c1e68e7e49284f86.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
In order to allow users to iterate through children mount namespaces via
listmount we need a way for them to know what the ns id for the mount.
Add a new field to statmount called mnt_ns_id which will carry the ns id
for the given mount entry.
Co-developed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/6dabf437331fb7415d886f7c64b21cb2a50b1c66.1719243756.git.josef@toxicpanda.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
util-linux is about to implement listmount() and statmount() support.
Karel requested the ability to scan the mount table in backwards order
because that's what libmount currently does in order to get the latest
mount first. We currently don't support this in listmount(). Add a new
LISTMOUNT_REVERSE flag to allow listing mounts in reverse order. For
example, listing all child mounts of /sys without LISTMOUNT_REVERSE
gives:
/sys/kernel/security @ mnt_id: 4294968369
/sys/fs/cgroup @ mnt_id: 4294968370
/sys/firmware/efi/efivars @ mnt_id: 4294968371
/sys/fs/bpf @ mnt_id: 4294968372
/sys/kernel/tracing @ mnt_id: 4294968373
/sys/kernel/debug @ mnt_id: 4294968374
/sys/fs/fuse/connections @ mnt_id: 4294968375
/sys/kernel/config @ mnt_id: 4294968376
whereas with LISTMOUNT_REVERSE it gives:
/sys/kernel/config @ mnt_id: 4294968376
/sys/fs/fuse/connections @ mnt_id: 4294968375
/sys/kernel/debug @ mnt_id: 4294968374
/sys/kernel/tracing @ mnt_id: 4294968373
/sys/fs/bpf @ mnt_id: 4294968372
/sys/firmware/efi/efivars @ mnt_id: 4294968371
/sys/fs/cgroup @ mnt_id: 4294968370
/sys/kernel/security @ mnt_id: 4294968369
Link: https://lore.kernel.org/r/20240607-vfs-listmount-reverse-v1-4-7877a2bfa5e5@kernel.org
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Johannes Berg says:
====================
Highlights this time are:
- cfg80211/nl80211:
* improvements for 6 GHz regulatory flexibility
- mac80211:
* use generic netdev stats
* multi-link improvements/fixes
- brcmfmac:
* MFP support (to enable WPA3)
- wilc1000:
* suspend/resume improvements
- iwlwifi:
* remove support for older FW for new devices
* fast resume (keeping the device configured)
- wl18xx:
* support newer firmware versions
* tag 'wireless-next-2024-06-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (100 commits)
wifi: brcmfmac: of: Support interrupts-extended
wifi: brcmsmac: advertise MFP_CAPABLE to enable WPA3
net: rfkill: Correct return value in invalid parameter case
wifi: mac80211: fix NULL dereference at band check in starting tx ba session
wifi: iwlwifi: mvm: fix rs.h kernel-doc
wifi: iwlwifi: fw: api: datapath: fix kernel-doc
wifi: iwlwifi: fix remaining mistagged kernel-doc comments
wifi: iwlwifi: fix prototype mismatch kernel-doc warnings
wifi: iwlwifi: fix kernel-doc in iwl-fh.h
wifi: iwlwifi: fix kernel-doc in iwl-trans.h
wifi: iwlwifi: pcie: fix kernel-doc
wifi: iwlwifi: dvm: fix kernel-doc warnings
wifi: iwlwifi: mvm: don't log error for failed UATS table read
wifi: iwlwifi: trans: make bad state warnings
wifi: iwlwifi: fw: api: fix some kernel-doc
wifi: iwlwifi: mvm: remove init_dbg module parameter
wifi: iwlwifi: update the BA notification API
wifi: iwlwifi: mvm: always unblock EMLSR on ROC end
wifi: iwlwifi: mvm: use IWL_FW_CHECK for link ID check
wifi: iwlwifi: mvm: don't flush BSSes on restart with MLD API
...
====================
Link: https://patch.msgid.link/20240627114135.28507-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add Raspberry Pi compressed RAW Bayer formats.
The compression algorithm description is provided by Nick Hollinghurst
<nick.hollinghurst@raspberrypi.com> from Raspberry Pi.
Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
Reviewed-by: Naushir Patuck <naush@raspberrypi.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
Add format description for the PiSP Back End configuration parameter
buffer.
Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
Reviewed-by: Naushir Patuck <naush@raspberrypi.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
Add the Raspberry Pi PiSP Back End uAPI header.
The header defines the data type used to configure the PiSP Back End
ISP.
The detailed description of the types and of the ISP configuration
procedure is available at
https://datasheets.raspberrypi.com/camera/raspberry-pi-image-signal-processor-specification.pdf
Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
Add BGR48 and RGB48 16-bit per component image formats.
Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
Reviewed-by: Kieran Bingham <kieran.bingham@ideasonboard.com>
Reviewed-by: Naushir Patuck <naush@raspberrypi.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
|
|
Add the ability to send out RFC-3948 NAT keepalives from the xfrm stack.
To use, Userspace sets an XFRM_NAT_KEEPALIVE_INTERVAL integer property when
creating XFRM outbound states which denotes the number of seconds between
keepalive messages.
Keepalive messages are sent from a per net delayed work which iterates over
the xfrm states. The logic is guarded by the xfrm state spinlock due to the
xfrm state walk iterator.
Possible future enhancements:
- Adding counters to keep track of sent keepalives.
- deduplicate NAT keepalives between states sharing the same nat keepalive
parameters.
- provisioning hardware offloads for devices capable of implementing this.
- revise xfrm state list to use an rcu list in order to avoid running this
under spinlock.
Suggested-by: Paul Wouters <paul.wouters@aiven.io>
Tested-by: Paul Wouters <paul.wouters@aiven.io>
Tested-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
The NetDIM library, currently leveraged by an array of NICs, delivers
excellent acceleration benefits. Nevertheless, NICs vary significantly
in their dim profile list prerequisites.
Specifically, virtio-net backends may present diverse sw or hw device
implementation, making a one-size-fits-all parameter list impractical.
On Alibaba Cloud, the virtio DPU's performance under the default DIM
profile falls short of expectations, partly due to a mismatch in
parameter configuration.
I also noticed that ice/idpf/ena and other NICs have customized
profilelist or placed some restrictions on dim capabilities.
Motivated by this, I tried adding new params for "ethtool -C" that provides
a per-device control to modify and access a device's interrupt parameters.
Usage
========
The target NIC is named ethx.
Assume that ethx only declares support for rx profile setting
(with DIM_PROFILE_RX flag set in profile_flags) and supports modification
of usec and pkt fields.
1. Query the currently customized list of the device
$ ethtool -c ethx
...
rx-profile:
{.usec = 1, .pkts = 256, .comps = n/a,},
{.usec = 8, .pkts = 256, .comps = n/a,},
{.usec = 64, .pkts = 256, .comps = n/a,},
{.usec = 128, .pkts = 256, .comps = n/a,},
{.usec = 256, .pkts = 256, .comps = n/a,}
tx-profile: n/a
2. Tune
$ ethtool -C ethx rx-profile 1,1,n_2,n,n_3,3,n_4,4,n_n,5,n
"n" means do not modify this field.
$ ethtool -c ethx
...
rx-profile:
{.usec = 1, .pkts = 1, .comps = n/a,},
{.usec = 2, .pkts = 256, .comps = n/a,},
{.usec = 3, .pkts = 3, .comps = n/a,},
{.usec = 4, .pkts = 4, .comps = n/a,},
{.usec = 256, .pkts = 5, .comps = n/a,}
tx-profile: n/a
3. Hint
If the device does not support some type of customized dim profiles,
the corresponding "n/a" will display.
If the "n/a" field is being modified, -EOPNOTSUPP will be reported.
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20240621101353.107425-4-hengqi@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
secmark context is artificially limited 256 bytes, rise it to 4Kbytes.
Fixes: fb961945457f ("netfilter: nf_tables: add SECMARK support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Add ioctl()s to translate pids between pid namespaces.
LXCFS is a tiny fuse filesystem used to virtualize various aspects of
procfs. LXCFS is run on the host. The files and directories it creates
can be bind-mounted by e.g. a container at startup and mounted over the
various procfs files the container wishes to have virtualized. When e.g.
a read request for uptime is received, LXCFS will receive the pid of the
reader. In order to virtualize the corresponding read, LXCFS needs to
know the pid of the init process of the reader's pid namespace. In order
to do this, LXCFS first needs to fork() two helper processes. The first
helper process setns() to the readers pid namespace. The second helper
process is needed to create a process that is a proper member of the pid
namespace. The second helper process then creates a ucred message with
ucred.pid set to 1 and sends it back to LXCFS. The kernel will translate
the ucred.pid field to the corresponding pid number in LXCFS's pid
namespace. This way LXCFS can learn the init pid number of the reader's
pid namespace and can go on to virtualize. Since these two forks() are
costly LXCFS maintains an init pid cache that caches a given pid for a
fixed amount of time. The cache is pruned during new read requests.
However, even with the cache the hit of the two forks() is singificant
when a very large number of containers are running. With this simple
patch we add an ns ioctl that let's a caller retrieve the init pid nr of
a pid namespace through its pid namespace fd. This significantly
improves performance with a very simple change.
Support translation of pids and tgids. Other concepts can be added but
there are no obvious users for this right now.
To protect against races pidfds can be used to check whether the process
is still valid. If needed, this can also be extended to work on pidfds
directly.
Link: https://lore.kernel.org/r/20240619-work-ns_ioctl-v1-1-7c0097e6bb6b@kernel.org
Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
https://gitlab.freedesktop.org/drm/misc/kernel into drm-next
drm-misc-next for 6.10:
UAPI Changes:
Cross-subsystem Changes:
- dma-buf: Warn when reserving 0 fence slots, internal API
enhancements for heaps
Core Changes:
Driver Changes:
- atmel-hlcdc: Support XLCDC in sam9x7
- msm: Validate registers XML description against schema in CI
- v3d: Fix build warning
- bridges:
- analogix_dp: Various improvements
- panels:
- New panel: WL-355608-A8
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <mripard@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240606-vivid-amphibian-jackrabbit-40b1d1@houat
|
|
Extend statx system call to return additional info for atomic write support
support for a file.
Helper function generic_fill_statx_atomic_writes() can be used by FSes to
fill in the relevant statx fields. For now atomic_write_segments_max will
always be 1, otherwise some rules would need to be imposed on iovec length
and alignment, which we don't want now.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: relocate bdev support to another patch
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-5-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
An atomic write is a write issued with torn-write protection, meaning
that for a power failure or any other hardware failure, all or none of the
data from the write will be stored, but never a mix of old and new data.
Userspace may add flag RWF_ATOMIC to pwritev2() to indicate that the
write is to be issued with torn-write prevention, according to special
alignment and length rules.
For any syscall interface utilizing struct iocb, add IOCB_ATOMIC for
iocb->ki_flags field to indicate the same.
A call to statx will give the relevant atomic write info for a file:
- atomic_write_unit_min
- atomic_write_unit_max
- atomic_write_segments_max
Both min and max values must be a power-of-2.
Applications can avail of atomic write feature by ensuring that the total
length of a write is a power-of-2 in size and also sized between
atomic_write_unit_min and atomic_write_unit_max, inclusive. Applications
must ensure that the write is at a naturally-aligned offset in the file
wrt the total write length. The value in atomic_write_segments_max
indicates the upper limit for IOV_ITER iovcnt.
Add file mode flag FMODE_CAN_ATOMIC_WRITE, so files which do not have the
flag set will have RWF_ATOMIC rejected and not just ignored.
Add a type argument to kiocb_set_rw_flags() to allows reads which have
RWF_ATOMIC set to be rejected.
Helper function generic_atomic_write_valid() can be used by FSes to verify
compliant writes. There we check for iov_iter type is for ubuf, which
implies iovcnt==1 for pwritev2(), which is an initial restriction for
atomic_write_segments_max. Initially the only user will be bdev file
operations write handler. We will rely on the block BIO submission path to
ensure write sizes are compliant for the bdev, so we don't need to check
atomic writes sizes yet.
Signed-off-by: Prasad Singamsetty <prasad.singamsetty@oracle.com>
jpg: merge into single patch and much rewrite
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20240620125359.2684798-4-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
With the new ISO 15765-2:2024 release the former documentation and comments
have to be reworked. This patch removes the ISO specification version/date
where possible.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Acked-by: Francesco Valla <valla.francesco@gmail.com>
Link: https://lore.kernel.org/all/20240420194746.4885-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
|
|
IORING_OP_LISTEN provides the semantic of listen(2) via io_uring. While
this is an essentially synchronous system call, the main point is to
enable a network path to execute fully with io_uring registered and
descriptorless files.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240614163047.31581-4-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
IORING_OP_BIND provides the semantic of bind(2) via io_uring. While
this is an essentially synchronous system call, the main point is to
enable a network path to execute fully with io_uring registered and
descriptorless files.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240614163047.31581-3-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Implement a new scheduler class sched_ext (SCX), which allows scheduling
policies to be implemented as BPF programs to achieve the following:
1. Ease of experimentation and exploration: Enabling rapid iteration of new
scheduling policies.
2. Customization: Building application-specific schedulers which implement
policies that are not applicable to general-purpose schedulers.
3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
policies in production environments.
sched_ext leverages BPF’s struct_ops feature to define a structure which
exports function callbacks and flags to BPF programs that wish to implement
scheduling policies. The struct_ops structure exported by sched_ext is
struct sched_ext_ops, and is conceptually similar to struct sched_class. The
role of sched_ext is to map the complex sched_class callbacks to the more
simple and ergonomic struct sched_ext_ops callbacks.
For more detailed discussion on the motivations and overview, please refer
to the cover letter.
Later patches will also add several example schedulers and documentation.
This patch implements the minimum core framework to enable implementation of
BPF schedulers. Subsequent patches will gradually add functionalities
including safety guarantee mechanisms, nohz and cgroup support.
include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
top, each operation should be self-explanatory. The followings are worth
noting:
- Both "sched_ext" and its shorthand "scx" are used. If the identifier
already has "sched" in it, "ext" is used; otherwise, "scx".
- In sched_ext_ops, only .name is mandatory. Every operation is optional and
if omitted a simple but functional default behavior is provided.
- A new policy constant SCHED_EXT is added and a task can select sched_ext
by invoking sched_setscheduler(2) with the new policy constant. However,
if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
and the task is scheduled by CFS. When the BPF scheduler is loaded, all
tasks which have the SCHED_EXT policy are switched to sched_ext.
- To bridge the workflow imbalance between the scheduler core and
sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
convenience and need not be used by a scheduler that doesn't require it.
SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
the next task on the CPU. The BPF scheduler can manage an arbitrary number
of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
- sched_ext guarantees system integrity no matter what the BPF scheduler
does. To enable this, each task's ownership is tracked through
p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
can always recover and revert all tasks back to CFS. See p->scx.ops_state
and scx_tasks.
- A task is not tied to its rq while enqueued. This decouples CPU selection
from queueing and allows sharing a scheduling queue across an arbitrary
subset of CPUs. This adds some complexities as a task may need to be
bounced between rq's right before it starts executing. See
dispatch_to_local_dsq() and move_task_to_local_dsq().
- One complication that arises from the above weak association between task
and rq is that synchronizing with dequeue() gets complicated as dequeue()
may happen anytime while the task is enqueued and the dispatch path might
need to release the rq lock to transfer the task. Solving this requires a
bit of complexity. See the logic around p->scx.sticky_cpu and
p->scx.ops_qseq.
- Both enable and disable paths are a bit complicated. The enable path
switches all tasks without blocking to avoid issues which can arise from
partially switched states (e.g. the switching task itself being starved).
The disable path can't trust the BPF scheduler at all, so it also has to
guarantee forward progress without blocking. See scx_ops_enable() and
scx_ops_disable_workfn().
- When sched_ext is disabled, static_branches are used to shut down the
entry points from hot paths.
v7: - scx_ops_bypass() was incorrectly and unnecessarily trying to grab
scx_ops_enable_mutex which can lead to deadlocks in the disable path.
Fixed.
- Fixed TASK_DEAD handling bug in scx_ops_enable() path which could lead
to use-after-free.
- Consolidated per-cpu variable usages and other cleanups.
v6: - SCX_NR_ONLINE_OPS replaced with SCX_OPI_*_BEGIN/END so that multiple
groups can be expressed. Later CPU hotplug operations are put into
their own group.
- SCX_OPS_DISABLING state is replaced with the new bypass mechanism
which allows temporarily putting the system into simple FIFO
scheduling mode bypassing the BPF scheduler. In addition to the shut
down path, this will also be used to isolate the BPF scheduler across
PM events. Enabling and disabling the bypass mode requires iterating
all runnable tasks. rq->scx.runnable_list addition is moved from the
later watchdog patch.
- ops.prep_enable() is replaced with ops.init_task() and
ops.enable/disable() are now called whenever the task enters and
leaves sched_ext instead of when the task becomes schedulable on
sched_ext and stops being so. A new operation - ops.exit_task() - is
called when the task stops being schedulable on sched_ext.
- scx_bpf_dispatch() can now be called from ops.select_cpu() too. This
removes the need for communicating local dispatch decision made by
ops.select_cpu() to ops.enqueue() via per-task storage.
SCX_KF_SELECT_CPU is added to support the change.
- SCX_TASK_ENQ_LOCAL which told the BPF scheudler that
scx_select_cpu_dfl() wants the task to be dispatched to the local DSQ
was removed. Instead, scx_bpf_select_cpu_dfl() now dispatches directly
if it finds a suitable idle CPU. If such behavior is not desired,
users can use scx_bpf_select_cpu_dfl() which returns the verdict in a
bool out param.
- scx_select_cpu_dfl() was mishandling WAKE_SYNC and could end up
queueing many tasks on a local DSQ which makes tasks to execute in
order while other CPUs stay idle which made some hackbench numbers
really bad. Fixed.
- The current state of sched_ext can now be monitored through files
under /sys/sched_ext instead of /sys/kernel/debug/sched/ext. This is
to enable monitoring on kernels which don't enable debugfs.
- sched_ext wasn't telling BPF that ops.dispatch()'s @prev argument may
be NULL and a BPF scheduler which derefs the pointer without checking
could crash the kernel. Tell BPF. This is currently a bit ugly. A
better way to annotate this is expected in the future.
- scx_exit_info updated to carry pointers to message buffers instead of
embedding them directly. This decouples buffer sizes from API so that
they can be changed without breaking compatibility.
- exit_code added to scx_exit_info. This is used to indicate different
exit conditions on non-error exits and will be used to handle e.g. CPU
hotplugs.
- The patch "sched_ext: Allow BPF schedulers to switch all eligible
tasks into sched_ext" is folded in and the interface is changed so
that partial switching is indicated with a new ops flag
%SCX_OPS_SWITCH_PARTIAL. This makes scx_bpf_switch_all() unnecessasry
and in turn SCX_KF_INIT. ops.init() is now called with
SCX_KF_SLEEPABLE.
- Code reorganized so that only the parts necessary to integrate with
the rest of the kernel are in the header files.
- Changes to reflect the BPF and other kernel changes including the
addition of bpf_sched_ext_ops.cfi_stubs.
v5: - To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
load_acquire/store_release is now unsigned long instead of u64.
- Fix the bug where bpf_scx_btf_struct_access() was allowing write
access to arbitrary fields.
- Distinguish kfuncs which can be called from any sched_ext ops and from
anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
sched_ext ops.
- Rename "type" to "kind" in scx_exit_info to make it easier to use on
languages in which "type" is a reserved keyword.
- Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
setup"), PF_IDLE is not set on idle tasks which haven't been online
yet which made scx_task_iter_next_filtered() include those idle tasks
in iterations leading to oopses. Update scx_task_iter_next_filtered()
to directly test p->sched_class against idle_sched_class instead of
using is_idle_task() which tests PF_IDLE.
- Other updates to match upstream changes such as adding const to
set_cpumask() param and renaming check_preempt_curr() to
wakeup_preempt().
v4: - SCHED_CHANGE_BLOCK replaced with the previous
sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
because upstream is adaopting a different generic cleanup mechanism.
Once that lands, the code will be adapted accordingly.
- task_on_scx() used to test whether a task should be switched into SCX,
which is confusing. Renamed to task_should_scx(). task_on_scx() now
tests whether a task is currently on SCX.
- scx_has_idle_cpus is barely used anymore and replaced with direct
check on the idle cpumask.
- SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
fully idle cores.
- ops.enable() now sees up-to-date p->scx.weight value.
- ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
schedulers expecting ->select_cpu() call.
- Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
of the scheduler.
v3: - ops.set_weight() added to allow BPF schedulers to track weight changes
without polling p->scx.weight.
- move_task_to_local_dsq() was losing SCX-specific enq_flags when
enqueueing the task on the target dsq because it goes through
activate_task() which loses the upper 32bit of the flags. Carry the
flags through rq->scx.extra_enq_flags.
- scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
and scx_bpf_task_cpu() now use the new KF_RCU instead of
KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
- The kfunc helper access control mechanism implemented through
sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
used when invoking scx_ops operations.
v2: - balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
To determine whether balance_scx() should be called from
put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
comment in put_prev_task_scx() for details.
- sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
with SCHED_CHANGE_BLOCK().
- Unused all_dsqs list removed. This was a left-over from previous
iterations.
- p->scx.kf_mask is added to track and enforce which kfunc helpers are
allowed. Also, init/exit sequences are updated to make some kfuncs
always safe to call regardless of the current BPF scheduler state.
Combined, this should make all the kfuncs safe.
- BPF now supports sleepable struct_ops operations. Hacky workaround
removed and operations and kfunc helpers are tagged appropriately.
- BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
and friends are added so that BPF schedulers can use the idle masks
with the generic helpers. This replaces the hacky kfunc helpers added
by a separate patch in V1.
- CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
enabled. This restriction will be removed by a later patch which adds
core-sched support.
- Add MAINTAINERS entries and other misc changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Co-authored-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
|
|
Ensure that any new KVM code that references immediate_exit gets extra
scrutiny by renaming it to immediate_exit__unsafe in kernel code.
All fields in struct kvm_run are subject to TOCTOU races since they are
mapped into userspace, which may be malicious or buggy. To protect KVM,
introduces a new macro that appends __unsafe to select field names in
struct kvm_run, hinting to developers and reviewers that accessing such
fields must be done carefully.
Apply the new macro to immediate_exit, since userspace can make
immediate_exit inconsistent with vcpu->wants_to_run, i.e. accessing
immediate_exit directly could lead to unexpected bugs in the future.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240503181734.1467938-3-dmatlack@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
This patch allows to create smc socket via AF_INET,
similar to the following code,
/* create v4 smc sock */
v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
/* create v6 smc sock */
v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
There are several reasons why we believe it is appropriate here:
1. For smc sockets, it actually use IPv4 (AF-INET) or IPv6 (AF-INET6)
address. There is no AF_SMC address at all.
2. Create smc socket in the AF_INET(6) path, which allows us to reuse
the infrastructure of AF_INET(6) path, such as common ebpf hooks.
Otherwise, smc have to implement it again in AF_SMC path.
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
Tested-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|