diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2023-06-28 16:43:10 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2023-06-28 16:43:10 -0700 |
commit | 3a8a670eeeaa40d87bd38a587438952741980c18 (patch) | |
tree | d5546d311271503eadf75b45d87e12720e72899f /Documentation | |
parent | 6a8cbd9253abc1bd0df4d60c4c24fa555190376d (diff) | |
parent | ae230642190a51b85656d6da2df744d534d59544 (diff) |
Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking changes from Jakub Kicinski:
"WiFi 7 and sendpage changes are the biggest pieces of work for this
release. The latter will definitely require fixes but I think that we
got it to a reasonable point.
Core:
- Rework the sendpage & splice implementations
Instead of feeding data into sockets page by page extend sendmsg
handlers to support taking a reference on the data, controlled by a
new flag called MSG_SPLICE_PAGES
Rework the handling of unexpected-end-of-file to invoke an
additional callback instead of trying to predict what the right
combination of MORE/NOTLAST flags is
Remove the MSG_SENDPAGE_NOTLAST flag completely
- Implement SCM_PIDFD, a new type of CMSG type analogous to
SCM_CREDENTIALS, but it contains pidfd instead of plain pid
- Enable socket busy polling with CONFIG_RT
- Improve reliability and efficiency of reporting for ref_tracker
- Auto-generate a user space C library for various Netlink families
Protocols:
- Allow TCP to shrink the advertised window when necessary, prevent
sk_rcvbuf auto-tuning from growing the window all the way up to
tcp_rmem[2]
- Use per-VMA locking for "page-flipping" TCP receive zerocopy
- Prepare TCP for device-to-device data transfers, by making sure
that payloads are always attached to skbs as page frags
- Make the backoff time for the first N TCP SYN retransmissions
linear. Exponential backoff is unnecessarily conservative
- Create a new MPTCP getsockopt to retrieve all info
(MPTCP_FULL_INFO)
- Avoid waking up applications using TLS sockets until we have a full
record
- Allow using kernel memory for protocol ioctl callbacks, paving the
way to issuing ioctls over io_uring
- Add nolocalbypass option to VxLAN, forcing packets to be fully
encapsulated even if they are destined for a local IP address
- Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure
in-kernel ECMP implementation (e.g. Open vSwitch) select the same
link for all packets. Support L4 symmetric hashing in Open vSwitch
- PPPoE: make number of hash bits configurable
- Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client
(ipconfig)
- Add layer 2 miss indication and filtering, allowing higher layers
(e.g. ACL filters) to make forwarding decisions based on whether
packet matched forwarding state in lower devices (bridge)
- Support matching on Connectivity Fault Management (CFM) packets
- Hide the "link becomes ready" IPv6 messages by demoting their
printk level to debug
- HSR: don't enable promiscuous mode if device offloads the proto
- Support active scanning in IEEE 802.15.4
- Continue work on Multi-Link Operation for WiFi 7
BPF:
- Add precision propagation for subprogs and callbacks. This allows
maintaining verification efficiency when subprograms are used, or
in fact passing the verifier at all for complex programs,
especially those using open-coded iterators
- Improve BPF's {g,s}setsockopt() length handling. Previously BPF
assumed the length is always equal to the amount of written data.
But some protos allow passing a NULL buffer to discover what the
output buffer *should* be, without writing anything
- Accept dynptr memory as memory arguments passed to helpers
- Add routing table ID to bpf_fib_lookup BPF helper
- Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands
- Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark
maps as read-only)
- Show target_{obj,btf}_id in tracing link fdinfo
- Addition of several new kfuncs (most of the names are
self-explanatory):
- Add a set of new dynptr kfuncs: bpf_dynptr_adjust(),
bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size()
and bpf_dynptr_clone().
- bpf_task_under_cgroup()
- bpf_sock_destroy() - force closing sockets
- bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs
Netfilter:
- Relax set/map validation checks in nf_tables. Allow checking
presence of an entry in a map without using the value
- Increase ip_vs_conn_tab_bits range for 64BIT builds
- Allow updating size of a set
- Improve NAT tuple selection when connection is closing
Driver API:
- Integrate netdev with LED subsystem, to allow configuring HW
"offloaded" blinking of LEDs based on link state and activity
(i.e. packets coming in and out)
- Support configuring rate selection pins of SFP modules
- Factor Clause 73 auto-negotiation code out of the drivers, provide
common helper routines
- Add more fool-proof helpers for managing lifetime of MDIO devices
associated with the PCS layer
- Allow drivers to report advanced statistics related to Time Aware
scheduler offload (taprio)
- Allow opting out of VF statistics in link dump, to allow more VFs
to fit into the message
- Split devlink instance and devlink port operations
New hardware / drivers:
- Ethernet:
- Synopsys EMAC4 IP support (stmmac)
- Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches
- Marvell 88E6250 7 port switches
- Microchip LAN8650/1 Rev.B0 PHYs
- MediaTek MT7981/MT7988 built-in 1GE PHY driver
- WiFi:
- Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps
- Realtek RTL8723DS (SDIO variant)
- Realtek RTL8851BE
- CAN:
- Fintek F81604
Drivers:
- Ethernet NICs:
- Intel (100G, ice):
- support dynamic interrupt allocation
- use meta data match instead of VF MAC addr on slow-path
- nVidia/Mellanox:
- extend link aggregation to handle 4, rather than just 2 ports
- spawn sub-functions without any features by default
- OcteonTX2:
- support HTB (Tx scheduling/QoS) offload
- make RSS hash generation configurable
- support selecting Rx queue using TC filters
- Wangxun (ngbe/txgbe):
- add basic Tx/Rx packet offloads
- add phylink support (SFP/PCS control)
- Freescale/NXP (enetc):
- report TAPRIO packet statistics
- Solarflare/AMD:
- support matching on IP ToS and UDP source port of outer
header
- VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6
- add devlink dev info support for EF10
- Virtual NICs:
- Microsoft vNIC:
- size the Rx indirection table based on requested
configuration
- support VLAN tagging
- Amazon vNIC:
- try to reuse Rx buffers if not fully consumed, useful for ARM
servers running with 16kB pages
- Google vNIC:
- support TCP segmentation of >64kB frames
- Ethernet embedded switches:
- Marvell (mv88e6xxx):
- enable USXGMII (88E6191X)
- Microchip:
- lan966x: add support for Egress Stage 0 ACL engine
- lan966x: support mapping packet priority to internal switch
priority (based on PCP or DSCP)
- Ethernet PHYs:
- Broadcom PHYs:
- support for Wake-on-LAN for BCM54210E/B50212E
- report LPI counter
- Microsemi PHYs: support RGMII delay configuration (VSC85xx)
- Micrel PHYs: receive timestamp in the frame (LAN8841)
- Realtek PHYs: support optional external PHY clock
- Altera TSE PCS: merge the driver into Lynx PCS which it is a
variant of
- CAN: Kvaser PCIEcan:
- support packet timestamping
- WiFi:
- Intel (iwlwifi):
- major update for new firmware and Multi-Link Operation (MLO)
- configuration rework to drop test devices and split the
different families
- support for segmented PNVM images and power tables
- new vendor entries for PPAG (platform antenna gain) feature
- Qualcomm 802.11ax (ath11k):
- Multiple Basic Service Set Identifier (MBSSID) and Enhanced
MBSSID Advertisement (EMA) support in AP mode
- support factory test mode
- RealTek (rtw89):
- add RSSI based antenna diversity
- support U-NII-4 channels on 5 GHz band
- RealTek (rtl8xxxu):
- AP mode support for 8188f
- support USB RX aggregation for the newer chips"
* tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits)
net: scm: introduce and use scm_recv_unix helper
af_unix: Skip SCM_PIDFD if scm->pid is NULL.
net: lan743x: Simplify comparison
netlink: Add __sock_i_ino() for __netlink_diag_dump().
net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses
Revert "af_unix: Call scm_recv() only after scm_set_cred()."
phylink: ReST-ify the phylink_pcs_neg_mode() kdoc
libceph: Partially revert changes to support MSG_SPLICE_PAGES
net: phy: mscc: fix packet loss due to RGMII delays
net: mana: use vmalloc_array and vcalloc
net: enetc: use vmalloc_array and vcalloc
ionic: use vmalloc_array and vcalloc
pds_core: use vmalloc_array and vcalloc
gve: use vmalloc_array and vcalloc
octeon_ep: use vmalloc_array and vcalloc
net: usb: qmi_wwan: add u-blox 0x1312 composition
perf trace: fix MSG_SPLICE_PAGES build error
ipvlan: Fix return value of ipvlan_queue_xmit()
netfilter: nf_tables: fix underflow in chain reference counter
netfilter: nf_tables: unbind non-anonymous set if rule construction fails
...
Diffstat (limited to 'Documentation')
56 files changed, 2211 insertions, 227 deletions
diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-netdev b/Documentation/ABI/testing/sysfs-class-led-trigger-netdev index 646540950e38..78b62a23b14a 100644 --- a/Documentation/ABI/testing/sysfs-class-led-trigger-netdev +++ b/Documentation/ABI/testing/sysfs-class-led-trigger-netdev @@ -13,6 +13,11 @@ Description: Specifies the duration of the LED blink in milliseconds. Defaults to 50 ms. + With hw_control ON, the interval value MUST be set to the + default value and cannot be changed. + Trying to set any value in this specific mode will return + an EINVAL error. + What: /sys/class/leds/<led>/link Date: Dec 2017 KernelVersion: 4.16 @@ -39,6 +44,9 @@ Description: If set to 1, the LED will blink for the milliseconds specified in interval to signal transmission. + With hw_control ON, the blink interval is controlled by hardware + and won't reflect the value set in interval. + What: /sys/class/leds/<led>/rx Date: Dec 2017 KernelVersion: 4.16 @@ -50,3 +58,84 @@ Description: If set to 1, the LED will blink for the milliseconds specified in interval to signal reception. + + With hw_control ON, the blink interval is controlled by hardware + and won't reflect the value set in interval. + +What: /sys/class/leds/<led>/hw_control +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Communicate whether the LED trigger modes are driven by hardware + or software fallback is used. + + If 0, the LED is using software fallback to blink. + + If 1, the LED is using hardware control to blink and signal the + requested modes. + +What: /sys/class/leds/<led>/link_10 +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link speed state of 10Mbps of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link state + speed of 10MBps of the named network device. + Setting this value also immediately changes the LED state. + +What: /sys/class/leds/<led>/link_100 +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link speed state of 100Mbps of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link state + speed of 100Mbps of the named network device. + Setting this value also immediately changes the LED state. + +What: /sys/class/leds/<led>/link_1000 +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link speed state of 1000Mbps of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link state + speed of 1000Mbps of the named network device. + Setting this value also immediately changes the LED state. + +What: /sys/class/leds/<led>/half_duplex +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link half duplex state of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link half + duplex state of the named network device. + Setting this value also immediately changes the LED state. + +What: /sys/class/leds/<led>/full_duplex +Date: Jun 2023 +KernelVersion: 6.5 +Contact: linux-leds@vger.kernel.org +Description: + Signal the link full duplex state of the named network device. + + If set to 0 (default), the LED's normal state is off. + + If set to 1, the LED's normal state reflects the link full + duplex state of the named network device. + Setting this value also immediately changes the LED state. diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 466c560b0c30..4877563241f3 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -386,8 +386,8 @@ Default : 0 (for compatibility reasons) txrehash -------- -Controls default hash rethink behaviour on listening socket when SO_TXREHASH -option is set to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). +Controls default hash rethink behaviour on socket when SO_TXREHASH option is set +to SOCK_TXREHASH_DEFAULT (i. e. not overridden by setsockopt). If set to 1 (default), hash rethink is performed on listening socket. If set to 0, hash rethink is not performed. diff --git a/Documentation/bpf/bpf_iterators.rst b/Documentation/bpf/bpf_iterators.rst index 6d7770793fab..07433915aa41 100644 --- a/Documentation/bpf/bpf_iterators.rst +++ b/Documentation/bpf/bpf_iterators.rst @@ -238,11 +238,8 @@ The following is the breakdown for each field in struct ``bpf_iter_reg``. that the kernel function cond_resched() is called to avoid other kernel subsystem (e.g., rcu) misbehaving. * - seq_info - - Specifies certain action requests in the kernel BPF iterator - infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means - that the kernel function cond_resched() is called to avoid other kernel - subsystem (e.g., rcu) misbehaving. - + - Specifies the set of seq operations for the BPF iterator and helpers to + initialize/free the private data for the corresponding ``seq_file``. `Click here <https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_ diff --git a/Documentation/bpf/cpumasks.rst b/Documentation/bpf/cpumasks.rst index 41efd8874eeb..3139c7c02e79 100644 --- a/Documentation/bpf/cpumasks.rst +++ b/Documentation/bpf/cpumasks.rst @@ -351,14 +351,15 @@ In addition to the above kfuncs, there is also a set of read-only kfuncs that can be used to query the contents of cpumasks. .. kernel-doc:: kernel/bpf/cpumask.c - :identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_test_cpu + :identifiers: bpf_cpumask_first bpf_cpumask_first_zero bpf_cpumask_first_and + bpf_cpumask_test_cpu .. kernel-doc:: kernel/bpf/cpumask.c :identifiers: bpf_cpumask_equal bpf_cpumask_intersects bpf_cpumask_subset bpf_cpumask_empty bpf_cpumask_full .. kernel-doc:: kernel/bpf/cpumask.c - :identifiers: bpf_cpumask_any bpf_cpumask_any_and + :identifiers: bpf_cpumask_any_distribute bpf_cpumask_any_and_distribute ---- diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst index 492980ece1ab..6644842cd3ea 100644 --- a/Documentation/bpf/instruction-set.rst +++ b/Documentation/bpf/instruction-set.rst @@ -163,13 +163,13 @@ BPF_MUL 0x20 dst \*= src BPF_DIV 0x30 dst = (src != 0) ? (dst / src) : 0 BPF_OR 0x40 dst \|= src BPF_AND 0x50 dst &= src -BPF_LSH 0x60 dst <<= src -BPF_RSH 0x70 dst >>= src +BPF_LSH 0x60 dst <<= (src & mask) +BPF_RSH 0x70 dst >>= (src & mask) BPF_NEG 0x80 dst = ~src BPF_MOD 0x90 dst = (src != 0) ? (dst % src) : dst BPF_XOR 0xa0 dst ^= src BPF_MOV 0xb0 dst = src -BPF_ARSH 0xc0 sign extending shift right +BPF_ARSH 0xc0 sign extending dst >>= (src & mask) BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below) ======== ===== ========================================================== @@ -204,6 +204,9 @@ for ``BPF_ALU64``, 'imm' is first sign extended to 64 bits and the result interpreted as an unsigned 64-bit value. There are no instructions for signed division or modulo. +Shift operations use a mask of 0x3F (63) for 64-bit operations and 0x1F (31) +for 32-bit operations. + Byte swap instructions ~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst index ea2516374d92..0d2647fb358d 100644 --- a/Documentation/bpf/kfuncs.rst +++ b/Documentation/bpf/kfuncs.rst @@ -100,7 +100,7 @@ Hence, whenever a constant scalar argument is accepted by a kfunc which is not a size parameter, and the value of the constant matters for program safety, __k suffix should be used. -2.2.2 __uninit Annotation +2.2.3 __uninit Annotation ------------------------- This annotation is used to indicate that the argument will be treated as @@ -117,6 +117,27 @@ Here, the dynptr will be treated as an uninitialized dynptr. Without this annotation, the verifier will reject the program if the dynptr passed in is not initialized. +2.2.4 __opt Annotation +------------------------- + +This annotation is used to indicate that the buffer associated with an __sz or __szk +argument may be null. If the function is passed a nullptr in place of the buffer, +the verifier will not check that length is appropriate for the buffer. The kfunc is +responsible for checking if this buffer is null before using it. + +An example is given below:: + + __bpf_kfunc void *bpf_dynptr_slice(..., void *buffer__opt, u32 buffer__szk) + { + ... + } + +Here, the buffer may be null. If buffer is not null, it at least of size buffer_szk. +Either way, the returned buffer is either NULL, or of size buffer_szk. Without this +annotation, the verifier will reject the program if a null pointer is passed in with +a nonzero size. + + .. _BPF_kfunc_nodef: 2.3 Using an existing kernel function @@ -206,23 +227,49 @@ absolutely no ABI stability guarantees. As mentioned above, a nested pointer obtained from walking a trusted pointer is no longer trusted, with one exception. If a struct type has a field that is -guaranteed to be valid as long as its parent pointer is trusted, the -``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as -follows: +guaranteed to be valid (trusted or rcu, as in KF_RCU description below) as long +as its parent pointer is valid, the following macros can be used to express +that to the verifier: + +* ``BTF_TYPE_SAFE_TRUSTED`` +* ``BTF_TYPE_SAFE_RCU`` +* ``BTF_TYPE_SAFE_RCU_OR_NULL`` + +For example, + +.. code-block:: c + + BTF_TYPE_SAFE_TRUSTED(struct socket) { + struct sock *sk; + }; + +or .. code-block:: c - BTF_TYPE_SAFE_NESTED(struct task_struct) { + BTF_TYPE_SAFE_RCU(struct task_struct) { const cpumask_t *cpus_ptr; + struct css_set __rcu *cgroups; + struct task_struct __rcu *real_parent; + struct task_struct *group_leader; }; In other words, you must: -1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro. +1. Wrap the valid pointer type in a ``BTF_TYPE_SAFE_*`` macro. -2. Specify the type and name of the trusted nested field. This field must match +2. Specify the type and name of the valid nested field. This field must match the field in the original type definition exactly. +A new type declared by a ``BTF_TYPE_SAFE_*`` macro also needs to be emitted so +that it appears in BTF. For example, ``BTF_TYPE_SAFE_TRUSTED(struct socket)`` +is emitted in the ``type_is_trusted()`` function as follows: + +.. code-block:: c + + BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct socket)); + + 2.4.5 KF_SLEEPABLE flag ----------------------- diff --git a/Documentation/bpf/llvm_reloc.rst b/Documentation/bpf/llvm_reloc.rst index ca8957d5b671..e4a777a6a3a2 100644 --- a/Documentation/bpf/llvm_reloc.rst +++ b/Documentation/bpf/llvm_reloc.rst @@ -48,7 +48,7 @@ the code with ``llvm-objdump -dr test.o``:: 14: 0f 10 00 00 00 00 00 00 r0 += r1 15: 95 00 00 00 00 00 00 00 exit -There are four relations in the above for four ``LD_imm64`` instructions. +There are four relocations in the above for four ``LD_imm64`` instructions. The following ``llvm-readelf -r test.o`` shows the binary values of the four relocations:: @@ -79,14 +79,16 @@ The following is the symbol table with ``llvm-readelf -s test.o``:: The 6th entry is global variable ``g1`` with value 0. Similarly, the second relocation is at ``.text`` offset ``0x18``, instruction 3, -for global variable ``g2`` which has a symbol value 4, the offset -from the start of ``.data`` section. - -The third and fourth relocations refers to static variables ``l1`` -and ``l2``. From ``.rel.text`` section above, it is not clear -which symbols they really refers to as they both refers to +has a type of ``R_BPF_64_64`` and refers to entry 7 in the symbol table. +The second relocation resolves to global variable ``g2`` which has a symbol +value 4. The symbol value represents the offset from the start of ``.data`` +section where the initial value of the global variable ``g2`` is stored. + +The third and fourth relocations refer to static variables ``l1`` +and ``l2``. From the ``.rel.text`` section above, it is not clear +to which symbols they really refer as they both refer to symbol table entry 4, symbol ``sec``, which has ``STT_SECTION`` type -and represents a section. So for static variable or function, +and represents a section. So for a static variable or function, the section offset is written to the original insn buffer, which is called ``A`` (addend). Looking at above insn ``7`` and ``11``, they have section offset ``8`` and ``12``. diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst index 8669426264c6..d2343952f2cb 100644 --- a/Documentation/bpf/map_hash.rst +++ b/Documentation/bpf/map_hash.rst @@ -1,5 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0-only .. Copyright (C) 2022 Red Hat, Inc. +.. Copyright (C) 2022-2023 Isovalent, Inc. =============================================== BPF_MAP_TYPE_HASH, with PERCPU and LRU Variants @@ -29,7 +30,16 @@ will automatically evict the least recently used entries when the hash table reaches capacity. An LRU hash maintains an internal LRU list that is used to select elements for eviction. This internal LRU list is shared across CPUs but it is possible to request a per CPU LRU list with -the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``. +the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``. The +following table outlines the properties of LRU maps depending on the a +map type and the flags used to create the map. + +======================== ========================= ================================ +Flag ``BPF_MAP_TYPE_LRU_HASH`` ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` +======================== ========================= ================================ +**BPF_F_NO_COMMON_LRU** Per-CPU LRU, global map Per-CPU LRU, per-cpu map +**!BPF_F_NO_COMMON_LRU** Global LRU, global map Global LRU, per-cpu map +======================== ========================= ================================ Usage ===== @@ -206,3 +216,44 @@ Userspace walking the map elements from the map declared above: cur_key = &next_key; } } + +Internals +========= + +This section of the document is targeted at Linux developers and describes +aspects of the map implementations that are not considered stable ABI. The +following details are subject to change in future versions of the kernel. + +``BPF_MAP_TYPE_LRU_HASH`` and variants +-------------------------------------- + +Updating elements in LRU maps may trigger eviction behaviour when the capacity +of the map is reached. There are various steps that the update algorithm +attempts in order to enforce the LRU property which have increasing impacts on +other CPUs involved in the following operation attempts: + +- Attempt to use CPU-local state to batch operations +- Attempt to fetch free nodes from global lists +- Attempt to pull any node from a global list and remove it from the hashmap +- Attempt to pull any node from any CPU's list and remove it from the hashmap + +This algorithm is described visually in the following diagram. See the +description in commit 3a08c2fd7634 ("bpf: LRU List") for a full explanation of +the corresponding operations: + +.. kernel-figure:: map_lru_hash_update.dot + :alt: Diagram outlining the LRU eviction steps taken during map update. + + LRU hash eviction during map update for ``BPF_MAP_TYPE_LRU_HASH`` and + variants. See the dot file source for kernel function name code references. + +Map updates start from the oval in the top right "begin ``bpf_map_update()``" +and progress through the graph towards the bottom where the result may be +either a successful update or a failure with various error codes. The key in +the top right provides indicators for which locks may be involved in specific +operations. This is intended as a visual hint for reasoning about how map +contention may impact update operations, though the map type and flags may +impact the actual contention on those locks, based on the logic described in +the table above. For instance, if the map is created with type +``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and flags ``BPF_F_NO_COMMON_LRU`` then all map +properties would be per-cpu. diff --git a/Documentation/bpf/map_lru_hash_update.dot b/Documentation/bpf/map_lru_hash_update.dot new file mode 100644 index 000000000000..a0fee349d29c --- /dev/null +++ b/Documentation/bpf/map_lru_hash_update.dot @@ -0,0 +1,172 @@ +// SPDX-License-Identifier: GPL-2.0-only +// Copyright (C) 2022-2023 Isovalent, Inc. +digraph { + node [colorscheme=accent4,style=filled] # Apply colorscheme to all nodes + graph [splines=ortho, nodesep=1] + + subgraph cluster_key { + label = "Key\n(locks held during operation)"; + rankdir = TB; + + remote_lock [shape=rectangle,fillcolor=4,label="remote CPU LRU lock"] + hash_lock [shape=rectangle,fillcolor=3,label="hashtab lock"] + lru_lock [shape=rectangle,fillcolor=2,label="LRU lock"] + local_lock [shape=rectangle,fillcolor=1,label="local CPU LRU lock"] + no_lock [shape=rectangle,label="no locks held"] + } + + begin [shape=oval,label="begin\nbpf_map_update()"] + + // Nodes below with an 'fn_' prefix are roughly labeled by the C function + // names that initiate the corresponding logic in kernel/bpf/bpf_lru_list.c. + // Number suffixes and errno suffixes handle subsections of the corresponding + // logic in the function as of the writing of this dot. + + // cf. __local_list_pop_free() / bpf_percpu_lru_pop_free() + local_freelist_check [shape=diamond,fillcolor=1, + label="Local freelist\nnode available?"]; + use_local_node [shape=rectangle, + label="Use node owned\nby this CPU"] + + // cf. bpf_lru_pop_free() + common_lru_check [shape=diamond, + label="Map created with\ncommon LRU?\n(!BPF_F_NO_COMMON_LRU)"]; + + fn_bpf_lru_list_pop_free_to_local [shape=rectangle,fillcolor=2, + label="Flush local pending, + Rotate Global list, move + LOCAL_FREE_TARGET + from global -> local"] + // Also corresponds to: + // fn__local_list_flush() + // fn_bpf_lru_list_rotate() + fn___bpf_lru_node_move_to_free[shape=diamond,fillcolor=2, + label="Able to free\nLOCAL_FREE_TARGET\nnodes?"] + + fn___bpf_lru_list_shrink_inactive [shape=rectangle,fillcolor=3, + label="Shrink inactive list + up to remaining + LOCAL_FREE_TARGET + (global LRU -> local)"] + fn___bpf_lru_list_shrink [shape=diamond,fillcolor=2, + label="> 0 entries in\nlocal free list?"] + fn___bpf_lru_list_shrink2 [shape=rectangle,fillcolor=2, + label="Steal one node from + inactive, or if empty, + from active global list"] + fn___bpf_lru_list_shrink3 [shape=rectangle,fillcolor=3, + label="Try to remove\nnode from hashtab"] + + local_freelist_check2 [shape=diamond,label="Htab removal\nsuccessful?"] + common_lru_check2 [shape=diamond, + label="Map created with\ncommon LRU?\n(!BPF_F_NO_COMMON_LRU)"]; + + subgraph cluster_remote_lock { + label = "Iterate through CPUs\n(start from current)"; + style = dashed; + rankdir=LR; + + local_freelist_check5 [shape=diamond,fillcolor=4, + label="Steal a node from\nper-cpu freelist?"] + local_freelist_check6 [shape=rectangle,fillcolor=4, + label="Steal a node from + (1) Unreferenced pending, or + (2) Any pending node"] + local_freelist_check7 [shape=rectangle,fillcolor=3, + label="Try to remove\nnode from hashtab"] + fn_htab_lru_map_update_elem [shape=diamond, + label="Stole node\nfrom remote\nCPU?"] + fn_htab_lru_map_update_elem2 [shape=diamond,label="Iterated\nall CPUs?"] + // Also corresponds to: + // use_local_node() + // fn__local_list_pop_pending() + } + + fn_bpf_lru_list_pop_free_to_local2 [shape=rectangle, + label="Use node that was\nnot recently referenced"] + local_freelist_check4 [shape=rectangle, + label="Use node that was\nactively referenced\nin global list"] + fn_htab_lru_map_update_elem_ENOMEM [shape=oval,label="return -ENOMEM"] + fn_htab_lru_map_update_elem3 [shape=rectangle, + label="Use node that was\nactively referenced\nin (another?) CPU's cache"] + fn_htab_lru_map_update_elem4 [shape=rectangle,fillcolor=3, + label="Update hashmap\nwith new element"] + fn_htab_lru_map_update_elem5 [shape=oval,label="return 0"] + fn_htab_lru_map_update_elem_EBUSY [shape=oval,label="return -EBUSY"] + fn_htab_lru_map_update_elem_EEXIST [shape=oval,label="return -EEXIST"] + fn_htab_lru_map_update_elem_ENOENT [shape=oval,label="return -ENOENT"] + + begin -> local_freelist_check + local_freelist_check -> use_local_node [xlabel="Y"] + local_freelist_check -> common_lru_check [xlabel="N"] + common_lru_check -> fn_bpf_lru_list_pop_free_to_local [xlabel="Y"] + common_lru_check -> fn___bpf_lru_list_shrink_inactive [xlabel="N"] + fn_bpf_lru_list_pop_free_to_local -> fn___bpf_lru_node_move_to_free + fn___bpf_lru_node_move_to_free -> + fn_bpf_lru_list_pop_free_to_local2 [xlabel="Y"] + fn___bpf_lru_node_move_to_free -> + fn___bpf_lru_list_shrink_inactive [xlabel="N"] + fn___bpf_lru_list_shrink_inactive -> fn___bpf_lru_list_shrink + fn___bpf_lru_list_shrink -> fn_bpf_lru_list_pop_free_to_local2 [xlabel = "Y"] + fn___bpf_lru_list_shrink -> fn___bpf_lru_list_shrink2 [xlabel="N"] + fn___bpf_lru_list_shrink2 -> fn___bpf_lru_list_shrink3 + fn___bpf_lru_list_shrink3 -> local_freelist_check2 + local_freelist_check2 -> local_freelist_check4 [xlabel = "Y"] + local_freelist_check2 -> common_lru_check2 [xlabel = "N"] + common_lru_check2 -> local_freelist_check5 [xlabel = "Y"] + common_lru_check2 -> fn_htab_lru_map_update_elem_ENOMEM [xlabel = "N"] + local_freelist_check5 -> fn_htab_lru_map_update_elem [xlabel = "Y"] + local_freelist_check5 -> local_freelist_check6 [xlabel = "N"] + local_freelist_check6 -> local_freelist_check7 + local_freelist_check7 -> fn_htab_lru_map_update_elem + + fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem3 [xlabel = "Y"] + fn_htab_lru_map_update_elem -> fn_htab_lru_map_update_elem2 [xlabel = "N"] + fn_htab_lru_map_update_elem2 -> + fn_htab_lru_map_update_elem_ENOMEM [xlabel = "Y"] + fn_htab_lru_map_update_elem2 -> local_freelist_check5 [xlabel = "N"] + fn_htab_lru_map_update_elem3 -> fn_htab_lru_map_update_elem4 + + use_local_node -> fn_htab_lru_map_update_elem4 + fn_bpf_lru_list_pop_free_to_local2 -> fn_htab_lru_map_update_elem4 + local_freelist_check4 -> fn_htab_lru_map_update_elem4 + + fn_htab_lru_map_update_elem4 -> fn_htab_lru_map_update_elem5 [headlabel="Success"] + fn_htab_lru_map_update_elem4 -> + fn_htab_lru_map_update_elem_EBUSY [xlabel="Hashtab lock failed"] + fn_htab_lru_map_update_elem4 -> + fn_htab_lru_map_update_elem_EEXIST [xlabel="BPF_EXIST set and\nkey already exists"] + fn_htab_lru_map_update_elem4 -> + fn_htab_lru_map_update_elem_ENOENT [headlabel="BPF_NOEXIST set\nand no such entry"] + + // Create invisible pad nodes to line up various nodes + pad0 [style=invis] + pad1 [style=invis] + pad2 [style=invis] + pad3 [style=invis] + pad4 [style=invis] + + // Line up the key with the top of the graph + no_lock -> local_lock [style=invis] + local_lock -> lru_lock [style=invis] + lru_lock -> hash_lock [style=invis] + hash_lock -> remote_lock [style=invis] + remote_lock -> local_freelist_check5 [style=invis] + remote_lock -> fn___bpf_lru_list_shrink [style=invis] + + // Line up return code nodes at the bottom of the graph + fn_htab_lru_map_update_elem -> pad0 [style=invis] + pad0 -> pad1 [style=invis] + pad1 -> pad2 [style=invis] + //pad2-> fn_htab_lru_map_update_elem_ENOMEM [style=invis] + fn_htab_lru_map_update_elem4 -> pad3 [style=invis] + pad3 -> fn_htab_lru_map_update_elem5 [style=invis] + pad3 -> fn_htab_lru_map_update_elem_EBUSY [style=invis] + pad3 -> fn_htab_lru_map_update_elem_EEXIST [style=invis] + pad3 -> fn_htab_lru_map_update_elem_ENOENT [style=invis] + + // Reduce diagram width by forcing some nodes to appear above others + local_freelist_check4 -> fn_htab_lru_map_update_elem3 [style=invis] + common_lru_check2 -> pad4 [style=invis] + pad4 -> local_freelist_check5 [style=invis] +} diff --git a/Documentation/bpf/map_sockmap.rst b/Documentation/bpf/map_sockmap.rst index cc92047c6630..2d630686a00b 100644 --- a/Documentation/bpf/map_sockmap.rst +++ b/Documentation/bpf/map_sockmap.rst @@ -240,11 +240,11 @@ offsets into ``msg``, respectively. If a program of type ``BPF_PROG_TYPE_SK_MSG`` is run on a ``msg`` it can only parse data that the (``data``, ``data_end``) pointers have already consumed. For ``sendmsg()`` hooks this is likely the first scatterlist element. But for -calls relying on the ``sendpage`` handler (e.g., ``sendfile()``) this will be -the range (**0**, **0**) because the data is shared with user space and by -default the objective is to avoid allowing user space to modify data while (or -after) BPF verdict is being decided. This helper can be used to pull in data -and to set the start and end pointers to given values. Data will be copied if +calls relying on MSG_SPLICE_PAGES (e.g., ``sendfile()``) this will be the +range (**0**, **0**) because the data is shared with user space and by default +the objective is to avoid allowing user space to modify data while (or after) +BPF verdict is being decided. This helper can be used to pull in data and to +set the start and end pointers to given values. Data will be copied if necessary (i.e., if data was not linear and if start and end pointers do not point to the same chunk). diff --git a/Documentation/bpf/prog_cgroup_sockopt.rst b/Documentation/bpf/prog_cgroup_sockopt.rst index 172f957204bf..1226a94af07a 100644 --- a/Documentation/bpf/prog_cgroup_sockopt.rst +++ b/Documentation/bpf/prog_cgroup_sockopt.rst @@ -98,10 +98,65 @@ can access only the first ``PAGE_SIZE`` of that data. So it has to options: indicates that the kernel should use BPF's trimmed ``optval``. When the BPF program returns with the ``optlen`` greater than -``PAGE_SIZE``, the userspace will receive ``EFAULT`` errno. +``PAGE_SIZE``, the userspace will receive original kernel +buffers without any modifications that the BPF program might have +applied. Example ======= +Recommended way to handle BPF programs is as follows: + +.. code-block:: c + + SEC("cgroup/getsockopt") + int getsockopt(struct bpf_sockopt *ctx) + { + /* Custom socket option. */ + if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) { + ctx->retval = 0; + optval[0] = ...; + ctx->optlen = 1; + return 1; + } + + /* Modify kernel's socket option. */ + if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) { + ctx->retval = 0; + optval[0] = ...; + ctx->optlen = 1; + return 1; + } + + /* optval larger than PAGE_SIZE use kernel's buffer. */ + if (ctx->optlen > PAGE_SIZE) + ctx->optlen = 0; + + return 1; + } + + SEC("cgroup/setsockopt") + int setsockopt(struct bpf_sockopt *ctx) + { + /* Custom socket option. */ + if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) { + /* do something */ + ctx->optlen = -1; + return 1; + } + + /* Modify kernel's socket option. */ + if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) { + optval[0] = ...; + return 1; + } + + /* optval larger than PAGE_SIZE use kernel's buffer. */ + if (ctx->optlen > PAGE_SIZE) + ctx->optlen = 0; + + return 1; + } + See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example of BPF program that handles socket options. diff --git a/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml b/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml index 3bd912ed7c7e..23e92be33ac8 100644 --- a/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml +++ b/Documentation/devicetree/bindings/net/allwinner,sun7i-a20-gmac.yaml @@ -7,7 +7,7 @@ $schema: http://devicetree.org/meta-schemas/core.yaml# title: Allwinner A20 GMAC allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# maintainers: - Chen-Yu Tsai <wens@csie.org> diff --git a/Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml b/Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml index 47bc2057e629..4bfac9186886 100644 --- a/Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml +++ b/Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml @@ -63,7 +63,7 @@ required: - syscon allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# - if: properties: compatible: diff --git a/Documentation/devicetree/bindings/net/altr,tse.yaml b/Documentation/devicetree/bindings/net/altr,tse.yaml index 9d02af468906..f5d3b70af07a 100644 --- a/Documentation/devicetree/bindings/net/altr,tse.yaml +++ b/Documentation/devicetree/bindings/net/altr,tse.yaml @@ -72,8 +72,8 @@ allOf: compatible: contains: enum: - - const: altr,tse-1.0 - - const: ALTR,tse-1.0 + - altr,tse-1.0 + - ALTR,tse-1.0 then: properties: reg: diff --git a/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml b/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml index a2c51a84efa5..ee7a65b528cd 100644 --- a/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/amlogic,meson-dwmac.yaml @@ -27,7 +27,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# - if: properties: compatible: diff --git a/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml b/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml index 68f78b90d23a..604985c8068e 100644 --- a/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml +++ b/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml @@ -50,6 +50,9 @@ properties: vddch0-supply: description: VDD_CH0 supply regulator handle + vddch1-supply: + description: VDD_CH1 supply regulator handle + vddaon-supply: description: VDD_AON supply regulator handle diff --git a/Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml b/Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml index 0e5e5db32faf..7c90a4390531 100644 --- a/Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml +++ b/Documentation/devicetree/bindings/net/brcm,bcmgenet.yaml @@ -55,7 +55,7 @@ properties: patternProperties: "^mdio@[0-9a-f]+$": type: object - $ref: "brcm,unimac-mdio.yaml" + $ref: brcm,unimac-mdio.yaml description: GENET internal UniMAC MDIO bus diff --git a/Documentation/devicetree/bindings/net/cdns,macb.yaml b/Documentation/devicetree/bindings/net/cdns,macb.yaml index bef5e0f895be..bf8894a0257e 100644 --- a/Documentation/devicetree/bindings/net/cdns,macb.yaml +++ b/Documentation/devicetree/bindings/net/cdns,macb.yaml @@ -109,6 +109,16 @@ properties: power-domains: maxItems: 1 + cdns,rx-watermark: + $ref: /schemas/types.yaml#/definitions/uint32 + description: + When the receive partial store and forward mode is activated, + the receiver will only begin to forward the packet to the external + AHB or AXI slave when enough packet data is stored in the SRAM packet buffer. + rx-watermark corresponds to the number of SRAM buffer locations, + that need to be filled, before the forwarding process is activated. + Width of the SRAM is platform dependent, and can be 4, 8 or 16 bytes. + '#address-cells': const: 1 @@ -166,6 +176,7 @@ examples: compatible = "cdns,macb"; reg = <0xfffc4000 0x4000>; interrupts = <21>; + cdns,rx-watermark = <0x44>; phy-mode = "rmii"; local-mac-address = [3a 0e 03 04 05 06]; clock-names = "pclk", "hclk", "tx_clk"; diff --git a/Documentation/devicetree/bindings/net/dsa/marvell.txt b/Documentation/devicetree/bindings/net/dsa/marvell.txt index 2363b412410c..33726134f5c9 100644 --- a/Documentation/devicetree/bindings/net/dsa/marvell.txt +++ b/Documentation/devicetree/bindings/net/dsa/marvell.txt @@ -20,7 +20,7 @@ which is at a different MDIO base address in different switch families. 6171, 6172, 6175, 6176, 6185, 6240, 6320, 6321, 6341, 6350, 6351, 6352 - "marvell,mv88e6190" : Switch has base address 0x00. Use with models: - 6190, 6190X, 6191, 6290, 6390, 6390X + 6163, 6190, 6190X, 6191, 6290, 6390, 6390X - "marvell,mv88e6250" : Switch has base address 0x08 or 0x18. Use with model: 6220, 6250 diff --git a/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml b/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml index 9a64ed658745..4d5f5cc6d031 100644 --- a/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml +++ b/Documentation/devicetree/bindings/net/dsa/nxp,sja1105.yaml @@ -12,10 +12,6 @@ description: cs_sck_delay of 500ns. Ensuring that this SPI timing requirement is observed depends on the SPI bus master driver. -allOf: - - $ref: dsa.yaml#/$defs/ethernet-ports - - $ref: /schemas/spi/spi-peripheral-props.yaml# - maintainers: - Vladimir Oltean <vladimir.oltean@nxp.com> @@ -36,6 +32,9 @@ properties: reg: maxItems: 1 + spi-cpha: true + spi-cpol: true + # Optional container node for the 2 internal MDIO buses of the SJA1110 # (one for the internal 100base-T1 PHYs and the other for the single # 100base-TX PHY). The "reg" property does not have physical significance. @@ -109,6 +108,30 @@ $defs: 1860, 1880, 1900, 1920, 1940, 1960, 1980, 2000, 2020, 2040, 2060, 2080, 2100, 2120, 2140, 2160, 2180, 2200, 2220, 2240, 2260] +allOf: + - $ref: dsa.yaml#/$defs/ethernet-ports + - $ref: /schemas/spi/spi-peripheral-props.yaml# + - if: + properties: + compatible: + enum: + - nxp,sja1105e + - nxp,sja1105p + - nxp,sja1105q + - nxp,sja1105r + - nxp,sja1105s + - nxp,sja1105t + then: + properties: + spi-cpol: false + required: + - spi-cpha + else: + properties: + spi-cpha: false + required: + - spi-cpol + unevaluatedProperties: false examples: @@ -120,6 +143,7 @@ examples: ethernet-switch@1 { reg = <0x1>; compatible = "nxp,sja1105t"; + spi-cpha; ethernet-ports { #address-cells = <1>; diff --git a/Documentation/devicetree/bindings/net/ethernet-phy.yaml b/Documentation/devicetree/bindings/net/ethernet-phy.yaml index 4f574532ee13..c1241c8a3b77 100644 --- a/Documentation/devicetree/bindings/net/ethernet-phy.yaml +++ b/Documentation/devicetree/bindings/net/ethernet-phy.yaml @@ -93,6 +93,12 @@ properties: the turn around line low at end of the control phase of the MDIO transaction. + clocks: + maxItems: 1 + description: + External clock connected to the PHY. If not specified it is assumed + that the PHY uses a fixed crystal or an internal oscillator. + enet-phy-lane-swap: $ref: /schemas/types.yaml#/definitions/flag description: diff --git a/Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml b/Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml index d23fa3771210..42a0bc94312c 100644 --- a/Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml +++ b/Documentation/devicetree/bindings/net/intel,dwmac-plat.yaml @@ -19,7 +19,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml b/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml index d71fa9de2b64..8a3713abd1ca 100644 --- a/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml +++ b/Documentation/devicetree/bindings/net/maxlinear,gpy2xx.yaml @@ -17,11 +17,12 @@ properties: maxlinear,use-broken-interrupts: description: | Interrupts are broken on some GPY2xx PHYs in that they keep the - interrupt line asserted even after the interrupt status register is - cleared. Thus it is blocking the interrupt line which is usually bad - for shared lines. By default interrupts are disabled for this PHY and - polling mode is used. If one can live with the consequences, this - property can be used to enable interrupt handling. + interrupt line asserted for a random amount of time even after the + interrupt status register is cleared. Thus it is blocking the + interrupt line which is usually bad for shared lines. By default, + interrupts are disabled for this PHY and polling mode is used. If one + can live with the consequences, this property can be used to enable + interrupt handling. Affected PHYs (as far as known) are GPY215B and GPY215C. type: boolean diff --git a/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml b/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml index 0fa2132fa4f4..08d74ca0769c 100644 --- a/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/mediatek-dwmac.yaml @@ -25,7 +25,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/micrel,ks8851.yaml b/Documentation/devicetree/bindings/net/micrel,ks8851.yaml index b44d83554ef5..b726c6e14633 100644 --- a/Documentation/devicetree/bindings/net/micrel,ks8851.yaml +++ b/Documentation/devicetree/bindings/net/micrel,ks8851.yaml @@ -44,13 +44,13 @@ required: allOf: - $ref: ethernet-controller.yaml# - - $ref: /schemas/memory-controllers/mc-peripheral-props.yaml# - if: properties: compatible: contains: const: micrel,ks8851 then: + $ref: /schemas/spi/spi-peripheral-props.yaml# properties: reg: maxItems: 1 @@ -60,6 +60,7 @@ allOf: contains: const: micrel,ks8851-mll then: + $ref: /schemas/memory-controllers/mc-peripheral-props.yaml# properties: reg: minItems: 2 diff --git a/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml b/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml index 63409cbff5ad..4c01cae7c93a 100644 --- a/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml +++ b/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml @@ -24,7 +24,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml b/Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml index b110abb42597..2d382faca0e6 100644 --- a/Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml +++ b/Documentation/devicetree/bindings/net/pse-pd/pse-controller.yaml @@ -16,7 +16,7 @@ maintainers: properties: $nodename: - pattern: "^ethernet-pse(@.*)?$" + pattern: "^ethernet-pse(@.*|-([0-9]|[1-9][0-9]+))?$" "#pse-cells": description: diff --git a/Documentation/devicetree/bindings/net/qcom,ethqos.yaml b/Documentation/devicetree/bindings/net/qcom,ethqos.yaml index 60a38044fb19..7bdb412a0185 100644 --- a/Documentation/devicetree/bindings/net/qcom,ethqos.yaml +++ b/Documentation/devicetree/bindings/net/qcom,ethqos.yaml @@ -20,6 +20,7 @@ properties: compatible: enum: - qcom,qcs404-ethqos + - qcom,sa8775p-ethqos - qcom,sc8280xp-ethqos - qcom,sm8150-ethqos @@ -32,11 +33,13 @@ properties: - const: rgmii interrupts: + minItems: 1 items: - description: Combined signal for various interrupt events - description: The interrupt that occurs when Rx exits the LPI state interrupt-names: + minItems: 1 items: - const: macirq - const: eth_lpi @@ -49,11 +52,18 @@ properties: - const: stmmaceth - const: pclk - const: ptp_ref - - const: rgmii + - enum: + - rgmii + - phyaux iommus: maxItems: 1 + phys: true + + phy-names: + const: serdes + required: - compatible - clocks diff --git a/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml b/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml index 2a21bbe02892..176ea5f90251 100644 --- a/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/rockchip-dwmac.yaml @@ -32,7 +32,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/snps,dwmac.yaml b/Documentation/devicetree/bindings/net/snps,dwmac.yaml index 363b3e3ea3a6..ddf9522a5dc2 100644 --- a/Documentation/devicetree/bindings/net/snps,dwmac.yaml +++ b/Documentation/devicetree/bindings/net/snps,dwmac.yaml @@ -67,6 +67,7 @@ properties: - loongson,ls2k-dwmac - loongson,ls7a-dwmac - qcom,qcs404-ethqos + - qcom,sa8775p-ethqos - qcom,sc8280xp-ethqos - qcom,sm8150-ethqos - renesas,r9a06g032-gmac @@ -582,6 +583,7 @@ allOf: - ingenic,x1600-mac - ingenic,x1830-mac - ingenic,x2000-mac + - qcom,sa8775p-ethqos - qcom,sc8280xp-ethqos - snps,dwmac-3.50a - snps,dwmac-4.10a @@ -638,6 +640,7 @@ allOf: - ingenic,x1830-mac - ingenic,x2000-mac - qcom,qcs404-ethqos + - qcom,sa8775p-ethqos - qcom,sc8280xp-ethqos - qcom,sm8150-ethqos - snps,dwmac-4.00 diff --git a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml index 395a4650e285..c9c25132d154 100644 --- a/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml +++ b/Documentation/devicetree/bindings/net/ti,k3-am654-cpsw-nuss.yaml @@ -168,14 +168,14 @@ properties: patternProperties: "^mdio@[0-9a-f]+$": type: object - $ref: "ti,davinci-mdio.yaml#" + $ref: ti,davinci-mdio.yaml# description: CPSW MDIO bus. "^cpts@[0-9a-f]+": type: object - $ref: "ti,k3-am654-cpts.yaml#" + $ref: ti,k3-am654-cpts.yaml# description: CPSW Common Platform Time Sync (CPTS) module. diff --git a/Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml b/Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml index 474fa8bcf302..052f636158b3 100644 --- a/Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml +++ b/Documentation/devicetree/bindings/net/toshiba,visconti-dwmac.yaml @@ -19,7 +19,7 @@ select: - compatible allOf: - - $ref: "snps,dwmac.yaml#" + - $ref: snps,dwmac.yaml# properties: compatible: diff --git a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml index c85ed330426d..7758a55dd328 100644 --- a/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml +++ b/Documentation/devicetree/bindings/net/wireless/qcom,ath10k.yaml @@ -84,6 +84,8 @@ properties: required: - iommus + ieee80211-freq-limit: true + qcom,ath10k-calibration-data: $ref: /schemas/types.yaml#/definitions/uint8-array description: @@ -164,6 +166,7 @@ required: additionalProperties: false allOf: + - $ref: ieee80211.yaml# - if: properties: compatible: @@ -355,4 +358,5 @@ examples: "msi14", "msi15", "legacy"; + ieee80211-freq-limit = <5470000 5875000>; }; diff --git a/Documentation/devicetree/bindings/net/xilinx_axienet.txt b/Documentation/devicetree/bindings/net/xilinx_axienet.txt deleted file mode 100644 index 80e505a2fda1..000000000000 --- a/Documentation/devicetree/bindings/net/xilinx_axienet.txt +++ /dev/null @@ -1,101 +0,0 @@ -XILINX AXI ETHERNET Device Tree Bindings --------------------------------------------------------- - -Also called AXI 1G/2.5G Ethernet Subsystem, the xilinx axi ethernet IP core -provides connectivity to an external ethernet PHY supporting different -interfaces: MII, GMII, RGMII, SGMII, 1000BaseX. It also includes two -segments of memory for buffering TX and RX, as well as the capability of -offloading TX/RX checksum calculation off the processor. - -Management configuration is done through the AXI interface, while payload is -sent and received through means of an AXI DMA controller. This driver -includes the DMA driver code, so this driver is incompatible with AXI DMA -driver. - -For more details about mdio please refer phy.txt file in the same directory. - -Required properties: -- compatible : Must be one of "xlnx,axi-ethernet-1.00.a", - "xlnx,axi-ethernet-1.01.a", "xlnx,axi-ethernet-2.01.a" -- reg : Address and length of the IO space, as well as the address - and length of the AXI DMA controller IO space, unless - axistream-connected is specified, in which case the reg - attribute of the node referenced by it is used. -- interrupts : Should be a list of 2 or 3 interrupts: TX DMA, RX DMA, - and optionally Ethernet core. If axistream-connected is - specified, the TX/RX DMA interrupts should be on that node - instead, and only the Ethernet core interrupt is optionally - specified here. -- phy-handle : Should point to the external phy device if exists. Pointing - this to the PCS/PMA PHY is deprecated and should be avoided. - See ethernet.txt file in the same directory. -- xlnx,rxmem : Set to allocated memory buffer for Rx/Tx in the hardware - -Optional properties: -- phy-mode : See ethernet.txt -- xlnx,phy-type : Deprecated, do not use, but still accepted in preference - to phy-mode. -- xlnx,txcsum : 0 or empty for disabling TX checksum offload, - 1 to enable partial TX checksum offload, - 2 to enable full TX checksum offload -- xlnx,rxcsum : Same values as xlnx,txcsum but for RX checksum offload -- xlnx,switch-x-sgmii : Boolean to indicate the Ethernet core is configured to - support both 1000BaseX and SGMII modes. If set, the phy-mode - should be set to match the mode selected on core reset (i.e. - by the basex_or_sgmii core input line). -- clock-names: Tuple listing input clock names. Possible clocks: - s_axi_lite_clk: Clock for AXI register slave interface - axis_clk: AXI4-Stream clock for TXD RXD TXC and RXS interfaces - ref_clk: Ethernet reference clock, used by signal delay - primitives and transceivers - mgt_clk: MGT reference clock (used by optional internal - PCS/PMA PHY) - - Note that if s_axi_lite_clk is not specified by name, the - first clock of any name is used for this. If that is also not - specified, the clock rate is auto-detected from the CPU clock - (but only on platforms where this is possible). New device - trees should specify all applicable clocks by name - the - fallbacks to an unnamed clock or to CPU clock are only for - backward compatibility. -- clocks: Phandles to input clocks matching clock-names. Refer to common - clock bindings. -- axistream-connected: Reference to another node which contains the resources - for the AXI DMA controller used by this device. - If this is specified, the DMA-related resources from that - device (DMA registers and DMA TX/RX interrupts) rather - than this one will be used. - - mdio : Child node for MDIO bus. Must be defined if PHY access is - required through the core's MDIO interface (i.e. always, - unless the PHY is accessed through a different bus). - Non-standard MDIO bus frequency is supported via - "clock-frequency", see mdio.yaml. - - - pcs-handle: Phandle to the internal PCS/PMA PHY in SGMII or 1000Base-X - modes, where "pcs-handle" should be used to point - to the PCS/PMA PHY, and "phy-handle" should point to an - external PHY if exists. - -Example: - axi_ethernet_eth: ethernet@40c00000 { - compatible = "xlnx,axi-ethernet-1.00.a"; - device_type = "network"; - interrupt-parent = <µblaze_0_axi_intc>; - interrupts = <2 0 1>; - clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; - clocks = <&axi_clk>, <&axi_clk>, <&pl_enet_ref_clk>, <&mgt_clk>; - phy-mode = "mii"; - reg = <0x40c00000 0x40000 0x50c00000 0x40000>; - xlnx,rxcsum = <0x2>; - xlnx,rxmem = <0x800>; - xlnx,txcsum = <0x2>; - phy-handle = <&phy0>; - axi_ethernetlite_0_mdio: mdio { - #address-cells = <1>; - #size-cells = <0>; - phy0: phy@0 { - device_type = "ethernet-phy"; - reg = <1>; - }; - }; - }; diff --git a/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml b/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml new file mode 100644 index 000000000000..1d33d80af11c --- /dev/null +++ b/Documentation/devicetree/bindings/net/xlnx,axi-ethernet.yaml @@ -0,0 +1,183 @@ +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/net/xlnx,axi-ethernet.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: AXI 1G/2.5G Ethernet Subsystem + +description: | + Also called AXI 1G/2.5G Ethernet Subsystem, the xilinx axi ethernet IP core + provides connectivity to an external ethernet PHY supporting different + interfaces: MII, GMII, RGMII, SGMII, 1000BaseX. It also includes two + segments of memory for buffering TX and RX, as well as the capability of + offloading TX/RX checksum calculation off the processor. + + Management configuration is done through the AXI interface, while payload is + sent and received through means of an AXI DMA controller. This driver + includes the DMA driver code, so this driver is incompatible with AXI DMA + driver. + +maintainers: + - Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com> + +properties: + compatible: + enum: + - xlnx,axi-ethernet-1.00.a + - xlnx,axi-ethernet-1.01.a + - xlnx,axi-ethernet-2.01.a + + reg: + description: + Address and length of the IO space, as well as the address + and length of the AXI DMA controller IO space, unless + axistream-connected is specified, in which case the reg + attribute of the node referenced by it is used. + maxItems: 2 + + interrupts: + items: + - description: Ethernet core interrupt + - description: Tx DMA interrupt + - description: Rx DMA interrupt + description: + Ethernet core interrupt is optional. If axistream-connected property is + present DMA node should contains TX/RX DMA interrupts else DMA interrupt + resources are mentioned on ethernet node. + minItems: 1 + + phy-handle: true + + xlnx,rxmem: + description: + Set to allocated memory buffer for Rx/Tx in the hardware. + $ref: /schemas/types.yaml#/definitions/uint32 + + phy-mode: + enum: + - mii + - gmii + - rgmii + - sgmii + - 1000BaseX + + xlnx,phy-type: + description: + Do not use, but still accepted in preference to phy-mode. + deprecated: true + $ref: /schemas/types.yaml#/definitions/uint32 + + xlnx,txcsum: + description: + TX checksum offload. 0 or empty for disabling TX checksum offload, + 1 to enable partial TX checksum offload and 2 to enable full TX + checksum offload. + $ref: /schemas/types.yaml#/definitions/uint32 + enum: [0, 1, 2] + + xlnx,rxcsum: + description: + RX checksum offload. 0 or empty for disabling RX checksum offload, + 1 to enable partial RX checksum offload and 2 to enable full RX + checksum offload. + $ref: /schemas/types.yaml#/definitions/uint32 + enum: [0, 1, 2] + + xlnx,switch-x-sgmii: + type: boolean + description: + Indicate the Ethernet core is configured to support both 1000BaseX and + SGMII modes. If set, the phy-mode should be set to match the mode + selected on core reset (i.e. by the basex_or_sgmii core input line). + + clocks: + items: + - description: Clock for AXI register slave interface. + - description: AXI4-Stream clock for TXD RXD TXC and RXS interfaces. + - description: Ethernet reference clock, used by signal delay primitives + and transceivers. + - description: MGT reference clock (used by optional internal PCS/PMA PHY) + + clock-names: + items: + - const: s_axi_lite_clk + - const: axis_clk + - const: ref_clk + - const: mgt_clk + + axistream-connected: + $ref: /schemas/types.yaml#/definitions/phandle + description: Phandle of AXI DMA controller which contains the resources + used by this device. If this is specified, the DMA-related resources + from that device (DMA registers and DMA TX/RX interrupts) rather than + this one will be used. + + mdio: + type: object + + pcs-handle: + description: Phandle to the internal PCS/PMA PHY in SGMII or 1000Base-X + modes, where "pcs-handle" should be used to point to the PCS/PMA PHY, + and "phy-handle" should point to an external PHY if exists. + maxItems: 1 + +required: + - compatible + - interrupts + - reg + - xlnx,rxmem + - phy-handle + +allOf: + - $ref: /schemas/net/ethernet-controller.yaml# + +additionalProperties: false + +examples: + - | + axi_ethernet_eth: ethernet@40c00000 { + compatible = "xlnx,axi-ethernet-1.00.a"; + interrupts = <2 0 1>; + clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; + clocks = <&axi_clk>, <&axi_clk>, <&pl_enet_ref_clk>, <&mgt_clk>; + phy-mode = "mii"; + reg = <0x40c00000 0x40000>,<0x50c00000 0x40000>; + xlnx,rxcsum = <0x2>; + xlnx,rxmem = <0x800>; + xlnx,txcsum = <0x2>; + phy-handle = <&phy0>; + + mdio { + #address-cells = <1>; + #size-cells = <0>; + phy0: ethernet-phy@1 { + device_type = "ethernet-phy"; + reg = <1>; + }; + }; + }; + + - | + axi_ethernet_eth1: ethernet@40000000 { + compatible = "xlnx,axi-ethernet-1.00.a"; + interrupts = <0>; + clock-names = "s_axi_lite_clk", "axis_clk", "ref_clk", "mgt_clk"; + clocks = <&axi_clk>, <&axi_clk>, <&pl_enet_ref_clk>, <&mgt_clk>; + phy-mode = "mii"; + reg = <0x00 0x40000000 0x00 0x40000>; + xlnx,rxcsum = <0x2>; + xlnx,rxmem = <0x800>; + xlnx,txcsum = <0x2>; + phy-handle = <&phy1>; + axistream-connected = <&dma>; + + mdio { + #address-cells = <1>; + #size-cells = <0>; + phy1: ethernet-phy@1 { + device_type = "ethernet-phy"; + reg = <1>; + }; + }; + }; diff --git a/Documentation/driver-api/ptp.rst b/Documentation/driver-api/ptp.rst index 664838ae7776..5e033c3b11b3 100644 --- a/Documentation/driver-api/ptp.rst +++ b/Documentation/driver-api/ptp.rst @@ -73,6 +73,22 @@ Writing clock drivers class driver, since the lock may also be needed by the clock driver's interrupt service routine. +PTP hardware clock requirements for '.adjphase' +----------------------------------------------- + + The 'struct ptp_clock_info' interface has a '.adjphase' function. + This function has a set of requirements from the PHC in order to be + implemented. + + * The PHC implements a servo algorithm internally that is used to + correct the offset passed in the '.adjphase' call. + * When other PTP adjustment functions are called, the PHC servo + algorithm is disabled. + + **NOTE:** '.adjphase' is not a simple time adjustment functionality + that 'jumps' the PHC clock time based on the provided offset. It + should correct the offset provided using an internal algorithm. + Supported hardware ================== @@ -106,3 +122,16 @@ Supported hardware - LPF settings (bandwidth, phase limiting, automatic holdover, physical layer assist (per ITU-T G.8273.2)) - Programmable output PTP clocks, any frequency up to 1GHz (to other PHY/MAC time stampers, refclk to ASSPs/SoCs/FPGAs) - Lock to GNSS input, automatic switching between GNSS and user-space PHC control (optional) + + * NVIDIA Mellanox + + - GPIO + - Certain variants of ConnectX-6 Dx and later products support one + GPIO which can time stamp external triggers and one GPIO to produce + periodic signals. + - Certain variants of ConnectX-5 and older products support one GPIO, + configured to either time stamp external triggers or produce + periodic signals. + - PHC instances + - All ConnectX devices have a free-running counter + - ConnectX-6 Dx and later devices have a UTC format counter diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index aa1a233b0fa8..ed148919e11a 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -521,8 +521,6 @@ prototypes:: int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*sendpage) (struct file *, struct page *, int, size_t, - loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 769be5230210..cb2a97e49872 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -1086,7 +1086,6 @@ This describes how the VFS can manipulate an open file. As of kernel int (*fsync) (struct file *, loff_t, loff_t, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); int (*flock) (struct file *, int, struct file_lock *); diff --git a/Documentation/leds/leds-class.rst b/Documentation/leds/leds-class.rst index cd155ead8703..5db620ed27aa 100644 --- a/Documentation/leds/leds-class.rst +++ b/Documentation/leds/leds-class.rst @@ -169,6 +169,87 @@ Setting the brightness to zero with brightness_set() callback function should completely turn off the LED and cancel the previously programmed hardware blinking function, if any. +Hardware driven LEDs +==================== + +Some LEDs can be programmed to be driven by hardware. This is not +limited to blink but also to turn off or on autonomously. +To support this feature, a LED needs to implement various additional +ops and needs to declare specific support for the supported triggers. + +With hw control we refer to the LED driven by hardware. + +LED driver must define the following value to support hw control: + + - hw_control_trigger: + unique trigger name supported by the LED in hw control + mode. + +LED driver must implement the following API to support hw control: + - hw_control_is_supported: + check if the flags passed by the supported trigger can + be parsed and activate hw control on the LED. + + Return 0 if the passed flags mask is supported and + can be set with hw_control_set(). + + If the passed flags mask is not supported -EOPNOTSUPP + must be returned, the LED trigger will use software + fallback in this case. + + Return a negative error in case of any other error like + device not ready or timeouts. + + - hw_control_set: + activate hw control. LED driver will use the provided + flags passed from the supported trigger, parse them to + a set of mode and setup the LED to be driven by hardware + following the requested modes. + + Set LED_OFF via the brightness_set to deactivate hw control. + + Return 0 on success, a negative error number on failing to + apply flags. + + - hw_control_get: + get active modes from a LED already in hw control, parse + them and set in flags the current active flags for the + supported trigger. + + Return 0 on success, a negative error number on failing + parsing the initial mode. + Error from this function is NOT FATAL as the device may + be in a not supported initial state by the attached LED + trigger. + + - hw_control_get_device: + return the device associated with the LED driver in + hw control. A trigger might use this to match the + returned device from this function with a configured + device for the trigger as the source for blinking + events and correctly enable hw control. + (example a netdev trigger configured to blink for a + particular dev match the returned dev from get_device + to set hw control) + + Returns a pointer to a struct device or NULL if nothing + is currently attached. + +LED driver can activate additional modes by default to workaround the +impossibility of supporting each different mode on the supported trigger. +Examples are hardcoding the blink speed to a set interval, enable special +feature like bypassing blink if some requirements are not met. + +A trigger should first check if the hw control API are supported by the LED +driver and check if the trigger is supported to verify if hw control is possible, +use hw_control_is_supported to check if the flags are supported and only at +the end use hw_control_set to activate hw control. + +A trigger can use hw_control_get to check if a LED is already in hw control +and init their flags. + +When the LED is in hw control, no software blink is possible and doing so +will effectively disable hw control. Known Issues ============ diff --git a/Documentation/netlink/genetlink-c.yaml b/Documentation/netlink/genetlink-c.yaml index 8e8c17b0a6c6..57d1c1c4918f 100644 --- a/Documentation/netlink/genetlink-c.yaml +++ b/Documentation/netlink/genetlink-c.yaml @@ -195,6 +195,16 @@ properties: description: Max length for a string or a binary attribute. $ref: '#/$defs/len-or-define' sub-type: *attr-type + display-hint: &display-hint + description: | + Optional format indicator that is intended only for choosing + the right formatting mechanism when displaying values of this + type. + enum: [ hex, mac, fddi, ipv4, ipv6, uuid ] + # Start genetlink-c + name-prefix: + type: string + # End genetlink-c # Make sure name-prefix does not appear in subsets (subsets inherit naming) dependencies: diff --git a/Documentation/netlink/genetlink-legacy.yaml b/Documentation/netlink/genetlink-legacy.yaml index b33541a51d6b..43b769c98fb2 100644 --- a/Documentation/netlink/genetlink-legacy.yaml +++ b/Documentation/netlink/genetlink-legacy.yaml @@ -119,9 +119,24 @@ properties: name: type: string type: - enum: [ u8, u16, u32, u64, s8, s16, s32, s64, string ] + description: The netlink attribute type + enum: [ u8, u16, u32, u64, s8, s16, s32, s64, string, binary ] len: $ref: '#/$defs/len-or-define' + byte-order: + enum: [ little-endian, big-endian ] + doc: + description: Documentation for the struct member attribute. + type: string + enum: + description: Name of the enum type used for the attribute. + type: string + display-hint: &display-hint + description: | + Optional format indicator that is intended only for choosing + the right formatting mechanism when displaying values of this + type. + enum: [ hex, mac, fddi, ipv4, ipv6, uuid ] # End genetlink-legacy attribute-sets: @@ -171,6 +186,7 @@ properties: name: type: string type: &attr-type + description: The netlink attribute type enum: [ unused, pad, flag, binary, u8, u16, u32, u64, s32, s64, string, nest, array-nest, nest-type-value ] doc: @@ -218,6 +234,11 @@ properties: description: Max length for a string or a binary attribute. $ref: '#/$defs/len-or-define' sub-type: *attr-type + display-hint: *display-hint + # Start genetlink-c + name-prefix: + type: string + # End genetlink-c # Start genetlink-legacy struct: description: Name of the struct type used for the attribute. diff --git a/Documentation/netlink/genetlink.yaml b/Documentation/netlink/genetlink.yaml index d8b2cdeba058..1cbb448d2f1c 100644 --- a/Documentation/netlink/genetlink.yaml +++ b/Documentation/netlink/genetlink.yaml @@ -168,6 +168,12 @@ properties: description: Max length for a string or a binary attribute. $ref: '#/$defs/len-or-define' sub-type: *attr-type + display-hint: &display-hint + description: | + Optional format indicator that is intended only for choosing + the right formatting mechanism when displaying values of this + type. + enum: [ hex, mac, fddi, ipv4, ipv6, uuid ] # Make sure name-prefix does not appear in subsets (subsets inherit naming) dependencies: diff --git a/Documentation/netlink/specs/devlink.yaml b/Documentation/netlink/specs/devlink.yaml index 90641668232e..5d46ca966979 100644 --- a/Documentation/netlink/specs/devlink.yaml +++ b/Documentation/netlink/specs/devlink.yaml @@ -9,6 +9,7 @@ doc: Partial family for Devlink. attribute-sets: - name: devlink + name-prefix: devlink-attr- attributes: - name: bus-name @@ -95,10 +96,12 @@ attribute-sets: - name: reload-action-info type: nest + multi-attr: true nested-attributes: dl-reload-act-info - name: reload-action-stats type: nest + multi-attr: true nested-attributes: dl-reload-act-stats - name: dl-dev-stats @@ -196,3 +199,8 @@ operations: attributes: - bus-name - dev-name + - info-driver-name + - info-serial-number + - info-version-fixed + - info-version-running + - info-version-stored diff --git a/Documentation/netlink/specs/ethtool.yaml b/Documentation/netlink/specs/ethtool.yaml index 4846345bade4..837b565577ca 100644 --- a/Documentation/netlink/specs/ethtool.yaml +++ b/Documentation/netlink/specs/ethtool.yaml @@ -9,8 +9,13 @@ doc: Partial family for Ethtool Netlink. definitions: - name: udp-tunnel-type + enum-name: type: enum entries: [ vxlan, geneve, vxlan-gpe ] + - + name: stringset + type: enum + entries: [] attribute-sets: - @@ -497,7 +502,7 @@ attribute-sets: attributes: - name: pad - type: u32 + type: pad - name: tx-frames type: u64 @@ -577,7 +582,7 @@ attribute-sets: name: phc-index type: u32 - - name: cable-test-ntf-nest-result + name: cable-result attributes: - name: pair @@ -586,7 +591,7 @@ attribute-sets: name: code type: u8 - - name: cable-test-ntf-nest-fault-length + name: cable-fault-length attributes: - name: pair @@ -595,16 +600,16 @@ attribute-sets: name: cm type: u32 - - name: cable-test-ntf-nest + name: cable-nest attributes: - name: result type: nest - nested-attributes: cable-test-ntf-nest-result + nested-attributes: cable-result - name: fault-length type: nest - nested-attributes: cable-test-ntf-nest-fault-length + nested-attributes: cable-fault-length - name: cable-test attributes: @@ -612,13 +617,20 @@ attribute-sets: name: header type: nest nested-attributes: header + - + name: cable-test-ntf + attributes: + - + name: header + type: nest + nested-attributes: header - name: status type: u8 - name: nest type: nest - nested-attributes: cable-test-ntf-nest + nested-attributes: cable-nest - name: cable-test-tdr-cfg attributes: @@ -632,9 +644,23 @@ attribute-sets: name: step type: u32 - - name: pari + name: pair type: u8 - + name: cable-test-tdr-ntf + attributes: + - + name: header + type: nest + nested-attributes: header + - + name: status + type: u8 + - + name: nest + type: nest + nested-attributes: cable-nest + - name: cable-test-tdr attributes: - @@ -646,7 +672,7 @@ attribute-sets: type: nest nested-attributes: cable-test-tdr-cfg - - name: tunnel-info-udp-entry + name: tunnel-udp-entry attributes: - name: port @@ -657,7 +683,7 @@ attribute-sets: type: u32 enum: udp-tunnel-type - - name: tunnel-info-udp-table + name: tunnel-udp-table attributes: - name: size @@ -667,9 +693,17 @@ attribute-sets: type: nest nested-attributes: bitset - - name: udp-ports + name: entry type: nest - nested-attributes: tunnel-info-udp-entry + multi-attr: true + nested-attributes: tunnel-udp-entry + - + name: tunnel-udp + attributes: + - + name: table + type: nest + nested-attributes: tunnel-udp-table - name: tunnel-info attributes: @@ -680,13 +714,13 @@ attribute-sets: - name: udp-ports type: nest - nested-attributes: tunnel-info-udp-table + nested-attributes: tunnel-udp - name: fec-stat attributes: - name: pad - type: u8 + type: pad - name: corrected type: binary @@ -750,7 +784,7 @@ attribute-sets: attributes: - name: pad - type: u32 + type: pad - name: id type: u32 @@ -759,16 +793,29 @@ attribute-sets: type: u32 - name: stat - type: nest - nested-attributes: u64 + type: u64 + type-value: [ id ] - name: hist-rx type: nest - nested-attributes: u64 + nested-attributes: stats-grp-hist - name: hist-tx type: nest - nested-attributes: u64 + nested-attributes: stats-grp-hist + - + name: hist-bkt-low + type: u32 + - + name: hist-bkt-hi + type: u32 + - + name: hist-val + type: u64 + - + name: stats-grp-hist + subset-of: stats-grp + attributes: - name: hist-bkt-low type: u32 @@ -783,7 +830,7 @@ attribute-sets: attributes: - name: pad - type: u32 + type: pad - name: header type: nest @@ -836,12 +883,15 @@ attribute-sets: - name: admin-state type: u32 + name-prefix: ethtool-a-podl-pse- - name: admin-control type: u32 + name-prefix: ethtool-a-podl-pse- - name: pw-d-status type: u32 + name-prefix: ethtool-a-podl-pse- - name: rss attributes: @@ -895,6 +945,7 @@ attribute-sets: operations: enum-model: directional + name-prefix: ethtool-msg- list: - name: strset-get @@ -1348,10 +1399,16 @@ operations: request: attributes: - header - reply: - attributes: - - header - - cable-test-ntf-nest + - + name: cable-test-ntf + doc: Cable test notification. + + attribute-set: cable-test-ntf + + event: + attributes: + - header + - status - name: cable-test-tdr-act doc: Cable test TDR. @@ -1362,10 +1419,17 @@ operations: request: attributes: - header - reply: - attributes: - - header - - cable-test-tdr-cfg + - + name: cable-test-tdr-ntf + doc: Cable test TDR notification. + + attribute-set: cable-test-tdr-ntf + + event: + attributes: + - header + - status + - nest - name: tunnel-info-get doc: Get tsinfo params. diff --git a/Documentation/netlink/specs/ovs_datapath.yaml b/Documentation/netlink/specs/ovs_datapath.yaml index 6d71db8c4416..f709c26c3e92 100644 --- a/Documentation/netlink/specs/ovs_datapath.yaml +++ b/Documentation/netlink/specs/ovs_datapath.yaml @@ -3,6 +3,7 @@ name: ovs_datapath version: 2 protocol: genetlink-legacy +uapi-header: linux/openvswitch.h doc: OVS datapath configuration over generic netlink. @@ -18,6 +19,7 @@ definitions: - name: user-features type: flags + name-prefix: ovs-dp-f- entries: - name: unaligned @@ -33,35 +35,37 @@ definitions: doc: Allow per-cpu dispatch of upcalls - name: datapath-stats + enum-name: ovs-dp-stats type: struct members: - - name: hit + name: n-hit type: u64 - - name: missed + name: n-missed type: u64 - - name: lost + name: n-lost type: u64 - - name: flows + name: n-flows type: u64 - name: megaflow-stats + enum-name: ovs-dp-megaflow-stats type: struct members: - - name: mask-hit + name: n-mask-hit type: u64 - - name: masks + name: n-masks type: u32 - name: padding type: u32 - - name: cache-hits + name: n-cache-hit type: u64 - name: pad1 @@ -70,6 +74,8 @@ definitions: attribute-sets: - name: datapath + name-prefix: ovs-dp-attr- + enum-name: ovs-datapath-attrs attributes: - name: name @@ -101,12 +107,16 @@ attribute-sets: name: per-cpu-pids type: binary sub-type: u32 + - + name: ifindex + type: u32 operations: fixed-header: ovs-header + name-prefix: ovs-dp-cmd- list: - - name: dp-get + name: get doc: Get / dump OVS data path configuration and state value: 3 attribute-set: datapath @@ -125,7 +135,7 @@ operations: - per-cpu-pids dump: *dp-get-op - - name: dp-new + name: new doc: Create new OVS data path value: 1 attribute-set: datapath @@ -137,7 +147,7 @@ operations: - upcall-pid - user-features - - name: dp-del + name: del doc: Delete existing OVS data path value: 2 attribute-set: datapath diff --git a/Documentation/netlink/specs/ovs_flow.yaml b/Documentation/netlink/specs/ovs_flow.yaml new file mode 100644 index 000000000000..109ca1f57b6c --- /dev/null +++ b/Documentation/netlink/specs/ovs_flow.yaml @@ -0,0 +1,980 @@ +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) + +name: ovs_flow +version: 1 +protocol: genetlink-legacy +uapi-header: linux/openvswitch.h + +doc: + OVS flow configuration over generic netlink. + +definitions: + - + name: ovs-header + type: struct + doc: | + Header for OVS Generic Netlink messages. + members: + - + name: dp-ifindex + type: u32 + doc: | + ifindex of local port for datapath (0 to make a request not specific + to a datapath). + - + name: ovs-flow-stats + type: struct + members: + - + name: n-packets + type: u64 + doc: Number of matched packets. + - + name: n-bytes + type: u64 + doc: Number of matched bytes. + - + name: ovs-key-ethernet + type: struct + members: + - + name: eth-src + type: binary + len: 6 + display-hint: mac + - + name: eth-dst + type: binary + len: 6 + display-hint: mac + - + name: ovs-key-mpls + type: struct + members: + - + name: mpls-lse + type: u32 + byte-order: big-endian + - + name: ovs-key-ipv4 + type: struct + members: + - + name: ipv4-src + type: u32 + byte-order: big-endian + display-hint: ipv4 + - + name: ipv4-dst + type: u32 + byte-order: big-endian + display-hint: ipv4 + - + name: ipv4-proto + type: u8 + - + name: ipv4-tos + type: u8 + - + name: ipv4-ttl + type: u8 + - + name: ipv4-frag + type: u8 + enum: ovs-frag-type + - + name: ovs-key-ipv6 + type: struct + members: + - + name: ipv6-src + type: binary + len: 16 + byte-order: big-endian + display-hint: ipv6 + - + name: ipv6-dst + type: binary + len: 16 + byte-order: big-endian + display-hint: ipv6 + - + name: ipv6-label + type: u32 + byte-order: big-endian + - + name: ipv6-proto + type: u8 + - + name: ipv6-tclass + type: u8 + - + name: ipv6-hlimit + type: u8 + - + name: ipv6-frag + type: u8 + - + name: ovs-key-ipv6-exthdrs + type: struct + members: + - + name: hdrs + type: u16 + - + name: ovs-frag-type + name-prefix: ovs-frag-type- + type: enum + entries: + - + name: none + doc: Packet is not a fragment. + - + name: first + doc: Packet is a fragment with offset 0. + - + name: later + doc: Packet is a fragment with nonzero offset. + - + name: any + value: 255 + - + name: ovs-key-tcp + type: struct + members: + - + name: tcp-src + type: u16 + byte-order: big-endian + - + name: tcp-dst + type: u16 + byte-order: big-endian + - + name: ovs-key-udp + type: struct + members: + - + name: udp-src + type: u16 + byte-order: big-endian + - + name: udp-dst + type: u16 + byte-order: big-endian + - + name: ovs-key-sctp + type: struct + members: + - + name: sctp-src + type: u16 + byte-order: big-endian + - + name: sctp-dst + type: u16 + byte-order: big-endian + - + name: ovs-key-icmp + type: struct + members: + - + name: icmp-type + type: u8 + - + name: icmp-code + type: u8 + - + name: ovs-key-arp + type: struct + members: + - + name: arp-sip + type: u32 + byte-order: big-endian + - + name: arp-tip + type: u32 + byte-order: big-endian + - + name: arp-op + type: u16 + byte-order: big-endian + - + name: arp-sha + type: binary + len: 6 + display-hint: mac + - + name: arp-tha + type: binary + len: 6 + display-hint: mac + - + name: ovs-key-nd + type: struct + members: + - + name: nd_target + type: binary + len: 16 + byte-order: big-endian + - + name: nd-sll + type: binary + len: 6 + display-hint: mac + - + name: nd-tll + type: binary + len: 6 + display-hint: mac + - + name: ovs-key-ct-tuple-ipv4 + type: struct + members: + - + name: ipv4-src + type: u32 + byte-order: big-endian + - + name: ipv4-dst + type: u32 + byte-order: big-endian + - + name: src-port + type: u16 + byte-order: big-endian + - + name: dst-port + type: u16 + byte-order: big-endian + - + name: ipv4-proto + type: u8 + - + name: ovs-action-push-vlan + type: struct + members: + - + name: vlan_tpid + type: u16 + byte-order: big-endian + doc: Tag protocol identifier (TPID) to push. + - + name: vlan_tci + type: u16 + byte-order: big-endian + doc: Tag control identifier (TCI) to push. + - + name: ovs-ufid-flags + name-prefix: ovs-ufid-f- + type: flags + entries: + - omit-key + - omit-mask + - omit-actions + - + name: ovs-action-hash + type: struct + members: + - + name: hash-alg + type: u32 + doc: Algorithm used to compute hash prior to recirculation. + - + name: hash-basis + type: u32 + doc: Basis used for computing hash. + - + name: ovs-hash-alg + type: enum + doc: | + Data path hash algorithm for computing Datapath hash. The algorithm type only specifies + the fields in a flow will be used as part of the hash. Each datapath is free to use its + own hash algorithm. The hash value will be opaque to the user space daemon. + entries: + - ovs-hash-alg-l4 + + - + name: ovs-action-push-mpls + type: struct + members: + - + name: mpls-lse + type: u32 + byte-order: big-endian + doc: | + MPLS label stack entry to push + - + name: mpls-ethertype + type: u32 + byte-order: big-endian + doc: | + Ethertype to set in the encapsulating ethernet frame. The only values + ethertype should ever be given are ETH_P_MPLS_UC and ETH_P_MPLS_MC, + indicating MPLS unicast or multicast. Other are rejected. + - + name: ovs-action-add-mpls + type: struct + members: + - + name: mpls-lse + type: u32 + byte-order: big-endian + doc: | + MPLS label stack entry to push + - + name: mpls-ethertype + type: u32 + byte-order: big-endian + doc: | + Ethertype to set in the encapsulating ethernet frame. The only values + ethertype should ever be given are ETH_P_MPLS_UC and ETH_P_MPLS_MC, + indicating MPLS unicast or multicast. Other are rejected. + - + name: tun-flags + type: u16 + doc: | + MPLS tunnel attributes. + - + name: ct-state-flags + type: flags + name-prefix: ovs-cs-f- + entries: + - + name: new + doc: Beginning of a new connection. + - + name: established + doc: Part of an existing connenction + - + name: related + doc: Related to an existing connection. + - + name: reply-dir + doc: Flow is in the reply direction. + - + name: invalid + doc: Could not track the connection. + - + name: tracked + doc: Conntrack has occurred. + - + name: src-nat + doc: Packet's source address/port was mangled by NAT. + - + name: dst-nat + doc: Packet's destination address/port was mangled by NAT. + +attribute-sets: + - + name: flow-attrs + enum-name: ovs-flow-attr + name-prefix: ovs-flow-attr- + attributes: + - + name: key + type: nest + nested-attributes: key-attrs + doc: | + Nested attributes specifying the flow key. Always present in + notifications. Required for all requests (except dumps). + - + name: actions + type: nest + nested-attributes: action-attrs + doc: | + Nested attributes specifying the actions to take for packets that + match the key. Always present in notifications. Required for + OVS_FLOW_CMD_NEW requests, optional for OVS_FLOW_CMD_SET requests. An + OVS_FLOW_CMD_SET without OVS_FLOW_ATTR_ACTIONS will not modify the + actions. To clear the actions, an OVS_FLOW_ATTR_ACTIONS without any + nested attributes must be given. + - + name: stats + type: binary + struct: ovs-flow-stats + doc: | + Statistics for this flow. Present in notifications if the stats would + be nonzero. Ignored in requests. + - + name: tcp-flags + type: u8 + doc: | + An 8-bit value giving the ORed value of all of the TCP flags seen on + packets in this flow. Only present in notifications for TCP flows, and + only if it would be nonzero. Ignored in requests. + - + name: used + type: u64 + doc: | + A 64-bit integer giving the time, in milliseconds on the system + monotonic clock, at which a packet was last processed for this + flow. Only present in notifications if a packet has been processed for + this flow. Ignored in requests. + - + name: clear + type: flag + doc: | + If present in a OVS_FLOW_CMD_SET request, clears the last-used time, + accumulated TCP flags, and statistics for this flow. Otherwise + ignored in requests. Never present in notifications. + - + name: mask + type: nest + nested-attributes: key-attrs + doc: | + Nested attributes specifying the mask bits for wildcarded flow + match. Mask bit value '1' specifies exact match with corresponding + flow key bit, while mask bit value '0' specifies a wildcarded + match. Omitting attribute is treated as wildcarding all corresponding + fields. Optional for all requests. If not present, all flow key bits + are exact match bits. + - + name: probe + type: binary + doc: | + Flow operation is a feature probe, error logging should be suppressed. + - + name: ufid + type: binary + doc: | + A value between 1-16 octets specifying a unique identifier for the + flow. Causes the flow to be indexed by this value rather than the + value of the OVS_FLOW_ATTR_KEY attribute. Optional for all + requests. Present in notifications if the flow was created with this + attribute. + display-hint: uuid + - + name: ufid-flags + type: u32 + enum: ovs-ufid-flags + doc: | + A 32-bit value of ORed flags that provide alternative semantics for + flow installation and retrieval. Optional for all requests. + - + name: pad + type: binary + + - + name: key-attrs + enum-name: ovs-key-attr + name-prefix: ovs-key-attr- + attributes: + - + name: encap + type: nest + nested-attributes: key-attrs + - + name: priority + type: u32 + - + name: in-port + type: u32 + - + name: ethernet + type: binary + struct: ovs-key-ethernet + doc: struct ovs_key_ethernet + - + name: vlan + type: u16 + byte-order: big-endian + - + name: ethertype + type: u16 + byte-order: big-endian + - + name: ipv4 + type: binary + struct: ovs-key-ipv4 + - + name: ipv6 + type: binary + struct: ovs-key-ipv6 + doc: struct ovs_key_ipv6 + - + name: tcp + type: binary + struct: ovs-key-tcp + - + name: udp + type: binary + struct: ovs-key-udp + - + name: icmp + type: binary + struct: ovs-key-icmp + - + name: icmpv6 + type: binary + struct: ovs-key-icmp + - + name: arp + type: binary + struct: ovs-key-arp + doc: struct ovs_key_arp + - + name: nd + type: binary + struct: ovs-key-nd + doc: struct ovs_key_nd + - + name: skb-mark + type: u32 + - + name: tunnel + type: nest + nested-attributes: tunnel-key-attrs + - + name: sctp + type: binary + struct: ovs-key-sctp + - + name: tcp-flags + type: u16 + byte-order: big-endian + - + name: dp-hash + type: u32 + doc: Value 0 indicates the hash is not computed by the datapath. + - + name: recirc-id + type: u32 + - + name: mpls + type: binary + struct: ovs-key-mpls + - + name: ct-state + type: u32 + enum: ct-state-flags + enum-as-flags: true + - + name: ct-zone + type: u16 + doc: connection tracking zone + - + name: ct-mark + type: u32 + doc: connection tracking mark + - + name: ct-labels + type: binary + display-hint: hex + doc: 16-octet connection tracking label + - + name: ct-orig-tuple-ipv4 + type: binary + struct: ovs-key-ct-tuple-ipv4 + - + name: ct-orig-tuple-ipv6 + type: binary + doc: struct ovs_key_ct_tuple_ipv6 + - + name: nsh + type: nest + nested-attributes: ovs-nsh-key-attrs + - + name: packet-type + type: u32 + byte-order: big-endian + doc: Should not be sent to the kernel + - + name: nd-extensions + type: binary + doc: Should not be sent to the kernel + - + name: tunnel-info + type: binary + doc: struct ip_tunnel_info + - + name: ipv6-exthdrs + type: binary + struct: ovs-key-ipv6-exthdrs + doc: struct ovs_key_ipv6_exthdr + - + name: action-attrs + enum-name: ovs-action-attr + name-prefix: ovs-action-attr- + attributes: + - + name: output + type: u32 + doc: ovs port number in datapath + - + name: userspace + type: nest + nested-attributes: userspace-attrs + - + name: set + type: nest + nested-attributes: key-attrs + doc: Replaces the contents of an existing header. The single nested attribute specifies a header to modify and its value. + - + name: push-vlan + type: binary + struct: ovs-action-push-vlan + doc: Push a new outermost 802.1Q or 802.1ad header onto the packet. + - + name: pop-vlan + type: flag + doc: Pop the outermost 802.1Q or 802.1ad header from the packet. + - + name: sample + type: nest + nested-attributes: sample-attrs + doc: | + Probabilistically executes actions, as specified in the nested attributes. + - + name: recirc + type: u32 + doc: recirc id + - + name: hash + type: binary + struct: ovs-action-hash + - + name: push-mpls + type: binary + struct: ovs-action-push-mpls + doc: | + Push a new MPLS label stack entry onto the top of the packets MPLS + label stack. Set the ethertype of the encapsulating frame to either + ETH_P_MPLS_UC or ETH_P_MPLS_MC to indicate the new packet contents. + - + name: pop-mpls + type: u16 + byte-order: big-endian + doc: ethertype + - + name: set-masked + type: nest + nested-attributes: key-attrs + doc: | + Replaces the contents of an existing header. A nested attribute + specifies a header to modify, its value, and a mask. For every bit set + in the mask, the corresponding bit value is copied from the value to + the packet header field, rest of the bits are left unchanged. The + non-masked value bits must be passed in as zeroes. Masking is not + supported for the OVS_KEY_ATTR_TUNNEL attribute. + - + name: ct + type: nest + nested-attributes: ct-attrs + doc: | + Track the connection. Populate the conntrack-related entries + in the flow key. + - + name: trunc + type: u32 + doc: struct ovs_action_trunc is a u32 max length + - + name: push-eth + type: binary + doc: struct ovs_action_push_eth + - + name: pop-eth + type: flag + - + name: ct-clear + type: flag + - + name: push-nsh + type: nest + nested-attributes: ovs-nsh-key-attrs + doc: | + Push NSH header to the packet. + - + name: pop-nsh + type: flag + doc: | + Pop the outermost NSH header off the packet. + - + name: meter + type: u32 + doc: | + Run packet through a meter, which may drop the packet, or modify the + packet (e.g., change the DSCP field) + - + name: clone + type: nest + nested-attributes: action-attrs + doc: | + Make a copy of the packet and execute a list of actions without + affecting the original packet and key. + - + name: check-pkt-len + type: nest + nested-attributes: check-pkt-len-attrs + doc: | + Check the packet length and execute a set of actions if greater than + the specified packet length, else execute another set of actions. + - + name: add-mpls + type: binary + struct: ovs-action-add-mpls + doc: | + Push a new MPLS label stack entry at the start of the packet or at the + start of the l3 header depending on the value of l3 tunnel flag in the + tun_flags field of this OVS_ACTION_ATTR_ADD_MPLS argument. + - + name: dec-ttl + type: nest + nested-attributes: dec-ttl-attrs + - + name: tunnel-key-attrs + enum-name: ovs-tunnel-key-attr + name-prefix: ovs-tunnel-key-attr- + attributes: + - + name: id + type: u64 + byte-order: big-endian + value: 0 + - + name: ipv4-src + type: u32 + byte-order: big-endian + - + name: ipv4-dst + type: u32 + byte-order: big-endian + - + name: tos + type: u8 + - + name: ttl + type: u8 + - + name: dont-fragment + type: flag + - + name: csum + type: flag + - + name: oam + type: flag + - + name: geneve-opts + type: binary + sub-type: u32 + - + name: tp-src + type: u16 + byte-order: big-endian + - + name: tp-dst + type: u16 + byte-order: big-endian + - + name: vxlan-opts + type: nest + nested-attributes: vxlan-ext-attrs + - + name: ipv6-src + type: binary + doc: | + struct in6_addr source IPv6 address + - + name: ipv6-dst + type: binary + doc: | + struct in6_addr destination IPv6 address + - + name: pad + type: binary + - + name: erspan-opts + type: binary + doc: | + struct erspan_metadata + - + name: ipv4-info-bridge + type: flag + - + name: check-pkt-len-attrs + enum-name: ovs-check-pkt-len-attr + name-prefix: ovs-check-pkt-len-attr- + attributes: + - + name: pkt-len + type: u16 + - + name: actions-if-greater + type: nest + nested-attributes: action-attrs + - + name: actions-if-less-equal + type: nest + nested-attributes: action-attrs + - + name: sample-attrs + enum-name: ovs-sample-attr + name-prefix: ovs-sample-attr- + attributes: + - + name: probability + type: u32 + - + name: actions + type: nest + nested-attributes: action-attrs + - + name: userspace-attrs + enum-name: ovs-userspace-attr + name-prefix: ovs-userspace-attr- + attributes: + - + name: pid + type: u32 + - + name: userdata + type: binary + - + name: egress-tun-port + type: u32 + - + name: actions + type: flag + - + name: ovs-nsh-key-attrs + enum-name: ovs-nsh-key-attr + name-prefix: ovs-nsh-key-attr- + attributes: + - + name: base + type: binary + - + name: md1 + type: binary + - + name: md2 + type: binary + - + name: ct-attrs + enum-name: ovs-ct-attr + name-prefix: ovs-ct-attr- + attributes: + - + name: commit + type: flag + - + name: zone + type: u16 + - + name: mark + type: binary + - + name: labels + type: binary + - + name: helper + type: string + - + name: nat + type: nest + nested-attributes: nat-attrs + - + name: force-commit + type: flag + - + name: eventmask + type: u32 + - + name: timeout + type: string + - + name: nat-attrs + enum-name: ovs-nat-attr + name-prefix: ovs-nat-attr- + attributes: + - + name: src + type: flag + - + name: dst + type: flag + - + name: ip-min + type: binary + - + name: ip-max + type: binary + - + name: proto-min + type: u16 + - + name: proto-max + type: u16 + - + name: persistent + type: flag + - + name: proto-hash + type: flag + - + name: proto-random + type: flag + - + name: dec-ttl-attrs + enum-name: ovs-dec-ttl-attr + name-prefix: ovs-dec-ttl-attr- + attributes: + - + name: action + type: nest + nested-attributes: action-attrs + - + name: vxlan-ext-attrs + enum-name: ovs-vxlan-ext- + name-prefix: ovs-vxlan-ext- + attributes: + - + name: gbp + type: u32 + +operations: + name-prefix: ovs-flow-cmd- + fixed-header: ovs-header + list: + - + name: get + doc: Get / dump OVS flow configuration and state + value: 3 + attribute-set: flow-attrs + do: &flow-get-op + request: + attributes: + - dp-ifindex + - key + - ufid + - ufid-flags + reply: + attributes: + - dp-ifindex + - key + - ufid + - mask + - stats + - actions + dump: *flow-get-op + - + name: new + doc: Create OVS flow configuration in a data path + value: 1 + attribute-set: flow-attrs + do: + request: + attributes: + - dp-ifindex + - key + - ufid + - mask + - actions + +mcast-groups: + list: + - + name: ovs_flow diff --git a/Documentation/netlink/specs/ovs_vport.yaml b/Documentation/netlink/specs/ovs_vport.yaml index 8e55622ddf11..17336455bec1 100644 --- a/Documentation/netlink/specs/ovs_vport.yaml +++ b/Documentation/netlink/specs/ovs_vport.yaml @@ -3,6 +3,7 @@ name: ovs_vport version: 2 protocol: genetlink-legacy +uapi-header: linux/openvswitch.h doc: OVS vport configuration over generic netlink. @@ -18,10 +19,13 @@ definitions: - name: vport-type type: enum + enum-name: ovs-vport-type + name-prefix: ovs-vport-type- entries: [ unspec, netdev, internal, gre, vxlan, geneve ] - name: vport-stats type: struct + enum-name: ovs-vport-stats members: - name: rx-packets @@ -51,6 +55,8 @@ definitions: attribute-sets: - name: vport-options + enum-name: ovs-vport-options + name-prefix: ovs-tunnel-attr- attributes: - name: dst-port @@ -60,6 +66,8 @@ attribute-sets: type: u32 - name: upcall-stats + enum-name: ovs-vport-upcall-attr + name-prefix: ovs-vport-upcall-attr- attributes: - name: success @@ -70,6 +78,8 @@ attribute-sets: type: u64 - name: vport + name-prefix: ovs-vport-attr- + enum-name: ovs-vport-attr attributes: - name: port-no @@ -108,9 +118,10 @@ attribute-sets: nested-attributes: upcall-stats operations: + name-prefix: ovs-vport-cmd- list: - - name: vport-get + name: get doc: Get / dump OVS vport configuration and state value: 3 attribute-set: vport diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst index 8bcb173e0353..5eaa3ab6c73e 100644 --- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst +++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst @@ -38,6 +38,7 @@ debug logs. Some of the ENA devices support a working mode called Low-latency Queue (LLQ), which saves several more microseconds. + ENA Source Code Directory Structure =================================== @@ -205,6 +206,8 @@ Adaptive coalescing can be switched on/off through `ethtool(8)`'s More information about Adaptive Interrupt Moderation (DIM) can be found in Documentation/networking/net_dim.rst +.. _`RX copybreak`: + RX copybreak ============ The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK @@ -315,3 +318,34 @@ Rx - The new SKB is updated with the necessary information (protocol, checksum hw verify result, etc), and then passed to the network stack, using the NAPI interface function :code:`napi_gro_receive()`. + +Dynamic RX Buffers (DRB) +------------------------ + +Each RX descriptor in the RX ring is a single memory page (which is either 4KB +or 16KB long depending on system's configurations). +To reduce the memory allocations required when dealing with a high rate of small +packets, the driver tries to reuse the remaining RX descriptor's space if more +than 2KB of this page remain unused. + +A simple example of this mechanism is the following sequence of events: + +:: + + 1. Driver allocates page-sized RX buffer and passes it to hardware + +----------------------+ + |4KB RX Buffer | + +----------------------+ + + 2. A 300Bytes packet is received on this buffer + + 3. The driver increases the ref count on this page and returns it back to + HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes + +----+--------------------+ + |****|3796 Bytes RX Buffer| + +----+--------------------+ + +This mechanism isn't used when an XDP program is loaded, or when the +RX packet is less than rx_copybreak bytes (in which case the packet is +copied out of the RX buffer into the linear part of a new skb allocated +for it and the RX buffer remains the same size, see `RX copybreak`_). diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst index 69695e5511f4..e4d065c55ea8 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst @@ -84,24 +84,6 @@ Once the VM shuts down, or otherwise releases the VF, the command will complete. -Important notes for SR-IOV and Link Aggregation ------------------------------------------------ -Link Aggregation is mutually exclusive with SR-IOV. - -- If Link Aggregation is active, SR-IOV VFs cannot be created on the PF. -- If SR-IOV is active, you cannot set up Link Aggregation on the interface. - -Bridging and MACVLAN are also affected by this. If you wish to use bridging or -MACVLAN with SR-IOV, you must set up bridging or MACVLAN before enabling -SR-IOV. If you are using bridging or MACVLAN in conjunction with SR-IOV, and -you want to remove the interface from the bridge or MACVLAN, you must follow -these steps: - -1. Destroy SR-IOV VFs if they exist -2. Remove the interface from the bridge or MACVLAN -3. Recreate SRIOV VFs as needed - - Additional Features and Configurations ====================================== diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst index 5ba9015336e2..bfd233cfac35 100644 --- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst +++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst @@ -13,6 +13,7 @@ Contents - `Drivers`_ - `Basic packet flow`_ - `Devlink health reporters`_ +- `Quality of service`_ Overview ======== @@ -287,3 +288,47 @@ For example:: NIX_AF_ERR: NIX Error Interrupt Reg : 64 Rx on unmapped PF_FUNC + + +Quality of service +================== + + +Hardware algorithms used in scheduling +-------------------------------------- + +octeontx2 silicon and CN10K transmit interface consists of five transmit levels +starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1 +levels. Each level contains an array of queues to support scheduling and shaping. +The hardware uses the below algorithms depending on the priority of scheduler queues. +once the usercreates tc classes with different priorities, the driver configures +schedulers allocated to the class with specified priority along with rate-limiting +configuration. + +1. Strict Priority + + - Once packets are submitted to MDQ, hardware picks all active MDQs having different priority + using strict priority. + +2. Round Robin + + - Active MDQs having the same priority level are chosen using round robin. + + +Setup HTB offload +----------------- + +1. Enable HW TC offload on the interface:: + + # ethtool -K <interface> hw-tc-offload on + +2. Crate htb root:: + + # tc qdisc add dev <interface> clsact + # tc qdisc replace dev <interface> root handle 1: htb offload + +3. Create tc classes with different priorities:: + + # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1 + + # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7 diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index 6b2d1fe74ecf..a395df9c2751 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -797,6 +797,16 @@ Counters on the NIC port that is connected to a eSwitch. RoCE/UD/RC traffic) [#accel]_. - Acceleration + * - `vport_loopback_packets` + - Unicast, multicast and broadcast packets that were loop-back (received + and transmitted), IB/Eth [#accel]_. + - Acceleration + + * - `vport_loopback_bytes` + - Unicast, multicast and broadcast bytes that were loop-back (received + and transmitted), IB/Eth [#accel]_. + - Acceleration + * - `rx_steer_missed_packets` - Number of packets that was received by the NIC, however was discarded because it did not match any flow in the NIC flow table. diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst index 3354ca3608ee..a4edf908b707 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst @@ -290,6 +290,13 @@ Description of the vnic counters: - nic_receive_steering_discard number of packets that completed RX flow steering but were discarded due to a mismatch in flow table. +- generated_pkt_steering_fail + number of packets generated by the VNIC experiencing unexpected steering + failure (at any point in steering flow). +- handled_pkt_steering_fail + number of packets handled by the VNIC experiencing unexpected steering + failure (at any point in steering flow owned by the VNIC, including the FDB + for the eswitch owner). User commands examples: diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst index 01deedb71597..6e3f5ee8b0d0 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst @@ -45,6 +45,28 @@ Following bridge VLAN functions are supported by mlx5: Subfunction =========== +Subfunction which are spawned over the E-switch are created only with devlink +device, and by default all the SF auxiliary devices are disabled. +This will allow user to configure the SF before the SF have been fully probed, +which will save time. + +Usage example: + +- Create SF:: + + $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11 + $ devlink port function set pci/0000:08:00.0/32768 hw_addr 00:00:00:00:00:11 state active + +- Enable ETH auxiliary device:: + + $ devlink dev param set auxiliary/mlx5_core.sf.1 name enable_eth value true cmode driverinit + +- Now, in order to fully probe the SF, use devlink reload:: + + $ devlink dev reload auxiliary/mlx5_core.sf.1 + +mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`). + mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface. A subfunction has its own function capabilities and its own resources. This diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 80b8f73a0244..4a010a7cde7f 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value - is 6, which corresponds to 63seconds till the last retransmission - with the current initial RTO of 1second. With this the final timeout - for an active TCP connection attempt will happen after 127seconds. + is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4) + till the last retransmission with the current initial RTO of 1second. + With this the final timeout for an active TCP connection attempt + will happen after 131seconds. tcp_timestamps - INTEGER Enable timestamps as defined in RFC1323. @@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER Default: 120 +tcp_syn_linear_timeouts - INTEGER + The number of times for an active TCP connection to retransmit SYNs with + a linear backoff timeout before defaulting to an exponential backoff + timeout. This has no effect on SYNACK at the passive TCP side. + + With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would + expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts, + and the first exponential backoff using 2^0 * initial_RTO). + Default: 4 + tcp_tso_win_divisor - INTEGER This allows control over what percentage of the congestion window can be consumed by a single TSO frame. @@ -970,6 +981,21 @@ tcp_tw_reuse - INTEGER tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. +tcp_shrink_window - BOOLEAN + This changes how the TCP receive window is calculated. + + RFC 7323, section 2.4, says there are instances when a retracted + window can be offered, and that TCP implementations MUST ensure + that they handle a shrinking window, as specified in RFC 1122. + + - 0 - Disabled. The window is never shrunk. + - 1 - Enabled. The window is shrunk when necessary to remain within + the memory limit set by autotuning (sk_rcvbuf). + This only occurs if a non-zero receive window + scaling factor is also in effect. + + Default: 0 + tcp_wmem - vector of 3 INTEGERs: min, default, max min: Amount of memory reserved for send buffers for TCP sockets. Each TCP socket has rights to use it due to fact of its birth. diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst index 3d435caa3ef2..92c9fb46d6a2 100644 --- a/Documentation/networking/scaling.rst +++ b/Documentation/networking/scaling.rst @@ -269,8 +269,8 @@ a single application thread handles flows with many different flow hashes. rps_sock_flow_table is a global flow table that contains the *desired* CPU for flows: the CPU that is currently processing the flow in userspace. Each table value is a CPU index that is updated during calls to recvmsg -and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() -and tcp_splice_read()). +and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and +tcp_splice_read()). When the scheduler moves a thread to a new CPU while it has outstanding receive packets on the old CPU, packets may arrive out of order. To diff --git a/Documentation/userspace-api/netlink/intro-specs.rst b/Documentation/userspace-api/netlink/intro-specs.rst index a3b847eafff7..bada89699455 100644 --- a/Documentation/userspace-api/netlink/intro-specs.rst +++ b/Documentation/userspace-api/netlink/intro-specs.rst @@ -78,3 +78,82 @@ to see other examples. The code generation itself is performed by ``tools/net/ynl/ynl-gen-c.py`` but it takes a few arguments so calling it directly for each file quickly becomes tedious. + +YNL lib +======= + +``tools/net/ynl/lib/`` contains an implementation of a C library +(based on libmnl) which integrates with code generated by +``tools/net/ynl/ynl-gen-c.py`` to create easy to use netlink wrappers. + +YNL basics +---------- + +The YNL library consists of two parts - the generic code (functions +prefix by ``ynl_``) and per-family auto-generated code (prefixed +with the name of the family). + +To create a YNL socket call ynl_sock_create() passing the family +struct (family structs are exported by the auto-generated code). +ynl_sock_destroy() closes the socket. + +YNL requests +------------ + +Steps for issuing YNL requests are best explained on an example. +All the functions and types in this example come from the auto-generated +code (for the netdev family in this case): + +.. code-block:: c + + // 0. Request and response pointers + struct netdev_dev_get_req *req; + struct netdev_dev_get_rsp *d; + + // 1. Allocate a request + req = netdev_dev_get_req_alloc(); + // 2. Set request parameters (as needed) + netdev_dev_get_req_set_ifindex(req, ifindex); + + // 3. Issues the request + d = netdev_dev_get(ys, req); + // 4. Free the request arguments + netdev_dev_get_req_free(req); + // 5. Error check (the return value from step 3) + if (!d) { + // 6. Print the YNL-generated error + fprintf(stderr, "YNL: %s\n", ys->err.msg); + return -1; + } + + // ... do stuff with the response @d + + // 7. Free response + netdev_dev_get_rsp_free(d); + +YNL dumps +--------- + +Performing dumps follows similar pattern as requests. +Dumps return a list of objects terminated by a special marker, +or NULL on error. Use ``ynl_dump_foreach()`` to iterate over +the result. + +YNL notifications +----------------- + +YNL lib supports using the same socket for notifications and +requests. In case notifications arrive during processing of a request +they are queued internally and can be retrieved at a later time. + +To subscribed to notifications use ``ynl_subscribe()``. +The notifications have to be read out from the socket, +``ynl_socket_get_fd()`` returns the underlying socket fd which can +be plugged into appropriate asynchronous IO API like ``poll``, +or ``select``. + +Notifications can be retrieved using ``ynl_ntf_dequeue()`` and have +to be freed using ``ynl_ntf_free()``. Since we don't know the notification +type upfront the notifications are returned as ``struct ynl_ntf_base_type *`` +and user is expected to cast them to the appropriate full type based +on the ``cmd`` member. |