summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst3
-rw-r--r--Documentation/networking/device_drivers/ethernet/amd/pds_core.rst139
-rw-r--r--Documentation/networking/device_drivers/ethernet/index.rst2
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e100.rst7
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e1000.rst9
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/e1000e.rst7
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/fm10k.rst7
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/i40e.rst11
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/iavf.rst7
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ice.rst9
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/igb.rst7
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/igbvf.rst7
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgb.rst468
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst7
-rw-r--r--Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst7
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst26
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst68
-rw-r--r--Documentation/networking/devlink/mlx5.rst12
-rw-r--r--Documentation/networking/driver.rst156
-rw-r--r--Documentation/networking/ethtool-netlink.rst51
-rw-r--r--Documentation/networking/index.rst2
-rw-r--r--Documentation/networking/ip-sysctl.rst7
-rw-r--r--Documentation/networking/napi.rst254
-rw-r--r--Documentation/networking/page_pool.rst1
-rw-r--r--Documentation/networking/rxrpc.rst17
-rw-r--r--Documentation/networking/tls-handshake.rst217
26 files changed, 859 insertions, 649 deletions
diff --git a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
index 1a4fc6607582..1661d13174d5 100644
--- a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
+++ b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst
@@ -229,8 +229,7 @@ frames for a while. This has a potential to avoid the costly round of
enabling interrupts, handling an incoming IRQ in ISR, re-enabling the
softirq and switching context back to softirq.
-More detailed documentation of NAPI may be found on the pages of Linux
-Foundation `<https://wiki.linuxfoundation.org/networking/napi>`_.
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
Integrating the core to Xilinx Zynq
-----------------------------------
diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst
new file mode 100644
index 000000000000..9e8a16c44102
--- /dev/null
+++ b/Documentation/networking/device_drivers/ethernet/amd/pds_core.rst
@@ -0,0 +1,139 @@
+.. SPDX-License-Identifier: GPL-2.0+
+
+========================================================
+Linux Driver for the AMD/Pensando(R) DSC adapter family
+========================================================
+
+Copyright(c) 2023 Advanced Micro Devices, Inc
+
+Identifying the Adapter
+=======================
+
+To find if one or more AMD/Pensando PCI Core devices are installed on the
+host, check for the PCI devices::
+
+ # lspci -d 1dd8:100c
+ b5:00.0 Processing accelerators: Pensando Systems Device 100c
+ b6:00.0 Processing accelerators: Pensando Systems Device 100c
+
+If such devices are listed as above, then the pds_core.ko driver should find
+and configure them for use. There should be log entries in the kernel
+messages such as these::
+
+ $ dmesg | grep pds_core
+ pds_core 0000:b5:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
+ pds_core 0000:b5:00.0: FW: 1.60.0-73
+ pds_core 0000:b6:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
+ pds_core 0000:b6:00.0: FW: 1.60.0-73
+
+Driver and firmware version information can be gathered with devlink::
+
+ $ devlink dev info pci/0000:b5:00.0
+ pci/0000:b5:00.0:
+ driver pds_core
+ serial_number FLM18420073
+ versions:
+ fixed:
+ asic.id 0x0
+ asic.rev 0x0
+ running:
+ fw 1.51.0-73
+ stored:
+ fw.goldfw 1.15.9-C-22
+ fw.mainfwa 1.60.0-73
+ fw.mainfwb 1.60.0-57
+
+Info versions
+=============
+
+The ``pds_core`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of firmware running on the device
+ * - ``fw.goldfw``
+ - stored
+ - Version of firmware stored in the goldfw slot
+ * - ``fw.mainfwa``
+ - stored
+ - Version of firmware stored in the mainfwa slot
+ * - ``fw.mainfwb``
+ - stored
+ - Version of firmware stored in the mainfwb slot
+ * - ``asic.id``
+ - fixed
+ - The ASIC type for this device
+ * - ``asic.rev``
+ - fixed
+ - The revision of the ASIC for this device
+
+Parameters
+==========
+
+The ``pds_core`` driver implements the following generic
+parameters for controlling the functionality to be made available
+as auxiliary_bus devices.
+
+.. list-table:: Generic parameters implemented
+ :widths: 5 5 8 82
+
+ * - Name
+ - Mode
+ - Type
+ - Description
+ * - ``enable_vnet``
+ - runtime
+ - Boolean
+ - Enables vDPA functionality through an auxiliary_bus device
+
+Firmware Management
+===================
+
+The ``flash`` command can update a the DSC firmware. The downloaded firmware
+will be saved into either of firmware bank 1 or bank 2, whichever is not
+currently in use, and that bank will used for the next boot::
+
+ # devlink dev flash pci/0000:b5:00.0 \
+ file pensando/dsc_fw_1.63.0-22.tar
+
+Health Reporters
+================
+
+The driver supports a devlink health reporter for FW status::
+
+ # devlink health show pci/0000:2b:00.0 reporter fw
+ pci/0000:2b:00.0:
+ reporter fw
+ state healthy error 0 recover 0
+ # devlink health diagnose pci/0000:2b:00.0 reporter fw
+ Status: healthy State: 1 Generation: 0 Recoveries: 0
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command::
+
+ make oldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet driver support
+ -> AMD devices
+ -> AMD/Pensando Ethernet PDS_CORE Support
+
+Support
+=======
+
+For general Linux networking support, please use the netdev mailing
+list, which is monitored by AMD/Pensando personnel::
+
+ netdev@vger.kernel.org
diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst
index 392969ac88ad..417ca514a4d0 100644
--- a/Documentation/networking/device_drivers/ethernet/index.rst
+++ b/Documentation/networking/device_drivers/ethernet/index.rst
@@ -14,6 +14,7 @@ Contents:
3com/vortex
amazon/ena
altera/altera_tse
+ amd/pds_core
aquantia/atlantic
chelsio/cxgb
cirrus/cs89x0
@@ -31,7 +32,6 @@ Contents:
intel/fm10k
intel/igb
intel/igbvf
- intel/ixgb
intel/ixgbe
intel/ixgbevf
intel/i40e
diff --git a/Documentation/networking/device_drivers/ethernet/intel/e100.rst b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
index 3d4a9ba21946..5dee1b53e977 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/e100.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e100.rst
@@ -151,8 +151,7 @@ NAPI
NAPI (Rx polling mode) is supported in the e100 driver.
-See https://wiki.linuxfoundation.org/networking/napi for more
-information on NAPI.
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
Multiple Interfaces on Same Ethernet Broadcast Network
------------------------------------------------------
@@ -181,8 +180,6 @@ Support
For general information, go to the Intel support website at:
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-http://sourceforge.net/projects/e1000
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/e1000.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
index 4aaae0f7d6ba..52a7fb9ce8d9 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst
@@ -451,13 +451,8 @@ Support
=======
For general information, go to the Intel support website at:
-
- http://support.intel.com
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
- http://sourceforge.net/projects/e1000
+http://support.intel.com
If an issue is identified with the released source code on the supported
kernel with a supported adapter, email the specific information related
-to the issue to e1000-devel@lists.sf.net
+to the issue to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
index f49cd370e7bf..d8f810afdd49 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst
@@ -371,13 +371,8 @@ NOTE: Wake on LAN is only supported on port A for the following devices:
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
index 9258ef6f515c..396a2c8c3db1 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst
@@ -130,13 +130,8 @@ the Intel Ethernet Controller XL710.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
index ac35bd472bdc..4fbaa1a2d674 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst
@@ -399,8 +399,8 @@ operate only in full duplex and only at their native speed.
NAPI
----
NAPI (Rx polling mode) is supported in the i40e driver.
-For more information on NAPI, see
-https://wiki.linuxfoundation.org/networking/napi
+
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
Flow Control
------------
@@ -759,13 +759,8 @@ enabled when setting up DCB on your switch.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/iavf.rst b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
index 151af0a8da9c..eb926c3bd4cd 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst
@@ -319,13 +319,8 @@ This is caused by the way the Linux kernel reports this stressed condition.
Support
=======
For general information, go to the Intel support website at:
-
https://support.intel.com
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on the supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
index 5efea4dd1251..69695e5511f4 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst
@@ -817,10 +817,10 @@ NOTE:
NAPI
----
+
This driver supports NAPI (Rx polling mode).
-For more information on NAPI, see
-https://wiki.linuxfoundation.org/networking/napi
+See :ref:`Documentation/networking/napi.rst <napi>` for more information.
MACVLAN
-------
@@ -1026,12 +1026,9 @@ Support
For general information, go to the Intel support website at:
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
Trademarks
diff --git a/Documentation/networking/device_drivers/ethernet/intel/igb.rst b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
index d46289e182cf..fbd590b6a0d6 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/igb.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/igb.rst
@@ -201,13 +201,8 @@ NOTE: This feature is exclusive to i210 models.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
index 40fa210c5e14..11a9017f3069 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst
@@ -53,13 +53,8 @@ https://www.kernel.org/pub/software/network/ethtool/
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgb.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgb.rst
deleted file mode 100644
index c6a233e68ad6..000000000000
--- a/Documentation/networking/device_drivers/ethernet/intel/ixgb.rst
+++ /dev/null
@@ -1,468 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0+
-
-=====================================================================
-Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection
-=====================================================================
-
-October 1, 2018
-
-
-Contents
-========
-
-- In This Release
-- Identifying Your Adapter
-- Command Line Parameters
-- Improving Performance
-- Additional Configurations
-- Known Issues/Troubleshooting
-- Support
-
-
-
-In This Release
-===============
-
-This file describes the ixgb Linux Base Driver for the 10 Gigabit Intel(R)
-Network Connection. This driver includes support for Itanium(R)2-based
-systems.
-
-For questions related to hardware requirements, refer to the documentation
-supplied with your 10 Gigabit adapter. All hardware requirements listed apply
-to use with Linux.
-
-The following features are available in this kernel:
- - Native VLANs
- - Channel Bonding (teaming)
- - SNMP
-
-Channel Bonding documentation can be found in the Linux kernel source:
-/Documentation/networking/bonding.rst
-
-The driver information previously displayed in the /proc filesystem is not
-supported in this release. Alternatively, you can use ethtool (version 1.6
-or later), lspci, and iproute2 to obtain the same information.
-
-Instructions on updating ethtool can be found in the section "Additional
-Configurations" later in this document.
-
-
-Identifying Your Adapter
-========================
-
-The following Intel network adapters are compatible with the drivers in this
-release:
-
-+------------+------------------------------+----------------------------------+
-| Controller | Adapter Name | Physical Layer |
-+============+==============================+==================================+
-| 82597EX | Intel(R) PRO/10GbE LR/SR/CX4 | - 10G Base-LR (fiber) |
-| | Server Adapters | - 10G Base-SR (fiber) |
-| | | - 10G Base-CX4 (copper) |
-+------------+------------------------------+----------------------------------+
-
-For more information on how to identify your adapter, go to the Adapter &
-Driver ID Guide at:
-
- https://support.intel.com
-
-
-Command Line Parameters
-=======================
-
-If the driver is built as a module, the following optional parameters are
-used by entering them on the command line with the modprobe command using
-this syntax::
-
- modprobe ixgb [<option>=<VAL1>,<VAL2>,...]
-
-For example, with two 10GbE PCI adapters, entering::
-
- modprobe ixgb TxDescriptors=80,128
-
-loads the ixgb driver with 80 TX resources for the first adapter and 128 TX
-resources for the second adapter.
-
-The default value for each parameter is generally the recommended setting,
-unless otherwise noted.
-
-Copybreak
----------
-:Valid Range: 0-XXXX
-:Default Value: 256
-
- This is the maximum size of packet that is copied to a new buffer on
- receive.
-
-Debug
------
-:Valid Range: 0-16 (0=none,...,16=all)
-:Default Value: 0
-
- This parameter adjusts the level of debug messages displayed in the
- system logs.
-
-FlowControl
------------
-:Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
-:Default Value: 1 if no EEPROM, otherwise read from EEPROM
-
- This parameter controls the automatic generation(Tx) and response(Rx) to
- Ethernet PAUSE frames. There are hardware bugs associated with enabling
- Tx flow control so beware.
-
-RxDescriptors
--------------
-:Valid Range: 64-4096
-:Default Value: 1024
-
- This value is the number of receive descriptors allocated by the driver.
- Increasing this value allows the driver to buffer more incoming packets.
- Each descriptor is 16 bytes. A receive buffer is also allocated for
- each descriptor and can be either 2048, 4056, 8192, or 16384 bytes,
- depending on the MTU setting. When the MTU size is 1500 or less, the
- receive buffer size is 2048 bytes. When the MTU is greater than 1500 the
- receive buffer size will be either 4056, 8192, or 16384 bytes. The
- maximum MTU size is 16114.
-
-TxDescriptors
--------------
-:Valid Range: 64-4096
-:Default Value: 256
-
- This value is the number of transmit descriptors allocated by the driver.
- Increasing this value allows the driver to queue more transmits. Each
- descriptor is 16 bytes.
-
-RxIntDelay
-----------
-:Valid Range: 0-65535 (0=off)
-:Default Value: 72
-
- This value delays the generation of receive interrupts in units of
- 0.8192 microseconds. Receive interrupt reduction can improve CPU
- efficiency if properly tuned for specific network traffic. Increasing
- this value adds extra latency to frame reception and can end up
- decreasing the throughput of TCP traffic. If the system is reporting
- dropped receives, this value may be set too high, causing the driver to
- run out of available receive descriptors.
-
-TxIntDelay
-----------
-:Valid Range: 0-65535 (0=off)
-:Default Value: 32
-
- This value delays the generation of transmit interrupts in units of
- 0.8192 microseconds. Transmit interrupt reduction can improve CPU
- efficiency if properly tuned for specific network traffic. Increasing
- this value adds extra latency to frame transmission and can end up
- decreasing the throughput of TCP traffic. If this value is set too high,
- it will cause the driver to run out of available transmit descriptors.
-
-XsumRX
-------
-:Valid Range: 0-1
-:Default Value: 1
-
- A value of '1' indicates that the driver should enable IP checksum
- offload for received packets (both UDP and TCP) to the adapter hardware.
-
-RxFCHighThresh
---------------
-:Valid Range: 1,536-262,136 (0x600 - 0x3FFF8, 8 byte granularity)
-:Default Value: 196,608 (0x30000)
-
- Receive Flow control high threshold (when we send a pause frame)
-
-RxFCLowThresh
--------------
-:Valid Range: 64-262,136 (0x40 - 0x3FFF8, 8 byte granularity)
-:Default Value: 163,840 (0x28000)
-
- Receive Flow control low threshold (when we send a resume frame)
-
-FCReqTimeout
-------------
-:Valid Range: 1-65535
-:Default Value: 65535
-
- Flow control request timeout (how long to pause the link partner's tx)
-
-IntDelayEnable
---------------
-:Value Range: 0,1
-:Default Value: 1
-
- Interrupt Delay, 0 disables transmit interrupt delay and 1 enables it.
-
-
-Improving Performance
-=====================
-
-With the 10 Gigabit server adapters, the default Linux configuration will
-very likely limit the total available throughput artificially. There is a set
-of configuration changes that, when applied together, will increase the ability
-of Linux to transmit and receive data. The following enhancements were
-originally acquired from settings published at https://www.spec.org/web99/ for
-various submitted results using Linux.
-
-NOTE:
- These changes are only suggestions, and serve as a starting point for
- tuning your network performance.
-
-The changes are made in three major ways, listed in order of greatest effect:
-
-- Use ip link to modify the mtu (maximum transmission unit) and the txqueuelen
- parameter.
-- Use sysctl to modify /proc parameters (essentially kernel tuning)
-- Use setpci to modify the MMRBC field in PCI-X configuration space to increase
- transmit burst lengths on the bus.
-
-NOTE:
- setpci modifies the adapter's configuration registers to allow it to read
- up to 4k bytes at a time (for transmits). However, for some systems the
- behavior after modifying this register may be undefined (possibly errors of
- some kind). A power-cycle, hard reset or explicitly setting the e6 register
- back to 22 (setpci -d 8086:1a48 e6.b=22) may be required to get back to a
- stable configuration.
-
-- COPY these lines and paste them into ixgb_perf.sh:
-
-::
-
- #!/bin/bash
- echo "configuring network performance , edit this file to change the interface
- or device ID of 10GbE card"
- # set mmrbc to 4k reads, modify only Intel 10GbE device IDs
- # replace 1a48 with appropriate 10GbE device's ID installed on the system,
- # if needed.
- setpci -d 8086:1a48 e6.b=2e
- # set the MTU (max transmission unit) - it requires your switch and clients
- # to change as well.
- # set the txqueuelen
- # your ixgb adapter should be loaded as eth1 for this to work, change if needed
- ip li set dev eth1 mtu 9000 txqueuelen 1000 up
- # call the sysctl utility to modify /proc/sys entries
- sysctl -p ./sysctl_ixgb.conf
-
-- COPY these lines and paste them into sysctl_ixgb.conf:
-
-::
-
- # some of the defaults may be different for your kernel
- # call this file with sysctl -p <this file>
- # these are just suggested values that worked well to increase throughput in
- # several network benchmark tests, your mileage may vary
-
- ### IPV4 specific settings
- # turn TCP timestamp support off, default 1, reduces CPU use
- net.ipv4.tcp_timestamps = 0
- # turn SACK support off, default on
- # on systems with a VERY fast bus -> memory interface this is the big gainer
- net.ipv4.tcp_sack = 0
- # set min/default/max TCP read buffer, default 4096 87380 174760
- net.ipv4.tcp_rmem = 10000000 10000000 10000000
- # set min/pressure/max TCP write buffer, default 4096 16384 131072
- net.ipv4.tcp_wmem = 10000000 10000000 10000000
- # set min/pressure/max TCP buffer space, default 31744 32256 32768
- net.ipv4.tcp_mem = 10000000 10000000 10000000
-
- ### CORE settings (mostly for socket and UDP effect)
- # set maximum receive socket buffer size, default 131071
- net.core.rmem_max = 524287
- # set maximum send socket buffer size, default 131071
- net.core.wmem_max = 524287
- # set default receive socket buffer size, default 65535
- net.core.rmem_default = 524287
- # set default send socket buffer size, default 65535
- net.core.wmem_default = 524287
- # set maximum amount of option memory buffers, default 10240
- net.core.optmem_max = 524287
- # set number of unprocessed input packets before kernel starts dropping them; default 300
- net.core.netdev_max_backlog = 300000
-
-Edit the ixgb_perf.sh script if necessary to change eth1 to whatever interface
-your ixgb driver is using and/or replace '1a48' with appropriate 10GbE device's
-ID installed on the system.
-
-NOTE:
- Unless these scripts are added to the boot process, these changes will
- only last only until the next system reboot.
-
-
-Resolving Slow UDP Traffic
---------------------------
-If your server does not seem to be able to receive UDP traffic as fast as it
-can receive TCP traffic, it could be because Linux, by default, does not set
-the network stack buffers as large as they need to be to support high UDP
-transfer rates. One way to alleviate this problem is to allow more memory to
-be used by the IP stack to store incoming data.
-
-For instance, use the commands::
-
- sysctl -w net.core.rmem_max=262143
-
-and::
-
- sysctl -w net.core.rmem_default=262143
-
-to increase the read buffer memory max and default to 262143 (256k - 1) from
-defaults of max=131071 (128k - 1) and default=65535 (64k - 1). These variables
-will increase the amount of memory used by the network stack for receives, and
-can be increased significantly more if necessary for your application.
-
-
-Additional Configurations
-=========================
-
-Configuring the Driver on Different Distributions
--------------------------------------------------
-Configuring a network driver to load properly when the system is started is
-distribution dependent. Typically, the configuration process involves adding
-an alias line to /etc/modprobe.conf as well as editing other system startup
-scripts and/or configuration files. Many popular Linux distributions ship
-with tools to make these changes for you. To learn the proper way to
-configure a network device for your system, refer to your distribution
-documentation. If during this process you are asked for the driver or module
-name, the name for the Linux Base Driver for the Intel 10GbE Family of
-Adapters is ixgb.
-
-Viewing Link Messages
----------------------
-Link messages will not be displayed to the console if the distribution is
-restricting system messages. In order to see network driver link messages on
-your console, set dmesg to eight by entering the following::
-
- dmesg -n 8
-
-NOTE: This setting is not saved across reboots.
-
-Jumbo Frames
-------------
-The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
-enabled by changing the MTU to a value larger than the default of 1500.
-The maximum value for the MTU is 16114. Use the ip command to
-increase the MTU size. For example::
-
- ip li set dev ethx mtu 9000
-
-The maximum MTU setting for Jumbo Frames is 16114. This value coincides
-with the maximum Jumbo Frames size of 16128.
-
-Ethtool
--------
-The driver utilizes the ethtool interface for driver configuration and
-diagnostics, as well as displaying statistical information. The ethtool
-version 1.6 or later is required for this functionality.
-
-The latest release of ethtool can be found from
-https://www.kernel.org/pub/software/network/ethtool/
-
-NOTE:
- The ethtool version 1.6 only supports a limited set of ethtool options.
- Support for a more complete ethtool feature set can be enabled by
- upgrading to the latest version.
-
-NAPI
-----
-NAPI (Rx polling mode) is supported in the ixgb driver.
-
-See https://wiki.linuxfoundation.org/networking/napi for more information on
-NAPI.
-
-
-Known Issues/Troubleshooting
-============================
-
-NOTE:
- After installing the driver, if your Intel Network Connection is not
- working, verify in the "In This Release" section of the readme that you have
- installed the correct driver.
-
-Cable Interoperability Issue with Fujitsu XENPAK Module in SmartBits Chassis
-----------------------------------------------------------------------------
-Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4
-Server adapter is connected to a Fujitsu XENPAK CX4 module in a SmartBits
-chassis using 15 m/24AWG cable assemblies manufactured by Fujitsu or Leoni.
-The CRC errors may be received either by the Intel(R) PRO/10GbE CX4
-Server adapter or the SmartBits. If this situation occurs using a different
-cable assembly may resolve the issue.
-
-Cable Interoperability Issues with HP Procurve 3400cl Switch Port
------------------------------------------------------------------
-Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 Server
-adapter is connected to an HP Procurve 3400cl switch port using short cables
-(1 m or shorter). If this situation occurs, using a longer cable may resolve
-the issue.
-
-Excessive CRC errors may be observed using Fujitsu 24AWG cable assemblies that
-Are 10 m or longer or where using a Leoni 15 m/24AWG cable assembly. The CRC
-errors may be received either by the CX4 Server adapter or at the switch. If
-this situation occurs, using a different cable assembly may resolve the issue.
-
-Jumbo Frames System Requirement
--------------------------------
-Memory allocation failures have been observed on Linux systems with 64 MB
-of RAM or less that are running Jumbo Frames. If you are using Jumbo
-Frames, your system may require more than the advertised minimum
-requirement of 64 MB of system memory.
-
-Performance Degradation with Jumbo Frames
------------------------------------------
-Degradation in throughput performance may be observed in some Jumbo frames
-environments. If this is observed, increasing the application's socket buffer
-size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help.
-See the specific application manual and /usr/src/linux*/Documentation/
-networking/ip-sysctl.txt for more details.
-
-Allocating Rx Buffers when Using Jumbo Frames
----------------------------------------------
-Allocating Rx buffers when using Jumbo Frames on 2.6.x kernels may fail if
-the available memory is heavily fragmented. This issue may be seen with PCI-X
-adapters or with packet split disabled. This can be reduced or eliminated
-by changing the amount of available memory for receive buffer allocation, by
-increasing /proc/sys/vm/min_free_kbytes.
-
-Multiple Interfaces on Same Ethernet Broadcast Network
-------------------------------------------------------
-Due to the default ARP behavior on Linux, it is not possible to have
-one system on two IP networks in the same Ethernet broadcast domain
-(non-partitioned switch) behave as expected. All Ethernet interfaces
-will respond to IP traffic for any IP address assigned to the system.
-This results in unbalanced receive traffic.
-
-If you have multiple interfaces in a server, do either of the following:
-
- - Turn on ARP filtering by entering::
-
- echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
-
- - Install the interfaces in separate broadcast domains - either in
- different switches or in a switch partitioned to VLANs.
-
-UDP Stress Test Dropped Packet Issue
---------------------------------------
-Under small packets UDP stress test with 10GbE driver, the Linux system
-may drop UDP packets due to the fullness of socket buffers. You may want
-to change the driver's Flow Control variables to the minimum value for
-controlling packet reception.
-
-Tx Hangs Possible Under Stress
-------------------------------
-Under stress conditions, if TX hangs occur, turning off TSO
-"ethtool -K eth0 tso off" may resolve the problem.
-
-
-Support
-=======
-For general information, go to the Intel support website at:
-
-https://www.intel.com/support/
-
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
-If an issue is identified with the released source code on a supported kernel
-with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
index 0a233b17c664..1e5f16993f69 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst
@@ -545,13 +545,8 @@ on the Intel Ethernet Controller XL710.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
index 76bbde736f21..08dc0d368a48 100644
--- a/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
+++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst
@@ -55,13 +55,8 @@ VLANs: There is a limit of a total of 64 shared VLANs to 1 or more VFs.
Support
=======
For general information, go to the Intel support website at:
-
https://www.intel.com/support/
-or the Intel Wired Networking project hosted by Sourceforge at:
-
-https://sourceforge.net/projects/e1000
-
If an issue is identified with the released source code on a supported kernel
with a supported adapter, email the specific information related to the issue
-to e1000-devel@lists.sf.net.
+to intel-wired-lan@lists.osuosl.org.
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
index 4cd8e869762b..6b2d1fe74ecf 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst
@@ -346,32 +346,6 @@ the software port.
- The number of receive packets with CQE compression on ring i [#accel]_.
- Acceleration
- * - `rx[i]_cache_reuse`
- - The number of events of successful reuse of a page from a driver's
- internal page cache.
- - Acceleration
-
- * - `rx[i]_cache_full`
- - The number of events of full internal page cache where driver can't put a
- page back to the cache for recycling (page will be freed).
- - Acceleration
-
- * - `rx[i]_cache_empty`
- - The number of events where cache was empty - no page to give. Driver
- shall allocate new page.
- - Acceleration
-
- * - `rx[i]_cache_busy`
- - The number of events where cache head was busy and cannot be recycled.
- Driver allocated new page.
- - Acceleration
-
- * - `rx[i]_cache_waive`
- - The number of cache evacuation. This can occur due to page move to
- another NUMA node or page was pfmemalloc-ed and should be freed as soon
- as possible.
- - Acceleration
-
* - `rx[i]_arfs_err`
- Number of flow rules that failed to be added to the flow table.
- Error
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst
index 9b5c40ba7f0d..3a7a714cc08f 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst
@@ -122,6 +122,41 @@ users try to enable them.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
+hairpin_num_queues: Number of hairpin queues
+--------------------------------------------
+We refer to a TC NIC rule that involves forwarding as "hairpin".
+
+Hairpin queues are mlx5 hardware specific implementation for hardware
+forwarding of such packets.
+
+- Show the number of hairpin queues::
+
+ $ devlink dev param show pci/0000:06:00.0 name hairpin_num_queues
+ pci/0000:06:00.0:
+ name hairpin_num_queues type driver-specific
+ values:
+ cmode driverinit value 2
+
+- Change the number of hairpin queues::
+
+ $ devlink dev param set pci/0000:06:00.0 name hairpin_num_queues value 4 cmode driverinit
+
+hairpin_queue_size: Size of the hairpin queues
+----------------------------------------------
+Control the size of the hairpin queues.
+
+- Show the size of the hairpin queues::
+
+ $ devlink dev param show pci/0000:06:00.0 name hairpin_queue_size
+ pci/0000:06:00.0:
+ name hairpin_queue_size type driver-specific
+ values:
+ cmode driverinit value 1024
+
+- Change the size (in packets) of the hairpin queues::
+
+ $ devlink dev param set pci/0000:06:00.0 name hairpin_queue_size value 512 cmode driverinit
+
Health reporters
================
@@ -222,3 +257,36 @@ User commands examples:
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
NOTE: This command can run only on PF.
+
+vnic reporter
+-------------
+The vnic reporter implements only the `diagnose` callback.
+It is responsible for querying the vnic diagnostic counters from fw and displaying
+them in realtime.
+
+Description of the vnic counters:
+total_q_under_processor_handle: number of queues in an error state due to
+an async error or errored command.
+send_queue_priority_update_flow: number of QP/SQ priority/SL update
+events.
+cq_overrun: number of times CQ entered an error state due to an
+overflow.
+async_eq_overrun: number of times an EQ mapped to async events was
+overrun.
+comp_eq_overrun: number of times an EQ mapped to completion events was
+overrun.
+quota_exceeded_command: number of commands issued and failed due to quota
+exceeded.
+invalid_command: number of commands issued and failed dues to any reason
+other than quota exceeded.
+nic_receive_steering_discard: number of packets that completed RX flow
+steering but were discarded due to a mismatch in flow table.
+
+User commands examples:
+- Diagnose PF/VF vnic counters
+ $ devlink health diagnose pci/0000:82:00.1 reporter vnic
+- Diagnose representor vnic counters (performed by supplying devlink port of the
+ representor, which can be obtained via devlink port command)
+ $ devlink health diagnose pci/0000:82:00.1/65537 reporter vnic
+
+NOTE: This command can run over all interfaces such as PF/VF and representor ports.
diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 3321117cf605..202798d6501e 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -72,6 +72,18 @@ parameters.
Default: disabled
+ * - ``hairpin_num_queues``
+ - u32
+ - driverinit
+ - We refer to a TC NIC rule that involves forwarding as "hairpin".
+ Hairpin queues are mlx5 hardware specific implementation for hardware
+ forwarding of such packets.
+
+ Control the number of hairpin queues.
+ * - ``hairpin_queue_size``
+ - u32
+ - driverinit
+ - Control the size (in packets) of the hairpin queues.
The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD``
diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst
index 64f7236ff10b..4f5dfa9c022e 100644
--- a/Documentation/networking/driver.rst
+++ b/Documentation/networking/driver.rst
@@ -4,94 +4,124 @@
Softnet Driver Issues
=====================
-Transmit path guidelines:
+Probing guidelines
+==================
-1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under
- any normal circumstances. It is considered a hard error unless
- there is no way your device can tell ahead of time when its
- transmit function will become busy.
+Address validation
+------------------
- Instead it must maintain the queue properly. For example,
- for a driver implementing scatter-gather this means::
+Any hardware layer address you obtain for your device should
+be verified. For example, for ethernet check it with
+linux/etherdevice.h:is_valid_ether_addr()
+
+Close/stop guidelines
+=====================
+
+Quiescence
+----------
+
+After the ndo_stop routine has been called, the hardware must
+not receive or transmit any data. All in flight packets must
+be aborted. If necessary, poll or wait for completion of
+any reset commands.
+
+Auto-close
+----------
+
+The ndo_stop routine will be called by unregister_netdevice
+if device is still UP.
+
+Transmit path guidelines
+========================
+
+Stop queues in advance
+----------------------
+
+The ndo_start_xmit method must not return NETDEV_TX_BUSY under
+any normal circumstances. It is considered a hard error unless
+there is no way your device can tell ahead of time when its
+transmit function will become busy.
+
+Instead it must maintain the queue properly. For example,
+for a driver implementing scatter-gather this means:
+
+.. code-block:: c
+
+ static u32 drv_tx_avail(struct drv_ring *dr)
+ {
+ u32 used = READ_ONCE(dr->prod) - READ_ONCE(dr->cons);
+
+ return dr->tx_ring_size - (used & bp->tx_ring_mask);
+ }
static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb,
struct net_device *dev)
{
struct drv *dp = netdev_priv(dev);
+ struct netdev_queue *txq;
+ struct drv_ring *dr;
+ int idx;
- lock_tx(dp);
- ...
- /* This is a hard error log it. */
- if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) {
+ idx = skb_get_queue_mapping(skb);
+ dr = dp->tx_rings[idx];
+ txq = netdev_get_tx_queue(dev, idx);
+
+ //...
+ /* This should be a very rare race - log it. */
+ if (drv_tx_avail(dr) <= skb_shinfo(skb)->nr_frags + 1) {
netif_stop_queue(dev);
- unlock_tx(dp);
- printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n",
- dev->name);
+ netdev_warn(dev, "Tx Ring full when queue awake!\n");
return NETDEV_TX_BUSY;
}
- ... queue packet to card ...
- ... update tx consumer index ...
-
- if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1))
- netif_stop_queue(dev);
-
- ...
- unlock_tx(dp);
- ...
- return NETDEV_TX_OK;
- }
-
- And then at the end of your TX reclamation event handling::
+ //... queue packet to card ...
- if (netif_queue_stopped(dp->dev) &&
- TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1))
- netif_wake_queue(dp->dev);
+ netdev_tx_sent_queue(txq, skb->len);
- For a non-scatter-gather supporting card, the three tests simply become::
+ //... update tx producer index using WRITE_ONCE() ...
- /* This is a hard error log it. */
- if (TX_BUFFS_AVAIL(dp) <= 0)
+ if (!netif_txq_maybe_stop(txq, drv_tx_avail(dr),
+ MAX_SKB_FRAGS + 1, 2 * MAX_SKB_FRAGS))
+ dr->stats.stopped++;
- and::
+ //...
+ return NETDEV_TX_OK;
+ }
- if (TX_BUFFS_AVAIL(dp) == 0)
+And then at the end of your TX reclamation event handling:
- and::
+.. code-block:: c
- if (netif_queue_stopped(dp->dev) &&
- TX_BUFFS_AVAIL(dp) > 0)
- netif_wake_queue(dp->dev);
+ //... update tx consumer index using WRITE_ONCE() ...
-2) An ndo_start_xmit method must not modify the shared parts of a
- cloned SKB.
+ netif_txq_completed_wake(txq, cmpl_pkts, cmpl_bytes,
+ drv_tx_avail(dr), 2 * MAX_SKB_FRAGS);
-3) Do not forget that once you return NETDEV_TX_OK from your
- ndo_start_xmit method, it is your driver's responsibility to free
- up the SKB and in some finite amount of time.
+Lockless queue stop / wake helper macros
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- For example, this means that it is not allowed for your TX
- mitigation scheme to let TX packets "hang out" in the TX
- ring unreclaimed forever if no new TX packets are sent.
- This error can deadlock sockets waiting for send buffer room
- to be freed up.
+.. kernel-doc:: include/net/netdev_queues.h
+ :doc: Lockless queue stopping / waking helpers.
- If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
- must not keep any reference to that SKB and you must not attempt
- to free it up.
+No exclusive ownership
+----------------------
-Probing guidelines:
+An ndo_start_xmit method must not modify the shared parts of a
+cloned SKB.
-1) Any hardware layer address you obtain for your device should
- be verified. For example, for ethernet check it with
- linux/etherdevice.h:is_valid_ether_addr()
+Timely completions
+------------------
-Close/stop guidelines:
+Do not forget that once you return NETDEV_TX_OK from your
+ndo_start_xmit method, it is your driver's responsibility to free
+up the SKB and in some finite amount of time.
-1) After the ndo_stop routine has been called, the hardware must
- not receive or transmit any data. All in flight packets must
- be aborted. If necessary, poll or wait for completion of
- any reset commands.
+For example, this means that it is not allowed for your TX
+mitigation scheme to let TX packets "hang out" in the TX
+ring unreclaimed forever if no new TX packets are sent.
+This error can deadlock sockets waiting for send buffer room
+to be freed up.
-2) The ndo_stop routine will be called by unregister_netdevice
- if device is still UP.
+If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
+must not keep any reference to that SKB and you must not attempt
+to free it up.
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index e1bc6186d7ea..2540c70952ff 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -860,22 +860,24 @@ Request contents:
Kernel response contents:
- ==================================== ====== ===========================
- ``ETHTOOL_A_RINGS_HEADER`` nested reply header
- ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring
- ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring
- ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring
- ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring
- ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
- ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
- ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
- ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
- ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
- ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split
- ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
- ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
- ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
- ==================================== ====== ===========================
+ ======================================= ====== ===========================
+ ``ETHTOOL_A_RINGS_HEADER`` nested reply header
+ ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring
+ ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring
+ ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring
+ ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
+ ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
+ ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
+ ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
+ ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX`` u32 max size of TX push buffer
+ ======================================= ====== ===========================
``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``).
@@ -891,6 +893,18 @@ through MMIO writes, thus reducing the latency. However, enabling this feature
may increase the CPU cost. Drivers may enforce additional per-packet
eligibility checks (e.g. on packet size).
+``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` specifies the maximum number of bytes of a
+transmitted packet a driver can push directly to the underlying device
+('push' mode). Pushing some of the payload bytes to the device has the
+advantages of reducing latency for small packets by avoiding DMA mapping (same
+as ``ETHTOOL_A_RINGS_TX_PUSH`` parameter) as well as allowing the underlying
+device to process packet headers ahead of fetching its payload.
+This can help the device to make fast actions based on the packet's headers.
+This is similar to the "tx-copybreak" parameter, which copies the packet to a
+preallocated DMA memory area instead of mapping new memory. However,
+tx-push-buff parameter copies the packet directly to the device to allow the
+device to take faster actions on the packet.
+
RINGS_SET
=========
@@ -908,6 +922,7 @@ Request contents:
``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode
``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode
+ ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer
==================================== ====== ===========================
Kernel checks that requested ring sizes do not exceed limits reported by
@@ -1084,6 +1099,10 @@ such that the corresponding bit in ``ethtool_ops::supported_coalesce_params``
is not set), regardless of their values. Driver may impose additional
constraints on coalescing parameters and their values.
+Compared to requests issued via the ``ioctl()`` netlink version of this request
+will try harder to make sure that values specified by the user have been applied
+and may call the driver twice.
+
PAUSE_GET
=========
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 4ddcae33c336..a164ff074356 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -36,6 +36,7 @@ Contents:
scaling
tls
tls-offload
+ tls-handshake
nfc
6lowpan
6pack
@@ -73,6 +74,7 @@ Contents:
mpls-sysctl
mptcp-sysctl
multiqueue
+ napi
netconsole
netdev-features
netdevices
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 58a78a316697..6ec06a33688a 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -2721,6 +2721,13 @@ echo_ignore_anycast - BOOLEAN
Default: 0
+error_anycast_as_unicast - BOOLEAN
+ If set to 1, then the kernel will respond with ICMP Errors
+ resulting from requests sent to it over the IPv6 protocol destined
+ to anycast address essentially treating anycast as unicast.
+
+ Default: 0
+
xfrm6_gc_thresh - INTEGER
(Obsolete since linux-4.14)
The threshold at which we will start garbage collecting for IPv6
diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst
new file mode 100644
index 000000000000..a7a047742e93
--- /dev/null
+++ b/Documentation/networking/napi.rst
@@ -0,0 +1,254 @@
+.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+
+.. _napi:
+
+====
+NAPI
+====
+
+NAPI is the event handling mechanism used by the Linux networking stack.
+The name NAPI no longer stands for anything in particular [#]_.
+
+In basic operation the device notifies the host about new events
+via an interrupt.
+The host then schedules a NAPI instance to process the events.
+The device may also be polled for events via NAPI without receiving
+interrupts first (:ref:`busy polling<poll>`).
+
+NAPI processing usually happens in the software interrupt context,
+but there is an option to use :ref:`separate kernel threads<threaded>`
+for NAPI processing.
+
+All in all NAPI abstracts away from the drivers the context and configuration
+of event (packet Rx and Tx) processing.
+
+Driver API
+==========
+
+The two most important elements of NAPI are the struct napi_struct
+and the associated poll method. struct napi_struct holds the state
+of the NAPI instance while the method is the driver-specific event
+handler. The method will typically free Tx packets that have been
+transmitted and process newly received packets.
+
+.. _drv_ctrl:
+
+Control API
+-----------
+
+netif_napi_add() and netif_napi_del() add/remove a NAPI instance
+from the system. The instances are attached to the netdevice passed
+as argument (and will be deleted automatically when netdevice is
+unregistered). Instances are added in a disabled state.
+
+napi_enable() and napi_disable() manage the disabled state.
+A disabled NAPI can't be scheduled and its poll method is guaranteed
+to not be invoked. napi_disable() waits for ownership of the NAPI
+instance to be released.
+
+The control APIs are not idempotent. Control API calls are safe against
+concurrent use of datapath APIs but an incorrect sequence of control API
+calls may result in crashes, deadlocks, or race conditions. For example,
+calling napi_disable() multiple times in a row will deadlock.
+
+Datapath API
+------------
+
+napi_schedule() is the basic method of scheduling a NAPI poll.
+Drivers should call this function in their interrupt handler
+(see :ref:`drv_sched` for more info). A successful call to napi_schedule()
+will take ownership of the NAPI instance.
+
+Later, after NAPI is scheduled, the driver's poll method will be
+called to process the events/packets. The method takes a ``budget``
+argument - drivers can process completions for any number of Tx
+packets but should only process up to ``budget`` number of
+Rx packets. Rx processing is usually much more expensive.
+
+In other words, it is recommended to ignore the budget argument when
+performing TX buffer reclamation to ensure that the reclamation is not
+arbitrarily bounded; however, it is required to honor the budget argument
+for RX processing.
+
+.. warning::
+
+ The ``budget`` argument may be 0 if core tries to only process Tx completions
+ and no Rx packets.
+
+The poll method returns the amount of work done. If the driver still
+has outstanding work to do (e.g. ``budget`` was exhausted)
+the poll method should return exactly ``budget``. In that case,
+the NAPI instance will be serviced/polled again (without the
+need to be scheduled).
+
+If event processing has been completed (all outstanding packets
+processed) the poll method should call napi_complete_done()
+before returning. napi_complete_done() releases the ownership
+of the instance.
+
+.. warning::
+
+ The case of finishing all events and using exactly ``budget``
+ must be handled carefully. There is no way to report this
+ (rare) condition to the stack, so the driver must either
+ not call napi_complete_done() and wait to be called again,
+ or return ``budget - 1``.
+
+ If the ``budget`` is 0 napi_complete_done() should never be called.
+
+Call sequence
+-------------
+
+Drivers should not make assumptions about the exact sequencing
+of calls. The poll method may be called without the driver scheduling
+the instance (unless the instance is disabled). Similarly,
+it's not guaranteed that the poll method will be called, even
+if napi_schedule() succeeded (e.g. if the instance gets disabled).
+
+As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent
+calls to the poll method only wait for the ownership of the instance
+to be released, not for the poll method to exit. This means that
+drivers should avoid accessing any data structures after calling
+napi_complete_done().
+
+.. _drv_sched:
+
+Scheduling and IRQ masking
+--------------------------
+
+Drivers should keep the interrupts masked after scheduling
+the NAPI instance - until NAPI polling finishes any further
+interrupts are unnecessary.
+
+Drivers which have to mask the interrupts explicitly (as opposed
+to IRQ being auto-masked by the device) should use the napi_schedule_prep()
+and __napi_schedule() calls:
+
+.. code-block:: c
+
+ if (napi_schedule_prep(&v->napi)) {
+ mydrv_mask_rxtx_irq(v->idx);
+ /* schedule after masking to avoid races */
+ __napi_schedule(&v->napi);
+ }
+
+IRQ should only be unmasked after a successful call to napi_complete_done():
+
+.. code-block:: c
+
+ if (budget && napi_complete_done(&v->napi, work_done)) {
+ mydrv_unmask_rxtx_irq(v->idx);
+ return min(work_done, budget - 1);
+ }
+
+napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage
+of guarantees given by being invoked in IRQ context (no need to
+mask interrupts). Note that PREEMPT_RT forces all interrupts
+to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD``
+to avoid issues on real-time kernel configurations.
+
+Instance to queue mapping
+-------------------------
+
+Modern devices have multiple NAPI instances (struct napi_struct) per
+interface. There is no strong requirement on how the instances are
+mapped to queues and interrupts. NAPI is primarily a polling/processing
+abstraction without specific user-facing semantics. That said, most networking
+devices end up using NAPI in fairly similar ways.
+
+NAPI instances most often correspond 1:1:1 to interrupts and queue pairs
+(queue pair is a set of a single Rx and single Tx queue).
+
+In less common cases a NAPI instance may be used for multiple queues
+or Rx and Tx queues can be serviced by separate NAPI instances on a single
+core. Regardless of the queue assignment, however, there is usually still
+a 1:1 mapping between NAPI instances and interrupts.
+
+It's worth noting that the ethtool API uses a "channel" terminology where
+each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear
+what constitutes a channel; the recommended interpretation is to understand
+a channel as an IRQ/NAPI which services queues of a given type. For example,
+a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected
+to utilize 3 interrupts, 2 Rx and 2 Tx queues.
+
+User API
+========
+
+User interactions with NAPI depend on NAPI instance ID. The instance IDs
+are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option.
+It's not currently possible to query IDs used by a given device.
+
+Software IRQ coalescing
+-----------------------
+
+NAPI does not perform any explicit event coalescing by default.
+In most scenarios batching happens due to IRQ coalescing which is done
+by the device. There are cases where software coalescing is helpful.
+
+NAPI can be configured to arm a repoll timer instead of unmasking
+the hardware interrupts as soon as all packets are processed.
+The ``gro_flush_timeout`` sysfs configuration of the netdevice
+is reused to control the delay of the timer, while
+``napi_defer_hard_irqs`` controls the number of consecutive empty polls
+before NAPI gives up and goes back to using hardware IRQs.
+
+.. _poll:
+
+Busy polling
+------------
+
+Busy polling allows a user process to check for incoming packets before
+the device interrupt fires. As is the case with any busy polling it trades
+off CPU cycles for lower latency (production uses of NAPI busy polling
+are not well known).
+
+Busy polling is enabled by either setting ``SO_BUSY_POLL`` on
+selected sockets or using the global ``net.core.busy_poll`` and
+``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling
+also exists.
+
+IRQ mitigation
+---------------
+
+While busy polling is supposed to be used by low latency applications,
+a similar mechanism can be used for IRQ mitigation.
+
+Very high request-per-second applications (especially routing/forwarding
+applications and especially applications using AF_XDP sockets) may not
+want to be interrupted until they finish processing a request or a batch
+of packets.
+
+Such applications can pledge to the kernel that they will perform a busy
+polling operation periodically, and the driver should keep the device IRQs
+permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL``
+socket option. To avoid system misbehavior the pledge is revoked
+if ``gro_flush_timeout`` passes without any busy poll call.
+
+The NAPI budget for busy polling is lower than the default (which makes
+sense given the low latency intention of normal busy polling). This is
+not the case with IRQ mitigation, however, so the budget can be adjusted
+with the ``SO_BUSY_POLL_BUDGET`` socket option.
+
+.. _threaded:
+
+Threaded NAPI
+-------------
+
+Threaded NAPI is an operating mode that uses dedicated kernel
+threads rather than software IRQ context for NAPI processing.
+The configuration is per netdevice and will affect all
+NAPI instances of that device. Each NAPI instance will spawn a separate
+thread (called ``napi/${ifc-name}-${napi-id}``).
+
+It is recommended to pin each kernel thread to a single CPU, the same
+CPU as the CPU which services the interrupt. Note that the mapping
+between IRQs and NAPI instances may not be trivial (and is driver
+dependent). The NAPI instance IDs will be assigned in the opposite
+order than the process IDs of the kernel threads.
+
+Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in
+netdev's sysfs directory.
+
+.. rubric:: Footnotes
+
+.. [#] NAPI was originally referred to as New API in 2.4 Linux.
diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst
index 30f1344e7cca..873efd97f822 100644
--- a/Documentation/networking/page_pool.rst
+++ b/Documentation/networking/page_pool.rst
@@ -165,6 +165,7 @@ Registration
pp_params.pool_size = DESC_NUM;
pp_params.nid = NUMA_NO_NODE;
pp_params.dev = priv->dev;
+ pp_params.napi = napi; /* only if locking is tied to NAPI */
pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
page_pool = page_pool_create(&pp_params);
diff --git a/Documentation/networking/rxrpc.rst b/Documentation/networking/rxrpc.rst
index ec1323d92c96..e807e18ba32a 100644
--- a/Documentation/networking/rxrpc.rst
+++ b/Documentation/networking/rxrpc.rst
@@ -848,14 +848,21 @@ The kernel interface functions are as follows:
returned. The caller now holds a reference on this and it must be
properly ended.
- (#) End a client call::
+ (#) Shut down a client call::
- void rxrpc_kernel_end_call(struct socket *sock,
+ void rxrpc_kernel_shutdown_call(struct socket *sock,
+ struct rxrpc_call *call);
+
+ This is used to shut down a previously begun call. The user_call_ID is
+ expunged from AF_RXRPC's knowledge and will not be seen again in
+ association with the specified call.
+
+ (#) Release the ref on a client call::
+
+ void rxrpc_kernel_put_call(struct socket *sock,
struct rxrpc_call *call);
- This is used to end a previously begun call. The user_call_ID is expunged
- from AF_RXRPC's knowledge and will not be seen again in association with
- the specified call.
+ This is used to release the caller's ref on an rxrpc call.
(#) Send data through a call::
diff --git a/Documentation/networking/tls-handshake.rst b/Documentation/networking/tls-handshake.rst
new file mode 100644
index 000000000000..a2817a88e905
--- /dev/null
+++ b/Documentation/networking/tls-handshake.rst
@@ -0,0 +1,217 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+In-Kernel TLS Handshake
+=======================
+
+Overview
+========
+
+Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs
+over TCP. TLS provides end-to-end data integrity and confidentiality in
+addition to peer authentication.
+
+The kernel's kTLS implementation handles the TLS record subprotocol, but
+does not handle the TLS handshake subprotocol which is used to establish
+a TLS session. Kernel consumers can use the API described here to
+request TLS session establishment.
+
+There are several possible ways to provide a handshake service in the
+kernel. The API described here is designed to hide the details of those
+implementations so that in-kernel TLS consumers do not need to be
+aware of how the handshake gets done.
+
+
+User handshake agent
+====================
+
+As of this writing, there is no TLS handshake implementation in the
+Linux kernel. To provide a handshake service, a handshake agent
+(typically in user space) is started in each network namespace where a
+kernel consumer might require a TLS handshake. Handshake agents listen
+for events sent from the kernel that indicate a handshake request is
+waiting.
+
+An open socket is passed to a handshake agent via a netlink operation,
+which creates a socket descriptor in the agent's file descriptor table.
+If the handshake completes successfully, the handshake agent promotes
+the socket to use the TLS ULP and sets the session information using the
+SOL_TLS socket options. The handshake agent returns the socket to the
+kernel via a second netlink operation.
+
+
+Kernel Handshake API
+====================
+
+A kernel TLS consumer initiates a client-side TLS handshake on an open
+socket by invoking one of the tls_client_hello() functions. First, it
+fills in a structure that contains the parameters of the request:
+
+.. code-block:: c
+
+ struct tls_handshake_args {
+ struct socket *ta_sock;
+ tls_done_func_t ta_done;
+ void *ta_data;
+ unsigned int ta_timeout_ms;
+ key_serial_t ta_keyring;
+ key_serial_t ta_my_cert;
+ key_serial_t ta_my_privkey;
+ unsigned int ta_num_peerids;
+ key_serial_t ta_my_peerids[5];
+ };
+
+The @ta_sock field references an open and connected socket. The consumer
+must hold a reference on the socket to prevent it from being destroyed
+while the handshake is in progress. The consumer must also have
+instantiated a struct file in sock->file.
+
+
+@ta_done contains a callback function that is invoked when the handshake
+has completed. Further explanation of this function is in the "Handshake
+Completion" sesction below.
+
+The consumer can fill in the @ta_timeout_ms field to force the servicing
+handshake agent to exit after a number of milliseconds. This enables the
+socket to be fully closed once both the kernel and the handshake agent
+have closed their endpoints.
+
+Authentication material such as x.509 certificates, private certificate
+keys, and pre-shared keys are provided to the handshake agent in keys
+that are instantiated by the consumer before making the handshake
+request. The consumer can provide a private keyring that is linked into
+the handshake agent's process keyring in the @ta_keyring field to prevent
+access of those keys by other subsystems.
+
+To request an x.509-authenticated TLS session, the consumer fills in
+the @ta_my_cert and @ta_my_privkey fields with the serial numbers of
+keys containing an x.509 certificate and the private key for that
+certificate. Then, it invokes this function:
+
+.. code-block:: c
+
+ ret = tls_client_hello_x509(args, gfp_flags);
+
+The function returns zero when the handshake request is under way. A
+zero return guarantees the callback function @ta_done will be invoked
+for this socket. The function returns a negative errno if the handshake
+could not be started. A negative errno guarantees the callback function
+@ta_done will not be invoked on this socket.
+
+
+To initiate a client-side TLS handshake with a pre-shared key, use:
+
+.. code-block:: c
+
+ ret = tls_client_hello_psk(args, gfp_flags);
+
+However, in this case, the consumer fills in the @ta_my_peerids array
+with serial numbers of keys containing the peer identities it wishes
+to offer, and the @ta_num_peerids field with the number of array
+entries it has filled in. The other fields are filled in as above.
+
+
+To initiate an anonymous client-side TLS handshake use:
+
+.. code-block:: c
+
+ ret = tls_client_hello_anon(args, gfp_flags);
+
+The handshake agent presents no peer identity information to the remote
+during this type of handshake. Only server authentication (ie the client
+verifies the server's identity) is performed during the handshake. Thus
+the established session uses encryption only.
+
+
+Consumers that are in-kernel servers use:
+
+.. code-block:: c
+
+ ret = tls_server_hello_x509(args, gfp_flags);
+
+or
+
+.. code-block:: c
+
+ ret = tls_server_hello_psk(args, gfp_flags);
+
+The argument structure is filled in as above.
+
+
+If the consumer needs to cancel the handshake request, say, due to a ^C
+or other exigent event, the consumer can invoke:
+
+.. code-block:: c
+
+ bool tls_handshake_cancel(sock);
+
+This function returns true if the handshake request associated with
+@sock has been canceled. The consumer's handshake completion callback
+will not be invoked. If this function returns false, then the consumer's
+completion callback has already been invoked.
+
+
+Handshake Completion
+====================
+
+When the handshake agent has completed processing, it notifies the
+kernel that the socket may be used by the consumer again. At this point,
+the consumer's handshake completion callback, provided in the @ta_done
+field in the tls_handshake_args structure, is invoked.
+
+The synopsis of this function is:
+
+.. code-block:: c
+
+ typedef void (*tls_done_func_t)(void *data, int status,
+ key_serial_t peerid);
+
+The consumer provides a cookie in the @ta_data field of the
+tls_handshake_args structure that is returned in the @data parameter of
+this callback. The consumer uses the cookie to match the callback to the
+thread waiting for the handshake to complete.
+
+The success status of the handshake is returned via the @status
+parameter:
+
++------------+----------------------------------------------+
+| status | meaning |
++============+==============================================+
+| 0 | TLS session established successfully |
++------------+----------------------------------------------+
+| -EACCESS | Remote peer rejected the handshake or |
+| | authentication failed |
++------------+----------------------------------------------+
+| -ENOMEM | Temporary resource allocation failure |
++------------+----------------------------------------------+
+| -EINVAL | Consumer provided an invalid argument |
++------------+----------------------------------------------+
+| -ENOKEY | Missing authentication material |
++------------+----------------------------------------------+
+| -EIO | An unexpected fault occurred |
++------------+----------------------------------------------+
+
+The @peerid parameter contains the serial number of a key containing the
+remote peer's identity or the value TLS_NO_PEERID if the session is not
+authenticated.
+
+A best practice is to close and destroy the socket immediately if the
+handshake failed.
+
+
+Other considerations
+--------------------
+
+While a handshake is under way, the kernel consumer must alter the
+socket's sk_data_ready callback function to ignore all incoming data.
+Once the handshake completion callback function has been invoked, normal
+receive operation can be resumed.
+
+Once a TLS session is established, the consumer must provide a buffer
+for and then examine the control message (CMSG) that is part of every
+subsequent sock_recvmsg(). Each control message indicates whether the
+received message data is TLS record data or session metadata.
+
+See tls.rst for details on how a kTLS consumer recognizes incoming
+(decrypted) application data, alerts, and handshake packets once the
+socket has been promoted to use the TLS ULP.