summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet
AgeCommit message (Collapse)Author
2025-01-20net: move HDS config from ethtool stateJakub Kicinski
Separate the HDS config from the ethtool state struct. The HDS config contains just simple parameters, not state. Having it as a separate struct will make it easier to clone / copy and also long term potentially make it per-queue. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250119020518.1962249-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20net: ethernet: ti: am65-cpsw: fix freeing IRQ in am65_cpsw_nuss_remove_tx_chns()Roger Quadros
When getting the IRQ we use k3_udma_glue_tx_get_irq() which returns negative error value on error. So not NULL check is not sufficient to deteremine if IRQ is valid. Check that IRQ is greater then zero to ensure it is valid. There is no issue at probe time but at runtime user can invoke .set_channels which results in the following call chain. am65_cpsw_set_channels() am65_cpsw_nuss_update_tx_rx_chns() am65_cpsw_nuss_remove_tx_chns() am65_cpsw_nuss_init_tx_chns() At this point if am65_cpsw_nuss_init_tx_chns() fails due to k3_udma_glue_tx_get_irq() then tx_chn->irq will be set to a negative value. Then, at subsequent .set_channels with higher channel count we will attempt to free an invalid IRQ in am65_cpsw_nuss_remove_tx_chns() leading to a kernel warning. The issue is present in the original commit that introduced this driver, although there, am65_cpsw_nuss_update_tx_rx_chns() existed as am65_cpsw_nuss_update_tx_chns(). Fixes: 93a76530316a ("net: ethernet: ti: introduce am65x/j721e gigabit eth subsystem driver") Signed-off-by: Roger Quadros <rogerq@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-18eth: bnxt: fix string truncation warning in FW versionJakub Kicinski
W=1 builds with gcc 14.2.1 report: drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c:4193:32: error: ā€˜%s’ directive output may be truncated writing up to 31 bytes into a region of size 27 [-Werror=format-truncation=] 4193 | "/pkg %s", buf); It's upset that we let buf be full length but then we use 5 characters for "/pkg ". The builds is also clear with clang version 19.1.5 now. Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250117183726.1481524-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18octeon_ep_vf: update tx/rx stats locally for persistenceShinas Rasheed
Update tx/rx stats locally, so that ndo_get_stats64() can use that and not rely on per queue resources to obtain statistics. The latter used to cause race conditions when the device stopped. Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Link: https://patch.msgid.link/20250117094653.2588578-5-srasheed@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18octeon_ep_vf: remove firmware stats fetch in ndo_get_stats64Shinas Rasheed
The firmware stats fetch call that happens in ndo_get_stats64() is currently not required, and causes a warning to issue. The corresponding warn log for the PF is given below: [ 123.316837] ------------[ cut here ]------------ [ 123.316840] Voluntary context switch within RCU read-side critical section! [ 123.316917] pc : rcu_note_context_switch+0x2e4/0x300 [ 123.316919] lr : rcu_note_context_switch+0x2e4/0x300 [ 123.316947] Call trace: [ 123.316949] rcu_note_context_switch+0x2e4/0x300 [ 123.316952] __schedule+0x84/0x584 [ 123.316955] schedule+0x38/0x90 [ 123.316956] schedule_timeout+0xa0/0x1d4 [ 123.316959] octep_send_mbox_req+0x190/0x230 [octeon_ep] [ 123.316966] octep_ctrl_net_get_if_stats+0x78/0x100 [octeon_ep] [ 123.316970] octep_get_stats64+0xd4/0xf0 [octeon_ep] [ 123.316975] dev_get_stats+0x4c/0x114 [ 123.316977] dev_seq_printf_stats+0x3c/0x11c [ 123.316980] dev_seq_show+0x1c/0x40 [ 123.316982] seq_read_iter+0x3cc/0x4e0 [ 123.316985] seq_read+0xc8/0x110 [ 123.316987] proc_reg_read+0x9c/0xec [ 123.316990] vfs_read+0xc8/0x2ec [ 123.316993] ksys_read+0x70/0x100 [ 123.316995] __arm64_sys_read+0x20/0x30 [ 123.316997] invoke_syscall.constprop.0+0x7c/0xd0 [ 123.317000] do_el0_svc+0xb4/0xd0 [ 123.317002] el0_svc+0xe8/0x1f4 [ 123.317005] el0t_64_sync_handler+0x134/0x150 [ 123.317006] el0t_64_sync+0x17c/0x180 [ 123.317008] ---[ end trace 63399811432ab69b ]--- Fixes: c3fad23cdc06 ("octeon_ep_vf: add support for ndo ops") Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Link: https://patch.msgid.link/20250117094653.2588578-4-srasheed@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18octeon_ep: update tx/rx stats locally for persistenceShinas Rasheed
Update tx/rx stats locally, so that ndo_get_stats64() can use that and not rely on per queue resources to obtain statistics. The latter used to cause race conditions when the device stopped. Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Link: https://patch.msgid.link/20250117094653.2588578-3-srasheed@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18octeon_ep: remove firmware stats fetch in ndo_get_stats64Shinas Rasheed
The firmware stats fetch call that happens in ndo_get_stats64() is currently not required, and causes a warning to issue. The warn log is given below: [ 123.316837] ------------[ cut here ]------------ [ 123.316840] Voluntary context switch within RCU read-side critical section! [ 123.316917] pc : rcu_note_context_switch+0x2e4/0x300 [ 123.316919] lr : rcu_note_context_switch+0x2e4/0x300 [ 123.316947] Call trace: [ 123.316949] rcu_note_context_switch+0x2e4/0x300 [ 123.316952] __schedule+0x84/0x584 [ 123.316955] schedule+0x38/0x90 [ 123.316956] schedule_timeout+0xa0/0x1d4 [ 123.316959] octep_send_mbox_req+0x190/0x230 [octeon_ep] [ 123.316966] octep_ctrl_net_get_if_stats+0x78/0x100 [octeon_ep] [ 123.316970] octep_get_stats64+0xd4/0xf0 [octeon_ep] [ 123.316975] dev_get_stats+0x4c/0x114 [ 123.316977] dev_seq_printf_stats+0x3c/0x11c [ 123.316980] dev_seq_show+0x1c/0x40 [ 123.316982] seq_read_iter+0x3cc/0x4e0 [ 123.316985] seq_read+0xc8/0x110 [ 123.316987] proc_reg_read+0x9c/0xec [ 123.316990] vfs_read+0xc8/0x2ec [ 123.316993] ksys_read+0x70/0x100 [ 123.316995] __arm64_sys_read+0x20/0x30 [ 123.316997] invoke_syscall.constprop.0+0x7c/0xd0 [ 123.317000] do_el0_svc+0xb4/0xd0 [ 123.317002] el0_svc+0xe8/0x1f4 [ 123.317005] el0t_64_sync_handler+0x134/0x150 [ 123.317006] el0t_64_sync+0x17c/0x180 [ 123.317008] ---[ end trace 63399811432ab69b ]--- Fixes: 6a610a46bad1 ("octeon_ep: add support for ndo ops") Signed-off-by: Shinas Rasheed <srasheed@marvell.com> Link: https://patch.msgid.link/20250117094653.2588578-2-srasheed@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: xilinx: axienet: Report an error for bad coalesce settingsSean Anderson
Instead of silently ignoring invalid/unsupported settings, report an error. Additionally, relax the check for non-zero usecs to apply only when it will be used (i.e. when frames != 1). Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed by: Shannon Nelson <shannon.nelson@amd.com> Link: https://patch.msgid.link/20250116232954.2696930-3-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-18net: xilinx: axienet: Add some symbolic constants for IRQ delay timerSean Anderson
Instead of using literals, add some symbolic constants for the IRQ delay timer calculation. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed by: Shannon Nelson <shannon.nelson@amd.com> Link: https://patch.msgid.link/20250116232954.2696930-2-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17net: mscc: ocelot: add TX timestamping statisticsVladimir Oltean
Add an u64 hardware timestamping statistics structure for each ocelot port. Export a function from the common switch library for reporting them to ethtool. This is called by the ocelot switchdev front-end for now. Note that for the switchdev driver, we report the one-step PTP packets as unconfirmed, even though in principle, for some transmission mechanisms like FDMA, we may be able to confirm transmission and bump the "pkts" counter in ocelot_fdma_tx_cleanup() instead. I don't have access to hardware which uses the switchdev front-end, and I've kept the implementation simple. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250116104628.123555-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17mlxsw: Do not store Tx header length as driver parameterAmit Cohen
Tx header handling was moved to PCI code, as there is no several drivers which configure Tx header differently. Tx header length is stored as driver parameter, this is not really necessary as it always stores the same value. Remove this field and use the macro MLXSW_TXHDR_LEN explicitly. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/1fb7b3f007de4d311e559c8a954b673d0895d5e9.1737044384.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17mlxsw: Move Tx header handling to PCI driverAmit Cohen
Tx header should be added to all packets transmitted from the CPU to Spectrum ASICs. Historically, handling this header was added as a driver function, as Tx header is different between Spectrum and Switch-X. See SwitchX implementation in commit 31557f0f9755 ("mlxsw: Introduce Mellanox SwitchX-2 ASIC support"). From May 2021, there is no support for SwitchX-2 ASIC, and all the relevant code was removed. For now, there is no justification to handle Tx header as part of spectrum.c, we can handle this as part of PCI, in skb_transmit(). A future patch set will add support for XDP in mlxsw driver, to support XDP_TX and XDP_REDIRECT actions, Tx header should be added before transmitting the packet. As preparation for this, move Tx header handling to PCI driver, so then XDP code will not have to call API from spectrum.c. This also improves the code as now Tx header is pushed just before transmitting, so it is not done from many flows which might miss something. Note that for PTP, we should configure Tx header differently, use the fields from mlxsw_txhdr_info to configure the packets correctly in PCI driver. Handle VLAN tagging in switch driver, verify that packet which should be transmitted as data is tagged, otherwise, tag it. Remove the calls for thxdr_construct() functions, as now this is done as part of skb_transmit(). Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/293a81e6f7d59a8ec9f9592edb7745536649ff11.1737044384.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17mlxsw: Define Tx header fields in txheader.hAmit Cohen
The next patch will move Tx header constructing to pci.c. As preparation, move the definitions of Tx header fields from spectrum.c to txheader.h, so pci.c will include this header and can access the fields. Remove 'etclass' which is not used. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/2250b5cb3998ab4850fc8251c3a0f5926d32e194.1737044384.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17mlxsw: Initialize txhdr_info according to PTP operationsAmit Cohen
A next patch will construct Tx header as part of pci.c. The switch driver (mlxsw_spectrum.ko) should encapsulate all the differences between the different ASICs and the bus driver (mlxsw_pci.ko) should remain unaware. As preparation, add the relevant info as part of mlxsw_txhdr_info structure, so later bus driver will merely construct the Tx header based on information passed from the switch driver. Most of the packets are transmitted as control packets, but PTP packets in Spectrum-2 and Spectrum-3 should be handled differently. The driver transmits them as data packets, and the default VLAN tag (4095) is added if the packet is not already tagged. Extend PTP operations to store a boolean which indicates whether packets should be transmitted as data packets. Set it for Spectrum-2 and Spectrum-3 only. Extend mlxsw_txhdr_info to store fields which will be used later to construct Tx header. Initialize such fields according to the new boolean which is stored in PTP operations. Note that for now, mlxsw_txhdr_info structure is initialized, but not used, a next patch will use it. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/efcaacd4bedef524e840a0c29f96cebf2c4bc0e0.1737044384.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17mlxsw: Add mlxsw_txhdr_info structureAmit Cohen
mlxsw_tx_info structure is used to store information that is needed to process Tx completions when Tx time stamps are requested. A next patch will move Tx header handling from spectrum.c to pci.c. For that, some additional fields which are related to Tx should be passed to pci driver. As preparation, create an extended structure, called mlxsw_txhdr_info, and store mlxsw_tx_info inside. The new fields should not be added to mlxsw_tx_info structure as it is stored in the SKB control block which is of limited size. The next patch will extend the new structure with some fields which are needed in order to construct Tx header. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/93aed1961f046f79f46869bab37a3faa5027751d.1737044384.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17net/mlx5: fix unintentional sign extension on shift of dest_attr->vport.vhca_idColin Ian King
Shifting dest_attr->vport.vhca_id << 16 results in a promotion from an unsigned 16 bit integer to a 32 bit signed integer, this is then sign extended to a 64 bit unsigned long on 64 bitarchitectures. If vhca_id is greater than 0x7fff then this leads to a sign extended result where all the upper 32 bits of idx are set to 1. Fix this by casting vhca_id to the same type as idx before performing the shift. Fixes: 8e2e08a6d1e0 ("net/mlx5: fs, add support for dest vport HWS action") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Moshe Shemesh <moshe@nvidia.com> Link: https://patch.msgid.link/20250116181700.96437-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice: support FW Recovery Mode Konrad Knitter says: Enable update of card in FW Recovery Mode * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: support FW Recovery Mode devlink: add devl guard pldmfw: enable selected component update ==================== Link: https://patch.msgid.link/20250116212059.1254349-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17net/mlxfw: Drop hard coded max FW flash image sizeMaher Sanalla
Currently, mlxfw kernel module limits FW flash image size to be 10MB at most, preventing the ability to burn recent BlueField-3 FW that exceeds the said size limit. Thus, drop the hard coded limit. Instead, rely on FW's max_component_size threshold that is reported in MCQI register as the size limit for FW image. Fixes: 410ed13cae39 ("Add the mlxfw module for Mellanox firmware flash process") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1737030796-1441634-1-git-send-email-moshe@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16net: stmmac: convert to phylink managed EEE supportRussell King (Oracle)
Convert stmmac to use phylink managed EEE support rather than delving into phylib: 1. Move the stmmac_eee_init() calls out of mac_link_down() and mac_link_up() methods into the new mac_{enable,disable}_lpi() methods. We leave the calls to stmmac_set_eee_pls() in place as these change bits which tell the EEE hardware when the link came up or down, and is used for a separate hardware timer. However, symmetrically conditionalise this with priv->dma_cap.eee. 2. Update the current LPI timer each time LPI is enabled - which we need for software-timed LPI. 3. With phylink managed EEE, phylink manages the receive clock stop configuration via phylink_config.eee_rx_clk_stop_enable. Set this appropriately which makes the call to phy_eee_rx_clock_stop() redundant. 4. From what I can work out, all supported interfaces support LPI signalling on stmmac (there's no restriction implemented.) It also appears to support LPI at all full duplex speeds at or over 100M. Set these capabilities. 5. The default timer appears to be derived from a module parameter. Set this the same, although we keep code that reconfigures the timer in stmmac_init_phy(). 6. Remove the direct call to phy_support_eee(), which phylink will do on the drivers behalf if phylink_config.eee_enabled_default is set. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1tYAEG-0014QH-9O@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16net: lan743x: convert to phylink managed EEERussell King (Oracle)
Convert lan743x to phylink managed EEE: - Set the lpi_capabilties. - Move the call to lan743x_mac_eee_enable() into the enable/disable tx_lpi functions. - Ensure that EEEEN is clear during probe. - Move the setting of the LPI timer into mac_enable_tx_lpi(). - Move reading of LPI timer to phylink initialisation to set the default timer value. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1tYAEB-0014QB-4s@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16net: lan743x: use netdev in lan743x_phylink_mac_link_down()Russell King (Oracle)
Use the netdev that we already have in lan743x_phylink_mac_link_down(). Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1tYAE5-0014Q5-Up@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16net: mvpp2: add EEE implementationRussell King (Oracle)
Add EEE support for mvpp2, using phylink's EEE implementation, which means we just need to implement the two methods for LPI control, and with the initial configuration. Only SGMII mode is supported, so only 100M and 1G speeds. Disabling LPI requires clearing a single bit. Enabling LPI needs a full configuration of several values, as the timer values are dependent on the MAC operating speed. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1tYAE0-0014Pz-R9@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16net: mvneta: convert to phylink EEE implementationRussell King (Oracle)
Convert mvneta to use phylink's EEE implementation by implementing the two LPI control methods, and adding the initial configuration and capabilities. Although disabling LPI requires clearing a single bit, for safety we clear the manual mode and force bits to ensure that auto mode will be used. Enabling LPI needs a full configuration of several values, as the timer values are dependent on the MAC operating speed, as per the original code. As Armada 388 states that EEE is only supported in "SGMII" modes, mark this in lpi_interfaces. Testing with RGMII on the Clearfog platform indicates that the receive path fails to detect LPI over RGMII. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/E1tYADv-0014Pt-NO@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16ice: support FW Recovery ModeKonrad Knitter
Recovery Mode is intended to recover from a fatal failure scenario in which the device is not accessible to the host, meaning the firmware is non-responsive. The purpose of the Firmware Recovery Mode is to enable software tools to update firmware and/or device configuration so the fatal error can be resolved. Recovery Mode Firmware supports a limited set of admin commands required for NVM update. Recovery Firmware does not support hardware interrupts so a polling mode is used. The driver will expose only the minimum set of devlink commands required for the recovery of the adapter. Using an appropriate NVM image, the user can recover the adapter using the devlink flash API. Prior to 4.20 E810 Adapter Recovery Firmware supports only the update and erase of the "fw.mgmt" component. E810 Adapter Recovery Firmware doesn't support selected preservation of cards settings or identifiers. The following command can be used to recover the adapter: $ devlink dev flash <pci-address> <update-image.bin> component fw.mgmt overwrite settings overwrite identifier Newer FW versions (4.20 or newer) supports update of "fw.undi" and "fw.netlist" components. $ devlink dev flash <pci-address> <update-image.bin> Tested on Intel Corporation Ethernet Controller E810-C for SFP FW revision 3.20 and 4.30. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Konrad Knitter <konrad.knitter@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc8). Conflicts: drivers/net/ethernet/realtek/r8169_main.c 1f691a1fc4be ("r8169: remove redundant hwmon support") 152d00a91396 ("r8169: simplify setting hwmon attribute visibility") https://lore.kernel.org/20250115122152.760b4e8d@canb.auug.org.au Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.c 152f4da05aee ("bnxt_en: add support for rx-copybreak ethtool command") f0aa6a37a3db ("eth: bnxt: always recalculate features after XDP clearing, fix null-deref") drivers/net/ethernet/intel/ice/ice_type.h 50327223a8bb ("ice: add lock to protect low latency interface") dc26548d729e ("ice: Fix quad registers read on E825") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16net/mlx5e: Always start IPsec sequence number from 1Leon Romanovsky
According to RFC4303, section "3.3.3. Sequence Number Generation", the first packet sent using a given SA will contain a sequence number of 1. This is applicable to both ESN and non-ESN mode, which was not covered in commit mentioned in Fixes line. Fixes: 3d42c8cc67a8 ("net/mlx5e: Ensure that IPsec sequence packet number starts from 1") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net/mlx5e: Rely on reqid in IPsec tunnel modeLeon Romanovsky
All packet offloads SAs have reqid in it to make sure they have corresponding policy. While it is not strictly needed for transparent mode, it is extremely important in tunnel mode. In that mode, policy and SAs have different match criteria. Policy catches the whole subnet addresses, and SA catches the tunnel gateways addresses. The source address of such tunnel is not known during egress packet traversal in flow steering as it is added only after successful encryption. As reqid is required for packet offload and it is unique for every SA, we can safely rely on it only. The output below shows the configured egress policy and SA by strongswan: [leonro@vm ~]$ sudo ip x s src 192.169.101.2 dst 192.169.101.1 proto esp spi 0xc88b7652 reqid 1 mode tunnel replay-window 0 flag af-unspec esn aead rfc4106(gcm(aes)) 0xe406a01083986e14d116488549094710e9c57bc6 128 anti-replay esn context: seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0 replay_window 1, bitmap-length 1 00000000 crypto offload parameters: dev eth2 dir out mode packet [leonro@064 ~]$ sudo ip x p src 192.170.0.0/16 dst 192.170.0.0/16 dir out priority 383615 ptype main tmpl src 192.169.101.2 dst 192.169.101.1 proto esp spi 0xc88b7652 reqid 1 mode tunnel crypto offload parameters: dev eth2 mode packet Fixes: b3beba1fb404 ("net/mlx5e: Allow policies with reqid 0, to support IKE policy holes") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net/mlx5e: Fix inversion dependency warning while enabling IPsec tunnelLeon Romanovsky
Attempt to enable IPsec packet offload in tunnel mode in debug kernel generates the following kernel panic, which is happening due to two issues: 1. In SA add section, the should be _bh() variant when marking SA mode. 2. There is not needed flush_workqueue in SA delete routine. It is not needed as at this stage as it is removed from SADB and the running work will be canceled later in SA free. ===================================================== WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected 6.12.0+ #4 Not tainted ----------------------------------------------------- charon/1337 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire: ffff88810f365020 (&xa->xa_lock#24){+.+.}-{3:3}, at: mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core] and this task is already holding: ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30 which would create a new lock dependency: (&x->lock){+.-.}-{3:3} -> (&xa->xa_lock#24){+.+.}-{3:3} but this new dependency connects a SOFTIRQ-irq-safe lock: (&x->lock){+.-.}-{3:3} ... which became SOFTIRQ-irq-safe at: lock_acquire+0x1be/0x520 _raw_spin_lock_bh+0x34/0x40 xfrm_timer_handler+0x91/0xd70 __hrtimer_run_queues+0x1dd/0xa60 hrtimer_run_softirq+0x146/0x2e0 handle_softirqs+0x266/0x860 irq_exit_rcu+0x115/0x1a0 sysvec_apic_timer_interrupt+0x6e/0x90 asm_sysvec_apic_timer_interrupt+0x16/0x20 default_idle+0x13/0x20 default_idle_call+0x67/0xa0 do_idle+0x2da/0x320 cpu_startup_entry+0x50/0x60 start_secondary+0x213/0x2a0 common_startup_64+0x129/0x138 to a SOFTIRQ-irq-unsafe lock: (&xa->xa_lock#24){+.+.}-{3:3} ... which became SOFTIRQ-irq-unsafe at: ... lock_acquire+0x1be/0x520 _raw_spin_lock+0x2c/0x40 xa_set_mark+0x70/0x110 mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core] xfrm_dev_state_add+0x3bb/0xd70 xfrm_add_sa+0x2451/0x4a90 xfrm_user_rcv_msg+0x493/0x880 netlink_rcv_skb+0x12e/0x380 xfrm_netlink_rcv+0x6d/0x90 netlink_unicast+0x42f/0x740 netlink_sendmsg+0x745/0xbe0 __sock_sendmsg+0xc5/0x190 __sys_sendto+0x1fe/0x2c0 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&xa->xa_lock#24); local_irq_disable(); lock(&x->lock); lock(&xa->xa_lock#24); <Interrupt> lock(&x->lock); *** DEADLOCK *** 2 locks held by charon/1337: #0: ffffffff87f8f858 (&net->xfrm.xfrm_cfg_mutex){+.+.}-{4:4}, at: xfrm_netlink_rcv+0x5e/0x90 #1: ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30 the dependencies between SOFTIRQ-irq-safe lock and the holding lock: -> (&x->lock){+.-.}-{3:3} ops: 29 { HARDIRQ-ON-W at: lock_acquire+0x1be/0x520 _raw_spin_lock_bh+0x34/0x40 xfrm_alloc_spi+0xc0/0xe60 xfrm_alloc_userspi+0x5f6/0xbc0 xfrm_user_rcv_msg+0x493/0x880 netlink_rcv_skb+0x12e/0x380 xfrm_netlink_rcv+0x6d/0x90 netlink_unicast+0x42f/0x740 netlink_sendmsg+0x745/0xbe0 __sock_sendmsg+0xc5/0x190 __sys_sendto+0x1fe/0x2c0 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 IN-SOFTIRQ-W at: lock_acquire+0x1be/0x520 _raw_spin_lock_bh+0x34/0x40 xfrm_timer_handler+0x91/0xd70 __hrtimer_run_queues+0x1dd/0xa60 hrtimer_run_softirq+0x146/0x2e0 handle_softirqs+0x266/0x860 irq_exit_rcu+0x115/0x1a0 sysvec_apic_timer_interrupt+0x6e/0x90 asm_sysvec_apic_timer_interrupt+0x16/0x20 default_idle+0x13/0x20 default_idle_call+0x67/0xa0 do_idle+0x2da/0x320 cpu_startup_entry+0x50/0x60 start_secondary+0x213/0x2a0 common_startup_64+0x129/0x138 INITIAL USE at: lock_acquire+0x1be/0x520 _raw_spin_lock_bh+0x34/0x40 xfrm_alloc_spi+0xc0/0xe60 xfrm_alloc_userspi+0x5f6/0xbc0 xfrm_user_rcv_msg+0x493/0x880 netlink_rcv_skb+0x12e/0x380 xfrm_netlink_rcv+0x6d/0x90 netlink_unicast+0x42f/0x740 netlink_sendmsg+0x745/0xbe0 __sock_sendmsg+0xc5/0x190 __sys_sendto+0x1fe/0x2c0 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 } ... key at: [<ffffffff87f9cd20>] __key.18+0x0/0x40 the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock: -> (&xa->xa_lock#24){+.+.}-{3:3} ops: 9 { HARDIRQ-ON-W at: lock_acquire+0x1be/0x520 _raw_spin_lock_bh+0x34/0x40 mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core] xfrm_dev_state_add+0x3bb/0xd70 xfrm_add_sa+0x2451/0x4a90 xfrm_user_rcv_msg+0x493/0x880 netlink_rcv_skb+0x12e/0x380 xfrm_netlink_rcv+0x6d/0x90 netlink_unicast+0x42f/0x740 netlink_sendmsg+0x745/0xbe0 __sock_sendmsg+0xc5/0x190 __sys_sendto+0x1fe/0x2c0 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 SOFTIRQ-ON-W at: lock_acquire+0x1be/0x520 _raw_spin_lock+0x2c/0x40 xa_set_mark+0x70/0x110 mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core] xfrm_dev_state_add+0x3bb/0xd70 xfrm_add_sa+0x2451/0x4a90 xfrm_user_rcv_msg+0x493/0x880 netlink_rcv_skb+0x12e/0x380 xfrm_netlink_rcv+0x6d/0x90 netlink_unicast+0x42f/0x740 netlink_sendmsg+0x745/0xbe0 __sock_sendmsg+0xc5/0x190 __sys_sendto+0x1fe/0x2c0 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 INITIAL USE at: lock_acquire+0x1be/0x520 _raw_spin_lock_bh+0x34/0x40 mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core] xfrm_dev_state_add+0x3bb/0xd70 xfrm_add_sa+0x2451/0x4a90 xfrm_user_rcv_msg+0x493/0x880 netlink_rcv_skb+0x12e/0x380 xfrm_netlink_rcv+0x6d/0x90 netlink_unicast+0x42f/0x740 netlink_sendmsg+0x745/0xbe0 __sock_sendmsg+0xc5/0x190 __sys_sendto+0x1fe/0x2c0 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 } ... key at: [<ffffffffa078ff60>] __key.48+0x0/0xfffffffffff210a0 [mlx5_core] ... acquired at: __lock_acquire+0x30a0/0x5040 lock_acquire+0x1be/0x520 _raw_spin_lock_bh+0x34/0x40 mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core] xfrm_dev_state_delete+0x90/0x160 __xfrm_state_delete+0x662/0xae0 xfrm_state_delete+0x1e/0x30 xfrm_del_sa+0x1c2/0x340 xfrm_user_rcv_msg+0x493/0x880 netlink_rcv_skb+0x12e/0x380 xfrm_netlink_rcv+0x6d/0x90 netlink_unicast+0x42f/0x740 netlink_sendmsg+0x745/0xbe0 __sock_sendmsg+0xc5/0x190 __sys_sendto+0x1fe/0x2c0 __x64_sys_sendto+0xdc/0x1b0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 stack backtrace: CPU: 7 UID: 0 PID: 1337 Comm: charon Not tainted 6.12.0+ #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x74/0xd0 check_irq_usage+0x12e8/0x1d90 ? print_shortest_lock_dependencies_backwards+0x1b0/0x1b0 ? check_chain_key+0x1bb/0x4c0 ? __lockdep_reset_lock+0x180/0x180 ? check_path.constprop.0+0x24/0x50 ? mark_lock+0x108/0x2fb0 ? print_circular_bug+0x9b0/0x9b0 ? mark_lock+0x108/0x2fb0 ? print_usage_bug.part.0+0x670/0x670 ? check_prev_add+0x1c4/0x2310 check_prev_add+0x1c4/0x2310 __lock_acquire+0x30a0/0x5040 ? lockdep_set_lock_cmp_fn+0x190/0x190 ? lockdep_set_lock_cmp_fn+0x190/0x190 lock_acquire+0x1be/0x520 ? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core] ? lockdep_hardirqs_on_prepare+0x400/0x400 ? __xfrm_state_delete+0x5f0/0xae0 ? lock_downgrade+0x6b0/0x6b0 _raw_spin_lock_bh+0x34/0x40 ? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core] mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core] xfrm_dev_state_delete+0x90/0x160 __xfrm_state_delete+0x662/0xae0 xfrm_state_delete+0x1e/0x30 xfrm_del_sa+0x1c2/0x340 ? xfrm_get_sa+0x250/0x250 ? check_chain_key+0x1bb/0x4c0 xfrm_user_rcv_msg+0x493/0x880 ? copy_sec_ctx+0x270/0x270 ? check_chain_key+0x1bb/0x4c0 ? lockdep_set_lock_cmp_fn+0x190/0x190 ? lockdep_set_lock_cmp_fn+0x190/0x190 netlink_rcv_skb+0x12e/0x380 ? copy_sec_ctx+0x270/0x270 ? netlink_ack+0xd90/0xd90 ? netlink_deliver_tap+0xcd/0xb60 xfrm_netlink_rcv+0x6d/0x90 netlink_unicast+0x42f/0x740 ? netlink_attachskb+0x730/0x730 ? lock_acquire+0x1be/0x520 netlink_sendmsg+0x745/0xbe0 ? netlink_unicast+0x740/0x740 ? __might_fault+0xbb/0x170 ? netlink_unicast+0x740/0x740 __sock_sendmsg+0xc5/0x190 ? fdget+0x163/0x1d0 __sys_sendto+0x1fe/0x2c0 ? __x64_sys_getpeername+0xb0/0xb0 ? do_user_addr_fault+0x856/0xe30 ? lock_acquire+0x1be/0x520 ? __task_pid_nr_ns+0x117/0x410 ? lock_downgrade+0x6b0/0x6b0 __x64_sys_sendto+0xdc/0x1b0 ? lockdep_hardirqs_on_prepare+0x284/0x400 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f7d31291ba4 Code: 7d e8 89 4d d4 e8 4c 42 f7 ff 44 8b 4d d0 4c 8b 45 c8 89 c3 44 8b 55 d4 8b 7d e8 b8 2c 00 00 00 48 8b 55 d8 48 8b 75 e0 0f 05 <48> 3d 00 f0 ff ff 77 34 89 df 48 89 45 e8 e8 99 42 f7 ff 48 8b 45 RSP: 002b:00007f7d2ccd94f0 EFLAGS: 00000297 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7d31291ba4 RDX: 0000000000000028 RSI: 00007f7d2ccd96a0 RDI: 000000000000000a RBP: 00007f7d2ccd9530 R08: 00007f7d2ccd9598 R09: 000000000000000c R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000028 R13: 00007f7d2ccd9598 R14: 00007f7d2ccd96a0 R15: 00000000000000e1 </TASK> Fixes: 4c24272b4e2b ("net/mlx5e: Listen to ARP events to update IPsec L2 headers in tunnel mode") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net/mlx5: Clear port select structure when fail to createMark Zhang
Clear the port select structure on error so no stale values left after definers are destroyed. That's because the mlx5_lag_destroy_definers() always try to destroy all lag definers in the tt_map, so in the flow below lag definers get double-destroyed and cause kernel crash: mlx5_lag_port_sel_create() mlx5_lag_create_definers() mlx5_lag_create_definer() <- Failed on tt 1 mlx5_lag_destroy_definers() <- definers[tt=0] gets destroyed mlx5_lag_port_sel_create() mlx5_lag_create_definers() mlx5_lag_create_definer() <- Failed on tt 0 mlx5_lag_destroy_definers() <- definers[tt=0] gets double-destroyed Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 64k pages, 48-bit VAs, pgdp=0000000112ce2e00 [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: iptable_raw bonding ip_gre ip6_gre gre ip6_tunnel tunnel6 geneve ip6_udp_tunnel udp_tunnel ipip tunnel4 ip_tunnel rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_fwctl(OE) fwctl(OE) mlx5_core(OE) mlxdevm(OE) ib_core(OE) mlxfw(OE) memtrack(OE) mlx_compat(OE) openvswitch nsh nf_conncount psample xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc netconsole overlay efi_pstore sch_fq_codel zram ip_tables crct10dif_ce qemu_fw_cfg fuse ipv6 crc_ccitt [last unloaded: mlx_compat(OE)] CPU: 3 UID: 0 PID: 217 Comm: kworker/u53:2 Tainted: G OE 6.11.0+ #2 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core] lr : mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core] sp : ffff800085fafb00 x29: ffff800085fafb00 x28: ffff0000da0c8000 x27: 0000000000000000 x26: ffff0000da0c8000 x25: ffff0000da0c8000 x24: ffff0000da0c8000 x23: ffff0000c31f81a0 x22: 0400000000000000 x21: ffff0000da0c8000 x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff8b0c9350 x14: 0000000000000000 x13: ffff800081390d18 x12: ffff800081dc3cc0 x11: 0000000000000001 x10: 0000000000000b10 x9 : ffff80007ab7304c x8 : ffff0000d00711f0 x7 : 0000000000000004 x6 : 0000000000000190 x5 : ffff00027edb3010 x4 : 0000000000000000 x3 : 0000000000000000 x2 : ffff0000d39b8000 x1 : ffff0000d39b8000 x0 : 0400000000000000 Call trace: mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core] mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core] mlx5_lag_destroy_definers+0xa0/0x108 [mlx5_core] mlx5_lag_port_sel_create+0x2d4/0x6f8 [mlx5_core] mlx5_activate_lag+0x60c/0x6f8 [mlx5_core] mlx5_do_bond_work+0x284/0x5c8 [mlx5_core] process_one_work+0x170/0x3e0 worker_thread+0x2d8/0x3e0 kthread+0x11c/0x128 ret_from_fork+0x10/0x20 Code: a9025bf5 aa0003f6 a90363f7 f90023f9 (f9400400) ---[ end trace 0000000000000000 ]--- Fixes: dc48516ec7d3 ("net/mlx5: Lag, add support to create definers for LAG") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net/mlx5: SF, Fix add port error handlingChris Mi
If failed to add SF, error handling doesn't delete the SF from the SF table. But the hw resources are deleted. So when unload driver, hw resources will be deleted again. Firmware will report syndrome 0x68def3 which means "SF is not allocated can not deallocate". Fix it by delete SF from SF table if failed to add SF. Fixes: 2597ee190b4e ("net/mlx5: Call mlx5_sf_id_erase() once in mlx5_sf_dealloc()") Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Shay Drori <shayd@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net/mlx5: Fix a lockdep warning as part of the write combining testYishai Hadas
Fix a lockdep warning [1] observed during the write combining test. The warning indicates a potential nested lock scenario that could lead to a deadlock. However, this is a false positive alarm because the SF lock and its parent lock are distinct ones. The lockdep confusion arises because the locks belong to the same object class (i.e., struct mlx5_core_dev). To resolve this, the code has been refactored to avoid taking both locks. Instead, only the parent lock is acquired. [1] raw_ethernet_bw/2118 is trying to acquire lock: [ 213.619032] ffff88811dd75e08 (&dev->wc_state_lock){+.+.}-{3:3}, at: mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.620270] [ 213.620270] but task is already holding lock: [ 213.620943] ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3}, at: mlx5_wc_support_get+0x10c/0x210 [mlx5_core] [ 213.622045] [ 213.622045] other info that might help us debug this: [ 213.622778] Possible unsafe locking scenario: [ 213.622778] [ 213.623465] CPU0 [ 213.623815] ---- [ 213.624148] lock(&dev->wc_state_lock); [ 213.624615] lock(&dev->wc_state_lock); [ 213.625071] [ 213.625071] *** DEADLOCK *** [ 213.625071] [ 213.625805] May be due to missing lock nesting notation [ 213.625805] [ 213.626522] 4 locks held by raw_ethernet_bw/2118: [ 213.627019] #0: ffff88813f80d578 (&uverbs_dev->disassociate_srcu){.+.+}-{0:0}, at: ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs] [ 213.628088] #1: ffff88810fb23930 (&file->hw_destroy_rwsem){.+.+}-{3:3}, at: ib_init_ucontext+0x2d/0xf0 [ib_uverbs] [ 213.629094] #2: ffff88810fb23878 (&file->ucontext_lock){+.+.}-{3:3}, at: ib_init_ucontext+0x49/0xf0 [ib_uverbs] [ 213.630106] #3: ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3}, at: mlx5_wc_support_get+0x10c/0x210 [mlx5_core] [ 213.631185] [ 213.631185] stack backtrace: [ 213.631718] CPU: 1 UID: 0 PID: 2118 Comm: raw_ethernet_bw Not tainted 6.12.0-rc7_internal_net_next_mlx5_89a0ad0 #1 [ 213.632722] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 213.633785] Call Trace: [ 213.634099] [ 213.634393] dump_stack_lvl+0x7e/0xc0 [ 213.634806] print_deadlock_bug+0x278/0x3c0 [ 213.635265] __lock_acquire+0x15f4/0x2c40 [ 213.635712] lock_acquire+0xcd/0x2d0 [ 213.636120] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.636722] ? mlx5_ib_enable_lb+0x24/0xa0 [mlx5_ib] [ 213.637277] __mutex_lock+0x81/0xda0 [ 213.637697] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.638305] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.638902] ? rcu_read_lock_sched_held+0x3f/0x70 [ 213.639400] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.640016] mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.640615] set_ucontext_resp+0x68/0x2b0 [mlx5_ib] [ 213.641144] ? debug_mutex_init+0x33/0x40 [ 213.641586] mlx5_ib_alloc_ucontext+0x18e/0x7b0 [mlx5_ib] [ 213.642145] ib_init_ucontext+0xa0/0xf0 [ib_uverbs] [ 213.642679] ib_uverbs_handler_UVERBS_METHOD_GET_CONTEXT+0x95/0xc0 [ib_uverbs] [ 213.643426] ? _copy_from_user+0x46/0x80 [ 213.643878] ib_uverbs_cmd_verbs+0xa6b/0xc80 [ib_uverbs] [ 213.644426] ? ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x130/0x130 [ib_uverbs] [ 213.645213] ? __lock_acquire+0xa99/0x2c40 [ 213.645675] ? lock_acquire+0xcd/0x2d0 [ 213.646101] ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs] [ 213.646625] ? reacquire_held_locks+0xcf/0x1f0 [ 213.647102] ? do_user_addr_fault+0x45d/0x770 [ 213.647586] ib_uverbs_ioctl+0xe0/0x170 [ib_uverbs] [ 213.648102] ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs] [ 213.648632] __x64_sys_ioctl+0x4d3/0xaa0 [ 213.649060] ? do_user_addr_fault+0x4a8/0x770 [ 213.649528] do_syscall_64+0x6d/0x140 [ 213.649947] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 213.650478] RIP: 0033:0x7fa179b0737b [ 213.650893] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 2a 0f 00 f7 d8 64 89 01 48 [ 213.652619] RSP: 002b:00007ffd2e6d46e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 213.653390] RAX: ffffffffffffffda RBX: 00007ffd2e6d47f8 RCX: 00007fa179b0737b [ 213.654084] RDX: 00007ffd2e6d47e0 RSI: 00000000c0181b01 RDI: 0000000000000003 [ 213.654767] RBP: 00007ffd2e6d47c0 R08: 00007fa1799be010 R09: 0000000000000002 [ 213.655453] R10: 00007ffd2e6d4960 R11: 0000000000000246 R12: 00007ffd2e6d487c [ 213.656170] R13: 0000000000000027 R14: 0000000000000001 R15: 00007ffd2e6d4f70 Fixes: d98995b4bf98 ("net/mlx5: Reimplement write combining test") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net/mlx5: Fix RDMA TX steering prioPatrisious Haddad
User added steering rules at RDMA_TX were being added to the first prio, which is the counters prio. Fix that so that they are correctly added to the BYPASS_PRIO instead. Fixes: 24670b1a3166 ("net/mlx5: Add support for RDMA TX steering") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net: stmmac: Convert prefetch() to net_prefetch() for received framesFurong Xu
The size of DMA descriptors is 32 bytes at most. net_prefetch() for received frames, and keep prefetch() for descriptors. This patch brings ~4.8% driver performance improvement in a TCP RX throughput test with iPerf tool on a single isolated Cortex-A65 CPU core, 2.92 Gbits/sec increased to 3.06 Gbits/sec. Suggested-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Furong Xu <0x1207@gmail.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net: stmmac: Optimize cache prefetch in RX pathFurong Xu
Current code prefetches cache lines for the received frame first, and then dma_sync_single_for_cpu() against this frame, this is wrong. Cache prefetch should be triggered after dma_sync_single_for_cpu(). This patch brings ~2.8% driver performance improvement in a TCP RX throughput test with iPerf tool on a single isolated Cortex-A65 CPU core, 2.84 Gbits/sec increased to 2.92 Gbits/sec. Signed-off-by: Furong Xu <0x1207@gmail.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net: stmmac: Set page_pool_params.max_len to a precise sizeFurong Xu
DMA engine will always write no more than dma_buf_sz bytes of a received frame into a page buffer, the remaining spaces are unused or used by CPU exclusively. Setting page_pool_params.max_len to almost the full size of page(s) helps nothing more, but wastes more CPU cycles on cache maintenance. For a standard MTU of 1500, then dma_buf_sz is assigned to 1536, and this patch brings ~16.9% driver performance improvement in a TCP RX throughput test with iPerf tool on a single isolated Cortex-A65 CPU core, from 2.43 Gbits/sec increased to 2.84 Gbits/sec. Signed-off-by: Furong Xu <0x1207@gmail.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-16net: stmmac: Switch to zero-copy in non-XDP RX pathFurong Xu
Avoid memcpy in non-XDP RX path by marking all allocated SKBs to be recycled in the upper network stack. This patch brings ~11.5% driver performance improvement in a TCP RX throughput test with iPerf tool on a single isolated Cortex-A65 CPU core, from 2.18 Gbits/sec increased to 2.43 Gbits/sec. Signed-off-by: Furong Xu <0x1207@gmail.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Reviewed-by: Yanteng Si <si.yanteng@linux.dev> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-15net/mlx5e: CT: Offload connections with hardware steering rulesCosmin Ratiu
This is modeled similar to how software steering works: - a reference-counted matcher is maintained for each combination of nat/no_nat x ipv4/ipv6 x tcp/udp/gre. - adding a rule involves finding+referencing or creating a corresponding matcher, then actually adding a rule. - updating rules is implemented using the bwc_rule update API, which can change a rule's actions without touching the match value. By using a T-Rex traffic generator to initiate multi-million UDP flows per second, a kernel running with these patches on the RX side was able to offload ~600K flows per second, which is about ~7x larger than what software steering could do on the same hardware (256-thread AMD EPYC, 512 GB RAM, ConnectX-7 b2b). Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250114130646.1937192-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net/mlx5e: CT: Make mlx5_ct_fs_smfs_ct_validate_flow_rule reusableCosmin Ratiu
This function checks whether a flow_rule has the right flow dissector keys and masks used for a connection tracking flow offload. It is currently used locally by the tc_ct smfs module, but is about to be used from another place, so this commit moves it to a better place, renames it to mlx5e_tc_ct_is_valid_flow_rule and drops the unused fs argument. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250114130646.1937192-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net/mlx5e: CT: Add initial support for Hardware SteeringCosmin Ratiu
Connection tracking can offload tuple matches to the NIC either via firmware commands (when the steering mode is dmfs or offload support is disabled due to eswitch being set to legacy) or via software-managed flow steering (smfs). This commit adds stub operations for a third mode, hardware-managed flow steering. This is enabled when both CONFIG_MLX5_TC_CT and CONFIG_MLX5_HW_STEERING are enabled. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250114130646.1937192-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net/mlx5: HWS, rework the check if matcher size can be increasedYevgeny Kliteynik
When checking if the matcher size can be increased, check both match and action RTCs. Also, consider the increasing step - check that it won't cause the new matcher size to become unsupported. Additionally, since we're using '+ 1' for action RTC size yet again, define it as macro and use in all the required places. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250114130646.1937192-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: protect NAPI enablement with netdev_lock()Jakub Kicinski
Wrap napi_enable() / napi_disable() with netdev_lock(). Provide the "already locked" flavor of the API. iavf needs the usual adjustment. A number of drivers call napi_enable() under a spin lock, so they have to be modified to take netdev_lock() first, then spin lock then call napi_enable_locked(). Protecting napi_enable() implies that napi->napi_id is protected by netdev_lock(). Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: protect netdev->napi_list with netdev_lock()Jakub Kicinski
Hold netdev->lock when NAPIs are getting added or removed. This will allow safe access to NAPI instances of a net_device without rtnl_lock. Create a family of helpers which assume the lock is already taken. Switch iavf to them, as it makes extensive use of netdev->lock, already. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15net: add netdev_lock() / netdev_unlock() helpersJakub Kicinski
Add helpers for locking the netdev instance, use it in drivers and the shaper code. This will make grepping for the lock usage much easier, as we extend the lock to cover more fields. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250115035319.559603-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-01-08 (ice) This series contains updates to ice driver only. Przemek reworks implementation so that ice_init_hw() is called before ice_adapter initialization. The motivation is to have ability to act on the number of PFs in ice_adapter initialization. This is not done here but the code is also a bit cleaner. Michal adds priority to be considered when matching recipes for proper differentiation. Konrad adds devlink health reporting for firmware generated events. R Sundar utilizes string helpers over open coded versions. Jake adds implementation to utilize a lower latency interface to program PHY timer when supported. Additional information can be found on the original cover letter: https://lore.kernel.org/intel-wired-lan/20241216145453.333745-1-anton.nadezhdin@intel.com/ Karol adds and allows for different PTP delay values to be used per pin. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: Add in/out PTP pin delays ice: implement low latency PHY timer updates ice: check low latency PHY timer update firmware capability ice: add lock to protect low latency interface ice: rename TS_LL_READ* macros to REG_LL_PROXY_H_* ice: use read_poll_timeout_atomic in ice_read_phy_tstamp_ll_e810 ice: use string choice helpers ice: add fw and port health reporters ice: add recipe priority check in search ice: ice_probe: init ice_adapter after HW init ice: minor: rename goto labels from err to unroll ice: split ice_init_hw() out from ice_init_dev() ice: c827: move wait for FW to ice_init_hw() ==================== Link: https://patch.msgid.link/20250115000844.714530-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15inet: ipmr: fix data-racesEric Dumazet
Following fields of 'struct mr_mfc' can be updated concurrently (no lock protection) from ip_mr_forward() and ip6_mr_forward() - bytes - pkt - wrong_if - lastuse They also can be read from other functions. Convert bytes, pkt and wrong_if to atomic_long_t, and use READ_ONCE()/WRITE_ONCE() for lastuse. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15bnxt_en: add support for hds-thresh ethtool commandTaehee Yoo
The bnxt_en driver has configured the hds_threshold value automatically when TPA is enabled based on the rx-copybreak default value. Now the hds-thresh ethtool command is added, so it adds an implementation of hds-thresh option. Configuration of the hds-thresh is applied only when the tcp-data-split is enabled. The default value of hds-thresh is 256, which is the default value of rx-copybreak, which used to be the hds_thresh value. The maximum hds-thresh is 1023. # Example: # ethtool -G enp14s0f0np0 tcp-data-split on hds-thresh 256 # ethtool -g enp14s0f0np0 Ring parameters for enp14s0f0np0: Pre-set maximums: ... HDS thresh: 1023 Current hardware settings: ... TCP data split: on HDS thresh: 256 Tested-by: Stanislav Fomichev <sdf@fomichev.me> Tested-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250114142852.3364986-9-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15bnxt_en: add support for tcp-data-split ethtool commandTaehee Yoo
NICs that uses bnxt_en driver supports tcp-data-split feature by the name of HDS(header-data-split). But there is no implementation for the HDS to enable by ethtool. Only getting the current HDS status is implemented and The HDS is just automatically enabled only when either LRO, HW-GRO, or JUMBO is enabled. The hds_threshold follows rx-copybreak value. and it was unchangeable. This implements `ethtool -G <interface name> tcp-data-split <value>` command option. The value can be <on> and <auto>. The value is <auto> and one of LRO/GRO/JUMBO is enabled, HDS is automatically enabled and all LRO/GRO/JUMBO are disabled, HDS is automatically disabled. HDS feature relies on the aggregation ring. So, if HDS is enabled, the bnxt_en driver initializes the aggregation ring. This is the reason why BNXT_FLAG_AGG_RINGS contains HDS condition. Acked-by: Jakub Kicinski <kuba@kernel.org> Tested-by: Stanislav Fomichev <sdf@fomichev.me> Tested-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250114142852.3364986-8-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15bnxt_en: add support for rx-copybreak ethtool commandTaehee Yoo
The bnxt_en driver supports rx-copybreak, but it couldn't be set by userspace. Only the default value(256) has worked. This patch makes the bnxt_en driver support following command. `ethtool --set-tunable <devname> rx-copybreak <value> ` and `ethtool --get-tunable <devname> rx-copybreak`. By this patch, hds_threshol is set to the rx-copybreak value. But it will be set by `ethtool -G eth0 hds-thresh N` in the next patch. Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Tested-by: Stanislav Fomichev <sdf@fomichev.me> Tested-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250114142852.3364986-7-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15eth: fbnic: Add hardware monitoring support via HWMON interfaceSanman Pradhan
This patch adds support for hardware monitoring to the fbnic driver, allowing for temperature and voltage sensor data to be exposed to userspace via the HWMON interface. The driver registers a HWMON device and provides callbacks for reading sensor data, enabling system admins to monitor the health and operating conditions of fbnic. Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250114000705.2081288-4-sanman.p211993@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15eth: fbnic: hwmon: Add support for reading temperature and voltage sensorsSanman Pradhan
Add support for reading temperature and voltage sensor data from firmware by implementing a new TSENE message type and response parsing. This adds message handler infrastructure to transmit sensor read requests and parse responses. The sensor data will be exposed through the driver's hwmon interface. Signed-off-by: Sanman Pradhan <sanman.p211993@gmail.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250114000705.2081288-3-sanman.p211993@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>