Age | Commit message (Collapse) | Author |
|
The memory barrier in eq_update_ci() after the doorbell write is a
significant hot spot in mlx5_eq_comp_int(). Under heavy TCP load, we see
3% of CPU time spent on the mfence instruction.
98df6d5b877c ("net/mlx5: A write memory barrier is sufficient in EQ ci
update") already relaxed the full memory barrier to just a write barrier
in mlx5_eq_update_ci(), which duplicates eq_update_ci(). So replace mb()
with wmb() in eq_update_ci() too.
On strongly ordered architectures, no barrier is actually needed because
the MMIO writes to the doorbell register are guaranteed to appear to the
device in the order they were made. However, the kernel's ordered MMIO
primitive writel() lacks a convenient big-endian interface.
Therefore, we opt to stick with __raw_writel() + a barrier.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241107183054.2443218-1-csander@purestorage.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The gettimex64() doesn't modify values in timecounter, that's why there
is no need to update sequence counter. Reduce the contention on sequence
lock for multi-thread PHC reading use-case.
Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241014170103.2473580-1-vadfed@meta.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
After adding HWS support in a separate folder, moving all the SWS
code into its own folder as well.
Now SWS and HWS implementation are located in their appropriate
folders:
- steering/sws/
- steering/hws/
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241031125856.530927-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Collecting crdump involves reading vsc registers from pci config space
of mlx device, which can take long time to complete. This might result
in starving other threads waiting to run on the cpu.
Numbers I got from testing ConnectX-5 Ex MCX516A-CDAT in the lab:
- mlx5_vsc_gw_read_block_fast() was called with length = 1310716.
- mlx5_vsc_gw_read_fast() reads 4 bytes at a time. It was not used to
read the entire 1310716 bytes. It was called 53813 times because
there are jumps in read_addr.
- On average mlx5_vsc_gw_read_fast() took 35284.4ns.
- In total mlx5_vsc_wait_on_flag() called vsc_read() 54707 times.
The average time for each call was 17548.3ns. In some instances
vsc_read() was called more than one time when the flag was not set.
As expected the thread released the cpu after 16 iterations in
mlx5_vsc_wait_on_flag().
- Total time to read crdump was 35284.4ns * 53813 ~= 1.898s.
It was seen in the field that crdump can take more than 5 seconds to
complete. During that time mlx5_vsc_wait_on_flag() did not release the
cpu because it did not complete 16 iterations. It is believed that pci
config reads were slow. Adding cond_resched() every 128 register read
improves the situation. In the common case the, crdump takes ~1.8989s,
the thread yields the cpu every ~4.51ms. If crdump takes ~5s, the thread
yields the cpu every ~18.0ms.
Fixes: 8b9d8baae1de ("net/mlx5: Add Crdump support")
Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Cross-merge networking fixes after downstream PR.
No conflicts.
Adjacent changes:
drivers/net/ethernet/broadcom/bnxt/bnxt.h
c948c0973df5 ("bnxt_en: Don't clear ntuple filters and rss contexts during ethtool ops")
f2878cdeb754 ("bnxt_en: Add support to call FW to update a VNIC")
Link: https://patch.msgid.link/20240822210125.1542769-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Prevent the call trace below from happening, by not allowing IPsec
creation over a slave, if master device doesn't support IPsec.
WARNING: CPU: 44 PID: 16136 at kernel/locking/rwsem.c:240 down_read+0x75/0x94
Modules linked in: esp4_offload esp4 act_mirred act_vlan cls_flower sch_ingress mlx5_vdpa vringh vhost_iotlb vdpa mst_pciconf(OE) nfsv3 nfs_acl nfs lockd grace fscache netfs xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill cuse fuse rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_ipoib iw_cm ib_cm ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel sha1_ssse3 dell_smbios ib_uverbs aesni_intel crypto_simd dcdbas wmi_bmof dell_wmi_descriptor cryptd pcspkr ib_core acpi_ipmi sp5100_tco ccp i2c_piix4 ipmi_si ptdma k10temp ipmi_devintf ipmi_msghandler acpi_power_meter acpi_cpufreq ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper syscopyarea sysfillrect mlx5_core sysimgblt fb_sys_fops cec
ahci libahci mlxfw drm pci_hyperv_intf libata tg3 sha256_ssse3 tls megaraid_sas i2c_algo_bit psample wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: mst_pci]
CPU: 44 PID: 16136 Comm: kworker/44:3 Kdump: loaded Tainted: GOE 5.15.0-20240509.el8uek.uek7_u3_update_v6.6_ipsec_bf.x86_64 #2
Hardware name: Dell Inc. PowerEdge R7525/074H08, BIOS 2.0.3 01/15/2021
Workqueue: events xfrm_state_gc_task
RIP: 0010:down_read+0x75/0x94
Code: 00 48 8b 45 08 65 48 8b 14 25 80 fc 01 00 83 e0 02 48 09 d0 48 83 c8 01 48 89 45 08 5d 31 c0 89 c2 89 c6 89 c7 e9 cb 88 3b 00 <0f> 0b 48 8b 45 08 a8 01 74 b2 a8 02 75 ae 48 89 c2 48 83 ca 02 f0
RSP: 0018:ffffb26387773da8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffa08b658af900 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ff886bc5e1366f2f RDI: 0000000000000000
RBP: ffffa08b658af940 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0a9bfb31540
R13: ffffa0a9bfb37900 R14: 0000000000000000 R15: ffffa0a9bfb37905
FS: 0000000000000000(0000) GS:ffffa0a9bfb00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055a45ed814e8 CR3: 000000109038a000 CR4: 0000000000350ee0
Call Trace:
<TASK>
? show_trace_log_lvl+0x1d6/0x2f9
? show_trace_log_lvl+0x1d6/0x2f9
? mlx5_devcom_for_each_peer_begin+0x29/0x60 [mlx5_core]
? down_read+0x75/0x94
? __warn+0x80/0x113
? down_read+0x75/0x94
? report_bug+0xa4/0x11d
? handle_bug+0x35/0x8b
? exc_invalid_op+0x14/0x75
? asm_exc_invalid_op+0x16/0x1b
? down_read+0x75/0x94
? down_read+0xe/0x94
mlx5_devcom_for_each_peer_begin+0x29/0x60 [mlx5_core]
mlx5_ipsec_fs_roce_tx_destroy+0xb1/0x130 [mlx5_core]
tx_destroy+0x1b/0xc0 [mlx5_core]
tx_ft_put+0x53/0xc0 [mlx5_core]
mlx5e_xfrm_free_state+0x45/0x90 [mlx5_core]
___xfrm_state_destroy+0x10f/0x1a2
xfrm_state_gc_task+0x81/0xa9
process_one_work+0x1f1/0x3c6
worker_thread+0x53/0x3e4
? process_one_work.cold+0x46/0x3c
kthread+0x127/0x144
? set_kthread_struct+0x60/0x52
ret_from_fork+0x22/0x2d
</TASK>
---[ end trace 5ef7896144d398e1 ]---
Fixes: dfbd229abeee ("net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic")
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20240815071611.2211873-5-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml
c25504a0ba36 ("dt-bindings: net: fsl,qoriq-mc-dpmac: add missed property phys")
be034ee6c33d ("dt-bindings: net: fsl,qoriq-mc-dpmac: using unevaluatedProperties")
https://lore.kernel.org/20240815110934.56ae623a@canb.auug.org.au
drivers/net/dsa/vitesse-vsc73xx-core.c
5b9eebc2c7a5 ("net: dsa: vsc73xx: pass value in phy_write operation")
fa63c6434b6f ("net: dsa: vsc73xx: check busy flag in MDIO operations")
2524d6c28bdc ("net: dsa: vsc73xx: use defined values in phy operations")
https://lore.kernel.org/20240813104039.429b9fe6@canb.auug.org.au
Resolve by using FIELD_PREP(), Stephen's resolution is simpler.
Adjacent changes:
net/vmw_vsock/af_vsock.c
69139d2919dd ("vsock: fix recursive ->recvmsg calls")
744500d81f81 ("vsock: add support for SIOCOUTQ ioctl")
Link: https://patch.msgid.link/20240815141149.33862-1-pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Unconditionally calling the MPIR query on BF separate mode yields the FW
syndrome below [1]. Do not call it unless admin clearly specified the SD
group, i.e. expressing the intention of using the multi-PF netdev
feature.
This fix covers cases not covered in
commit fca3b4791850 ("net/mlx5: Do not query MPIR on embedded CPU function").
[1]
mlx5_cmd_out_err:808:(pid 8267): ACCESS_REG(0x805) op_mod(0x1) failed,
status bad system state(0x4), syndrome (0x685f19), err(-5)
Fixes: 678eb448055a ("net/mlx5: SD, Implement basic query and instantiation")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://patch.msgid.link/20240808144107.2095424-2-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Expose Precision Time Measurement support through related PTP ioctl.
The performance of PTM on ConnectX-7 was evaluated using both real-time
(RTC) and free-running (FRC) clocks under traffic and no traffic
conditions. Tests with phc2sys measured the maximum offset values at a 50Hz
rate, with and without PTM.
Results:
1. No traffic
+-----+--------+--------+
| | No-PTM | PTM |
+-----+--------+--------+
| FRC | 125 ns | <29 ns |
+-----+--------+--------+
| RTC | 248 ns | <34 ns |
+-----+--------+--------+
2. With traffic
+-----+--------+--------+
| | No-PTM | PTM |
+-----+--------+--------+
| FRC | 254 ns | <40 ns |
+-----+--------+--------+
| RTC | 255 ns | <45 ns |
+-----+--------+--------+
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Co-developed-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Carolina Jubran <cjubran@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Link: https://patch.msgid.link/20240730134055.1835261-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In case pci channel becomes offline the driver should not wait for PCI
reads during health dump and recovery flow. The driver has timeout for
each of these loops trying to read PCI, so it would fail anyway.
However, in case of recovery waiting till timeout may cause the pci
error_detected() callback fail to meet pci_dpc_recovered() wait timeout.
Fixes: b3bd076f7501 ("net/mlx5: Report devlink health on FW fatal issues")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
A proper query to MPIR needs to set the correct value in the depth field.
On embedded CPU this value is not necessarily zero. As there is no real
use case for multi-PF netdev on the embedded CPU of the smart NIC, block
this option.
This fixes the following failure:
ACCESS_REG(0x805) op_mod(0x1) failed, status bad system state(0x4), syndrome (0x685f19), err(-5)
Fixes: 678eb448055a ("net/mlx5: SD, Implement basic query and instantiation")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
include/trace/events/rpcgss.h
386f4a737964 ("trace: events: cleanup deprecated strncpy uses")
a4833e3abae1 ("SUNRPC: Fix rpcgss_context trace event acceptor field")
Adjacent changes:
drivers/net/ethernet/intel/ice/ice_tc_lib.c
2cca35f5dd78 ("ice: Fix checking for unsupported keys on non-tunnel device")
784feaa65dfd ("ice: Add support for PFCP hardware offload in switchdev")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Check if devcom holds an error pointer and return immediately.
This fixes Smatch static checker warning:
drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c:221 sd_register()
error: 'devcom' dereferencing possible ERR_PTR()
Enhance mlx5_devcom_register_component() so it stops returning NULL,
making it easier for its callers.
Fixes: d3d057666090 ("net/mlx5: SD, Implement devcom communication and primary election")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/all/f09666c8-e604-41f6-958b-4cc55c73faf9@gmail.com/T/
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Link: https://lore.kernel.org/r/20240411115444.374475-3-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Replace matching on TCP and UDP protocols with new l4_type field which
is parsed by steering for ttc_table. It is enabled by the
outer_l4_type or inner_l4_type bits in nic_rx or port_sel flow table
capabilities and used only if pcc_ifa2 bit in HCA capabilities is set.
Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://lore.kernel.org/r/20240402133043.56322-10-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Have an actual mlx5_sd instance in the core device, and fix the getter
accordingly. This allows SD stuff to flow, the feature becomes supported
only here.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add debugfs entries that describe the Socket-Direct group.
Example:
$ grep -H . /sys/kernel/debug/mlx5/0000\:08\:00.0/multi-pf/*
/sys/kernel/debug/mlx5/0000:08:00.0/multi-pf/group_id:0x00000101
/sys/kernel/debug/mlx5/0000:08:00.0/multi-pf/primary:0000:08:00.0 vhca 0x0
/sys/kernel/debug/mlx5/0000:08:00.0/multi-pf/secondary_0:0000:09:00.0 vhca 0x2
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Print to kernel log when an SD group moves from/to ready state.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Implement the needed SD steering adjustments for the primary and
secondaries.
While the SD multiple PFs are used to avoid cross-numa memory, when it
comes to chip level all traffic goes only through the primary device.
The secondaries are forced to silent mode, to guarantee they are not
involved in any unexpected ingress/egress traffic.
In RX, secondary devices will not have steering objects. Traffic will be
steered from the primary device to the RQs of a secondary device using
advanced cross-vhca RX steering capabilities.
In TX, the primary creates a new TX flow table, which is aliased by the
secondaries.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Use devcom to communicate between the different devices. Add a new
devcom component type for this.
Each device registers itself to the devcom component <SD, group ID>.
Once all devices of a component are registered, the component becomes
ready, and a primary device is elected.
In principle, any of the devices can act as a primary, they are all
capable, and a random election would've worked. However, we aim to
achieve predictability and consistency, hence each group always choses
the same device, with the lowest PCI BUS number, as primary.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add implementation for querying the MPIR register for Socket-Direct
attributes, and instantiating a SD struct accordingly.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add Socket-Direct API with empty/minimal implementation.
We fill-in the implementation gradually in downstream patches.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
mlx5 devices have specific constants for choosing the CQ period mode. These
constants do not have to match the constants used by the kernel software
API for DIM period mode selection.
Fixes: cdd04f4d4d71 ("net/mlx5: Add support to create SQ and CQ for ASO")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Pull rdma updates from Jason Gunthorpe:
"Small cycle, with some typical driver updates:
- General code tidying in siw, hfi1, idrdma, usnic, hns rtrs and
bnxt_re
- Many small siw cleanups without an overeaching theme
- Debugfs stats for hns
- Fix a TX queue timeout in IPoIB and missed locking of the mcast
list
- Support more features of P7 devices in bnxt_re including a new work
submission protocol
- CQ interrupts for MANA
- netlink stats for erdma
- EFA multipath PCI support
- Fix Incorrect MR invalidation in iser"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (66 commits)
RDMA/bnxt_re: Fix error code in bnxt_re_create_cq()
RDMA/efa: Add EFA query MR support
IB/iser: Prevent invalidating wrong MR
RDMA/erdma: Add hardware statistics support
RDMA/erdma: Introduce dma pool for hardware responses of CMDQ requests
IB/iser: iscsi_iser.h: fix kernel-doc warning and spellos
RDMA/mana_ib: Add CQ interrupt support for RAW QP
RDMA/mana_ib: query device capabilities
RDMA/mana_ib: register RDMA device with GDMA
RDMA/bnxt_re: Fix the sparse warnings
RDMA/bnxt_re: Fix the offset for GenP7 adapters for user applications
RDMA/bnxt_re: Share a page to expose per CQ info with userspace
RDMA/bnxt_re: Add UAPI to share a page with user space
IB/ipoib: Fix mcast list locking
RDMA/mlx5: Expose register c0 for RDMA device
net/mlx5: E-Switch, expose eswitch manager vport
net/mlx5: Manage ICM type of SW encap
RDMA/mlx5: Support handling of SW encap ICM area
net/mlx5: Introduce indirect-sw-encap ICM properties
RDMA/bnxt_re: Adds MSN table capability for Gen P7 adapters
...
|
|
Revert "net/mlx5: Implement management PF Ethernet profile"
This reverts commit 22c4640698a1d47606b5a4264a584e8046641784.
Revert "net/mlx5: Enable SD feature"
This reverts commit c88c49ac9c18fb7c3fa431126de1d8f8f555e912.
Revert "net/mlx5e: Block TLS device offload on combined SD netdev"
This reverts commit 83a59ce0057b7753d7fbece194b89622c663b2a6.
Revert "net/mlx5e: Support per-mdev queue counter"
This reverts commit d72baceb92539a178d2610b0e9ceb75706a75b55.
Revert "net/mlx5e: Support cross-vhca RSS"
This reverts commit c73a3ab8fa6e93a783bd563938d7cf00d62d5d34.
Revert "net/mlx5e: Let channels be SD-aware"
This reverts commit e4f9686bdee7b4dd89e0ed63cd03606e4bda4ced.
Revert "net/mlx5e: Create EN core HW resources for all secondary devices"
This reverts commit c4fb94aa822d6c9d05fc3c5aee35c7e339061dc1.
Revert "net/mlx5e: Create single netdev per SD group"
This reverts commit e2578b4f983cfcd47837bbe3bcdbf5920e50b2ad.
Revert "net/mlx5: SD, Add informative prints in kernel log"
This reverts commit c82d360325112ccc512fc11a3b68cdcdf04a1478.
Revert "net/mlx5: SD, Implement steering for primary and secondaries"
This reverts commit 605fcce33b2d1beb0139b6e5913fa0b2062116b2.
Revert "net/mlx5: SD, Implement devcom communication and primary election"
This reverts commit a45af9a96740873db9a4b5bb493ce2ad81ccb4d5.
Revert "net/mlx5: SD, Implement basic query and instantiation"
This reverts commit 63b9ce944c0e26c44c42cdd5095c2e9851c1a8ff.
Revert "net/mlx5: SD, Introduce SD lib"
This reverts commit 4a04a31f49320d078b8078e1da4b0e2faca5dfa3.
Revert "net/mlx5: Fix query of sd_group field"
This reverts commit e04984a37398b3f4f5a79c993b94c6b1224184cc.
Revert "net/mlx5e: Use the correct lag ports number when creating TISes"
This reverts commit a7e7b40c4bc115dbf2a2bb453d7bbb2e0ea99703.
There are some unanswered questions on the list, and we don't
have any docs. Given the lack of replies so far and the fact
that v6.8 merge window has started - let's revert this and
revisit for v6.9.
Link: https://lore.kernel.org/all/20231221005721.186607-1-saeed@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Have an actual mlx5_sd instance in the core device, and fix the getter
accordingly. This allows SD stuff to flow, the feature becomes supported
only here.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Print to kernel log when an SD group moves from/to ready state.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Implement the needed SD steering adjustments for the primary and
secondaries.
While the SD multiple devices are used to avoid cross-numa memory, when
it comes to chip level all traffic goes only through the primary device.
The secondaries are forced to silent mode, to guarantee they are not
involved in any unexpected ingress/egress traffic.
In RX, secondary devices will not have steering objects. Traffic will be
steered from the primary device to the RQs of a secondary device using
advanced cross-vhca RX steering capabilities.
In TX, the primary creates a new TX flow table, which is aliased by the
secondaries.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Use devcom to communicate between the different devices. Add a new
devcom component type for this.
Each device registers itself to the devcom component <SD, group ID>.
Once all devices of a component are registered, the component becomes
ready, and a primary device is elected.
In principle, any of the devices can act as a primary, they are all
capable, and a random election would've worked. However, we aim to
achieve predictability and consistency, hence each group always choses
the same device, with the lowest PCI BUS number, as primary.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add implementation for querying the MPIR register for Socket-Direct
attributes, and instantiating a SD struct accordingly.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add Socket-Direct API with empty/minimal implementation.
We fill-in the implementation gradually in downstream patches.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Add a getter for the number of participants in a devcom
component (those who share the same component id and key).
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Support allocate/deallocate the new SW encap ICM type memory.
The new ICM type is used for encap context allocation managed by SW,
instead FW. It can increase encap context maximum number and allocation
speed
Signed-off-by: Shun Hao <shunh@nvidia.com>
Link: https://lore.kernel.org/r/bed5121255918eb132a1334141c76a0594df8143.1701871118.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2023-11-13
1) Cleanup patches, leftovers from previous cycle
2) Allow sync reset flow when BF MGT interface device is present
3) Trivial ptp refactorings and improvements
4) Add local loopback counter to vport rep stats
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When running a phase adjustment operation, the free running clock should
not be modified at all. The phase control keyword is intended to trigger an
internal servo on the device that will converge to the provided delta. A
free running counter cannot implement phase adjustment.
Fixes: 8e11a68e2e8a ("net/mlx5: Add adjphase function to support hardware-only offset control")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20231114215846.5902-5-saeed@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Some mlx5 devices do not support the default advertised maximum frequency
adjustment value for the PTP hardware clock that is set by the driver.
These devices need to be queried when initializing the clock functionality
in order to get the maximum supported frequency adjustment value. This
value can be greater than the minimum supported frequency adjustment across
mlx5 devices (50 million ppb).
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
adjustments
Represent scaled ppm as ppb to the device when the value in scaled ppm is
not representable as a 32-bit signed integer. mlx5 devices only support a
32-bit field for the frequency adjustment value in units of either scaled
ppm or ppb.
Since mlx5 devices only support a 32-bit field for the frequency adjustment
value independent of unit used, limit the maximum frequency adjustment to
S32_MAX ppb.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Configure the PHC inside mlx5_init_timer_clock for calling mlx5_ptp_settime
later in the function. Would previously use mlx5_ptp_clock_info instance to
invoke mlx5_ptp_settime to set the NIC real-time clock to be synchronized
with the host system clock.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Check if the MTUTC register of the NIC can be modified before attempting to
execute a real-time clock operation. Previous implementation aborted the
real-time clock operation pre-emptively when the MTUTC register used to
control the real-time clock was not modifiable, indicating real-time clock
mode was not enabled on the NIC. The original control flow was confusing
since the noop-if-RTC-disabled branch looked similar to an error handling
guard clause. The purpose of this patch is purely for improving readability
and should lead to no functional change.
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Commit 2ac9cfe78223 ("net/mlx5e: IPSec, Add Innova IPSec offload TX data path")
declared mlx5e_ipsec_inverse_table_init() but never implemented it.
Commit f52f2faee581 ("net/mlx5e: Introduce flow steering API")
declared mlx5e_fs_set_tc() but never implemented it.
Commit f2f3df550139 ("net/mlx5: EQ, Privatize eq_table and friends")
declared mlx5_eq_comp_cpumask() but never implemented it.
Commit cac1eb2cf2e3 ("net/mlx5: Lag, properly lock eswitch if needed")
removed mlx5_lag_update() but not its declaration.
Commit 35ba005d820b ("net/mlx5: DR, Set flex parser for TNL_MPLS dynamically")
removed mlx5dr_ste_build_tnl_mpls() but not its declaration.
Commit e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
declared but never implemented mlx5_alloc_cmd_mailbox_chain() and mlx5_free_cmd_mailbox_chain().
Commit 0cf53c124756 ("net/mlx5: FWPage, Use async events chain")
removed mlx5_core_req_pages_handler() but not its declaration.
Commit 938fe83c8dcb ("net/mlx5_core: New device capabilities handling")
removed mlx5_query_odp_caps() but not its declaration.
Commit f6a8a19bb11b ("RDMA/netdev: Hoist alloc_netdev_mqs out of the driver")
removed mlx5_rdma_netdev_alloc() but not its declaration.
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
mlx5_intf_lock is used to sync between LAG changes and its slaves
mlx5 core dev aux devices changes, which means every time mlx5 core
dev add/remove aux devices, mlx5 is taking this global lock, even if
LAG functionality isn't supported over the core dev.
This cause a bottleneck when probing VFs/SFs in parallel.
Hence, replace mlx5_intf_lock with HCA devcom component lock, or no
lock if LAG functionality isn't supported.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
LAG peer device lookout bus logic required the usage of global lock,
mlx5_intf_mutex.
As part of the effort to remove this global lock, refactor LAG peer
device lookout to use mlx5 devcom layer.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
Downstream patch will add devcom component which will be locked in
many places. This can lead to a false positive "possible circular
locking dependency" warning by lockdep, on flows which lock more than
one mlx5 devcom component, such as probing ETH aux device.
Hence, add a lock_class_key per mlx5 device.
Signed-off-by: Shay Drory <shayd@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
When the master device is unbinded, make sure to clean up all of the
steering rules or flow tables that were created over the master, in
order to allow proper unbinding of master, and for ethernet traffic
to continue to work independently.
Upon bringing master device back up and attaching the slave to it,
checks if the slave already has IPsec configured and if so reconfigure
the rules needed to support RoCE traffic.
Note that while master device is unbound, the user is unable to
configure IPsec again, since they are in a kind of illegal state in
which they are in MPV mode but the slave has no master.
However if IPsec was configured before hand, it will continue to work
for ethernet traffic while master is unbound, and would continue to
work for all traffic when the master is bound back again.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/8434e88912c588affe51b34669900382a132e873.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add empty flow table in RDMA_RX master domain, to forward all received
traffic to it, in order to continue through the FW RoCE steering.
In order to achieve that however, first we check if the decrypted
traffic is RoCEv2, if so then forward it to RDMA_RX domain.
But in case the traffic is coming from the slave, have to first send the
traffic to an alias table in order to switch gvmi and from there we can
go to the appropriate gvmi flow table in RDMA_RX master domain.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/d2200b53158b1e7ef30996812107dd7207485c28.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Add steering tables/rules in RDMA_TX master domain, to forward all traffic
to IPsec crypto table in NIC domain.
But in case the traffic is coming from the slave, have to first send the
traffic to an alias table in order to switch gvmi and from there we can
go to the appropriate gvmi crypto table in NIC domain.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/7ca5cf1ac5c6979359b8726e97510574e2b3d44d.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Implements functions which creates an alias flow table, and check
if alias flow table creation is even supported, and if successful
returns the created alias flow table object id.
This function would be used in later patches to allow jumping from
one vhca to another, in order to add support for MPV mode.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/36e15ef41586f2a9aacc65b935de18391eef5607.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Store the mlx5e priv devcom component within IPsec RoCE to enable
the IPsec RoCE code to access the other device's private information.
This includes retrieving the necessary device information and
the IPsec database, which helps determine if IPsec is configured or not.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/5bb3160ceeb07523542302886da54c78eef0d2af.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
If the device is in MPV mode, the ethernet driver would now register
to events from IB driver about core devices affiliation or
de-affiliation.
Use the key provided in said event to connect each mlx5e priv
instance to it's master counterpart, this way the ethernet driver
is now aware of who is his master core device and even more, such
as knowing if partner device has IPsec configured or not.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://lore.kernel.org/r/279adfa0aa3a1957a339086f2c1739a50b8e4b68.1695296682.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
|
|
Merge in late fixes to prepare for the 6.6 net-next PR.
No conflicts.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
used as core library"
Recent merge had a conflict in:
drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec_fs.c
between commit:
aeb660171b06 ("net/mlx5e: fix double free in macsec_fs_tx_create_crypto_table_groups")
from Linus' tree and commit:
cb5ebe4896f9 ("net/mlx5e: Move MACsec flow steering operations to be used as core library")
from the mlx5-next tree. This was missed and the former commit
got lost, bring it back.
Fixes: 3c5066c6b0a5 ("Merge branch 'mlx5-next' of https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux")
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Link: https://lore.kernel.org/r/20230815123725.4ef5b7b9@canb.auug.org.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|