Age | Commit message (Collapse) | Author |
|
The i386 ABI disagrees with most other ABIs regarding alignment of
data types larger than 4 bytes: on most ABIs a padding must be added
at end of the structures, while it is not required on i386.
So for most ABI struct c4iw_alloc_ucontext_resp gets implicitly padded
to be aligned on a 8 bytes multiple, while for i386, such padding is
not added.
The tool pahole can be used to find such implicit padding:
$ pahole --anon_include \
--nested_anon_include \
--recursive \
--class_name c4iw_alloc_ucontext_resp \
drivers/infiniband/hw/cxgb4/iw_cxgb4.o
Then, structure layout can be compared between i386 and x86_64:
+++ obj-i386/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt 2014-03-28 11:43:05.547432195 +0100
--- obj-x86_64/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt 2014-03-28 10:55:10.990133017 +0100
@@ -2,9 +2,8 @@ struct c4iw_alloc_ucontext_resp {
__u64 status_page_key; /* 0 8 */
__u32 status_page_size; /* 8 4 */
- /* size: 12, cachelines: 1, members: 2 */
- /* last cacheline: 12 bytes */
+ /* size: 16, cachelines: 1, members: 2 */
+ /* padding: 4 */
+ /* last cacheline: 16 bytes */
};
This ABI disagreement will make an x86_64 kernel try to write past the
buffer provided by an i386 binary.
When boundary check will be implemented, the x86_64 kernel will refuse
to write past the i386 userspace provided buffer and the uverbs will
fail.
If the structure is on a page boundary and the next page is not
mapped, ib_copy_to_udata() will fail and the uverb will fail.
Additionally, as reported by Dan Carpenter, without the implicit
padding being properly cleared, an information leak would take place
in most architectures.
This patch adds an explicit padding to struct c4iw_alloc_ucontext_resp,
and, like 92b0ca7cb149 ("IB/mlx5: Fix stack info leak in
mlx5_ib_alloc_ucontext()"), makes function c4iw_alloc_ucontext()
not writting this padding field to userspace. This way, x86_64 kernel
will be able to write struct c4iw_alloc_ucontext_resp as expected by
unpatched and patched i386 libcxgb4.
Link: http://marc.info/?i=cover.1399309513.git.ydroneaud@opteya.com
Link: http://marc.info/?i=1395848977.3297.15.camel@localhost.localdomain
Link: http://marc.info/?i=20140328082428.GH25192@mwanda
Cc: <stable@vger.kernel.org>
Fixes: 05eb23893c2c ("cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes")
Reported-by: Yann Droneaud <ydroneaud@opteya.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Modify the various routines used to allocate memory resources which
serve QPs in mlx4 to get an input GFP directive. Have the Ethernet
driver to use GFP_KERNEL in it's QP allocations as done prior to this
commit, and the IB driver to use GFP_NOIO when the IB verbs
IB_QP_CREATE_USE_GFP_NOIO QP creation flag is provided.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Fix the usnic and thw qib drivers to err when QP creation flags that
they don't understand are provided.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
It is not possible to build only the drivers/infiniband/hw/ (or ulp/)
subdirectory with command such as:
$ make ARCH=x86_64 O=./obj-x86_64/ drivers/infiniband/hw/
This fails with following error messages:
make[2]: Nothing to be done for `all'.
make[2]: Nothing to be done for `relocs'.
CHK include/config/kernel.release
Using /home/ydroneaud/src/linux as source for kernel
GEN /home/ydroneaud/src/linux/obj-x86_64/Makefile
CHK include/generated/uapi/linux/version.h
CHK include/generated/utsrelease.h
CALL /home/ydroneaud/src/linux/scripts/checksyscalls.sh
/home/ydroneaud/src/linux/scripts/Makefile.build:44: /home/ydroneaud/src/linux/drivers/infiniband/hw/Makefile: No such file or directory
make[2]: *** No rule to make target `/home/ydroneaud/src/linux/drivers/infiniband/hw/Makefile'. Stop.
make[1]: *** [drivers/infiniband/hw/] Error 2
make: *** [sub-make] Error 2
This patch creates a Makefile in hw/ and ulp/ and moves each
corresponding parts of drivers/infiniband/Makefile in the new
Makefiles.
It should not break build except if some hw/ drivers or ulp/ were
allowed previously to be built while CONFIG_INFINIBAND is set to 'n',
but according to drivers/infiniband/Kconfig, it's not possible. So it
should be safe to apply.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
This reverts commit 70a640d0dae3a9b1b222ce673eb5d92c263ddd61.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The “affinity hint” mechanism is used by the user space
daemon, irqbalancer, to indicate a preferred CPU mask for irqs.
Irqbalancer can use this hint to balance the irqs between the
cpus indicated by the mask.
We wish the HCA to preferentially map the IRQs it uses to numa cores
close to it. To accomplish this, we use cpumask_set_cpu_local_first(), that
sets the affinity hint according the following policy:
First it maps IRQs to “close” numa cores. If these are exhausted, the
remaining IRQs are mapped to “far” numa cores.
Signed-off-by: Yuval Atias <yuvala@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The i386 ABI disagrees with most other ABIs regarding alignment of
data types larger than 4 bytes: on most ABIs a padding must be added
at end of the structures, while it is not required on i386.
So for most ABI struct c4iw_create_cq_resp gets implicitly padded
to be aligned on a 8 bytes multiple, while for i386, such padding
is not added.
The tool pahole can be used to find such implicit padding:
$ pahole --anon_include \
--nested_anon_include \
--recursive \
--class_name c4iw_create_cq_resp \
drivers/infiniband/hw/cxgb4/iw_cxgb4.o
Then, structure layout can be compared between i386 and x86_64:
+++ obj-i386/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt 2014-03-28 11:43:05.547432195 +0100
--- obj-x86_64/drivers/infiniband/hw/cxgb4/iw_cxgb4.o.pahole.txt 2014-03-28 10:55:10.990133017 +0100
@@ -14,9 +13,8 @@ struct c4iw_create_cq_resp {
__u32 size; /* 28 4 */
__u32 qid_mask; /* 32 4 */
- /* size: 36, cachelines: 1, members: 6 */
- /* last cacheline: 36 bytes */
+ /* size: 40, cachelines: 1, members: 6 */
+ /* padding: 4 */
+ /* last cacheline: 40 bytes */
};
This ABI disagreement will make an x86_64 kernel try to write past the
buffer provided by an i386 binary.
When boundary check will be implemented, the x86_64 kernel will refuse
to write past the i386 userspace provided buffer and the uverbs will
fail.
If the structure is on a page boundary and the next page is not
mapped, ib_copy_to_udata() will fail and the uverb will fail.
This patch adds an explicit padding at end of structure
c4iw_create_cq_resp, and, like 92b0ca7cb149 ("IB/mlx5: Fix stack info
leak in mlx5_ib_alloc_ucontext()"), makes function c4iw_create_cq()
not writting this padding field to userspace. This way, x86_64 kernel
will be able to write struct c4iw_create_cq_resp as expected by
unpatched and patched i386 libcxgb4.
Link: http://marc.info/?i=cover.1399309513.git.ydroneaud@opteya.com
Cc: <stable@vger.kernel.org>
Fixes: cfdda9d764362 ("RDMA/cxgb4: Add driver for Chelsio T4 RNIC")
Fixes: e24a72a3302a6 ("RDMA/cxgb4: Fix four byte info leak in c4iw_create_cq()")
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
This commit adds the sysfs interface for enabling QP0 on VFs for
selected VF/port.
By default, no VFs are enabled for QP0 operation.
To enable QP0 operation on a VF/port, under
/sys/class/infiniband/mlx4_x/iov/<b:d:f>/ports/x there are two new entries:
- smi_enabled (read-only). Indicates whether smi is currently
enabled for the indicated VF/port
- enable_smi_admin (rw). Used by the admin to request that smi
capability be enabled or disabled for the indicated VF/port.
0 = disable, 1 = enable.
The requested enablement will occur at the next reset of the
VF (e.g. driver restart on the VM which owns the VF).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
This commit adds the infrastructure for enabling selected VFs to
operate SMI (QP0) MADs without restriction.
Additionally, for these enabled VFs, their QP0 proxy and tunnel QPs
are MLX QPs. As such, they operate over VL15. Therefore, they are
not affected by "credit" problems or changes in the VLArb table (which
may shut down VL0).
Non-enabled VFs may only create UD proxy QP0 qps (which are forced by
the hypervisor to send packets using the q-key it assigns and places
in the qp-context). Thus, non-enabled VFs will not pose a security
risk. The hypervisor discards any privileged MADs it receives from
these non-enabled VFs.
By default, all VFs are NOT enabled, and must explicitly be enabled
by the administrator.
The sysfs interface which operates the VF enablement infrastructure
is provided in the next commit.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Currently, VFs in SRIOV VFs are denied QP0 access. The main reason
for this decision is security, since Subnet Management Datagrams
(SMPs) are not restricted by network partitioning and may affect the
physical network topology. Moreover, even the SM may be denied access
from portions of the network by setting management keys unknown to the
SM.
However, it is desirable to grant SMI access to certain privileged
VFs, so that certain network management activities may be conducted
within virtual machines instead of the hypervisor.
This commit does the following:
1. Create QP0 tunnel QPs for all VFs.
2. Discard SMI mads sent-from/received-for non-privileged VFs in the
hypervisor MAD multiplex/demultiplex logic. SMI mads from/for
privileged VFs are allowed to pass.
3. MAD_IFC wrapper changes/fixes. For non-privileged VFs, only
host-view MAD_IFC commands are allowed, and only for SMI LID-Routed
GET mads. For privileged VFs, there are no restrictions.
This commit does not allow privileged VFs as yet. To determine if a VF
is privileged, it calls function mlx4_vf_smi_enabled(). This function
returns 0 unconditionally for now.
The next two commits allow defining and activating privileged VFs.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
mlx4_ib_modify_port is invoked in IB for resetting the Q_Key violations
counters and for modifying the IB port capability flags.
For example, when opensm is started up on the hypervisor,
mlx4_ib_modify_port is called to set the port's IsSM flag.
In multifunction mode, the SET_PORT command used in this flow should
be wrapped (so that the PF port capability flags are also tracked,
thus enabling the aggregate of all the VF/PF capability flags to be
tracked properly).
The procedure mlx4_SET_PORT() in main.c is also renamed to mlx4_ib_SET_PORT()
to differentiate it from procedure mlx4_SET_PORT() in port.c.
mlx4_ib_SET_PORT() is used exclusively by mlx4_ib_modify_port().
Finally, the CM invokes ib_modify_port() to set the IsCMSupported flag
even when running over RoCE. Therefore, when RoCE is active,
mlx4_ib_modify_port should return OK unconditionally (since the
capability flags and qkey violations counter are not relevant).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
This patches changes user visible function names containing "qlogic"
in module init and cleanup.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Vinit Agnihotri <vinit.abhay.agnihotri@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
We know that "reset_tpt_entry" is false on this side of the if else
statement so there is no need to check again.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Commit 297e0dad7206 ("IB/mlx4: Handle Ethernet L2 parameters for IP
based GID addressing") introduced a bug where is_mcast is now no
longer initialized on the non-multicast condition and so it can be
any random value from the stack. This issue was detected by cppcheck:
[drivers/infiniband/hw/mlx4/ah.c:103]: (error) Uninitialized
variable: is_mcast
Simple fix is to initialise is_mcast to zero.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Time comparisons must use time_after / time_before to avoid problems
when jiffies wraps.
Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
We need to cast wr_id to unsigned long before casting to a pointer.
This fixes:
drivers/infiniband/hw/mlx5/mr.c: In function 'mlx5_umr_cq_handler':
>> drivers/infiniband/hw/mlx5/mr.c:724:13: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
context = (struct mlx5_ib_umr_context *)wc.wr_id;
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Prepends copyright and license to usnic_uiom_interval_tree.c
Signed-off-by: Upinder Malhi <umalhi@cisco.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
This patch addresses an issue where the legacy diagpacket is sent in
from the user, but the driver operates on only the extended
diagpkt. This patch specifically initializes the extended diagpkt
based on the legacy packet.
Cc: <stable@vger.kernel.org>
Reported-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
The code used a literal 1 in dispatching an IB_EVENT_PKEY_CHANGE.
As of the dual port qib QDR card, this is not necessarily correct.
Change to use the port as specified in the call.
Cc: <stable@vger.kernel.org>
Reported-by: Alex Estrin <alex.estrin@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
The cpl_abort_req struct has several reserved members which need to be
cleared to avoid disclosing kernel information. I have added a memset()
so now it matches the cxgb4 version of this function.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
The i386 ABI disagrees with most other ABIs regarding alignment of
data type larger than 4 bytes: on most ABIs a padding must be added at
end of the structures, while it is not required on i386.
So for most ABIs struct mlx5_ib_create_srq gets implicitly padded to be
aligned on a 8 bytes multiple, while for i386, such padding is not
added.
Tool pahole could be used to find such implicit padding:
$ pahole --anon_include \
--nested_anon_include \
--recursive \
--class_name mlx5_ib_create_srq \
drivers/infiniband/hw/mlx5/mlx5_ib.o
Then, structure layout can be compared between i386 and x86_64:
+++ obj-i386/drivers/infiniband/hw/mlx5/mlx5_ib.o.pahole.txt 2014-03-28 11:43:07.386413682 +0100
--- obj-x86_64/drivers/infiniband/hw/mlx5/mlx5_ib.o.pahole.txt 2014-03-27 13:06:17.788472721 +0100
@@ -69,7 +68,6 @@ struct mlx5_ib_create_srq {
__u64 db_addr; /* 8 8 */
__u32 flags; /* 16 4 */
- /* size: 20, cachelines: 1, members: 3 */
- /* last cacheline: 20 bytes */
+ /* size: 24, cachelines: 1, members: 3 */
+ /* padding: 4 */
+ /* last cacheline: 24 bytes */
};
ABI disagreement will make an x86_64 kernel try to read past
the buffer provided by an i386 binary.
When boundary check will be implemented, the x86_64 kernel will
refuse to read past the i386 userspace provided buffer and the
uverb will fail.
Anyway, if the structure lay in memory on a page boundary and
next page is not mapped, ib_copy_from_udata() will fail and the
uverb will fail.
This patch makes create_srq_user() takes care of the input
data size to handle the case where no padding was provided.
This way, x86_64 kernel will be able to handle struct mlx5_ib_create_srq
as sent by unpatched and patched i386 libmlx5.
Link: http://marc.info/?i=cover.1399309513.git.ydroneaud@opteya.com
Cc: <stable@vger.kernel.org>
Fixes: e126ba97dba9e ("mlx5: Add driver for Mellanox Connect-IB adapter")
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
The i386 ABI disagrees with most other ABIs regarding alignment of
data type larger than 4 bytes: on most ABIs a padding must be added at
end of the structures, while it is not required on i386.
So for most ABI struct mlx5_ib_create_cq get padded to be aligned on a
8 bytes multiple, while for i386, such padding is not added.
The tool pahole can be used to find such implicit padding:
$ pahole --anon_include \
--nested_anon_include \
--recursive \
--class_name mlx5_ib_create_cq \
drivers/infiniband/hw/mlx5/mlx5_ib.o
Then, structure layout can be compared between i386 and x86_64:
+++ obj-i386/drivers/infiniband/hw/mlx5/mlx5_ib.o.pahole.txt 2014-03-28 11:43:07.386413682 +0100
--- obj-x86_64/drivers/infiniband/hw/mlx5/mlx5_ib.o.pahole.txt 2014-03-27 13:06:17.788472721 +0100
@@ -34,9 +34,8 @@ struct mlx5_ib_create_cq {
__u64 db_addr; /* 8 8 */
__u32 cqe_size; /* 16 4 */
- /* size: 20, cachelines: 1, members: 3 */
- /* last cacheline: 20 bytes */
+ /* size: 24, cachelines: 1, members: 3 */
+ /* padding: 4 */
+ /* last cacheline: 24 bytes */
};
This ABI disagreement will make an x86_64 kernel try to read past the
buffer provided by an i386 binary.
When boundary check will be implemented, a x86_64 kernel will refuse
to read past the i386 userspace provided buffer and the uverb will
fail.
Anyway, if the structure lies in memory on a page boundary and next
page is not mapped, ib_copy_from_udata() will fail when trying to read
the 4 bytes of padding and the uverb will fail.
This patch makes create_cq_user() takes care of the input data size to
handle the case where no padding is provided.
This way, x86_64 kernel will be able to handle struct
mlx5_ib_create_cq as sent by unpatched and patched i386 libmlx5.
Link: http://marc.info/?i=cover.1399309513.git.ydroneaud@opteya.com
Cc: <stable@vger.kernel.org>
Fixes: e126ba97dba9e ("mlx5: Add driver for Mellanox Connect-IB adapter")
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Instead of having the UMR context part of each memory region, allocate
a struct on the stack. This allows queuing multiple UMRs that access
the same memory region.
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
For user QPs, the creation process does not currently initialize the fields:
* qp->rq.offset
* qp->sq.offset
* qp->sq.wqe_shift
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
The patch stores iova, pd and size during mr creation and after UMRs
that modify them. It removes the unused access flags field.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
For memory regions that are allocated using reg_umr, the suffix of
mlx5_core_create_mkey isn't being called. Instead the creation is
completed in a callback function (reg_mr_callback). This means that
these MRs aren't being added to the MR radix tree. Add them in the
callback.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
If ib_post_send fails when posting the UMR work request in reg_umr,
the code doesn't release the temporary pas buffer allocated, and
doesn't dma_unmap it.
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Some DIF implementations (SCSI initiator/target) may want to use different
input/output values for application tag and/or reference tag. So in
case memory/wire domain values don't match HW must not copy them.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
No need for repetition format pattern in case the data and protection
are already interleaved in the memory domain since the pattern
already exists. A single key entry is sufficient and may save some
extra fetch ops.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
When the data and protection are interleaved in the memory domain, no
need to expand the mkey total length.
At the moment no Linux user works (iSER initiator & target) in
interleaved mode. This may change in the future as for SCSI
pass-through devices there is no real point in target performing
de-interleaving and re-interleaving of the protection data in the PT
stage. Regardless, signature verbs support this mode.
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
RDMA connections over a vlan interface don't work due to
import_ep() not using the correct egress device.
- use the real device in import_ep()
- use rdma_vlan_dev_real_dev() in get_real_dev().
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
This removes an open-coded duplicate of simple_open().
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
c4iw_alloc() bails out without freeing the storage that 'devp' points to.
Picked up by Coverity - CID 1204241.
Fixes: fa658a98a2 ("RDMA/cxgb4: Use the BAR2/WC path for kernel QPs and T5 devices")
Signed-off-by: Christoph Jaeger <christophjaeger@linux.com>
Acked-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
When we receive a netdev event indicating a netdev change and/or
a netdev address change, we must change the MAC index used by the
proxy QP1 (in the QP context), otherwise RoCE CM packets sent by the
VF will not carry the same source MAC address as the non-CM packets.
We use the UPDATE_QP command to perform this change.
In order to avoid modifying a QP context based on netdev event,
while the driver attempts to destroy this QP (e.g either the mlx4_ib
or ib_mad modules are unloaded), we use mutex locking in both flows.
Since the relevant mlx4 proxy GSI QP is created indirectly by the
mad module when they create their GSI QP, the mlx4 didn't need to
keep track on that QP prior to this change.
Now, when QP modifications are needed to this QP from within the
driver, we added refernece to it.
Signed-off-by: Matan Barak <matanb@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
The whole db drop avoidance stuff is for T4 only. So we cannot allow
that to be enabled for T5 devices.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
This is required to work around a T5 HW issue.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
In cases where the cm calls c4iw_modify_rc_qp() with the endpoint
mutex held, they must be called with internal == 1. rx_data() and
process_mpa_reply() are not doing this. This causes a deadlock
because c4iw_modify_rc_qp() might call c4iw_ep_disconnect() in some
!internal cases, and c4iw_ep_disconnect() acquires the endpoint mutex.
The design was intended to only do the disconnect for !internal calls.
Change rx_data(), FPDU_MODE case, to call c4iw_modify_rc_qp() with
internal == 1, and then disconnect only after releasing the mutex.
Change process_mpa_reply() to call c4iw_modify_rc_qp(TERMINATE) with
internal == 1 and set a new attr flag telling it to send a TERMINATE
message. Previously this was implied by !internal.
Change process_mpa_reply() to return whether the caller should
disconnect after releasing the endpoint mutex. Now rx_data() will do
the disconnect in the cases where process_mpa_reply() wants to
disconnect after the TERMINATE is sent.
Change c4iw_modify_rc_qp() RTS->TERM to only disconnect if !internal,
and to send a TERMINATE message if attrs->send_term is 1.
Change abort_connection() to not aquire the ep mutex for setting the
state, and make all calls to abort_connection() do so with the mutex
held.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
|
|
Need to get the endpoint reference before calling rdma_fini(), which
might fail causing us to not get the reference.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
The max depth of a fastreg mr depends on whether the device supports
DSGL or not. So compute it dynamically based on the device support
and the module use_dsgl option.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
There is a race when moving a QP from RTS->CLOSING where a SQ work
request could be posted after the FW receives the RDMA_RI/FINI WR.
The SQ work request will never get processed, and should be completed
with FLUSHED status. Function c4iw_flush_sq(), however was dropping
the oldest SQ work request when in CLOSING or IDLE states, instead of
completing the pending work request. If that oldest pending work
request was actually complete and has a CQE in the CQ, then when that
CQE is proceessed in poll_cq, we'll BUG_ON() due to the inconsistent
SQ/CQ state.
This is a very small timing hole and has only been hit once so far.
The fix is two-fold:
1) c4iw_flush_sq() MUST always flush all non-completed WRs with FLUSHED
status regardless of the QP state.
2) In c4iw_modify_rc_qp(), always set the "in error" bit on the queue
before moving the state out of RTS. This ensures that the state
transition will not happen while another thread is in
post_rc_send(), because set_state() and post_rc_send() both aquire
the qp spinlock. Also, once we transition the state out of RTS,
subsequent calls to post_rc_send() will fail because the "in error"
bit is set. I don't think this fully closes the race where the FW
can get a FINI followed a SQ work request being posted (because
they are posted to differente EQs), but the #1 fix will handle the
issue by flushing the SQ work request.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Some HW platforms can reorder read operations, so we must rmb() after
we see a valid gen bit in a CQE but before we read any other fields
from the CQE.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
1) timedout endpoint processing can be starved. If there are continual
CPL messages flowing into the driver, the endpoint timeout
processing can be starved. This condition exposed the other bugs
below.
Solution: In process_work(), call process_timedout_eps() after each CPL
is processed.
2) Connection events can be processed even though the endpoint is on
the timeout list. If the endpoint is scheduled for timeout
processing, then we must ignore MPA Start Requests and Replies.
Solution: Change stop_ep_timer() to return 1 if the ep has already been
queued for timeout processing. All the callers of stop_ep_timer() need
to check this and act accordingly. There are just a few cases where
the caller needs to do something different if stop_ep_timer() returns 1:
1) in process_mpa_reply(), ignore the reply and process_timeout()
will abort the connection.
2) in process_mpa_request, ignore the request and process_timeout()
will abort the connection.
It is ok for callers of stop_ep_timer() to abort the connection since
that will leave the state in ABORTING or DEAD, and process_timeout()
now ignores timeouts when the ep is in these states.
3) Double insertion on the timeout list. Since the endpoint timers
are used for connection setup and teardown, we need to guard
against the possibility that an endpoint is already on the timeout
list. This is a rare condition and only seen under heavy load and
in the presense of the above 2 bugs.
Solution: In ep_timeout(), don't queue the endpoint if it is already on
the queue.
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
[ Fix cast from u64* to integer. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Add support for the block multicast loopback QP creation flag along
the proper firmware API for that.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|