summaryrefslogtreecommitdiff
path: root/drivers/infiniband/ulp/ipoib
AgeCommit message (Collapse)Author
2015-05-05IPoIB/CM: Fix indentation levelBart Van Assche
See also patch "IPoIB/cm: Add connected mode support for devices without SRQs" (commit ID 68e995a29572). Detected by smatch. Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Pradeep Satyanarayana <pradeeps@linux.vnet.ibm.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-22Merge tag 'rdma-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband Pull InfiniBand/RDMA updates from Roland Dreier: - IPoIB fixes from Doug Ledford and Erez Shitrit - iSER updates from Sagi Grimberg - mlx4 GUID handling changes from Yishai Hadas - other misc fixes * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (51 commits) mlx5: wrong page mask if CONFIG_ARCH_DMA_ADDR_T_64BIT enabled for 32Bit architectures IB/iser: Rewrite bounce buffer code path IB/iser: Bump version to 1.6 IB/iser: Remove code duplication for a single DMA entry IB/iser: Pass struct iser_mem_reg to iser_fast_reg_mr and iser_reg_sig_mr IB/iser: Modify struct iser_mem_reg members IB/iser: Make fastreg pool cache friendly IB/iser: Move PI context alloc/free to routines IB/iser: Move fastreg descriptor pool get/put to helper functions IB/iser: Merge build page-vec into register page-vec IB/iser: Get rid of struct iser_rdma_regd IB/iser: Remove redundant assignments in iser_reg_page_vec IB/iser: Move memory reg/dereg routines to iser_memory.c IB/iser: Don't pass ib_device to fall_to_bounce_buff routine IB/iser: Remove a redundant struct iser_data_buf IB/iser: Remove redundant cmd_data_len calculation IB/iser: Fix wrong calculation of protection buffer length IB/iser: Handle fastreg/local_inv completion errors IB/iser: Fix unload during ep_poll wrong dereference ib_srpt: convert printk's to pr_* functions ...
2015-04-17IB/ipoib: Fix ndo_get_iflinkErez Shitrit
Currently, iflink of the parent interface was always accessed, even when interface didn't have a parent and hence we crashed there. Handle the interface types properly: for a child interface, return the ifindex of the parent, for parent interface, return its ifindex. For child devices, make sure to set the parent pointer prior to invoking register_netdevice(), this allows the new ndo to be called by the stack immediately after the child device is registered. Fixes: 5aa7add8f14b ('infiniband/ipoib: implement ndo_get_iflink') Reported-by: Honggang Li <honli@redhat.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Honggang Li <honli@redhat.com> Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>+ Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-15IB/ipoib: Remove IPOIB_MCAST_RUN bitErez Shitrit
After Doug Ledford's changes there is no need in that bit, it's semantic becomes subset of the IPOIB_FLAG_OPER_UP bit. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Save only IPOIB_MAX_PATH_REC_QUEUE skb'sErez Shitrit
Whenever there is no path->ah to the destination, keep only defined number of skb's. Otherwise there are cases that the driver can keep infinite list of skb's. For example, when one device want to send unicast arp to the destination, and from some reason the SM doesn't respond, the driver currently keeps all the skb's. If that unicast arp traffic stopped, all these skb's are kept by the path object till the interface is down. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Handle QP in SQE stateErez Shitrit
As the result of a completion error the QP can moved to SQE state by the hardware. Since it's not the Error state, there are no flushes and hence the driver doesn't know about that. The fix creates a task that after completion with error which is not a flush tracks the QP state and if it is in SQE state moves it back to RTS. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Update broadcast record values after each successful join requestErez Shitrit
Update the cached broadcast record in the priv object after every new join of this broadcast domain group. These values are needed for the port configuration (MTU size) and to all the new multicast (non-broadcast) join requests initial parameters. For example, SM starts with 2K MTU for all the fabric, and after that it restarts (or handover to new SM) with new port configuration of 4K MTU. Without using the new values, the driver will keep its old configuration of 2K and will not apply the new configuration of 4K. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Use one linear skb in RX flowErez Shitrit
The current code in the RX flow uses two sg entries for each incoming packet, the first one was for the IB headers and the second for the rest of the data, that causes two dma map/unmap and two allocations, and few more actions that were done at the data path. Use only one linear skb on each incoming packet, for the data (IB headers and payload), that reduces the packet processing in the data-path (only one skb, no frags, the first frag was not used anyway, less memory allocations) and the dma handling (only one dma map/unmap over each incoming packet instead of two map/unmap per each incoming packet). After commit 73d3fe6d1c6d ("gro: fix aggregation for skb using frag_list") from Eric Dumazet, we will get full aggregation for large packets. When running bandwidth tests before and after the (over the card's numa node), using "netperf -H 1.1.1.3 -T -t TCP_STREAM", the results before are ~12Gbs before and after ~16Gbs on my setup (Mellanox's ConnectX3). Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: drop mcast_mutex usageDoug Ledford
We needed the mcast_mutex when we had to prevent the join completion callback from having the value it stored in mcast->mc overwritten by a delayed return from ib_sa_join_multicast. By storing the return of ib_sa_join_multicast in an intermediate variable, we prevent a delayed return from ib_sa_join_multicast overwriting the valid contents of mcast->mc, and we no longer need a mutex to force the join callback to run after the return of ib_sa_join_multicast. This allows us to do away with the mutex entirely and protect our critical sections with a just a spinlock instead. This is highly desirable as there were some places where we couldn't use a mutex because the code was not allowed to sleep, and so we were currently using a mix of mutex and spinlock to protect what we needed to protect. Now we only have a spin lock and the locking complexity is greatly reduced. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: deserialize multicast joinsDoug Ledford
Allow the ipoib layer to attempt to join all outstanding multicast groups at once. The ib_sa layer will serialize multiple attempts to join the same group, but will process attempts to join different groups in parallel. Take advantage of that. In order to make this happen, change the mcast_join_thread to loop through all needed joins, sending a join request for each one that we still need to join. There are a few special cases we handle though: 1) Don't attempt to join anything but the broadcast group until the join of the broadcast group has succeeded. 2) No longer restart the join task at the end of completion handling. If we completed successfully, we are done. The join task now needs kicked either by mcast_send or mcast_restart_task or mcast_start_thread, but should not need started anytime else except when scheduling a backoff attempt to rejoin. 3) No longer use separate join/completion routines for regular and sendonly joins, pass them all through the same routine and just do the right thing based on the SENDONLY join flag. 4) Only try to join a SENDONLY join twice, then drop the packets and quit trying. We leave the mcast group in the list so that if we get a new packet, all that we have to do is queue up the packet and restart the join task and it will automatically try to join twice and then either send or flush the queue again. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: fix MCAST_FLAG_BUSY usageDoug Ledford
Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast objects") added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. And when we did complete it, sometimes we continued to touch the mcast entry after the completion, opening us up to possible use after free issues. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test mcast->mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast->done completion struct 5) Make sure that before calling complete(&mcast->done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast->mc before we process the callback. This way, our callback can safely clear mcast->mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast->mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task 8) Make MCAST_RUN mean that we have a working mcast subsystem, not that we have a running task. We know when we need to reschedule our join task thread and don't need a flag to tell us. 9) Add a helper for rescheduling the join task thread A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast->mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast->mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv->broadcast = NULL join_complete wait_for_completion mcast->mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast->mc complete return -EAGAIN (making mcast->mc invalid) call ib_sa_multicast_leave on invalid mcast->mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast->mc to be valid at the time we run the callback. This allows us to clear mcast->mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast->mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: No longer use flush as a parameterDoug Ledford
Various places in the IPoIB code had a deadlock related to flushing the ipoib workqueue. Now that we have per device workqueues and a specific flush workqueue, there is no longer a deadlock issue with flushing the device specific workqueues and we can do so unilaterally. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Use dedicated workqueues per interfaceDoug Ledford
During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue means that we would also have to prevent all races between different devices (for instance, ipoib_mcast_restart_task on interface ib0 can race with ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on the rtnl_lock). There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Make the carrier_on_task race awareDoug Ledford
We blindly assume that we can just take the rtnl lock and that will prevent races with downing this interface. Unfortunately, that's not the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() in an attempt to clear out all remaining instances of ipoib_join_task. But, since this task is put on the same workqueue as the join task, the flush_workqueue waits on this thread too. But this thread is deadlocked on the rtnl lock. The better thing here is to use trylock and loop on that until we either get the lock or we see that FLAG_OPER_UP has been cleared, in which case we don't need to do anything anyway and we just return. While investigating which flag should be used, FLAG_ADMIN_UP or FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more appropriate flag to use. However, there was a mix of these two flags in use in the existing code. So while we check for that flag here as part of this race fix, also cleanup the two places that had used the less appropriate flag for their tests. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: Consolidate rtnl_lock tasks in workqueueDoug Ledford
The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and needs to keep it held. It also needs to call flush_workqueue() to flush out any outstanding work. In the past, we've had to try and make sure that we didn't flush out any outstanding join completions because they also wanted to grab rtnl_lock() and that would deadlock. It turns out that the only thing in the join completion handler that needs this lock can be safely moved to our carrier_on_task, thereby reducing the potential for the join completion code and the flush code to deadlock against each other. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: change init sequence orderingDoug Ledford
In preparation for using per device work queues, we need to move the start of the neighbor thread task to after ipoib_ib_dev_init and move the destruction of the neighbor task to before ipoib_ib_dev_cleanup. Otherwise we will end up freeing our workqueue with work possibly still on it. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-15IB/ipoib: factor out ah flushingDoug Ledford
Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at appropriate times to flush out all remaining ah entries before we shut the device down. Because neighbors and mcast entries can each have a reference on any given ah, we must make sure to free all of those first before our ah will actually have a 0 refcount and be able to be reaped. This factoring is needed in preparation for having per-device work queues. The original per-device workqueue code resulted in the following error message: <ibdev>: ib_dealloc_pd failed That error was tracked down to this issue. With the changes to which workqueues were flushed when, there were no flushes of the per device workqueue after the last ah's were freed, resulting in an attempt to dealloc the pd with outstanding resources still allocated. This code puts the explicit flushes in the needed places to avoid that problem. Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-04-02infiniband/ipoib: implement ndo_get_iflinkNicolas Dichtel
Don't use dev->iflink anymore. CC: Roland Dreier <roland@kernel.org> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-30Revert "IPoIB: Consolidate rtnl_lock tasks in workqueue"Roland Dreier
This reverts commit afe1de664ef3cb756e70938d99417dcbc6b1379a. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-01-30Revert "IPoIB: Make the carrier_on_task race aware"Roland Dreier
This reverts commit 67d7209e1f481cbaed37f9a224a328a3f83d0482. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-01-30Revert "IPoIB: fix MCAST_FLAG_BUSY usage"Roland Dreier
This reverts commit 016d9fb25cd9817ea9c723f4f7ecd978636b4489. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-01-30Revert "IPoIB: fix mcast_dev_flush/mcast_restart_task race"Roland Dreier
This reverts commit e5d1dcf1b0951f4ba00d93653942dda6196109d8. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-01-30Revert "IPoIB: change init sequence ordering"Roland Dreier
This reverts commit 3bcce487fda8161597c20ed303d510e41ad7770e. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-01-30Revert "IPoIB: Use dedicated workqueues per interface"Roland Dreier
This reverts commit 5141861cd5e17eac9676ff49c5abfafbea2b0e98. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-01-30Revert "IPoIB: Make ipoib_mcast_stop_thread flush the workqueue"Roland Dreier
This reverts commit bb42a6dd02fb2901a69dbec2358810735b14b186. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
2015-01-30Revert "IPoIB: No longer use flush as a parameter"Roland Dreier
This reverts commit ce347ab90eaabc69a6146d41943981d51e7a9b82. The series of IPoIB bug fixes that went into 3.19-rc1 introduce regressions, and after trying to sort things out, we decided to revert to 3.18's IPoIB driver and get things right for 3.20. Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IPoIB: No longer use flush as a parameterDoug Ledford
Various places in the IPoIB code had a deadlock related to flushing the ipoib workqueue. Now that we have per device workqueues and a specific flush workqueue, there is no longer a deadlock issue with flushing the device specific workqueues and we can do so unilaterally. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IPoIB: Make ipoib_mcast_stop_thread flush the workqueueDoug Ledford
We used to pass a flush variable to mcast_stop_thread to indicate if we should flush the workqueue or not. This was due to some code trying to flush a workqueue that it was currently running on which is a no-no. Now that we have per-device work queues, and now that ipoib_mcast_restart_task has taken the fact that it is queued on a single thread workqueue with all of the ipoib_mcast_join_task's and therefore has no need to stop the join task while it runs, we can do away with the flush parameter and unilaterally flush always. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IPoIB: Use dedicated workqueues per interfaceDoug Ledford
During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue, we can have one device in the middle of a down and holding the rtnl lock and another totally unrelated device needing to run mcast_restart_task, which wants the rtnl lock and will loop trying to take it unless is sees its own FLAG_ADMIN_UP flag go away. Because the unrelated interface will never see its own ADMIN_UP flag drop, the interface going down will deadlock trying to flush the queue. There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. I suppose if our own ADMIN_UP flag doesn't go away, then maybe after a few tries on the rtnl lock we can queue our own task back up as a delayed work and return and avoid dropping work on the floor that way. But I'm not 100% convinced that we won't cause other problems. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IPoIB: change init sequence orderingDoug Ledford
In preparation for using per device work queues, we need to move the start of the neighbor thread task to after ipoib_ib_dev_init and move the destruction of the neighbor task to before ipoib_ib_dev_cleanup. Otherwise we will end up freeing our workqueue with work possibly still on it. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IPoIB: fix mcast_dev_flush/mcast_restart_task raceDoug Ledford
Our mcast_dev_flush routine and our mcast_restart_task can race against each other. In particular, they both hold the priv->lock while manipulating the rbtree and while removing mcast entries from the multicast_list and while adding entries to the remove_list, but they also both drop their locks prior to doing the actual removes. The mcast_dev_flush routine is run entirely under the rtnl lock and so has at least some locking. The actual race condition is like this: Thread 1 Thread 2 ifconfig ib0 up start multicast join for broadcast multicast join completes for broadcast start to add more multicast joins call mcast_restart_task to add new entries ifconfig ib0 down mcast_dev_flush mcast_leave(mcast A) mcast_leave(mcast A) As mcast_leave calls ib_sa_multicast_leave, and as member in core/multicast.c is ref counted, we run into an unbalanced refcount issue. To avoid stomping on each others removes, take the rtnl lock specifically when we are deleting the entries from the remove list. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IPoIB: fix MCAST_FLAG_BUSY usageDoug Ledford
Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast objects") added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. This is further muddied by the fact that we overload the MCAST_FLAG_BUSY flag to mean two different things: we have a join in flight, and we have succeeded in getting an ib_sa_join_multicast. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test on mcast->mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast->done completion struct 5) Make sure that before calling complete(&mcast->done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast->mc before we process the callback. This way, our callback can safely clear mcast->mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast->mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast->mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast->mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv->broadcast = NULL join_complete wait_for_completion mcast->mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast->mc complete return -EAGAIN (making mcast->mc invalid) call ib_sa_multicast_leave on invalid mcast->mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast->mc to be valid at the time we run the callback. This allows us to clear mcast->mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast->mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IPoIB: Make the carrier_on_task race awareDoug Ledford
We blindly assume that we can just take the rtnl lock and that will prevent races with downing this interface. Unfortunately, that's not the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() in an attempt to clear out all remaining instances of ipoib_join_task. But, since this task is put on the same workqueue as the join task, the flush_workqueue waits on this thread too. But this thread is deadlocked on the rtnl lock. The better thing here is to use trylock and loop on that until we either get the lock or we see that FLAG_ADMIN_UP has been cleared, in which case we don't need to do anything anyway and we just return. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-12-15IPoIB: Consolidate rtnl_lock tasks in workqueueDoug Ledford
Setting the MTU can safely be moved to the carrier_on_task, which keeps us from needing to take the rtnl lock in the join_finish section. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-10-07net: better IFF_XMIT_DST_RELEASE supportEric Dumazet
Testing xmit_more support with netperf and connected UDP sockets, I found strange dst refcount false sharing. Current handling of IFF_XMIT_DST_RELEASE is not optimal. Dropping dst in validate_xmit_skb() is certainly too late in case packet was queued by cpu X but dequeued by cpu Y The logical point to take care of drop/force is in __dev_queue_xmit() before even taking qdisc lock. As Julian Anastasov pointed out, need for skb_dst() might come from some packet schedulers or classifiers. This patch adds new helper to cleanly express needs of various drivers or qdiscs/classifiers. Drivers that need skb_dst() in their ndo_start_xmit() should call following helper in their setup instead of the prior : dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; -> netif_keep_dst(dev); Instead of using a single bit, we use two bits, one being eventually rebuilt in bonding/team drivers. The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being rebuilt in bonding/team. Eventually, we could add something smarter later. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23Merge tag 'rdma-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband Pull infiniband/rdma fixes from Roland Dreier: "Last late set of InfiniBand/RDMA fixes for 3.17: - fixes for the new memory region re-registration support - iSER initiator error path fixes - grab bag of small fixes for the qib and ocrdma hardware drivers - larger set of fixes for mlx4, especially in RoCE mode" * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (26 commits) IB/mlx4: Fix VF mac handling in RoCE IB/mlx4: Do not allow APM under RoCE IB/mlx4: Don't update QP1 in native mode IB/mlx4: Avoid accessing netdevice when building RoCE qp1 header mlx4: Fix mlx4 reg/unreg mac to work properly with 0-mac addresses IB/core: When marshaling uverbs path, clear unused fields IB/mlx4: Avoid executing gid task when device is being removed IB/mlx4: Fix lockdep splat for the iboe lock IB/mlx4: Get upper dev addresses as RoCE GIDs when port comes up IB/mlx4: Reorder steps in RoCE GID table initialization IB/mlx4: Don't duplicate the default RoCE GID IB/mlx4: Avoid null pointer dereference in mlx4_ib_scan_netdevs() IB/iser: Bump version to 1.4.1 IB/iser: Allow bind only when connection state is UP IB/iser: Fix RX/TX CQ resource leak on error flow RDMA/ocrdma: Use right macro in query AH RDMA/ocrdma: Resolve L2 address when creating user AH mlx4: Correct error flows in rereg_mr IB/qib: Correct reference counting in debugfs qp_stats IPoIB: Remove unnecessary port query ...
2014-09-22ipoib: validate struct ipoib_cb sizeEric Dumazet
To catch future errors sooner. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-19IPoIB: Remove unnecessary port queryAlex Estrin
There are two queries for port attributes one after another. A second call is not needed since port_attr structure already holds the data. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Alex Estrin <alex.estrin@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-08-14Merge tag 'rdma-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband Pull infiniband/rdma updates from Roland Dreier: "Main set of InfiniBand/RDMA updates for 3.17 merge window: - MR reregistration support - MAD support for RMPP in userspace - iSER and SRP initiator updates - ocrdma hardware driver updates - other fixes..." * tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband: (52 commits) IB/srp: Fix return value check in srp_init_module() RDMA/ocrdma: report asic-id in query device RDMA/ocrdma: Update sli data structure for endianness RDMA/ocrdma: Obtain SL from device structure RDMA/uapi: Include socket.h in rdma_user_cm.h IB/srpt: Handle GID change events IB/mlx5: Use ARRAY_SIZE instead of sizeof/sizeof[0] IB/mlx4: Use ARRAY_SIZE instead of sizeof/sizeof[0] RDMA/amso1100: Check for integer overflow in c2_alloc_cq_buf() IPoIB: Remove unnecessary test for NULL before debugfs_remove() IB/mad: Add user space RMPP support IB/mad: add new ioctl to ABI to support new registration options IB/mad: Add dev_notice messages for various umad/mad registration failures IB/mad: Update module to [pr|dev]_* style print messages IB/ipoib: Avoid multicast join attempts with invalid P_key IB/umad: Update module to [pr|dev]_* style print messages IB/ipoib: Avoid flushing the workqueue from worker context IB/ipoib: Use P_Key change event instead of P_Key polling mechanism IB/ipath: Add P_Key change event support mlx4_core: Add support for secure-host and SMP firewall ...
2014-08-12IPoIB: Remove unnecessary test for NULL before debugfs_remove()Fabian Frederick
Fix checkpatch warning: WARNING: debugfs_remove(NULL) is safe this check is probably not required Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-08-10IB/ipoib: Avoid multicast join attempts with invalid P_keyAlex Estrin
Currently, the parent interface keeps sending broadcast group join requests even if p_key index 0 is invalid, which is possible/common in virtualized environments where a VF has been probed to VM but the actual P_key configuration has not yet been assigned by the management software. This creates unnecessary noise on the fabric and in the kernel logs: ib0: multicast join failed for ff12:401b:8000:0000:0000:0000:ffff:ffff, status -22 The original code run the multicast task regardless of the actual P_key value, which can be avoided. The fix is to re-init resources and bring interface up only if P_key index 0 is valid either when starting up or on PKEY_CHANGE event. Fixes: c290414169 ("IPoIB: Fix pkey change flow for virtualization environments") Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Alex Estrin <alex.estrin@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-08-05IB/ipoib: Avoid flushing the workqueue from worker contextErez Shitrit
The error flow of ipoib_ib_dev_open() invokes ipoib_ib_dev_stop() with workqueue flushing enabled, which deadlocks if the open procedure itself was called by a worker thread. Fix this by adding a flush enabled flag to ipoib_ib_dev_open() and set it accordingly from the locations where such a call is made. The call trace was the following: [<ffffffff81095bc4>] ? flush_workqueue+0x54/0x80 [<ffffffffa056c657>] ? ipoib_ib_dev_stop+0x447/0x650 [ib_ipoib] [<ffffffffa056cc34>] ? ipoib_ib_dev_open+0x284/0x430 [ib_ipoib] [<ffffffffa05674a8>] ? ipoib_open+0x78/0x1d0 [ib_ipoib] [<ffffffffa05697b8>] ? ipoib_pkey_open+0x38/0x40 [ib_ipoib] [<ffffffffa056cf3c>] ? __ipoib_ib_dev_flush+0x15c/0x2c0 [ib_ipoib] [<ffffffffa056ce56>] ? __ipoib_ib_dev_flush+0x76/0x2c0 [ib_ipoib] [<ffffffffa056d0a0>] ? ipoib_ib_dev_flush_heavy+0x0/0x20 [ib_ipoib] [<ffffffffa056d0ba>] ? ipoib_ib_dev_flush_heavy+0x1a/0x20 [ib_ipoib] [<ffffffff81094d20>] ? worker_thread+0x170/0x2a0 [<ffffffff8109b2a0>] ? autoremove_wake_function+0x0/0x40 Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Acked-by: Alex Estrin <alex.estrin@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-08-05IB/ipoib: Use P_Key change event instead of P_Key polling mechanismErez Shitrit
The current code use a dedicated polling logic to determine when the P_Key assigned to the ipoib device is present in HCA port table and act accordingly. Move to use the code which acts upon getting PKEY_CHANGE event to handle this task and remove the P_Key polling logic/thread as they add extra complexity which isn't needed. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Acked-by: Alex Estrin <alex.estrin@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-07-15net: set name_assign_type in alloc_netdev()Tom Gundersen
Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert all users to pass NET_NAME_UNKNOWN. Coccinelle patch: @@ expression sizeof_priv, name, setup, txqs, rxqs, count; @@ ( -alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs) +alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs) | -alloc_netdev_mq(sizeof_priv, name, setup, count) +alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count) | -alloc_netdev(sizeof_priv, name, setup) +alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup) ) v9: move comments here from the wrong commit Signed-off-by: Tom Gundersen <teg@jklm.no> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov. 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J Benniston. 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn Mork. 4) BPF now has a "random" opcode, from Chema Gonzalez. 5) Add more BPF documentation and improve test framework, from Daniel Borkmann. 6) Support TCP fastopen over ipv6, from Daniel Lee. 7) Add software TSO helper functions and use them to support software TSO in mvneta and mv643xx_eth drivers. From Ezequiel Garcia. 8) Support software TSO in fec driver too, from Nimrod Andy. 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli. 10) Handle broadcasts more gracefully over macvlan when there are large numbers of interfaces configured, from Herbert Xu. 11) Allow more control over fwmark used for non-socket based responses, from Lorenzo Colitti. 12) Do TCP congestion window limiting based upon measurements, from Neal Cardwell. 13) Support busy polling in SCTP, from Neal Horman. 14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru. 15) Bridge promisc mode handling improvements from Vlad Yasevich. 16) Don't use inetpeer entries to implement ID generation any more, it performs poorly, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits) rtnetlink: fix userspace API breakage for iproute2 < v3.9.0 tcp: fixing TLP's FIN recovery net: fec: Add software TSO support net: fec: Add Scatter/gather support net: fec: Increase buffer descriptor entry number net: fec: Factorize feature setting net: fec: Enable IP header hardware checksum net: fec: Factorize the .xmit transmit function bridge: fix compile error when compiling without IPv6 support bridge: fix smatch warning / potential null pointer dereference via-rhine: fix full-duplex with autoneg disable bnx2x: Enlarge the dorq threshold for VFs bnx2x: Check for UNDI in uncommon branch bnx2x: Fix 1G-baseT link bnx2x: Fix link for KR with swapped polarity lane sctp: Fix sk_ack_backlog wrap-around problem net/core: Add VF link state control policy net/fsl: xgmac_mdio is dependent on OF_MDIO net/fsl: Make xgmac_mdio read error message useful net_sched: drr: warn when qdisc is not work conserving ...
2014-06-02IB: Add a QP creation flag to use GFP_NOIO allocationsOr Gerlitz
This addresses a problem where NFS client writes over IPoIB connected mode may deadlock on memory allocation/writeback. The problem is not directly memory reclamation. There is an indirect dependency between network filesystems writing back pages and ipoib_cm_tx_init() due to how a kworker is used. Page reclaim cannot make forward progress until ipoib_cm_tx_init() succeeds and it is stuck in page reclaim itself waiting for network transmission. Ordinarily this situation may be avoided by having the caller use GFP_NOFS but ipoib_cm_tx_init() does not have that information. To address this, take a general approach and add a new QP creation flag that tells the low-level hardware driver to use GFP_NOIO for the memory allocations related to the new QP. Use the new flag in the ipoib connected mode path, and if the driver doesn't support it, re-issue the QP creation without the flag. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-05-13net: get rid of SET_ETHTOOL_OPSWilfried Klaebe
net: get rid of SET_ETHTOOL_OPS Dave Miller mentioned he'd like to see SET_ETHTOOL_OPS gone. This does that. Mostly done via coccinelle script: @@ struct ethtool_ops *ops; struct net_device *dev; @@ - SET_ETHTOOL_OPS(dev, ops); + dev->ethtool_ops = ops; Compile tested only, but I'd seriously wonder if this broke anything. Suggested-by: Dave Miller <davem@davemloft.net> Signed-off-by: Wilfried Klaebe <w-lkml@lebenslange-mailadresse.de> Acked-by: Felipe Balbi <balbi@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22Merge branches 'cma', 'cxgb4', 'flowsteer', 'ipoib', 'misc', 'mlx4', 'mlx5', ↵Roland Dreier
'ocrdma', 'qib', 'srp' and 'usnic' into for-next
2014-01-22IPoIB: Report operstate consistently when brought up without a linkMichal Schmidt
After booting without a working link, "ip link" shows: 5: mlx4_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast state DOWN qlen 256 ... 7: mlx4_ib1.8003@mlx4_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast state DOWN qlen 256 ... Then after connecting and disconnecting the link, which should result in exactly the same state as before, it shows: 5: mlx4_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast state DOWN qlen 256 ... 7: mlx4_ib1.8003@mlx4_ib1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 2044 qdisc pfifo_fast state LOWERLAYERDOWN qlen 256 ... Notice the (now correct) LOWERLAYERDOWN operstate shown for the mlx4_ib1.8003 interface. Ideally the identical state would be shown right after boot. The problem is related to the calling of netif_carrier_off() in network drivers. For a long time it was known that doing netif_carrier_off() before registering the netdevice would result in the interface's operstate being shown as UNKNOWN if the device was brought up without a working link. This problem was fixed in commit 8f4cccbbd92 ('net: Set device operstate at registration time'), but still there remains the minor inconsistency demonstrated above. This patch fixes it by moving ipoib's call to netif_carrier_off() into the .ndo_open method, which is where network drivers ordinarily do it. With the patch when doing the same test as above, the operstate of mlx4_ib1.8003 is shown as LOWERLAYERDOWN right after boot. Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Acked-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2014-01-14IB/core: Add flow steering support for IPoIB UD trafficMatan Barak
When creating an IPoIB UD QP, provide a hint to the low level driver that the QP should support flow-steering. This means that privileged user space applications can steer TCP/IP IPoIB traffic from the network stack, in a similar manner done with Ethernet RAW_PACKET QPs. The hint is provided through new QP creation flag called NETIF_QP. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>