summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-07-18SUNRPC: Optimise transport balancing codeTrond Myklebust
Moves the balancing code to avoid doing cursor changes on every search iteration. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-18SUNRPC: Ensure the bvecs are reset when we re-encode the RPC requestTrond Myklebust
The bvec tracks the list of pages, so if the number of pages changes due to a re-encode, we need to reset the bvec as well. Fixes: 277e4ab7d530 ("SUNRPC: Simplify TCP receive code by switching...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.20+
2019-07-18SUNRPC: Fix up backchannel slot table accountingTrond Myklebust
Add a per-transport maximum limit in the socket case, and add helpers to allow the NFSv4 code to discover that limit. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-18SUNRPC: Fix initialisation of struct rpc_xprt_switchTrond Myklebust
Ensure that we do initialise the fields xps_nactive, xps_queuelen and xps_net. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-16SUNRPC: Skip zero-refcount transportsTrond Myklebust
When looking for the next transport to use for an RPC call, skip those that are in the process of being destroyed and that have a zero refcount. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-16SUNRPC: Replace division by multiplication in calculation of queue lengthTrond Myklebust
When checking whether or not a particular xprt queue length is shorter than the average queue length for all xprts, prefer to use multiplication rather than division for performance reasons. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-12SUNRPC: Fix transport accounting when caller specifies an rpc_xprtTrond Myklebust
Ensure that we do the required accounting for the round robin queue when the caller to rpc_init_task() has passed in a transport to be used. Reported-by: Olga Kornievskaia <aglo@umich.edu> Reported-by: Neil Brown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-12Merge tag 'nfs-rdma-for-5.3-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs NFSoRDMA client updates for 5.3 New features: - Add a way to place MRs back on the free list - Reduce context switching - Add new trace events Bugfixes and cleanups: - Fix a BUG when tracing is enabled with NFSv4.1 - Fix a use-after-free in rpcrdma_post_recvs - Replace use of xdr_stream_pos in rpcrdma_marshal_req - Fix occasional transport deadlock - Fix show_nfs_errors macros, other tracing improvements - Remove RPCRDMA_REQ_F_PENDING and fr_state - Various simplifications and refactors
2019-07-09xprtrdma: Modernize ops->connectChuck Lever
Adapt and apply changes that were made to the TCP socket connect code. See the following commits for details on the purpose of these changes: Commit 7196dbb02ea0 ("SUNRPC: Allow changing of the TCP timeout parameters on the fly") Commit 3851f1cdb2b8 ("SUNRPC: Limit the reconnect backoff timer to the max RPC message timeout") Commit 02910177aede ("SUNRPC: Fix reconnection timeouts") Some common transport code is moved to xprt.c to satisfy the code duplication police. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Remove rpcrdma_req::rl_bufferChuck Lever
Clean up. There is only one remaining function, rpcrdma_buffer_put(), that uses this field. Its caller can supply a pointer to the correct rpcrdma_buffer, enabling the removal of an 8-byte pointer field from a frequently-allocated shared data structure. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Refactor chunk encodingChuck Lever
Clean up. Move the "not present" case into the individual chunk encoders. This improves code organization and readability. The reason for the original organization was to optimize for the case where there there are no chunks. The optimization turned out to be inconsequential, so let's err on the side of code readability. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Streamline rpcrdma_post_recvsChuck Lever
rb_lock is contended between rpcrdma_buffer_create, rpcrdma_buffer_put, and rpcrdma_post_recvs. Commit e340c2d6ef2a ("xprtrdma: Reduce the doorbell rate (Receive)") causes rpcrdma_post_recvs to take the rb_lock repeatedly when it determines more Receives are needed. Streamline this code path so it takes the lock just once in most cases to build the Receive chain that is about to be posted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Simplify rpcrdma_rep_createChuck Lever
Clean up. Commit 7c8d9e7c8863 ("xprtrdma: Move Receive posting to Receive handler") reduced the number of rpcrdma_rep_create call sites to one. After that commit, the backchannel code no longer invokes it. Therefore the free list logic added by commit d698c4a02ee0 ("xprtrdma: Fix backchannel allocation of extra rpcrdma_reps") is no longer necessary, and in fact adds some extra overhead that we can do without. Simply post any newly created reps. They will get added back to the rb_recv_bufs list when they subsequently complete. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Wake RPCs directly in rpcrdma_wc_send pathChuck Lever
Eliminate a context switch in the path that handles RPC wake-ups when a Receive completion has to wait for a Send completion. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Reduce context switching due to Local InvalidationChuck Lever
Since commit ba69cd122ece ("xprtrdma: Remove support for FMR memory registration"), FRWR is the only supported memory registration mode. We can take advantage of the asynchronous nature of FRWR's LOCAL_INV Work Requests to get rid of the completion wait by having the LOCAL_INV completion handler take care of DMA unmapping MRs and waking the upper layer RPC waiter. This eliminates two context switches when local invalidation is necessary. As a side benefit, we will no longer need the per-xprt deferred completion work queue. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Add mechanism to place MRs back on the free listChuck Lever
When a marshal operation fails, any MRs that were already set up for that request are recycled. Recycling releases MRs and creates new ones, which is expensive. Since commit f2877623082b ("xprtrdma: Chain Send to FastReg WRs") was merged, recycling FRWRs is unnecessary. This is because before that commit, frwr_map had already posted FAST_REG Work Requests, so ownership of the MRs had already been passed to the NIC and thus dealing with them had to be delayed until they completed. Since that commit, however, FAST_REG WRs are posted at the same time as the Send WR. This means that if marshaling fails, we are certain the MRs are safe to simply unmap and place back on the free list because neither the Send nor the FAST_REG WRs have been posted yet. The kernel still has ownership of the MRs at this point. This reduces the total number of MRs that the xprt has to create under heavy workloads and makes the marshaling logic less brittle. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Remove fr_stateChuck Lever
Now that both the Send and Receive completions are handled in process context, it is safe to DMA unmap and return MRs to the free or recycle lists directly in the completion handlers. Doing this means rpcrdma_frwr no longer needs to track the state of each MR, meaning that a VALID or FLUSHED MR can no longer appear on an xprt's MR free list. Thus there is no longer a need to track the MR's registration state in rpcrdma_frwr. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flagChuck Lever
Commit 9590d083c1bb ("xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they can be left in the pending requests queue while they are being processed without introducing a race between ->buf_free and the transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no longer necessary. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Fix occasional transport deadlockChuck Lever
Under high I/O workloads, I've noticed that an RPC/RDMA transport occasionally deadlocks (IOPS goes to zero, and doesn't recover). Diagnosis shows that the sendctx queue is empty, but when sendctxs are returned to the queue, the xprt_write_space wake-up never occurs. The wake-up logic in rpcrdma_sendctx_put_locked is racy. I noticed that both EMPTY_SCQ and XPRT_WRITE_SPACE are implemented via an atomic bit. Just one of those is sufficient. Removing EMPTY_SCQ in favor of the generic bit mechanism makes the deadlock un-reproducible. Without EMPTY_SCQ, rpcrdma_buffer::rb_flags is no longer used and is therefore removed. Unfortunately this patch does not apply cleanly to stable. If needed, someone will have to port it and test it. Fixes: 2fad659209d5 ("xprtrdma: Wait on empty sendctx queue") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-09xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_reqChuck Lever
This is a latent bug. xdr_stream_pos works by subtracting xdr_stream::nwords from xdr_buf::len. But xdr_stream::nwords is not initialized by xdr_init_encode(). It works today only because all fields in rpcrdma_req::rl_stream are initialized to zero by rpcrdma_req_create, making the subtraction in xdr_stream_pos always a no-op. I found this issue via code inspection. It was introduced by commit 39f4cd9e9982 ("xprtrdma: Harden chunk list encoding against send buffer overflow"), but the code has changed enough since then that this fix can't be automatically applied to stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-07-06SUNRPC: Fix possible autodisconnect during connect due to old last_usedDave Wysochanski
Ensure last_used is updated before calling mod_timer inside xprt_schedule_autodisconnect. This avoids a possible xprt_autoclose firing immediately after a successful connect when xprt_unlock_connect calls xprt_schedule_autodisconnect with an old value of last_used. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: Drop redundant CONFIG_ from CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPESAnna Schumaker
The "CONFIG_" portion is added automatically, so this was being expanded into "CONFIG_CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES" Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06Merge branch 'multipath_tcp'Trond Myklebust
2019-07-06SUNRPC: Remove warning in debugfs.c when compiling with W=1Trond Myklebust
Remove the following warning: net/sunrpc/debugfs.c:13: warning: cannot understand function prototype: 'struct dentry *topdir; Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06Merge branch 'bh-remove'Trond Myklebust
2019-07-06SUNRPC: add links for all client xprts to debugfsNeilBrown
Now that a client can have multiple xprts, we need to add them all to debugs. The first one is still "xprt" Subsequent xprts are "xprt1", "xprt2", etc. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: Count ops completing with tk_status < 0Dave Wysochanski
We often see various error conditions with NFS4.x that show up with a very high operation count all completing with tk_status < 0 in a short period of time. Add a count to rpc_iostats to record on a per-op basis the ops that complete in this manner, which will enable lower overhead diagnostics. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: enhance rpc_clnt_show_stats() to report on all xprts.NeilBrown
Now that a client can have multiple xprts, we need to report the statistics for all of them. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: Use proper printk specifiers for unsigned long longDave Wysochanski
Update the printk specifiers inside _print_rpc_iostats to avoid a checkpatch warning. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: Move call to rpc_count_iostats before rpc_call_doneDave Wysochanski
For diagnostic purposes, it would be useful to have an rpc_iostats metric of RPCs completing with tk_status < 0. Unfortunately, tk_status is reset inside the rpc_call_done functions for each operation, and the call to tally the per-op metrics comes after rpc_call_done. Refactor the call to rpc_count_iostat earlier in rpc_exit_task so we can count these RPCs completing in error. Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06NFS: send state management on a single connection.NeilBrown
With NFSv4.1, different network connections need to be explicitly bound to a session. During session startup, this is not possible so only a single connection must be used for session startup. So add a task flag to disable the default round-robin choice of connections (when nconnect > 1) and force the use of a single connection. Then use that flag on all requests for session management - for consistence, include NFSv4.0 management (SETCLIENTID) and session destruction Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: Allow creation of RPC clients with multiple connectionsTrond Myklebust
Add an argument to struct rpc_create_args that allows the specification of how many transport connections you want to set up to the server. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2019-07-06SUNRPC: Remove the bh-safe lock requirement on the rpc_wait_queue->lockTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: Add basic load balancing to the transport switchTrond Myklebust
For now, just count the queue length. It is less accurate than counting number of bytes queued, but easier to implement. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2019-07-06SUNRPC: Remove the bh-safe lock requirement on xprt->transport_lockTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: Replace direct task wakeups from softirq contextTrond Myklebust
Replace the direct task wakeups from inside a softirq context with wakeups from a process context. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-06SUNRPC: Replace the queue timer with a delayed work functionTrond Myklebust
The queue timer function, which walks the RPC queue in order to locate candidates for waking up is one of the current constraints against removing the bh-safe queue spin locks. Replace it with a delayed work queue, so that we can do the actual rpc task wake ups from an ordinary process context. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2019-07-05Merge tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Two more quick bugfixes for nfsd: fixing a regression causing mount failures on high-memory machines and fixing the DRC over RDMA" * tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux: nfsd: Fix overflow causing non-working mounts on 1 TB machines svcrdma: Ignore source port when computing DRC hash
2019-07-03Bluetooth: Fix faulty expression for minimum encryption key size checkMatias Karhumaa
Fix minimum encryption key size check so that HCI_MIN_ENC_KEY_SIZE is also allowed as stated in the comment. This bug caused connection problems with devices having maximum encryption key size of 7 octets (56-bit). Fixes: 693cd8ce3f88 ("Bluetooth: Fix regression with minimum encryption key size alignment") Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203997 Signed-off-by: Matias Karhumaa <matias.karhumaa@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-02xprtrdma: Fix use-after-free in rpcrdma_post_recvsChuck Lever
Dereference wr->next /before/ the memory backing wr has been released. This issue was found by code inspection. It is not expected to be a significant problem because it is in an error path that is almost never executed. Fixes: 7c8d9e7c8863 ("xprtrdma: Move Receive posting to ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-06-29Merge tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull two more NFS client fixes from Anna Schumaker: "These are both stable fixes. One to calculate the correct client message length in the case of partial transmissions. And the other to set the proper TCP timeout for flexfiles" * tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O SUNRPC: Fix up calculation of client message length
2019-06-28SUNRPC: Fix up calculation of client message lengthTrond Myklebust
In the case where a record marker was used, xs_sendpages() needs to return the length of the payload + record marker so that we operate correctly in the case of a partial transmission. When the callers check return value, they therefore need to take into account the record marker length. Fixes: 06b5fc3ad94e ("Merge tag 'nfs-rdma-for-5.1-1'...") Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-06-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix ppp_mppe crypto soft dependencies, from Takashi Iawi. 2) Fix TX completion to be finite, from Sergej Benilov. 3) Use register_pernet_device to avoid a dst leak in tipc, from Xin Long. 4) Double free of TX cleanup in Dirk van der Merwe. 5) Memory leak in packet_set_ring(), from Eric Dumazet. 6) Out of bounds read in qmi_wwan, from Bjørn Mork. 7) Fix iif used in mcast/bcast looped back packets, from Stephen Suryaputra. 8) Fix neighbour resolution on raw ipv6 sockets, from Nicolas Dichtel. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (25 commits) af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET sctp: change to hold sk after auth shkey is created successfully ipv6: fix neighbour resolution with raw socket ipv6: constify rt6_nexthop() net: dsa: microchip: Use gpiod_set_value_cansleep() net: aquantia: fix vlans not working over bridged network ipv4: reset rt_iif for recirculated mcast/bcast out pkts team: Always enable vlan tx offload net/smc: Fix error path in smc_init net/smc: hold conns_lock before calling smc_lgr_register_conn() bonding: Always enable vlan tx offload net/ipv6: Fix misuse of proc_dointvec "skip_notify_on_dev_down" ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loop qmi_wwan: Fix out-of-bounds read tipc: check msg->req data len in tipc_nl_compat_bearer_disable net: macb: do not copy the mac address if NULL net/packet: fix memory leak in packet_set_ring() net/tls: fix page double free on TX cleanup net/sched: cbs: Fix error path of cbs_module_init tipc: change to use register_pernet_device ...
2019-06-26af_packet: Block execution of tasks waiting for transmit to complete in ↵Neil Horman
AF_PACKET When an application is run that: a) Sets its scheduler to be SCHED_FIFO and b) Opens a memory mapped AF_PACKET socket, and sends frames with the MSG_DONTWAIT flag cleared, its possible for the application to hang forever in the kernel. This occurs because when waiting, the code in tpacket_snd calls schedule, which under normal circumstances allows other tasks to run, including ksoftirqd, which in some cases is responsible for freeing the transmitted skb (which in AF_PACKET calls a destructor that flips the status bit of the transmitted frame back to available, allowing the transmitting task to complete). However, when the calling application is SCHED_FIFO, its priority is such that the schedule call immediately places the task back on the cpu, preventing ksoftirqd from freeing the skb, which in turn prevents the transmitting task from detecting that the transmission is complete. We can fix this by converting the schedule call to a completion mechanism. By using a completion queue, we force the calling task, when it detects there are no more frames to send, to schedule itself off the cpu until such time as the last transmitted skb is freed, allowing forward progress to be made. Tested by myself and the reporter, with good results Change Notes: V1->V2: Enhance the sleep logic to support being interruptible and allowing for honoring to SK_SNDTIMEO (Willem de Bruijn) V2->V3: Rearrage the point at which we wait for the completion queue, to avoid needing to check for ph/skb being null at the end of the loop. Also move the complete call to the skb destructor to avoid needing to modify __packet_set_status. Also gate calling complete on packet_read_pending returning zero to avoid multiple calls to complete. (Willem de Bruijn) Move timeo computation within loop, to re-fetch the socket timeout since we also use the timeo variable to record the return code from the wait_for_complete call (Neil Horman) V3->V4: Willem has requested that the control flow be restored to the previous state. Doing so lets us eliminate the need for the po->wait_on_complete flag variable, and lets us get rid of the packet_next_frame function, but introduces another complexity. Specifically, but using the packet pending count, we can, if an applications calls sendmsg multiple times with MSG_DONTWAIT set, each set of transmitted frames, when complete, will cause tpacket_destruct_skb to issue a complete call, for which there will never be a wait_on_completion call. This imbalance will lead to any future call to wait_for_completion here to return early, when the frames they sent may not have completed. To correct this, we need to re-init the completion queue on every call to tpacket_snd before we enter the loop so as to ensure we wait properly for the frames we send in this iteration. Change the timeout and interrupted gotos to out_put rather than out_status so that we don't try to free a non-existant skb Clean up some extra newlines (Willem de Bruijn) Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26sctp: change to hold sk after auth shkey is created successfullyXin Long
Now in sctp_endpoint_init(), it holds the sk then creates auth shkey. But when the creation fails, it doesn't release the sk, which causes a sk defcnf leak, Here to fix it by only holding the sk when auth shkey is created successfully. Fixes: a29a5bd4f5c3 ("[SCTP]: Implement SCTP-AUTH initializations.") Reported-by: syzbot+afabda3890cc2f765041@syzkaller.appspotmail.com Reported-by: syzbot+276ca1c77a19977c0130@syzkaller.appspotmail.com Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26ipv6: fix neighbour resolution with raw socketNicolas Dichtel
The scenario is the following: the user uses a raw socket to send an ipv6 packet, destinated to a not-connected network, and specify a connected nh. Here is the corresponding python script to reproduce this scenario: import socket IPPROTO_RAW = 255 send_s = socket.socket(socket.AF_INET6, socket.SOCK_RAW, IPPROTO_RAW) # scapy # p = IPv6(src='fd00:100::1', dst='fd00:200::fa')/ICMPv6EchoRequest() # str(p) req = b'`\x00\x00\x00\x00\x08:@\xfd\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xfd\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfa\x80\x00\x81\xc0\x00\x00\x00\x00' send_s.sendto(req, ('fd00:175::2', 0, 0, 0)) fd00:175::/64 is a connected route and fd00:200::fa is not a connected host. With this scenario, the kernel starts by sending a NS to resolve fd00:175::2. When it receives the NA, it flushes its queue and try to send the initial packet. But instead of sending it, it sends another NS to resolve fd00:200::fa, which obvioulsy fails, thus the packet is dropped. If the user sends again the packet, it now uses the right nh (fd00:175::2). The problem is that ip6_dst_lookup_neigh() uses the rt6i_gateway, which is :: because the associated route is a connected route, thus it uses the dst addr of the packet. Let's use rt6_nexthop() to choose the right nh. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26ipv6: constify rt6_nexthop()Nicolas Dichtel
There is no functional change in this patch, it only prepares the next one. rt6_nexthop() will be used by ip6_dst_lookup_neigh(), which uses const variables. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26ipv4: reset rt_iif for recirculated mcast/bcast out pktsStephen Suryaputra
Multicast or broadcast egress packets have rt_iif set to the oif. These packets might be recirculated back as input and lookup to the raw sockets may fail because they are bound to the incoming interface (skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function returns rt_iif instead of skb_iif. Hence, the lookup fails. v2: Make it non vrf specific (David Ahern). Reword the changelog to reflect it. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26net/smc: Fix error path in smc_initYueHaibing
If register_pernet_subsys success in smc_init, we should cleanup it in case any other error. Fixes: 64e28b52c7a6 (net/smc: add pnet table namespace support") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26net/smc: hold conns_lock before calling smc_lgr_register_conn()Huaping Zhou
After smc_lgr_create(), the newly created link group is added to smc_lgr_list, thus is accessible from other context. Although link group creation is serialized by smc_create_lgr_pending, the new link group may still be accessed concurrently. For example, if ib_device is no longer active, smc_ib_port_event_work() will call smc_port_terminate(), which in turn will call __smc_lgr_terminate() on every link group of this device. So conns_lock is required here. Signed-off-by: Huaping Zhou <zhp@smail.nju.edu.cn> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>