summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2018-05-01tcp: send in-queue bytes in cmsg upon readSoheil Hassas Yeganeh
Applications with many concurrent connections, high variance in receive queue length and tight memory bounds cannot allocate worst-case buffer size to drain sockets. Knowing the size of receive queue length, applications can optimize how they allocate buffers to read from the socket. The number of bytes pending on the socket is directly available through ioctl(FIONREAD/SIOCINQ) and can be approximated using getsockopt(MEMINFO) (rmem_alloc includes skb overheads in addition to application data). But, both of these options add an extra syscall per recvmsg. Moreover, ioctl(FIONREAD/SIOCINQ) takes the socket lock. Add the TCP_INQ socket option to TCP. When this socket option is set, recvmsg() relays the number of bytes available on the socket for reading to the application via the TCP_CM_INQ control message. Calculate the number of bytes after releasing the socket lock to include the processed backlog, if any. To avoid an extra branch in the hot path of recvmsg() for this new control message, move all cmsg processing inside an existing branch for processing receive timestamps. Since the socket lock is not held when calculating the size of receive queue, TCP_INQ is a hint. For example, it can overestimate the queue size by one byte, if FIN is received. With this method, applications can start reading from the socket using a small buffer, and then use larger buffers based on the remaining data when needed. V3 change-log: As suggested by David Miller, added loads with barrier to check whether we have multiple threads calling recvmsg in parallel. When that happens we lock the socket to calculate inq. V4 change-log: Removed inline from a static function. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01udp: disable gso with no_check_txWillem de Bruijn
Syzbot managed to send a udp gso packet without checksum offload into the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This triggered the skb_warn_bad_offload. RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658 skb_gso_segment include/linux/netdevice.h:4038 [inline] validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120 __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577 dev_queue_xmit+0x17/0x20 net/core/dev.c:3618 UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to catch and fail in this case. After the integrity checks jump directly to the CHECKSUM_PARTIAL case to avoid reading the no_check_tx flags again (a TOCTTOU race). Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01tcp: fix TCP_REPAIR_QUEUE bound checkingEric Dumazet
syzbot is able to produce a nasty WARN_ON() in tcp_verify_left_out() with following C-repro : socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3 setsockopt(3, SOL_TCP, TCP_REPAIR, [1], 4) = 0 setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1242, MSG_FASTOPEN, {sa_family=AF_INET, sin_port=htons(20002), sin_addr=inet_addr("127.0.0.1")}, 16) = 1242 setsockopt(3, SOL_TCP, TCP_REPAIR_WINDOW, "\4\0\0@+\205\0\0\377\377\0\0\377\377\377\177\0\0\0\0", 20) = 0 writev(3, [{"\270", 1}], 1) = 1 setsockopt(3, SOL_TCP, TCP_REPAIR_OPTIONS, "\10\0\0\0\0\0\0\0\0\0\0\0|\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 386) = 0 writev(3, [{"\210v\r[\226\320t\231qwQ\204\264l\254\t\1\20\245\214p\350H\223\254;\\\37\345\307p$"..., 3144}], 1) = 3144 The 3rd system call looks odd : setsockopt(3, SOL_TCP, TCP_REPAIR_QUEUE, [-1], 4) = 0 This patch makes sure bound checking is using an unsigned compare. Fixes: ee9952831cfd ("tcp: Initial repair mode") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-01tcp: Add clean acked data hookIlya Lesokhin
Called when a TCP segment is acknowledged. Could be used by application protocols who hold additional metadata associated with the stream data. This is required by TLS device offload to release metadata associated with acknowledged TLS records. Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Aviad Yehezkel <aviadye@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-30erspan: auto detect truncated packets.William Tu
Currently the truncated bit is set only when the mirrored packet is larger than mtu. For certain cases, the packet might already been truncated before sending to the erspan tunnel. In this case, the patch detect whether the IP header's total length is larger than the actual skb->len. If true, this indicated that the mirrored packet is truncated and set the erspan truncate bit. I tested the patch using bpf_skb_change_tail helper function to shrink the packet size and send to erspan tunnel. Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com> Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-29tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receiveEric Dumazet
When adding tcp mmap() implementation, I forgot that socket lock had to be taken before current->mm->mmap_sem. syzbot eventually caught the bug. Since we can not lock the socket in tcp mmap() handler we have to split the operation in two phases. 1) mmap() on a tcp socket simply reserves VMA space, and nothing else. This operation does not involve any TCP locking. 2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements the transfert of pages from skbs to one VMA. This operation only uses down_read(&current->mm->mmap_sem) after holding TCP lock, thus solving the lockdep issue. This new implementation was suggested by Andy Lutomirski with great details. Benefits are : - Better scalability, in case multiple threads reuse VMAS (without mmap()/munmap() calls) since mmap_sem wont be write locked. - Better error recovery. The previous mmap() model had to provide the expected size of the mapping. If for some reason one part could not be mapped (partial MSS), the whole operation had to be aborted. With the tcp_zerocopy_receive struct, kernel can report how many bytes were successfuly mapped, and how many bytes should be read to skip the problematic sequence. - No more memory allocation to hold an array of page pointers. 16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/ - skbs are freed while mmap_sem has been released Following patch makes the change in tcp_mmap tool to demonstrate one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...) Note that memcg might require additional changes. Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Cc: linux-mm@kvack.org Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27udp: remove stray export symbolWillem de Bruijn
UDP GSO needs to export __udp_gso_segment to call it from ipv6. I accidentally exported static ipv4 function __udp4_gso_segment. Remove that EXPORT_SYMBOL_GPL. Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27tcp: remove mss check in tcp_select_initial_window()Wei Wang
In tcp_select_initial_window(), we only set rcv_wnd to tcp_default_init_rwnd() if current mss > (1 << wscale). Otherwise, rcv_wnd is kept at the full receive space of the socket which is a value way larger than tcp_default_init_rwnd(). With larger initial rcv_wnd value, receive buffer autotuning logic takes longer to kick in and increase the receive buffer. In a TCP throughput test where receiver has rmem[2] set to 125MB (wscale is 11), we see the connection gets recvbuf limited at the beginning of the connection and gets less throughput overall. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-27tcp: ignore Fast Open on repair modeYuchung Cheng
The TCP repair sequence of operation is to first set the socket in repair mode, then inject the TCP stats into the socket with repair socket options, then call connect() to re-activate the socket. The connect syscall simply returns and set state to ESTABLISHED mode. As a result Fast Open is meaningless for TCP repair. However allowing sendto() system call with MSG_FASTOPEN flag half-way during the repair operation could unexpectedly cause data to be sent, before the operation finishes changing the internal TCP stats (e.g. MSS). This in turn triggers TCP warnings on inconsistent packet accounting. The fix is to simply disallow Fast Open operation once the socket is in the repair mode. Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: add gso segment cmsgWillem de Bruijn
Allow specifying segment size in the send call. The new control message performs the same function as socket option UDP_SEGMENT while avoiding the extra system call. [ Export udp_cmsg_send for ipv6. -DaveM ] Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: paged allocation with gsoWillem de Bruijn
When sending large datagrams that are later segmented, store data in page frags to avoid copying from linear in skb_segment. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: better wmem accounting on gsoWillem de Bruijn
skb_segment by default transfers allocated wmem from the gso skb to the tail of the segment list. This underreports real truesize of the list, especially if the tail might be dropped. Similar to tcp_gso_segment, update wmem_alloc with the aggregate list truesize and make each segment responsible for its own share by setting skb->destructor. Clear gso_skb->destructor prior to calling skb_segment to skip the default assignment to tail. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: generate gso with UDP_SEGMENTWillem de Bruijn
Support generic segmentation offload for udp datagrams. Callers can concatenate and send at once the payload of multiple datagrams with the same destination. To set segment size, the caller sets socket option UDP_SEGMENT to the length of each discrete payload. This value must be smaller than or equal to the relevant MTU. A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a per send call basis. Total byte length may then exceed MTU. If not an exact multiple of segment size, the last segment will be shorter. The implementation adds a gso_size field to the udp socket, ip(v6) cmsg cookie and inet_cork structure to be able to set the value at setsockopt or cmsg time and to work with both lockless and corked paths. Initial benchmark numbers show UDP GSO about as expensive as TCP GSO. tcp tso 3197 MB/s 54232 msg/s 54232 calls/s 6,457,754,262 cycles tcp gso 1765 MB/s 29939 msg/s 29939 calls/s 11,203,021,806 cycles tcp without tso/gso * 739 MB/s 12548 msg/s 12548 calls/s 11,205,483,630 cycles udp 876 MB/s 14873 msg/s 624666 calls/s 11,205,777,429 cycles udp gso 2139 MB/s 36282 msg/s 36282 calls/s 11,204,374,561 cycles [*] after reverting commit 0a6b2a1dc2a2 ("tcp: switch to GSO being always on") Measured total system cycles ('-a') for one core while pinning both the network receive path and benchmark process to that core: perf stat -a -C 12 -e cycles \ ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4 Note the reduction in calls/s with GSO. Bytes per syscall drops increases from 1470 to 61818. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: add udp gsoWillem de Bruijn
Implement generic segmentation offload support for udp datagrams. A follow-up patch adds support to the protocol stack to generate such packets. UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits a large payload into a number of discrete UDP datagrams. The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it from UFO (SKB_UDP_GSO). IPPROTO_UDPLITE is excluded, as that protocol has no gso handler registered. [ Export __udp_gso_segment for ipv6. -DaveM ] Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-26udp: expose inet cork to udpWillem de Bruijn
UDP segmentation offload needs access to inet_cork in the udp layer. Pass the struct to ip(6)_make_skb instead of allocating it on the stack in that function itself. This patch is a noop otherwise. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2018-04-24ipconfig: Write NTP server IPs to /proc/net/ipconfig/ntp_serversChris Novakovic
Distributed filesystems are most effective when the server and client clocks are synchronised. Embedded devices often use NFS for their root filesystem but typically do not contain an RTC, so the clocks of the NFS server and the embedded device will be out-of-sync when the root filesystem is mounted (and may not be synchronised until late in the boot process). Extend ipconfig with the ability to export IP addresses of NTP servers it discovers to /proc/net/ipconfig/ntp_servers. They can be supplied as follows: - If ipconfig is configured manually via the "ip=" or "nfsaddrs=" kernel command line parameters, one NTP server can be specified in the new "<ntp0-ip>" parameter. - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in the DHCPDISCOVER message, and record the IP addresses of up to three NTP servers sent by the responding DHCP server in the subsequent DHCPOFFER message. ipconfig will only write the NTP server IP addresses it discovers to /proc/net/ipconfig/ntp_servers, one per line (in the order received from the DHCP server, if DHCP autoconfiguration is used); making use of these NTP servers is the responsibility of a user space process (e.g. an initrd/initram script that invokes an NTP client before mounting an NFS root filesystem). Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: Create /proc/net/ipconfig directoryChris Novakovic
To allow ipconfig to report IP configuration details to user space processes without cluttering /proc/net, create a new subdirectory /proc/net/ipconfig. All files containing IP configuration details should be written to this directory. Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: Correctly initialise ic_nameserversChris Novakovic
ic_nameservers, which stores the list of name servers discovered by ipconfig, is initialised (i.e. has all of its elements set to NONE, or 0xffffffff) by ic_nameservers_predef() in the following scenarios: - before the "ip=" and "nfsaddrs=" kernel command line parameters are parsed (in ip_auto_config_setup()); - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in order to clear any values that may have been set after parsing "ip=" or "nfsaddrs=" and are no longer needed. This means that ic_nameservers_predef() is not called when neither "ip=" nor "nfsaddrs=" is specified on the kernel command line. In this scenario, every element in ic_nameservers remains set to 0x00000000, which is indistinguishable from ANY and causes pnp_seq_show() to write the following (bogus) information to /proc/net/pnp: #MANUAL nameserver 0.0.0.0 nameserver 0.0.0.0 nameserver 0.0.0.0 This is potentially problematic for systems that blindly link /etc/resolv.conf to /proc/net/pnp. Ensure that ic_nameservers is also initialised when neither "ip=" nor "nfsaddrs=" are specified by calling ic_nameservers_predef() in ip_auto_config(), but only when ip_auto_config_setup() was not called earlier. This causes the following to be written to /proc/net/pnp, and is consistent with what gets written when ipconfig is configured manually but no name servers are specified on the kernel command line: #MANUAL Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name serversChris Novakovic
When ipconfig is autoconfigured via BOOTP, the request packet initialised by ic_bootp_init_ext() always allocates 8 bytes for the name server option, limiting the BOOTP server to responding with at most 2 name servers even though ipconfig in fact supports an arbitrary number of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently 3). Only request name servers in the request packet if CONF_NAMESERVERS_MAX is positive (to comply with [1, §3.8]), and allocate enough space in the packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum number we can accept in response. [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions": https://tools.ietf.org/rfc/rfc2132.txt Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: BOOTP: Don't request IEN-116 name serversChris Novakovic
When ipconfig is autoconfigured via BOOTP, the request packet initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name Server" [1, §3.7]), but tag 5 in the response isn't processed by ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears to have been the intended tag to request. This won't cause any breakage for existing users, as tag 5 responses provided by BOOTP servers weren't being processed anyway. [1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions": https://tools.ietf.org/rfc/rfc2132.txt Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24ipconfig: Tidy up reporting of name serversChris Novakovic
Commit 5e953778a2aab04929a5e7b69f53dc26e39b079e ("ipconfig: add nameserver IPs to kernel-parameter ip=") adds the IP addresses of discovered name servers to the summary printed by ipconfig when configuration is complete. It appears the intention in ip_auto_config() was to print the name servers on a new line (especially given the spacing and lack of comma before "nameserver0="), but they're actually printed on the same line as the NFS root filesystem configuration summary: [ 0.686186] IP-Config: Complete: [ 0.686226] device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1 [ 0.686328] host=test, domain=example.com, nis-domain=(none) [ 0.686386] bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath= nameserver0=10.0.0.1 This makes it harder to read and parse ipconfig's output. Instead, print the name servers on a separate line: [ 0.791250] IP-Config: Complete: [ 0.791289] device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, mask=255.255.255.0, gw=10.0.0.1 [ 0.791407] host=test, domain=example.com, nis-domain=(none) [ 0.791475] bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath= [ 0.791476] nameserver0=10.0.0.1 Signed-off-by: Chris Novakovic <chris@chrisn.me.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24tcp: md5: only call tp->af_specific->md5_lookup() for md5 socketsEric Dumazet
RETPOLINE made calls to tp->af_specific->md5_lookup() quite expensive, given they have no result. We can omit the calls for sockets that have no md5 keys. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24Revert "net: init sk_cookie for inet socket"Yafang Shao
This reverts commit <c6849a3ac17e> ("net: init sk_cookie for inet socket") Per discussion with Eric, when update sock_net(sk)->cookie_gen, the whole cache cache line will be invalidated, as this cache line is shared with all cpus, that may cause great performace hit. Bellow is the data form Eric. "Performance is reduced from ~5 Mpps to ~3.8 Mpps with 16 RX queues on my host" when running synflood test. Have to revert it to prevent from cache line false sharing. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-24netfilter: xtables: use ipt_get_target_c instead of ipt_get_targetTaehee Yoo
ipt_get_target is used to get struct xt_entry_target and ipt_get_target_c is used to get const struct xt_entry_target. However in the ipt_do_table, ipt_get_target is used to get const struct xt_entry_target. it should be replaced by ipt_get_target_c. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24netfilter: add NAT support for shifted portmap rangesThierry Du Tre
This is a patch proposal to support shifted ranges in portmaps. (i.e. tcp/udp incoming port 5000-5100 on WAN redirected to LAN 192.168.1.5:2000-2100) Currently DNAT only works for single port or identical port ranges. (i.e. ports 5000-5100 on WAN interface redirected to a LAN host while original destination port is not altered) When different port ranges are configured, either 'random' mode should be used, or else all incoming connections are mapped onto the first port in the redirect range. (in described example WAN:5000-5100 will all be mapped to 192.168.1.5:2000) This patch introduces a new mode indicated by flag NF_NAT_RANGE_PROTO_OFFSET which uses a base port value to calculate an offset with the destination port present in the incoming stream. That offset is then applied as index within the redirect port range (index modulo rangewidth to handle range overflow). In described example the base port would be 5000. An incoming stream with destination port 5004 would result in an offset value 4 which means that the NAT'ed stream will be using destination port 2004. Other possibilities include deterministic mapping of larger or multiple ranges to a smaller range : WAN:5000-5999 -> LAN:5000-5099 (maps WAN port 5*xx to port 51xx) This patch does not change any current behavior. It just adds new NAT proto range functionality which must be selected via the specific flag when intended to use. A patch for iptables (libipt_DNAT.c + libip6t_DNAT.c) will also be proposed which makes this functionality immediately available. Signed-off-by: Thierry Du Tre <thierry@dtsystems.be> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24netfilter: nf_flow_table: move init code to nf_flow_table_core.cFelix Fietkau
Reduces duplication of .gc and .params in flowtable type definitions and makes the API clearer Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-24netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_tableFelix Fietkau
Allows some minor code sharing with the ipv6 hook code and is also useful as preparation for adding iptables support for offload Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-23net: init sk_cookie for inet socketYafang Shao
With sk_cookie we can identify a socket, that is very helpful for traceing and statistic, i.e. tcp tracepiont and ebpf. So we'd better init it by default for inet socket. When using it, we just need call atomic64_read(&sk->sk_cookie). Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: fib_rules: add extack supportRoopa Prabhu
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23net: introduce a new tracepoint for tcp_rcv_space_adjustYafang Shao
tcp_rcv_space_adjust is called every time data is copied to user space, introducing a tcp tracepoint for which could show us when the packet is copied to user. When a tcp packet arrives, tcp_rcv_established() will be called and with the existed tracepoint tcp_probe we could get the time when this packet arrives. Then this packet will be copied to user, and tcp_rcv_space_adjust will be called and with this new introduced tracepoint we could get the time when this packet is copied to user. With these two tracepoints, we could figure out whether the user program processes this packet immediately or there's latency. Hence in the printk message, sk_cookie is printed as a key to relate tcp_rcv_space_adjust with tcp_probe. Maybe we could export sockfd in this new tracepoint as well, then we could relate this new tracepoint with epoll/read/recv* tracepoints, and finally that could show us the whole lifespan of this packet. But we could also implement that with pid as these functions are executed in process context. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-23tcp: don't read out-of-bounds opsizeJann Horn
The old code reads the "opsize" variable from out-of-bounds memory (first byte behind the segment) if a broken TCP segment ends directly after an opcode that is neither EOL nor NOP. The result of the read isn't used for anything, so the worst thing that could theoretically happen is a pagefault; and since the physmap is usually mostly contiguous, even that seems pretty unlikely. The following C reproducer triggers the uninitialized read - however, you can't actually see anything happen unless you put something like a pr_warn() in tcp_parse_md5sig_option() to print the opsize. ==================================== #define _GNU_SOURCE #include <arpa/inet.h> #include <stdlib.h> #include <errno.h> #include <stdarg.h> #include <net/if.h> #include <linux/if.h> #include <linux/ip.h> #include <linux/tcp.h> #include <linux/in.h> #include <linux/if_tun.h> #include <err.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <string.h> #include <stdio.h> #include <unistd.h> #include <sys/ioctl.h> #include <assert.h> void systemf(const char *command, ...) { char *full_command; va_list ap; va_start(ap, command); if (vasprintf(&full_command, command, ap) == -1) err(1, "vasprintf"); va_end(ap); printf("systemf: <<<%s>>>\n", full_command); system(full_command); } char *devname; int tun_alloc(char *name) { int fd = open("/dev/net/tun", O_RDWR); if (fd == -1) err(1, "open tun dev"); static struct ifreq req = { .ifr_flags = IFF_TUN|IFF_NO_PI }; strcpy(req.ifr_name, name); if (ioctl(fd, TUNSETIFF, &req)) err(1, "TUNSETIFF"); devname = req.ifr_name; printf("device name: %s\n", devname); return fd; } #define IPADDR(a,b,c,d) (((a)<<0)+((b)<<8)+((c)<<16)+((d)<<24)) void sum_accumulate(unsigned int *sum, void *data, int len) { assert((len&2)==0); for (int i=0; i<len/2; i++) { *sum += ntohs(((unsigned short *)data)[i]); } } unsigned short sum_final(unsigned int sum) { sum = (sum >> 16) + (sum & 0xffff); sum = (sum >> 16) + (sum & 0xffff); return htons(~sum); } void fix_ip_sum(struct iphdr *ip) { unsigned int sum = 0; sum_accumulate(&sum, ip, sizeof(*ip)); ip->check = sum_final(sum); } void fix_tcp_sum(struct iphdr *ip, struct tcphdr *tcp) { unsigned int sum = 0; struct { unsigned int saddr; unsigned int daddr; unsigned char pad; unsigned char proto_num; unsigned short tcp_len; } fakehdr = { .saddr = ip->saddr, .daddr = ip->daddr, .proto_num = ip->protocol, .tcp_len = htons(ntohs(ip->tot_len) - ip->ihl*4) }; sum_accumulate(&sum, &fakehdr, sizeof(fakehdr)); sum_accumulate(&sum, tcp, tcp->doff*4); tcp->check = sum_final(sum); } int main(void) { int tun_fd = tun_alloc("inject_dev%d"); systemf("ip link set %s up", devname); systemf("ip addr add 192.168.42.1/24 dev %s", devname); struct { struct iphdr ip; struct tcphdr tcp; unsigned char tcp_opts[20]; } __attribute__((packed)) syn_packet = { .ip = { .ihl = sizeof(struct iphdr)/4, .version = 4, .tot_len = htons(sizeof(syn_packet)), .ttl = 30, .protocol = IPPROTO_TCP, /* FIXUP check */ .saddr = IPADDR(192,168,42,2), .daddr = IPADDR(192,168,42,1) }, .tcp = { .source = htons(1), .dest = htons(1337), .seq = 0x12345678, .doff = (sizeof(syn_packet.tcp)+sizeof(syn_packet.tcp_opts))/4, .syn = 1, .window = htons(64), .check = 0 /*FIXUP*/ }, .tcp_opts = { /* INVALID: trailing MD5SIG opcode after NOPs */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 19 } }; fix_ip_sum(&syn_packet.ip); fix_tcp_sum(&syn_packet.ip, &syn_packet.tcp); while (1) { int write_res = write(tun_fd, &syn_packet, sizeof(syn_packet)); if (write_res != sizeof(syn_packet)) err(1, "packet write failed"); } } ==================================== Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were simple overlapping changes in microchip driver. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-21netfilter: nf_flow_table: cache mtu in struct flow_offload_tupleFelix Fietkau
Reduces the number of cache lines touched in the offload forwarding path. This is safe because PMTU limits are bypassed for the forwarding path (see commit f87c10a8aa1e for more details). Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-04-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Unbalanced refcounting in TIPC, from Jon Maloy. 2) Only allow TCP_MD5SIG to be set on sockets in close or listen state. Once the connection is established it makes no sense to change this. From Eric Dumazet. 3) Missing attribute validation in neigh_dump_table(), also from Eric Dumazet. 4) Fix address comparisons in SCTP, from Xin Long. 5) Neigh proxy table clearing can deadlock, from Wolfgang Bumiller. 6) Fix tunnel refcounting in l2tp, from Guillaume Nault. 7) Fix double list insert in team driver, from Paolo Abeni. 8) af_vsock.ko module was accidently made unremovable, from Stefan Hajnoczi. 9) Fix reference to freed llc_sap object in llc stack, from Cong Wang. 10) Don't assume netdevice struct is DMA'able memory in virtio_net driver, from Michael S. Tsirkin. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits) net/smc: fix shutdown in state SMC_LISTEN bnxt_en: Fix memory fault in bnxt_ethtool_init() virtio_net: sparse annotation fix virtio_net: fix adding vids on big-endian virtio_net: split out ctrl buffer net: hns: Avoid action name truncation docs: ip-sysctl.txt: fix name of some ipv6 variables vmxnet3: fix incorrect dereference when rxvlan is disabled llc: hold llc_sap before release_sock() MAINTAINERS: Direct networking documentation changes to netdev atm: iphase: fix spelling mistake: "Tansmit" -> "Transmit" net: qmi_wwan: add Wistron Neweb D19Q1 net: caif: fix spelling mistake "UKNOWN" -> "UNKNOWN" net: stmmac: Disable ACS Feature for GMAC >= 4 net: mvpp2: Fix DMA address mask size net: change the comment of dev_mc_init net: qualcomm: rmnet: Fix warning seen with fill_info tun: fix vlan packet truncation tipc: fix infinite loop when dumping link monitor summary tipc: fix use-after-free in tipc_nametbl_stop ...
2018-04-19tcp: export packets delivery infoYuchung Cheng
Export data delivered and delivered with CE marks to 1) SNMP TCPDelivered and TCPDeliveredCE 2) getsockopt(TCP_INFO) 3) Timestamping API SOF_TIMESTAMPING_OPT_STATS Note that for SCM_TSTAMP_ACK, the delivery info in SOF_TIMESTAMPING_OPT_STATS is reported before the info was fully updated on the ACK. These stats help application monitor TCP delivery and ECN status on per host, per connection, even per message level. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19tcp: track total bytes delivered with ECN CE marksYuchung Cheng
Introduce a new delivered_ce stat in tcp socket to estimate number of packets being marked with CE bits. The estimation is done via ACKs with ECE bit. Depending on the actual receiver behavior, the estimation could have biases. Since the TCP sender can't really see the CE bit in the data path, so the sender is technically counting packets marked delivered with the "ECE / ECN-Echo" flag set. With RFC3168 ECN, because the ECE bit is sticky, this count can drastically overestimate the nummber of CE-marked data packets With DCTCP-style ECN this should be reasonably precise unless there is loss in the ACK path, in which case it's not precise. With AccECN proposal this can be made still more precise, even in the case some degree of ACK loss. However this is sender's best estimate of CE information. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19tcp: new helper to calculate newly deliveredYuchung Cheng
Add new helper tcp_newly_delivered() to prepare the ECN accounting change. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-19tcp: better delivery accounting for SYN-ACK and SYN-dataYuchung Cheng
the tcp_sock:delivered has inconsistent accounting for SYN and FIN. 1. it counts pure FIN 2. it counts pure SYN 3. it counts SYN-data twice 4. it does not count SYN-ACK For congestion control perspective it does not matter much as C.C. only cares about the difference not the aboslute value. But the next patch would export this field to user-space so it's better to report the absolute value w/o these caveats. This patch counts SYN, SYN-ACK, or SYN-data delivery once always in the "delivered" field. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-18udp: enable UDP checksum offload for ESPJacek Kalwas
In case NIC has support for ESP TX CSUM offload skb->ip_summed is not set to CHECKSUM_PARTIAL which results in checksum calculated by SW. Fix enables ESP TX CSUM for UDP by extending condition with check for NETIF_F_HW_ESP_TX_CSUM. Signed-off-by: Jacek Kalwas <jacek.kalwas@intel.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-04-17net: Move fib_convert_metrics to metrics fileDavid Ahern
Move logic of fib_convert_metrics into ip_metrics_convert. This allows the code that converts netlink attributes into metrics struct to be re-used in a later patch by IPv6. This is mostly a code move with the following changes to variable names: - fi->fib_net becomes net - fc_mx and fc_mx_len are passed as inputs pulled from fib_config - metrics array is passed as an input from fi->fib_metrics->metrics Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16tcp: implement mmap() for zero copy receiveEric Dumazet
Some networks can make sure TCP payload can exactly fit 4KB pages, with well chosen MSS/MTU and architectures. Implement mmap() system call so that applications can avoid copying data without complex splice() games. Note that a successful mmap( X bytes) on TCP socket is consuming bytes, as if recvmsg() has been done. (tp->copied += X) Only PROT_READ mappings are accepted, as skb page frags are fundamentally shared and read only. If tcp_mmap() finds data that is not a full page, or a patch of urgent data, -EINVAL is returned, no bytes are consumed. Application must fallback to recvmsg() to read the problematic sequence. mmap() wont block, regardless of socket being in blocking or non-blocking mode. If not enough bytes are in receive queue, mmap() would return -EAGAIN, or -EIO if socket is in a state where no other bytes can be added into receive queue. An application might use SO_RCVLOWAT, poll() and/or ioctl( FIONREAD) to efficiently use mmap() On the sender side, MSG_EOR might help to clearly separate unaligned headers and 4K-aligned chunks if necessary. Tested: mlx4 (cx-3) 40Gbit NIC, with tcp_mmap program provided in following patch. MTU set to 4168 (4096 TCP payload, 40 bytes IPv6 header, 32 bytes TCP header) Without mmap() (tcp_mmap -s) received 32768 MB (0 % mmap'ed) in 8.13342 s, 33.7961 Gbit, cpu usage user:0.034 sys:3.778, 116.333 usec per MB, 63062 c-switches received 32768 MB (0 % mmap'ed) in 8.14501 s, 33.748 Gbit, cpu usage user:0.029 sys:3.997, 122.864 usec per MB, 61903 c-switches received 32768 MB (0 % mmap'ed) in 8.11723 s, 33.8635 Gbit, cpu usage user:0.048 sys:3.964, 122.437 usec per MB, 62983 c-switches received 32768 MB (0 % mmap'ed) in 8.39189 s, 32.7552 Gbit, cpu usage user:0.038 sys:4.181, 128.754 usec per MB, 55834 c-switches With mmap() on receiver (tcp_mmap -s -z) received 32768 MB (100 % mmap'ed) in 8.03083 s, 34.2278 Gbit, cpu usage user:0.024 sys:1.466, 45.4712 usec per MB, 65479 c-switches received 32768 MB (100 % mmap'ed) in 7.98805 s, 34.4111 Gbit, cpu usage user:0.026 sys:1.401, 43.5486 usec per MB, 65447 c-switches received 32768 MB (100 % mmap'ed) in 7.98377 s, 34.4296 Gbit, cpu usage user:0.028 sys:1.452, 45.166 usec per MB, 65496 c-switches received 32768 MB (99.9969 % mmap'ed) in 8.01838 s, 34.281 Gbit, cpu usage user:0.02 sys:1.446, 44.7388 usec per MB, 65505 c-switches Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16tcp: avoid extra wakeups for SO_RCVLOWAT usersEric Dumazet
SO_RCVLOWAT is properly handled in tcp_poll(), so that POLLIN is only generated when enough bytes are available in receive queue, after David change (commit c7004482e8dc "tcp: Respect SO_RCVLOWAT in tcp_poll().") But TCP still calls sk->sk_data_ready() for each chunk added in receive queue, meaning thread is awaken, and goes back to sleep shortly after. Tested: tcp_mmap test program, receiving 32768 MB of data with SO_RCVLOWAT set to 512KB -> Should get ~2 wakeups (c-switches) per MB, regardless of how many (tiny or big) packets were received. High speed (mostly full size GRO packets) received 32768 MB (100 % mmap'ed) in 8.03112 s, 34.2266 Gbit, cpu usage user:0.037 sys:1.404, 43.9758 usec per MB, 65497 c-switches received 32768 MB (99.9954 % mmap'ed) in 7.98453 s, 34.4263 Gbit, cpu usage user:0.03 sys:1.422, 44.3115 usec per MB, 65485 c-switches Low speed (sender is ratelimited and sends 1-MSS at a time, so GRO is not helping) received 22474.5 MB (100 % mmap'ed) in 6015.35 s, 0.0313414 Gbit, cpu usage user:0.05 sys:1.586, 72.7952 usec per MB, 44950 c-switches Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16tcp: fix delayed acks behavior for SO_RCVLOWATEric Dumazet
We should not delay acks if there are not enough bytes in receive queue to satisfy SO_RCVLOWAT. Since [E]POLLIN event is not going to be generated, there is little hope for a delayed ack to be useful. In fact, delaying ACK prevents sender from completing the transfer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16tcp: fix SO_RCVLOWAT and RCVBUF autotuningEric Dumazet
Applications might use SO_RCVLOWAT on TCP socket hoping to receive one [E]POLLIN event only when a given amount of bytes are ready in socket receive queue. Problem is that receive autotuning is not aware of this constraint, meaning sk_rcvbuf might be too small to allow all bytes to be stored. Add a new (struct proto_ops)->set_rcvlowat method so that a protocol can override the default setsockopt(SO_RCVLOWAT) behavior. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16net: Fix one possible memleak in ip_setup_corkGao Feng
It would allocate memory in this function when the cork->opt is NULL. But the memory isn't freed if failed in the latter rt check, and return error directly. It causes the memleak if its caller is ip_make_skb which also doesn't free the cork->opt when meet a error. Now move the rt check ahead to avoid the memleak. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-16tcp: clear tp->packets_out when purging write queueSoheil Hassas Yeganeh
Clear tp->packets_out when purging the write queue, otherwise tcp_rearm_rto() mistakenly assumes TCP write queue is not empty. This results in NULL pointer dereference. Also, remove the redundant `tp->packets_out = 0` from tcp_disconnect(), since tcp_disconnect() calls tcp_write_queue_purge(). Fixes: a27fd7a8ed38 (tcp: purge write queue upon RST) Reported-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Reported-by: Sami Farin <hvtaifwkbgefbaei@gmail.com> Tested-by: Sami Farin <hvtaifwkbgefbaei@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-15Merge tag 'kbuild-v4.17-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild updates from Masahiro Yamada: - pass HOSTLDFLAGS when compiling single .c host programs - build genksyms lexer and parser files instead of using shipped versions - rename *-asn1.[ch] to *.asn1.[ch] for suffix consistency - let the top .gitignore globally ignore artifacts generated by flex, bison, and asn1_compiler - let the top Makefile globally clean artifacts generated by flex, bison, and asn1_compiler - use safer .SECONDARY marker instead of .PRECIOUS to prevent intermediate files from being removed - support -fmacro-prefix-map option to make __FILE__ a relative path - fix # escaping to prepare for the future GNU Make release - clean up deb-pkg by using debian tools instead of handrolled source/changes generation - improve rpm-pkg portability by supporting kernel-install as a fallback of new-kernel-pkg - extend Kconfig listnewconfig target to provide more information * tag 'kbuild-v4.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kconfig: extend output of 'listnewconfig' kbuild: rpm-pkg: use kernel-install as a fallback for new-kernel-pkg Kbuild: fix # escaping in .cmd files for future Make kbuild: deb-pkg: split generating packaging and build kbuild: use -fmacro-prefix-map to make __FILE__ a relative path kbuild: mark $(targets) as .SECONDARY and remove .PRECIOUS markers kbuild: rename *-asn1.[ch] to *.asn1.[ch] kbuild: clean up *-asn1.[ch] patterns from top-level Makefile .gitignore: move *-asn1.[ch] patterns to the top-level .gitignore kbuild: add %.dtb.S and %.dtb to 'targets' automatically kbuild: add %.lex.c and %.tab.[ch] to 'targets' automatically genksyms: generate lexer and parser during build instead of shipping kbuild: clean up *.lex.c and *.tab.[ch] patterns from top-level Makefile .gitignore: move *.lex.c *.tab.[ch] patterns to the top-level .gitignore kbuild: use HOSTLDFLAGS for single .c executables
2018-04-12tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established socketsEric Dumazet
syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1] I believe this was caused by a TCP_MD5SIG being set on live flow. This is highly unexpected, since TCP option space is limited. For instance, presence of TCP MD5 option automatically disables TCP TimeStamp option at SYN/SYNACK time, which we can not do once flow has been established. Really, adding/deleting an MD5 key only makes sense on sockets in CLOSE or LISTEN state. [1] BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720 CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720 tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline] tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184 tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747 SyS_sendto+0x8a/0xb0 net/socket.c:1715 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x448fe9 RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9 RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004 RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010 R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000 R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321 slab_post_alloc_hook mm/slab.h:445 [inline] slab_alloc_node mm/slub.c:2737 [inline] __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369 __kmalloc_reserve net/core/skbuff.c:138 [inline] __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206 alloc_skb include/linux/skbuff.h:984 [inline] tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624 __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline] tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline] tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469 tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469 sk_backlog_rcv include/net/sock.h:908 [inline] __release_sock+0x2d6/0x680 net/core/sock.c:2271 release_sock+0x97/0x2a0 net/core/sock.c:2786 tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464 inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg net/socket.c:640 [inline] SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747 SyS_sendto+0x8a/0xb0 net/socket.c:1715 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-10ip_gre: clear feature flags when incompatible o_flags are setSabrina Dubroca
Commit dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink") added the ability to change o_flags, but missed that the GSO/LLTX features are disabled by default, and only enabled some gre features are unused. Thus we also need to disable the GSO/LLTX features on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set. These two examples should result in the same features being set: ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 0 ip link set gre_none type gre seq ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 key 1 seq Fixes: dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>