summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-02-13bpf: Drop MTU check when doing TC-BPF redirect to ingressJesper Dangaard Brouer
The use-case for dropping the MTU check when TC-BPF does redirect to ingress, is described by Eyal Birger in email[0]. The summary is the ability to increase packet size (e.g. with IPv6 headers for NAT64) and ingress redirect packet and let normal netstack fragment packet as needed. [0] https://lore.kernel.org/netdev/CAHsH6Gug-hsLGHQ6N0wtixdOa85LDZ3HNRHVd0opR=19Qo4W4Q@mail.gmail.com/ V15: - missing static for function declaration V9: - Make net_device "up" (IFF_UP) check explicit in skb_do_redirect V4: - Keep net_device "up" (IFF_UP) check. - Adjustment to handle bpf_redirect_peer() helper Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/161287790971.790810.11785274340154740591.stgit@firesoul
2021-02-13bpf: Add BPF-helper for MTU checkingJesper Dangaard Brouer
This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. The SKB object is complex and the skb->len value (accessible from BPF-prog) also include the length of any extra GRO/GSO segments, but without taking into account that these GRO/GSO segments get added transport (L4) and network (L3) headers before being transmitted. Thus, this BPF-helper is created such that the BPF-programmer don't need to handle these details in the BPF-prog. The API is designed to help the BPF-programmer, that want to do packet context size changes, which involves other helpers. These other helpers usually does a delta size adjustment. This helper also support a delta size (len_diff), which allow BPF-programmer to reuse arguments needed by these other helpers, and perform the MTU check prior to doing any actual size adjustment of the packet context. It is on purpose, that we allow the len adjustment to become a negative result, that will pass the MTU check. This might seem weird, but it's not this helpers responsibility to "catch" wrong len_diff adjustments. Other helpers will take care of these checks, if BPF-programmer chooses to do actual size adjustment. V14: - Improve man-page desc of len_diff. V13: - Enforce flag BPF_MTU_CHK_SEGS cannot use len_diff. V12: - Simplify segment check that calls skb_gso_validate_network_len. - Helpers should return long V9: - Use dev->hard_header_len (instead of ETH_HLEN) - Annotate with unlikely req from Daniel - Fix logic error using skb_gso_validate_network_len from Daniel V6: - Took John's advice and dropped BPF_MTU_CHK_RELAX - Returned MTU is kept at L3-level (like fib_lookup) V4: Lot of changes - ifindex 0 now use current netdev for MTU lookup - rename helper from bpf_mtu_check to bpf_check_mtu - fix bug for GSO pkt length (as skb->len is total len) - remove __bpf_len_adj_positive, simply allow negative len adj Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/161287790461.790810.3429728639563297353.stgit@firesoul
2021-02-13bpf: bpf_fib_lookup return MTU value as output when looked upJesper Dangaard Brouer
The BPF-helpers for FIB lookup (bpf_xdp_fib_lookup and bpf_skb_fib_lookup) can perform MTU check and return BPF_FIB_LKUP_RET_FRAG_NEEDED. The BPF-prog don't know the MTU value that caused this rejection. If the BPF-prog wants to implement PMTU (Path MTU Discovery) (rfc1191) it need to know this MTU value for the ICMP packet. Patch change lookup and result struct bpf_fib_lookup, to contain this MTU value as output via a union with 'tot_len' as this is the value used for the MTU lookup. V5: - Fixed uninit value spotted by Dan Carpenter. - Name struct output member mtu_result Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/161287789952.790810.13134700381067698781.stgit@firesoul
2021-02-13bpf: Fix bpf_fib_lookup helper MTU check for SKB ctxJesper Dangaard Brouer
BPF end-user on Cilium slack-channel (Carlo Carraro) wants to use bpf_fib_lookup for doing MTU-check, but *prior* to extending packet size, by adjusting fib_params 'tot_len' with the packet length plus the expected encap size. (Just like the bpf_check_mtu helper supports). He discovered that for SKB ctx the param->tot_len was not used, instead skb->len was used (via MTU check in is_skb_forwardable() that checks against netdev MTU). Fix this by using fib_params 'tot_len' for MTU check. If not provided (e.g. zero) then keep existing TC behaviour intact. Notice that 'tot_len' for MTU check is done like XDP code-path, which checks against FIB-dst MTU. V16: - Revert V13 optimization, 2nd lookup is against egress/resulting netdev V13: - Only do ifindex lookup one time, calling dev_get_by_index_rcu(). V10: - Use same method as XDP for 'tot_len' MTU check Fixes: 4c79579b44b1 ("bpf: Change bpf_fib_lookup to return lookup status") Reported-by: Carlo Carraro <colrack@gmail.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/161287789444.790810.15247494756551413508.stgit@firesoul
2021-02-13bpf: Remove MTU check in __bpf_skb_max_lenJesper Dangaard Brouer
Multiple BPF-helpers that can manipulate/increase the size of the SKB uses __bpf_skb_max_len() as the max-length. This function limit size against the current net_device MTU (skb->dev->mtu). When a BPF-prog grow the packet size, then it should not be limited to the MTU. The MTU is a transmit limitation, and software receiving this packet should be allowed to increase the size. Further more, current MTU check in __bpf_skb_max_len uses the MTU from ingress/current net_device, which in case of redirects uses the wrong net_device. This patch keeps a sanity max limit of SKB_MAX_ALLOC (16KiB). The real limit is elsewhere in the system. Jesper's testing[1] showed it was not possible to exceed 8KiB when expanding the SKB size via BPF-helper. The limiting factor is the define KMALLOC_MAX_CACHE_SIZE which is 8192 for SLUB-allocator (CONFIG_SLUB) in-case PAGE_SIZE is 4096. This define is in-effect due to this being called from softirq context see code __gfp_pfmemalloc_flags() and __do_kmalloc_node(). Jakub's testing showed that frames above 16KiB can cause NICs to reset (but not crash). Keep this sanity limit at this level as memory layer can differ based on kernel config. [1] https://github.com/xdp-project/bpf-examples/tree/master/MTU-tests Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/161287788936.790810.2937823995775097177.stgit@firesoul
2021-02-11bpf: Expose bpf_get_socket_cookie to tracing programsFlorent Revest
This needs a new helper that: - can work in a sleepable context (using sock_gen_cookie) - takes a struct sock pointer and checks that it's not NULL Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: KP Singh <kpsingh@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20210210111406.785541-2-revest@chromium.org
2021-02-04net, veth: Alloc skb in bulk for ndo_xdp_xmitLorenzo Bianconi
Split ndo_xdp_xmit and ndo_start_xmit use cases in veth_xdp_rcv routine in order to alloc skbs in bulk for XDP_PASS verdict. Introduce xdp_alloc_skb_bulk utility routine to alloc skb bulk list. The proposed approach has been tested in the following scenario: eth (ixgbe) --> XDP_REDIRECT --> veth0 --> (remote-ns) veth1 --> XDP_PASS XDP_REDIRECT: xdp_redirect_map bpf sample XDP_PASS: xdp_rxq_info bpf sample traffic generator: pkt_gen sending udp traffic on a remote device bpf-next master: ~3.64Mpps bpf-next + skb bulking allocation: ~3.79Mpps Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/a14a30d3c06fff24e13f836c733d80efc0bd6eb5.1611957532.git.lorenzo@kernel.org
2021-01-29bpf: Enable bpf_{g,s}etsockopt in BPF_CGROUP_UDP{4,6}_RECVMSGStanislav Fomichev
Those hooks run as BPF_CGROUP_RUN_SA_PROG_LOCK and operate on a locked socket. Note that we could remove the switch for prog->expected_attach_type altogether since all current sock_addr attach types are covered. However, it makes sense to keep it as a safe-guard in case new sock_addr attach types are added that might not operate on a locked socket. Therefore, avoid to let this slip through. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210127232853.3753823-5-sdf@google.com
2021-01-29bpf: Enable bpf_{g,s}etsockopt in BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAMEStanislav Fomichev
Those hooks run as BPF_CGROUP_RUN_SA_PROG_LOCK and operate on a locked socket. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210127232853.3753823-3-sdf@google.com
2021-01-29bpf: Enable bpf_{g,s}etsockopt in BPF_CGROUP_UDP{4,6}_SENDMSGStanislav Fomichev
Can be used to query/modify socket state for unconnected UDP sendmsg. Those hooks run as BPF_CGROUP_RUN_SA_PROG_LOCK and operate on a locked socket. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210127232853.3753823-2-sdf@google.com
2021-01-27bpf: Allow rewriting to ports under ip_unprivileged_port_startStanislav Fomichev
At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port to the privileged ones (< ip_unprivileged_port_start), but it will be rejected later on in the __inet_bind or __inet6_bind. Let's add another return value to indicate that CAP_NET_BIND_SERVICE check should be ignored. Use the same idea as we currently use in cgroup/egress where bit #1 indicates CN. Instead, for cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should be bypassed. v5: - rename flags to be less confusing (Andrey Ignatov) - rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags and accept BPF_RET_SET_CN (no behavioral changes) v4: - Add missing IPv6 support (Martin KaFai Lau) v3: - Update description (Martin KaFai Lau) - Fix capability restore in selftest (Martin KaFai Lau) v2: - Switch to explicit return code (Martin KaFai Lau) Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Andrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/20210127193140.3170382-1-sdf@google.com
2021-01-28skmsg: Make sk_psock_destroy() staticCong Wang
sk_psock_destroy() is a RCU callback, I can't see any reason why it could be used outside. Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jakub Sitnicki <jakub@cloudflare.com> Cc: Lorenz Bauer <lmb@cloudflare.com> Link: https://lore.kernel.org/bpf/20210127221501.46866-1-xiyou.wangcong@gmail.com
2021-01-25xsk: Fold xp_assign_dev and __xp_assign_devBjörn Töpel
Fold xp_assign_dev and __xp_assign_dev. The former directly calls the latter. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210122105351.11751-3-bjorn.topel@gmail.com
2021-01-25xsk: Remove explicit_free parameter from __xsk_rcv()Björn Töpel
The explicit_free parameter of the __xsk_rcv() function was used to mark whether the call was via the generic XDP or the native XDP path. Instead of clutter the code with if-statements and "true/false" parameters which are hard to understand, simply move the explicit free to the __xsk_map_redirect() which is always called from the native XDP path. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210122105351.11751-2-bjorn.topel@gmail.com
2021-01-20bpf: Split cgroup_bpf_enabled per attach typeStanislav Fomichev
When we attach any cgroup hook, the rest (even if unused/unattached) start to contribute small overhead. In particular, the one we want to avoid is __cgroup_bpf_run_filter_skb which does two redirections to get to the cgroup and pushes/pulls skb. Let's split cgroup_bpf_enabled to be per-attach to make sure only used attach types trigger. I've dropped some existing high-level cgroup_bpf_enabled in some places because BPF_PROG_CGROUP_XXX_RUN macros usually have another cgroup_bpf_enabled check. I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type] has to be constant and known at compile time. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210115163501.805133-4-sdf@google.com
2021-01-20bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVEStanislav Fomichev
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE. We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom call in do_tcp_getsockopt using the on-stack data. This removes 3% overhead for locking/unlocking the socket. Without this patch: 3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt | --3.30%--__cgroup_bpf_run_filter_getsockopt | --0.81%--__kmalloc With the patch applied: 0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern Note, exporting uapi/tcp.h requires removing netinet/tcp.h from test_progs.h because those headers have confliciting definitions. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
2021-01-20net, xdp: Introduce xdp_build_skb_from_frame utility routineLorenzo Bianconi
Introduce xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Respect to __xdp_build_skb_from_frame, xdp_build_skb_from_frame will allocate the skb object. Rely on xdp_build_skb_from_frame in veth driver. Introduce missing xdp metadata support in veth_xdp_rcv_one routine. Add missing metadata support in veth_xdp_rcv_one(). Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/94ade9e853162ae1947941965193190da97457bc.1610475660.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-20net, xdp: Introduce __xdp_build_skb_from_frame utility routineLorenzo Bianconi
Introduce __xdp_build_skb_from_frame utility routine to build the skb from xdp_frame. Rely on __xdp_build_skb_from_frame in cpumap code. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/bpf/4f9f4c6b3dd3933770c617eb6689dbc0c6e25863.1610475660.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2021-01-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/can/dev.c commit 03f16c5075b2 ("can: dev: can_restart: fix use after free bug") commit 3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir") Code move. drivers/net/dsa/b53/b53_common.c commit 8e4052c32d6b ("net: dsa: b53: fix an off by one in checking "vlan->vid"") commit b7a9e0da2d1c ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects") Field rename. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20Merge tag 'net-5.11-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.11-rc5, including fixes from bpf, wireless, and can trees. Current release - regressions: - nfc: nci: fix the wrong NCI_CORE_INIT parameters Current release - new code bugs: - bpf: allow empty module BTFs Previous releases - regressions: - bpf: fix signed_{sub,add32}_overflows type handling - tcp: do not mess with cloned skbs in tcp_add_backlog() - bpf: prevent double bpf_prog_put call from bpf_tracing_prog_attach - bpf: don't leak memory in bpf getsockopt when optlen == 0 - tcp: fix potential use-after-free due to double kfree() - mac80211: fix encryption issues with WEP - devlink: use right genl user_ptr when handling port param get/set - ipv6: set multicast flag on the multicast route - tcp: fix TCP_USER_TIMEOUT with zero window Previous releases - always broken: - bpf: local storage helpers should check nullness of owner ptr passed - mac80211: fix incorrect strlen of .write in debugfs - cls_flower: call nla_ok() before nla_next() - skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too" * tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits) net: systemport: free dev before on error path net: usb: cdc_ncm: don't spew notifications net: mscc: ocelot: Fix multicast to the CPU port tcp: Fix potential use-after-free due to double kfree() bpf: Fix signed_{sub,add32}_overflows type handling can: peak_usb: fix use after free bugs can: vxcan: vxcan_xmit: fix use after free bug can: dev: can_restart: fix use after free bug tcp: fix TCP socket rehash stats mis-accounting net: dsa: b53: fix an off by one in checking "vlan->vid" tcp: do not mess with cloned skbs in tcp_add_backlog() selftests: net: fib_tests: remove duplicate log test net: nfc: nci: fix the wrong NCI_CORE_INIT parameters sh_eth: Fix power down vs. is_opened flag ordering net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled netfilter: rpfilter: mask ecn bits before fib lookup udp: mask TOS bits in udp_v4_early_demux() xsk: Clear pool even for inactive queues bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback sh_eth: Make PHY access aware of Runtime PM to fix reboot crash ...
2021-01-20tcp: Fix potential use-after-free due to double kfree()Kuniyuki Iwashima
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct request_sock and then can allocate inet_rsk(req)->ireq_opt. After that, tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full socket into ehash and sets NULL to ireq_opt. Otherwise, tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full socket. The commit 01770a1661657 ("tcp: fix race condition when creating child sockets from syncookies") added a new path, in which more than one cores create full sockets for the same SYN cookie. Currently, the core which loses the race frees the full socket without resetting inet_opt, resulting in that both sock_put() and reqsk_put() call kfree() for the same memory: sock_put sk_free __sk_free sk_destruct __sk_destruct sk->sk_destruct/inet_sock_destruct kfree(rcu_dereference_protected(inet->inet_opt, 1)); reqsk_put reqsk_free __reqsk_free req->rsk_ops->destructor/tcp_v4_reqsk_destructor kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1)); Calling kmalloc() between the double kfree() can lead to use-after-free, so this patch fixes it by setting NULL to inet_opt before sock_put(). As a side note, this kind of issue does not happen for IPv6. This is because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which correspond to ireq_opt in IPv4. Fixes: 01770a166165 ("tcp: fix race condition when creating child sockets from syncookies") CC: Ricardo Dias <rdias@singlestore.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-20Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf 2021-01-20 1) Fix wrong bpf_map_peek_elem_proto helper callback, from Mircea Cirjaliu. 2) Fix signed_{sub,add32}_overflows type truncation, from Daniel Borkmann. 3) Fix AF_XDP to also clear pools for inactive queues, from Maxim Mikityanskiy. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix signed_{sub,add32}_overflows type handling xsk: Clear pool even for inactive queues bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback ==================== Link: https://lore.kernel.org/r/20210120163439.8160-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19tcp: fix TCP socket rehash stats mis-accountingYuchung Cheng
The previous commit 32efcc06d2a1 ("tcp: export count for rehash attempts") would mis-account rehashing SNMP and socket stats: a. During handshake of an active open, only counts the first SYN timeout b. After handshake of passive and active open, stop updating after (roughly) TCP_RETRIES1 recurring RTOs c. After the socket aborts, over count timeout_rehash by 1 This patch fixes this by checking the rehash result from sk_rethink_txhash. Fixes: 32efcc06d2a1 ("tcp: export count for rehash attempts") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20210119192619.1848270-1-ycheng@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19net: fix GSO for SG-enabled devicesPaolo Abeni
The commit dbd50f238dec ("net: move the hsize check to the else block in skb_segment") introduced a data corruption for devices supporting scatter-gather. The problem boils down to signed/unsigned comparison given unexpected results: if signed 'hsize' is negative, it will be considered greater than a positive 'len', which is unsigned. This commit addresses resorting to the old checks order, so that 'hsize' never has a negative value when compared with 'len'. v1 -> v2: - reorder hsize checks instead of explicit cast (Alex) Bisected-by: Matthieu Baerts <matthieu.baerts@tessares.net> Fixes: dbd50f238dec ("net: move the hsize check to the else block in skb_segment") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://lore.kernel.org/r/861947c2d2d087db82af93c21920ce8147d15490.1611074818.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19tcp: do not mess with cloned skbs in tcp_add_backlog()Eric Dumazet
Heiner Kallweit reported that some skbs were sent with the following invalid GSO properties : - gso_size > 0 - gso_type == 0 This was triggerring a WARN_ON_ONCE() in rtl8169_tso_csum_v2. Juerg Haefliger was able to reproduce a similar issue using a lan78xx NIC and a workload mixing TCP incoming traffic and forwarded packets. The problem is that tcp_add_backlog() is writing over gso_segs and gso_size even if the incoming packet will not be coalesced to the backlog tail packet. While skb_try_coalesce() would bail out if tail packet is cloned, this overwriting would lead to corruptions of other packets cooked by lan78xx, sharing a common super-packet. The strategy used by lan78xx is to use a big skb, and split it into all received packets using skb_clone() to avoid copies. The drawback of this strategy is that all the small skb share a common struct skb_shared_info. This patch rewrites TCP gso_size/gso_segs handling to only happen on the tail skb, since skb_try_coalesce() made sure it was not cloned. Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue") Signed-off-by: Eric Dumazet <edumazet@google.com> Bisected-by: Juerg Haefliger <juergh@canonical.com> Tested-by: Juerg Haefliger <juergh@canonical.com> Reported-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=209423 Link: https://lore.kernel.org/r/20210119164900.766957-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19taprio: boolean values to a bool variableJiapeng Zhong
Fix the following coccicheck warnings: ./net/sched/sch_taprio.c:393:3-16: WARNING: Assignment of 0/1 to bool variable. ./net/sched/sch_taprio.c:375:2-15: WARNING: Assignment of 0/1 to bool variable. ./net/sched/sch_taprio.c:244:4-19: WARNING: Assignment of 0/1 to bool variable. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com> Link: https://lore.kernel.org/r/1610958662-71166-1-git-send-email-abaci-bugfix@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19net: nfc: nci: fix the wrong NCI_CORE_INIT parametersBongsu Jeon
Fix the code because NCI_CORE_INIT_CMD includes two parameters in NCI2.0 but there is no parameters in NCI1.x. Fixes: bcd684aace34 ("net/nfc/nci: Support NCI 2.x initial sequence") Signed-off-by: Bongsu Jeon <bongsu.jeon@samsung.com> Link: https://lore.kernel.org/r/20210118205522.317087-1-bongsu.jeon@samsung.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabledTariq Toukan
With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be logically done when RXCSUM offload is off. Fixes: 14136564c8ee ("net: Add TLS RX offload feature") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19net: add inline function skb_csum_is_sctpXin Long
This patch is to define a inline function skb_csum_is_sctp(), and also replace all places where it checks if it's a SCTP CSUM skb. This function would be used later in many networking drivers in the following patches. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19netfilter: rpfilter: mask ecn bits before fib lookupGuillaume Nault
RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt() treats Not-ECT or ECT(1) packets in a different way than those with ECT(0) or CE. Reproducer: Create two netns, connected with a veth: $ ip netns add ns0 $ ip netns add ns1 $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1 $ ip -netns ns0 link set dev veth01 up $ ip -netns ns1 link set dev veth10 up $ ip -netns ns0 address add 192.0.2.10/32 dev veth01 $ ip -netns ns1 address add 192.0.2.11/32 dev veth10 Add a route to ns1 in ns0: $ ip -netns ns0 route add 192.0.2.11/32 dev veth01 In ns1, only packets with TOS 4 can be routed to ns0: $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS is 4: $ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT ... 0% packet loss ... $ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1) ... 0% packet loss ... $ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0) ... 0% packet loss ... $ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE ... 0% packet loss ... Now use iptable's rpfilter module in ns1: $ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP Not-ECT and ECT(1) packets still pass: $ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT ... 0% packet loss ... $ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1) ... 0% packet loss ... But ECT(0) and ECN packets are dropped: $ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0) ... 100% packet loss ... $ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE ... 100% packet loss ... After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore. Fixes: 8f97339d3feb ("netfilter: add ipv4 reverse path filter match") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19udp: mask TOS bits in udp_v4_early_demux()Guillaume Nault
udp_v4_early_demux() is the only function that calls ip_mc_validate_source() with a TOS that hasn't been masked with IPTOS_RT_MASK. This results in different behaviours for incoming multicast UDPv4 packets, depending on if ip_mc_validate_source() is called from the early-demux path (udp_v4_early_demux) or from the regular input path (ip_route_input_noref). ECN would normally not be used with UDP multicast packets, so the practical consequences should be limited on that side. However, IPTOS_RT_MASK is used to also masks the TOS' high order bits, to align with the non-early-demux path behaviour. Reproducer: Setup two netns, connected with veth: $ ip netns add ns0 $ ip netns add ns1 $ ip -netns ns0 link set dev lo up $ ip -netns ns1 link set dev lo up $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1 $ ip -netns ns0 link set dev veth01 up $ ip -netns ns1 link set dev veth10 up $ ip -netns ns0 address add 192.0.2.10 peer 192.0.2.11/32 dev veth01 $ ip -netns ns1 address add 192.0.2.11 peer 192.0.2.10/32 dev veth10 In ns0, add route to multicast address 224.0.2.0/24 using source address 198.51.100.10: $ ip -netns ns0 address add 198.51.100.10/32 dev lo $ ip -netns ns0 route add 224.0.2.0/24 dev veth01 src 198.51.100.10 In ns1, define route to 198.51.100.10, only for packets with TOS 4: $ ip -netns ns1 route add 198.51.100.10/32 tos 4 dev veth10 Also activate rp_filter in ns1, so that incoming packets not matching the above route get dropped: $ ip netns exec ns1 sysctl -wq net.ipv4.conf.veth10.rp_filter=1 Now try to receive packets on 224.0.2.11: $ ip netns exec ns1 socat UDP-RECVFROM:1111,ip-add-membership=224.0.2.11:veth10,ignoreeof - In ns0, send packet to 224.0.2.11 with TOS 4 and ECT(0) (that is, tos 6 for socat): $ echo test0 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6 The "test0" message is properly received by socat in ns1, because early-demux has no cached dst to use, so source address validation is done by ip_route_input_mc(), which receives a TOS that has the ECN bits masked. Now send another packet to 224.0.2.11, still with TOS 4 and ECT(0): $ echo test1 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6 The "test1" message isn't received by socat in ns1, because, now, early-demux has a cached dst to use and calls ip_mc_validate_source() immediately, without masking the ECN bits. Fixes: bc044e8db796 ("udp: perform source validation for mcast early demux") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-19xsk: Clear pool even for inactive queuesMaxim Mikityanskiy
The number of queues can change by other means, rather than ethtool. For example, attaching an mqprio qdisc with num_tc > 1 leads to creating multiple sets of TX queues, which may be then destroyed when mqprio is deleted. If an AF_XDP socket is created while mqprio is active, dev->_tx[queue_id].pool will be filled, but then real_num_tx_queues may decrease with deletion of mqprio, which will mean that the pool won't be NULLed, and a further increase of the number of TX queues may expose a dangling pointer. To avoid any potential misbehavior, this commit clears pool for RX and TX queues, regardless of real_num_*_queues, still taking into consideration num_*_queues to avoid overflows. Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from umem") Fixes: a41b4f3c58dd ("xsk: simplify xdp_clear_umem_at_qid implementation") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/20210118160333.333439-1-maximmi@mellanox.com
2021-01-19Merge tag 'nfsd-5.11-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Avoid exposing parent of root directory in NFSv3 READDIRPLUS results - Fix a tracepoint change that went in the initial 5.11 merge * tag 'nfsd-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: SUNRPC: Move the svc_xdr_recvfrom tracepoint again nfsd4: readdirplus shouldn't return parent of export
2021-01-19net: core: devlink: use right genl user_ptr when handling port param get/setOleksandr Mazur
Fix incorrect user_ptr dereferencing when handling port param get/set: idx [0] stores the 'struct devlink' pointer; idx [1] stores the 'struct devlink_port' pointer; Fixes: 637989b5d77e ("devlink: Always use user_ptr[0] for devlink and simplify post_doit") CC: Parav Pandit <parav@mellanox.com> Signed-off-by: Oleksandr Mazur <oleksandr.mazur@plvision.eu> Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu> Link: https://lore.kernel.org/r/20210119085333.16833-1-vadym.kochan@plvision.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18net/tls: Except bond interface from some TLS checksTariq Toukan
In the tls_dev_event handler, ignore tlsdev_ops requirement for bond interfaces, they do not exist as the interaction is done directly with the lower device. Also, make the validate function pass when it's called with the upper bond interface. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18net/tls: Device offload to use lowest netdevice in chainTariq Toukan
Do not call the tls_dev_ops of upper devices. Instead, ask them for the proper lowest device and communicate with it directly. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18net: netdevice: Add operation ndo_sk_get_lower_devTariq Toukan
ndo_sk_get_lower_dev returns the lower netdev that corresponds to a given socket. Additionally, we implement a helper netdev_sk_get_lowest_dev() to get the lowest one in chain. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18net_sched: fix RTNL deadlock again caused by request_module()Cong Wang
tcf_action_init_1() loads tc action modules automatically with request_module() after parsing the tc action names, and it drops RTNL lock and re-holds it before and after request_module(). This causes a lot of troubles, as discovered by syzbot, because we can be in the middle of batch initializations when we create an array of tc actions. One of the problem is deadlock: CPU 0 CPU 1 rtnl_lock(); for (...) { tcf_action_init_1(); -> rtnl_unlock(); -> request_module(); rtnl_lock(); for (...) { tcf_action_init_1(); -> tcf_idr_check_alloc(); // Insert one action into idr, // but it is not committed until // tcf_idr_insert_many(), then drop // the RTNL lock in the _next_ // iteration -> rtnl_unlock(); -> rtnl_lock(); -> a_o->init(); -> tcf_idr_check_alloc(); // Now waiting for the same index // to be committed -> request_module(); -> rtnl_lock() // Now waiting for RTNL lock } rtnl_unlock(); } rtnl_unlock(); This is not easy to solve, we can move the request_module() before this loop and pre-load all the modules we need for this netlink message and then do the rest initializations. So the loop breaks down to two now: for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) { struct tc_action_ops *a_o; a_o = tc_action_load_ops(name, tb[i]...); ops[i - 1] = a_o; } for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) { act = tcf_action_init_1(ops[i - 1]...); } Although this looks serious, it only has been reported by syzbot, so it seems hard to trigger this by humans. And given the size of this patch, I'd suggest to make it to net-next and not to backport to stable. This patch has been tested by syzbot and tested with tdc.py by me. Fixes: 0fedc63fadf0 ("net_sched: commit action insertions together") Reported-and-tested-by: syzbot+82752bc5331601cf4899@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+b3b63b6bff456bd95294@syzkaller.appspotmail.com Reported-by: syzbot+ba67b12b1ca729912834@syzkaller.appspotmail.com Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20210117005657.14810-1-xiyou.wangcong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18tcp: fix TCP_USER_TIMEOUT with zero windowEnke Chen
The TCP session does not terminate with TCP_USER_TIMEOUT when data remain untransmitted due to zero window. The number of unanswered zero-window probes (tcp_probes_out) is reset to zero with incoming acks irrespective of the window size, as described in tcp_probe_timer(): RFC 1122 4.2.2.17 requires the sender to stay open indefinitely as long as the receiver continues to respond probes. We support this by default and reset icsk_probes_out with incoming ACKs. This counter, however, is the wrong one to be used in calculating the duration that the window remains closed and data remain untransmitted. Thanks to Jonathan Maxwell <jmaxwell37@gmail.com> for diagnosing the actual issue. In this patch a new timestamp is introduced for the socket in order to track the elapsed time for the zero-window probes that have not been answered with any non-zero window ack. Fixes: 9721e709fa68 ("tcp: simplify window probe aborting on USER_TIMEOUT") Reported-by: William McCall <william.mccall@gmail.com> Co-developed-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Enke Chen <enchen@paloaltonetworks.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210115223058.GA39267@localhost.localdomain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18ipv6: set multicast flag on the multicast routeMatteo Croce
The multicast route ff00::/8 is created with type RTN_UNICAST: $ ip -6 -d route unicast ::1 dev lo proto kernel scope global metric 256 pref medium unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium unicast ff00::/8 dev eth0 proto kernel scope global metric 256 pref medium Set the type to RTN_MULTICAST which is more appropriate. Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18ipv6: create multicast route with RTPROT_KERNELMatteo Croce
The ff00::/8 multicast route is created without specifying the fc_protocol field, so the default RTPROT_BOOT value is used: $ ip -6 -d route unicast ::1 dev lo proto kernel scope global metric 256 pref medium unicast fe80::/64 dev eth0 proto kernel scope global metric 256 pref medium unicast ff00::/8 dev eth0 proto boot scope global metric 256 pref medium As the documentation says, this value identifies routes installed during boot, but the route is created when interface is set up. Change the value to RTPROT_KERNEL which is a better value. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Matteo Croce <mcroce@microsoft.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18net: bridge: check vlan with eth_type_vlan() methodMenglong Dong
Replace some checks for ETH_P_8021Q and ETH_P_8021AD with eth_type_vlan(). Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Link: https://lore.kernel.org/r/20210117080950.122761-1-dong.menglong@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-18Merge tag 'mac80211-for-net-2021-01-18.2' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Various fixes: * kernel-doc parsing fixes * incorrect debugfs string checks * locking fix in regulatory * some encryption-related fixes * tag 'mac80211-for-net-2021-01-18.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211: mac80211: check if atf has been disabled in __ieee80211_schedule_txq mac80211: do not drop tx nulldata packets on encrypted links mac80211: fix encryption key selection for 802.3 xmit mac80211: fix fast-rx encryption check mac80211: fix incorrect strlen of .write in debugfs cfg80211: fix a kerneldoc markup cfg80211: Save the regulatory domain with a lock cfg80211/mac80211: fix kernel-doc for SAR APIs ==================== Link: https://lore.kernel.org/r/20210118204750.7243-1-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16sctp: remove the NETIF_F_SG flag before calling skb_segmentXin Long
It makes more sense to clear NETIF_F_SG instead of set it when calling skb_segment() in sctp_gso_segment(), since SCTP GSO is using head_skb's fraglist, of which all frags are linear skbs. This will make SCTP GSO code more understandable. Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16net: move the hsize check to the else block in skb_segmentXin Long
After commit 89319d3801d1 ("net: Add frag_list support to skb_segment"), it goes to process frag_list when !hsize in skb_segment(). However, when using skb frag_list, sg normally should not be set. In this case, hsize will be set with len right before !hsize check, then it won't go to frag_list processing code. So the right thing to do is move the hsize check to the else block, so that it won't affect the !hsize check for frag_list processing. v1->v2: - change to do "hsize <= 0" check instead of "!hsize", and also move "hsize < 0" into else block, to save some cycles, as Alex suggested. Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-16skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() tooAlexander Lobakin
Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for tiny skbs") ensured that skbs with data size lower than 1025 bytes will be kmalloc'ed to avoid excessive page cache fragmentation and memory consumption. However, the fix adressed only __napi_alloc_skb() (primarily for virtio_net and napi_get_frags()), but the issue can still be achieved through __netdev_alloc_skb(), which is still used by several drivers. Drivers often allocate a tiny skb for headers and place the rest of the frame to frags (so-called copybreak). Mirror the condition to __netdev_alloc_skb() to handle this case too. Since v1 [0]: - fix "Fixes:" tag; - refine commit message (mention copybreak usecase). [0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me Fixes: a1c7fff7e18f ("net: netdev_alloc_skb() use build_skb()") Signed-off-by: Alexander Lobakin <alobakin@pm.me> Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15tcp_cubic: use memset and offsetof initYejune Deng
In bictcp_reset(), use memset and offsetof instead of = 0. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Link: https://lore.kernel.org/r/1610597696-128610-1-git-send-email-yejune.deng@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15nfc: netlink: use &w->w in nfc_genl_rcv_nl_eventGeliang Tang
Use the struct member w of the struct urelease_work directly instead of casting it. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Link: https://lore.kernel.org/r/f0ed86d6d54ac0834bd2e161d172bf7bb5647cf7.1610683862.git.geliangtang@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15net: dsa: add ops for devlink-sbVladimir Oltean
Switches that care about QoS might have hardware support for reserving buffer pools for individual ports or traffic classes, and configuring their sizes and thresholds. Through devlink-sb (shared buffers), this is all configurable, as well as their occupancy being viewable. Add the plumbing in DSA for these operations. Individual drivers still need to call devlink_sb_register() with the shared buffers they want to expose. A helper was not created in DSA for this purpose (unlike, say, dsa_devlink_params_register), since in my opinion it does not bring any benefit over plainly calling devlink_sb_register() directly. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-15net_sched: avoid shift-out-of-bounds in tcindex_set_parms()Eric Dumazet
tc_index being 16bit wide, we need to check that TCA_TCINDEX_SHIFT attribute is not silly. UBSAN: shift-out-of-bounds in net/sched/cls_tcindex.c:260:29 shift exponent 255 is too large for 32-bit type 'int' CPU: 0 PID: 8516 Comm: syz-executor228 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 valid_perfect_hash net/sched/cls_tcindex.c:260 [inline] tcindex_set_parms.cold+0x1b/0x215 net/sched/cls_tcindex.c:425 tcindex_change+0x232/0x340 net/sched/cls_tcindex.c:546 tc_new_tfilter+0x13fb/0x21b0 net/sched/cls_api.c:2127 rtnetlink_rcv_msg+0x8b6/0xb80 net/core/rtnetlink.c:5555 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330 netlink_sendmsg+0x907/0xe40 net/netlink/af_netlink.c:1919 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:672 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2336 ___sys_sendmsg+0xf3/0x170 net/socket.c:2390 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2423 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20210114185229.1742255-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>