summaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2022-02-03net: dsa: mv88e6xxx: Improve multichip isolation of standalone portsTobias Waldekranz
Given that standalone ports are now configured to bypass the ATU and forward all frames towards the upstream port, extend the ATU bypass to multichip systems. Load VID 0 (standalone) into the VTU with the policy bit set. Since VID 4095 (bridged) is already loaded, we now know that all VIDs in use are always available in all VTUs. Therefore, we can safely enable 802.1Q on DSA ports. Setting the DSA ports' VTU policy to TRAP means that all incoming frames on VID 0 will be classified as MGMT - as a result, the ATU is bypassed on all subsequent switches. With this isolation in place, we are able to support configurations that are simultaneously very quirky and very useful. Quirky because it involves looping cables between local switchports like in this example: CPU | .------. .---0---. | .----0----. | sw0 | | | sw1 | '-1-2-3-' | '-1-2-3-4-' $ @ '---' $ @ % % We have three physically looped pairs ($, @, and %). This is very useful because it allows us to run the kernel's kselftests for the bridge on mv88e6xxx hardware. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-03net: dsa: mv88e6xxx: Improve isolation of standalone portsTobias Waldekranz
Clear MapDA on standalone ports to bypass any ATU lookup that might point the packet in the wrong direction. This means that all packets are flooded using the PVT config. So make sure that standalone ports are only allowed to communicate with the local upstream port. Here is a scenario in which this is needed: CPU | .----. .---0---. | .--0--. | sw0 | | | sw1 | '-1-2-3-' | '-1-2-' '---' - sw0p1 and sw1p1 are bridged - sw0p2 and sw1p2 are in standalone mode - Learning must be enabled on sw0p3 in order for hardware forwarding to work properly between bridged ports 1. A packet with SA :aa comes in on sw1p2 1a. Egresses sw1p0 1b. Ingresses sw0p3, ATU adds an entry for :aa towards port 3 1c. Egresses sw0p0 2. A packet with DA :aa comes in on sw0p2 2a. If an ATU lookup is done at this point, the packet will be incorrectly forwarded towards sw0p3. With this change in place, the ATU is bypassed and the packet is forwarded in accordance with the PVT, which only contains the CPU port. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-03page_pool: Refactor page_pool to enable fragmenting after allocationAlexander Duyck
This change is meant to permit a driver to perform "fragmenting" of the page from within the driver instead of the current model which requires pre-partitioning the page. The main motivation behind this is to support use cases where the page will be split up by the driver after DMA instead of before. With this change it becomes possible to start using page pool to replace some of the existing use cases where multiple references were being used for a single page, but the number needed was unknown as the size could be dynamic. For example, with this code it would be possible to do something like the following to handle allocation: page = page_pool_alloc_pages(); if (!page) return NULL; page_pool_fragment_page(page, DRIVER_PAGECNT_BIAS_MAX); rx_buf->page = page; rx_buf->pagecnt_bias = DRIVER_PAGECNT_BIAS_MAX; Then we would process a received buffer by handling it with: rx_buf->pagecnt_bias--; Once the page has been fully consumed we could then flush the remaining instances with: if (page_pool_defrag_page(page, rx_buf->pagecnt_bias)) continue; page_pool_put_defragged_page(pool, page -1, !!budget); The general idea is that we want to have the ability to allocate a page with excess fragment count and then trim off the unneeded fragments. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02tcp: Use BPF timeout setting for SYN ACK RTOAkhmat Karakotov
When setting RTO through BPF program, some SYN ACK packets were unaffected and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT and is reassigned through BPF using tcp_timeout_init call. SYN ACK retransmits now use newly added timeout option. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Acked-by: Martin KaFai Lau <kafai@fb.com> v2: - Add timeout option to struct request_sock. Do not call tcp_timeout_init on every syn ack retransmit. v3: - Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX. v4: - Refactor duplicate code by adding reqsk_timeout function. Signed-off-by: David S. Miller <davem@davemloft.net>
2022-02-02net: dsa: provide switch operations for tracking the master stateVladimir Oltean
Certain drivers may need to send management traffic to the switch for things like register access, FDB dump, etc, to accelerate what their slow bus (SPI, I2C, MDIO) can already do. Ethernet is faster (especially in bulk transactions) but is also more unreliable, since the user may decide to bring the DSA master down (or not bring it up), therefore severing the link between the host and the attached switch. Drivers needing Ethernet-based register access already should have fallback logic to the slow bus if the Ethernet method fails, but that fallback may be based on a timeout, and the I/O to the switch may slow down to a halt if the master is down, because every Ethernet packet will have to time out. The driver also doesn't have the option to turn off Ethernet-based I/O momentarily, because it wouldn't know when to turn it back on. Which is where this change comes in. By tracking NETDEV_CHANGE, NETDEV_UP and NETDEV_GOING_DOWN events on the DSA master, we should know the exact interval of time during which this interface is reliably available for traffic. Provide this information to switches so they can use it as they wish. An helper is added dsa_port_master_is_operational() to check if a master port is operational. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31txhash: Add socket option to control TX hash rethink behaviorAkhmat Karakotov
Add the SO_TXREHASH socket option to control hash rethink behavior per socket. When default mode is set, sockets disable rehash at initialization and use sysctl option when entering listen state. setsockopt() overrides default behavior. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31txhash: Make rethinking txhash behavior configurable via sysctlAkhmat Karakotov
Add a per ns sysctl that controls the txhash rethink behavior: net.core.txrehash. When enabled, the same behavior is retained, when disabled, rethink is not performed. Sysctl is enabled by default. Signed-off-by: Akhmat Karakotov <hmukos@yandex-team.ru> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-31ipv4: Make ip_idents_reserve staticDavid Ahern
ip_idents_reserve is only used in net/ipv4/route.c. Make it static and remove the export. Signed-off-by: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-29ipv4: drop fragmentation code from ip_options_build()Jakub Kicinski
Since v2.5.44 and addition of ip_options_fragment() ip_options_build() does not render headers for fragments directly. @is_frag is always 0. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-28Merge tag 'for-net-next-2022-01-28' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for RTL8822C hci_ver 0x08 - Add support for RTL8852AE part 0bda:2852 - Fix WBS setting for Intel legacy ROM products - Enable SCO over I2S ib mt7921s - Increment management interface revision * tag 'for-net-next-2022-01-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (30 commits) Bluetooth: Increment management interface revision Bluetooth: hci_sync: Fix queuing commands when HCI_UNREGISTER is set Bluetooth: hci_h5: Add power reset via gpio in h5_btrtl_open Bluetooth: btrtl: Add support for RTL8822C hci_ver 0x08 Bluetooth: hci_event: Fix HCI_EV_VENDOR max_len Bluetooth: hci_core: Rate limit the logging of invalid SCO handle Bluetooth: hci_event: Ignore multiple conn complete events Bluetooth: msft: fix null pointer deref on msft_monitor_device_evt Bluetooth: btmtksdio: mask out interrupt status Bluetooth: btmtksdio: run sleep mode by default Bluetooth: btmtksdio: lower log level in btmtksdio_runtime_[resume|suspend]() Bluetooth: mt7921s: fix btmtksdio_[drv|fw]_pmctrl() Bluetooth: mt7921s: fix bus hang with wrong privilege Bluetooth: btmtksdio: refactor btmtksdio_runtime_[suspend|resume]() Bluetooth: mt7921s: fix firmware coredump retrieve Bluetooth: hci_serdev: call init_rwsem() before p->open() Bluetooth: Remove kernel-doc style comment block Bluetooth: btusb: Whitespace fixes for btusb_setup_csr() Bluetooth: btusb: Add one more Bluetooth part for the Realtek RTL8852AE Bluetooth: btintel: Fix WBS setting for Intel legacy ROM products ... ==================== Link: https://lore.kernel.org/r/20220128205915.3995760-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv6: partially inline ipv6_fixup_optionsPavel Begunkov
Inline a part of ipv6_fixup_options() to avoid extra overhead on function call if opt is NULL. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27udp6: pass flow in ip6_make_skb together with corkPavel Begunkov
Another preparation patch. inet_cork_full already contains a field for iflow, so we can avoid passing a separate struct iflow6 into __ip6_append_data() and ip6_make_skb(), and use the flow stored in inet_cork_full. Make sure callers set cork->fl, i.e. we init it in ip6_append_data() and before calling ip6_make_skb(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv4: remove sparse error in ip_neigh_gw4()Eric Dumazet
./include/net/route.h:373:48: warning: incorrect type in argument 2 (different base types) ./include/net/route.h:373:48: expected unsigned int [usertype] key ./include/net/route.h:373:48: got restricted __be32 [usertype] daddr Fixes: 5c9f7c1dfc2e ("ipv4: Add helpers for neigh lookup for nexthop") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220127013404.1279313-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27ipv4: avoid using shared IP generator for connected socketsEric Dumazet
ip_select_ident_segs() has been very conservative about using the connected socket private generator only for packets with IP_DF set, claiming it was needed for some VJ compression implementations. As mentioned in this referenced document, this can be abused. (Ref: Off-Path TCP Exploits of the Mixed IPID Assignment) Before switching to pure random IPID generation and possibly hurt some workloads, lets use the private inet socket generator. Not only this will remove one vulnerability, this will also improve performance of TCP flows using pmtudisc==IP_PMTUDISC_DONT Fixes: 73f156a6e8c1 ("inetpeer: get rid of ip_id_count") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reported-by: Ray Che <xijiache@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-27net: sched: remove qdisc_qlen_cpu()Jakub Kicinski
Never used since it was added in v5.2. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27net: sched: remove psched_tdiff_bounded()Jakub Kicinski
Not used since v3.9. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27udplite: remove udplite_csum_outgoing()Jakub Kicinski
Not used since v4.0. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27net: ax25: remove route refcountJakub Kicinski
Nothing takes the refcount since v4.9. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27net: remove bond_slave_has_mac_rcu()Jakub Kicinski
No caller since v3.16. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27Revert "ipv6: Honor all IPv6 PIO Valid Lifetime values"Guillaume Nault
This reverts commit b75326c201242de9495ff98e5d5cff41d7fc0d9d. This commit breaks Linux compatibility with USGv6 tests. The RFC this commit was based on is actually an expired draft: no published RFC currently allows the new behaviour it introduced. Without full IETF endorsement, the flash renumbering scenario this patch was supposed to enable is never going to work, as other IPv6 equipements on the same LAN will keep the 2 hours limit. Fixes: b75326c20124 ("ipv6: Honor all IPv6 PIO Valid Lifetime values") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-27ipv4: Namespaceify min_adv_mss sysctl knobxu xin
Different netns has different requirement on the setting of min_adv_mss sysctl which the advertised MSS will be never lower than. Enable min_adv_mss to be configured per network namespace. Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-26tcp: allocate tcp_death_row outside of struct netns_ipv4Eric Dumazet
I forgot tcp had per netns tracking of timewait sockets, and their sysctl to change the limit. After 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()"), whole struct net can be freed before last tw socket is freed. We need to allocate a separate struct inet_timewait_death_row object per netns. tw_count becomes a refcount and gains associated debugging infrastructure. BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46 Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690 CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: events pwq_unbound_release_workfn Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46 call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421 expire_timers kernel/time/timer.c:1466 [inline] __run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734 __run_timers kernel/time/timer.c:1715 [inline] run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 invoke_softirq kernel/softirq.c:432 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637 irq_exit_rcu+0x5/0x20 kernel/softirq.c:649 sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638 RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328 Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206 RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001 RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246 R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0 wq_unregister_lockdep kernel/workqueue.c:3508 [inline] pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307 worker_thread+0x657/0x1110 kernel/workqueue.c:2454 kthread+0x2e9/0x3a0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> Allocated by task 3635: kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38 kasan_set_track mm/kasan/common.c:46 [inline] set_alloc_info mm/kasan/common.c:437 [inline] __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470 kasan_slab_alloc include/linux/kasan.h:260 [inline] slab_post_alloc_hook mm/slab.h:732 [inline] slab_alloc_node mm/slub.c:3230 [inline] slab_alloc mm/slub.c:3238 [inline] kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243 kmem_cache_zalloc include/linux/slab.h:705 [inline] net_alloc net/core/net_namespace.c:407 [inline] copy_net_ns+0x125/0x760 net/core/net_namespace.c:462 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226 ksys_unshare+0x445/0x920 kernel/fork.c:3048 __do_sys_unshare kernel/fork.c:3119 [inline] __se_sys_unshare kernel/fork.c:3117 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88807d5f9a80 which belongs to the cache net_namespace of size 6528 The buggy address is located 1216 bytes inside of 6528-byte region [ffff88807d5f9a80, ffff88807d5fb400) The buggy address belongs to the page: page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8 head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0 memcg:ffff888070023001 flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000 raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950 prep_new_page mm/page_alloc.c:2434 [inline] get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389 alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271 alloc_slab_page mm/slub.c:1799 [inline] allocate_slab mm/slub.c:1944 [inline] new_slab+0x28a/0x3b0 mm/slub.c:2004 ___slab_alloc+0x87c/0xe90 mm/slub.c:3018 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105 slab_alloc_node mm/slub.c:3196 [inline] slab_alloc mm/slub.c:3238 [inline] kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243 kmem_cache_zalloc include/linux/slab.h:705 [inline] net_alloc net/core/net_namespace.c:407 [inline] copy_net_ns+0x125/0x760 net/core/net_namespace.c:462 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226 ksys_unshare+0x445/0x920 kernel/fork.c:3048 __do_sys_unshare kernel/fork.c:3119 [inline] __se_sys_unshare kernel/fork.c:3117 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1352 [inline] free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404 free_unref_page_prepare mm/page_alloc.c:3325 [inline] free_unref_page+0x19/0x690 mm/page_alloc.c:3404 skb_free_head net/core/skbuff.c:655 [inline] skb_release_data+0x65d/0x790 net/core/skbuff.c:677 skb_release_all net/core/skbuff.c:742 [inline] __kfree_skb net/core/skbuff.c:756 [inline] consume_skb net/core/skbuff.c:914 [inline] consume_skb+0xc2/0x160 net/core/skbuff.c:908 skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325 netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998 sock_recvmsg_nosec net/socket.c:948 [inline] sock_recvmsg net/socket.c:966 [inline] sock_recvmsg net/socket.c:962 [inline] ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632 ___sys_recvmsg+0x127/0x200 net/socket.c:2674 __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Memory state around the buggy address: ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-25ipv4/tcp: do not use per netns ctl socketsEric Dumazet
TCP ipv4 uses per-cpu/per-netns ctl sockets in order to send RST and some ACK packets (on behalf of TIMEWAIT sockets). This adds memory and cpu costs, which do not seem needed. Now typical servers have 256 or more cores, this adds considerable tax to netns users. tcp sockets are used from BH context, are not receiving packets, and do not store any persistent state but the 'struct net' pointer in order to be able to use IPv4 output functions. Note that I attempted a related change in the past, that had to be hot-fixed in commit bdbbb8527b6f ("ipv4: tcp: get rid of ugly unicast_sock") This patch could very well surface old bugs, on layers not taking care of sk->sk_kern_sock properly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-25ipv6: do not use per netns icmp socketsEric Dumazet
Back in linux-2.6.25 (commit 98c6d1b261e7 "[NETNS]: Make icmpv6_sk per namespace.", we added private per-cpu/per-netns ipv6 icmp sockets. This adds memory and cpu costs, which do not seem needed. Now typical servers have 256 or more cores, this adds considerable tax to netns users. icmp sockets are used from BH context, are not receiving packets, and do not store any persistent state but the 'struct net' pointer. icmpv6_xmit_lock() already makes sure to lock the chosen per-cpu socket. This patch has a considerable impact on the number of netns that the worker thread in cleanup_net() can dismantle per second, because ip6mr_sk_done() is no longer called, meaning we no longer acquire the rtnl mutex, competing with other threads adding new netns. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-25ipv4: do not use per netns icmp socketsEric Dumazet
Back in linux-2.6.25 (commit 4a6ad7a141cb "[NETNS]: Make icmp_sk per namespace."), we added private per-cpu/per-netns ipv4 icmp sockets. This adds memory and cpu costs, which do not seem needed. Now typical servers have 256 or more cores, this adds considerable tax to netns users. icmp sockets are used from BH context, are not receiving packets, and do not store any persistent state but the 'struct net' pointer. icmp_xmit_lock() already makes sure to lock the chosen per-cpu socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-25tcp/dccp: get rid of inet_twsk_purge()Eric Dumazet
Prior patches in the series made sure tw_timer_handler() can be fired after netns has been dismantled/freed. We no longer have to scan a potentially big TCP ehash table at netns dismantle. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-25tcp/dccp: no longer use twsk_net(tw) from tw_timer_handler()Eric Dumazet
We will soon get rid of inet_twsk_purge(). This means that tw_timer_handler() might fire after a netns has been dismantled/freed. Instead of adding a function (and data structure) to find a netns from tw->tw_net_cookie, just update the SNMP counters a bit earlier, when the netns is known to be alive. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-25tcp/dccp: add tw->tw_bslotEric Dumazet
We want to allow inet_twsk_kill() working even if netns has been dismantled/freed, to get rid of inet_twsk_purge(). This patch adds tw->tw_bslot to cache the bind bucket slot so that inet_twsk_kill() no longer needs to dereference twsk_net(tw) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-24Bluetooth: hci_event: Ignore multiple conn complete eventsSoenke Huster
When one of the three connection complete events is received multiple times for the same handle, the device is registered multiple times which leads to memory corruptions. Therefore, consequent events for a single connection are ignored. The conn->state can hold different values, therefore HCI_CONN_HANDLE_UNSET is introduced to identify new connections. To make sure the events do not contain this or another invalid handle HCI_CONN_HANDLE_MAX and checks are introduced. Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=215497 Signed-off-by: Soenke Huster <soenke.huster@eknoes.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-01-24Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== pull-request: bpf-next 2022-01-24 We've added 80 non-merge commits during the last 14 day(s) which contain a total of 128 files changed, 4990 insertions(+), 895 deletions(-). The main changes are: 1) Add XDP multi-buffer support and implement it for the mvneta driver, from Lorenzo Bianconi, Eelco Chaudron and Toke Høiland-Jørgensen. 2) Add unstable conntrack lookup helpers for BPF by using the BPF kfunc infra, from Kumar Kartikeya Dwivedi. 3) Extend BPF cgroup programs to export custom ret value to userspace via two helpers bpf_get_retval() and bpf_set_retval(), from YiFei Zhu. 4) Add support for AF_UNIX iterator batching, from Kuniyuki Iwashima. 5) Complete missing UAPI BPF helper description and change bpf_doc.py script to enforce consistent & complete helper documentation, from Usama Arif. 6) Deprecate libbpf's legacy BPF map definitions and streamline XDP APIs to follow tc-based APIs, from Andrii Nakryiko. 7) Support BPF_PROG_QUERY for BPF programs attached to sockmap, from Di Zhu. 8) Deprecate libbpf's bpf_map__def() API and replace users with proper getters and setters, from Christy Lee. 9) Extend libbpf's btf__add_btf() with an additional hashmap for strings to reduce overhead, from Kui-Feng Lee. 10) Fix bpftool and libbpf error handling related to libbpf's hashmap__new() utility function, from Mauricio Vásquez. 11) Add support to BTF program names in bpftool's program dump, from Raman Shukhau. 12) Fix resolve_btfids build to pick up host flags, from Connor O'Brien. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (80 commits) selftests, bpf: Do not yet switch to new libbpf XDP APIs selftests, xsk: Fix rx_full stats test bpf: Fix flexible_array.cocci warnings xdp: disable XDP_REDIRECT for xdp frags bpf: selftests: add CPUMAP/DEVMAP selftests for xdp frags bpf: selftests: introduce bpf_xdp_{load,store}_bytes selftest net: xdp: introduce bpf_xdp_pointer utility routine bpf: generalise tail call map compatibility check libbpf: Add SEC name for xdp frags programs bpf: selftests: update xdp_adjust_tail selftest to include xdp frags bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature bpf: introduce frags support to bpf_prog_test_run_xdp() bpf: move user_size out of bpf_test_init bpf: add frags support to xdp copy helpers bpf: add frags support to the bpf_xdp_adjust_tail() API bpf: introduce bpf_xdp_get_buff_len helper net: mvneta: enable jumbo frames if the loaded XDP program support frags bpf: introduce BPF_F_XDP_HAS_FRAGS flag in prog_flags loading the ebpf program net: mvneta: add frags support to XDP_TX xdp: add frags support to xdp_return_{buff/frame} ... ==================== Link: https://lore.kernel.org/r/20220124221235.18993-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-24bonding: use rcu_dereference_rtnl when get bonding active slaveHangbin Liu
bond_option_active_slave_get_rcu() should not be used in rtnl_mutex as it use rcu_dereference(). Replace to rcu_dereference_rtnl() so we also can use this function in rtnl protected context. With this update, we can rmeove the rcu_read_lock/unlock in bonding .ndo_eth_ioctl and .get_ts_info. Reported-by: Vladimir Oltean <vladimir.oltean@nxp.com> Fixes: 94dd016ae538 ("bond: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to active device") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-21bpf: add frags support to the bpf_xdp_adjust_tail() APIEelco Chaudron
This change adds support for tail growing and shrinking for XDP frags. When called on a non-linear packet with a grow request, it will work on the last fragment of the packet. So the maximum grow size is the last fragments tailroom, i.e. no new buffer will be allocated. A XDP frags capable driver is expected to set frag_size in xdp_rxq_info data structure to notify the XDP core the fragment size. frag_size set to 0 is interpreted by the XDP core as tail growing is not allowed. Introduce __xdp_rxq_info_reg utility routine to initialize frag_size field. When shrinking, it will work from the last fragment, all the way down to the base buffer depending on the shrinking size. It's important to mention that once you shrink down the fragment(s) are freed, so you can not grow again to the original size. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/eabda3485dda4f2f158b477729337327e609461d.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-21bpf: introduce bpf_xdp_get_buff_len helperLorenzo Bianconi
Introduce bpf_xdp_get_buff_len helper in order to return the xdp buffer total size (linear and paged area) Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/aac9ac3504c84026cf66a3c71b7c5ae89bc991be.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-21xdp: add frags support to xdp_return_{buff/frame}Lorenzo Bianconi
Take into account if the received xdp_buff/xdp_frame is non-linear recycling/returning the frame memory to the allocator or into xdp_frame_bulk. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/a961069febc868508ce1bdf5e53a343eb4e57cb2.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-21net: xdp: add xdp_update_skb_shared_info utility routineLorenzo Bianconi
Introduce xdp_update_skb_shared_info routine to update frags array metadata in skb_shared_info data structure converting to a skb from a xdp_buff or xdp_frame. According to the current skb_shared_info architecture in xdp_frame/xdp_buff and to the xdp frags support, there is no need to run skb_add_rx_frag() and reset frags array converting the buffer to a skb since the frag array will be in the same position for xdp_buff/xdp_frame and for the skb, we just need to update memory metadata. Introduce XDP_FLAGS_PF_MEMALLOC flag in xdp_buff_flags in order to mark the xdp_buff or xdp_frame as under memory-pressure if pages of the frags array are under memory pressure. Doing so we can avoid looping over all fragments in xdp_update_skb_shared_info routine. The driver is expected to set the flag constructing the xdp_buffer using xdp_buff_set_frag_pfmemalloc utility routine. Rely on xdp_update_skb_shared_info in __xdp_build_skb_from_frame routine converting the non-linear xdp_frame to a skb after performing a XDP_REDIRECT. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/bfd23fb8a8d7438724f7819c567cdf99ffd6226f.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-21xdp: introduce flags field in xdp_buff/xdp_frameLorenzo Bianconi
Introduce flags field in xdp_frame and xdp_buffer data structures to define additional buffer features. At the moment the only supported buffer feature is frags bit (XDP_FLAGS_HAS_FRAGS). frags bit is used to specify if this is a linear buffer (XDP_FLAGS_HAS_FRAGS not set) or a frags frame (XDP_FLAGS_HAS_FRAGS set). In the latter case the driver is expected to initialize the skb_shared_info structure at the end of the first buffer to link together subsequent buffers belonging to the same frame. Acked-by: Toke Hoiland-Jorgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/e389f14f3a162c0a5bc6a2e1aa8dd01a90be117d.1642758637.git.lorenzo@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-20ipv6: annotate accesses to fn->fn_sernumEric Dumazet
struct fib6_node's fn_sernum field can be read while other threads change it. Add READ_ONCE()/WRITE_ONCE() annotations. Do not change existing smp barriers in fib6_get_cookie_safe() and __fib6_update_sernum_upto_root() syzbot reported: BUG: KCSAN: data-race in fib6_clean_node / inet6_csk_route_socket write to 0xffff88813df62e2c of 4 bytes by task 1920 on cpu 1: fib6_clean_node+0xc2/0x260 net/ipv6/ip6_fib.c:2178 fib6_walk_continue+0x38e/0x430 net/ipv6/ip6_fib.c:2112 fib6_walk net/ipv6/ip6_fib.c:2160 [inline] fib6_clean_tree net/ipv6/ip6_fib.c:2240 [inline] __fib6_clean_all+0x1a9/0x2e0 net/ipv6/ip6_fib.c:2256 fib6_flush_trees+0x6c/0x80 net/ipv6/ip6_fib.c:2281 rt_genid_bump_ipv6 include/net/net_namespace.h:488 [inline] addrconf_dad_completed+0x57f/0x870 net/ipv6/addrconf.c:4230 addrconf_dad_work+0x908/0x1170 process_one_work+0x3f6/0x960 kernel/workqueue.c:2307 worker_thread+0x616/0xa70 kernel/workqueue.c:2454 kthread+0x1bf/0x1e0 kernel/kthread.c:359 ret_from_fork+0x1f/0x30 read to 0xffff88813df62e2c of 4 bytes by task 15701 on cpu 0: fib6_get_cookie_safe include/net/ip6_fib.h:285 [inline] rt6_get_cookie include/net/ip6_fib.h:306 [inline] ip6_dst_store include/net/ip6_route.h:234 [inline] inet6_csk_route_socket+0x352/0x3c0 net/ipv6/inet6_connection_sock.c:109 inet6_csk_xmit+0x91/0x1e0 net/ipv6/inet6_connection_sock.c:121 __tcp_transmit_skb+0x1323/0x1840 net/ipv4/tcp_output.c:1402 tcp_transmit_skb net/ipv4/tcp_output.c:1420 [inline] tcp_write_xmit+0x1450/0x4460 net/ipv4/tcp_output.c:2680 __tcp_push_pending_frames+0x68/0x1c0 net/ipv4/tcp_output.c:2864 tcp_push+0x2d9/0x2f0 net/ipv4/tcp.c:725 mptcp_push_release net/mptcp/protocol.c:1491 [inline] __mptcp_push_pending+0x46c/0x490 net/mptcp/protocol.c:1578 mptcp_sendmsg+0x9ec/0xa50 net/mptcp/protocol.c:1764 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:643 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] kernel_sendmsg+0x97/0xd0 net/socket.c:745 sock_no_sendpage+0x84/0xb0 net/core/sock.c:3086 inet_sendpage+0x9d/0xc0 net/ipv4/af_inet.c:834 kernel_sendpage+0x187/0x200 net/socket.c:3492 sock_sendpage+0x5a/0x70 net/socket.c:1007 pipe_to_sendpage+0x128/0x160 fs/splice.c:364 splice_from_pipe_feed fs/splice.c:418 [inline] __splice_from_pipe+0x207/0x500 fs/splice.c:562 splice_from_pipe fs/splice.c:597 [inline] generic_splice_sendpage+0x94/0xd0 fs/splice.c:746 do_splice_from fs/splice.c:767 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:936 splice_direct_to_actor+0x345/0x650 fs/splice.c:891 do_splice_direct+0x106/0x190 fs/splice.c:979 do_sendfile+0x675/0xc40 fs/read_write.c:1245 __do_sys_sendfile64 fs/read_write.c:1310 [inline] __se_sys_sendfile64 fs/read_write.c:1296 [inline] __x64_sys_sendfile64+0x102/0x140 fs/read_write.c:1296 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000026f -> 0x00000271 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15701 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 The Fixes tag I chose is probably arbitrary, I do not think we need to backport this patch to older kernels. Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220120174112.1126644-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-20tcp: Add a stub for sk_defer_free_flush()Gal Pressman
When compiling the kernel with CONFIG_INET disabled, the sk_defer_free_flush() should be defined as a nop. This resolves the following compilation error: ld: net/core/sock.o: in function `sk_defer_free_flush': ./include/net/tcp.h:1378: undefined reference to `__sk_defer_free_flush' Fixes: 79074a72d335 ("net: Flush deferred skb free on socket destroy") Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220120123440.9088-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-20Bluetooth: mgmt: Add MGMT Adv Monitor Device Found/Lost eventsManish Mandlik
This patch introduces two new MGMT events for notifying the bluetoothd whenever the controller starts/stops monitoring a device. Test performed: - Verified by logs that the MSFT Monitor Device is received from the controller and the bluetoothd is notified whenever the controller starts/stops monitoring a device. Signed-off-by: Manish Mandlik <mmandlik@google.com> Reviewed-by: Miao-chen Chou <mcchou@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-01-20Bluetooth: msft: Handle MSFT Monitor Device EventManish Mandlik
Whenever the controller starts/stops monitoring a bt device, it sends MSFT Monitor Device event. Add handler to read this vendor event. Test performed: - Verified by logs that the MSFT Monitor Device event is received from the controller whenever it starts/stops monitoring a device. Signed-off-by: Manish Mandlik <mmandlik@google.com> Reviewed-by: Miao-chen Chou <mcchou@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2022-01-20Merge tag 'net-5.17-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, bpf. Quite a handful of old regression fixes but most of those are pre-5.16. Current release - regressions: - fix memory leaks in the skb free deferral scheme if upper layer protocols are used, i.e. in-kernel TCP readers like TLS Current release - new code bugs: - nf_tables: fix NULL check typo in _clone() functions - change the default to y for Vertexcom vendor Kconfig - a couple of fixes to incorrect uses of ref tracking - two fixes for constifying netdev->dev_addr Previous releases - regressions: - bpf: - various verifier fixes mainly around register offset handling when passed to helper functions - fix mount source displayed for bpffs (none -> bpffs) - bonding: - fix extraction of ports for connection hash calculation - fix bond_xmit_broadcast return value when some devices are down - phy: marvell: add Marvell specific PHY loopback - sch_api: don't skip qdisc attach on ingress, prevent ref leak - htb: restore minimal packet size handling in rate control - sfp: fix high power modules without diagnostic monitoring - mscc: ocelot: - don't let phylink re-enable TX PAUSE on the NPI port - don't dereference NULL pointers with shared tc filters - smsc95xx: correct reset handling for LAN9514 - cpsw: avoid alignment faults by taking NET_IP_ALIGN into account - phy: micrel: use kszphy_suspend/_resume for irq aware devices, avoid races with the interrupt Previous releases - always broken: - xdp: check prog type before updating BPF link - smc: resolve various races around abnormal connection termination - sit: allow encapsulated IPv6 traffic to be delivered locally - axienet: fix init/reset handling, add missing barriers, read the right status words, stop queues correctly - add missing dev_put() in sock_timestamping_bind_phc() Misc: - ipv4: prevent accidentally passing RTO_ONLINK to ip_route_output_key_hash() by sanitizing flags - ipv4: avoid quadratic behavior in netns dismantle - stmmac: dwmac-oxnas: add support for OX810SE - fsl: xgmac_mdio: add workaround for erratum A-009885" * tag 'net-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (92 commits) ipv4: add net_hash_mix() dispersion to fib_info_laddrhash keys ipv4: avoid quadratic behavior in netns dismantle net/fsl: xgmac_mdio: Fix incorrect iounmap when removing module powerpc/fsl/dts: Enable WA for erratum A-009885 on fman3l MDIO buses dt-bindings: net: Document fsl,erratum-a009885 net/fsl: xgmac_mdio: Add workaround for erratum A-009885 net: mscc: ocelot: fix using match before it is set net: phy: micrel: use kszphy_suspend()/kszphy_resume for irq aware devices net: cpsw: avoid alignment faults by taking NET_IP_ALIGN into account nfc: llcp: fix NULL error pointer dereference on sendmsg() after failed bind() net: axienet: increase default TX ring size to 128 net: axienet: fix for TX busy handling net: axienet: fix number of TX ring slots for available check net: axienet: Fix TX ring slot available check net: axienet: limit minimum TX ring size net: axienet: add missing memory barriers net: axienet: reset core on initialization prior to MDIO access net: axienet: Wait for PhyRstCmplt after core reset net: axienet: increase reset timeout bpf, selftests: Add ringbuf memory type confusion test ...
2022-01-18net/netfilter: Add unstable CT lookup helpers for XDP and TC-BPFKumar Kartikeya Dwivedi
This change adds conntrack lookup helpers using the unstable kfunc call interface for the XDP and TC-BPF hooks. The primary usecase is implementing a synproxy in XDP, see Maxim's patchset [0]. Export get_net_ns_by_id as nf_conntrack_bpf.c needs to call it. This object is only built when CONFIG_DEBUG_INFO_BTF_MODULES is enabled. [0]: https://lore.kernel.org/bpf/20211019144655.3483197-1-maximmi@nvidia.com Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20220114163953.1455836-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-16Merge tag '9p-for-5.17-rc1' of git://github.com/martinetd/linuxLinus Torvalds
Pull 9p updates from Dominique Martinet: "Fixes, split 9p_net_fd, and new reviewer: - fix possible uninitialized memory usage for setattr - fix fscache reading hole in a file just after it's been grown - split net/9p/trans_fd.c in its own module like other transports. The new transport module defaults to 9P_NET and is autoloaded if required so users should not be impacted - add Christian Schoenebeck to 9p reviewers - some more trivial cleanup" * tag '9p-for-5.17-rc1' of git://github.com/martinetd/linux: 9p: fix enodata when reading growing file net/9p: show error message if user 'msize' cannot be satisfied MAINTAINERS: 9p: add Christian Schoenebeck as reviewer 9p: only copy valid iattrs in 9P2000.L setattr implementation 9p: Use BUG_ON instead of if condition followed by BUG. net/p9: load default transports 9p/xen: autoload when xenbus service is available 9p/trans_fd: split into dedicated module fs: 9p: remove unneeded variable 9p/trans_virtio: Fix typo in the comment for p9_virtio_create()
2022-01-13net_sched: restore "mpu xxx" handlingKevin Bracey
commit 56b765b79e9a ("htb: improved accuracy at high rates") broke "overhead X", "linklayer atm" and "mpu X" attributes. "overhead X" and "linklayer atm" have already been fixed. This restores the "mpu X" handling, as might be used by DOCSIS or Ethernet shaping: tc class add ... htb rate X overhead 4 mpu 64 The code being fixed is used by htb, tbf and act_police. Cake has its own mpu handling. qdisc_calculate_pkt_len still uses the size table containing values adjusted for mpu by user space. iproute2 tc has always passed mpu into the kernel via a tc_ratespec structure, but the kernel never directly acted on it, merely stored it so that it could be read back by `tc class show`. Rather, tc would generate length-to-time tables that included the mpu (and linklayer) in their construction, and the kernel used those tables. Since v3.7, the tables were no longer used. Along with "mpu", this also broke "overhead" and "linklayer" which were fixed in 01cb71d2d47b ("net_sched: restore "overhead xxx" handling", v3.10) and 8a8e3d84b171 ("net_sched: restore "linklayer atm" handling", v3.11). "overhead" was fixed by simply restoring use of tc_ratespec::overhead - this had originally been used by the kernel but was initially omitted from the new non-table-based calculations. "linklayer" had been handled in the table like "mpu", but the mode was not originally passed in tc_ratespec. The new implementation was made to handle it by getting new versions of tc to pass the mode in an extended tc_ratespec, and for older versions of tc the table contents were analysed at load time to deduce linklayer. As "mpu" has always been given to the kernel in tc_ratespec, accompanying the mpu-based table, we can restore system functionality with no userspace change by making the kernel act on the tc_ratespec value. Fixes: 56b765b79e9a ("htb: improved accuracy at high rates") Signed-off-by: Kevin Bracey <kevin@bracey.fi> Cc: Eric Dumazet <edumazet@google.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Vimalkumar <j.vimal@gmail.com> Link: https://lore.kernel.org/r/20220112170210.1014351-1-kevin@bracey.fi Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-13inet: frags: annotate races around fqdir->dead and fqdir->high_threshEric Dumazet
Both fields can be read/written without synchronization, add proper accessors and documentation. Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-11net: sched: do not allocate a tracker in tcf_exts_init()Eric Dumazet
While struct tcf_exts has a net pointer, it is not refcounted until tcf_exts_get_net() is called. Fixes: dbdcda634ce3 ("net: sched: add netns refcount tracker to struct tcf_exts") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220110094750.236478-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in fixes directly in prep for the 5.17 merge window. No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-10net/p9: load default transportsThomas Weißschuh
Now that all transports are split into modules it may happen that no transports are registered when v9fs_get_default_trans() is called. When that is the case try to load more transports from modules. Link: https://lkml.kernel.org/r/20211103193823.111007-5-linux@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> [Dominique: constify v9fs_get_trans_by_name argument as per patch1v2] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2022-01-109p/trans_fd: split into dedicated moduleThomas Weißschuh
This allows these transports only to be used when needed. Link: https://lkml.kernel.org/r/20211103193823.111007-3-linux@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> [Dominique: Kconfig NET_9P_FD: -depends VIRTIO, +default NET_9P] Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>