summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-07-12tipc: Consolidate redundant functionsShigeru Yoshida
link_is_up() and tipc_link_is_up() have the same functionality. Consolidate these functions. Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Tung Nguyen <tung.q.nguyen@endava.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-12tipc: Remove unused struct declarationShigeru Yoshida
struct tipc_name_table in core.h is not used. Remove this declaration. Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Tung Nguyen <tung.q.nguyen@endava.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-11netdevice: define and allocate &net_device _properly_Alexander Lobakin
In fact, this structure contains a flexible array at the end, but historically its size, alignment etc., is calculated manually. There are several instances of the structure embedded into other structures, but also there's ongoing effort to remove them and we could in the meantime declare &net_device properly. Declare the array explicitly, use struct_size() and store the array size inside the structure, so that __counted_by() can be applied. Don't use PTR_ALIGN(), as SLUB itself tries its best to ensure the allocated buffer is aligned to what the user expects. Also, change its alignment from %NETDEV_ALIGN to the cacheline size as per several suggestions on the netdev ML. bloat-o-meter for vmlinux: free_netdev 445 440 -5 netdev_freemem 24 - -24 alloc_netdev_mqs 1481 1450 -31 On x86_64 with several NICs of different vendors, I was never able to get a &net_device pointer not aligned to the cacheline size after the change. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20240710113036.2125584-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11net: psample: fix flag being set in wrong skbAdrian Moreno
A typo makes PSAMPLE_ATTR_SAMPLE_RATE netlink flag be added to the wrong sk_buff. Fix the error and make the input sk_buff pointer "const" so that it doesn't happen again. Acked-by: Eelco Chaudron <echaudro@redhat.com> Fixes: 7b1b2b60c63f ("net: psample: allow using rate as probability") Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20240710171004.2164034-1-amorenoz@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11Merge tag 'wireless-next-2024-07-11' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.11 Most likely the last "new features" pull request for v6.11 with changes both in stack and in drivers. The big thing is the multiple radios for wiphy feature which makes it possible to better advertise radio capabilities to user space. mt76 enabled MLO and iwlwifi re-enabled MLO, ath12k and rtw89 Wi-Fi 6 devices got WoWLAN support. Major changes: cfg80211/mac80211 * remove DEAUTH_NEED_MGD_TX_PREP flag * multiple radios per wiphy support mac80211_hwsim * multi-radio wiphy support ath12k * DebugFS support for datapath statistics * WCN7850: support for WoW (Wake on WLAN) * WCN7850: device-tree bindings ath11k * QCA6390: device-tree bindings iwlwifi * mvm: re-enable Multi-Link Operation (MLO) * aggregation (A-MSDU) optimisations rtw89 * preparation for RTL8852BE-VT support * WoWLAN support for WiFi 6 chips * 36-bit PCI DMA support mt76 * mt7925 Multi-Link Operation (MLO) support * tag 'wireless-next-2024-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (204 commits) wifi: mac80211: fix AP chandef capturing in CSA wifi: iwlwifi: correctly reference TSO page information wifi: mt76: mt792x: fix scheduler interference in drv own process wifi: mt76: mt7925: enabling MLO when the firmware supports it wifi: mt76: mt7925: remove the unused mt7925_mcu_set_chan_info wifi: mt76: mt7925: update mt7925_mac_link_bss_add for MLO wifi: mt76: mt7925: update mt7925_mcu_bss_basic_tlv for MLO wifi: mt76: mt7925: update mt7925_mcu_set_timing for MLO wifi: mt76: mt7925: update mt7925_mcu_sta_phy_tlv for MLO wifi: mt76: mt7925: update mt7925_mcu_sta_rate_ctrl_tlv for MLO wifi: mt76: mt7925: add mt7925_mcu_sta_eht_mld_tlv for MLO wifi: mt76: mt7925: update mt7925_mcu_sta_update for MLO wifi: mt76: mt7925: update mt7925_mcu_add_bss_info for MLO wifi: mt76: mt7925: update mt7925_mcu_bss_mld_tlv for MLO wifi: mt76: mt7925: update mt7925_mcu_sta_mld_tlv for MLO wifi: mt76: mt7925: add mt7925_[assign,unassign]_vif_chanctx wifi: mt76: add def_wcid to struct mt76_wcid wifi: mt76: mt7925: report link information in rx status wifi: mt76: mt7925: update rate index according to link id wifi: mt76: mt7925: add link handling in the mt7925_ipv6_addr_change ... ==================== Link: https://patch.msgid.link/20240711102353.0C849C116B1@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11net: ethtool: Fix RSS settingSaeed Mahameed
When user submits a rxfh set command without touching XFRM_SYM_XOR, rxfh.input_xfrm is set to RXH_XFRM_NO_CHANGE, which is equal to 0xff. Testing if (rxfh.input_xfrm & RXH_XFRM_SYM_XOR && !ops->cap_rss_sym_xor_supported) return -EOPNOTSUPP; Will always be true on devices that don't set cap_rss_sym_xor_supported, since rxfh.input_xfrm & RXH_XFRM_SYM_XOR is always true, if input_xfrm was not set, i.e RXH_XFRM_NO_CHANGE=0xff, which will result in failure of any command that doesn't require any change of XFRM, e.g RSS context or hash function changes. To avoid this breakage, test if rxfh.input_xfrm != RXH_XFRM_NO_CHANGE before testing other conditions. Note that the problem will only trigger with XFRM-aware userspace, old ethtool CLI would continue to work. Fixes: 0dd415d15505 ("net: ethtool: add a NO_CHANGE uAPI for new RXFH's input_xfrm") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://patch.msgid.link/20240710225538.43368-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11net: reduce rtnetlink_rcv_msg() stack usageEric Dumazet
IFLA_MAX is increasing slowly but surely. Some compilers use more than 512 bytes of stack in rtnetlink_rcv_msg() because it calls rtnl_calcit() for RTM_GETLINK message. Use noinline_for_stack attribute to not inline rtnl_calcit(), and directly use nla_for_each_attr_type() (Jakub suggestion) because we only care about IFLA_EXT_MASK at this stage. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240710151653.3786604-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11net/sched: act_skbmod: convert comma to semicolonChen Ni
Replace a comma between expression statements by a semicolon. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240709072838.1152880-1-nichen@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11ethtool: use the rss context XArray in ring deactivation safety-checkJakub Kicinski
ethtool_get_max_rxfh_channel() gets called when user requests deactivating Rx channels. Check the additional RSS contexts, too. While we do track whether RSS context has an indirection table explicitly set by the user, no driver looks at that bit. Assume drivers won't auto-regenerate the additional tables, to be safe. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20240710174043.754664-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11ethtool: fail closed if we can't get max channel used in indirection tablesJakub Kicinski
Commit 0d1b7d6c9274 ("bnxt: fix crashes when reducing ring count with active RSS contexts") proves that allowing indirection table to contain channels with out of bounds IDs may lead to crashes. Currently the max channel check in the core gets skipped if driver can't fetch the indirection table or when we can't allocate memory. Both of those conditions should be extremely rare but if they do happen we should try to be safe and fail the channel change. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20240710174043.754664-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/sched/act_ct.c 26488172b029 ("net/sched: Fix UAF when resolving a clash") 3abbd7ed8b76 ("act_ct: prepare for stolen verdict coming from conntrack and nat engine") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11libceph: fix crush_choose_firstn() kernel-doc warningsJeff Johnson
Currently, when built with "make W=1", the following warnings are generated: net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'work' not described in 'crush_choose_firstn' net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'weight' not described in 'crush_choose_firstn' net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'weight_max' not described in 'crush_choose_firstn' net/ceph/crush/mapper.c:466: warning: Function parameter or struct member 'choose_args' not described in 'crush_choose_firstn' Update the crush_choose_firstn() kernel-doc to document these parameters. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-11libceph: suppress crush_choose_indep() kernel-doc warningsJeff Johnson
Currently, when built with "make W=1", the following warnings are generated: net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'map' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'work' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'bucket' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'weight' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'weight_max' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'x' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'left' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'numrep' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'type' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'out' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'outpos' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'tries' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'recurse_tries' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'recurse_to_leaf' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'out2' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'parent_r' not described in 'crush_choose_indep' net/ceph/crush/mapper.c:655: warning: Function parameter or struct member 'choose_args' not described in 'crush_choose_indep' These warnings are generated because the prologue comment for crush_choose_indep() uses the kernel-doc prefix, but the actual comment is a very brief description that is not in kernel-doc format. Since this is a static function there is no need to fully document the function, so replace the kernel-doc comment prefix with a standard comment prefix to remove these warnings. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-07-11Merge tag 'nf-24-07-11' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: Patch #1 fixes a bogus WARN_ON splat in nfnetlink_queue. Patch #2 fixes a crash due to stack overflow in chain loop detection by using the existing chain validation routines Both patches from Florian Westphal. netfilter pull request 24-07-11 * tag 'nf-24-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: prefer nft_chain_validate netfilter: nfnetlink_queue: drop bogus WARN_ON ==================== Link: https://patch.msgid.link/20240711093948.3816-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-11net, sunrpc: Remap EPERM in case of connection failure in xs_tcp_setup_socketDaniel Borkmann
When using a BPF program on kernel_connect(), the call can return -EPERM. This causes xs_tcp_setup_socket() to loop forever, filling up the syslog and causing the kernel to potentially freeze up. Neil suggested: This will propagate -EPERM up into other layers which might not be ready to handle it. It might be safer to map EPERM to an error we would be more likely to expect from the network system - such as ECONNREFUSED or ENETDOWN. ECONNREFUSED as error seems reasonable. For programs setting a different error can be out of reach (see handling in 4fbac77d2d09) in particular on kernels which do not have f10d05966196 ("bpf: Make BPF_PROG_RUN_ARRAY return -err instead of allow boolean"), thus given that it is better to simply remap for consistent behavior. UDP does handle EPERM in xs_udp_send_request(). Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect") Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind") Co-developed-by: Lex Siegel <usiegl00@gmail.com> Signed-off-by: Lex Siegel <usiegl00@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trondmy@kernel.org> Cc: Anna Schumaker <anna@kernel.org> Link: https://github.com/cilium/cilium/issues/33395 Link: https://lore.kernel.org/bpf/171374175513.12877.8993642908082014881@noble.neil.brown.name Link: https://patch.msgid.link/9069ec1d59e4b2129fc23433349fd5580ad43921.1720075070.git.daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-11net/sched: Fix UAF when resolving a clashChengen Du
KASAN reports the following UAF: BUG: KASAN: slab-use-after-free in tcf_ct_flow_table_process_conn+0x12b/0x380 [act_ct] Read of size 1 at addr ffff888c07603600 by task handler130/6469 Call Trace: <IRQ> dump_stack_lvl+0x48/0x70 print_address_description.constprop.0+0x33/0x3d0 print_report+0xc0/0x2b0 kasan_report+0xd0/0x120 __asan_load1+0x6c/0x80 tcf_ct_flow_table_process_conn+0x12b/0x380 [act_ct] tcf_ct_act+0x886/0x1350 [act_ct] tcf_action_exec+0xf8/0x1f0 fl_classify+0x355/0x360 [cls_flower] __tcf_classify+0x1fd/0x330 tcf_classify+0x21c/0x3c0 sch_handle_ingress.constprop.0+0x2c5/0x500 __netif_receive_skb_core.constprop.0+0xb25/0x1510 __netif_receive_skb_list_core+0x220/0x4c0 netif_receive_skb_list_internal+0x446/0x620 napi_complete_done+0x157/0x3d0 gro_cell_poll+0xcf/0x100 __napi_poll+0x65/0x310 net_rx_action+0x30c/0x5c0 __do_softirq+0x14f/0x491 __irq_exit_rcu+0x82/0xc0 irq_exit_rcu+0xe/0x20 common_interrupt+0xa1/0xb0 </IRQ> <TASK> asm_common_interrupt+0x27/0x40 Allocated by task 6469: kasan_save_stack+0x38/0x70 kasan_set_track+0x25/0x40 kasan_save_alloc_info+0x1e/0x40 __kasan_krealloc+0x133/0x190 krealloc+0xaa/0x130 nf_ct_ext_add+0xed/0x230 [nf_conntrack] tcf_ct_act+0x1095/0x1350 [act_ct] tcf_action_exec+0xf8/0x1f0 fl_classify+0x355/0x360 [cls_flower] __tcf_classify+0x1fd/0x330 tcf_classify+0x21c/0x3c0 sch_handle_ingress.constprop.0+0x2c5/0x500 __netif_receive_skb_core.constprop.0+0xb25/0x1510 __netif_receive_skb_list_core+0x220/0x4c0 netif_receive_skb_list_internal+0x446/0x620 napi_complete_done+0x157/0x3d0 gro_cell_poll+0xcf/0x100 __napi_poll+0x65/0x310 net_rx_action+0x30c/0x5c0 __do_softirq+0x14f/0x491 Freed by task 6469: kasan_save_stack+0x38/0x70 kasan_set_track+0x25/0x40 kasan_save_free_info+0x2b/0x60 ____kasan_slab_free+0x180/0x1f0 __kasan_slab_free+0x12/0x30 slab_free_freelist_hook+0xd2/0x1a0 __kmem_cache_free+0x1a2/0x2f0 kfree+0x78/0x120 nf_conntrack_free+0x74/0x130 [nf_conntrack] nf_ct_destroy+0xb2/0x140 [nf_conntrack] __nf_ct_resolve_clash+0x529/0x5d0 [nf_conntrack] nf_ct_resolve_clash+0xf6/0x490 [nf_conntrack] __nf_conntrack_confirm+0x2c6/0x770 [nf_conntrack] tcf_ct_act+0x12ad/0x1350 [act_ct] tcf_action_exec+0xf8/0x1f0 fl_classify+0x355/0x360 [cls_flower] __tcf_classify+0x1fd/0x330 tcf_classify+0x21c/0x3c0 sch_handle_ingress.constprop.0+0x2c5/0x500 __netif_receive_skb_core.constprop.0+0xb25/0x1510 __netif_receive_skb_list_core+0x220/0x4c0 netif_receive_skb_list_internal+0x446/0x620 napi_complete_done+0x157/0x3d0 gro_cell_poll+0xcf/0x100 __napi_poll+0x65/0x310 net_rx_action+0x30c/0x5c0 __do_softirq+0x14f/0x491 The ct may be dropped if a clash has been resolved but is still passed to the tcf_ct_flow_table_process_conn function for further usage. This issue can be fixed by retrieving ct from skb again after confirming conntrack. Fixes: 0cc254e5aa37 ("net/sched: act_ct: Offload connections with commit action") Co-developed-by: Gerald Yang <gerald.yang@canonical.com> Signed-off-by: Gerald Yang <gerald.yang@canonical.com> Signed-off-by: Chengen Du <chengen.du@canonical.com> Link: https://patch.msgid.link/20240710053747.13223-1-chengen.du@canonical.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-11udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().Kuniyuki Iwashima
syzkaller triggered the warning [0] in udp_v4_early_demux(). In udp_v[46]_early_demux() and sk_lookup(), we do not touch the refcount of the looked-up sk and use sock_pfree() as skb->destructor, so we check SOCK_RCU_FREE to ensure that the sk is safe to access during the RCU grace period. Currently, SOCK_RCU_FREE is flagged for a bound socket after being put into the hash table. Moreover, the SOCK_RCU_FREE check is done too early in udp_v[46]_early_demux() and sk_lookup(), so there could be a small race window: CPU1 CPU2 ---- ---- udp_v4_early_demux() udp_lib_get_port() | |- hlist_add_head_rcu() |- sk = __udp4_lib_demux_lookup() | |- DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk)); `- sock_set_flag(sk, SOCK_RCU_FREE) We had the same bug in TCP and fixed it in commit 871019b22d1b ("net: set SOCK_RCU_FREE before inserting socket into hashtable"). Let's apply the same fix for UDP. [0]: WARNING: CPU: 0 PID: 11198 at net/ipv4/udp.c:2599 udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599 Modules linked in: CPU: 0 PID: 11198 Comm: syz-executor.1 Not tainted 6.9.0-g93bda33046e7 #13 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599 Code: c5 7a 15 fe bb 01 00 00 00 44 89 e9 31 ff d3 e3 81 e3 bf ef ff ff 89 de e8 2c 74 15 fe 85 db 0f 85 02 06 00 00 e8 9f 7a 15 fe <0f> 0b e8 98 7a 15 fe 49 8d 7e 60 e8 4f 39 2f fe 49 c7 46 60 20 52 RSP: 0018:ffffc9000ce3fa58 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8318c92c RDX: ffff888036ccde00 RSI: ffffffff8318c2f1 RDI: 0000000000000001 RBP: ffff88805a2dd6e0 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0001ffffffffffff R12: ffff88805a2dd680 R13: 0000000000000007 R14: ffff88800923f900 R15: ffff88805456004e FS: 00007fc449127640(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc449126e38 CR3: 000000003de4b002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Call Trace: <TASK> ip_rcv_finish_core.constprop.0+0xbdd/0xd20 net/ipv4/ip_input.c:349 ip_rcv_finish+0xda/0x150 net/ipv4/ip_input.c:447 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip_rcv+0x16c/0x180 net/ipv4/ip_input.c:569 __netif_receive_skb_one_core+0xb3/0xe0 net/core/dev.c:5624 __netif_receive_skb+0x21/0xd0 net/core/dev.c:5738 netif_receive_skb_internal net/core/dev.c:5824 [inline] netif_receive_skb+0x271/0x300 net/core/dev.c:5884 tun_rx_batched drivers/net/tun.c:1549 [inline] tun_get_user+0x24db/0x2c50 drivers/net/tun.c:2002 tun_chr_write_iter+0x107/0x1a0 drivers/net/tun.c:2048 new_sync_write fs/read_write.c:497 [inline] vfs_write+0x76f/0x8d0 fs/read_write.c:590 ksys_write+0xbf/0x190 fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __x64_sys_write+0x41/0x50 fs/read_write.c:652 x64_sys_call+0xe66/0x1990 arch/x86/include/generated/asm/syscalls_64.h:2 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x4b/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7fc44a68bc1f Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 e9 cf f5 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 3c d0 f5 ff 48 RSP: 002b:00007fc449126c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007fc44a68bc1f RDX: 0000000000000032 RSI: 00000000200000c0 RDI: 00000000000000c8 RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000032 R11: 0000000000000293 R12: 0000000000000000 R13: 000000000000000b R14: 00007fc44a5ec530 R15: 0000000000000000 </TASK> Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240709191356.24010-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-11netfilter: nf_tables: prefer nft_chain_validateFlorian Westphal
nft_chain_validate already performs loop detection because a cycle will result in a call stack overflow (ctx->level >= NFT_JUMP_STACK_SIZE). It also follows maps via ->validate callback in nft_lookup, so there appears no reason to iterate the maps again. nf_tables_check_loops() and all its helper functions can be removed. This improves ruleset load time significantly, from 23s down to 12s. This also fixes a crash bug. Old loop detection code can result in unbounded recursion: BUG: TASK stack guard page was hit at .... Oops: stack guard page: 0000 [#1] PREEMPT SMP KASAN CPU: 4 PID: 1539 Comm: nft Not tainted 6.10.0-rc5+ #1 [..] with a suitable ruleset during validation of register stores. I can't see any actual reason to attempt to check for this from nft_validate_register_store(), at this point the transaction is still in progress, so we don't have a full picture of the rule graph. For nf-next it might make sense to either remove it or make this depend on table->validate_state in case we could catch an error earlier (for improved error reporting to userspace). Fixes: 20a69341f2d0 ("netfilter: nf_tables: add netlink set API") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-07-11netfilter: nfnetlink_queue: drop bogus WARN_ONFlorian Westphal
Happens when rules get flushed/deleted while packet is out, so remove this WARN_ON. This WARN exists in one form or another since v4.14, no need to backport this to older releases, hence use a more recent fixes tag. Fixes: 3f8019688894 ("netfilter: move nf_reinject into nfnetlink_queue modules") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202407081453.11ac0f63-lkp@intel.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-07-11ethtool: netlink: do not return SQI value if link is downOleksij Rempel
Do not attach SQI value if link is down. "SQI values are only valid if link-up condition is present" per OpenAlliance specification of 100Base-T1 Interoperability Test suite [1]. The same rule would apply for other link types. [1] https://opensig.org/automotive-ethernet-specifications/# Fixes: 806602191592 ("ethtool: provide UAPI for PHY Signal Quality Index (SQI)") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Woojung Huh <woojung.huh@microchip.com> Link: https://patch.msgid.link/20240709061943.729381-1-o.rempel@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-10tcp: avoid too many retransmit packetsEric Dumazet
If a TCP socket is using TCP_USER_TIMEOUT, and the other peer retracted its window to zero, tcp_retransmit_timer() can retransmit a packet every two jiffies (2 ms for HZ=1000), for about 4 minutes after TCP_USER_TIMEOUT has 'expired'. The fix is to make sure tcp_rtx_probe0_timed_out() takes icsk->icsk_user_timeout into account. Before blamed commit, the socket would not timeout after icsk->icsk_user_timeout, but would use standard exponential backoff for the retransmits. Also worth noting that before commit e89688e3e978 ("net: tcp: fix unexcepted socket die when snd_wnd is 0"), the issue would last 2 minutes instead of 4. Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jon Maxwell <jmaxwell37@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240710001402.2758273-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-10page_pool: use __cacheline_group_{begin, end}_aligned()Alexander Lobakin
Instead of doing __cacheline_group_begin() __aligned(), use the new __cacheline_group_{begin,end}_aligned(), so that it will take care of the group alignment itself. Also replace open-coded `4 * sizeof(long)` in two places with a definition. Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-07-10bpf: Remove tst_run from lwt_seg6local_prog_ops.Sebastian Andrzej Siewior
The syzbot reported that the lwt_seg6 related BPF ops can be invoked via bpf_test_run() without without entering input_action_end_bpf() first. Martin KaFai Lau said that self test for BPF_PROG_TYPE_LWT_SEG6LOCAL probably didn't work since it was introduced in commit 04d4b274e2a ("ipv6: sr: Add seg6local action End.BPF"). The reason is that the per-CPU variable seg6_bpf_srh_states::srh is never assigned in the self test case but each BPF function expects it. Remove test_run for BPF_PROG_TYPE_LWT_SEG6LOCAL. Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Reported-by: syzbot+608a2acde8c5a101d07d@syzkaller.appspotmail.com Fixes: d1542d4ae4df ("seg6: Use nested-BH locking for seg6_bpf_srh_states.") Fixes: 004d4b274e2a ("ipv6: sr: Add seg6local action End.BPF") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20240710141631.FbmHcQaX@linutronix.de Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-07-10wifi: mac80211: fix AP chandef capturing in CSAJohannes Berg
When the CSA is announced with only HT elements, the AP chandef isn't captured correctly, leading to crashes in the later code that checks for TPE changes during CSA. Capture the AP chandef correctly in both cases to fix this. Reported-by: Jouni Malinen <j@w1.fi> Fixes: 4540568136fe ("wifi: mac80211: handle TPE element during CSA") Link: https://patch.msgid.link/20240709160851.47805f24624d.I024091f701447f7921e93bb23b46e01c2f46347d@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-10libceph: fix race between delayed_work() and ceph_monc_stop()Ilya Dryomov
The way the delayed work is handled in ceph_monc_stop() is prone to races with mon_fault() and possibly also finish_hunting(). Both of these can requeue the delayed work which wouldn't be canceled by any of the following code in case that happens after cancel_delayed_work_sync() runs -- __close_session() doesn't mess with the delayed work in order to avoid interfering with the hunting interval logic. This part was missed in commit b5d91704f53e ("libceph: behave in mon_fault() if cur_mon < 0") and use-after-free can still ensue on monc and objects that hang off of it, with monc->auth and monc->monmap being particularly susceptible to quickly being reused. To fix this: - clear monc->cur_mon and monc->hunting as part of closing the session in ceph_monc_stop() - bail from delayed_work() if monc->cur_mon is cleared, similar to how it's done in mon_fault() and finish_hunting() (based on monc->hunting) - call cancel_delayed_work_sync() after the session is closed Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/66857 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com>
2024-07-09net: fix rc7's __skb_datagram_iter()Hugh Dickins
X would not start in my old 32-bit partition (and the "n"-handling looks just as wrong on 64-bit, but for whatever reason did not show up there): "n" must be accumulated over all pages before it's added to "offset" and compared with "copy", immediately after the skb_frag_foreach_page() loop. Fixes: d2d30a376d9c ("net: allow skb_datagram_iter to be called from any context") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://patch.msgid.link/fef352e8-b89a-da51-f8ce-04bc39ee6481@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-09net: tls: Pass union tls_crypto_context pointer to memzero_explicitSimon Horman
Pass union tls_crypto_context pointer, rather than struct tls_crypto_info pointer, to memzero_explicit(). The address of the pointer is the same before and after. But the new construct means that the size of the dereferenced pointer type matches the size being zeroed. Which aids static analysis. As reported by Smatch: .../tls_main.c:842 do_tls_setsockopt_conf() error: memzero_explicit() 'crypto_info' too small (4 vs 56) No functional change intended. Compile tested only. Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240708-tls-memzero-v2-1-9694eaf31b79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-09Merge tag 'for-netdev' of ↵Paolo Abeni
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-07-08 The following pull-request contains BPF updates for your *net-next* tree. We've added 102 non-merge commits during the last 28 day(s) which contain a total of 127 files changed, 4606 insertions(+), 980 deletions(-). The main changes are: 1) Support resilient split BTF which cuts down on duplication and makes BTF as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman. 2) Add support for dumping kfunc prototypes from BTF which enables both detecting as well as dumping compilable prototypes for kfuncs, from Daniel Xu. 3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement support for BPF exceptions, from Ilya Leoshkevich. 4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui. 5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option for skbs and add coverage along with it, from Vadim Fedorenko. 6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan. 7) Extend the BPF verifier to track the delta between linked registers in order to better deal with recent LLVM code optimizations, from Alexei Starovoitov. 8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should have been a pointer to the map value, from Benjamin Tissoires. 9) Extend BPF selftests to add regular expression support for test output matching and adjust some of the selftest when compiled under gcc, from Cupertino Miranda. 10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always iterates exactly once anyway, from Dan Carpenter. 11) Add the capability to offload the netfilter flowtable in XDP layer through kfuncs, from Florian Westphal & Lorenzo Bianconi. 12) Various cleanups in networking helpers in BPF selftests to shave off a few lines of open-coded functions on client/server handling, from Geliang Tang. 13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so that x86 JIT does not need to implement detection, from Leon Hwang. 14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an out-of-bounds memory access for dynpointers, from Matt Bobrowski. 15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as it might lead to problems on 32-bit archs, from Jiri Olsa. 16) Enhance traffic validation and dynamic batch size support in xsk selftests, from Tushar Vyavahare. bpf-next-for-netdev * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits) selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature bpf: helpers: fix bpf_wq_set_callback_impl signature libbpf: Add NULL checks to bpf_object__{prev_map,next_map} selftests/bpf: Remove exceptions tests from DENYLIST.s390x s390/bpf: Implement exceptions s390/bpf: Change seen_reg to a mask bpf: Remove unnecessary loop in task_file_seq_get_next() riscv, bpf: Optimize stack usage of trampoline bpf, devmap: Add .map_alloc_check selftests/bpf: Remove arena tests from DENYLIST.s390x selftests/bpf: Add UAF tests for arena atomics selftests/bpf: Introduce __arena_global s390/bpf: Support arena atomics s390/bpf: Enable arena s390/bpf: Support address space cast instruction s390/bpf: Support BPF_PROBE_MEM32 s390/bpf: Land on the next JITed instruction after exception s390/bpf: Introduce pre- and post- probe functions s390/bpf: Get rid of get_probe_mem_regno() ... ==================== Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-09udp: Remove duplicate included header file trace/events/udp.hThorsten Blum
Remove duplicate included header file trace/events/udp.h and the following warning reported by make includecheck: trace/events/udp.h is included more than once Compile-tested only. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240706071132.274352-2-thorsten.blum@toblux.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-09wifi: mac80211: add wiphy radio assignment and validationFelix Fietkau
Validate number of channels and interface combinations per radio. Assign each channel context to a radio. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/1d3e9ba70a30ce18aaff337f0a76d7aeb311bafb.1720514221.git-series.nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09wifi: mac80211: move code in ieee80211_link_reserve_chanctx to a helperFelix Fietkau
Reduces indentation in preparation for further changes Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/cce95007092336254d51570f4a27e05a6f150a53.1720514221.git-series.nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09wifi: mac80211: extend ifcomb check functions for multi-radioFelix Fietkau
Add support for counting global and per-radio max/current number of channels, as well as checking radio-specific interface combinations. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/e76307f8ce562a91a74faab274ae01f6a5ba0a2e.1720514221.git-series.nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09wifi: mac80211: add radio index to ieee80211_chanctx_confFelix Fietkau
Will be used to explicitly assign a channel context to a wiphy radio. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/59f76f57d935f155099276be22badfa671d5bfd9.1720514221.git-series.nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09wifi: mac80211: add support for DFS with multiple radiosFelix Fietkau
DFS can be supported with multi-channel combinations, as long as each DFS capable radio only supports one channel. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/4d27a4adca99fa832af1f7cda4f2e71016bd9fda.1720514221.git-series.nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09wifi: cfg80211: add helper for checking if a chandef is valid on a radioFelix Fietkau
Check if the full channel width is in the radio's frequency range. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/7c8ea146feb6f37cee62e5ba6be5370403695797.1720514221.git-series.nbd@nbd.name [add missing Return: documentation] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09wifi: cfg80211: extend interface combination check for multi-radioFelix Fietkau
Add a field in struct iface_combination_params to check per-radio interface combinations instead of per-wiphy ones. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/32b28da89c2d759b0324deeefe2be4cee91de18e.1720514221.git-series.nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09wifi: cfg80211: add support for advertising multiple radios belonging to a wiphyFelix Fietkau
The prerequisite for MLO support in cfg80211/mac80211 is that all the links participating in MLO must be from the same wiphy/ieee80211_hw. To meet this expectation, some drivers may need to group multiple discrete hardware each acting as a link in MLO under single wiphy. With this change, supported frequencies and interface combinations of each individual radio are reported to user space. This allows user space to figure out the limitations of what combination of channels can be used concurrently. Even for non-MLO devices, this improves support for devices capable of running on multiple channels at the same time. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://patch.msgid.link/18a88f9ce82b1c9f7c12f1672430eaf2bb0be295.1720514221.git-series.nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09wifi: mac80211: chanctx emulation set CHANGE_CHANNEL when in_reconfigZong-Zhe Yang
Chanctx emulation didn't info IEEE80211_CONF_CHANGE_CHANNEL to drivers during ieee80211_restart_hw (ieee80211_emulate_add_chanctx). It caused non-chanctx drivers to not stand on the correct channel after recovery. RX then behaved abnormally. Finally, disconnection/reconnection occurred. So, set IEEE80211_CONF_CHANGE_CHANNEL when in_reconfig. Signed-off-by: Zong-Zhe Yang <kevin_yang@realtek.com> Link: https://patch.msgid.link/20240709073531.30565-1-kevin_yang@realtek.com Cc: stable@vger.kernel.org Fixes: 0a44dfc07074 ("wifi: mac80211: simplify non-chanctx drivers") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-09l2tp: fix possible UAF when cleaning up tunnelsJames Chapman
syzbot reported a UAF caused by a race when the L2TP work queue closes a tunnel at the same time as a userspace thread closes a session in that tunnel. Tunnel cleanup is handled by a work queue which iterates through the sessions contained within a tunnel, and closes them in turn. Meanwhile, a userspace thread may arbitrarily close a session via either netlink command or by closing the pppox socket in the case of l2tp_ppp. The race condition may occur when l2tp_tunnel_closeall walks the list of sessions in the tunnel and deletes each one. Currently this is implemented using list_for_each_safe, but because the list spinlock is dropped in the loop body it's possible for other threads to manipulate the list during list_for_each_safe's list walk. This can lead to the list iterator being corrupted, leading to list_for_each_safe spinning. One sequence of events which may lead to this is as follows: * A tunnel is created, containing two sessions A and B. * A thread closes the tunnel, triggering tunnel cleanup via the work queue. * l2tp_tunnel_closeall runs in the context of the work queue. It removes session A from the tunnel session list, then drops the list lock. At this point the list_for_each_safe temporary variable is pointing to the other session on the list, which is session B, and the list can be manipulated by other threads since the list lock has been released. * Userspace closes session B, which removes the session from its parent tunnel via l2tp_session_delete. Since l2tp_tunnel_closeall has released the tunnel list lock, l2tp_session_delete is able to call list_del_init on the session B list node. * Back on the work queue, l2tp_tunnel_closeall resumes execution and will now spin forever on the same list entry until the underlying session structure is freed, at which point UAF occurs. The solution is to iterate over the tunnel's session list using list_first_entry_not_null to avoid the possibility of the list iterator pointing at a list item which may be removed during the walk. Also, have l2tp_tunnel_closeall ref each session while it processes it to prevent another thread from freeing it. cpu1 cpu2 --- --- pppol2tp_release() spin_lock_bh(&tunnel->list_lock); for (;;) { session = list_first_entry_or_null(&tunnel->session_list, struct l2tp_session, list); if (!session) break; list_del_init(&session->list); spin_unlock_bh(&tunnel->list_lock); l2tp_session_delete(session); l2tp_session_delete(session); spin_lock_bh(&tunnel->list_lock); } spin_unlock_bh(&tunnel->list_lock); Calling l2tp_session_delete on the same session twice isn't a problem per-se, but if cpu2 manages to destruct the socket and unref the session to zero before cpu1 progresses then it would lead to UAF. Reported-by: syzbot+b471b7c936301a59745b@syzkaller.appspotmail.com Reported-by: syzbot+c041b4ce3a6dfd1e63e2@syzkaller.appspotmail.com Fixes: d18d3f0a24fc ("l2tp: replace hlist with simple list for per-tunnel session list") Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Link: https://patch.msgid.link/20240704152508.1923908-1-jchapman@katalix.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-09skmsg: Skip zero length skb in sk_msg_recvmsgGeliang Tang
When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, the following kernel panic occurs: [...] Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160 Code: ... ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- [...] This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic. This crash is because a NULL pointer is passed to page_address() in the sk_msg_recvmsg(). Due to the different implementations depending on the architecture, page_address(NULL) will trigger a panic on Loongarch platform but not on x86 platform. So this bug was hidden on x86 platform for a while, but now it is exposed on Loongarch platform. The root cause is that a zero length skb (skb->len == 0) was put on the queue. This zero length skb is a TCP FIN packet, which was sent by shutdown(), invoked in test_sockmap_skb_verdict_shutdown(): shutdown(p1, SHUT_WR); In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list. And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it to kmap_local_page() and to page_address(), then kernel panics. To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(), if copy is zero, that means it's a zero length skb, skip invoking copy_page_to_iter(). We are using the EFAULT return triggered by copy_page_to_iter to check for is_fin in tcp_bpf.c. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Suggested-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.1719992715.git.tanggeliang@kylinos.cn
2024-07-08net: page_pool: fix warning codeJohannes Berg
WARN_ON_ONCE("string") doesn't really do what appears to be intended, so fix that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Fixes: 90de47f020db ("page_pool: fragment API support for 32-bit arch with 64-bit DMA") Link: https://patch.msgid.link/20240705134221.2f4de205caa1.I28496dc0f2ced580282d1fb892048017c4491e21@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-08bpf: Fix too early release of tcx_entryDaniel Borkmann
Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported an issue that the tcx_entry can be released too early leading to a use after free (UAF) when an active old-style ingress or clsact qdisc with a shared tc block is later replaced by another ingress or clsact instance. Essentially, the sequence to trigger the UAF (one example) can be as follows: 1. A network namespace is created 2. An ingress qdisc is created. This allocates a tcx_entry, and &tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the same time, a tcf block with index 1 is created. 3. chain0 is attached to the tcf block. chain0 must be connected to the block linked to the ingress qdisc to later reach the function tcf_chain0_head_change_cb_del() which triggers the UAF. 4. Create and graft a clsact qdisc. This causes the ingress qdisc created in step 1 to be removed, thus freeing the previously linked tcx_entry: rtnetlink_rcv_msg() => tc_modify_qdisc() => qdisc_create() => clsact_init() [a] => qdisc_graft() => qdisc_destroy() => __qdisc_destroy() => ingress_destroy() [b] => tcx_entry_free() => kfree_rcu() // tcx_entry freed 5. Finally, the network namespace is closed. This registers the cleanup_net worker, and during the process of releasing the remaining clsact qdisc, it accesses the tcx_entry that was already freed in step 4, causing the UAF to occur: cleanup_net() => ops_exit_list() => default_device_exit_batch() => unregister_netdevice_many() => unregister_netdevice_many_notify() => dev_shutdown() => qdisc_put() => clsact_destroy() [c] => tcf_block_put_ext() => tcf_chain0_head_change_cb_del() => tcf_chain_head_change_item() => clsact_chain_head_change() => mini_qdisc_pair_swap() // UAF There are also other variants, the gist is to add an ingress (or clsact) qdisc with a specific shared block, then to replace that qdisc, waiting for the tcx_entry kfree_rcu() to be executed and subsequently accessing the current active qdisc's miniq one way or another. The correct fix is to turn the miniq_active boolean into a counter. What can be observed, at step 2 above, the counter transitions from 0->1, at step [a] from 1->2 (in order for the miniq object to remain active during the replacement), then in [b] from 2->1 and finally [c] 1->0 with the eventual release. The reference counter in general ranges from [0,2] and it does not need to be atomic since all access to the counter is protected by the rtnl mutex. With this in place, there is no longer a UAF happening and the tcx_entry is freed at the correct time. Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Reported-by: Pedro Pinto <xten@osec.io> Co-developed-by: Pedro Pinto <xten@osec.io> Signed-off-by: Pedro Pinto <xten@osec.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Hyunwoo Kim <v4bel@theori.io> Cc: Wongi Lee <qwerty@theori.io> Cc: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240708133130.11609-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-07-08gss_krb5: Fix the error handling path for crypto_sync_skcipher_setkeyGaosheng Cui
If we fail to call crypto_sync_skcipher_setkey, we should free the memory allocation for cipher, replace err_return with err_free_cipher to free the memory of cipher. Fixes: 4891f2d008e4 ("gss_krb5: import functionality to derive keys into the kernel") Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08sunrpc: refactor pool_mode setting codeJeff Layton
Allow the pool_mode setting code to be called from internal callers so we can call it from a new netlink op. Add a new svc_pool_map_get function to return the current setting. Change the existing module parameter handling to use the new interfaces under the hood. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08sunrpc: fix up the special handling of sv_nrpools == 1Jeff Layton
Only pooled services take a reference to the svc_pool_map. The sunrpc code has always used the sv_nrpools value to detect whether the service is pooled. The problem there is that nfsd is a pooled service, but when it's running in "global" pool_mode, it doesn't take a reference to the pool map because it has a sv_nrpools value of 1. This means that we have two separate codepaths for starting the server, depending on whether it's pooled or not. Fix this by adding a new flag to the svc_serv, that indicates whether the serv is pooled. With this we can have the nfsd service unconditionally take a reference, regardless of pool_mode. Note that this is a behavior change for /sys/module/sunrpc/parameters/pool_mode. Usually this file does not allow you to change the pool-mode while there are nfsd threads running, but if the pool-mode is "global" it's allowed. My assumption is that this is a bug, since it probably should never have worked this way. This patch changes the behavior such that you get back EBUSY even when nfsd is running in global mode. I think this is more reasonable behavior, and given that most people set this today using the module parameter, it's doubtful anyone will notice. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08SUNRPC: Add a trace point in svc_xprt_deferred_closeChuck Lever
The trace point in svc_xprt_close() reports only some local close requests. Try to capture more local close requests. Note that "trace-cmd record -T -e sunrpc:svc_xprt_close" will neatly capture the identity of the caller requesting the close. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08svcrdma: Handle ADDR_CHANGE CM event properlyChuck Lever
Sagi tells me that when a bonded device reports an address change, the consumer must destroy its listener IDs and create new ones. See commit a032e4f6d60d ("nvmet-rdma: fix bonding failover possible NULL deref"). Suggested-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08svcrdma: Refactor the creation of listener CMA IDChuck Lever
In a moment, I will add a second consumer of CMA ID creation in svcrdma. Refactor so this code can be reused. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08SUNRPC: avoid soft lockup when transmitting UDP to reachable server.NeilBrown
Prior to the commit identified below, call_transmit_status() would handle -EPERM and other errors related to an unreachable server by falling through to call_status() which added a 3-second delay and handled the failure as a timeout. Since that commit, call_transmit_status() falls through to handle_bind(). For UDP this moves straight on to handle_connect() and handle_transmit() so we immediately retransmit - and likely get the same error. This results in an indefinite loop in __rpc_execute() which triggers a soft-lockup warning. For the errors that indicate an unreachable server, call_transmit_status() should fall back to call_status() as it did before. This cannot cause the thundering herd that the previous patch was avoiding, as the call_status() will insert a delay. Fixes: ed7dc973bd91 ("SUNRPC: Prevent thundering herd when the socket is not connected") Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-08xprtrdma: Remove temp allocation of rpcrdma_rep objectsChuck Lever
The original code was designed so that most calls to rpcrdma_rep_create() would occur on the NUMA node that the device preferred. There are a few cases where that's not possible, so those reps are marked as temporary. However, we have the device (and its preferred node) already in rpcrdma_rep_create(), so let's use that to guarantee the memory is allocated from the correct node. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>