summaryrefslogtreecommitdiff
path: root/net/netfilter
AgeCommit message (Collapse)Author
2021-10-14netfilter: ipvs: make global sysctl readonly in non-init netnsAntoine Tenart
Because the data pointer of net/ipv4/vs/debug_level is not updated per netns, it must be marked as read-only in non-init netns. Fixes: c6d2d445d8de ("IPVS: netns, final patch enabling network name space.") Signed-off-by: Antoine Tenart <atenart@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: nf_tables: skip netdev events generated on netns removalFlorian Westphal
syzbot reported following (harmless) WARN: WARNING: CPU: 1 PID: 2648 at net/netfilter/core.c:468 nft_netdev_unregister_hooks net/netfilter/nf_tables_api.c:230 [inline] nf_tables_unregister_hook include/net/netfilter/nf_tables.h:1090 [inline] __nft_release_basechain+0x138/0x640 net/netfilter/nf_tables_api.c:9524 nft_netdev_event net/netfilter/nft_chain_filter.c:351 [inline] nf_tables_netdev_event+0x521/0x8a0 net/netfilter/nft_chain_filter.c:382 reproducer: unshare -n bash -c 'ip link add br0 type bridge; nft add table netdev t ; \ nft add chain netdev t ingress \{ type filter hook ingress device "br0" \ priority 0\; policy drop\; \}' Problem is that when netns device exit hooks create the UNREGISTER event, the .pre_exit hook for nf_tables core has already removed the base hook. Notifier attempts to do this again. The need to do base hook unregister unconditionally was needed in the past, because notifier was last stage where reg->dev dereference was safe. Now that nf_tables does the hook removal in .pre_exit, this isn't needed anymore. Reported-and-tested-by: syzbot+154bd5be532a63aa778b@syzkaller.appspotmail.com Fixes: 767d1216bff825 ("netfilter: nftables: fix possible UAF over chains from packet path in netns") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: Kconfig: use 'default y' instead of 'm' for bool config optionVegard Nossum
This option, NF_CONNTRACK_SECMARK, is a bool, so it can never be 'm'. Fixes: 33b8e77605620 ("[NETFILTER]: Add CONFIG_NETFILTER_ADVANCED option") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-07netfilter: xt_IDLETIMER: fix panic that occurs when timer_type has garbage valueJuhee Kang
Currently, when the rule related to IDLETIMER is added, idletimer_tg timer structure is initialized by kmalloc on executing idletimer_tg_create function. However, in this process timer->timer_type is not defined to a specific value. Thus, timer->timer_type has garbage value and it occurs kernel panic. So, this commit fixes the panic by initializing timer->timer_type using kzalloc instead of kmalloc. Test commands: # iptables -A OUTPUT -j IDLETIMER --timeout 1 --label test $ cat /sys/class/xt_idletimer/timers/test Killed Splat looks like: BUG: KASAN: user-memory-access in alarm_expires_remaining+0x49/0x70 Read of size 8 at addr 0000002e8c7bc4c8 by task cat/917 CPU: 12 PID: 917 Comm: cat Not tainted 5.14.0+ #3 79940a339f71eb14fc81aee1757a20d5bf13eb0e Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: dump_stack_lvl+0x6e/0x9c kasan_report.cold+0x112/0x117 ? alarm_expires_remaining+0x49/0x70 __asan_load8+0x86/0xb0 alarm_expires_remaining+0x49/0x70 idletimer_tg_show+0xe5/0x19b [xt_IDLETIMER 11219304af9316a21bee5ba9d58f76a6b9bccc6d] dev_attr_show+0x3c/0x60 sysfs_kf_seq_show+0x11d/0x1f0 ? device_remove_bin_file+0x20/0x20 kernfs_seq_show+0xa4/0xb0 seq_read_iter+0x29c/0x750 kernfs_fop_read_iter+0x25a/0x2c0 ? __fsnotify_parent+0x3d1/0x570 ? iov_iter_init+0x70/0x90 new_sync_read+0x2a7/0x3d0 ? __x64_sys_llseek+0x230/0x230 ? rw_verify_area+0x81/0x150 vfs_read+0x17b/0x240 ksys_read+0xd9/0x180 ? vfs_write+0x460/0x460 ? do_syscall_64+0x16/0xc0 ? lockdep_hardirqs_on+0x79/0x120 __x64_sys_read+0x43/0x50 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f0cdc819142 Code: c0 e9 c2 fe ff ff 50 48 8d 3d 3a ca 0a 00 e8 f5 19 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24 RSP: 002b:00007fff28eee5b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f0cdc819142 RDX: 0000000000020000 RSI: 00007f0cdc032000 RDI: 0000000000000003 RBP: 00007f0cdc032000 R08: 00007f0cdc031010 R09: 0000000000000000 R10: 0000000000000022 R11: 0000000000000246 R12: 00005607e9ee31f0 R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 Fixes: 68983a354a65 ("netfilter: xtables: Add snapshot of hardidletimer target") Signed-off-by: Juhee Kang <claudiajkang@gmail.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-10-02netfilter: nf_tables: honor NLM_F_CREATE and NLM_F_EXCL in event notificationPablo Neira Ayuso
Include the NLM_F_CREATE and NLM_F_EXCL flags in netlink event notifications, otherwise userspace cannot distiguish between create and add commands. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-28netfilter: nf_tables: reverse order in rule replacement expansionPablo Neira Ayuso
Deactivate old rule first, then append the new rule, so rule replacement notification via netlink first reports the deletion of the old rule with handle X in first place, then it adds the new rule (reusing the handle X of the replaced old rule). Note that the abort path releases the transaction that has been created by nft_delrule() on error. Fixes: ca08987885a1 ("netfilter: nf_tables: deactivate expressions in rule replecement routine") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-28netfilter: nf_tables: add position handle in event notificationPablo Neira Ayuso
Add position handle to allow to identify the rule location from netlink events. Otherwise, userspace cannot incrementally update a userspace cache through monitoring events. Skip handle dump if the rule has been either inserted (at the beginning of the ruleset) or appended (at the end of the ruleset), the NLM_F_APPEND netlink flag is sufficient in these two cases. Handle NLM_F_REPLACE as NLM_F_APPEND since the rule replacement expansion appends it after the specified rule handle. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: conntrack: serialize hash resizes and cleanupsEric Dumazet
Syzbot was able to trigger the following warning [1] No repro found by syzbot yet but I was able to trigger similar issue by having 2 scripts running in parallel, changing conntrack hash sizes, and: for j in `seq 1 1000` ; do unshare -n /bin/true >/dev/null ; done It would take more than 5 minutes for net_namespace structures to be cleaned up. This is because nf_ct_iterate_cleanup() has to restart everytime a resize happened. By adding a mutex, we can serialize hash resizes and cleanups and also make get_next_corpse() faster by skipping over empty buckets. Even without resizes in the picture, this patch considerably speeds up network namespace dismantles. [1] INFO: task syz-executor.0:8312 can't die for more than 144 seconds. task:syz-executor.0 state:R running task stack:25672 pid: 8312 ppid: 6573 flags:0x00004006 Call Trace: context_switch kernel/sched/core.c:4955 [inline] __schedule+0x940/0x26f0 kernel/sched/core.c:6236 preempt_schedule_common+0x45/0xc0 kernel/sched/core.c:6408 preempt_schedule_thunk+0x16/0x18 arch/x86/entry/thunk_64.S:35 __local_bh_enable_ip+0x109/0x120 kernel/softirq.c:390 local_bh_enable include/linux/bottom_half.h:32 [inline] get_next_corpse net/netfilter/nf_conntrack_core.c:2252 [inline] nf_ct_iterate_cleanup+0x15a/0x450 net/netfilter/nf_conntrack_core.c:2275 nf_conntrack_cleanup_net_list+0x14c/0x4f0 net/netfilter/nf_conntrack_core.c:2469 ops_exit_list+0x10d/0x160 net/core/net_namespace.c:171 setup_net+0x639/0xa30 net/core/net_namespace.c:349 copy_net_ns+0x319/0x760 net/core/net_namespace.c:470 create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226 ksys_unshare+0x445/0x920 kernel/fork.c:3128 __do_sys_unshare kernel/fork.c:3202 [inline] __se_sys_unshare kernel/fork.c:3200 [inline] __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3200 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f63da68e739 RSP: 002b:00007f63d7c05188 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f63da792f80 RCX: 00007f63da68e739 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000 RBP: 00007f63da6e8cc4 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f63da792f80 R13: 00007fff50b75d3f R14: 00007f63d7c05300 R15: 0000000000022000 Showing all locks held in the system: 1 lock held by khungtaskd/27: #0: ffffffff8b980020 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:6446 2 locks held by kworker/u4:2/153: #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: arch_atomic_long_set include/linux/atomic/atomic-long.h:41 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: atomic_long_set include/linux/atomic/atomic-instrumented.h:1198 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_data kernel/workqueue.c:634 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: set_work_pool_and_clear_pending kernel/workqueue.c:661 [inline] #0: ffff888010c69138 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x896/0x1690 kernel/workqueue.c:2268 #1: ffffc9000140fdb0 ((kfence_timer).work){+.+.}-{0:0}, at: process_one_work+0x8ca/0x1690 kernel/workqueue.c:2272 1 lock held by systemd-udevd/2970: 1 lock held by in:imklog/6258: #0: ffff88807f970ff0 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:990 3 locks held by kworker/1:6/8158: 1 lock held by syz-executor.0/8312: 2 locks held by kworker/u4:13/9320: 1 lock held by syz-executor.5/10178: 1 lock held by syz-executor.4/10217: Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: log: work around missing softdep backend moduleFlorian Westphal
iptables/nftables has two types of log modules: 1. backend, e.g. nf_log_syslog, which implement the functionality 2. frontend, e.g. xt_LOG or nft_log, which call the functionality provided by backend based on nf_tables or xtables rule set. Problem is that the request_module() call to load the backed in nf_logger_find_get() might happen with nftables transaction mutex held in case the call path is via nf_tables/nft_compat. This can cause deadlocks (see 'Fixes' tags for details). The chosen solution as to let modprobe deal with this by adding 'pre: ' soft dep tag to xt_LOG (to load the syslog backend) and xt_NFLOG (to load nflog backend). Eric reports that this breaks on systems with older modprobe that doesn't support softdeps. Another, similar issue occurs when someone either insmods xt_(NF)LOG directly or unloads the backend module (possible if no log frontend is in use): because the frontend module is already loaded, modprobe is not invoked again so the softdep isn't evaluated. Add a workaround: If nf_logger_find_get() returns -ENOENT and call is not via nft_compat, load the backend explicitly and try again. Else, let nft_compat ask for deferred request_module via nf_tables infra. Softdeps are kept in-place, so with newer modprobe the dependencies are resolved from userspace. Fixes: cefa31a9d461 ("netfilter: nft_log: perform module load from nf_tables") Fixes: a38b5b56d6f4 ("netfilter: nf_log: add module softdeps") Reported-and-tested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nf_nat_masquerade: defer conntrack walk to work queueFlorian Westphal
The ipv4 and device notifiers are called with RTNL mutex held. The table walk can take some time, better not block other RTNL users. 'ip a' has been reported to block for up to 20 seconds when conntrack table has many entries and device down events are frequent (e.g., PPP). Reported-and-tested-by: Martin Zaharinov <micron10@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nf_nat_masquerade: make async masq_inet6_event handling genericFlorian Westphal
masq_inet6_event is called asynchronously from system work queue, because the inet6 notifier is atomic and nf_iterate_cleanup can sleep. The ipv4 and device notifiers call nf_iterate_cleanup directly. This is legal, but these notifiers are called with RTNL mutex held. A large conntrack table with many devices coming and going will have severe impact on the system usability, with 'ip a' blocking for several seconds. This change places the defer code into a helper and makes it more generic so ipv4 and ifdown notifiers can be converted to defer the cleanup walk as well in a follow patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nf_tables: Fix oversized kvmalloc() callsPablo Neira Ayuso
The commit 7661809d493b ("mm: don't allow oversized kvmalloc() calls") limits the max allocatable memory via kvmalloc() to MAX_INT. Reported-by: syzbot+cd43695a64bcd21b8596@syzkaller.appspotmail.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nf_tables: unlink table before deleting itFlorian Westphal
syzbot reports following UAF: BUG: KASAN: use-after-free in memcmp+0x18f/0x1c0 lib/string.c:955 nla_strcmp+0xf2/0x130 lib/nlattr.c:836 nft_table_lookup.part.0+0x1a2/0x460 net/netfilter/nf_tables_api.c:570 nft_table_lookup net/netfilter/nf_tables_api.c:4064 [inline] nf_tables_getset+0x1b3/0x860 net/netfilter/nf_tables_api.c:4064 nfnetlink_rcv_msg+0x659/0x13f0 net/netfilter/nfnetlink.c:285 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2504 Problem is that all get operations are lockless, so the commit_mutex held by nft_rcv_nl_event() isn't enough to stop a parallel GET request from doing read-accesses to the table object even after synchronize_rcu(). To avoid this, unlink the table first and store the table objects in on-stack scratch space. Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership") Reported-and-tested-by: syzbot+f31660cf279b0557160c@syzkaller.appspotmail.com Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: nat: include zone id in nat table hash againFlorian Westphal
Similar to the conntrack change, also use the zone id for the nat source lists if the zone id is valid in both directions. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: conntrack: include zone id in tuple hash againFlorian Westphal
commit deedb59039f111 ("netfilter: nf_conntrack: add direction support for zones") removed the zone id from the hash value. This has implications on hash chain lengths with overlapping tuples, which can hit 64k entries on released kernels, before upper droplimit was added in d7e7747ac5c ("netfilter: refuse insertion if chain has grown too large"). With that change reverted, test script coming with this series shows linear insertion time growth: 10000 entries in 3737 ms (now 10000 total, loop 1) 10000 entries in 16994 ms (now 20000 total, loop 2) 10000 entries in 47787 ms (now 30000 total, loop 3) 10000 entries in 72731 ms (now 40000 total, loop 4) 10000 entries in 95761 ms (now 50000 total, loop 5) 10000 entries in 96809 ms (now 60000 total, loop 6) inserted 60000 entries from packet path in 333825 ms With d7e7747ac5c in place, the test fails. There are three supported zone use cases: 1. Connection is in the default zone (zone 0). This means to special config (the default). 2. Connection is in a different zone (1 to 2**16). This means rules are in place to put packets in the desired zone, e.g. derived from vlan id or interface. 3. Original direction is in zone X and Reply is in zone 0. 3) allows to use of the existing NAT port collision avoidance to provide connectivity to internet/wan even when the various zones have overlapping source networks separated via policy routing. In case the original zone is 0 all three cases are identical. There is no way to place original direction in zone x and reply in zone y (with y != 0). Zones need to be assigned manually via the iptables/nftables ruleset, before conntrack lookup occurs (raw table in iptables) using the "CT" target conntrack template support (-j CT --{zone,zone-orig,zone-reply} X). Normally zone assignment happens based on incoming interface, but could also be derived from packet mark, vlan id and so on. This means that when case 3 is used, the ruleset will typically not even assign a connection tracking template to the "reply" packets, so lookup happens in zone 0. However, it is possible that reply packets also match a ct zone assignment rule which sets up a template for zone X (X > 0) in original direction only. Therefore, after making the zone id part of the hash, we need to do a second lookup using the reply zone id if we did not find an entry on the first lookup. In practice, most deployments will either not use zones at all or the origin and reply zones are the same, no second lookup is required in either case. After this change, packet path insertion test passes with constant insertion times: 10000 entries in 1064 ms (now 10000 total, loop 1) 10000 entries in 1074 ms (now 20000 total, loop 2) 10000 entries in 1066 ms (now 30000 total, loop 3) 10000 entries in 1079 ms (now 40000 total, loop 4) 10000 entries in 1081 ms (now 50000 total, loop 5) 10000 entries in 1082 ms (now 60000 total, loop 6) inserted 60000 entries from packet path in 6452 ms Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-21netfilter: conntrack: make max chain length randomFlorian Westphal
Similar to commit 67d6d681e15b ("ipv4: make exception cache less predictible"): Use a random drop length to make it harder to detect when entries were hashed to same bucket list. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-14ipvs: check that ip_vs_conn_tab_bits is between 8 and 20Andrea Claudi
ip_vs_conn_tab_bits may be provided by the user through the conn_tab_bits module parameter. If this value is greater than 31, or less than 0, the shift operator used to derive tab_size causes undefined behaviour. Fix this checking ip_vs_conn_tab_bits value to be in the range specified in ipvs Kconfig. If not, simply use default value. Fixes: 6f7edb4881bf ("IPVS: Allow boot time change of hash size") Reported-by: Yi Chen <yiche@redhat.com> Signed-off-by: Andrea Claudi <aclaudi@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-14netfilter: ipset: Fix oversized kvmalloc() callsJozsef Kadlecsik
The commit commit 7661809d493b426e979f39ab512e3adf41fbcc69 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Wed Jul 14 09:45:49 2021 -0700 mm: don't allow oversized kvmalloc() calls limits the max allocatable memory via kvmalloc() to MAX_INT. Apply the same limit in ipset. Reported-by: syzbot+3493b1873fb3ea827986@syzkaller.appspotmail.com Reported-by: syzbot+2b8443c35458a617c904@syzkaller.appspotmail.com Reported-by: syzbot+ee5cb15f4a0e85e0d54e@syzkaller.appspotmail.com Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-09-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfJakub Kicinski
Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Protect nft_ct template with global mutex, from Pavel Skripkin. 2) Two recent commits switched inet rt and nexthop exception hashes from jhash to siphash. If those two spots are problematic then conntrack is affected as well, so switch voer to siphash too. While at it, add a hard upper limit on chain lengths and reject insertion if this is hit. Patches from Florian Westphal. 3) Fix use-after-scope in nf_socket_ipv6 reported by KASAN, from Benjamin Hesmans. * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf: netfilter: socket: icmp6: fix use-after-scope netfilter: refuse insertion if chain has grown too large netfilter: conntrack: switch to siphash netfilter: conntrack: sanitize table size default settings netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutex ==================== Link: https://lore.kernel.org/r/20210903163020.13741-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-30Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Daniel Borkmann says: ==================== bpf-next 2021-08-31 We've added 116 non-merge commits during the last 17 day(s) which contain a total of 126 files changed, 6813 insertions(+), 4027 deletions(-). The main changes are: 1) Add opaque bpf_cookie to perf link which the program can read out again, to be used in libbpf-based USDT library, from Andrii Nakryiko. 2) Add bpf_task_pt_regs() helper to access userspace pt_regs, from Daniel Xu. 3) Add support for UNIX stream type sockets for BPF sockmap, from Jiang Wang. 4) Allow BPF TCP congestion control progs to call bpf_setsockopt() e.g. to switch to another congestion control algorithm during init, from Martin KaFai Lau. 5) Extend BPF iterator support for UNIX domain sockets, from Kuniyuki Iwashima. 6) Allow bpf_{set,get}sockopt() calls from setsockopt progs, from Prankur Gupta. 7) Add bpf_get_netns_cookie() helper for BPF_PROG_TYPE_{SOCK_OPS,CGROUP_SOCKOPT} progs, from Xu Liu and Stanislav Fomichev. 8) Support for __weak typed ksyms in libbpf, from Hao Luo. 9) Shrink struct cgroup_bpf by 504 bytes through refactoring, from Dave Marchevsky. 10) Fix a smatch complaint in verifier's narrow load handling, from Andrey Ignatov. 11) Fix BPF interpreter's tail call count limit, from Daniel Borkmann. 12) Big batch of improvements to BPF selftests, from Magnus Karlsson, Li Zhijian, Yucong Sun, Yonghong Song, Ilya Leoshkevich, Jussi Maki, Ilya Leoshkevich, others. 13) Another big batch to revamp XDP samples in order to give them consistent look and feel, from Kumar Kartikeya Dwivedi. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (116 commits) MAINTAINERS: Remove self from powerpc BPF JIT selftests/bpf: Fix potential unreleased lock samples: bpf: Fix uninitialized variable in xdp_redirect_cpu selftests/bpf: Reduce more flakyness in sockmap_listen bpf: Fix bpf-next builds without CONFIG_BPF_EVENTS bpf: selftests: Add dctcp fallback test bpf: selftests: Add connect_to_fd_opts to network_helpers bpf: selftests: Add sk_state to bpf_tcp_helpers.h bpf: tcp: Allow bpf-tcp-cc to call bpf_(get|set)sockopt selftests: xsk: Preface options with opt selftests: xsk: Make enums lower case selftests: xsk: Generate packets from specification selftests: xsk: Generate packet directly in umem selftests: xsk: Simplify cleanup of ifobjects selftests: xsk: Decrease sending speed selftests: xsk: Validate tx stats on tx thread selftests: xsk: Simplify packet validation in xsk tests selftests: xsk: Rename worker_* functions that are not thread entry points selftests: xsk: Disassociate umem size with packets sent selftests: xsk: Remove end-of-test packet ... ==================== Link: https://lore.kernel.org/r/20210830225618.11634-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-30netfilter: refuse insertion if chain has grown too largeFlorian Westphal
Also add a stat counter for this that gets exported both via old /proc interface and ctnetlink. Assuming the old default size of 16536 buckets and max hash occupancy of 64k, this results in 128k insertions (origin+reply), so ~8 entries per chain on average. The revised settings in this series will result in about two entries per bucket on average. This allows a hard-limit ceiling of 64. This is not tunable at the moment, but its possible to either increase nf_conntrack_buckets or decrease nf_conntrack_max to reduce average lengths. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-30netfilter: conntrack: switch to siphashFlorian Westphal
Replace jhash in conntrack and nat core with siphash. While at it, use the netns mix value as part of the input key rather than abuse the seed value. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-30netfilter: conntrack: sanitize table size default settingsFlorian Westphal
conntrack has two distinct table size settings: nf_conntrack_max and nf_conntrack_buckets. The former limits how many conntrack objects are allowed to exist in each namespace. The second sets the size of the hashtable. As all entries are inserted twice (once for original direction, once for reply), there should be at least twice as many buckets in the table than the maximum number of conntrack objects that can exist at the same time. Change the default multiplier to 1 and increase the chosen bucket sizes. This results in the same nf_conntrack_max settings as before but reduces the average bucket list length. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-30netfilter: add netfilter hooks to SRv6 data planeRyoga Saito
This patch introduces netfilter hooks for solving the problem that conntrack couldn't record both inner flows and outer flows. This patch also introduces a new sysctl toggle for enabling lightweight tunnel netfilter hooks. Signed-off-by: Ryoga Saito <contact@proelbtn.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ctnetlink: missing counters and timestamp in nfnetlink_{log,queue}Pablo Neira Ayuso
Add counters and timestamps (if available) to the conntrack object that is represented in nfnetlink_log and _queue messages. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: remove nf_exp_event_notifier structureFlorian Westphal
Reuse the conntrack event notofier struct, this allows to remove the extra register/unregister functions and avoids a pointer in struct net. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: prepare for event notifier mergeFlorian Westphal
This prepares for merge for ct and exp notifier structs. The 'fcn' member is renamed to something unique. Second, the register/unregister api is simplified. There is only one implementation so there is no need to do any error checking. Replace the EBUSY logic with WARN_ON_ONCE. This allows to remove error unwinding. The exp notifier register/unregister function is removed in a followup patch. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: add common helper for nf_conntrack_eventmask_reportFlorian Westphal
nf_ct_deliver_cached_events and nf_conntrack_eventmask_report are very similar. Split nf_conntrack_eventmask_report into a common helper function that can be used for both cases. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: remove another indent levelFlorian Westphal
... by changing: if (unlikely(ret < 0 || missed)) { if (ret < 0) { to if (likely(ret >= 0 && !missed)) goto out; if (ret < 0) { After this nf_conntrack_eventmask_report and nf_ct_deliver_cached_events look pretty much the same, next patch moves common code to a helper. This patch has no effect on generated code. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-25netfilter: ecache: remove one indent levelFlorian Westphal
nf_conntrack_eventmask_report and nf_ct_deliver_cached_events shared most of their code. This unifies the layout by changing if (nf_ct_is_confirmed(ct)) { foo } to if (!nf_ct_is_confirmed(ct))) return foo This removes one level of indentation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-19net: Fix offloading indirect devices dependency on qdisc order creationEli Cohen
Currently, when creating an ingress qdisc on an indirect device before the driver registered for callbacks, the driver will not have a chance to register its filter configuration callbacks. To fix that, modify the code such that it keeps track of all the ingress qdiscs that call flow_indr_dev_setup_offload(). When a driver calls flow_indr_dev_register(), go through the list of tracked ingress qdiscs and call the driver callback entry point so as to give it a chance to register its callback. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-17bpf: Refactor BPF_PROG_RUN into a functionAndrii Nakryiko
Turn BPF_PROG_RUN into a proper always inlined function. No functional and performance changes are intended, but it makes it much easier to understand what's going on with how BPF programs are actually get executed. It's more obvious what types and callbacks are expected. Also extra () around input parameters can be dropped, as well as `__` variable prefixes intended to avoid naming collisions, which makes the code simpler to read and write. This refactoring also highlighted one extra issue. BPF_PROG_RUN is both a macro and an enum value (BPF_PROG_RUN == BPF_PROG_TEST_RUN). Turning BPF_PROG_RUN into a function causes naming conflict compilation error. So rename BPF_PROG_RUN into lower-case bpf_prog_run(), similar to bpf_prog_run_xdp(), bpf_prog_run_pin_on_cpu(), etc. All existing callers of BPF_PROG_RUN, the macro, are switched to bpf_prog_run() explicitly. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210815070609.987780-2-andrii@kernel.org
2021-08-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h 9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp") 9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins") 099fdeda659d ("bnxt_en: Event handler for PPS events") kernel/bpf/helpers.c include/linux/bpf-cgroup.h a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()") c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current") drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray") 2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer") MAINTAINERS 7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo") 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use nfnetlink_unicast() instead of netlink_unicast() in nft_compat. 2) Remove call to nf_ct_l4proto_find() in flowtable offload timeout fixup. 3) CLUSTERIP registers ARP hook on demand, from Florian. 4) Use clusterip_net to store pernet warning, also from Florian. 5) Remove struct netns_xt, from Florian Westphal. 6) Enable ebtables hooks in initns on demand, from Florian. 7) Allow to filter conntrack netlink dump per status bits, from Florian Westphal. 8) Register x_tables hooks in initns on demand, from Florian. 9) Remove queue_handler from per-netns structure, again from Florian. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-11netfilter: nft_ct: protect nft_ct_pcpu_template_refcnt with mutexPavel Skripkin
Syzbot hit use-after-free in nf_tables_dump_sets. The problem was in missing lock protection for nft_ct_pcpu_template_refcnt. Before commit f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") all transactions were serialized by global mutex, but then global mutex was changed to local per netnamespace commit_mutex. This change causes use-after-free bug, when 2 netnamespaces concurently changing nft_ct_pcpu_template_refcnt without proper locking. Fix it by adding nft_ct_pcpu_mutex and protect all nft_ct_pcpu_template_refcnt changes with it. Fixes: f102d66b335a ("netfilter: nf_tables: use dedicated mutex to guard transactions") Reported-and-tested-by: syzbot+649e339fa6658ee623d3@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-10netfilter: nf_queue: move hookfn registration out of struct netFlorian Westphal
This was done to detect when the pernet->init() function was not called yet, by checking if net->nf.queue_handler is NULL. Once the nfnetlink_queue module is active, all struct net pointers contain the same address. So place this back in nf_queue.c. Handle the 'netns error unwind' test by checking nfnl_queue_net for a NULL pointer and add a comment for this. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-09netfilter: x_tables: never register tables by defaultFlorian Westphal
For historical reasons x_tables still register tables by default in the initial namespace. Only newly created net namespaces add the hook on demand. This means that the init_net always pays hook cost, even if no filtering rules are added (e.g. only used inside a single netns). Note that the hooks are added even when 'iptables -L' is called. This is because there is no way to tell 'iptables -A' and 'iptables -L' apart at kernel level. The only solution would be to register the table, but delay hook registration until the first rule gets added (or policy gets changed). That however means that counters are not hooked either, so 'iptables -L' would always show 0-counters even when traffic is flowing which might be unexpected. This keeps table and hook registration consistent with what is already done in non-init netns: first iptables(-save) invocation registers both table and hooks. This applies the same solution adopted for ebtables. All tables register a template that contains the l3 family, the name and a constructor function that is called when the initial table has to be added. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: translate inet ingress to netdevPablo Neira Ayuso
The NFPROTO_INET pseudofamily is not exposed through this new netlink interface. The netlink dump either shows NFPROTO_IPV4 or NFPROTO_IPV6 for NFPROTO_INET prerouting/input/forward/output/postrouting hooks. The NFNLA_CHAIN_FAMILY attribute provides the family chain, which specifies if this hook applies to inet traffic only (either IPv4 or IPv6). Translate the inet/ingress hook to netdev/ingress to fully hide the NFPROTO_INET implementation details. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: conntrack: remove offload_pickup sysctl againFlorian Westphal
These two sysctls were added because the hardcoded defaults (2 minutes, tcp, 30 seconds, udp) turned out to be too low for some setups. They appeared in 5.14-rc1 so it should be fine to remove it again. Marcelo convinced me that there should be no difference between a flow that was offloaded vs. a flow that was not wrt. timeout handling. Thus the default is changed to those for TCP established and UDP stream, 5 days and 120 seconds, respectively. Marcelo also suggested to account for the timeout value used for the offloading, this avoids increase beyond the value in the conntrack-sysctl and will also instantly expire the conntrack entry with altered sysctls. Example: nf_conntrack_udp_timeout_stream=60 nf_flowtable_udp_timeout=60 This will remove offloaded udp flows after one minute, rather than two. An earlier version of this patch also cleared the ASSURED bit to allow nf_conntrack to evict the entry via early_drop (i.e., table full). However, it looks like we can safely assume that connection timed out via HW is still in established state, so this isn't needed. Quoting Oz: [..] the hardware sends all packets with a set FIN flags to sw. [..] Connections that are aged in hardware are expected to be in the established state. In case it turns out that back-to-sw-path transition can occur for 'dodgy' connections too (e.g., one side disappeared while software-path would have been in RETRANS timeout), we can adjust this later. Cc: Oz Shlomo <ozsh@nvidia.com> Cc: Paul Blakey <paulb@nvidia.com> Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: Use same family as request messagePablo Neira Ayuso
Use the same family as the request message, for consistency. The netlink payload provides sufficient information to describe the hook object, including the family. This makes it easier to userspace to correlate the hooks are that visited by the packets for a certain family. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: use the sequence number of the request messagePablo Neira Ayuso
The sequence number allows to correlate the netlink reply message (as part of the dump) with the original request message. The cb->seq field is internally used to detect an interference (update) of the hook list during the netlink dump, do not use it as sequence number in the netlink dump header. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: missing chain familyPablo Neira Ayuso
The family is relevant for pseudo-families like NFPROTO_INET otherwise the user needs to rely on the hook function name to differentiate it from NFPROTO_IPV4 and NFPROTO_IPV6 names. Add nfnl_hook_chain_desc_attributes instead of using the existing NFTA_CHAIN_* attributes, since these do not provide a family number. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: nfnetlink_hook: strip off module name from hookfnPablo Neira Ayuso
NFNLA_HOOK_FUNCTION_NAME should include the hook function name only, the module name is already provided by NFNLA_HOOK_MODULE_NAME. Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-06netfilter: conntrack: collect all entries in one cycleFlorian Westphal
Michal Kubecek reports that conntrack gc is responsible for frequent wakeups (every 125ms) on idle systems. On busy systems, timed out entries are evicted during lookup. The gc worker is only needed to remove entries after system becomes idle after a busy period. To resolve this, always scan the entire table. If the scan is taking too long, reschedule so other work_structs can run and resume from next bucket. After a completed scan, wait for 2 minutes before the next cycle. Heuristics for faster re-schedule are removed. GC_SCAN_INTERVAL could be exposed as a sysctl in the future to allow tuning this as-needed or even turn the gc worker off. Reported-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-05net: Remove redundant if statementsYajun Deng
The 'if (dev)' statement already move into dev_{put , hold}, so remove redundant if statements. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05netfilter: ctnetlink: allow to filter dump by status bitsFlorian Westphal
If CTA_STATUS is present, but CTA_STATUS_MASK is not, then the mask is automatically set to 'status', so that kernel returns those entries that have all of the requested bits set. This makes more sense than using a all-one mask since we'd hardly ever find a match. There are no other checks for status bits, so if e.g. userspace sets impossible combinations it will get an empty dump. If kernel would reject unknown status bits, then a program that works on a future kernel that has IPS_FOO bit fails on old kernels. Same for 'impossible' combinations: Kernel never sets ASSURED without first having set SEEN_REPLY, but its possible that a future kernel could do so. Therefore no sanity tests other than a 0-mask. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-05netfilter: ctnetlink: add and use a helper for mark parsingFlorian Westphal
ctnetlink dumps can be filtered based on the connmark. Prepare for status bit filtering by using a named structure and by moving the mark parsing code to a helper. Else ctnetlink_alloc_filter size grows a bit too big for my taste when status handling is added. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-04netfilter: ipset: Limit the maximal range of consecutive elements to add/deleteJozsef Kadlecsik
The range size of consecutive elements were not limited. Thus one could define a huge range which may result soft lockup errors due to the long execution time. Now the range size is limited to 2^20 entries. Reported-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-01netfilter: remove xt pernet dataFlorian Westphal
clusterip is now handled via net_generic. NOTRACK is tiny compared to rest of xt_CT feature set, even the existing deprecation warning is bigger than the actual functionality. Just remove the warning, its not worth keeping/adding a net_generic one. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-08-01netfilter: flowtable: remove nf_ct_l4proto_find() callPablo Neira Ayuso
TCP and UDP are built-in conntrack protocol trackers and the flowtable only supports for TCP and UDP, remove this call. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>