summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2022-11-28Daniel Borkmann says:Jakub Kicinski
==================== bpf-next 2022-11-25 We've added 101 non-merge commits during the last 11 day(s) which contain a total of 109 files changed, 8827 insertions(+), 1129 deletions(-). The main changes are: 1) Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF, from Kumar Kartikeya Dwivedi. 2) Add bpf_rcu_read_{,un}lock() support for sleepable programs, from Yonghong Song. 3) Add support storing struct task_struct objects as kptrs in maps, from David Vernet. 4) Batch of BPF map documentation improvements, from Maryam Tahhan and Donald Hunter. 5) Improve BPF verifier to propagate nullness information for branches of register to register comparisons, from Eduard Zingerman. 6) Fix cgroup BPF iter infra to hold reference on the start cgroup, from Hou Tao. 7) Fix BPF verifier to not mark fentry/fexit program arguments as trusted given it is not the case for them, from Alexei Starovoitov. 8) Improve BPF verifier's realloc handling to better play along with dynamic runtime analysis tools like KASAN and friends, from Kees Cook. 9) Remove legacy libbpf mode support from bpftool, from Sahid Orentino Ferdjaoui. 10) Rework zero-len skb redirection checks to avoid potentially breaking existing BPF test infra users, from Stanislav Fomichev. 11) Two small refactorings which are independent and have been split out of the XDP queueing RFC series, from Toke Høiland-Jørgensen. 12) Fix a memory leak in LSM cgroup BPF selftest, from Wang Yufen. 13) Documentation on how to run BPF CI without patch submission, from Daniel Müller. Signed-off-by: Jakub Kicinski <kuba@kernel.org> ==================== Link: https://lore.kernel.org/r/20221125012450.441-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-24bpf: Don't mark arguments to fentry/fexit programs as trusted.Alexei Starovoitov
The PTR_TRUSTED flag should only be applied to pointers where the verifier can guarantee that such pointers are valid. The fentry/fexit/fmod_ret programs are not in this category. Only arguments of SEC("tp_btf") and SEC("iter") programs are trusted (which have BPF_TRACE_RAW_TP and BPF_TRACE_ITER attach_type correspondingly) This bug was masked because convert_ctx_accesses() was converting trusted loads into BPF_PROBE_MEM loads. Fix it as well. The loads from trusted pointers don't need exception handling. Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20221124215314.55890-1-alexei.starovoitov@gmail.com
2022-11-24bpf: Add kfunc bpf_rcu_read_lock/unlock()Yonghong Song
Add two kfunc's bpf_rcu_read_lock() and bpf_rcu_read_unlock(). These two kfunc's can be used for all program types. The following is an example about how rcu pointer are used w.r.t. bpf_rcu_read_lock()/bpf_rcu_read_unlock(). struct task_struct { ... struct task_struct *last_wakee; struct task_struct __rcu *real_parent; ... }; Let us say prog does 'task = bpf_get_current_task_btf()' to get a 'task' pointer. The basic rules are: - 'real_parent = task->real_parent' should be inside bpf_rcu_read_lock region. This is to simulate rcu_dereference() operation. The 'real_parent' is marked as MEM_RCU only if (1). task->real_parent is inside bpf_rcu_read_lock region, and (2). task is a trusted ptr. So MEM_RCU marked ptr can be 'trusted' inside the bpf_rcu_read_lock region. - 'last_wakee = real_parent->last_wakee' should be inside bpf_rcu_read_lock region since it tries to access rcu protected memory. - the ptr 'last_wakee' will be marked as PTR_UNTRUSTED since in general it is not clear whether the object pointed by 'last_wakee' is valid or not even inside bpf_rcu_read_lock region. The verifier will reset all rcu pointer register states to untrusted at bpf_rcu_read_unlock() kfunc call site, so any such rcu pointer won't be trusted any more outside the bpf_rcu_read_lock() region. The current implementation does not support nested rcu read lock region in the prog. Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221124053217.2373910-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-24bpf: Introduce might_sleep field in bpf_func_protoYonghong Song
Introduce bpf_func_proto->might_sleep to indicate a particular helper might sleep. This will make later check whether a helper might be sleepable or not easier. Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221124053211.2373553-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-23bpf: Add bpf_task_from_pid() kfuncDavid Vernet
Callers can currently store tasks as kptrs using bpf_task_acquire(), bpf_task_kptr_get(), and bpf_task_release(). These are useful if a caller already has a struct task_struct *, but there may be some callers who only have a pid, and want to look up the associated struct task_struct * from that to e.g. find task->comm. This patch therefore adds a new bpf_task_from_pid() kfunc which allows BPF programs to get a struct task_struct * kptr from a pid. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221122145300.251210-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-24bpf: Unify and simplify btf_func_proto_check error handlingStanislav Fomichev
Replace 'err = x; break;' with 'return x;'. Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20221124002838.2700179-1-sdf@google.com
2022-11-24bpf: Prevent decl_tag from being referenced in func_proto argStanislav Fomichev
Syzkaller managed to hit another decl_tag issue: btf_func_proto_check kernel/bpf/btf.c:4506 [inline] btf_check_all_types kernel/bpf/btf.c:4734 [inline] btf_parse_type_sec+0x1175/0x1980 kernel/bpf/btf.c:4763 btf_parse kernel/bpf/btf.c:5042 [inline] btf_new_fd+0x65a/0xb00 kernel/bpf/btf.c:6709 bpf_btf_load+0x6f/0x90 kernel/bpf/syscall.c:4342 __sys_bpf+0x50a/0x6c0 kernel/bpf/syscall.c:5034 __do_sys_bpf kernel/bpf/syscall.c:5093 [inline] __se_sys_bpf kernel/bpf/syscall.c:5091 [inline] __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5091 do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48 This seems similar to commit ea68376c8bed ("bpf: prevent decl_tag from being referenced in func_proto") but for the argument. Reported-by: syzbot+8dd0551dda6020944c5d@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221123035422.872531-2-sdf@google.com
2022-11-23bpf: Don't use idx variable when registering kfunc dtorsDavid Vernet
In commit fda01efc6160 ("bpf: Enable cgroups to be used as kptrs"), I added an 'int idx' variable to kfunc_init() which was meant to dynamically set the index of the btf id entries of the 'generic_dtor_ids' array. This was done to make the code slightly less brittle as the struct cgroup * kptr kfuncs such as bpf_cgroup_aquire() are compiled out if CONFIG_CGROUPS is not defined. This, however, causes an lkp build warning: >> kernel/bpf/helpers.c:2005:40: warning: multiple unsequenced modifications to 'idx' [-Wunsequenced] .btf_id = generic_dtor_ids[idx++], Fix the warning by just hard-coding the indices. Fixes: fda01efc6160 ("bpf: Enable cgroups to be used as kptrs") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Vernet <void@manifault.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221123135253.637525-1-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-22bpf: Add bpf_cgroup_ancestor() kfuncDavid Vernet
struct cgroup * objects have a variably sized struct cgroup *ancestors[] field which stores pointers to their ancestor cgroups. If using a cgroup as a kptr, it can be useful to access these ancestors, but doing so requires variable offset accesses for PTR_TO_BTF_ID, which is currently unsupported. This is a very useful field to access for cgroup kptrs, as programs may wish to walk their ancestor cgroups when determining e.g. their proportional cpu.weight. So as to enable this functionality with cgroup kptrs before var_off is supported for PTR_TO_BTF_ID, this patch adds a bpf_cgroup_ancestor() kfunc which accesses the cgroup node on behalf of the caller, and acquires a reference on it. Once var_off is supported for PTR_TO_BTF_ID, and fields inside a struct can be marked as trusted so they retain the PTR_TRUSTED modifier when walked, this can be removed. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221122055458.173143-4-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-22bpf: Enable cgroups to be used as kptrsDavid Vernet
Now that tasks can be used as kfuncs, and the PTR_TRUSTED flag is available for us to easily add basic acquire / get / release kfuncs, we can do the same for cgroups. This patch set adds the following kfuncs which enable using cgroups as kptrs: struct cgroup *bpf_cgroup_acquire(struct cgroup *cgrp); struct cgroup *bpf_cgroup_kptr_get(struct cgroup **cgrpp); void bpf_cgroup_release(struct cgroup *cgrp); A follow-on patch will add a selftest suite which validates these kfuncs. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221122055458.173143-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-21bpf: Pin the start cgroup in cgroup_iter_seq_init()Hou Tao
bpf_iter_attach_cgroup() has already acquired an extra reference for the start cgroup, but the reference may be released if the iterator link fd is closed after the creation of iterator fd, and it may lead to user-after-free problem when reading the iterator fd. An alternative fix is pinning iterator link when opening iterator, but it will make iterator link being still visible after the close of iterator link fd and the behavior is different with other link types, so just fixing it by acquiring another reference for the start cgroup. Fixes: d4ccaf58a847 ("bpf: Introduce cgroup iter") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221121073440.1828292-2-houtao@huaweicloud.com
2022-11-21bpf/verifier: Use kmalloc_size_roundup() to match ksize() usageKees Cook
Most allocation sites in the kernel want an explicitly sized allocation (and not "more"), and that dynamic runtime analysis tools (e.g. KASAN, UBSAN_BOUNDS, FORTIFY_SOURCE, etc) are looking for precise bounds checking (i.e. not something that is rounded up). A tiny handful of allocations were doing an implicit alloc/realloc loop that actually depended on ksize(), and didn't actually always call realloc. This has created a long series of bugs and problems over many years related to the runtime bounds checking, so these callers are finally being adjusted to _not_ depend on the ksize() side-effect, by doing one of several things: - tracking the allocation size precisely and just never calling ksize() at all [1]. - always calling realloc and not using ksize() at all. (This solution ends up actually be a subset of the next solution.) - using kmalloc_size_roundup() to explicitly round up the desired allocation size immediately [2]. The bpf/verifier case is this another of this latter case, and is the last outstanding case to be fixed in the kernel. Because some of the dynamic bounds checking depends on the size being an _argument_ to an allocator function (i.e. see the __alloc_size attribute), the ksize() users are rare, and it could waste local variables, it was been deemed better to explicitly separate the rounding up from the allocation itself [3]. Round up allocations with kmalloc_size_roundup() so that the verifier's use of ksize() is always accurate. [1] e.g.: https://git.kernel.org/linus/712f210a457d https://git.kernel.org/linus/72c08d9f4c72 [2] e.g.: https://git.kernel.org/netdev/net-next/c/12d6c1d3a2ad https://git.kernel.org/netdev/net-next/c/ab3f7828c979 https://git.kernel.org/netdev/net-next/c/d6dd508080a3 [3] https://lore.kernel.org/lkml/0ea1fc165a6c6117f982f4f135093e69cb884930.camel@redhat.com/ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20221118183409.give.387-kees@kernel.org
2022-11-20bpf: Add a kfunc for generic type castYonghong Song
Implement bpf_rdonly_cast() which tries to cast the object to a specified type. This tries to support use case like below: #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB))) where skb_end_pointer(SKB) is a 'unsigned char *' and needs to be casted to 'struct skb_shared_info *'. The signature of bpf_rdonly_cast() looks like void *bpf_rdonly_cast(void *obj, __u32 btf_id) The function returns the same 'obj' but with PTR_TO_BTF_ID with btf_id. The verifier will ensure btf_id being a struct type. Since the supported type cast may not reflect what the 'obj' represents, the returned btf_id is marked as PTR_UNTRUSTED, so the return value and subsequent pointer chasing cannot be used as helper/kfunc arguments. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221120195437.3114585-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctxYonghong Song
Implement bpf_cast_to_kern_ctx() kfunc which does a type cast of a uapi ctx object to the corresponding kernel ctx. Previously if users want to access some data available in kctx but not in uapi ctx, bpf_probe_read_kernel() helper is needed. The introduction of bpf_cast_to_kern_ctx() allows direct memory access which makes code simpler and easier to understand. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221120195432.3113982-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20bpf: Add support for kfunc set with common btf_idsYonghong Song
Later on, we will introduce kfuncs bpf_cast_to_kern_ctx() and bpf_rdonly_cast() which apply to all program types. Currently kfunc set only supports individual prog types. This patch added support for kfunc applying to all program types. Signed-off-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221120195426.3113828-1-yhs@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20bpf: Disallow bpf_obj_new_impl call when bpf_mem_alloc_init failsKumar Kartikeya Dwivedi
In the unlikely event that bpf_global_ma is not correctly initialized, instead of checking the boolean everytime bpf_obj_new_impl is called, simply check it while loading the program and return an error if bpf_global_ma_set is false. Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221120212610.2361700-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20bpf: Add kfuncs for storing struct task_struct * as a kptrDavid Vernet
Now that BPF supports adding new kernel functions with kfuncs, and storing kernel objects in maps with kptrs, we can add a set of kfuncs which allow struct task_struct objects to be stored in maps as referenced kptrs. The possible use cases for doing this are plentiful. During tracing, for example, it would be useful to be able to collect some tasks that performed a certain operation, and then periodically summarize who they are, which cgroup they're in, how much CPU time they've utilized, etc. In order to enable this, this patch adds three new kfuncs: struct task_struct *bpf_task_acquire(struct task_struct *p); struct task_struct *bpf_task_kptr_get(struct task_struct **pp); void bpf_task_release(struct task_struct *p); A follow-on patch will add selftests validating these kfuncs. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221120051004.3605026-4-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncsDavid Vernet
Kfuncs currently support specifying the KF_TRUSTED_ARGS flag to signal to the verifier that it should enforce that a BPF program passes it a "safe", trusted pointer. Currently, "safe" means that the pointer is either PTR_TO_CTX, or is refcounted. There may be cases, however, where the kernel passes a BPF program a safe / trusted pointer to an object that the BPF program wishes to use as a kptr, but because the object does not yet have a ref_obj_id from the perspective of the verifier, the program would be unable to pass it to a KF_ACQUIRE | KF_TRUSTED_ARGS kfunc. The solution is to expand the set of pointers that are considered trusted according to KF_TRUSTED_ARGS, so that programs can invoke kfuncs with these pointers without getting rejected by the verifier. There is already a PTR_UNTRUSTED flag that is set in some scenarios, such as when a BPF program reads a kptr directly from a map without performing a bpf_kptr_xchg() call. These pointers of course can and should be rejected by the verifier. Unfortunately, however, PTR_UNTRUSTED does not cover all the cases for safety that need to be addressed to adequately protect kfuncs. Specifically, pointers obtained by a BPF program "walking" a struct are _not_ considered PTR_UNTRUSTED according to BPF. For example, say that we were to add a kfunc called bpf_task_acquire(), with KF_ACQUIRE | KF_TRUSTED_ARGS, to acquire a struct task_struct *. If we only used PTR_UNTRUSTED to signal that a task was unsafe to pass to a kfunc, the verifier would mistakenly allow the following unsafe BPF program to be loaded: SEC("tp_btf/task_newtask") int BPF_PROG(unsafe_acquire_task, struct task_struct *task, u64 clone_flags) { struct task_struct *acquired, *nested; nested = task->last_wakee; /* Would not be rejected by the verifier. */ acquired = bpf_task_acquire(nested); if (!acquired) return 0; bpf_task_release(acquired); return 0; } To address this, this patch defines a new type flag called PTR_TRUSTED which tracks whether a PTR_TO_BTF_ID pointer is safe to pass to a KF_TRUSTED_ARGS kfunc or a BPF helper function. PTR_TRUSTED pointers are passed directly from the kernel as a tracepoint or struct_ops callback argument. Any nested pointer that is obtained from walking a PTR_TRUSTED pointer is no longer PTR_TRUSTED. From the example above, the struct task_struct *task argument is PTR_TRUSTED, but the 'nested' pointer obtained from 'task->last_wakee' is not PTR_TRUSTED. A subsequent patch will add kfuncs for storing a task kfunc as a kptr, and then another patch will add selftests to validate. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221120051004.3605026-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20bpf: Allow multiple modifiers in reg_type_str() prefixDavid Vernet
reg_type_str() in the verifier currently only allows a single register type modifier to be present in the 'prefix' string which is eventually stored in the env type_str_buf. This currently works fine because there are no overlapping type modifiers, but once PTR_TRUSTED is added, that will no longer be the case. This patch updates reg_type_str() to support having multiple modifiers in the prefix string, and updates the size of type_str_buf to be 128 bytes. Signed-off-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/r/20221120051004.3605026-2-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Add comments for map BTF matching requirement for bpf_list_headKumar Kartikeya Dwivedi
The old behavior of bpf_map_meta_equal was that it compared timer_off to be equal (but not spin_lock_off, because that was not allowed), and did memcmp of kptr_off_tab. Now, we memcmp the btf_record of two bpf_map structs, which has all fields. We preserve backwards compat as we kzalloc the array, so if only spin lock and timer exist in map, we only compare offset while the rest of unused members in the btf_field struct are zeroed out. In case of kptr, btf and everything else is of vmlinux or module, so as long type is same it will match, since kernel btf, module, dtor pointer will be same across maps. Now with list_head in the mix, things are a bit complicated. We implicitly add a requirement that both BTFs are same, because struct btf_field_list_head has btf and value_rec members. We obviously shouldn't force BTFs to be equal by default, as that breaks backwards compatibility. Currently it is only implicitly required due to list_head matching struct btf and value_rec member. value_rec points back into a btf_record stashed in the map BTF (btf member of btf_field_list_head). So that pointer and btf member has to match exactly. Document all these subtle details so that things don't break in the future when touching this code. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-19-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}Kumar Kartikeya Dwivedi
This commit implements the delayed release logic for bpf_list_push_front and bpf_list_push_back. Once a node has been added to the list, it's pointer changes to PTR_UNTRUSTED. However, it is only released once the lock protecting the list is unlocked. For such PTR_TO_BTF_ID | MEM_ALLOC with PTR_UNTRUSTED set but an active ref_obj_id, it is still permitted to read them as long as the lock is held. Writing to them is not allowed. This allows having read access to push items we no longer own until we release the lock guarding the list, allowing a little more flexibility when working with these APIs. Note that enabling write support has fairly tricky interactions with what happens inside the critical section. Just as an example, currently, bpf_obj_drop is not permitted, but if it were, being able to write to the PTR_UNTRUSTED pointer while the object gets released back to the memory allocator would violate safety properties we wish to guarantee (i.e. not crashing the kernel). The memory could be reused for a different type in the BPF program or even in the kernel as it gets eventually kfree'd. Not enabling bpf_obj_drop inside the critical section would appear to prevent all of the above, but that is more of an artifical limitation right now. Since the write support is tangled with how we handle potential aliasing of nodes inside the critical section that may or may not be part of the list anymore, it has been deferred to a future patch. Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-18-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Introduce single ownership BPF linked list APIKumar Kartikeya Dwivedi
Add a linked list API for use in BPF programs, where it expects protection from the bpf_spin_lock in the same allocation as the bpf_list_head. For now, only one bpf_spin_lock can be present hence that is assumed to be the one protecting the bpf_list_head. The following functions are added to kick things off: // Add node to beginning of list void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node); // Add node to end of list void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node); // Remove node at beginning of list and return it struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head); // Remove node at end of list and return it struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head); The lock protecting the bpf_list_head needs to be taken for all operations. The verifier ensures that the lock that needs to be taken is always held, and only the correct lock is taken for these operations. These checks are made statically by relying on the reg->id preserved for registers pointing into regions having both bpf_spin_lock and the objects protected by it. The comment over check_reg_allocation_locked in this change describes the logic in detail. Note that bpf_list_push_front and bpf_list_push_back are meant to consume the object containing the node in the 1st argument, however that specific mechanism is intended to not release the ref_obj_id directly until the bpf_spin_unlock is called. In this commit, nothing is done, but the next commit will be introducing logic to handle this case, so it has been left as is for now. bpf_list_pop_front and bpf_list_pop_back delete the first or last item of the list respectively, and return pointer to the element at the list_node offset. The user can then use container_of style macro to get the actual entry type. The verifier however statically knows the actual type, so the safety properties are still preserved. With these additions, programs can now manage their own linked lists and store their objects in them. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-17-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Permit NULL checking pointer with non-zero fixed offsetKumar Kartikeya Dwivedi
Pointer increment on seeing PTR_MAYBE_NULL is already protected against, hence make an exception for PTR_TO_BTF_ID | MEM_ALLOC while still keeping the warning for other unintended cases that might creep in. bpf_list_pop_{front,_back} helpers planned to be introduced in next commit will return a MEM_ALLOC register with incremented offset pointing to bpf_list_node field. The user is supposed to then obtain the pointer to the entry using container_of after NULL checking it. The current restrictions trigger a warning when doing the NULL checking. Revisiting the reason, it is meant as an assertion which seems to actually work and catch the bad case. Hence, under no other circumstances can reg->off be non-zero for a register that has the PTR_MAYBE_NULL type flag set. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-16-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Introduce bpf_obj_dropKumar Kartikeya Dwivedi
Introduce bpf_obj_drop, which is the kfunc used to free allocated objects (allocated using bpf_obj_new). Pairing with bpf_obj_new, it implicitly destructs the fields part of object automatically without user intervention. Just like the previous patch, btf_struct_meta that is needed to free up the special fields is passed as a hidden argument to the kfunc. For the user, a convenience macro hides over the kernel side kfunc which is named bpf_obj_drop_impl. Continuing the previous example: void prog(void) { struct foo *f; f = bpf_obj_new(typeof(*f)); if (!f) return; bpf_obj_drop(f); } Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-15-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Introduce bpf_obj_newKumar Kartikeya Dwivedi
Introduce type safe memory allocator bpf_obj_new for BPF programs. The kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments to kfuncs still requires having them in prototype, unlike BPF helpers which always take 5 arguments and have them checked using bpf_func_proto in verifier, ignoring unset argument types. Introduce __ign suffix to ignore a specific kfunc argument during type checks, then use this to introduce support for passing type metadata to the bpf_obj_new_impl kfunc. The user passes BTF ID of the type it wants to allocates in program BTF, the verifier then rewrites the first argument as the size of this type, after performing some sanity checks (to ensure it exists and it is a struct type). The second argument is also fixed up and passed by the verifier. This is the btf_struct_meta for the type being allocated. It would be needed mostly for the offset array which is required for zero initializing special fields while leaving the rest of storage in unitialized state. It would also be needed in the next patch to perform proper destruction of the object's special fields. Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free, using the any context BPF memory allocator introduced recently. To this end, a global instance of the BPF memory allocator is initialized on boot to be used for this purpose. This 'bpf_global_ma' serves all allocations for bpf_obj_new. In the future, bpf_obj_new variants will allow specifying a custom allocator. Note that now that bpf_obj_new can be used to allocate objects that can be linked to BPF linked list (when future linked list helpers are available), we need to also free the elements using bpf_mem_free. However, since the draining of elements is done outside the bpf_spin_lock, we need to do migrate_disable around the call since bpf_list_head_free can be called from map free path where migration is enabled. Otherwise, when called from BPF programs migration is already disabled. A convenience macro is included in the bpf_experimental.h header to hide over the ugly details of the implementation, leading to user code looking similar to a language level extension which allocates and constructs fields of a user type. struct bar { struct bpf_list_node node; }; struct foo { struct bpf_spin_lock lock; struct bpf_list_head head __contains(bar, node); }; void prog(void) { struct foo *f; f = bpf_obj_new(typeof(*f)); if (!f) return; ... } A key piece of this story is still missing, i.e. the free function, which will come in the next patch. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Support constant scalar arguments for kfuncsKumar Kartikeya Dwivedi
Allow passing known constant scalars as arguments to kfuncs that do not represent a size parameter. We use mark_chain_precision for the constant scalar argument to mark it precise. This makes the search pruning optimization of verifier more conservative for such kfunc calls, and each non-distinct argument is considered unequivalent. We will use this support to then expose a bpf_obj_new function where it takes the local type ID of a type in program BTF, and returns a PTR_TO_BTF_ID | MEM_ALLOC to the local type, and allows programs to allocate their own objects. Each type ID resolves to a distinct type with a possibly distinct size, hence the type ID constant matters in terms of program safety and its precision needs to be checked between old and cur states inside regsafe. The use of mark_chain_precision enables this. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-13-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Rewrite kfunc argument handlingKumar Kartikeya Dwivedi
As we continue to add more features, argument types, kfunc flags, and different extensions to kfuncs, the code to verify the correctness of the kfunc prototype wrt the passed in registers has become ad-hoc and ugly to read. To make life easier, and make a very clear split between different stages of argument processing, move all the code into verifier.c and refactor into easier to read helpers and functions. This also makes sharing code within the verifier easier with kfunc argument processing. This will be more and more useful in later patches as we are now moving to implement very core BPF helpers as kfuncs, to keep them experimental before baking into UAPI. Remove all kfunc related bits now from btf_check_func_arg_match, as users have been converted away to refactored kfunc argument handling. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-12-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Allow locking bpf_spin_lock in inner map valuesKumar Kartikeya Dwivedi
There is no need to restrict users from locking bpf_spin_lock in map values of inner maps. Each inner map lookup gets a unique reg->id assigned to the returned PTR_TO_MAP_VALUE which will be preserved after the NULL check. Distinct lookups into different inner map get unique IDs, and distinct lookups into same inner map also get unique IDs. Hence, lift the restriction by removing the check return -ENOTSUPP in map_in_map.c. Later commits will add comprehensive test cases to ensure that invalid cases are rejected. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-11-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Allow locking bpf_spin_lock global variablesKumar Kartikeya Dwivedi
Global variables reside in maps accessible using direct_value_addr callbacks, so giving each load instruction's rewrite a unique reg->id disallows us from holding locks which are global. The reason for preserving reg->id as a unique value for registers that may point to spin lock is that two separate lookups are treated as two separate memory regions, and any possible aliasing is ignored for the purposes of spin lock correctness. This is not great especially for the global variable case, which are served from maps that have max_entries == 1, i.e. they always lead to map values pointing into the same map value. So refactor the active_spin_lock into a 'active_lock' structure which represents the lock identity, and instead of the reg->id, remember two fields, a pointer and the reg->id. The pointer will store reg->map_ptr or reg->btf. It's only necessary to distinguish for the id == 0 case of global variables, but always setting the pointer to a non-NULL value and using the pointer to check whether the lock is held simplifies code in the verifier. This is generic enough to allow it for global variables, map lookups, and allocated objects at the same time. Note that while whether a lock is held can be answered by just comparing active_lock.ptr to NULL, to determine whether the register is pointing to the same held lock requires comparing _both_ ptr and id. Finally, as a result of this refactoring, pseudo load instructions are not given a unique reg->id, as they are doing lookup for the same map value (max_entries is never greater than 1). Essentially, we consider that the tuple of (ptr, id) will always be unique for any kind of argument to bpf_spin_{lock,unlock}. Note that this can be extended in the future to also remember offset used for locking, so that we can introduce multiple bpf_spin_lock fields in the same allocation. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-10-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Allow locking bpf_spin_lock in allocated objectsKumar Kartikeya Dwivedi
Allow locking a bpf_spin_lock in an allocated object, in addition to already supported map value pointers. The handling is similar to that of map values, by just preserving the reg->id of PTR_TO_BTF_ID | MEM_ALLOC as well, and adjusting process_spin_lock to work with them and remember the id in verifier state. Refactor the existing process_spin_lock to work with PTR_TO_BTF_ID | MEM_ALLOC in addition to PTR_TO_MAP_VALUE. We need to update the reg_may_point_to_spin_lock which is used in mark_ptr_or_null_reg to preserve reg->id, that will be used in env->cur_state->active_spin_lock to remember the currently held spin lock. Also update the comment describing bpf_spin_lock implementation details to also talk about PTR_TO_BTF_ID | MEM_ALLOC type. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-9-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Verify ownership relationships for user BTF typesKumar Kartikeya Dwivedi
Ensure that there can be no ownership cycles among different types by way of having owning objects that can hold some other type as their element. For instance, a map value can only hold allocated objects, but these are allowed to have another bpf_list_head. To prevent unbounded recursion while freeing resources, elements of bpf_list_head in local kptrs can never have a bpf_list_head which are part of list in a map value. Later patches will verify this by having dedicated BTF selftests. Also, to make runtime destruction easier, once btf_struct_metas is fully populated, we can stash the metadata of the value type directly in the metadata of the list_head fields, as that allows easier access to the value type's layout to destruct it at runtime from the btf_field entry of the list head itself. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Recognize lock and list fields in allocated objectsKumar Kartikeya Dwivedi
Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a allocated object. Also update btf_struct_access to reject direct access to these special fields. A bpf_list_head allows implementing map-in-map style use cases, where an allocated object with bpf_list_head is linked into a list in a map value. This would require embedding a bpf_list_node, support for which is also included. The bpf_spin_lock is used to protect the bpf_list_head and other data. While we strictly don't require to hold a bpf_spin_lock while touching the bpf_list_head in such objects, as when have access to it, we have complete ownership of the object, the locking constraint is still kept and may be conditionally lifted in the future. Note that the specification of such types can be done just like map values, e.g.: struct bar { struct bpf_list_node node; }; struct foo { struct bpf_spin_lock lock; struct bpf_list_head head __contains(bar, node); struct bpf_list_node node; }; struct map_value { struct bpf_spin_lock lock; struct bpf_list_head head __contains(foo, node); }; To recognize such types in user BTF, we build a btf_struct_metas array of metadata items corresponding to each BTF ID. This is done once during the btf_parse stage to avoid having to do it each time during the verification process's requirement to inspect the metadata. Moreover, the computed metadata needs to be passed to some helpers in future patches which requires allocating them and storing them in the BTF that is pinned by the program itself, so that valid access can be assumed to such data during program runtime. A key thing to note is that once a btf_struct_meta is available for a type, both the btf_record and btf_field_offs should be available. It is critical that btf_field_offs is available in case special fields are present, as we extensively rely on special fields being zeroed out in map values and allocated objects in later patches. The code ensures that by bailing out in case of errors and ensuring both are available together. If the record is not available, the special fields won't be recognized, so not having both is also fine (in terms of being a verification error and not a runtime bug). Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Introduce allocated objects supportKumar Kartikeya Dwivedi
Introduce support for representing pointers to objects allocated by the BPF program, i.e. PTR_TO_BTF_ID that point to a type in program BTF. This is indicated by the presence of MEM_ALLOC type flag in reg->type to avoid having to check btf_is_kernel when trying to match argument types in helpers. Whenever walking such types, any pointers being walked will always yield a SCALAR instead of pointer. In the future we might permit kptr inside such allocated objects (either kernel or program allocated), and it will then form a PTR_TO_BTF_ID of the respective type. For now, such allocated objects will always be referenced in verifier context, hence ref_obj_id == 0 for them is a bug. It is allowed to write to such objects, as long fields that are special are not touched (support for which will be added in subsequent patches). Note that once such a pointer is marked PTR_UNTRUSTED, it is no longer allowed to write to it. No PROBE_MEM handling is therefore done for loads into this type unless PTR_UNTRUSTED is part of the register type, since they can never be in an undefined state, and their lifetime will always be valid. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Populate field_offs for inner_map_metaKumar Kartikeya Dwivedi
Far too much code simply assumes that both btf_record and btf_field_offs are set to valid pointers together, or both are unset. They go together hand in hand as btf_record describes the special fields and btf_field_offs is compact representation for runtime copying/zeroing. It is very difficult to make this clear in the code when the only exception to this universal invariant is inner_map_meta which is used as reg->map_ptr in the verifier. This is simply a bug waiting to happen, as in verifier context we cannot easily distinguish if PTR_TO_MAP_VALUE is coming from an inner map, and if we ever end up using field_offs for any reason in the future, we will silently ignore the special fields for inner map case (as NULL is not an error but unset field_offs). Hence, simply copy field_offs from inner map together with btf_record. While at it, refactor code to unwind properly on errors with gotos. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Free inner_map_meta when btf_record_dup failsKumar Kartikeya Dwivedi
Whenever btf_record_dup fails, we must free inner_map_meta that was allocated before. This fixes a memory leak (in case of errors) during inner map creation. Fixes: aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Do btf_record_free outside map_free callbackKumar Kartikeya Dwivedi
Since the commit being fixed, we now miss freeing btf_record for local storage maps which will have a btf_record populated in case they have bpf_spin_lock element. This was missed because I made the choice of offloading the job to free kptr_off_tab (now btf_record) to the map_free callback when adding support for kptrs. Revisiting the reason for this decision, there is the possibility that the btf_record gets used inside map_free callback (e.g. in case of maps embedding kptrs) to iterate over them and free them, hence doing it before the map_free callback would be leaking special field memory, and do invalid memory access. The btf_record keeps module references which is critical to ensure the dtor call made for referenced kptr is safe to do. If doing it after map_free callback, the map area is already freed, so we cannot access bpf_map structure anymore. To fix this and prevent such lapses in future, move bpf_map_free_record out of the map_free callback, and do it after map_free by remembering the btf_record pointer. There is no need to access bpf_map structure in that case, and we can avoid missing this case when support for new map types is added for other special fields. Since a btf_record and its btf_field_offs are used together, for consistency delay freeing of field_offs as well. While not a problem right now, a lot of code assumes that either both record and field_offs are set or none at once. Note that in case of map of maps (outer maps), inner_map_meta->record is only used during verification, not to free fields in map value, hence we simply keep the bpf_map_free_record call as is in bpf_map_meta_free and never touch map->inner_map_meta in bpf_map_free_deferred. Add a comment making note of these details. Fixes: db559117828d ("bpf: Consolidate spin_lock, timer management into btf_record") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17bpf: Fix early return in map_check_btfKumar Kartikeya Dwivedi
Instead of returning directly with -EOPNOTSUPP for the timer case, we need to free the btf_record before returning to userspace. Fixes: db559117828d ("bpf: Consolidate spin_lock, timer management into btf_record") Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221118015614.2013203-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
include/linux/bpf.h 1f6e04a1c7b8 ("bpf: Fix offset calculation error in __copy_map_value and zero_map_value") aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record") f71b2f64177a ("bpf: Refactor map->off_arr handling") https://lore.kernel.org/all/20221114095000.67a73239@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17bpf: Pass map file to .map_update_batch directlyHou Tao
Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get the target map file, then it invokes generic_map_update_batch() to do batch update. generic_map_update_batch() will get the target map file by using fdget(batch.map_fd) again and pass it to bpf_map_update_value(). The problem is map file returned by the second fdget() may be NULL or a totally different file compared by map file in bpf_map_do_batch(). The reason is that the first fdget() only guarantees the liveness of struct file instead of file descriptor and the file description may be released by concurrent close() through pick_file(). It doesn't incur any problem as for now, because maps with batch update support don't use map file in .map_fd_get_ptr() ops. But it is better to fix the potential access of an invalid map file. Using __bpf_map_get() again in generic_map_update_batch() can not fix the problem, because batch.map_fd may be closed and reopened, and the returned map file may be different with map file got in bpf_map_do_batch(), so just passing the map file directly to .map_update_batch() in bpf_map_do_batch(). Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com
2022-11-15bpf: propagate nullness information for reg to reg comparisonsEduard Zingerman
Propagate nullness information for branches of register to register equality compare instructions. The following rules are used: - suppose register A maybe null - suppose register B is not null - for JNE A, B, ... - A is not null in the false branch - for JEQ A, B, ... - A is not null in the true branch E.g. for program like below: r6 = skb->sk; r7 = sk_fullsock(r6); r0 = sk_fullsock(r6); if (r0 == 0) return 0; (a) if (r0 != r7) return 0; (b) *r7->type; (c) return 0; It is safe to dereference r7 at point (c), because of (a) and (b). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20221115224859.2452988-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15bpf: Expand map key argument of bpf_redirect_map to u64Toke Høiland-Jørgensen
For queueing packets in XDP we want to add a new redirect map type with support for 64-bit indexes. To prepare fore this, expand the width of the 'key' argument to the bpf_redirect_map() helper. Since BPF registers are always 64-bit, this should be safe to do after the fact. Acked-by: Song Liu <song@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14bpf: Refactor btf_struct_accessKumar Kartikeya Dwivedi
Instead of having to pass multiple arguments that describe the register, pass the bpf_reg_state into the btf_struct_access callback. Currently, all call sites simply reuse the btf and btf_id of the reg they want to check the access of. The only exception to this pattern is the callsite in check_ptr_to_map_access, hence for that case create a dummy reg to simulate PTR_TO_BTF_ID access. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14bpf: Rename MEM_ALLOC to MEM_RINGBUFKumar Kartikeya Dwivedi
Currently, verifier uses MEM_ALLOC type tag to specially tag memory returned from bpf_ringbuf_reserve helper. However, this is currently only used for this purpose and there is an implicit assumption that it only refers to ringbuf memory (e.g. the check for ARG_PTR_TO_ALLOC_MEM in check_func_arg_reg_off). Hence, rename MEM_ALLOC to MEM_RINGBUF to indicate this special relationship and instead open the use of MEM_ALLOC for more generic allocations made for user types. Also, since ARG_PTR_TO_ALLOC_MEM_OR_NULL is unused, simply drop it. Finally, update selftests using 'alloc_' verifier string to 'ringbuf_'. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-7-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14bpf: Rename RET_PTR_TO_ALLOC_MEMKumar Kartikeya Dwivedi
Currently, the verifier has two return types, RET_PTR_TO_ALLOC_MEM, and RET_PTR_TO_ALLOC_MEM_OR_NULL, however the former is confusingly named to imply that it carries MEM_ALLOC, while only the latter does. This causes confusion during code review leading to conclusions like that the return value of RET_PTR_TO_DYNPTR_MEM_OR_NULL (which is RET_PTR_TO_ALLOC_MEM | PTR_MAYBE_NULL) may be consumable by bpf_ringbuf_{submit,commit}. Rename it to make it clear MEM_ALLOC needs to be tacked on top of RET_PTR_TO_MEM. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-6-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14bpf: Support bpf_list_head in map valuesKumar Kartikeya Dwivedi
Add the support on the map side to parse, recognize, verify, and build metadata table for a new special field of the type struct bpf_list_head. To parameterize the bpf_list_head for a certain value type and the list_node member it will accept in that value type, we use BTF declaration tags. The definition of bpf_list_head in a map value will be done as follows: struct foo { struct bpf_list_node node; int data; }; struct map_value { struct bpf_list_head head __contains(foo, node); }; Then, the bpf_list_head only allows adding to the list 'head' using the bpf_list_node 'node' for the type struct foo. The 'contains' annotation is a BTF declaration tag composed of four parts, "contains:name:node" where the name is then used to look up the type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during the lookup. The node defines name of the member in this type that has the type struct bpf_list_node, which is actually used for linking into the linked list. For now, 'kind' part is hardcoded as struct. This allows building intrusive linked lists in BPF, using container_of to obtain pointer to entry, while being completely type safe from the perspective of the verifier. The verifier knows exactly the type of the nodes, and knows that list helpers return that type at some fixed offset where the bpf_list_node member used for this list exists. The verifier also uses this information to disallow adding types that are not accepted by a certain list. For now, no elements can be added to such lists. Support for that is coming in future patches, hence draining and freeing items is done with a TODO that will be resolved in a future patch. Note that the bpf_list_head_free function moves the list out to a local variable under the lock and releases it, doing the actual draining of the list items outside the lock. While this helps with not holding the lock for too long pessimizing other concurrent list operations, it is also necessary for deadlock prevention: unless every function called in the critical section would be notrace, a fentry/fexit program could attach and call bpf_map_update_elem again on the map, leading to the same lock being acquired if the key matches and lead to a deadlock. While this requires some special effort on part of the BPF programmer to trigger and is highly unlikely to occur in practice, it is always better if we can avoid such a condition. While notrace would prevent this, doing the draining outside the lock has advantages of its own, hence it is used to also fix the deadlock related problem. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14bpf: Remove BPF_MAP_OFF_ARR_MAXKumar Kartikeya Dwivedi
In f71b2f64177a ("bpf: Refactor map->off_arr handling"), map->off_arr was refactored to be btf_field_offs. The number of field offsets is equal to maximum possible fields limited by BTF_FIELDS_MAX. Hence, reuse BTF_FIELDS_MAX as spin_lock and timer no longer are to be handled specially for offset sorting, fix the comment, and remove incorrect WARN_ON as its rec->cnt can never exceed this value. The reason to keep separate constant was the it was always more 2 more than total kptrs. This is no longer the case. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20221114191547.1694267-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-11Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Andrii Nakryiko says: ==================== bpf-next 2022-11-11 We've added 49 non-merge commits during the last 9 day(s) which contain a total of 68 files changed, 3592 insertions(+), 1371 deletions(-). The main changes are: 1) Veristat tool improvements to support custom filtering, sorting, and replay of results, from Andrii Nakryiko. 2) BPF verifier precision tracking fixes and improvements, from Andrii Nakryiko. 3) Lots of new BPF documentation for various BPF maps, from Dave Tucker, Donald Hunter, Maryam Tahhan, Bagas Sanjaya. 4) BTF dedup improvements and libbpf's hashmap interface clean ups, from Eduard Zingerman. 5) Fix veth driver panic if XDP program is attached before veth_open, from John Fastabend. 6) BPF verifier clean ups and fixes in preparation for follow up features, from Kumar Kartikeya Dwivedi. 7) Add access to hwtstamp field from BPF sockops programs, from Martin KaFai Lau. 8) Various fixes for BPF selftests and samples, from Artem Savkov, Domenico Cerasuolo, Kang Minchul, Rong Tao, Yang Jihong. 9) Fix redirection to tunneling device logic, preventing skb->len == 0, from Stanislav Fomichev. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits) selftests/bpf: fix veristat's singular file-or-prog filter selftests/bpf: Test skops->skb_hwtstamp selftests/bpf: Fix incorrect ASSERT in the tcp_hdr_options test bpf: Add hwtstamp field for the sockops prog selftests/bpf: Fix xdp_synproxy compilation failure in 32-bit arch bpf, docs: Document BPF_MAP_TYPE_ARRAY docs/bpf: Document BPF map types QUEUE and STACK docs/bpf: Document BPF ARRAY_OF_MAPS and HASH_OF_MAPS docs/bpf: Document BPF_MAP_TYPE_CPUMAP map docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map libbpf: Hashmap.h update to fix build issues using LLVM14 bpf: veth driver panics when xdp prog attached before veth_open selftests: Fix test group SKIPPED result selftests/bpf: Tests for btf_dedup_resolve_fwds libbpf: Resolve unambigous forward declarations libbpf: Hashmap interface update to allow both long and void* keys/values samples/bpf: Fix sockex3 error: Missing BPF prog type selftests/bpf: Fix u32 variable compared with less than zero Documentation: bpf: Escape underscore in BPF type name prefix selftests/bpf: Use consistent build-id type for liburandom_read.so ... ==================== Link: https://lore.kernel.org/r/20221111233733.1088228-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Andrii Nakryiko says: ==================== bpf 2022-11-11 We've added 11 non-merge commits during the last 8 day(s) which contain a total of 11 files changed, 83 insertions(+), 74 deletions(-). The main changes are: 1) Fix strncpy_from_kernel_nofault() to prevent out-of-bounds writes, from Alban Crequy. 2) Fix for bpf_prog_test_run_skb() to prevent wrong alignment, from Baisong Zhong. 3) Switch BPF_DISPATCHER to static_call() instead of ftrace infra, with a small build fix on top, from Peter Zijlstra and Nathan Chancellor. 4) Fix memory leak in BPF verifier in some error cases, from Wang Yufen. 5) 32-bit compilation error fixes for BPF selftests, from Pu Lehui and Yang Jihong. 6) Ensure even distribution of per-CPU free list elements, from Xu Kuohai. 7) Fix copy_map_value() to track special zeroed out areas properly, from Xu Kuohai. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix offset calculation error in __copy_map_value and zero_map_value bpf: Initialize same number of free nodes for each pcpu_freelist selftests: bpf: Add a test when bpf_probe_read_kernel_str() returns EFAULT maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() selftests/bpf: Fix test_progs compilation failure in 32-bit arch selftests/bpf: Fix casting error when cross-compiling test_verifier for 32-bit platforms bpf: Fix memory leaks in __check_func_call bpf: Add explicit cast to 'void *' for __BPF_DISPATCHER_UPDATE() bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace) bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop") bpf, test_run: Fix alignment problem in bpf_prog_test_run_skb() ==================== Link: https://lore.kernel.org/r/20221111231624.938829-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11bpf: Initialize same number of free nodes for each pcpu_freelistXu Kuohai
pcpu_freelist_populate() initializes nr_elems / num_possible_cpus() + 1 free nodes for some CPUs, and then possibly one CPU with fewer nodes, followed by remaining cpus with 0 nodes. For example, when nr_elems == 256 and num_possible_cpus() == 32, CPU 0~27 each gets 9 free nodes, CPU 28 gets 4 free nodes, CPU 29~31 get 0 free nodes, while in fact each CPU should get 8 nodes equally. This patch initializes nr_elems / num_possible_cpus() free nodes for each CPU firstly, then allocates the remaining free nodes by one for each CPU until no free nodes left. Fixes: e19494edab82 ("bpf: introduce percpu_freelist") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221110122128.105214-1-xukuohai@huawei.com
2022-11-11docs/bpf: Document BPF_MAP_TYPE_CPUMAP mapMaryam Tahhan
Add documentation for BPF_MAP_TYPE_CPUMAP including kernel version introduced, usage and examples. Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Maryam Tahhan <mtahhan@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20221107165207.2682075-2-mtahhan@redhat.com