summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-11-15bpf: Do not alloc arena on unsupported archesViktor Malik
Do not allocate BPF arena on arches that do not support it, instead return EOPNOTSUPP. This is useful to prevent bugs such as soft lockups while trying to free the arena which we have witnessed on ppc64le [1]. [1] https://lore.kernel.org/bpf/4afdcb50-13f2-4772-8db1-3fd02bd985b3@redhat.com/ Signed-off-by: Viktor Malik <vmalik@redhat.com> Link: https://lore.kernel.org/r/20241115082548.74972-1-vmalik@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-14selftests/bpf: Set test path for token/obj_priv_implicit_token_envvarIhor Solodrai
token/obj_priv_implicit_token_envvar test may fail in an environment where the process executing tests can not write to the root path. Example: https://github.com/libbpf/libbpf/actions/runs/11844507007/job/33007897936 Change default path used by the test to /tmp/bpf-token-fs, and make it runtime configurable via an environment variable. Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241115003853.864397-1-ihor.solodrai@pm.me
2024-11-13Merge branch 'bpf-range_tree-for-bpf-arena'Andrii Nakryiko
Alexei Starovoitov says: ==================== bpf: range_tree for bpf arena From: Alexei Starovoitov <ast@kernel.org> Introduce range_tree (interval tree plus rbtree) to track unallocated ranges in bpf arena and replace maple_tree with it. This is a step towards making bpf_arena|free_alloc_pages non-sleepable. The previous approach to reuse drm_mm to replace maple_tree reached dead end, since sizeof(struct drm_mm_node) = 168 and sizeof(struct maple_node) = 256 while sizeof(struct range_node) = 64 introduced in this patch. Not only it's smaller, but the algorithm splits and merges adjacent ranges. Ultimate performance doesn't matter. The main objective of range_tree is to work in context where kmalloc/kfree are not safe. It achieves that via bpf_mem_alloc. ==================== Link: https://patch.msgid.link/20241108025616.17625-1-alexei.starovoitov@gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-13selftests/bpf: Add a test for arena range tree algorithmAlexei Starovoitov
Add a test that verifies specific behavior of arena range tree algorithm and adjust existing big_alloc1 test due to use of global data in arena. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20241108025616.17625-3-alexei.starovoitov@gmail.com
2024-11-13bpf: Introduce range_tree data structure and use it in bpf arenaAlexei Starovoitov
Introduce range_tree data structure and use it in bpf arena to track ranges of allocated pages. range_tree is a large bitmap that is implemented as interval tree plus rbtree. The contiguous sequence of bits represents unallocated pages. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/bpf/20241108025616.17625-2-alexei.starovoitov@gmail.com
2024-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfAlexei Starovoitov
Cross-merge bpf fixes after downstream PR. In particular to bring the fix in commit aa30eb3260b2 ("bpf: Force checkpoint when jmp history is too long"). The follow up verifier work depends on it. And the fix in commit 6801cf7890f2 ("selftests/bpf: Use -4095 as the bad address for bits iterator"). It's fixing instability of BPF CI on s390 arch. No conflicts. Adjacent changes in: Auto-merging arch/Kconfig Auto-merging kernel/bpf/helpers.c Auto-merging kernel/bpf/memalloc.c Auto-merging kernel/bpf/verifier.c Auto-merging mm/slab_common.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-13samples/bpf: Remove unused variable in xdp2skb_meta_kern.cZhu Jun
The variable is never referenced in the code, just remove it. Signed-off-by: Zhu Jun <zhujun2@cmss.chinamobile.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241111061514.3257-1-zhujun2@cmss.chinamobile.com
2024-11-13samples/bpf: Remove unused variables in tc_l2_redirect_kern.cZhu Jun
These variables are never referenced in the code, just remove them. Signed-off-by: Zhu Jun <zhujun2@cmss.chinamobile.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241111062312.3541-1-zhujun2@cmss.chinamobile.com
2024-11-13bpftool: Cast variable `var` to long longLuo Yifan
When the SIGNED condition is met, the variable `var` should be cast to `long long` instead of `unsigned long long`. Signed-off-by: Luo Yifan <luoyifan@cmss.chinamobile.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Quentin Monnet <qmo@kernel.org> Link: https://lore.kernel.org/bpf/20241112073701.283362-1-luoyifan@cmss.chinamobile.com
2024-11-13Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Daniel Borkmann: - Fix a mismatching RCU unlock flavor in bpf_out_neigh_v6 (Jiawei Ye) - Fix BPF sockmap with kTLS to reject vsock and unix sockets upon kTLS context retrieval (Zijian Zhang) - Fix BPF bits iterator selftest for s390x (Hou Tao) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix mismatched RCU unlock flavour in bpf_out_neigh_v6 bpf: Add sk_is_inet and IS_ICSK check in tls_sw_has_ctx_tx/rx selftests/bpf: Use -4095 as the bad address for bits iterator
2024-11-13Merge tag 'loongarch-fixes-6.12-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson Pull LoongArch fixes from Huacai Chen: - fix possible CPUs setup logical-physical CPU mapping, in order to avoid CPU hotplug issue - fix some KASAN bugs - fix AP booting issue in VM mode - some trivial cleanups * tag 'loongarch-fixes-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson: LoongArch: Fix AP booting issue in VM mode LoongArch: Add WriteCombine shadow mapping in KASAN LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits LoongArch: Make KASAN work with 5-level page-tables LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS LoongArch: Fix early_numa_add_cpu() usage for FDT systems LoongArch: For all possible CPUs setup logical-physical CPU mapping
2024-11-13Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All singletons" * tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: swapfile: fix cluster reclaim work crash on rotational devices selftests: hugetlb_dio: fixup check for initial conditions to skip in the start mm/thp: fix deferred split queue not partially_mapped: fix mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases nommu: pass NULL argument to vma_iter_prealloc() ocfs2: fix UBSAN warning in ocfs2_verify_volume() nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint mm: page_alloc: move mlocked flag clearance into free_pages_prepare() mm: count zeromap read and set for swapout and swapin
2024-11-12bpf, x86: Propagate tailcall info only for subprogsLeon Hwang
In x64 JIT, propagate tailcall info only for subprogs, not for helpers or kfuncs. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20241107134529.8602-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12Merge branch 'add-kernel-symbol-for-struct_ops-trampoline'Alexei Starovoitov
Xu Kuohai says: ==================== Add kernel symbol for struct_ops trampoline. Without kernel symbol for struct_ops trampoline, the unwinder may produce unexpected stacktraces. For example, the x86 ORC and FP unwinder stops stacktrace on a struct_ops trampoline address since there is no kernel symbol for the address. v4: - Add a separate cleanup patch to remove unused member rcu from bpf_struct_ops_map (patch 1) - Use funcs_cnt instead of btf_type_vlen(vt) for links memory calculation in .map_mem_usage (patch 2) - Include ksyms[] memory in map_mem_usage (patch 3) - Various fixes in patch 3 (Thanks to Martin) v3: https://lore.kernel.org/bpf/20241111121641.2679885-1-xukuohai@huaweicloud.com/ - Add a separate cleanup patch to replace links_cnt with funcs_cnt - Allocate ksyms on-demand in update_elem() to stay with the links allocation way - Set ksym name to prog__<struct_ops_name>_<member_name> v2: https://lore.kernel.org/bpf/20241101111948.1570547-1-xukuohai@huaweicloud.com/ - Refine the commit message for clarity and fix a test bot warning v1: https://lore.kernel.org/bpf/20241030111533.907289-1-xukuohai@huaweicloud.com/ ==================== Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20241112145849.3436772-1-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Add kernel symbol for struct_ops trampolineXu Kuohai
Without kernel symbols for struct_ops trampoline, the unwinder may produce unexpected stacktraces. For example, the x86 ORC and FP unwinders check if an IP is in kernel text by verifying the presence of the IP's kernel symbol. When a struct_ops trampoline address is encountered, the unwinder stops due to the absence of symbol, resulting in an incomplete stacktrace that consists only of direct and indirect child functions called from the trampoline. The arm64 unwinder is another example. While the arm64 unwinder can proceed across a struct_ops trampoline address, the corresponding symbol name is displayed as "unknown", which is confusing. Thus, add kernel symbol for struct_ops trampoline. The name is bpf__<struct_ops_name>_<member_name>, where <struct_ops_name> is the type name of the struct_ops, and <member_name> is the name of the member that the trampoline is linked to. Below is a comparison of stacktraces captured on x86 by perf record, before and after this patch. Before: ffffffff8116545d __lock_acquire+0xad ([kernel.kallsyms]) ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) ffffffff813088f4 __bpf_prog_enter+0x34 ([kernel.kallsyms]) After: ffffffff811656bd __lock_acquire+0x30d ([kernel.kallsyms]) ffffffff81167fcc lock_acquire+0xcc ([kernel.kallsyms]) ffffffff81309024 __bpf_prog_enter+0x34 ([kernel.kallsyms]) ffffffffc000d7e9 bpf__tcp_congestion_ops_cong_avoid+0x3e ([kernel.kallsyms]) ffffffff81f250a5 tcp_ack+0x10d5 ([kernel.kallsyms]) ffffffff81f27c66 tcp_rcv_established+0x3b6 ([kernel.kallsyms]) ffffffff81f3ad03 tcp_v4_do_rcv+0x193 ([kernel.kallsyms]) ffffffff81d65a18 __release_sock+0xd8 ([kernel.kallsyms]) ffffffff81d65af4 release_sock+0x34 ([kernel.kallsyms]) ffffffff81f15c4b tcp_sendmsg+0x3b ([kernel.kallsyms]) ffffffff81f663d7 inet_sendmsg+0x47 ([kernel.kallsyms]) ffffffff81d5ab40 sock_write_iter+0x160 ([kernel.kallsyms]) ffffffff8149c67b vfs_write+0x3fb ([kernel.kallsyms]) ffffffff8149caf6 ksys_write+0xc6 ([kernel.kallsyms]) ffffffff8149cb5d __x64_sys_write+0x1d ([kernel.kallsyms]) ffffffff81009200 x64_sys_call+0x1d30 ([kernel.kallsyms]) ffffffff82232d28 do_syscall_64+0x68 ([kernel.kallsyms]) ffffffff8240012f entry_SYSCALL_64_after_hwframe+0x76 ([kernel.kallsyms]) Fixes: 85d33df357b6 ("bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112145849.3436772-4-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Use function pointers count as struct_ops links countXu Kuohai
Only function pointers in a struct_ops structure can be linked to bpf progs, so set the links count to the function pointers count, instead of the total members count in the structure. Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20241112145849.3436772-3-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Remove unused member rcu from bpf_struct_ops_mapXu Kuohai
The rcu member in bpf_struct_ops_map is not used after commit b671c2067a04 ("bpf: Retire the struct_ops map kvalue->refcnt.") Remove it. Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Link: https://lore.kernel.org/r/20241112145849.3436772-2-xukuohai@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio fix from Michael Tsirkin: "A last minute mlx5 bugfix" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vdpa/mlx5: Fix PA offset with unaligned starting iotlb map
2024-11-12Merge branch 'bpf-support-private-stack-for-bpf-progs'Alexei Starovoitov
Yonghong Song says: ==================== bpf: Support private stack for bpf progs The main motivation for private stack comes from nested scheduler in sched-ext from Tejun. The basic idea is that - each cgroup will its own associated bpf program, - bpf program with parent cgroup will call bpf programs in immediate child cgroups. Let us say we have the following cgroup hierarchy: root_cg (prog0): cg1 (prog1): cg11 (prog11): cg111 (prog111) cg112 (prog112) cg12 (prog12): cg121 (prog121) cg122 (prog122) cg2 (prog2): cg21 (prog21) cg22 (prog22) cg23 (prog23) In the above example, prog0 will call a kfunc which will call prog1 and prog2 to get sched info for cg1 and cg2 and then the information is summarized and sent back to prog0. Similarly, prog11 and prog12 will be invoked in the kfunc and the result will be summarized and sent back to prog1, etc. The following illustrates a possible call sequence: ... -> bpf prog A -> kfunc -> ops.<callback_fn> (bpf prog B) ... Currently, for each thread, the x86 kernel allocate 16KB stack. Each bpf program (including its subprograms) has maximum 512B stack size to avoid potential stack overflow. Nested bpf programs further increase the risk of stack overflow. To avoid potential stack overflow caused by bpf programs, this patch set supported private stack and bpf program stack space is allocated during jit time. Using private stack for bpf progs can reduce or avoid potential kernel stack overflow. Currently private stack is applied to tracing programs like kprobe/uprobe, perf_event, tracepoint, raw tracepoint and struct_ops progs. Tracing progs enable private stack if any subprog stack size is more than a threshold (i.e. 64B). Struct-ops progs enable private stack based on particular struct op implementation which can enable private stack before verification at per-insn level. Struct-ops progs have the same treatment as tracing progs w.r.t when to enable private stack. For all these progs, the kernel will do recursion check (no nesting for per prog per cpu) to ensure that private stack won't be overwritten. The bpf_prog_aux struct has a callback func recursion_detected() which can be implemented by kernel subsystem to synchronously detect recursion, report error, etc. Only x86_64 arch supports private stack now. It can be extended to other archs later. Please see each individual patch for details. Change logs: v11 -> v12: - v11 link: https://lore.kernel.org/bpf/20241109025312.148539-1-yonghong.song@linux.dev/ - Fix a bug where allocated percpu space is less than actual private stack. - Add guard memory (before and after actual prog stack) to detect potential underflow/overflow. v10 -> v11: - v10 link: https://lore.kernel.org/bpf/20241107024138.3355687-1-yonghong.song@linux.dev/ - Use two bool variables, priv_stack_requested (used by struct-ops only) and jits_use_priv_stack, in order to make code cleaner. - Set env->prog->aux->jits_use_priv_stack to true if any subprog uses private stack. This is for struct-ops use case to kick in recursion protection. v9 -> v10: - v9 link: https://lore.kernel.org/bpf/20241104193455.3241859-1-yonghong.song@linux.dev/ - Simplify handling async cbs by making those async cb related progs using normal kernel stack. - Do percpu allocation in jit instead of verifier. v8 -> v9: - v8 link: https://lore.kernel.org/bpf/20241101030950.2677215-1-yonghong.song@linux.dev/ - Use enum to express priv stack mode. - Use bits in bpf_subprog_info struct to do subprog recursion check between main/async and async subprogs. - Fix potential memory leak. - Rename recursion detection func from recursion_skipped() to recursion_detected(). v7 -> v8: - v7 link: https://lore.kernel.org/bpf/20241029221637.264348-1-yonghong.song@linux.dev/ - Add recursion_skipped() callback func to bpf_prog->aux structure such that if a recursion miss happened and bpf_prog->aux->recursion_skipped is not NULL, the callback fn will be called so the subsystem can do proper action based on their respective design. v6 -> v7: - v6 link: https://lore.kernel.org/bpf/20241020191341.2104841-1-yonghong.song@linux.dev/ - Going back to do private stack allocation per prog instead per subtree. This can simplify implementation and avoid verifier complexity. - Handle potential nested subprog run if async callback exists. - Use struct_ops->check_member() callback to set whether a particular struct-ops prog wants private stack or not. v5 -> v6: - v5 link: https://lore.kernel.org/bpf/20241017223138.3175885-1-yonghong.song@linux.dev/ - Instead of using (or not using) private stack at struct_ops level, each prog in struct_ops can decide whether to use private stack or not. v4 -> v5: - v4 link: https://lore.kernel.org/bpf/20241010175552.1895980-1-yonghong.song@linux.dev/ - Remove bpf_prog_call() related implementation. - Allow (opt-in) private stack for sched-ext progs. v3 -> v4: - v3 link: https://lore.kernel.org/bpf/20240926234506.1769256-1-yonghong.song@linux.dev/ There is a long discussion in the above v3 link trying to allow private stack to be used by kernel functions in order to simplify implementation. But unfortunately we didn't find a workable solution yet, so we return to the approach where private stack is only used by bpf programs. - Add bpf_prog_call() kfunc. v2 -> v3: - Instead of per-subprog private stack allocation, allocate private stacks at main prog or callback entry prog. Subprogs not main or callback progs will increment the inherited stack pointer to be their frame pointer. - Private stack allows each prog max stack size to be 512 bytes, intead of the whole prog hierarchy to be 512 bytes. - Add some tests. ==================== Link: https://lore.kernel.org/r/20241112163902.2223011-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12selftests/bpf: Add struct_ops prog private stack testsYonghong Song
Add three tests for struct_ops using private stack. ./test_progs -t struct_ops_private_stack #336/1 struct_ops_private_stack/private_stack:OK #336/2 struct_ops_private_stack/private_stack_fail:OK #336/3 struct_ops_private_stack/private_stack_recur:OK #336 struct_ops_private_stack:OK The following is a snippet of a struct_ops check_member() implementation: u32 moff = __btf_member_bit_offset(t, member) / 8; switch (moff) { case offsetof(struct bpf_testmod_ops3, test_1): prog->aux->priv_stack_requested = true; prog->aux->recursion_detected = test_1_recursion_detected; fallthrough; default: break; } return 0; The first test is with nested two different callback functions where the first prog has more than 512 byte stack size (including subprogs) with private stack enabled. The second test is a negative test where the second prog has more than 512 byte stack size without private stack enabled. The third test is the same callback function recursing itself. At run time, the jit trampoline recursion check kicks in to prevent the recursion. The recursion_detected() callback function is implemented by the bpf_testmod, the following message in dmesg bpf_testmod: oh no, recursing into test_1, recursion_misses 1 demonstrates the callback function is indeed triggered when recursion miss happens. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163938.2225528-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Support private stack for struct_ops progsYonghong Song
For struct_ops progs, whether a particular prog uses private stack depends on prog->aux->priv_stack_requested setting before actual insn-level verification for that prog. One particular implementation is to piggyback on struct_ops->check_member(). The next patch has an example for this. The struct_ops->check_member() sets prog->aux->priv_stack_requested to be true which enables private stack usage. The struct_ops prog follows the same rule as kprobe/tracing progs after function bpf_enable_priv_stack(). For example, even a struct_ops prog requests private stack, it could still use normal kernel stack if the stack size is small (< 64 bytes). Similar to tracing progs, nested same cpu same prog run will be skipped. A field (recursion_detected()) is added to bpf_prog_aux structure. If bpf_prog->aux->recursion_detected is implemented by the struct_ops subsystem and nested same cpu/prog happens, the function will be triggered to report an error, collect related info, etc. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163933.2224962-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12selftests/bpf: Add tracing prog private stack testsYonghong Song
Some private stack tests are added including: - main prog only with stack size greater than BPF_PSTACK_MIN_SIZE. - main prog only with stack size smaller than BPF_PSTACK_MIN_SIZE. - prog with one subprog having MAX_BPF_STACK stack size and another subprog having non-zero small stack size. - prog with callback function. - prog with exception in main prog or subprog. - prog with async callback without nesting - prog with async callback with possible nesting Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163927.2224750-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf, x86: Support private stack in jitYonghong Song
Private stack is allocated in function bpf_int_jit_compile() with alignment 8. Private stack allocation size includes the stack size determined by verifier and additional space to protect stack overflow and underflow. See below an illustration: ---> memory address increasing [8 bytes to protect overflow] [normal stack] [8 bytes to protect underflow] If overflow/underflow is detected, kernel messages will be emited in dmesg like BPF private stack overflow/underflow detected for prog Fx BPF Private stack overflow/underflow detected for prog bpf_prog_a41699c234a1567a_subprog1x Those messages are generated when I made some changes to jitted code to intentially cause overflow for some progs. For the jited prog, The x86 register 9 (X86_REG_R9) is used to replace bpf frame register (BPF_REG_10). The private stack is used per subprog per cpu. The X86_REG_R9 is saved and restored around every func call (not including tailcall) to maintain correctness of X86_REG_R9. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163922.2224385-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf, x86: Avoid repeated usage of bpf_prog->aux->stack_depthYonghong Song
Refactor the code to avoid repeated usage of bpf_prog->aux->stack_depth in do_jit() func. If the private stack is used, the stack_depth will be 0 for that prog. Refactoring make it easy to adjust stack_depth. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163917.2224189-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Enable private stack for eligible subprogsYonghong Song
If private stack is used by any subprog, set that subprog prog->aux->jits_use_priv_stack to be true so later jit can allocate private stack for that subprog properly. Also set env->prog->aux->jits_use_priv_stack to be true if any subprog uses private stack. This is a use case for a single main prog (no subprogs) to use private stack, and also a use case for later struct-ops progs where env->prog->aux->jits_use_priv_stack will enable recursion check if any subprog uses private stack. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163912.2224007-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12bpf: Find eligible subprogs for private stack supportYonghong Song
Private stack will be allocated with percpu allocator in jit time. To avoid complexity at runtime, only one copy of private stack is available per cpu per prog. So runtime recursion check is necessary to avoid stack corruption. Current private stack only supports kprobe/perf_event/tp/raw_tp which has recursion check in the kernel, and prog types that use bpf trampoline recursion check. For trampoline related prog types, currently only tracing progs have recursion checking. To avoid complexity, all async_cb subprogs use normal kernel stack including those subprogs used by both main prog subtree and async_cb subtree. Any prog having tail call also uses kernel stack. To avoid jit penalty with private stack support, a subprog stack size threshold is set such that only if the stack size is no less than the threshold, private stack is supported. The current threshold is 64 bytes. This avoids jit penality if the stack usage is small. A useless 'continue' is also removed from a loop in func check_max_stack_depth(). Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20241112163907.2223839-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12mm: swapfile: fix cluster reclaim work crash on rotational devicesJohannes Weiner
syzbot and Daan report a NULL pointer crash in the new full swap cluster reclaim work: > Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI > KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] > CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 > Workqueue: events swap_reclaim_work > RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49 > Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00 > RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202 > RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078 > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008 > RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000 > R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000 > R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000 > FS: 0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > __list_del_entry_valid include/linux/list.h:124 [inline] > __list_del_entry include/linux/list.h:215 [inline] > list_move_tail include/linux/list.h:310 [inline] > swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748 > swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779 The syzbot console output indicates a virtual environment where swapfile is on a rotational device. In this case, clusters aren't actually used, and si->full_clusters is not initialized. Daan's report is from qemu, so likely rotational too. Make sure to only schedule the cluster reclaim work when clusters are actually in use. Link: https://lkml.kernel.org/r/20241107142335.GB1172372@cmpxchg.org Link: https://lore.kernel.org/lkml/672ac50b.050a0220.2edce.1517.GAE@google.com/ Link: https://github.com/systemd/systemd/issues/35044 Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters") Reported-by: syzbot+078be8bfa863cb9e0c6b@syzkaller.appspotmail.com Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Daan De Meyer <daan.j.demeyer@gmail.com> Cc: Kairui Song <ryncsn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-12vdpa/mlx5: Fix PA offset with unaligned starting iotlb mapSi-Wei Liu
When calculating the physical address range based on the iotlb and mr [start,end) ranges, the offset of mr->start relative to map->start is not taken into account. This leads to some incorrect and duplicate mappings. For the case when mr->start < map->start the code is already correct: the range in [mr->start, map->start) was handled by a different iteration. Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code") Cc: stable@vger.kernel.org Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Message-Id: <20241021134040.975221-2-dtatulea@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2024-11-12Merge branch 'selftests-bpf-fix-for-bpf_signal-stalls-watchdog-for-test_progs'Alexei Starovoitov
Eduard Zingerman says: ==================== selftests/bpf: fix for bpf_signal stalls, watchdog for test_progs Test case 'bpf_signal' had been recently reported to stall, both on the mailing list [1] and CI [2]. The stall is caused by CPU cycles perf event not being delivered within expected time frame, before test process enters system call and waits indefinitely. This patch-set addresses the issue in several ways: - A watchdog timer is added to test_progs.c runner: - it prints current sub-test name to stderr if sub-test takes longer than 10 seconds to finish; - it terminates process executing sub-test if sub-test takes longer than 120 seconds to finish. - The test case is updated to await perf event notification with a timeout and a few retries, this serves two purposes: - busy loops longer to increase the time frame for CPU cycles event generation/delivery; - makes a timeout, not stall, a worst case scenario. - The test case is updated to lower frequency of perf events, as high frequency of such events caused events generation throttling, which in turn delayed events delivery by amount of time sufficient to cause test case failure. Note: librt pthread-based timer API is used to implement watchdog timer. I chose this API over SIGALRM because signal handler execution within test process context was sufficient to trigger perf event delivery for send_signal/send_signal_nmi_thread_remote test case, w/o any additional changes. Thus I concluded that SIGALRM based implementation interferes with tests execution. [1] https://lore.kernel.org/bpf/CAP01T75OUeE8E-Lw9df84dm8ag2YmHW619f1DmPSVZ5_O89+Bg@mail.gmail.com/ [2] https://github.com/kernel-patches/bpf/actions/runs/11791485271/job/32843996871 ==================== Link: https://lore.kernel.org/r/20241112110906.3045278-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12selftests/bpf: update send_signal to lower perf evemts frequencyEduard Zingerman
Similar to commit [1] sample perf events less often in test_send_signal_nmi(). This should reduce perf events throttling. [1] 7015843afcaf ("selftests/bpf: Fix send_signal test with nested CONFIG_PARAVIRT") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241112110906.3045278-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12selftests/bpf: allow send_signal test to timeoutEduard Zingerman
The following invocation: $ t1=send_signal/send_signal_perf_thread_remote \ t2=send_signal/send_signal_nmi_thread_remote \ ./test_progs -t $t1,$t2 Leads to send_signal_nmi_thread_remote to be stuck on a line 180: /* wait for result */ err = read(pipe_c2p[0], buf, 1); In this test case: - perf event PERF_COUNT_HW_CPU_CYCLES is created for parent process; - BPF program is attached to perf event, and sends a signal to child process when event occurs; - parent program burns some CPU in busy loop and calls read() to get notification from child that it received a signal. The perf event is declared with .sample_period = 1. This forces perf to throttle events, and under some unclear conditions the event does not always occur while parent is in busy loop. After parent enters read() system call CPU cycles event won't be generated for parent anymore. Thus, if perf event had not occurred already the test is stuck. This commit updates the parent to wait for notification with a timeout, doing several iterations of busy loop + read_with_timeout(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241112110906.3045278-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12selftests/bpf: add read_with_timeout() utility functionEduard Zingerman
int read_with_timeout(int fd, char *buf, size_t count, long usec) As a regular read(2), but allows to specify a timeout in micro-seconds. Returns -EAGAIN on timeout. Implemented using select(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241112110906.3045278-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12selftests/bpf: watchdog timer for test_progsEduard Zingerman
This commit provides a watchdog timer that sets a limit of how long a single sub-test could run: - if sub-test runs for 10 seconds, the name of the test is printed (currently the name of the test is printed only after it finishes); - if sub-test runs for 120 seconds, the running thread is terminated with SIGSEGV (to trigger crash_handler() and get a stack trace). Specifically: - the timer is armed on each call to run_one_test(); - re-armed at each call to test__start_subtest(); - is stopped when exiting run_one_test(). Default timeout could be overridden using '-w' or '--watchdog-timeout' options. Value 0 can be used to turn the timer off. Here is an example execution: $ ./ssh-exec.sh ./test_progs -w 5 -t \ send_signal/send_signal_perf_thread_remote,send_signal/send_signal_nmi_thread_remote WATCHDOG: test case send_signal/send_signal_nmi_thread_remote executes for 5 seconds, terminating with SIGSEGV Caught signal #11! Stack trace: ./test_progs(crash_handler+0x1f)[0x9049ef] /lib64/libc.so.6(+0x40d00)[0x7f1f1184fd00] /lib64/libc.so.6(read+0x4a)[0x7f1f1191cc4a] ./test_progs[0x720dd3] ./test_progs[0x71ef7a] ./test_progs(test_send_signal+0x1db)[0x71edeb] ./test_progs[0x9066c5] ./test_progs(main+0x5ed)[0x9054ad] /lib64/libc.so.6(+0x2a088)[0x7f1f11839088] /lib64/libc.so.6(__libc_start_main+0x8b)[0x7f1f1183914b] ./test_progs(_start+0x25)[0x527385] #292 send_signal:FAIL test_send_signal_common:PASS:reading pipe 0 nsec test_send_signal_common:PASS:reading pipe error: size 0 0 nsec test_send_signal_common:PASS:incorrect result 0 nsec test_send_signal_common:PASS:pipe_write 0 nsec test_send_signal_common:PASS:setpriority 0 nsec Timer is implemented using timer_{create,start} librt API. Internally librt uses pthreads for SIGEV_THREAD timers, so this change adds a background timer thread to the test process. Because of this a few checks in tests 'bpf_iter' and 'iters' need an update to account for an extra thread. For parallelized scenario the watchdog is also created for each worker fork. If one of the workers gets stuck, it would be terminated by a watchdog. In theory, this might lead to a scenario when all worker threads are exhausted, however this should not be a problem for server_main(), as it would exit with some of the tests not run. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20241112110906.3045278-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-11-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "x86 and selftests fixes. x86: - When emulating a guest TLB flush for a nested guest, flush vpid01, not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if L2 and L1 are sharing VPID '0' (from L1's perspective). - Fix a bug in the SNP initialization flow where KVM would return '0' to userspace instead of -errno on failure. - Move the Intel PT virtualization (i.e. outputting host trace to host buffer and guest trace to guest buffer) behind CONFIG_BROKEN. - Fix memory leak on failure of KVM_SEV_SNP_LAUNCH_START - Fix a bug where KVM fails to inject an interrupt from the IRR after KVM_SET_LAPIC. Selftests: - Increase the timeout for the memslot performance selftest to avoid false failures on arm64 and nested x86 platforms. - Fix a goof in the guest_memfd selftest where a for-loop initialized a bit mask to zero instead of BIT(0). - Disable strict aliasing when building KVM selftests to prevent the compiler from treating things like "u64 *" to "uint64_t *" cases as undefined behavior, which can lead to nasty, hard to debug failures. - Force -march=x86-64-v2 for KVM x86 selftests if and only if the uarch is supported by the compiler. - Fix broken compilation of kvm selftests after a header sync in tools/" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN KVM: x86: Unconditionally set irr_pending when updating APICv state kvm: svm: Fix gctx page leak on invalid inputs KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB KVM: SVM: Propagate error from snp_guest_req_init() to userspace KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported KVM: selftests: Disable strict aliasing KVM: selftests: fix unintentional noop test in guest_memfd_test.c KVM: selftests: memslot_perf_test: increase guest sync timeout
2024-11-12Merge tag 'for-6.12/dm-fixes-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - fix warnings about duplicate slab cache names * tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-cache: fix warnings about duplicate slab caches dm-bufio: fix warnings about duplicate slab caches
2024-11-12Merge tag 'integrity-v6.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity Pull integrity fixes from Mimi Zohar: "One bug fix, one performance improvement, and the use of static_assert: - The bug fix addresses "only a cosmetic change" commit, which didn't take into account the original 'ima' template definition. - The performance improvement limits the atomic_read()" * tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity: integrity: Use static_assert() to check struct sizes evm: stop avoidably reading i_writecount in evm_file_release ima: fix buffer overrun in ima_eventdigest_init_common
2024-11-12Merge tag 'landlock-6.12-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux Pull landlock fixes from Mickaël Salaün: "This fixes issues in the Landlock's sandboxer sample and documentation, slightly refactors helpers (required for ongoing patch series), and improve/fix a feature merged in v6.12 (signal and abstract UNIX socket scoping)" * tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux: landlock: Optimize scope enforcement landlock: Refactor network access mask management landlock: Refactor filesystem access mask management samples/landlock: Clarify option parsing behaviour samples/landlock: Refactor help message samples/landlock: Fix port parsing in sandboxer landlock: Fix grammar issues in documentation landlock: Improve documentation of previous limitations
2024-11-12selftests: hugetlb_dio: fixup check for initial conditions to skip in the startDonet Tom
This test verifies that a hugepage, used as a user buffer for DIO operations, is correctly freed upon unmapping. To test this, we read the count of free hugepages before and after the mmap, DIO, and munmap operations, then check if the free hugepage count is the same. Reading free hugepages before the test was removed by commit 0268d4579901 ('selftests: hugetlb_dio: check for initial conditions to skip at the start'), causing the test to always fail. This patch adds back reading the free hugepages before starting the test. With this patch, the tests are now passing. Test results without this patch: ./tools/testing/selftests/mm/hugetlb_dio TAP version 13 1..4 # No. Free pages before allocation : 0 # No. Free pages after munmap : 100 not ok 1 : Huge pages not freed! # No. Free pages before allocation : 0 # No. Free pages after munmap : 100 not ok 2 : Huge pages not freed! # No. Free pages before allocation : 0 # No. Free pages after munmap : 100 not ok 3 : Huge pages not freed! # No. Free pages before allocation : 0 # No. Free pages after munmap : 100 not ok 4 : Huge pages not freed! # Totals: pass:0 fail:4 xfail:0 xpass:0 skip:0 error:0 Test results with this patch: /tools/testing/selftests/mm/hugetlb_dio TAP version 13 1..4 # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 1 : Huge pages freed successfully ! # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 2 : Huge pages freed successfully ! # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 3 : Huge pages freed successfully ! # No. Free pages before allocation : 100 # No. Free pages after munmap : 100 ok 4 : Huge pages freed successfully ! # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 Link: https://lkml.kernel.org/r/20241110064903.23626-1-donettom@linux.ibm.com Fixes: 0268d4579901 ("selftests: hugetlb_dio: check for initial conditions to skip in the start") Signed-off-by: Donet Tom <donettom@linux.ibm.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Shuah Khan <shuah@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-12mm/thp: fix deferred split queue not partially_mapped: fixHugh Dickins
Though even more elusive than before, list_del corruption has still been seen on THP's deferred split queue. The idea in commit e66f3185fa04 was right, but its implementation wrong. The context omitted an important comment just before the critical test: "split_folio() removes folio from list on success." In ignoring that comment, when a THP split succeeded, the code went on to release the preceding safe folio, preserving instead an irrelevant (formerly head) folio: which gives no safety because it's not on the list. Fix the logic. Link: https://lkml.kernel.org/r/3c995a30-31ce-0998-1b9f-3a2cb9354c91@google.com Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-12mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM casesJohn Hubbard
commit 53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()") created a new constraint on the pin_user_pages*() API family: a potentially large internal allocation must now occur, for FOLL_LONGTERM cases. A user-visible consequence has now appeared: user space can no longer pin more than 2GB of memory anymore on x86_64. That's because, on a 4KB PAGE_SIZE system, when user space tries to (indirectly, via a device driver that calls pin_user_pages()) pin 2GB, this requires an allocation of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for kmalloc(). In addition to the directly visible effect described above, there is also the problem of adding an unnecessary allocation. The **pages array argument has already been allocated, and there is no need for a redundant **folios array allocation in this case. Fix this by avoiding the new allocation entirely. This is done by referring to either the original page[i] within **pages, or to the associated folio. Thanks to David Hildenbrand for suggesting this approach and for providing the initial implementation (which I've tested and adjusted slightly) as well. [jhubbard@nvidia.com: whitespace tweak, per David] Link: https://lkml.kernel.org/r/131cf9c8-ebc0-4cbb-b722-22fa8527bf3c@nvidia.com [jhubbard@nvidia.com: bypass pofs_get_folio(), per Oscar] Link: https://lkml.kernel.org/r/c1587c7f-9155-45be-bd62-1e36c0dd6923@nvidia.com Link: https://lkml.kernel.org/r/20241105032944.141488-2-jhubbard@nvidia.com Fixes: 53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()") Signed-off-by: John Hubbard <jhubbard@nvidia.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Dave Airlie <airlied@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dongwon Kim <dongwon.kim@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Junxiao Chang <junxiao.chang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-12LoongArch: Fix AP booting issue in VM modeBibo Mao
Native IPI is used for AP booting, because it is the booting interface between OS and BIOS firmware. The paravirt IPI is only used inside OS, and native IPI is necessary to boot AP. When booting AP, we write the kernel entry address in the HW mailbox of AP and send IPI interrupt to it. AP executes idle instruction and waits for interrupts or SW events, then clears IPI interrupt and jumps to the kernel entry from HW mailbox. Between writing HW mailbox and sending IPI, AP can be woken up by SW events and jumps to the kernel entry, so ACTION_BOOT_CPU IPI interrupt will keep pending during AP booting. And native IPI interrupt handler needs be registered so that it can clear pending native IPI, else there will be endless interrupts during AP booting stage. Here native IPI interrupt is initialized even if paravirt IPI is used. Cc: stable@vger.kernel.org Fixes: 74c16b2e2b0c ("LoongArch: KVM: Add PV IPI support on guest side") Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Add WriteCombine shadow mapping in KASANKanglong Wang
Currently, the kernel couldn't boot when ARCH_IOREMAP, ARCH_WRITECOMBINE and KASAN are enabled together. Because DMW2 is used by kernel now which is configured as 0xa000000000000000 for WriteCombine, but KASAN has no segment mapping for it. This patch fix this issue. Solution: Add the relevant definitions for WriteCombine (DMW2) in KASAN. Cc: stable@vger.kernel.org Fixes: 8e02c3b782ec ("LoongArch: Add writecombine support for DMW-based ioremap()") Signed-off-by: Kanglong Wang <wangkanglong@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabitsHuacai Chen
If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will overflow UINTPTR_MAX because KASAN_SHADOW_START/KASAN_SHADOW_END are aligned up by PGDIR_SIZE. And then the overflowed KASAN_SHADOW_END looks like a user space address. For example, PGDIR_SIZE of CONFIG_4KB_4LEVEL is 2^39, which is too large for Loongson-2K series whose cpu_vabits = 39. Since CONFIG_4KB_4LEVEL is completely legal for CPUs with cpu_vabits <= 39, we just disable KASAN via early return in kasan_init(). Otherwise we get a boot failure. Moreover, we change KASAN_SHADOW_END from the first address after KASAN shadow area to the last address in KASAN shadow area, in order to avoid the end address exactly overflow to 0 (which is a legal case). We don't need to worry about alignment because pgd_addr_end() can handle it. Cc: stable@vger.kernel.org Reviewed-by: Jiaxun Yang <jiaxun.yang@flygoat.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Make KASAN work with 5-level page-tablesHuacai Chen
Make KASAN work with 5-level page-tables, including: 1. Implement and use __pgd_none() and kasan_p4d_offset(). 2. As done in kasan_pmd_populate() and kasan_pte_populate(), restrict the loop conditions of kasan_p4d_populate() and kasan_pud_populate() to avoid unnecessary population. Cc: stable@vger.kernel.org Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGSYuli Wang
This is a trivial cleanup, commit c62da0c35d58518d ("mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS") has unified default values of VM_DATA_DEFAULT_FLAGS across different platforms. Apply the same consistency to LoongArch. Suggested-by: Wentao Guan <guanwentao@uniontech.com> Signed-off-by: Yuli Wang <wangyuli@uniontech.com> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: Fix early_numa_add_cpu() usage for FDT systemsHuacai Chen
early_numa_add_cpu() applies on physical CPU id rather than logical CPU id, so use cpuid instead of cpu. Cc: stable@vger.kernel.org Fixes: 3de9c42d02a79a5 ("LoongArch: Add all CPUs enabled by fdt to NUMA node 0") Reported-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-12LoongArch: For all possible CPUs setup logical-physical CPU mappingHuacai Chen
In order to support ACPI-based physical CPU hotplug, we suppose for all "possible" CPUs cpu_logical_map() can work. Because some drivers want to use cpu_logical_map() for all "possible" CPUs, while currently we only setup logical-physical mapping for "present" CPUs. This lack of mapping also causes cpu_to_node() cannot work for hot-added CPUs. All "possible" CPUs are listed in MADT, and the "present" subset is marked as ACPI_MADT_ENABLED. To setup logical-physical CPU mapping for all possible CPUs and keep present CPUs continuous in cpu_present_mask, we parse MADT twice. The first pass handles CPUs with ACPI_MADT_ENABLED and the second pass handles CPUs without ACPI_MADT_ENABLED. The global flag (cpu_enumerated) is removed because acpi_map_cpu() calls cpu_number_map() rather than set_processor_mask() now. Reported-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-11-11Merge branch 'libbpf-stringify-error-codes-in-log-messages'Andrii Nakryiko
Mykyta Yatsenko says: ==================== libbpf: stringify error codes in log messages From: Mykyta Yatsenko <yatsenko@meta.com> Libbpf may report error in 2 ways: 1. Numeric errno 2. Errno's text representation, returned by strerror Both ways may be confusing for users: numeric code requires people to know how to find its meaning and strerror may be too generic and unclear. These patches modify libbpf error reporting by swapping numeric codes and strerror with the standard short error name, for example: "failed to attach: -22" becomes "failed to attach: -EINVAL". ==================== Link: https://patch.msgid.link/20241111212919.368971-1-mykyta.yatsenko5@gmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-11-11libbpf: Stringify errno in log messages in the remaining codeMykyta Yatsenko
Convert numeric error codes into the string representations in log messages in the rest of libbpf source files. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241111212919.368971-5-mykyta.yatsenko5@gmail.com
2024-11-11libbpf: Stringify errno in log messages in btf*.cMykyta Yatsenko
Convert numeric error codes into the string representations in log messages in btf.c and btf_dump.c. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241111212919.368971-4-mykyta.yatsenko5@gmail.com