Age | Commit message (Collapse) | Author |
|
ARG_PTR_TO_KPTR is currently only used by the bpf_kptr_xchg helper.
Although it limits reg types for that helper's first arg to
PTR_TO_MAP_VALUE, any arbitrary mapval won't do: further custom
verification logic ensures that the mapval reg being xchgd-into is
pointing to a kptr field. If this is not the case, it's not safe to xchg
into that reg's pointee.
Let's rename the bpf_arg_type to more accurately describe the fairly
specific expectations that this arg type encodes.
This is a nonfunctional change.
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Link: https://lore.kernel.org/r/20240813212424.2871455-4-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Currently btf_parse_fields is used in two places to create struct
btf_record's for structs: when looking at mapval type, and when looking
at any struct in program BTF. The former looks for kptr fields while the
latter does not. This patch modifies the btf_parse_fields call made when
looking at prog BTF struct types to search for kptrs as well.
Before this series there was no reason to search for kptrs in non-mapval
types: a referenced kptr needs some owner to guarantee resource cleanup,
and map values were the only owner that supported this. If a struct with
a kptr field were to have some non-kptr-aware owner, the kptr field
might not be properly cleaned up and result in resources leaking. Only
searching for kptr fields in mapval was a simple way to avoid this
problem.
In practice, though, searching for BPF_KPTR when populating
struct_meta_tab does not expose us to this risk, as struct_meta_tab is
only accessed through btf_find_struct_meta helper, and that helper is
only called in contexts where recognizing the kptr field is safe:
* PTR_TO_BTF_ID reg w/ MEM_ALLOC flag
* Such a reg is a local kptr and must be free'd via bpf_obj_drop,
which will correctly handle kptr field
* When handling specific kfuncs which either expect MEM_ALLOC input or
return MEM_ALLOC output (obj_{new,drop}, percpu_obj_{new,drop},
list+rbtree funcs, refcount_acquire)
* Will correctly handle kptr field for same reasons as above
* When looking at kptr pointee type
* Called by functions which implement "correct kptr resource
handling"
* In btf_check_and_fixup_fields
* Helper that ensures no ownership loops for lists and rbtrees,
doesn't care about kptr field existence
So we should be able to find BPF_KPTR fields in all prog BTF structs
without leaking resources.
Further patches in the series will build on this change to support
kptr_xchg into non-mapval local kptr. Without this change there would be
no kptr field found in such a type.
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Link: https://lore.kernel.org/r/20240813212424.2871455-3-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
btf_parse_kptr() and btf_record_free() do btf_get() and btf_put()
respectively when working on btf_record in program and map if there are
kptr fields. If the kptr is from program BTF, since both callers has
already tracked the life cycle of program BTF, it is safe to remove the
btf_get() and btf_put().
This change prevents memory leak of program BTF later when we start
searching for kptr fields when building btf_record for program. It can
happen when the btf fd is closed. The btf_put() corresponding to the
btf_get() in btf_parse_kptr() was supposed to be called by
btf_record_free() in btf_free_struct_meta_tab() in btf_free(). However,
it will never happen since the invocation of btf_free() depends on the
refcount of the btf to become 0 in the first place.
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Link: https://lore.kernel.org/r/20240813212424.2871455-2-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The timerslack_ns setting is used to specify how much the hardware
timers should be delayed, to potentially dispatch multiple timers in a
single interrupt. This is a performance optimization. Timers of
realtime tasks (having a realtime scheduling policy) should not be
delayed.
This logic was inconsitently applied to the hrtimers, leading to delays
of realtime tasks which used timed waits for events (e.g. condition
variables). Due to the downstream override of the slack for rt tasks,
the procfs reported incorrect (non-zero) timerslack_ns values.
This is changed by setting the timer_slack_ns task attribute to 0 for
all tasks with a rt policy. By that, downstream users do not need to
specially handle rt tasks (w.r.t. the slack), and the procfs entry
shows the correct value of "0". Setting non-zero slack values (either
via procfs or PR_SET_TIMERSLACK) on tasks with a rt policy is ignored,
as stated in "man 2 PR_SET_TIMERSLACK":
Timer slack is not applied to threads that are scheduled under a
real-time scheduling policy (see sched_setscheduler(2)).
The special handling of timerslack on rt tasks in downstream users
is removed as well.
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240814121032.368444-2-felix.moessbauer@siemens.com
|
|
Cross-merge bpf fixes after downstream PR including
important fixes (from bpf-next point of view):
commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp")
commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map")
No conflicts.
Adjacent changes in:
include/linux/bpf_verifier.h
kernel/bpf/verifier.c
tools/testing/selftests/bpf/Makefile
Link: https://lore.kernel.org/bpf/20240813234307.82773-1-alexei.starovoitov@gmail.com/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
do_misc_fixups() relaces bpf_cast_to_kern_ctx() and bpf_rdonly_cast()
by a single instruction "r0 = r1". This follows bpf_fastcall contract.
This commit allows bpf_fastcall pattern rewrite for these two
functions in order to use them in bpf_fastcall selftests.
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Recognize bpf_fastcall patterns around kfunc calls.
For example, suppose bpf_cast_to_kern_ctx() follows bpf_fastcall
contract (which it does), in such a case allow verifier to rewrite BPF
program below:
r2 = 1;
*(u64 *)(r10 - 32) = r2;
call %[bpf_cast_to_kern_ctx];
r2 = *(u64 *)(r10 - 32);
r0 = r2;
By removing the spill/fill pair:
r2 = 1;
call %[bpf_cast_to_kern_ctx];
r0 = r2;
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Attribute used by LLVM implementation of the feature had been changed
from no_caller_saved_registers to bpf_fastcall (see [1]).
This commit replaces references to nocsr by references to bpf_fastcall
to keep LLVM and Kernel parts in sync.
[1] https://github.com/llvm/llvm-project/pull/105417
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In arraymap.c:
In bpf_array_map_seq_start() and bpf_array_map_seq_next()
cast return values from the __percpu address space to
the generic address space via uintptr_t [1].
Correct the declaration of pptr pointer in __bpf_array_map_seq_show()
to void __percpu * and cast the value from the generic address
space to the __percpu address space via uintptr_t [1].
In hashtab.c:
Assign the return value from bpf_mem_cache_alloc() to void pointer
and cast the value to void __percpu ** (void pointer to percpu void
pointer) before dereferencing.
In memalloc.c:
Explicitly declare __percpu variables.
Cast obj to void __percpu **.
In helpers.c:
Cast ptr in BPF_CALL_1 and BPF_CALL_2 from generic address space
to __percpu address space via const uintptr_t [1].
Found by GCC's named address space checks.
There were no changes in the resulting object files.
[1] https://sparse.docs.kernel.org/en/latest/annotations.html#address-space-name
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240811161414.56744-1-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
In case of malformed relocation record of kind BPF_CORE_TYPE_ID_LOCAL
referencing a non-existing BTF type, function bpf_core_calc_relo_insn
would cause a null pointer deference.
Fix this by adding a proper check upper in call stack, as malformed
relocation records could be passed from user space.
Simplest reproducer is a program:
r0 = 0
exit
With a single relocation record:
.insn_off = 0, /* patch first instruction */
.type_id = 100500, /* this type id does not exist */
.access_str_off = 6, /* offset of string "0" */
.kind = BPF_CORE_TYPE_ID_LOCAL,
See the link for original reproducer or next commit for a test case.
Fixes: 74753e1462e7 ("libbpf: Replace btf__type_by_id() with btf_type_by_id().")
Reported-by: Liu RuiTong <cnitlrt@gmail.com>
Closes: https://lore.kernel.org/bpf/CAK55_s6do7C+DVwbwY_7nKfUz0YLDoiA1v6X3Y9+p0sWzipFSA@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822080124.2995724-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Directly call into dma-iommu just like we have been doing for dma-direct
for a while. This avoids the indirect call overhead for IOMMU ops and
removes the need to have DMA ops entirely for many common configurations.
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Almost all instances of the dma_map_ops ->map_page()/map_sg() methods
implement ->unmap_page()/unmap_sg() too. The once instance which doesn't
dma_dummy_ops which is used to fail the DMA mapping and thus there won't
be any calls to ->unmap_page()/unmap_sg().
Remove the checks for ->unmap_page()/unmap_sg() and call them directly to
create an interface that is symmetrical to ->map_page()/map_sg().
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The hardware DMA limit might not be power of 2. When RAM range starts
above 0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high
bit can not encode this limit.
Use a plain address for the DMA zone limit instead.
Since the DMA zone can now potentially span beyond 4GB physical limit of
DMA32, make sure to use DMA zone for GFP_DMA32 allocations in that case.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Co-developed-by: Baruch Siach <baruch@tkos.co.il>
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
In dma_common_find_pages(), area->flags are compared directly with
VM_DMA_COHERENT. This works because VM_DMA_COHERENT is the only set
flag.
During development of a new feature (ASI [1]), a new VM flag is
introduced, and that flag can be injected into VM_DMA_COHERENT mappings
(among others). The presence of that flag caused
dma_common_find_pages() to return NULL for VM_DMA_COHERENT addresses,
leading to a lot of problems ending in crashing during boot. It took a
bit of time to figure this problem out.
It was a mistake to inject a VM flag to begin with, but it took a
significant amount of debugging to figure out the problem. Most users of
area->flags use bitmasking rather than equivalency to check for flags.
Update dma_common_find_pages() and dma_common_free_remap() to do the
same, which would have avoided the boot crashing. Instead, add a warning
in dma_common_find_pages() if any extra VM flags are set to catch such
problems more easily during development.
No functional change intended.
[1]https://lore.kernel.org/lkml/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com/
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This reverts commit 3b5bbe798b2451820e74243b738268f51901e7d0.
Eric reported that systemd-shutdown gets broken by blocking the creating
of pidfds for kthreads as older versions seems to rely on being able to
create a pidfd for any process in /proc.
Reported-by: Eric Biggers <ebiggers@kernel.org>
Link: https://lore.kernel.org/r/20240818035818.GA1929@sol.localdomain
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
With ftrace boot-time selftest, kmemleak reported some memory leaks in
the new test case for function graph storage for multiple tracers.
unreferenced object 0xffff888005060080 (size 32):
comm "swapper/0", pid 1, jiffies 4294676440
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 20 10 06 05 80 88 ff ff ........ .......
54 0c 1e 81 ff ff ff ff 00 00 00 00 00 00 00 00 T...............
backtrace (crc 7c93416c):
[<000000000238ee6f>] __kmalloc_cache_noprof+0x11f/0x2a0
[<0000000033d2b6c5>] enter_record+0xe8/0x150
[<0000000054c38424>] match_records+0x1cd/0x230
[<00000000c775b63d>] ftrace_set_hash+0xff/0x380
[<000000007bf7208c>] ftrace_set_filter+0x70/0x90
[<00000000a5c08dda>] test_graph_storage_multi+0x2e/0xf0
[<000000006ba028ca>] trace_selftest_startup_function_graph+0x1e8/0x260
[<00000000a715d3eb>] run_tracer_selftest+0x111/0x190
[<00000000395cbf90>] register_tracer+0xdf/0x1f0
[<0000000093e67f7b>] do_one_initcall+0x141/0x3b0
[<00000000c591b682>] do_initcall_level+0x82/0xa0
[<000000004e4c6600>] do_initcalls+0x43/0x70
[<0000000034f3c4e4>] kernel_init_freeable+0x170/0x1f0
[<00000000c7a5dab2>] kernel_init+0x1a/0x1a0
[<00000000ea105947>] ret_from_fork+0x3a/0x50
[<00000000a1932e84>] ret_from_fork_asm+0x1a/0x30
...
This means filter hash allocated for the fixtures are not correctly
released after the test.
Free those hash lists after tests are done and split the loop for
initialize fixture and register fixture for rollback.
Fixes: dd120af2d5f8 ("ftrace: Add multiple fgraph storage selftest")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/172411539857.28895.13119957560263401102.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
ftrace_startup_subops()
Since the register_ftrace_graph() assigns a new fgraph_ops to
fgraph_array before registring it by ftrace_startup_subops(), the new
fgraph_ops can be used in function_graph_enter().
In most cases, it is still OK because those fgraph_ops's hashtable is
already initialized by ftrace_set_filter*() etc.
But if a user registers a new fgraph_ops which does not initialize the
hash list, ftrace_ops_test() in function_graph_enter() causes a NULL
pointer dereference BUG because fgraph_ops->ops.func_hash is NULL.
This can be reproduced by the below commands because function profiler's
fgraph_ops does not initialize the hash list;
# cd /sys/kernel/tracing
# echo function_graph > current_tracer
# echo 1 > function_profile_enabled
To fix this problem, add a new fgraph_ops to fgraph_array after
ftrace_startup_subops(). Thus, until the new fgraph_ops is initialized,
we will see fgraph_stub on the corresponding fgraph_array entry.
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf <bpf@vger.kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alan Maguire <alan.maguire@oracle.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Guo Ren <guoren@kernel.org>
Link: https://lore.kernel.org/172398528350.293426.8347220120333730248.stgit@devnote2
Fixes: c132be2c4fcc ("function_graph: Have the instances use their own ftrace_ops for filtering")
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
There are potentially useful cases where a specific iterator type might
need to be passed into some kfunc. So, in addition to existing
bpf_iter_<type>_{new,next,destroy}() kfuncs, allow to pass iterator
pointer to any kfunc.
We employ "__iter" naming suffix for arguments that are meant to accept
iterators. We also enforce that they accept PTR -> STRUCT btf_iter_<type>
type chain and point to a valid initialized on-the-stack iterator state.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20240808232230.2848712-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Verifier enforces that all iterator structs are named `bpf_iter_<name>`
and that whenever iterator is passed to a kfunc it's passed as a valid PTR ->
STRUCT chain (with potentially const modifiers in between).
We'll need this check for upcoming changes, so instead of duplicating
the logic, extract it into a helper function.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240808232230.2848712-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
wq->lockdep_map is set only after __alloc_workqueue()
successfully returns. However, on its error path
__alloc_workqueue() may call destroy_workqueue() which
expects wq->lockdep_map to be already set, which results
in a null-ptr-deref in touch_wq_lockdep_map().
Add a simple NULL-check to touch_wq_lockdep_map().
Oops: general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:__lock_acquire+0x81/0x7800
[..]
Call Trace:
<TASK>
? __die_body+0x66/0xb0
? die_addr+0xb2/0xe0
? exc_general_protection+0x300/0x470
? asm_exc_general_protection+0x22/0x30
? __lock_acquire+0x81/0x7800
? mark_lock+0x94/0x330
? __lock_acquire+0x12fd/0x7800
? __lock_acquire+0x3439/0x7800
lock_acquire+0x14c/0x3e0
? __flush_workqueue+0x167/0x13a0
? __init_swait_queue_head+0xaf/0x150
? __flush_workqueue+0x167/0x13a0
__flush_workqueue+0x17d/0x13a0
? __flush_workqueue+0x167/0x13a0
? lock_release+0x50f/0x830
? drain_workqueue+0x94/0x300
drain_workqueue+0xe3/0x300
destroy_workqueue+0xac/0xc40
? workqueue_sysfs_register+0x159/0x2f0
__alloc_workqueue+0x1506/0x1760
alloc_workqueue+0x61/0x150
...
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Mark emergency sections wherever multiple lines of
lock debugging output are generated. In an emergency
section, every printk() call will attempt to directly
flush to the consoles using the EMERGENCY priority.
Note that debug_show_all_locks() and
lockdep_print_held_locks() rely on their callers to
enter the emergency section. This is because these
functions can also be called in non-emergency
situations (such as sysrq).
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-36-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Mark emergency sections wherever multiple lines of
rcu stall information are generated. In an emergency
section, every printk() call will attempt to directly
flush to the consoles using the EMERGENCY priority.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240820063001.36405-35-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Mark an emergency section beginning with oops_enter() until the
end of oops_exit(). In this section, every printk() call will
attempt to directly flush to the consoles using the EMERGENCY
priority.
The very end of oops_exit() performs a kmsg_dump(). This is not
included in the emergency section because it is another
flushing mechanism that should occur after the consoles have
flushed the oops messages.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-34-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Mark the full contents of __warn() as an emergency section. In
this section, every printk() call will attempt to directly
flush to the consoles using the EMERGENCY priority.
Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-33-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
In emergency situations (something has gone wrong but the
system continues to operate), usually important information
(such as a backtrace) is generated via printk(). This
information should be pushed out to the consoles ASAP.
Add per-CPU emergency nesting tracking because an emergency
can arise while in an emergency situation.
Add functions to mark the beginning and end of emergency
sections where the urgent messages are generated.
Perform direct console flushing at the emergency priority if
the current CPU is in an emergency state and it is safe to do
so.
Note that the emergency state is not system-wide. While one CPU
is in an emergency state, another CPU may attempt to print
console messages at normal priority.
Also note that printk() already attempts to flush consoles in
the caller context for normal priority. However, follow-up
changes will introduce printing kthreads, in which case the
normal priority printk() calls will offload to the kthreads.
Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-32-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
There are many call sites where console flushing occur.
Depending on the system state and types of consoles, the flush
methods to use are different. A flush call site generally must
consider:
@have_boot_console
@have_nbcon_console
@have_legacy_console
@legacy_allow_panic_sync
is_printk_preferred()
and take into account the current CPU state:
NBCON_PRIO_NORMAL
NBCON_PRIO_EMERGENCY
NBCON_PRIO_PANIC
in order to decide if it should:
flush nbcon directly via atomic_write() callback
flush legacy directly via console_unlock
flush legacy via offload to irq_work
All of these call sites use their own logic to make this
decision, which is complicated and error prone. Especially
later when two more flush methods will be introduced:
flush nbcon via offload to kthread
flush legacy via offload to kthread
Introduce a new internal struct console_flush_type that specifies
which console flushing methods should be used in the context of
the caller.
Introduce a helper function to fill out console_flush_type to
be used for flushing call sites.
Replace the logic of all flushing call sites to use the new
helper.
This change standardizes behavior, leading to both fixes and
optimizations across various call sites. For instance, in
console_cpu_notify(), the new logic ensures that nbcon consoles
are flushed when they aren’t managed by the legacy loop.
Similarly, in console_flush_on_panic(), the system no longer
needs to flush nbcon consoles if none are present.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-31-john.ogness@linutronix.de
[pmladek@suse.com: Updated the commit message.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
If legacy and nbcon consoles are registered and the nbcon
consoles are allowed to flush (i.e. no boot consoles
registered), the legacy consoles will no longer perform
direct printing on the panic CPU until after the backtrace
has been stored. This will give the safe nbcon consoles a
chance to print the panic messages before allowing the
unsafe legacy consoles to print.
If no nbcon consoles are registered or they are not allowed
to flush because boot consoles are registered, there is no
change in behavior (i.e. legacy consoles will always attempt
to print from the printk() caller context).
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-30-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Add a global flag @have_nbcon_console to identify if any nbcon
consoles are registered. This will be used in follow-up commits
to preserve legacy behavior when no nbcon consoles are registered.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-29-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Currently the console lock is used to attempt legacy-type
printing even if there are no legacy or boot consoles registered.
If no such consoles are registered, the console lock does not
need to be taken.
Add tracking of legacy console registration and use it with
boot console tracking to avoid unnecessary code paths, i.e.
do not use the console lock if there are no boot consoles
and no legacy consoles.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-28-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Add nbcon_atomic_flush_unsafe() to flush all nbcon consoles
using the write_atomic() callback and allowing unsafe hostile
takeovers. Call this at the end of panic() as a final attempt
to flush any pending messages.
Note that legacy consoles use unsafe methods for flushing
from the beginning of panic (see bust_spinlocks()). Therefore,
systems using both legacy and nbcon consoles may still fail to
see panic messages due to unsafe legacy console usage.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-27-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
In console_flush_on_panic(), flush the nbcon consoles before
flushing legacy consoles. The legacy write() callbacks are not
fully safe when oops_in_progress is set.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-26-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
There may be new records that were added while a driver was
holding the nbcon context for non-printing purposes. These
new records must be flushed by the nbcon_device_release()
context because no other context will do it.
If boot consoles are registered, the legacy loop is used
(either direct or per irq_work) to handle the flushing.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-25-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
If printk has been explicitly deferred or is called from NMI
context, legacy console printing must be deferred to an irq_work
context. Introduce a helper function is_printk_legacy_deferred()
for a CPU to query if it must defer legacy console printing.
In follow-up commits this helper will be needed at other call
sites as well.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-24-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Allow nbcon consoles to print messages in the legacy printk()
caller context (printing via unlock) by integrating them into
console_flush_all(). The write_atomic() callback is used for
printing.
Provide nbcon_legacy_emit_next_record(), which acts as the
nbcon variant of console_emit_next_record(). Call this variant
within console_flush_all() for nbcon consoles. Since nbcon
consoles use their own @nbcon_seq variable to track the next
record to print, this also must be appropriately handled in
console_flush_all().
Note that the legacy printing logic uses @handover to detect
handovers for printing all consoles. For nbcon consoles,
handovers/takeovers occur on a per-console basis and thus do
not cause the console_flush_all() loop to abort.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-23-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Unfortunately it is not known if a boot console and a regular
(legacy or nbcon) console use the same hardware. For this reason
they must not be allowed to print simultaneously.
For legacy consoles this is not an issue because they are
already synchronized with the boot consoles using the console
lock. However nbcon consoles can be triggered separately.
Add a global flag @have_boot_console to identify if any boot
consoles are registered. This will be used in follow-up commits
to ensure that boot consoles and nbcon consoles cannot print
simultaneously.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-22-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Provide nbcon_atomic_flush_pending() to perform flushing of all
registered nbcon consoles using their write_atomic() callback.
Unlike console_flush_all(), nbcon_atomic_flush_pending() will
only flush up through the newest record at the time of the
call. This prevents a CPU from printing unbounded when other
CPUs are adding records. If new records are added while
flushing, it is expected that the dedicated printer threads
will print those records. If the printer thread is not
available (which is always the case at this point in the
rework), nbcon_atomic_flush_pending() _will_ flush all records
in the ringbuffer.
Unlike console_flush_all(), nbcon_atomic_flush_pending() will
fully flush one console before flushing the next. This helps to
guarantee that a block of pending records (such as a stack
trace in an emergency situation) can be printed atomically at
once before releasing console ownership.
nbcon_atomic_flush_pending() is safe in any context because it
uses write_atomic() and acquires with unsafe_takeover disabled.
Co-developed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Thomas Gleixner (Intel) <tglx@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-21-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Add a helper function to use the current state of the CPU to
determine which priority to assign to the printing context.
The EMERGENCY priority handling is added in a follow-up commit.
It will use a per-CPU variable.
Note: nbcon_device_try_acquire(), which is used by console
drivers to acquire the nbcon console for non-printing
activities, is hard-coded to always use NORMAL priority.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-20-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The caller of console_is_usable() usually needs @console->flags
for its own checks. Rather than having console_is_usable() read
its own copy, make the caller pass in the @flags. This also
ensures that the caller saw the same @flags value.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-19-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The nbcon consoles use a different printing callback. For nbcon
consoles, check for the write_atomic() callback instead of
write().
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-18-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Move console_is_usable() as-is into internal.h so that it can
be used by nbcon printing functions as well.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-17-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The headers kernel.h, serial_core.h, and console.h allow for the
definitions of many types and functions from other headers.
Rather than relying on these as proxy headers, explicitly
include all headers providing needed definitions. Also sort the
list alphabetically to be able to easily detect duplicates.
Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-16-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Provide functions nbcon_device_try_acquire() and
nbcon_device_release() which will try to acquire the nbcon
console ownership with NBCON_PRIO_NORMAL and mark it unsafe for
handover/takeover.
These functions are to be used together with the device-specific
locking when performing non-printing activities on the console
device. They will allow synchronization against the
atomic_write() callback which will be serialized, for higher
priority contexts, only by acquiring the console context
ownership.
Pitfalls:
The API requires to be called in a context with migration
disabled because it uses per-CPU variables internally.
The context is set unsafe for a takeover all the time. It
guarantees full serialization against any atomic_write() caller
except for the final flush in panic() which might try an unsafe
takeover.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-14-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Console drivers typically have to deal with access to the
hardware via user input/output (such as an interactive login
shell) and output of kernel messages via printk() calls.
They use some classic driver-specific locking mechanism in most
situations. But console->write_atomic() callbacks, used by nbcon
consoles, are synchronized only by acquiring the console
context.
The synchronization via the console context ownership is possible
only when the console driver is registered. It is when a
particular device driver is connected with a particular console
driver.
The two synchronization mechanisms must be synchronized between
each other. It is tricky because the console context ownership
is quite special. It might be taken over by a higher priority
context. Also CPU migration must be disabled. The most tricky
part is to (dis)connect these two mechanisms during the console
(un)registration.
Use the driver-specific locking callbacks: device_lock(),
device_unlock(). They allow taking the device-specific lock
while the device is being (un)registered by the related console
driver.
For example, these callbacks lock/unlock the port lock for
serial port drivers.
Note that the driver-specific locking is only needed during
(un)register if it is an nbcon console with the write_atomic()
callback implemented. If write_atomic() is not implemented, the
driver should never attempt to access the hardware without
first acquiring its driver-specific lock.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-10-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The return value of write_atomic() does not provide any useful
information. On the contrary, it makes things more complicated
for the caller to appropriately deal with the information.
Change write_atomic() to not have a return value. If the
message did not get printed due to loss of ownership, the
caller will notice this on its own. If ownership was not lost,
it will be assumed that the driver successfully printed the
message and the sequence number for that console will be
incremented.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-7-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
The functions nbcon_owner_matches() and nbcon_waiter_matches()
use a minimal set of data to determine if a context matches.
The existing kerneldoc and comments were not clear enough and
caused the printk folks to re-prove that the functions are
indeed reliable in all cases.
Update and expand the explanations so that it is clear that the
implementations are sufficient for all cases.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-6-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Add validation that printk_deferred_enter()/_exit() are called in
non-migration contexts.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-5-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
If a non-boot console is registering and boot consoles exist,
the consoles are flushed before being unregistered. This allows
the non-boot console to continue where the boot console left
off.
If for whatever reason flushing fails, the lowest seq found from
any of the enabled boot consoles is used. Until now con->seq was
checked. However, if it is an nbcon boot console, the function
nbcon_seq_read() must be used to read seq because con->seq is
not updated for nbcon consoles.
Check if it is an nbcon boot console and if so call
nbcon_seq_read() to read seq.
Also, avoid usage of con->seq as temporary storage of the
starting record. Instead, rename console_init_seq() to
get_init_console_seq() and just return the value. For nbcon
consoles set the sequence via nbcon_seq_force(), for legacy
consoles set con->seq.
The cleaned design should make sure that the value stays and is
set before the console is added to the console list. It also
unifies the sequence number initialization for legacy and nbcon
consoles.
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Link: https://lore.kernel.org/r/20240820063001.36405-4-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Rather than splitting the nbcon allocation and initialization into
two pieces, perform all initialization in nbcon_alloc(). Later,
the initial sequence is calculated and can be explicitly set using
nbcon_seq_force(). This removes the need for the strong rules of
nbcon_init() that even included a BUG_ON().
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20240820063001.36405-3-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
kernel/printk/printk.c:284:5: sparse: sparse: context imbalance in
'console_srcu_read_lock' - wrong count at exit
include/linux/srcu.h:301:9: sparse: sparse: context imbalance in
'console_srcu_read_unlock' - unexpected unlock
Fixes: 6c4afa79147e ("printk: Prepare for SRCU console list protection")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240820063001.36405-2-john.ogness@linutronix.de
Signed-off-by: Petr Mladek <pmladek@suse.com>
|
|
Calling va_start / va_end multiple times is undefined and causes
problems with certain compiler / platforms.
Change alloc_ordered_workqueue_lockdep_map to a macro and updated
__alloc_workqueue to take a va_list argument.
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|