diff options
author | Alexei Starovoitov <ast@kernel.org> | 2020-05-09 17:05:27 -0700 |
---|---|---|
committer | Alexei Starovoitov <ast@kernel.org> | 2020-05-09 17:05:33 -0700 |
commit | 180139dca8b38c858027b8360ee10064fdb2fbf7 (patch) | |
tree | 4491e7c009f30efcb70b55948767df86a1f8ce70 /kernel/trace/bpf_trace.c | |
parent | 8086fbaf49345f988deec539ec8e182b02914401 (diff) | |
parent | 6879c042e10584ea9d5e2204939cafadcd500465 (diff) |
Merge branch 'bpf_iter'
Yonghong Song says:
====================
Motivation:
The current way to dump kernel data structures mostly:
1. /proc system
2. various specific tools like "ss" which requires kernel support.
3. drgn
The dropback for the first two is that whenever you want to dump more, you
need change the kernel. For example, Martin wants to dump socket local
storage with "ss". Kernel change is needed for it to work ([1]).
This is also the direct motivation for this work.
drgn ([2]) solves this proble nicely and no kernel change is not needed.
But since drgn is not able to verify the validity of a particular pointer value,
it might present the wrong results in rare cases.
In this patch set, we introduce bpf iterator. Initial kernel changes are
still needed for interested kernel data, but a later data structure change
will not require kernel changes any more. bpf program itself can adapt
to new data structure changes. This will give certain flexibility with
guaranteed correctness.
In this patch set, kernel seq_ops is used to facilitate iterating through
kernel data, similar to current /proc and many other lossless kernel
dumping facilities. In the future, different iterators can be
implemented to trade off losslessness for other criteria e.g. no
repeated object visits, etc.
User Interface:
1. Similar to prog/map/link, the iterator can be pinned into a
path within a bpffs mount point.
2. The bpftool command can pin an iterator to a file
bpftool iter pin <bpf_prog.o> <path>
3. Use `cat <path>` to dump the contents.
Use `rm -f <path>` to remove the pinned iterator.
4. The anonymous iterator can be created as well.
Please see patch #19 andd #20 for bpf programs and bpf iterator
output examples.
Note that certain iterators are namespace aware. For example,
task and task_file targets only iterate through current pid namespace.
ipv6_route and netlink will iterate through current net namespace.
Please see individual patches for implementation details.
Performance:
The bpf iterator provides in-kernel aggregation abilities
for kernel data. This can greatly improve performance
compared to e.g., iterating all process directories under /proc.
For example, I did an experiment on my VM with an application forking
different number of tasks and each forked process opening various number
of files. The following is the result with the latency with unit of microseconds:
# of forked tasks # of open files # of bpf_prog calls # latency (us)
100 100 11503 7586
1000 1000 1013203 709513
10000 100 1130203 764519
The number of bpf_prog calls may be more than forked tasks multipled by
open files since there are other tasks running on the system.
The bpf program is a do-nothing program. One millions of bpf calls takes
less than one second.
Although the initial motivation is from Martin's sk_local_storage,
this patch didn't implement tcp6 sockets and sk_local_storage.
The /proc/net/tcp6 involves three types of sockets, timewait,
request and tcp6 sockets. Some kind of type casting or other
mechanism is needed to handle all these socket types in one
bpf program. This will be addressed in future work.
Currently, we do not support kernel data generated under module.
This requires some BTF work.
More work for more iterators, e.g., tcp, udp, bpf_map elements, etc.
Changelog:
v3 -> v4:
- in bpf_seq_read(), if start() failed with an error, return that
error to user space (Andrii)
- in bpf_seq_printf(), if reading kernel memory failed for
%s and %p{i,I}{4,6}, set buffer to empty string or address 0.
Documented this behavior in uapi header (Andrii)
- fix a few error handling issues for bpftool (Andrii)
- A few other minor fixes and cosmetic changes.
v2 -> v3:
- add bpf_iter_unreg_target() to unregister a target, used in the
error path of the __init functions.
- handle err != 0 before handling overflow (Andrii)
- reference count "task" for task_file target (Andrii)
- remove some redundancy for bpf_map/task/task_file targets
- add bpf_iter_unreg_target() in ip6_route_cleanup()
- Handling "%%" format in bpf_seq_printf() (Andrii)
- implement auto-attach for bpf_iter in libbpf (Andrii)
- add macros offsetof and container_of in bpf_helpers.h (Andrii)
- add tests for auto-attach and program-return-1 cases
- some other minor fixes
v1 -> v2:
- removed target_feature, using callback functions instead
- checking target to ensure program specified btf_id supported (Martin)
- link_create change with new changes from Andrii
- better handling of btf_iter vs. seq_file private data (Martin, Andrii)
- implemented bpf_seq_read() (Andrii, Alexei)
- percpu buffer for bpf_seq_printf() (Andrii)
- better syntax for BPF_SEQ_PRINTF macro (Andrii)
- bpftool fixes (Quentin)
- a lot of other fixes
RFC v2 -> v1:
- rename bpfdump to bpf_iter
- use bpffs instead of a new file system
- use bpf_link to streamline and simplify iterator creation.
References:
[1]: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
[2]: https://github.com/osandov/drgn
====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Diffstat (limited to 'kernel/trace/bpf_trace.c')
-rw-r--r-- | kernel/trace/bpf_trace.c | 214 |
1 files changed, 214 insertions, 0 deletions
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index e875c95d3ced..d961428fb5b6 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -457,6 +457,212 @@ const struct bpf_func_proto *bpf_get_trace_printk_proto(void) return &bpf_trace_printk_proto; } +#define MAX_SEQ_PRINTF_VARARGS 12 +#define MAX_SEQ_PRINTF_MAX_MEMCPY 6 +#define MAX_SEQ_PRINTF_STR_LEN 128 + +struct bpf_seq_printf_buf { + char buf[MAX_SEQ_PRINTF_MAX_MEMCPY][MAX_SEQ_PRINTF_STR_LEN]; +}; +static DEFINE_PER_CPU(struct bpf_seq_printf_buf, bpf_seq_printf_buf); +static DEFINE_PER_CPU(int, bpf_seq_printf_buf_used); + +BPF_CALL_5(bpf_seq_printf, struct seq_file *, m, char *, fmt, u32, fmt_size, + const void *, data, u32, data_len) +{ + int err = -EINVAL, fmt_cnt = 0, memcpy_cnt = 0; + int i, buf_used, copy_size, num_args; + u64 params[MAX_SEQ_PRINTF_VARARGS]; + struct bpf_seq_printf_buf *bufs; + const u64 *args = data; + + buf_used = this_cpu_inc_return(bpf_seq_printf_buf_used); + if (WARN_ON_ONCE(buf_used > 1)) { + err = -EBUSY; + goto out; + } + + bufs = this_cpu_ptr(&bpf_seq_printf_buf); + + /* + * bpf_check()->check_func_arg()->check_stack_boundary() + * guarantees that fmt points to bpf program stack, + * fmt_size bytes of it were initialized and fmt_size > 0 + */ + if (fmt[--fmt_size] != 0) + goto out; + + if (data_len & 7) + goto out; + + for (i = 0; i < fmt_size; i++) { + if (fmt[i] == '%') { + if (fmt[i + 1] == '%') + i++; + else if (!data || !data_len) + goto out; + } + } + + num_args = data_len / 8; + + /* check format string for allowed specifiers */ + for (i = 0; i < fmt_size; i++) { + /* only printable ascii for now. */ + if ((!isprint(fmt[i]) && !isspace(fmt[i])) || !isascii(fmt[i])) { + err = -EINVAL; + goto out; + } + + if (fmt[i] != '%') + continue; + + if (fmt[i + 1] == '%') { + i++; + continue; + } + + if (fmt_cnt >= MAX_SEQ_PRINTF_VARARGS) { + err = -E2BIG; + goto out; + } + + if (fmt_cnt >= num_args) { + err = -EINVAL; + goto out; + } + + /* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */ + i++; + + /* skip optional "[0 +-][num]" width formating field */ + while (fmt[i] == '0' || fmt[i] == '+' || fmt[i] == '-' || + fmt[i] == ' ') + i++; + if (fmt[i] >= '1' && fmt[i] <= '9') { + i++; + while (fmt[i] >= '0' && fmt[i] <= '9') + i++; + } + + if (fmt[i] == 's') { + /* try our best to copy */ + if (memcpy_cnt >= MAX_SEQ_PRINTF_MAX_MEMCPY) { + err = -E2BIG; + goto out; + } + + err = strncpy_from_unsafe(bufs->buf[memcpy_cnt], + (void *) (long) args[fmt_cnt], + MAX_SEQ_PRINTF_STR_LEN); + if (err < 0) + bufs->buf[memcpy_cnt][0] = '\0'; + params[fmt_cnt] = (u64)(long)bufs->buf[memcpy_cnt]; + + fmt_cnt++; + memcpy_cnt++; + continue; + } + + if (fmt[i] == 'p') { + if (fmt[i + 1] == 0 || + fmt[i + 1] == 'K' || + fmt[i + 1] == 'x') { + /* just kernel pointers */ + params[fmt_cnt] = args[fmt_cnt]; + fmt_cnt++; + continue; + } + + /* only support "%pI4", "%pi4", "%pI6" and "%pi6". */ + if (fmt[i + 1] != 'i' && fmt[i + 1] != 'I') { + err = -EINVAL; + goto out; + } + if (fmt[i + 2] != '4' && fmt[i + 2] != '6') { + err = -EINVAL; + goto out; + } + + if (memcpy_cnt >= MAX_SEQ_PRINTF_MAX_MEMCPY) { + err = -E2BIG; + goto out; + } + + + copy_size = (fmt[i + 2] == '4') ? 4 : 16; + + err = probe_kernel_read(bufs->buf[memcpy_cnt], + (void *) (long) args[fmt_cnt], + copy_size); + if (err < 0) + memset(bufs->buf[memcpy_cnt], 0, copy_size); + params[fmt_cnt] = (u64)(long)bufs->buf[memcpy_cnt]; + + i += 2; + fmt_cnt++; + memcpy_cnt++; + continue; + } + + if (fmt[i] == 'l') { + i++; + if (fmt[i] == 'l') + i++; + } + + if (fmt[i] != 'i' && fmt[i] != 'd' && + fmt[i] != 'u' && fmt[i] != 'x') { + err = -EINVAL; + goto out; + } + + params[fmt_cnt] = args[fmt_cnt]; + fmt_cnt++; + } + + /* Maximumly we can have MAX_SEQ_PRINTF_VARARGS parameter, just give + * all of them to seq_printf(). + */ + seq_printf(m, fmt, params[0], params[1], params[2], params[3], + params[4], params[5], params[6], params[7], params[8], + params[9], params[10], params[11]); + + err = seq_has_overflowed(m) ? -EOVERFLOW : 0; +out: + this_cpu_dec(bpf_seq_printf_buf_used); + return err; +} + +static int bpf_seq_printf_btf_ids[5]; +static const struct bpf_func_proto bpf_seq_printf_proto = { + .func = bpf_seq_printf, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_PTR_TO_MEM_OR_NULL, + .arg5_type = ARG_CONST_SIZE_OR_ZERO, + .btf_id = bpf_seq_printf_btf_ids, +}; + +BPF_CALL_3(bpf_seq_write, struct seq_file *, m, const void *, data, u32, len) +{ + return seq_write(m, data, len) ? -EOVERFLOW : 0; +} + +static int bpf_seq_write_btf_ids[5]; +static const struct bpf_func_proto bpf_seq_write_proto = { + .func = bpf_seq_write, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE_OR_ZERO, + .btf_id = bpf_seq_write_btf_ids, +}; + static __always_inline int get_map_perf_counter(struct bpf_map *map, u64 flags, u64 *value, u64 *enabled, u64 *running) @@ -1226,6 +1432,14 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_xdp_output: return &bpf_xdp_output_proto; #endif + case BPF_FUNC_seq_printf: + return prog->expected_attach_type == BPF_TRACE_ITER ? + &bpf_seq_printf_proto : + NULL; + case BPF_FUNC_seq_write: + return prog->expected_attach_type == BPF_TRACE_ITER ? + &bpf_seq_write_proto : + NULL; default: return raw_tp_prog_func_proto(func_id, prog); } |