summaryrefslogtreecommitdiff
path: root/kernel/seccomp.c
AgeCommit message (Collapse)Author
2024-07-24sysctl: treewide: constify the ctl_table argument of proc_handlersJoel Granados
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *); @r4@ identifier func, ctl; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos); ``` * Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-06-28seccomp: release task filters when the task exitsAndrei Vagin
Previously, seccomp filters were released in release_task(), which required the process to exit and its zombie to be collected. However, exited threads/processes can't trigger any seccomp events, making it more logical to release filters upon task exits. This adjustment simplifies scenarios where a parent is tracing its child process. The parent process can now handle all events from a seccomp listening descriptor and then call wait to collect a child zombie. seccomp_filter_release takes the siglock to avoid races with seccomp_sync_threads. There was an idea to bypass taking the lock by checking PF_EXITING, but it can be set without holding siglock if threads have SIGNAL_GROUP_EXIT. This means it can happen concurently with seccomp_filter_release. This change also fixes another minor problem. Suppose that a group leader installs the new filter without SECCOMP_FILTER_FLAG_TSYNC, exits, and becomes a zombie. Without this change, SECCOMP_FILTER_FLAG_TSYNC from any other thread can never succeed, seccomp_can_sync_threads() will check a zombie leader and is_ancestor() will fail. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrei Vagin <avagin@google.com> Link: https://lore.kernel.org/r/20240628021014.231976-3-avagin@google.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Kees Cook <kees@kernel.org>
2024-06-28seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV when all users have exitedAndrei Vagin
SECCOMP_IOCTL_NOTIF_RECV promptly returns when a seccomp filter becomes unused, as a filter without users can't trigger any events. Previously, event listeners had to rely on epoll to detect when all processes had exited. The change is based on the 'commit 99cdb8b9a573 ("seccomp: notify about unused filter")' which implemented (E)POLLHUP notifications. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrei Vagin <avagin@google.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/r/20240628021014.231976-2-avagin@google.com Reviewed-by: Tycho Andersen <tandersen@netflix.com> Signed-off-by: Kees Cook <kees@kernel.org>
2024-05-17Merge tag 'sysctl-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Remove sentinel elements from ctl_table structs in kernel/* Removing sentinels in ctl_table arrays reduces the build time size and runtime memory consumed by ~64 bytes per array. Removals for net/, io_uring/, mm/, ipc/ and security/ are set to go into mainline through their respective subsystems making the next release the most likely place where the final series that removes the check for proc_name == NULL will land. This adds to removals already in arch/, drivers/ and fs/. - Adjust ctl_table definitions and references to allow constification - Remove unused ctl_table function arguments - Move non-const elements from ctl_table to ctl_table_header - Make ctl_table pointers const in ctl_table_root structure Making the static ctl_table structs const will increase safety by keeping the pointers to proc_handler functions in .rodata. Though no ctl_tables where made const in this PR, the ground work for making that possible has started with these changes sent by Thomas Weißschuh. * tag 'sysctl-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: drop now unnecessary out-of-bounds check sysctl: move sysctl type to ctl_table_header sysctl: drop sysctl_is_perm_empty_ctl_table sysctl: treewide: constify argument ctl_table_root::permissions(table) sysctl: treewide: drop unused argument ctl_table_root::set_ownership(table) bpf: Remove the now superfluous sentinel elements from ctl_table array delayacct: Remove the now superfluous sentinel elements from ctl_table array kprobes: Remove the now superfluous sentinel elements from ctl_table array printk: Remove the now superfluous sentinel elements from ctl_table array scheduler: Remove the now superfluous sentinel elements from ctl_table array seccomp: Remove the now superfluous sentinel elements from ctl_table array timekeeping: Remove the now superfluous sentinel elements from ctl_table array ftrace: Remove the now superfluous sentinel elements from ctl_table array umh: Remove the now superfluous sentinel elements from ctl_table array kernel misc: Remove the now superfluous sentinel elements from ctl_table array
2024-05-08seccomp: Constify sysctl subhelpersKees Cook
The read_actions_logged() and write_actions_logged() helpers called by the sysctl proc handler seccomp_actions_logged_handler() are already expecting their sysctl table argument to be read-only. Actually mark the argument as const in preparation[1] for global constification of the sysctl tables. Suggested-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/lkml/20240423-sysctl-const-handler-v3-11-e0beccb836e2@weissschuh.net/ [1] Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20240508171337.work.861-kees@kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2024-04-24seccomp: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel element from seccomp_sysctl_table. Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Joel Granados <j.granados@samsung.com>
2023-12-12file: remove __receive_fd()Christian Brauner
Honestly, there's little value in having a helper with and without that int __user *ufd argument. It's just messy and doesn't really give us anything. Just expose receive_fd() with that argument and get rid of that helper. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-5-e73ca6f4ea83@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-17seccomp: Add missing kerndoc notationsKees Cook
The kerndoc for some struct member and function arguments were missing. Add them. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202308171742.AncabIG1-lkp@intel.com/ Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17seccomp: add the synchronous mode for seccomp_unotifyAndrei Vagin
seccomp_unotify allows more privileged processes do actions on behalf of less privileged processes. In many cases, the workflow is fully synchronous. It means a target process triggers a system call and passes controls to a supervisor process that handles the system call and returns controls to the target process. In this context, "synchronous" means that only one process is running and another one is waiting. There is the WF_CURRENT_CPU flag that is used to advise the scheduler to move the wakee to the current CPU. For such synchronous workflows, it makes context switches a few times faster. Right now, each interaction takes 12µs. With this patch, it takes about 3µs. This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that it used to enable the sync mode. Signed-off-by: Andrei Vagin <avagin@google.com> Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: https://lore.kernel.org/r/20230308073201.3102738-5-avagin@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-07-17seccomp: don't use semaphore and wait_queue togetherAndrei Vagin
The main reason is to use new wake_up helpers that will be added in the following patches. But here are a few other reasons: * if we use two different ways, we always need to call them both. This patch fixes seccomp_notify_recv where we forgot to call wake_up_poll in the error path. * If we use one primitive, we can control how many waiters are woken up for each request. Our goal is to wake up just one that will handle a request. Right now, wake_up_poll can wake up one waiter and up(&match->notif->request) can wake up one more. Signed-off-by: Andrei Vagin <avagin@google.com> Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: https://lore.kernel.org/r/20230308073201.3102738-2-avagin@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-04-13seccomp: simplify sysctls with register_sysctl_init()Luis Chamberlain
register_sysctl_paths() is only needed if you have childs (directories) with entries. Just use register_sysctl_init() as it also does the kmemleak check for you. Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-01-13seccomp: fix kernel-doc function name warningRandy Dunlap
Move the ACTION_ONLY() macro so that it is not between the kernel-doc notation and the function definition for seccomp_run_filters(), eliminating a kernel-doc warning: kernel/seccomp.c:400: warning: expecting prototype for seccomp_run_filters(). Prototype was for ACTION_ONLY() instead Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20230108021228.15975-1-rdunlap@infradead.org
2022-05-03seccomp: Add wait_killable semantic to seccomp user notifierSargun Dhillon
This introduces a per-filter flag (SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV) that makes it so that when notifications are received by the supervisor the notifying process will transition to wait killable semantics. Although wait killable isn't a set of semantics formally exposed to userspace, the concept is searchable. If the notifying process is signaled prior to the notification being received by the userspace agent, it will be handled as normal. One quirk about how this is handled is that the notifying process only switches to TASK_KILLABLE if it receives a wakeup from either an addfd or a signal. This is to avoid an unnecessary wakeup of the notifying task. The reasons behind switching into wait_killable only after userspace receives the notification are: * Avoiding unncessary work - Often, workloads will perform work that they may abort (request racing comes to mind). This allows for syscalls to be aborted safely prior to the notification being received by the supervisor. In this, the supervisor doesn't end up doing work that the workload does not want to complete anyways. * Avoiding side effects - We don't want the syscall to be interruptible once the supervisor starts doing work because it may not be trivial to reverse the operation. For example, unmounting a file system may take a long time, and it's hard to rollback, or treat that as reentrant. * Avoid breaking runtimes - Various runtimes do not GC when they are during a syscall (or while running native code that subsequently calls a syscall). If many notifications are blocked, and not picked up by the supervisor, this can get the application into a bad state. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220503080958.20220-2-sargun@sargun.me
2022-04-29seccomp: Use FIFO semantics to order notificationsSargun Dhillon
Previously, the seccomp notifier used LIFO semantics, where each notification would be added on top of the stack, and notifications were popped off the top of the stack. This could result one process that generates a large number of notifications preventing other notifications from being handled. This patch moves from LIFO (stack) semantics to FIFO (queue semantics). Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20220428015447.13661-1-sargun@sargun.me
2022-03-28Merge tag 'ptrace-cleanups-for-v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace cleanups from Eric Biederman: "This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where anything left in tracehook.h was some weird strange thing that was difficult to understand" * tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Remove duplicated include in ptrace.c ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE ptrace: Return the signal to continue with from ptrace_stop ptrace: Move setting/clearing ptrace_message into ptrace_stop tracehook: Remove tracehook.h resume_user_mode: Move to resume_user_mode.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume signal: Move set_notify_signal and clear_notify_signal into sched/signal.h task_work: Decouple TIF_NOTIFY_SIGNAL and task_work task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Introduce task_work_pending task_work: Remove unnecessary include from posix_timers.h ptrace: Remove tracehook_signal_handler ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Move ptrace_report_syscall into ptrace.h
2022-03-10tracehook: Remove tracehook.hEric W. Biederman
Now that all of the definitions have moved out of tracehook.h into ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in tracehook.h so remove it. Update the few files that were depending upon tracehook.h to bring in definitions to use the headers they need directly. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2022-02-10seccomp: Invalidate seccomp mode to catch death failuresKees Cook
If seccomp tries to kill a process, it should never see that process again. To enforce this proactively, switch the mode to something impossible. If encountered: WARN, reject all syscalls, and attempt to kill the process again even harder. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Fixes: 8112c4f140fa ("seccomp: remove 2-phase API") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2021-09-01Merge branch 'exit-cleanups-for-v5.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull exit cleanups from Eric Biederman: "In preparation of doing something about PTRACE_EVENT_EXIT I have started cleaning up various pieces of code related to do_exit. Most of that code I did not manage to get tested and reviewed before the merge window opened but a handful of very useful cleanups are ready to be merged. The first change is simply the removal of the bdflush system call. The code has now been disabled long enough that even the oldest userspace working userspace setups anyone can find to test are fine with the bdflush system call being removed. Changing m68k fsp040_die to use force_sigsegv(SIGSEGV) instead of calling do_exit directly is interesting only in that it is nearly the most difficult of the incorrect uses of do_exit to remove. The change to the seccomp code to simply send a signal instead of calling do_coredump directly is a very nice little cleanup made possible by realizing the existing signal sending helpers were missing a little bit of functionality that is easy to provide" * 'exit-cleanups-for-v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: signal/seccomp: Dump core when there is only one live thread signal/seccomp: Refactor seccomp signal and coredump generation signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die exit/bdflush: Remove the deprecated bdflush system call
2021-08-26signal/seccomp: Dump core when there is only one live threadEric W. Biederman
Replace get_nr_threads with atomic_read(&current->signal->live) as that is a more accurate number that is decremented sooner. Acked-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87lf6z6qbd.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-26signal/seccomp: Refactor seccomp signal and coredump generationEric W. Biederman
Factor out force_sig_seccomp from the seccomp signal generation and place it in kernel/signal.c. The function force_sig_seccomp takes a parameter force_coredump to indicate that the sigaction field should be reset to SIGDFL so that a coredump will be generated when the signal is delivered. force_sig_seccomp is then used to replace both seccomp_send_sigsys and seccomp_init_siginfo. force_sig_info_to_task gains an extra parameter to force using the default signal action. With this change seccomp is no longer a special case and there becomes exactly one place do_coredump is called from. Further it no longer becomes necessary for __seccomp_filter to call do_group_exit. Acked-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/87r1gr6qc4.fsf_-_@disp2133 Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2021-08-11seccomp: Fix setting loaded filter count during TSYNCHsuan-Chi Kuo
The desired behavior is to set the caller's filter count to thread's. This value is reported via /proc, so this fixes the inaccurate count exposed to userspace; it is not used for reference counting, etc. Signed-off-by: Hsuan-Chi Kuo <hsuanchikuo@gmail.com> Link: https://lore.kernel.org/r/20210304233708.420597-1-hsuanchikuo@gmail.com Co-developed-by: Wiktor Garbacz <wiktorg@google.com> Signed-off-by: Wiktor Garbacz <wiktorg@google.com> Link: https://lore.kernel.org/lkml/20210810125158.329849-1-wiktorg@google.com Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Fixes: c818c03b661c ("seccomp: Report number of loaded filters in /proc/$pid/status")
2021-06-28seccomp: Support atomic "addfd + send reply"Rodrigo Campos
Alban Crequy reported a race condition userspace faces when we want to add some fds and make the syscall return them[1] using seccomp notify. The problem is that currently two different ioctl() calls are needed by the process handling the syscalls (agent) for another userspace process (target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible for the agent to do the first ioctl to add a file descriptor but the target is interrupted (EINTR) before the agent does the second ioctl() call. This patch adds a flag to the ADDFD ioctl() so it adds the fd and returns that value atomically to the target program, as suggested by Kees Cook[2]. This is done by simply allowing seccomp_do_user_notification() to add the fd and return it in this case. Therefore, in this case the target wakes up from the wait in seccomp_do_user_notification() either to interrupt the syscall or to add the fd and return it. This "allocate an fd and return" functionality is useful for syscalls that return a file descriptor only, like connect(2). Other syscalls that return a file descriptor but not as return value (or return more than one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not work with this flag. This effectively combines SECCOMP_IOCTL_NOTIF_ADDFD and SECCOMP_IOCTL_NOTIF_SEND into an atomic opteration. The notification's return value, nor error can be set by the user. Upon successful invocation of the SECCOMP_IOCTL_NOTIF_ADDFD ioctl with the SECCOMP_ADDFD_FLAG_SEND flag, the notifying process's errno will be 0, and the return value will be the file descriptor number that was installed. [1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/ [2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/ Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Tycho Andersen <tycho@tycho.pizza> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210517193908.3113-4-sargun@sargun.me
2021-05-29seccomp: Refactor notification handler to prepare for new semanticsSargun Dhillon
This refactors the user notification code to have a do / while loop around the completion condition. This has a small change in semantic, in that previously we ignored addfd calls upon wakeup if the notification had been responded to, but instead with the new change we check for an outstanding addfd calls prior to returning to userspace. Rodrigo Campos also identified a bug that can result in addfd causing an early return, when the supervisor didn't actually handle the syscall [1]. [1]: https://lore.kernel.org/lkml/20210413160151.3301-1-rodrigo@kinvolk.io/ Fixes: 7cf97b125455 ("seccomp: Introduce addfd ioctl to seccomp user notifier") Signed-off-by: Sargun Dhillon <sargun@sargun.me> Acked-by: Tycho Andersen <tycho@tycho.pizza> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Rodrigo Campos <rodrigo@kinvolk.io> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210517193908.3113-3-sargun@sargun.me
2021-05-03Merge branch 'work.file' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull receive_fd update from Al Viro: "Cleanup of receive_fd mess" * 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: split receive_fd_replace from __receive_fd
2021-04-16fs: split receive_fd_replace from __receive_fdChristoph Hellwig
receive_fd_replace shares almost no code with the general case, so split it out. Also remove the "Bump the sock usage counts" comment from both copies, as that is now what __receive_sock actually does. [AV: ... and make the only user of receive_fd_replace() choose between it and receive_fd() according to what userland had passed to it in flags] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2021-03-30seccomp: Fix "cacheable" typo in commentsCui GaoSheng
Do a trivial typo fix: s/cachable/cacheable/ Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Cui GaoSheng <cuigaosheng1@huawei.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210331030724.84419-1-cuigaosheng1@huawei.com
2021-02-10seccomp: Improve performace by optimizing rmb()wanghongzhe
According to Kees's suggest, we started with the patch that just replaces rmb() with smp_rmb() and did a performance test with UnixBench. The results showed the overhead about 2.53% in rmb() test compared to the smp_rmb() one, in a x86-64 kernel with CONFIG_SMP enabled running inside a qemu-kvm vm. The test is a "syscall" testcase in UnixBench, which executes 5 syscalls in a loop during a certain timeout (100 second in our test) and counts the total number of executions of this 5-syscall sequence. We set a seccomp filter with all allow rule for all used syscalls in this test (which will go bitmap path) to make sure the rmb() will be executed. The details for the test: with rmb(): /txm # ./syscall_allow_min 100 COUNT|35861159|1|lps /txm # ./syscall_allow_min 100 COUNT|35545501|1|lps /txm # ./syscall_allow_min 100 COUNT|35664495|1|lps with smp_rmb(): /txm # ./syscall_allow_min 100 COUNT|36552771|1|lps /txm # ./syscall_allow_min 100 COUNT|36491247|1|lps /txm # ./syscall_allow_min 100 COUNT|36504746|1|lps For a x86-64 kernel with CONFIG_SMP enabled, the smp_rmb() is just a compiler barrier() which have no impact in runtime, while rmb() is a lfence which will prevent all memory access operations (not just load according the recently claim by Intel) behind itself. We can also figure it out in disassembly: with rmb(): 0000000000001430 <__seccomp_filter>: 1430: 41 57 push %r15 1432: 41 56 push %r14 1434: 41 55 push %r13 1436: 41 54 push %r12 1438: 55 push %rbp 1439: 53 push %rbx 143a: 48 81 ec 90 00 00 00 sub $0x90,%rsp 1441: 89 7c 24 10 mov %edi,0x10(%rsp) 1445: 89 54 24 14 mov %edx,0x14(%rsp) 1449: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax 1450: 00 00 1452: 48 89 84 24 88 00 00 mov %rax,0x88(%rsp) 1459: 00 145a: 31 c0 xor %eax,%eax * 145c: 0f ae e8 lfence 145f: 48 85 f6 test %rsi,%rsi 1462: 49 89 f4 mov %rsi,%r12 1465: 0f 84 42 03 00 00 je 17ad <__seccomp_filter+0x37d> 146b: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax 1472: 00 00 1474: 48 8b 98 80 07 00 00 mov 0x780(%rax),%rbx 147b: 48 85 db test %rbx,%rbx with smp_rmb(); 0000000000001430 <__seccomp_filter>: 1430: 41 57 push %r15 1432: 41 56 push %r14 1434: 41 55 push %r13 1436: 41 54 push %r12 1438: 55 push %rbp 1439: 53 push %rbx 143a: 48 81 ec 90 00 00 00 sub $0x90,%rsp 1441: 89 7c 24 10 mov %edi,0x10(%rsp) 1445: 89 54 24 14 mov %edx,0x14(%rsp) 1449: 65 48 8b 04 25 28 00 mov %gs:0x28,%rax 1450: 00 00 1452: 48 89 84 24 88 00 00 mov %rax,0x88(%rsp) 1459: 00 145a: 31 c0 xor %eax,%eax 145c: 48 85 f6 test %rsi,%rsi 145f: 49 89 f4 mov %rsi,%r12 1462: 0f 84 42 03 00 00 je 17aa <__seccomp_filter+0x37a> 1468: 65 48 8b 04 25 00 00 mov %gs:0x0,%rax 146f: 00 00 1471: 48 8b 98 80 07 00 00 mov 0x780(%rax),%rbx 1478: 48 85 db test %rbx,%rbx Signed-off-by: wanghongzhe <wanghongzhe@huawei.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/1612496049-32507-1-git-send-email-wanghongzhe@huawei.com
2021-01-11seccomp: Add missing return in non-void functionPaul Cercueil
We don't actually care about the value, since the kernel will panic before that; but a value should nonetheless be returned, otherwise the compiler will complain. Fixes: 8112c4f140fa ("seccomp: remove 2-phase API") Cc: stable@vger.kernel.org # 4.7+ Signed-off-by: Paul Cercueil <paul@crapouillou.net> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210111172839.640914-1-paul@crapouillou.net
2020-12-16Merge tag 'seccomp-v5.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: "The major change here is finally gaining seccomp constant-action bitmaps, which internally reduces the seccomp overhead for many real-world syscall filters to O(1), as discussed at Plumbers this year. - Improve seccomp performance via constant-action bitmaps (YiFei Zhu & Kees Cook) - Fix bogus __user annotations (Jann Horn) - Add missed CONFIG for improved selftest coverage (Mickaël Salaün)" * tag 'seccomp-v5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: selftests/seccomp: Update kernel config seccomp: Remove bogus __user annotations seccomp/cache: Report cache data through /proc/pid/seccomp_cache xtensa: Enable seccomp architecture tracking sh: Enable seccomp architecture tracking s390: Enable seccomp architecture tracking riscv: Enable seccomp architecture tracking powerpc: Enable seccomp architecture tracking parisc: Enable seccomp architecture tracking csky: Enable seccomp architecture tracking arm: Enable seccomp architecture tracking arm64: Enable seccomp architecture tracking selftests/seccomp: Compare bitmap vs filter overhead x86: Enable seccomp architecture tracking seccomp/cache: Add "emulator" to check if filter is constant allow seccomp/cache: Lookup syscall allowlist bitmap for fast path
2020-12-14Merge tag 'core-entry-2020-12-14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry/exit updates from Thomas Gleixner: "A set of updates for entry/exit handling: - More generalization of entry/exit functionality - The consolidation work to reclaim TIF flags on x86 and also for non-x86 specific TIF flags which are solely relevant for syscall related work and have been moved into their own storage space. The x86 specific part had to be merged in to avoid a major conflict. - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal delivery mode of task work and results in an impressive performance improvement for io_uring. The non-x86 consolidation of this is going to come seperate via Jens. - The selective syscall redirection facility which provides a clean and efficient way to support the non-Linux syscalls of WINE by catching them at syscall entry and redirecting them to the user space emulation. This can be utilized for other purposes as well and has been designed carefully to avoid overhead for the regular fastpath. This includes the core changes and the x86 support code. - Simplification of the context tracking entry/exit handling for the users of the generic entry code which guarantee the proper ordering and protection. - Preparatory changes to make the generic entry code accomodate S390 specific requirements which are mostly related to their syscall restart mechanism" * tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) entry: Add syscall_exit_to_user_mode_work() entry: Add exit_to_user_mode() wrapper entry_Add_enter_from_user_mode_wrapper entry: Rename exit_to_user_mode() entry: Rename enter_from_user_mode() docs: Document Syscall User Dispatch selftests: Add benchmark for syscall user dispatch selftests: Add kselftest for syscall user dispatch entry: Support Syscall User Dispatch on common syscall entry kernel: Implement selective syscall userspace redirection signal: Expose SYS_USER_DISPATCH si_code type x86: vdso: Expose sigreturn address on vdso to the kernel MAINTAINERS: Add entry for common entry code entry: Fix boot for !CONFIG_GENERIC_ENTRY x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs sched: Detect call to schedule from critical entry code context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK x86: Reclaim unused x86 TI flags ...
2020-11-20Merge branch 'for-linus/seccomp' into for-next/seccompKees Cook
2020-11-20seccomp: Remove bogus __user annotationsJann Horn
Buffers that are passed to read_actions_logged() and write_actions_logged() are in kernel memory; the sysctl core takes care of copying from/to userspace. Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler") Reviewed-by: Tyler Hicks <code@tyhicks.com> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201120170545.1419332-1-jannh@google.com
2020-11-20seccomp/cache: Report cache data through /proc/pid/seccomp_cacheYiFei Zhu
Currently the kernel does not provide an infrastructure to translate architecture numbers to a human-readable name. Translating syscall numbers to syscall names is possible through FTRACE_SYSCALL infrastructure but it does not provide support for compat syscalls. This will create a file for each PID as /proc/pid/seccomp_cache. The file will be empty when no seccomp filters are loaded, or be in the format of: <arch name> <decimal syscall number> <ALLOW | FILTER> where ALLOW means the cache is guaranteed to allow the syscall, and filter means the cache will pass the syscall to the BPF filter. For the docker default profile on x86_64 it looks like: x86_64 0 ALLOW x86_64 1 ALLOW x86_64 2 ALLOW x86_64 3 ALLOW [...] x86_64 132 ALLOW x86_64 133 ALLOW x86_64 134 FILTER x86_64 135 FILTER x86_64 136 FILTER x86_64 137 ALLOW x86_64 138 ALLOW x86_64 139 FILTER x86_64 140 ALLOW x86_64 141 ALLOW [...] This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default of N because I think certain users of seccomp might not want the application to know which syscalls are definitely usable. For the same reason, it is also guarded by CAP_SYS_ADMIN. Suggested-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/ Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/94e663fa53136f5a11f432c661794d1ee7060779.1605101222.git.yifeifz2@illinois.edu
2020-11-20seccomp/cache: Add "emulator" to check if filter is constant allowYiFei Zhu
SECCOMP_CACHE will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will return allow regardless of syscall arguments for a given architecture number / syscall number pair. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. Nearly all seccomp filters are built from these cBPF instructions: BPF_LD | BPF_W | BPF_ABS BPF_JMP | BPF_JEQ | BPF_K BPF_JMP | BPF_JGE | BPF_K BPF_JMP | BPF_JGT | BPF_K BPF_JMP | BPF_JSET | BPF_K BPF_JMP | BPF_JA BPF_RET | BPF_K BPF_ALU | BPF_AND | BPF_K Each of these instructions are emulated. Any weirdness or loading from a syscall argument will cause the emulator to bail. The emulation is also halted if it reaches a return. In that case, if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. Emulator structure and comments are from Kees [1] and Jann [2]. Emulation is done at attach time. If a filter depends on more filters, and if the dependee does not guarantee to allow the syscall, then we skip the emulation of this syscall. [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/ Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> Reviewed-by: Jann Horn <jannh@google.com> Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/71c7be2db5ee08905f41c3be5c1ad6e2601ce88f.1602431034.git.yifeifz2@illinois.edu
2020-11-20seccomp/cache: Lookup syscall allowlist bitmap for fast pathYiFei Zhu
The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. When it can be concluded that an allow must occur for the given architecture and syscall pair (this determination is introduced in the next commit), seccomp will immediately allow the syscall, bypassing further BPF execution. Each architecture number has its own bitmap. The architecture number in seccomp_data is checked against the defined architecture number constant before proceeding to test the bit against the bitmap with the syscall number as the index of the bit in the bitmap, and if the bit is set, seccomp returns allow. The bitmaps are all clear in this patch and will be initialized in the next commit. When only one architecture exists, the check against architecture number is skipped, suggested by Kees Cook [7]. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 [7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/ Co-developed-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: Dimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: YiFei Zhu <yifeifz2@illinois.edu> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/10f91a367ec4fcdea7fc3f086de3f5f13a4a7436.1602431034.git.yifeifz2@illinois.edu
2020-11-17seccomp: Set PF_SUPERPRIV when checking capabilityMickaël Salaün
Replace the use of security_capable(current_cred(), ...) with ns_capable_noaudit() which set PF_SUPERPRIV. Since commit 98f368e9e263 ("kernel: Add noaudit variant of ns_capable()"), a new ns_capable_noaudit() helper is available. Let's use it! Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Tyler Hicks <tyhicks@linux.microsoft.com> Cc: Will Drewry <wad@chromium.org> Cc: stable@vger.kernel.org Fixes: e2cfabdfd075 ("seccomp: add system call filtering using BPF") Signed-off-by: Mickaël Salaün <mic@linux.microsoft.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201030123849.770769-3-mic@digikod.net
2020-11-16seccomp: Migrate to use SYSCALL_WORK flagGabriel Krisman Bertazi
On architectures using the generic syscall entry code the architecture independent syscall work is moved to flags in thread_info::syscall_work. This removes architecture dependencies and frees up TIF bits. Define SYSCALL_WORK_SECCOMP, use it in the generic entry code and convert the code which uses the TIF specific helper functions to use the new *_syscall_work() helpers which either resolve to the new mode for users of the generic entry code or to the TIF based functions for the other architectures. Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lore.kernel.org/r/20201116174206.2639648-5-krisman@collabora.com
2020-10-08seccomp: Make duplicate listener detection non-racyJann Horn
Currently, init_listener() tries to prevent adding a filter with SECCOMP_FILTER_FLAG_NEW_LISTENER if one of the existing filters already has a listener. However, this check happens without holding any lock that would prevent another thread from concurrently installing a new filter (potentially with a listener) on top of the ones we already have. Theoretically, this is also a data race: The plain load from current->seccomp.filter can race with concurrent writes to the same location. Fix it by moving the check into the region that holds the siglock to guard against concurrent TSYNC. (The "Fixes" tag points to the commit that introduced the theoretical data race; concurrent installation of another filter with TSYNC only became possible later, in commit 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together").) Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Reviewed-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201005014401.490175-1-jannh@google.com
2020-09-08seccomp: Use current_pt_regs() instead of task_pt_regs(current)Denis Efremov
As described in commit a3460a59747c ("new helper: current_pt_regs()"): - arch versions are "optimized versions". - some architectures have task_pt_regs() working only for traced tasks blocked on signal delivery. current_pt_regs() needs to work for *all* processes. In preparation for adding a coccinelle rule for using current_*(), instead of raw accesses to current members, modify seccomp_do_user_notification(), __seccomp_filter(), __secure_computing() to use current_pt_regs(). Signed-off-by: Denis Efremov <efremov@linux.com> Link: https://lore.kernel.org/r/20200824125921.488311-1-efremov@linux.com [kees: Reworded commit log, add comment to populate_seccomp_data()] Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08seccomp: kill process instead of thread for unknown actionsRich Felker
Asynchronous termination of a thread outside of the userspace thread library's knowledge is an unsafe operation that leaves the process in an inconsistent, corrupt, and possibly unrecoverable state. In order to make new actions that may be added in the future safe on kernels not aware of them, change the default action from SECCOMP_RET_KILL_THREAD to SECCOMP_RET_KILL_PROCESS. Signed-off-by: Rich Felker <dalias@libc.org> Link: https://lore.kernel.org/r/20200829015609.GA32566@brightrain.aerifal.cx [kees: Fixed up coredump selection logic to match] Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08seccomp: don't leave dangling ->notif if file allocation failsTycho Andersen
Christian and Kees both pointed out that this is a bit sloppy to open-code both places, and Christian points out that we leave a dangling pointer to ->notif if file allocation fails. Since we check ->notif for null in order to determine if it's ok to install a filter, this means people won't be able to install a filter if the file allocation fails for some reason, even if they subsequently should be able to. To fix this, let's hoist this free+null into its own little helper and use it. Reported-by: Kees Cook <keescook@chromium.org> Reported-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200902140953.1201956-1-tycho@tycho.pizza Signed-off-by: Kees Cook <keescook@chromium.org>
2020-09-08seccomp: don't leak memory when filter install racesTycho Andersen
In seccomp_set_mode_filter() with TSYNC | NEW_LISTENER, we first initialize the listener fd, then check to see if we can actually use it later in seccomp_may_assign_mode(), which can fail if anyone else in our thread group has installed a filter and caused some divergence. If we can't, we partially clean up the newly allocated file: we put the fd, put the file, but don't actually clean up the *memory* that was allocated at filter->notif. Let's clean that up too. To accomplish this, let's hoist the actual "detach a notifier from a filter" code to its own helper out of seccomp_notify_release(), so that in case anyone adds stuff to init_listener(), they only have to add the cleanup code in one spot. This does a bit of extra locking and such on the failure path when the filter is not attached, but it's a slow failure path anyway. Fixes: 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together") Reported-by: syzbot+3ad9614a12f80994c32e@syzkaller.appspotmail.com Signed-off-by: Tycho Andersen <tycho@tycho.pizza> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200902014017.934315-1-tycho@tycho.pizza Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-14seccomp: Introduce addfd ioctl to seccomp user notifierSargun Dhillon
The current SECCOMP_RET_USER_NOTIF API allows for syscall supervision over an fd. It is often used in settings where a supervising task emulates syscalls on behalf of a supervised task in userspace, either to further restrict the supervisee's syscall abilities or to circumvent kernel enforced restrictions the supervisor deems safe to lift (e.g. actually performing a mount(2) for an unprivileged container). While SECCOMP_RET_USER_NOTIF allows for the interception of any syscall, only a certain subset of syscalls could be correctly emulated. Over the last few development cycles, the set of syscalls which can't be emulated has been reduced due to the addition of pidfd_getfd(2). With this we are now able to, for example, intercept syscalls that require the supervisor to operate on file descriptors of the supervisee such as connect(2). However, syscalls that cause new file descriptors to be installed can not currently be correctly emulated since there is no way for the supervisor to inject file descriptors into the supervisee. This patch adds a new addfd ioctl to remove this restriction by allowing the supervisor to install file descriptors into the intercepted task. By implementing this feature via seccomp the supervisor effectively instructs the supervisee to install a set of file descriptors into its own file descriptor table during the intercepted syscall. This way it is possible to intercept syscalls such as open() or accept(), and install (or replace, like dup2(2)) the supervisor's resulting fd into the supervisee. One replacement use-case would be to redirect the stdout and stderr of a supervisee into log file descriptors opened by the supervisor. The ioctl handling is based on the discussions[1] of how Extensible Arguments should interact with ioctls. Instead of building size into the addfd structure, make it a function of the ioctl command (which is how sizes are normally passed to ioctls). To support forward and backward compatibility, just mask out the direction and size, and match everything. The size (and any future direction) checks are done along with copy_struct_from_user() logic. As a note, the seccomp_notif_addfd structure is laid out based on 8-byte alignment without requiring packing as there have been packing issues with uapi highlighted before[2][3]. Although we could overload the newfd field and use -1 to indicate that it is not to be used, doing so requires changing the size of the fd field, and introduces struct packing complexity. [1]: https://lore.kernel.org/lkml/87o8w9bcaf.fsf@mid.deneb.enyo.de/ [2]: https://lore.kernel.org/lkml/a328b91d-fd8f-4f27-b3c2-91a9c45f18c0@rasmusvillemoes.dk/ [3]: https://lore.kernel.org/lkml/20200612104629.GA15814@ircssh-2.c.rugged-nimbus-611.internal Cc: Christoph Hellwig <hch@lst.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Tycho Andersen <tycho@tycho.ws> Cc: Jann Horn <jannh@google.com> Cc: Robert Sesek <rsesek@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-api@vger.kernel.org Suggested-by: Matt Denton <mpdenton@google.com> Link: https://lore.kernel.org/r/20200603011044.7972-4-sargun@sargun.me Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: Will Drewry <wad@chromium.org> Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10seccomp: Use -1 marker for end of mode 1 syscall listKees Cook
The terminator for the mode 1 syscalls list was a 0, but that could be a valid syscall number (e.g. x86_64 __NR_read). By luck, __NR_read was listed first and the loop construct would not test it, so there was no bug. However, this is fragile. Replace the terminator with -1 instead, and make the variable name for mode 1 syscall lists more descriptive. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will Drewry <wad@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALIDKees Cook
When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced it had the wrong direction flag set. While this isn't a big deal as nothing currently enforces these bits in the kernel, it should be defined correctly. Fix the define and provide support for the old command until it is no longer needed for backward compatibility. Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10seccomp: Use pr_fmtKees Cook
Avoid open-coding "seccomp: " prefixes for pr_*() calls. Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10seccomp: notify about unused filterChristian Brauner
We've been making heavy use of the seccomp notifier to intercept and handle certain syscalls for containers. This patch allows a syscall supervisor listening on a given notifier to be notified when a seccomp filter has become unused. A container is often managed by a singleton supervisor process the so-called "monitor". This monitor process has an event loop which has various event handlers registered. If the user specified a seccomp profile that included a notifier for various syscalls then we also register a seccomp notify even handler. For any container using a separate pid namespace the lifecycle of the seccomp notifier is bound to the init process of the pid namespace, i.e. when the init process exits the filter must be unused. If a new process attaches to a container we force it to assume a seccomp profile. This can either be the same seccomp profile as the container was started with or a modified one. If the attaching process makes use of the seccomp notifier we will register a new seccomp notifier handler in the monitor's event loop. However, when the attaching process exits we can't simply delete the handler since other child processes could've been created (daemons spawned etc.) that have inherited the seccomp filter and so we need to keep the seccomp notifier fd alive in the event loop. But this is problematic since we don't get a notification when the seccomp filter has become unused and so we currently never remove the seccomp notifier fd from the event loop and just keep accumulating fds in the event loop. We've had this issue for a while but it has recently become more pressing as more and larger users make use of this. To fix this, we introduce a new "users" reference counter that tracks any tasks and dependent filters making use of a filter. When a notifier is registered waiting tasks will be notified that the filter is now empty by receiving a (E)POLLHUP event. The concept in this patch introduces is the same as for signal_struct, i.e. reference counting for life-cycle management is decoupled from reference counting taks using the object. There's probably some trickery possible but the second counter is just the correct way of doing this IMHO and has precedence. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200531115031.391515-3-christian.brauner@ubuntu.com Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10seccomp: Lift wait_queue into struct seccomp_filterChristian Brauner
Lift the wait_queue from struct notification into struct seccomp_filter. This is cleaner overall and lets us avoid having to take the notifier mutex in the future for EPOLLHUP notifications since we need to neither read nor modify the notifier specific aspects of the seccomp filter. In the exit path I'd very much like to avoid having to take the notifier mutex for each filter in the task's filter hierarchy. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10seccomp: release filter after task is fully deadChristian Brauner
The seccomp filter used to be released in free_task() which is called asynchronously via call_rcu() and assorted mechanisms. Since we need to inform tasks waiting on the seccomp notifier when a filter goes empty we will notify them as soon as a task has been marked fully dead in release_task(). To not split seccomp cleanup into two parts, move filter release out of free_task() and into release_task() after we've unhashed struct task from struct pid, exited signals, and unlinked it from the threadgroups' thread list. We'll put the empty filter notification infrastructure into it in a follow up patch. This also renames put_seccomp_filter() to seccomp_filter_release() which is a more descriptive name of what we're doing here especially once we've added the empty filter notification mechanism in there. We're also NULL-ing the task's filter tree entrypoint which seems cleaner than leaving a dangling pointer in there. Note that this shouldn't need any memory barriers since we're calling this when the task is in release_task() which means it's EXIT_DEAD. So it can't modify its seccomp filters anymore. You can also see this from the point where we're calling seccomp_filter_release(). It's after __exit_signal() and at this point, tsk->sighand will already have been NULLed which is required for thread-sync and filter installation alike. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10seccomp: rename "usage" to "refs" and documentChristian Brauner
Naming the lifetime counter of a seccomp filter "usage" suggests a little too strongly that its about tasks that are using this filter while it also tracks other references such as the user notifier or ptrace. This also updates the documentation to note this fact. We'll be introducing an actual usage counter in a follow-up patch. Cc: Tycho Andersen <tycho@tycho.ws> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Denton <mpdenton@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Jann Horn <jannh@google.com> Cc: Chris Palmer <palmer@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Robert Sesek <rsesek@google.com> Cc: Jeffrey Vander Stoep <jeffv@google.com> Cc: Linux Containers <containers@lists.linux-foundation.org> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20200531115031.391515-1-christian.brauner@ubuntu.com Signed-off-by: Kees Cook <keescook@chromium.org>