summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2021-04-20Merge tag 'v5.12-rc8' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-17Merge tag 'net-5.12-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.12-rc8, including fixes from netfilter, and bpf. BPF verifier changes stand out, otherwise things have slowed down. Current release - regressions: - gro: ensure frag0 meets IP header alignment - Revert "net: stmmac: re-init rx buffers when mac resume back" - ethernet: macb: fix the restore of cmp registers Previous releases - regressions: - ixgbe: Fix NULL pointer dereference in ethtool loopback test - ixgbe: fix unbalanced device enable/disable in suspend/resume - phy: marvell: fix detection of PHY on Topaz switches - make tcp_allowed_congestion_control readonly in non-init netns - xen-netback: Check for hotplug-status existence before watching Previous releases - always broken: - bpf: mitigate a speculative oob read of up to map value size by tightening the masking window - sctp: fix race condition in sctp_destroy_sock - sit, ip6_tunnel: Unregister catch-all devices - netfilter: nftables: clone set element expression template - netfilter: flowtable: fix NAT IPv6 offload mangling - net: geneve: check skb is large enough for IPv4/IPv6 header - netlink: don't call ->netlink_bind with table lock held" * tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits) netlink: don't call ->netlink_bind with table lock held MAINTAINERS: update my email bpf: Update selftests to reflect new error states bpf: Tighten speculative pointer arithmetic mask bpf: Move sanitize_val_alu out of op switch bpf: Refactor and streamline bounds check into helper bpf: Improve verifier error messages for users bpf: Rework ptr_limit into alu_limit and add common error path bpf: Ensure off_reg has no mixed signed bounds for all types bpf: Move off_reg into sanitize_ptr_alu bpf: Use correct permission flag for mixed signed bounds arithmetic ch_ktls: do not send snd_una update to TCB in middle ch_ktls: tcb close causes tls connection failure ch_ktls: fix device connection close ch_ktls: Fix kernel panic i40e: fix the panic when running bpf in xdpdrv mode net/mlx5e: fix ingress_ifindex check in mlx5e_flower_parse_meta net/mlx5e: Fix setting of RS FEC mode net/mlx5: Fix setting of devlink traps in switchdev mode Revert "net: stmmac: re-init rx buffers when mac resume back" ...
2021-04-17sched/debug: Rename the sched_debug parameter to sched_verbosePeter Zijlstra
CONFIG_SCHED_DEBUG is the build-time Kconfig knob, the boot param sched_debug and the /debug/sched/debug_enabled knobs control the sched_debug_enabled variable, but what they really do is make SCHED_DEBUG more verbose, so rename the lot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-04-16gcov: clang: fix clang-11+ buildJohannes Berg
With clang-11+, the code is broken due to my kvmalloc() conversion (which predated the clang-11 support code) leaving one vmalloc() in place. Fix that. Link: https://lkml.kernel.org/r/20210412214210.6e1ecca9cdc5.I24459763acf0591d5e6b31c7e3a59890d802f79c@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-04-17 The following pull-request contains BPF updates for your *net* tree. We've added 10 non-merge commits during the last 9 day(s) which contain a total of 8 files changed, 175 insertions(+), 111 deletions(-). The main changes are: 1) Fix a potential NULL pointer dereference in libbpf's xsk umem handling, from Ciara Loftus. 2) Mitigate a speculative oob read of up to map value size by tightening the masking window, from Daniel Borkmann. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-16bpf: Tighten speculative pointer arithmetic maskDaniel Borkmann
This work tightens the offset mask we use for unprivileged pointer arithmetic in order to mitigate a corner case reported by Piotr and Benedict where in the speculative domain it is possible to advance, for example, the map value pointer by up to value_size-1 out-of-bounds in order to leak kernel memory via side-channel to user space. Before this change, the computed ptr_limit for retrieve_ptr_limit() helper represents largest valid distance when moving pointer to the right or left which is then fed as aux->alu_limit to generate masking instructions against the offset register. After the change, the derived aux->alu_limit represents the largest potential value of the offset register which we mask against which is just a narrower subset of the former limit. For minimal complexity, we call sanitize_ptr_alu() from 2 observation points in adjust_ptr_min_max_vals(), that is, before and after the simulated alu operation. In the first step, we retieve the alu_state and alu_limit before the operation as well as we branch-off a verifier path and push it to the verification stack as we did before which checks the dst_reg under truncation, in other words, when the speculative domain would attempt to move the pointer out-of-bounds. In the second step, we retrieve the new alu_limit and calculate the absolute distance between both. Moreover, we commit the alu_state and final alu_limit via update_alu_sanitation_state() to the env's instruction aux data, and bail out from there if there is a mismatch due to coming from different verification paths with different states. Reported-by: Piotr Krysiuk <piotras@gmail.com> Reported-by: Benedict Schlueter <benedict.schlueter@rub.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Benedict Schlueter <benedict.schlueter@rub.de>
2021-04-16bpf: Move sanitize_val_alu out of op switchDaniel Borkmann
Add a small sanitize_needed() helper function and move sanitize_val_alu() out of the main opcode switch. In upcoming work, we'll move sanitize_ptr_alu() as well out of its opcode switch so this helps to streamline both. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16bpf: Refactor and streamline bounds check into helperDaniel Borkmann
Move the bounds check in adjust_ptr_min_max_vals() into a small helper named sanitize_check_bounds() in order to simplify the former a bit. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16bpf: Improve verifier error messages for usersDaniel Borkmann
Consolidate all error handling and provide more user-friendly error messages from sanitize_ptr_alu() and sanitize_val_alu(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16bpf: Rework ptr_limit into alu_limit and add common error pathDaniel Borkmann
Small refactor with no semantic changes in order to consolidate the max ptr_limit boundary check. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16bpf: Ensure off_reg has no mixed signed bounds for all typesDaniel Borkmann
The mixed signed bounds check really belongs into retrieve_ptr_limit() instead of outside of it in adjust_ptr_min_max_vals(). The reason is that this check is not tied to PTR_TO_MAP_VALUE only, but to all pointer types that we handle in retrieve_ptr_limit() and given errors from the latter propagate back to adjust_ptr_min_max_vals() and lead to rejection of the program, it's a better place to reside to avoid anything slipping through for future types. The reason why we must reject such off_reg is that we otherwise would not be able to derive a mask, see details in 9d7eceede769 ("bpf: restrict unknown scalars of mixed signed bounds for unprivileged"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16bpf: Move off_reg into sanitize_ptr_aluDaniel Borkmann
Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can use off_reg for generalizing some of the checks for all pointer types. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16bpf: Use correct permission flag for mixed signed bounds arithmeticDaniel Borkmann
We forbid adding unknown scalars with mixed signed bounds due to the spectre v1 masking mitigation. Hence this also needs bypass_spec_v1 flag instead of allow_ptr_leaks. Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
2021-04-16sched,fair: Alternative sched_slice()Peter Zijlstra
The current sched_slice() seems to have issues; there's two possible things that could be improved: - the 'nr_running' used for __sched_period() is daft when cgroups are considered. Using the RQ wide h_nr_running seems like a much more consistent number. - (esp) cgroups can slice it real fine, which makes for easy over-scheduling, ensure min_gran is what the name says. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.611897312@infradead.org
2021-04-16sched: Move /proc/sched_debug to debugfsPeter Zijlstra
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.548833671@infradead.org
2021-04-16sched,debug: Convert sysctl sched_domains to debugfsPeter Zijlstra
Stop polluting sysctl, move to debugfs for SCHED_DEBUG stuff. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/YHgB/s4KCBQ1ifdm@hirez.programming.kicks-ass.net
2021-04-16sched,preempt: Move preempt_dynamic to debug.cPeter Zijlstra
Move the #ifdef SCHED_DEBUG bits to kernel/sched/debug.c in order to collect all the debugfs bits. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.353833279@infradead.org
2021-04-16sched: Move SCHED_DEBUG sysctl to debugfsPeter Zijlstra
Stop polluting sysctl with undocumented knobs that really are debug only, move them all to /debug/sched/ along with the existing /debug/sched_* files that already exist. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.287610138@infradead.org
2021-04-16sched: Remove sched_schedstats sysctl out from under SCHED_DEBUGPeter Zijlstra
CONFIG_SCHEDSTATS does not depend on SCHED_DEBUG, it is inconsistent to have the sysctl depend on it. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210412102001.161151631@infradead.org
2021-04-16sched/numa: Allow runtime enabling/disabling of NUMA balance without SCHED_DEBUGMel Gorman
The ability to enable/disable NUMA balancing is not a debugging feature and should not depend on CONFIG_SCHED_DEBUG. For example, machines within a HPC cluster may disable NUMA balancing temporarily for some jobs and re-enable it for other jobs without needing to reboot. This patch removes the dependency on CONFIG_SCHED_DEBUG for kernel.numa_balancing sysctl. The other numa balancing related sysctls are left as-is because if they need to be tuned then it is more likely that NUMA balancing needs to be fixed instead. Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210324133916.GQ15768@suse.de
2021-04-16sched: Use cpu_dying() to fix balance_push vs hotplug-rollbackPeter Zijlstra
Use the new cpu_dying() state to simplify and fix the balance_push() vs CPU hotplug rollback state. Specifically, we currently rely on notifiers sched_cpu_dying() / sched_cpu_activate() to terminate balance_push, however if the cpu_down() fails when we're past sched_cpu_deactivate(), it should terminate balance_push at that point and not wait until we hit sched_cpu_activate(). Similarly, when cpu_up() fails and we're going back down, balance_push should be active, where it currently is not. So instead, make sure balance_push is enabled below SCHED_AP_ACTIVE (when !cpu_active()), and gate it's utility with cpu_dying(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/YHgAYef83VQhKdC2@hirez.programming.kicks-ass.net
2021-04-16cpumask: Introduce DYING maskPeter Zijlstra
Introduce a cpumask that indicates (for each CPU) what direction the CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when an up fails and we do a roll-back down, it will accurately reflect the direction. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210310150109.151441252@infradead.org
2021-04-14rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs()Eric Dumazet
Commit ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes") added regressions for our servers. Using copy_from_user() and clear_user() for 64bit values is suboptimal. We can use faster put_user() and get_user() on 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lkml.kernel.org/r/20210413203352.71350-4-eric.dumazet@gmail.com
2021-04-14rseq: Remove redundant access_ok()Eric Dumazet
After commit 8f2817701492 ("rseq: Use get_user/put_user rather than __get_user/__put_user") we no longer need an access_ok() call from __rseq_handle_notify_resume() Mathieu pointed out the same cleanup can be done in rseq_syscall(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lkml.kernel.org/r/20210413203352.71350-3-eric.dumazet@gmail.com
2021-04-14rseq: Optimize rseq_update_cpu_id()Eric Dumazet
Two put_user() in rseq_update_cpu_id() are replaced by a pair of unsafe_put_user() with appropriate surroundings. This removes one stac/clac pair on x86 in fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lkml.kernel.org/r/20210413203352.71350-2-eric.dumazet@gmail.com
2021-04-14signal: Allow tasks to cache one sigqueue structThomas Gleixner
The idea for this originates from the real time tree to make signal delivery for realtime applications more efficient. In quite some of these application scenarios a control tasks signals workers to start their computations. There is usually only one signal per worker on flight. This works nicely as long as the kmem cache allocations do not hit the slow path and cause latencies. To cure this an optimistic caching was introduced (limited to RT tasks) which allows a task to cache a single sigqueue in a pointer in task_struct instead of handing it back to the kmem cache after consuming a signal. When the next signal is sent to the task then the cached sigqueue is used instead of allocating a new one. This solved the problem for this set of application scenarios nicely. The task cache is not preallocated so the first signal sent to a task goes always to the cache allocator. The cached sigqueue stays around until the task exits and is freed when task::sighand is dropped. After posting this solution for mainline the discussion came up whether this would be useful in general and should not be limited to realtime tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org One concern leading to the original limitation was to avoid a large amount of pointlessly cached sigqueues in alive tasks. The other concern was vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for. The accounting problem is real, but on the other hand slightly academic. After gathering some statistics it turned out that after boot of a regular distro install there are less than 10 sigqueues cached in ~1500 tasks. In case of a 'mass fork and fire signal to child' scenario the extra 80 bytes of memory per task are well in the noise of the overall memory consumption of the fork bomb. If this should be limited then this would need an extra counter in struct user, more atomic instructions and a seperate rlimit. Yet another tunable which is mostly unused. The caching is actually used. After boot and a full kernel compile on a 64CPU machine with make -j128 the number of 'allocations' looks like this: From slab: 23996 From task cache: 52223 I.e. it reduces the number of slab cache operations by ~68%. A typical pattern there is: <...>-58490 __sigqueue_alloc: for 58488 from slab ffff8881132df460 <...>-58488 __sigqueue_free: cache ffff8881132df460 <...>-58488 __sigqueue_alloc: for 1149 from cache ffff8881103dc550 bash-1149 exit_task_sighand: free ffff8881132df460 bash-1149 __sigqueue_free: cache ffff8881103dc550 The interesting sequence is that the exiting task 58488 grabs the sigqueue from bash's task cache to signal exit and bash sticks it back into it's own cache. Lather, rinse and repeat. The caching is probably not noticable for the general use case, but the benefit for latency sensitive applications is clear. While kmem caches are usually just serving from the fast path the slab merging (default) can depending on the usage pattern of the merged slabs cause occasional slow path allocations. The time spared per cached entry is a few micro seconds per signal which is not relevant for e.g. a kernel build, but for signal heavy workloads it's measurable. As there is no real downside of this caching mechanism making it unconditionally available is preferred over more conditional code or new magic tunables. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
2021-04-14signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc()Thomas Gleixner
There is no point in having the conditional at the callsite. Just hand in the allocation mode flag to __sigqueue_alloc() and use it to initialize sigqueue::flags. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210322092258.898677147@linutronix.de
2021-04-13Merge tag 'trace-v5.12-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix a memory link in dyn_event_release(). An error path exited the function before freeing the allocated 'argv' variable" * tag 'trace-v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing/dynevent: Fix a memory leak in an error handling path
2021-04-13tracing/dynevent: Fix a memory leak in an error handling pathChristophe JAILLET
We must free 'argv' before returning, as already done in all the other paths of this function. Link: https://lkml.kernel.org/r/21e3594ccd7fc88c5c162c98450409190f304327.1618136448.git.christophe.jaillet@wanadoo.fr Fixes: d262271d0483 ("tracing/dynevent: Delegate parsing to create function") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-04-11Merge tag 'locking-urgent-2021-04-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixlets from Ingo Molnar: "Two minor fixes: one for a Clang warning, the other improves an ambiguous/confusing kernel log message" * tag 'locking-urgent-2021-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: lockdep: Address clang -Wformat warning printing for %hd lockdep: Add a missing initialization hint to the "INFO: Trying to register non-static key" message
2021-04-09Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "14 patches. Subsystems affected by this patch series: mm (kasan, gup, pagecache, and kfence), MAINTAINERS, mailmap, nds32, gcov, ocfs2, ia64, and lib" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS kfence, x86: fix preemptible warning on KPTI-enabled systems lib/test_kasan_module.c: suppress unused var warning kasan: fix conflict with page poisoning fs: direct-io: fix missing sdio->boundary ia64: fix user_stack_pointer() for ptrace() ocfs2: fix deadlock between setattr and dio_end_io_write gcov: re-fix clang-11+ support nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff mm/gup: check page posion status for coredump. .mailmap: fix old email addresses mailmap: update email address for Jordan Crouse treewide: change my e-mail address, fix my name MAINTAINERS: update CZ.NIC's Turris information
2021-04-09Merge tag 'net-5.12-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.12-rc7, including fixes from can, ipsec, mac80211, wireless, and bpf trees. No scary regressions here or in the works, but small fixes for 5.12 changes keep coming. Current release - regressions: - virtio: do not pull payload in skb->head - virtio: ensure mac header is set in virtio_net_hdr_to_skb() - Revert "net: correct sk_acceptq_is_full()" - mptcp: revert "mptcp: provide subflow aware release function" - ethernet: lan743x: fix ethernet frame cutoff issue - dsa: fix type was not set for devlink port - ethtool: remove link_mode param and derive link params from driver - sched: htb: fix null pointer dereference on a null new_q - wireless: iwlwifi: Fix softirq/hardirq disabling in iwl_pcie_enqueue_hcmd() - wireless: iwlwifi: fw: fix notification wait locking - wireless: brcmfmac: p2p: Fix deadlock introduced by avoiding the rtnl dependency Current release - new code bugs: - napi: fix hangup on napi_disable for threaded napi - bpf: take module reference for trampoline in module - wireless: mt76: mt7921: fix airtime reporting and related tx hangs - wireless: iwlwifi: mvm: rfi: don't lock mvm->mutex when sending config command Previous releases - regressions: - rfkill: revert back to old userspace API by default - nfc: fix infinite loop, refcount & memory leaks in LLCP sockets - let skb_orphan_partial wake-up waiters - xfrm/compat: Cleanup WARN()s that can be user-triggered - vxlan, geneve: do not modify the shared tunnel info when PMTU triggers an ICMP reply - can: fix msg_namelen values depending on CAN_REQUIRED_SIZE - can: uapi: mark union inside struct can_frame packed - sched: cls: fix action overwrite reference counting - sched: cls: fix err handler in tcf_action_init() - ethernet: mlxsw: fix ECN marking in tunnel decapsulation - ethernet: nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx - ethernet: i40e: fix receiving of single packets in xsk zero-copy mode - ethernet: cxgb4: avoid collecting SGE_QBASE regs during traffic Previous releases - always broken: - bpf: Refuse non-O_RDWR flags in BPF_OBJ_GET - bpf: Refcount task stack in bpf_get_task_stack - bpf, x86: Validate computation of branch displacements - ieee802154: fix many similar syzbot-found bugs - fix NULL dereferences in netlink attribute handling - reject unsupported operations on monitor interfaces - fix error handling in llsec_key_alloc() - xfrm: make ipv4 pmtu check honor ip header df - xfrm: make hash generation lock per network namespace - xfrm: esp: delete NETIF_F_SCTP_CRC bit from features for esp offload - ethtool: fix incorrect datatype in set_eee ops - xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory model - openvswitch: fix send of uninitialized stack memory in ct limit reply Misc: - udp: add get handling for UDP_GRO sockopt" * tag 'net-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (182 commits) net: fix hangup on napi_disable for threaded napi net: hns3: Trivial spell fix in hns3 driver lan743x: fix ethernet frame cutoff issue net: ipv6: check for validity before dereferencing cfg->fc_nlinfo.nlh net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits net: dsa: lantiq_gswip: Don't use PHY auto polling net: sched: sch_teql: fix null-pointer dereference ipv6: report errors for iftoken via netlink extack net: sched: fix err handler in tcf_action_init() net: sched: fix action overwrite reference counting Revert "net: sched: bump refcount for new action in ACT replace mode" ice: fix memory leak of aRFS after resuming from suspend i40e: Fix sparse warning: missing error code 'err' i40e: Fix sparse error: 'vsi->netdev' could be null i40e: Fix sparse error: uninitialized symbol 'ring' i40e: Fix sparse errors in i40e_txrx.c i40e: Fix parameters in aq_get_phy_register() nl80211: fix beacon head validation bpf, x86: Validate computation of branch displacements for x86-32 bpf, x86: Validate computation of branch displacements for x86-64 ...
2021-04-09gcov: re-fix clang-11+ supportNick Desaulniers
LLVM changed the expected function signature for llvm_gcda_emit_function() in the clang-11 release. Users of clang-11 or newer may have noticed their kernels producing invalid coverage information: $ llvm-cov gcov -a -c -u -f -b <input>.gcda -- gcno=<input>.gcno 1 <func>: checksum mismatch, \ (<lineno chksum A>, <cfg chksum B>) != (<lineno chksum A>, <cfg chksum C>) 2 Invalid .gcda File! ... Fix up the function signatures so calling this function interprets its parameters correctly and computes the correct cfg checksum. In particular, in clang-11, the additional checksum is no longer optional. Link: https://reviews.llvm.org/rG25544ce2df0daa4304c07e64b9c8b0f7df60c11d Link: https://lkml.kernel.org/r/20210408184631.1156669-1-ndesaulniers@google.com Reported-by: Prasad Sodagudi <psodagud@quicinc.com> Tested-by: Prasad Sodagudi <psodagud@quicinc.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> [5.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-09sched/fair: Introduce a CPU capacity comparison helperValentin Schneider
During load-balance, groups classified as group_misfit_task are filtered out if they do not pass group_smaller_max_cpu_capacity(<candidate group>, <local group>); which itself employs fits_capacity() to compare the sgc->max_capacity of both groups. Due to the underlying margin, fits_capacity(X, 1024) will return false for any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are {261, 871, 1024}. If a CPU-bound task ends up on one of those "medium" CPUs, misfit migration will never intentionally upmigrate it to a CPU of higher capacity due to the aforementioned margin. One may argue the 20% margin of fits_capacity() is excessive in the advent of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here is that fits_capacity() is meant to compare a utilization value to a capacity value, whereas here it is being used to compare two capacity values. As CPU capacity and task utilization have different dynamics, a sensible approach here would be to add a new helper dedicated to comparing CPU capacities. Also note that comparing capacity extrema of local and source sched_group's doesn't make much sense when at the day of the day the imbalance will be pulled by a known env->dst_cpu, whose capacity can be anywhere within the local group's capacity extrema. While at it, replace group_smaller_{min, max}_cpu_capacity() with comparisons of the source group's min/max capacity and the destination CPU's capacity. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
2021-04-09sched/fair: Clean up active balance nr_balance_failed trickeryValentin Schneider
When triggering an active load balance, sd->nr_balance_failed is set to such a value that any further can_migrate_task() using said sd will ignore the output of task_hot(). This behaviour makes sense, as active load balance intentionally preempts a rq's running task to migrate it right away, but this asynchronous write is a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop before the sd->nr_balance_failed write either becomes visible to the stopper's CPU or even happens on the CPU that appended the stopper work. Add a struct lb_env flag to denote active balancing, and use it in can_migrate_task(). Remove the sd->nr_balance_failed write that served the same purpose. Cleanup the LBF_DST_PINNED active balance special case. Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
2021-04-09sched/fair: Ignore percpu threads for imbalance pullsLingutla Chandrasekhar
During load balance, LBF_SOME_PINNED will be set if any candidate task cannot be detached due to CPU affinity constraints. This can result in setting env->sd->parent->sgc->group_imbalance, which can lead to a group being classified as group_imbalanced (rather than any of the other, lower group_type) when balancing at a higher level. In workloads involving a single task per CPU, LBF_SOME_PINNED can often be set due to per-CPU kthreads being the only other runnable tasks on any given rq. This results in changing the group classification during load-balance at higher levels when in reality there is nothing that can be done for this affinity constraint: per-CPU kthreads, as the name implies, don't get to move around (modulo hotplug shenanigans). It's not as clear for userspace tasks - a task could be in an N-CPU cpuset with N-1 offline CPUs, making it an "accidental" per-CPU task rather than an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which we can leverage here to not set LBF_SOME_PINNED. Note that the aforementioned classification to group_imbalance (when nothing can be done) is especially problematic on big.LITTLE systems, which have a topology the likes of: DIE [ ] MC [ ][ ] 0 1 2 3 L L B B arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B) Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads can significantly delay the upgmigration of said misfit tasks. Systems relying on ASYM_PACKING are likely to face similar issues. Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed] [Reword changelog] Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
2021-04-09sched/fair: Bring back select_idle_smt(), but differentlyRik van Riel
Mel Gorman did some nice work in 9fe1f127b913 ("sched/fair: Merge select_idle_core/cpu()"), resulting in the kernel being more efficient at finding an idle CPU, and in tasks spending less time waiting to be run, both according to the schedstats run_delay numbers, and according to measured application latencies. Yay. The flip side of this is that we see more task migrations (about 30% more), higher cache misses, higher memory bandwidth utilization, and higher CPU use, for the same number of requests/second. This is most pronounced on a memcache type workload, which saw a consistent 1-3% increase in total CPU use on the system, due to those increased task migrations leading to higher L2 cache miss numbers, and higher memory utilization. The exclusive L3 cache on Skylake does us no favors there. On our web serving workload, that effect is usually negligible. It appears that the increased number of CPU migrations is generally a good thing, since it leads to lower cpu_delay numbers, reflecting the fact that tasks get to run faster. However, the reduced locality and the corresponding increase in L2 cache misses hurts a little. The patch below appears to fix the regression, while keeping the benefit of the lower cpu_delay numbers, by reintroducing select_idle_smt with a twist: when a socket has no idle cores, check to see if the sibling of "prev" is idle, before searching all the other CPUs. This fixes both the occasional 9% regression on the web serving workload, and the continuous 2% CPU use regression on the memcache type workload. With Mel's patches and this patch together, task migrations are still high, but L2 cache misses, memory bandwidth, and CPU time used are back down to what they were before. The p95 and p99 response times for the memcache type application improve by about 10% over what they were before Mel's patches got merged. Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
2021-04-08psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi filesJosh Hunt
Currently only root can write files under /proc/pressure. Relax this to allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be able to write to these files. Signed-off-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210402025833.27599-1-johunt@akamai.com
2021-04-04workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog()Wang Qing
84;0;0c84;0;0c There are two workqueue-specific watchdog timestamps: + @wq_watchdog_touched_cpu (per-CPU) updated by touch_softlockup_watchdog() + @wq_watchdog_touched (global) updated by touch_all_softlockup_watchdogs() watchdog_timer_fn() checks only the global @wq_watchdog_touched for unbound workqueues. As a result, unbound workqueues are not aware of touch_softlockup_watchdog(). The watchdog might report a stall even when the unbound workqueues are blocked by a known slow code. Solution: touch_softlockup_watchdog() must touch also the global @wq_watchdog_touched timestamp. The global timestamp can no longer be used for bound workqueues because it is now updated from all CPUs. Instead, bound workqueues have to check only @wq_watchdog_touched_cpu and these timestamps have to be updated for all CPUs in touch_all_softlockup_watchdogs(). Beware: The change might cause the opposite problem. An unbound workqueue might get blocked on CPU A because of a real softlockup. The workqueue watchdog would miss it when the timestamp got touched on CPU B. It is acceptable because softlockups are detected by softlockup watchdog. The workqueue watchdog is there to detect stalls where a work never finishes, for example, because of dependencies of works queued into the same workqueue. V3: - Modify the commit message clearly according to Petr's suggestion. Signed-off-by: Wang Qing <wangqing@vivo.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-04workqueue: Move the position of debug_work_activate() in __queue_work()Zqiang
The debug_work_activate() is called on the premise that the work can be inserted, because if wq be in WQ_DRAINING status, insert work may be failed. Fixes: e41e704bc4f4 ("workqueue: improve destroy_workqueue() debuggability") Signed-off-by: Zqiang <qiang.zhang@windriver.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2021-04-02Merge tag 'trace-v5.12-rc5-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fix from Steven Rostedt: "Fix stack trace entry size to stop showing garbage The macro that creates both the structure and the format displayed to user space for the stack trace event was changed a while ago to fix the parsing by user space tooling. But this change also modified the structure used to store the stack trace event. It changed the caller array field from [0] to [8]. Even though the size in the ring buffer is dynamic and can be something other than 8 (user space knows how to handle this), the 8 extra words was not accounted for when reserving the event on the ring buffer, and added 8 more entries, due to the calculation of "sizeof(*entry) + nr_entries * sizeof(long)", as the sizeof(*entry) now contains 8 entries. The size of the caller field needs to be subtracted from the size of the entry to create the correct allocation size" * tag 'trace-v5.12-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix stack trace event size
2021-04-01bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GETLorenz Bauer
As for bpf_link, refuse creating a non-O_RDWR fd. Since program fds currently don't allow modifications this is a precaution, not a straight up bug fix. Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210326160501.46234-2-lmb@cloudflare.com
2021-04-01bpf: link: Refuse non-O_RDWR flags in BPF_OBJ_GETLorenz Bauer
Invoking BPF_OBJ_GET on a pinned bpf_link checks the path access permissions based on file_flags, but the returned fd ignores flags. This means that any user can acquire a "read-write" fd for a pinned link with mode 0664 by invoking BPF_OBJ_GET with BPF_F_RDONLY in file_flags. The fd can be used to invoke BPF_LINK_DETACH, etc. Fix this by refusing non-O_RDWR flags in BPF_OBJ_GET. This works because OBJ_GET by default returns a read write mapping and libbpf doesn't expose a way to override this behaviour for programs and links. Fixes: 70ed506c3bbc ("bpf: Introduce pinnable bpf_link abstraction") Signed-off-by: Lorenz Bauer <lmb@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20210326160501.46234-1-lmb@cloudflare.com
2021-04-01bpf: Refcount task stack in bpf_get_task_stackDave Marchevsky
On x86 the struct pt_regs * grabbed by task_pt_regs() points to an offset of task->stack. The pt_regs are later dereferenced in __bpf_get_stack (e.g. by user_mode() check). This can cause a fault if the task in question exits while bpf_get_task_stack is executing, as warned by task_stack_page's comment: * When accessing the stack of a non-current task that might exit, use * try_get_task_stack() instead. task_stack_page will return a pointer * that could get freed out from under you. Taking the comment's advice and using try_get_task_stack() and put_task_stack() to hold task->stack refcount, or bail early if it's already 0. Incrementing stack_refcount will ensure the task's stack sticks around while we're using its data. I noticed this bug while testing a bpf task iter similar to bpf_iter_task_stack in selftests, except mine grabbed user stack, and getting intermittent crashes, which resulted in dumps like: BUG: unable to handle page fault for address: 0000000000003fe0 \#PF: supervisor read access in kernel mode \#PF: error_code(0x0000) - not-present page RIP: 0010:__bpf_get_stack+0xd0/0x230 <snip...> Call Trace: bpf_prog_0a2be35c092cb190_get_task_stacks+0x5d/0x3ec bpf_iter_run_prog+0x24/0x81 __task_seq_show+0x58/0x80 bpf_seq_read+0xf7/0x3d0 vfs_read+0x91/0x140 ksys_read+0x59/0xd0 do_syscall_64+0x48/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()") Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210401000747.3648767-1-davemarchevsky@fb.com
2021-04-01tracing: Fix stack trace event sizeSteven Rostedt (VMware)
Commit cbc3b92ce037 fixed an issue to modify the macros of the stack trace event so that user space could parse it properly. Originally the stack trace format to user space showed that the called stack was a dynamic array. But it is not actually a dynamic array, in the way that other dynamic event arrays worked, and this broke user space parsing for it. The update was to make the array look to have 8 entries in it. Helper functions were added to make it parse it correctly, as the stack was dynamic, but was determined by the size of the event stored. Although this fixed user space on how it read the event, it changed the internal structure used for the stack trace event. It changed the array size from [0] to [8] (added 8 entries). This increased the size of the stack trace event by 8 words. The size reserved on the ring buffer was the size of the stack trace event plus the number of stack entries found in the stack trace. That commit caused the amount to be 8 more than what was needed because it did not expect the caller field to have any size. This produced 8 entries of garbage (and reading random data) from the stack trace event: <idle>-0 [002] d... 1976396.837549: <stack trace> => trace_event_raw_event_sched_switch => __traceiter_sched_switch => __schedule => schedule_idle => do_idle => cpu_startup_entry => secondary_startup_64_no_verify => 0xc8c5e150ffff93de => 0xffff93de => 0 => 0 => 0xc8c5e17800000000 => 0x1f30affff93de => 0x00000004 => 0x200000000 Instead, subtract the size of the caller field from the size of the event to make sure that only the amount needed to store the stack trace is reserved. Link: https://lore.kernel.org/lkml/your-ad-here.call-01617191565-ext-9692@work.hours/ Cc: stable@vger.kernel.org Fixes: cbc3b92ce037 ("tracing: Set kernel_stack's caller size properly") Reported-by: Vasily Gorbik <gor@linux.ibm.com> Tested-by: Vasily Gorbik <gor@linux.ibm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-31Merge tag 'trace-v5.12-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull ftrace fix from Steven Rostedt: "Add check of order < 0 before calling free_pages() The function addresses that are traced by ftrace are stored in pages, and the size is held in a variable. If there's some error in creating them, the allocate ones will be freed. In this case, it is possible that the order of pages to be freed may end up being negative due to a size of zero passed to get_count_order(), and then that negative number will cause free_pages() to free a very large section. Make sure that does not happen" * tag 'trace-v5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ftrace: Check if pages were allocated before calling free_pages()
2021-03-30ftrace: Check if pages were allocated before calling free_pages()Steven Rostedt (VMware)
It is possible that on error pg->size can be zero when getting its order, which would return a -1 value. It is dangerous to pass in an order of -1 to free_pages(). Check if order is greater than or equal to zero before calling free_pages(). Link: https://lore.kernel.org/lkml/20210330093916.432697c7@gandalf.local.home/ Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-03-28Merge tag 'io_uring-5.12-2021-03-27' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Use thread info versions of flag testing, as discussed last week. - The series enabling PF_IO_WORKER to just take signals, instead of needing to special case that they do not in a bunch of places. Ends up being pretty trivial to do, and then we can revert all the special casing we're currently doing. - Kill dead pointer assignment - Fix hashed part of async work queue trace - Fix sign extension issue for IORING_OP_PROVIDE_BUFFERS - Fix a link completion ordering regression in this merge window - Cancellation fixes * tag 'io_uring-5.12-2021-03-27' of git://git.kernel.dk/linux-block: io_uring: remove unsued assignment to pointer io io_uring: don't cancel extra on files match io_uring: don't cancel-track common timeouts io_uring: do post-completion chore on t-out cancel io_uring: fix timeout cancel return code Revert "signal: don't allow STOP on PF_IO_WORKER threads" Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing" Revert "kernel: treat PF_IO_WORKER like PF_KTHREAD for ptrace/signals" Revert "signal: don't allow sending any signals to PF_IO_WORKER threads" kernel: stop masking signals in create_io_thread() io_uring: handle signals for IO threads like a normal thread kernel: don't call do_exit() for PF_IO_WORKER threads io_uring: maintain CQE order of a failed link io-wq: fix race around pending work on teardown io_uring: do ctx sqd ejection in a clear context io_uring: fix provide_buffers sign extension io_uring: don't skip file_end_write() on reissue io_uring: correct io_queue_async_work() traces io_uring: don't use {test,clear}_tsk_thread_flag() for current
2021-03-27Revert "signal: don't allow STOP on PF_IO_WORKER threads"Jens Axboe
This reverts commit 4db4b1a0d1779dc159f7b87feb97030ec0b12597. The IO threads allow and handle SIGSTOP now, so don't special case them anymore in task_set_jobctl_pending(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-03-27Revert "kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing"Jens Axboe
This reverts commit 15b2219facadec583c24523eed40fa45865f859f. Before IO threads accepted signals, the freezer using take signals to wake up an IO thread would cause them to loop without any way to clear the pending signal. That is no longer the case, so stop special casing PF_IO_WORKER in the freezer. Signed-off-by: Jens Axboe <axboe@kernel.dk>