Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- MTE asynchronous support for KASan. Previously only synchronous
(slower) mode was supported. Asynchronous is faster but does not
allow precise identification of the illegal access.
- Run kernel mode SIMD with softirqs disabled. This allows using NEON
in softirq context for crypto performance improvements. The
conditional yield support is modified to take softirqs into account
and reduce the latency.
- Preparatory patches for Apple M1: handle CPUs that only have the VHE
mode available (host kernel running at EL2), add FIQ support.
- arm64 perf updates: support for HiSilicon PA and SLLC PMU drivers,
new functions for the HiSilicon HHA and L3C PMU, cleanups.
- Re-introduce support for execute-only user permissions but only when
the EPAN (Enhanced Privileged Access Never) architecture feature is
available.
- Disable fine-grained traps at boot and improve the documented boot
requirements.
- Support CONFIG_KASAN_VMALLOC on arm64 (only with KASAN_GENERIC).
- Add hierarchical eXecute Never permissions for all page tables.
- Add arm64 prctl(PR_PAC_{SET,GET}_ENABLED_KEYS) allowing user programs
to control which PAC keys are enabled in a particular task.
- arm64 kselftests for BTI and some improvements to the MTE tests.
- Minor improvements to the compat vdso and sigpage.
- Miscellaneous cleanups.
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (86 commits)
arm64/sve: Add compile time checks for SVE hooks in generic functions
arm64/kernel/probes: Use BUG_ON instead of if condition followed by BUG.
arm64: pac: Optimize kernel entry/exit key installation code paths
arm64: Introduce prctl(PR_PAC_{SET,GET}_ENABLED_KEYS)
arm64: mte: make the per-task SCTLR_EL1 field usable elsewhere
arm64/sve: Remove redundant system_supports_sve() tests
arm64: fpsimd: run kernel mode NEON with softirqs disabled
arm64: assembler: introduce wxN aliases for wN registers
arm64: assembler: remove conditional NEON yield macros
kasan, arm64: tests supports for HW_TAGS async mode
arm64: mte: Report async tag faults before suspend
arm64: mte: Enable async tag check fault
arm64: mte: Conditionally compile mte_enable_kernel_*()
arm64: mte: Enable TCO in functions that can read beyond buffer limits
kasan: Add report for async mode
arm64: mte: Drop arch_enable_tagging()
kasan: Add KASAN mode kernel parameter
arm64: mte: Add asynchronous mode support
arm64: Get rid of CONFIG_ARM64_VHE
arm64: Cope with CPUs stuck in VHE mode
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The time and timers updates contain:
Core changes:
- Allow runtime power management when the clocksource is changed.
- A correctness fix for clock_adjtime32() so that the return value on
success is not overwritten by the result of the copy to user.
- Allow late installment of broadcast clockevent devices which was
broken because nothing switched them over to oneshot mode. This
went unnoticed so far because clockevent devices used to be built
in, but now people started to make them modular.
- Debugfs related simplifications
- Small cleanups and improvements here and there
Driver changes:
- The usual set of device tree binding updates for a wide range of
drivers/devices.
- The usual updates and improvements for drivers all over the place
but nothing outstanding.
- No new clocksource/event drivers. They'll come back next time"
* tag 'timers-core-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
posix-timers: Preserve return value in clock_adjtime32()
tick/broadcast: Allow late registered device to enter oneshot mode
tick: Use tick_check_replacement() instead of open coding it
time/timecounter: Mark 1st argument of timecounter_cyc2time() as const
dt-bindings: timer: nuvoton,npcm7xx: Add wpcm450-timer
clocksource/drivers/arm_arch_timer: Add __ro_after_init and __init
clocksource/drivers/timer-ti-dm: Handle dra7 timer wrap errata i940
clocksource/drivers/timer-ti-dm: Prepare to handle dra7 timer wrap issue
clocksource/drivers/dw_apb_timer_of: Add handling for potential memory leak
clocksource/drivers/npcm: Add support for WPCM450
clocksource/drivers/sh_cmt: Don't use CMTOUT_IE with R-Car Gen2/3
clocksource/drivers/pistachio: Fix trivial typo
clocksource/drivers/ingenic_ost: Fix return value check in ingenic_ost_probe()
clocksource/drivers/timer-ti-dm: Add missing set_state_oneshot_stopped
clocksource/drivers/timer-ti-dm: Fix posted mode status check order
dt-bindings: timer: renesas,cmt: Document R8A77961
dt-bindings: timer: renesas,cmt: Add r8a779a0 CMT support
clocksource/drivers/ingenic-ost: Add support for the JZ4760B
clocksource/drivers/ingenic: Add support for the JZ4760
dt-bindings: timer: ingenic: Add compatible strings for JZ4760(B)
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"The usual updates from the irq departement:
Core changes:
- Provide IRQF_NO_AUTOEN as a flag for request*_irq() so drivers can
be cleaned up which either use a seperate mechanism to prevent
auto-enable at request time or have a racy mechanism which disables
the interrupt right after request.
- Get rid of the last usage of irq_create_identity_mapping() and
remove the interface.
- An overhaul of tasklet_disable().
Most usage sites of tasklet_disable() are in task context and
usually in cleanup, teardown code pathes. tasklet_disable()
spinwaits for a tasklet which is currently executed. That's not
only a problem for PREEMPT_RT where this can lead to a live lock
when the disabling task preempts the softirq thread. It's also
problematic in context of virtualization when the vCPU which runs
the tasklet is scheduled out and the disabling code has to spin
wait until it's scheduled back in.
There are a few code pathes which invoke tasklet_disable() from
non-sleepable context. For these a new disable variant which still
spinwaits is provided which allows to switch tasklet_disable() to a
sleep wait mechanism. For the atomic use cases this does not solve
the live lock issue on PREEMPT_RT. That is mitigated by blocking on
the RT specific softirq lock.
- The PREEMPT_RT specific implementation of softirq processing and
local_bh_disable/enable().
On RT enabled kernels soft interrupt processing happens always in
task context and all interrupt handlers, which are not explicitly
marked to be invoked in hard interrupt context are forced into task
context as well. This allows to protect against softirq processing
with a per CPU lock, which in turn allows to make BH disabled
regions preemptible.
Most of the softirq handling code is still shared. The RT/non-RT
specific differences are addressed with a set of inline functions
which provide the context specific functionality. The
local_bh_disable() / local_bh_enable() mechanism are obviously
seperate.
- The usual set of small improvements and cleanups
Driver changes:
- New drivers for Nuvoton WPCM450 and DT 79rc3243x interrupt
controllers
- Extended functionality for MStar, STM32 and SC7280 irq chips
- Enhanced robustness for ARM GICv3/4.1 drivers
- The usual set of cleanups and improvements all over the place"
* tag 'irq-core-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
irqchip/xilinx: Expose Kconfig option for Zynq/ZynqMP
irqchip/gic-v3: Do not enable irqs when handling spurious interrups
dt-bindings: interrupt-controller: Add IDT 79RC3243x Interrupt Controller
irqchip: Add support for IDT 79rc3243x interrupt controller
irqdomain: Drop references to recusive irqdomain setup
irqdomain: Get rid of irq_create_strict_mappings()
irqchip/jcore-aic: Kill use of irq_create_strict_mappings()
ARM: PXA: Kill use of irq_create_strict_mappings()
irqchip/gic-v4.1: Disable vSGI upon (GIC CPUIF < v4.1) detection
irqchip/tb10x: Use 'fallthrough' to eliminate a warning
genirq: Reduce irqdebug cacheline bouncing
kernel: Initialize cpumask before parsing
irqchip/wpcm450: Drop COMPILE_TEST
irqchip/irq-mst: Support polarity configuration
irqchip: Add driver for WPCM450 interrupt controller
dt-bindings: interrupt-controller: Add nuvoton, wpcm450-aic
dt-bindings: qcom,pdc: Add compatible for sc7280
irqchip/stm32: Add usart instances exti direct event support
irqchip/gic-v3: Fix OF_BAD_ADDR error handling
irqchip/sifive-plic: Mark two global variables __ro_after_init
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry updates from Thomas Gleixner:
"A trivial cleanup of typo fixes"
* tag 'core-entry-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Fix typos in comments
|
|
Pull lockdep capacity limit updates from Tetsuo Handa:
"syzbot is occasionally reporting that fuzz testing is terminated due
to hitting upper limits lockdep can track.
Analysis via /proc/lockdep* did not show any obvious culprits, allow
tuning tracing capacity constants"
* tag 'tomoyo-pr-20210426' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
lockdep: Allow tuning tracing capacity constants.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Borislav Petkov:
"Fix ordering in the queued writer lock's slowpath"
* tag 'locking_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/qrwlock: Fix ordering in queued_write_lock_slowpath()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
"Fix a typo in a macro ifdeffery"
* tag 'sched_urgent_for_v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
preempt/dynamic: Fix typo in macro conditional statement
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core
Pull irqchip and irqdomain updates from Marc Zyngier:
New HW support:
- New driver for the Nuvoton WPCM450 interrupt controller
- New driver for the IDT 79rc3243x interrupt controller
- Add support for interrupt trigger configuration to the MStar irqchip
- Add more external interrupt support to the STM32 irqchip
- Add new compatible strings for QCOM SC7280 to the qcom-pdc binding
Fixes and cleanups:
- Drop irq_create_strict_mappings() and irq_create_identity_mapping()
from the irqdomain API, with cleanups in a couple of drivers
- Fix nested NMI issue with spurious interrupts on GICv3
- Don't allow GICv4.1 vSGIs when the CPU doesn't support them
- Various cleanups and minor fixes
Link: https://lore.kernel.org/r/20210424094640.1731920-1-maz@kernel.org
|
|
It was never completely implemented, and was removed a long time
ago. Adjust the documentation to reflect this.
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210406093557.1073423-8-maz@kernel.org
|
|
No user of this helper is left, remove it.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix tp_printk command line and trace events
Masami added a wrapper to be able to unhash trace event pointers as
they are only read by root anyway, and they can also be extracted by
the raw trace data buffers. But this wrapper utilized the iterator to
have a temporary buffer to manipulate the text with.
tp_printk is a kernel command line option that will send the trace
output of a trace event to the console on boot up (useful when the
system crashes before finishing the boot). But the code used the same
wrapper that Masami added, and its iterator did not have a buffer, and
this caused the system to crash.
Have the wrapper just print the trace event normally if the iterator
has no temporary buffer"
* tag 'trace-v5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix checking event hash pointer logic when tp_printk is enabled
|
|
cap_setfcap is required to create file capabilities.
Since commit 8db6c34f1dbc ("Introduce v3 namespaced file capabilities"),
a process running as uid 0 but without cap_setfcap is able to work
around this as follows: unshare a new user namespace which maps parent
uid 0 into the child namespace.
While this task will not have new capabilities against the parent
namespace, there is a loophole due to the way namespaced file
capabilities are represented as xattrs. File capabilities valid in
userns 1 are distinguished from file capabilities valid in userns 2 by
the kuid which underlies uid 0. Therefore the restricted root process
can unshare a new self-mapping namespace, add a namespaced file
capability onto a file, then use that file capability in the parent
namespace.
To prevent that, do not allow mapping parent uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the
capability for setting file capabilities.
As a further wrinkle: a task can unshare its user namespace, then open
its uid_map file itself, and map (only) its own uid. In this case we do
not have the credential from before unshare, which was potentially more
restricted. So, when creating a user namespace, we record whether the
creator had CAP_SETFCAP. Then we can use that during map_write().
With this patch:
1. Unprivileged user can still unshare -Ur
ubuntu@caps:~$ unshare -Ur
root@caps:~# logout
2. Root user can still unshare -Ur
ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout
3. Root user without CAP_SETFCAP cannot unshare -Ur:
root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted
Note: an alternative solution would be to allow uid 0 mappings by
processes without CAP_SETFCAP, but to prevent such a namespace from
writing any file capabilities. This approach can be seen at [1].
Background history: commit 95ebabde382 ("capabilities: Don't allow
writing ambiguous v3 file capabilities") tried to fix the issue by
preventing v3 fscaps to be written to disk when the root uid would map
to the same uid in nested user namespaces. This led to regressions for
various workloads. For example, see [2]. Ultimately this is a valid
use-case we have to support meaning we had to revert this change in
3b0c2d3eaa83 ("Revert 95ebabde382c ("capabilities: Don't allow writing
ambiguous v3 file capabilities")").
Link: https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4 [1]
Link: https://github.com/containers/buildah/issues/3071 [2]
Signed-off-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Andrew G. Morgan <morgan@kernel.org>
Tested-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Tested-by: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pointers in events that are printed are unhashed if the flags allow it,
and the logic to do so is called before processing the event output from
the raw ring buffer. In most cases, this is done when a user reads one of
the trace files.
But if tp_printk is added on the kernel command line, this logic is done
for trace events when they are triggered, and their output goes out via
printk. The unhash logic (and even the validation of the output) did not
support the tp_printk output, and would crash.
Link: https://lore.kernel.org/linux-tegra/9835d9f1-8d3a-3440-c53f-516c2606ad07@nvidia.com/
Fixes: efbbdaa22bb7 ("tracing: Show real address for trace event arguments")
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Tested-by: Jon Hunter <jonathanh@nvidia.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
This reverts commit 04c53de57cb6435738961dace8b1b71d3ecd3c39.
Nathan Chancellor points out that it should not have been merged into
mainline by itself. It was a fix for "gcov: use kvmalloc()", which is
still in -mm/-next. Merging it alone has broken the build.
Link: https://github.com/ClangBuiltLinux/continuous-integration2/runs/2384465683?check_suite_focus=true
Reported-by: Nathan Chancellor <nathan@kernel.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 40607ee97e4e ("preempt/dynamic: Provide irqentry_exit_cond_resched()
static call") tried to provide irqentry_exit_cond_resched() static call
in irqentry_exit, but has a typo in macro conditional statement.
Fixes: 40607ee97e4e ("preempt/dynamic: Provide irqentry_exit_cond_resched() static call")
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210410073523.5493-1-zhouzhouyi@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.12-rc8, including fixes from netfilter, and
bpf. BPF verifier changes stand out, otherwise things have slowed
down.
Current release - regressions:
- gro: ensure frag0 meets IP header alignment
- Revert "net: stmmac: re-init rx buffers when mac resume back"
- ethernet: macb: fix the restore of cmp registers
Previous releases - regressions:
- ixgbe: Fix NULL pointer dereference in ethtool loopback test
- ixgbe: fix unbalanced device enable/disable in suspend/resume
- phy: marvell: fix detection of PHY on Topaz switches
- make tcp_allowed_congestion_control readonly in non-init netns
- xen-netback: Check for hotplug-status existence before watching
Previous releases - always broken:
- bpf: mitigate a speculative oob read of up to map value size by
tightening the masking window
- sctp: fix race condition in sctp_destroy_sock
- sit, ip6_tunnel: Unregister catch-all devices
- netfilter: nftables: clone set element expression template
- netfilter: flowtable: fix NAT IPv6 offload mangling
- net: geneve: check skb is large enough for IPv4/IPv6 header
- netlink: don't call ->netlink_bind with table lock held"
* tag 'net-5.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
netlink: don't call ->netlink_bind with table lock held
MAINTAINERS: update my email
bpf: Update selftests to reflect new error states
bpf: Tighten speculative pointer arithmetic mask
bpf: Move sanitize_val_alu out of op switch
bpf: Refactor and streamline bounds check into helper
bpf: Improve verifier error messages for users
bpf: Rework ptr_limit into alu_limit and add common error path
bpf: Ensure off_reg has no mixed signed bounds for all types
bpf: Move off_reg into sanitize_ptr_alu
bpf: Use correct permission flag for mixed signed bounds arithmetic
ch_ktls: do not send snd_una update to TCB in middle
ch_ktls: tcb close causes tls connection failure
ch_ktls: fix device connection close
ch_ktls: Fix kernel panic
i40e: fix the panic when running bpf in xdpdrv mode
net/mlx5e: fix ingress_ifindex check in mlx5e_flower_parse_meta
net/mlx5e: Fix setting of RS FEC mode
net/mlx5: Fix setting of devlink traps in switchdev mode
Revert "net: stmmac: re-init rx buffers when mac resume back"
...
|
|
The return value on success (>= 0) is overwritten by the return value of
put_old_timex32(). That works correct in the fault case, but is wrong for
the success case where put_old_timex32() returns 0.
Just check the return value of put_old_timex32() and return -EFAULT in case
it is not zero.
[ tglx: Massage changelog ]
Fixes: 3a4d44b61625 ("ntp: Move adjtimex related compat syscalls to native counterparts")
Signed-off-by: Chen Jun <chenjun102@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210414030449.90692-1-chenjun102@huawei.com
|
|
While this code is executed with the wait_lock held, a reader can
acquire the lock without holding wait_lock. The writer side loops
checking the value with the atomic_cond_read_acquire(), but only truly
acquires the lock when the compare-and-exchange is completed
successfully which isn’t ordered. This exposes the window between the
acquire and the cmpxchg to an A-B-A problem which allows reads
following the lock acquisition to observe values speculatively before
the write lock is truly acquired.
We've seen a problem in epoll where the reader does a xchg while
holding the read lock, but the writer can see a value change out from
under it.
Writer | Reader
--------------------------------------------------------------------------------
ep_scan_ready_list() |
|- write_lock_irq() |
|- queued_write_lock_slowpath() |
|- atomic_cond_read_acquire() |
| read_lock_irqsave(&ep->lock, flags);
--> (observes value before unlock) | chain_epi_lockless()
| | epi->next = xchg(&ep->ovflist, epi);
| | read_unlock_irqrestore(&ep->lock, flags);
| |
| atomic_cmpxchg_relaxed() |
|-- READ_ONCE(ep->ovflist); |
A core can order the read of the ovflist ahead of the
atomic_cmpxchg_relaxed(). Switching the cmpxchg to use acquire
semantics addresses this issue at which point the atomic_cond_read can
be switched to use relaxed semantics.
Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when spinning in qrwlock")
Signed-off-by: Ali Saidi <alisaidi@amazon.com>
[peterz: use try_cmpxchg()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Waiman Long <longman@redhat.com>
Tested-by: Steve Capper <steve.capper@arm.com>
|
|
With clang-11+, the code is broken due to my kvmalloc() conversion
(which predated the clang-11 support code) leaving one vmalloc() in
place. Fix that.
Link: https://lkml.kernel.org/r/20210412214210.6e1ecca9cdc5.I24459763acf0591d5e6b31c7e3a59890d802f79c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Tested-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Daniel Borkmann says:
====================
pull-request: bpf 2021-04-17
The following pull-request contains BPF updates for your *net* tree.
We've added 10 non-merge commits during the last 9 day(s) which contain
a total of 8 files changed, 175 insertions(+), 111 deletions(-).
The main changes are:
1) Fix a potential NULL pointer dereference in libbpf's xsk
umem handling, from Ciara Loftus.
2) Mitigate a speculative oob read of up to map value size by
tightening the masking window, from Daniel Borkmann.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This work tightens the offset mask we use for unprivileged pointer arithmetic
in order to mitigate a corner case reported by Piotr and Benedict where in
the speculative domain it is possible to advance, for example, the map value
pointer by up to value_size-1 out-of-bounds in order to leak kernel memory
via side-channel to user space.
Before this change, the computed ptr_limit for retrieve_ptr_limit() helper
represents largest valid distance when moving pointer to the right or left
which is then fed as aux->alu_limit to generate masking instructions against
the offset register. After the change, the derived aux->alu_limit represents
the largest potential value of the offset register which we mask against which
is just a narrower subset of the former limit.
For minimal complexity, we call sanitize_ptr_alu() from 2 observation points
in adjust_ptr_min_max_vals(), that is, before and after the simulated alu
operation. In the first step, we retieve the alu_state and alu_limit before
the operation as well as we branch-off a verifier path and push it to the
verification stack as we did before which checks the dst_reg under truncation,
in other words, when the speculative domain would attempt to move the pointer
out-of-bounds.
In the second step, we retrieve the new alu_limit and calculate the absolute
distance between both. Moreover, we commit the alu_state and final alu_limit
via update_alu_sanitation_state() to the env's instruction aux data, and bail
out from there if there is a mismatch due to coming from different verification
paths with different states.
Reported-by: Piotr Krysiuk <piotras@gmail.com>
Reported-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Benedict Schlueter <benedict.schlueter@rub.de>
|
|
Add a small sanitize_needed() helper function and move sanitize_val_alu()
out of the main opcode switch. In upcoming work, we'll move sanitize_ptr_alu()
as well out of its opcode switch so this helps to streamline both.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move the bounds check in adjust_ptr_min_max_vals() into a small helper named
sanitize_check_bounds() in order to simplify the former a bit.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
Consolidate all error handling and provide more user-friendly error messages
from sanitize_ptr_alu() and sanitize_val_alu().
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
Small refactor with no semantic changes in order to consolidate the max
ptr_limit boundary check.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
The mixed signed bounds check really belongs into retrieve_ptr_limit()
instead of outside of it in adjust_ptr_min_max_vals(). The reason is
that this check is not tied to PTR_TO_MAP_VALUE only, but to all pointer
types that we handle in retrieve_ptr_limit() and given errors from the latter
propagate back to adjust_ptr_min_max_vals() and lead to rejection of the
program, it's a better place to reside to avoid anything slipping through
for future types. The reason why we must reject such off_reg is that we
otherwise would not be able to derive a mask, see details in 9d7eceede769
("bpf: restrict unknown scalars of mixed signed bounds for unprivileged").
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
Small refactor to drag off_reg into sanitize_ptr_alu(), so we later on can
use off_reg for generalizing some of the checks for all pointer types.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
We forbid adding unknown scalars with mixed signed bounds due to the
spectre v1 masking mitigation. Hence this also needs bypass_spec_v1
flag instead of allow_ptr_leaks.
Fixes: 2c78ee898d8f ("bpf: Implement CAP_BPF")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
|
|
The broadcast device is switched to oneshot mode when the system switches
to oneshot mode. If a broadcast clock event device is registered after the
system switched to oneshot mode, it will stay in periodic mode forever.
Ensure that a late registered device which is selected as broadcast device
is initialized in oneshot mode when the system already uses oneshot mode.
[ tglx: Massage changelog ]
Signed-off-by: Jindong Yue <jindong.yue@nxp.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210331083318.21794-1-jindong.yue@nxp.com
|
|
The function tick_check_replacement() is the combination of
tick_check_percpu() and tick_check_preferred(), but tick_check_new_device()
has the same logic open coded.
Use the helper to simplify the code.
[ tglx: Massage changelog ]
Signed-off-by: Wang Wensheng <wangwensheng4@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210326022328.3266-1-wangwensheng4@huawei.com
|
|
The timecounter is not modified in this function. Mark it as const.
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210303103544.994855-1-mkl@pengutronix.de
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix a memory link in dyn_event_release().
An error path exited the function before freeing the allocated 'argv'
variable"
* tag 'trace-v5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/dynevent: Fix a memory leak in an error handling path
|
|
This change introduces a prctl that allows the user program to control
which PAC keys are enabled in a particular task. The main reason
why this is useful is to enable a userspace ABI that uses PAC to
sign and authenticate function pointers and other pointers exposed
outside of the function, while still allowing binaries conforming
to the ABI to interoperate with legacy binaries that do not sign or
authenticate pointers.
The idea is that a dynamic loader or early startup code would issue
this prctl very early after establishing that a process may load legacy
binaries, but before executing any PAC instructions.
This change adds a small amount of overhead to kernel entry and exit
due to additional required instruction sequences.
On a DragonBoard 845c (Cortex-A75) with the powersave governor, the
overhead of similar instruction sequences was measured as 4.9ns when
simulating the common case where IA is left enabled, or 43.7ns when
simulating the uncommon case where IA is disabled. These numbers can
be seen as the worst case scenario, since in more realistic scenarios
a better performing governor would be used and a newer chip would be
used that would support PAC unlike Cortex-A75 and would be expected
to be faster than Cortex-A75.
On an Apple M1 under a hypervisor, the overhead of the entry/exit
instruction sequences introduced by this patch was measured as 0.3ns
in the case where IA is left enabled, and 33.0ns in the case where
IA is disabled.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ibc41a5e6a76b275efbaa126b31119dc197b927a5
Link: https://lore.kernel.org/r/d6609065f8f40397a4124654eb68c9f490b4d477.1616123271.git.pcc@google.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
|
|
We must free 'argv' before returning, as already done in all the other
paths of this function.
Link: https://lkml.kernel.org/r/21e3594ccd7fc88c5c162c98450409190f304327.1618136448.git.christophe.jaillet@wanadoo.fr
Fixes: d262271d0483 ("tracing/dynevent: Delegate parsing to create function")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixlets from Ingo Molnar:
"Two minor fixes: one for a Clang warning, the other improves an
ambiguous/confusing kernel log message"
* tag 'locking-urgent-2021-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
lockdep: Address clang -Wformat warning printing for %hd
lockdep: Add a missing initialization hint to the "INFO: Trying to register non-static key" message
|
|
note_interrupt() increments desc->irq_count for each interrupt even for
percpu interrupt handlers, even when they are handled successfully. This
causes cacheline bouncing and limits scalability.
Instead of incrementing irq_count every time, only start incrementing it
after seeing an unhandled irq, which should avoid the cache line
bouncing in the common path.
This actually should give better consistency in handling misbehaving
irqs too, because instead of the first unhandled irq arriving at an
arbitrary point in the irq_count cycle, its arrival will begin the
irq_count cycle.
Cédric reports the result of his IPI throughput test:
Millions of IPIs/s
----------- --------------------------------------
upstream upstream patched
chips cpus default noirqdebug default (irqdebug)
----------- -----------------------------------------
1 0-15 4.061 4.153 4.084
0-31 7.937 8.186 8.158
0-47 11.018 11.392 11.233
0-63 11.460 13.907 14.022
2 0-79 8.376 18.105 18.084
0-95 7.338 22.101 22.266
0-111 6.716 25.306 25.473
0-127 6.223 27.814 28.029
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210402132037.574661-1-npiggin@gmail.com
|
|
KMSAN complains that new_value at cpumask_parse_user() from
write_irq_affinity() from irq_affinity_proc_write() is uninitialized.
[ 148.133411][ T5509] =====================================================
[ 148.135383][ T5509] BUG: KMSAN: uninit-value in find_next_bit+0x325/0x340
[ 148.137819][ T5509]
[ 148.138448][ T5509] Local variable ----new_value.i@irq_affinity_proc_write created at:
[ 148.140768][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.142298][ T5509] irq_affinity_proc_write+0xc3/0x3d0
[ 148.143823][ T5509] =====================================================
Since bitmap_parse() from cpumask_parse_user() calls find_next_bit(),
any alloc_cpumask_var() + cpumask_parse_user() sequence has possibility
that find_next_bit() accesses uninitialized cpu mask variable. Fix this
problem by replacing alloc_cpumask_var() with zalloc_cpumask_var().
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210401055823.3929-1-penguin-kernel@I-love.SAKURA.ne.jp
|
|
Merge misc fixes from Andrew Morton:
"14 patches.
Subsystems affected by this patch series: mm (kasan, gup, pagecache,
and kfence), MAINTAINERS, mailmap, nds32, gcov, ocfs2, ia64, and lib"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS
kfence, x86: fix preemptible warning on KPTI-enabled systems
lib/test_kasan_module.c: suppress unused var warning
kasan: fix conflict with page poisoning
fs: direct-io: fix missing sdio->boundary
ia64: fix user_stack_pointer() for ptrace()
ocfs2: fix deadlock between setattr and dio_end_io_write
gcov: re-fix clang-11+ support
nds32: flush_dcache_page: use page_mapping_file to avoid races with swapoff
mm/gup: check page posion status for coredump.
.mailmap: fix old email addresses
mailmap: update email address for Jordan Crouse
treewide: change my e-mail address, fix my name
MAINTAINERS: update CZ.NIC's Turris information
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.12-rc7, including fixes from can, ipsec,
mac80211, wireless, and bpf trees.
No scary regressions here or in the works, but small fixes for 5.12
changes keep coming.
Current release - regressions:
- virtio: do not pull payload in skb->head
- virtio: ensure mac header is set in virtio_net_hdr_to_skb()
- Revert "net: correct sk_acceptq_is_full()"
- mptcp: revert "mptcp: provide subflow aware release function"
- ethernet: lan743x: fix ethernet frame cutoff issue
- dsa: fix type was not set for devlink port
- ethtool: remove link_mode param and derive link params from driver
- sched: htb: fix null pointer dereference on a null new_q
- wireless: iwlwifi: Fix softirq/hardirq disabling in
iwl_pcie_enqueue_hcmd()
- wireless: iwlwifi: fw: fix notification wait locking
- wireless: brcmfmac: p2p: Fix deadlock introduced by avoiding the
rtnl dependency
Current release - new code bugs:
- napi: fix hangup on napi_disable for threaded napi
- bpf: take module reference for trampoline in module
- wireless: mt76: mt7921: fix airtime reporting and related tx hangs
- wireless: iwlwifi: mvm: rfi: don't lock mvm->mutex when sending
config command
Previous releases - regressions:
- rfkill: revert back to old userspace API by default
- nfc: fix infinite loop, refcount & memory leaks in LLCP sockets
- let skb_orphan_partial wake-up waiters
- xfrm/compat: Cleanup WARN()s that can be user-triggered
- vxlan, geneve: do not modify the shared tunnel info when PMTU
triggers an ICMP reply
- can: fix msg_namelen values depending on CAN_REQUIRED_SIZE
- can: uapi: mark union inside struct can_frame packed
- sched: cls: fix action overwrite reference counting
- sched: cls: fix err handler in tcf_action_init()
- ethernet: mlxsw: fix ECN marking in tunnel decapsulation
- ethernet: nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx
- ethernet: i40e: fix receiving of single packets in xsk zero-copy
mode
- ethernet: cxgb4: avoid collecting SGE_QBASE regs during traffic
Previous releases - always broken:
- bpf: Refuse non-O_RDWR flags in BPF_OBJ_GET
- bpf: Refcount task stack in bpf_get_task_stack
- bpf, x86: Validate computation of branch displacements
- ieee802154: fix many similar syzbot-found bugs
- fix NULL dereferences in netlink attribute handling
- reject unsupported operations on monitor interfaces
- fix error handling in llsec_key_alloc()
- xfrm: make ipv4 pmtu check honor ip header df
- xfrm: make hash generation lock per network namespace
- xfrm: esp: delete NETIF_F_SCTP_CRC bit from features for esp
offload
- ethtool: fix incorrect datatype in set_eee ops
- xdp: fix xdp_return_frame() kernel BUG throw for page_pool memory
model
- openvswitch: fix send of uninitialized stack memory in ct limit
reply
Misc:
- udp: add get handling for UDP_GRO sockopt"
* tag 'net-5.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (182 commits)
net: fix hangup on napi_disable for threaded napi
net: hns3: Trivial spell fix in hns3 driver
lan743x: fix ethernet frame cutoff issue
net: ipv6: check for validity before dereferencing cfg->fc_nlinfo.nlh
net: dsa: lantiq_gswip: Configure all remaining GSWIP_MII_CFG bits
net: dsa: lantiq_gswip: Don't use PHY auto polling
net: sched: sch_teql: fix null-pointer dereference
ipv6: report errors for iftoken via netlink extack
net: sched: fix err handler in tcf_action_init()
net: sched: fix action overwrite reference counting
Revert "net: sched: bump refcount for new action in ACT replace mode"
ice: fix memory leak of aRFS after resuming from suspend
i40e: Fix sparse warning: missing error code 'err'
i40e: Fix sparse error: 'vsi->netdev' could be null
i40e: Fix sparse error: uninitialized symbol 'ring'
i40e: Fix sparse errors in i40e_txrx.c
i40e: Fix parameters in aq_get_phy_register()
nl80211: fix beacon head validation
bpf, x86: Validate computation of branch displacements for x86-32
bpf, x86: Validate computation of branch displacements for x86-64
...
|
|
LLVM changed the expected function signature for llvm_gcda_emit_function()
in the clang-11 release. Users of clang-11 or newer may have noticed
their kernels producing invalid coverage information:
$ llvm-cov gcov -a -c -u -f -b <input>.gcda -- gcno=<input>.gcno
1 <func>: checksum mismatch, \
(<lineno chksum A>, <cfg chksum B>) != (<lineno chksum A>, <cfg chksum C>)
2 Invalid .gcda File!
...
Fix up the function signatures so calling this function interprets its
parameters correctly and computes the correct cfg checksum. In
particular, in clang-11, the additional checksum is no longer optional.
Link: https://reviews.llvm.org/rG25544ce2df0daa4304c07e64b9c8b0f7df60c11d
Link: https://lkml.kernel.org/r/20210408184631.1156669-1-ndesaulniers@google.com
Reported-by: Prasad Sodagudi <psodagud@quicinc.com>
Tested-by: Prasad Sodagudi <psodagud@quicinc.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org> [5.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The sole user of irq_create_identity_mapping() having been converted,
get rid of the unused helper.
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Since syzkaller continues various test cases until the kernel crashes,
syzkaller tends to examine more locking dependencies than normal systems.
As a result, syzbot is reporting that the fuzz testing was terminated
due to hitting upper limits lockdep can track [1] [2] [3]. Since analysis
via /proc/lockdep* did not show any obvious culprit [4] [5], we have no
choice but allow tuning tracing capacity constants.
[1] https://syzkaller.appspot.com/bug?id=3d97ba93fb3566000c1c59691ea427370d33ea1b
[2] https://syzkaller.appspot.com/bug?id=381cb436fe60dc03d7fd2a092b46d7f09542a72a
[3] https://syzkaller.appspot.com/bug?id=a588183ac34c1437fc0785e8f220e88282e5a29f
[4] https://lkml.kernel.org/r/4b8f7a57-fa20-47bd-48a0-ae35d860f233@i-love.sakura.ne.jp
[5] https://lkml.kernel.org/r/1c351187-253b-2d49-acaf-4563c63ae7d2@i-love.sakura.ne.jp
References: https://lkml.kernel.org/r/1595640639-9310-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
|
|
84;0;0c84;0;0c
There are two workqueue-specific watchdog timestamps:
+ @wq_watchdog_touched_cpu (per-CPU) updated by
touch_softlockup_watchdog()
+ @wq_watchdog_touched (global) updated by
touch_all_softlockup_watchdogs()
watchdog_timer_fn() checks only the global @wq_watchdog_touched for
unbound workqueues. As a result, unbound workqueues are not aware
of touch_softlockup_watchdog(). The watchdog might report a stall
even when the unbound workqueues are blocked by a known slow code.
Solution:
touch_softlockup_watchdog() must touch also the global @wq_watchdog_touched
timestamp.
The global timestamp can no longer be used for bound workqueues because
it is now updated from all CPUs. Instead, bound workqueues have to check
only @wq_watchdog_touched_cpu and these timestamps have to be updated for
all CPUs in touch_all_softlockup_watchdogs().
Beware:
The change might cause the opposite problem. An unbound workqueue
might get blocked on CPU A because of a real softlockup. The workqueue
watchdog would miss it when the timestamp got touched on CPU B.
It is acceptable because softlockups are detected by softlockup
watchdog. The workqueue watchdog is there to detect stalls where
a work never finishes, for example, because of dependencies of works
queued into the same workqueue.
V3:
- Modify the commit message clearly according to Petr's suggestion.
Signed-off-by: Wang Qing <wangqing@vivo.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The debug_work_activate() is called on the premise that
the work can be inserted, because if wq be in WQ_DRAINING
status, insert work may be failed.
Fixes: e41e704bc4f4 ("workqueue: improve destroy_workqueue() debuggability")
Signed-off-by: Zqiang <qiang.zhang@windriver.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix stack trace entry size to stop showing garbage
The macro that creates both the structure and the format displayed to
user space for the stack trace event was changed a while ago to fix
the parsing by user space tooling. But this change also modified the
structure used to store the stack trace event. It changed the caller
array field from [0] to [8].
Even though the size in the ring buffer is dynamic and can be
something other than 8 (user space knows how to handle this), the 8
extra words was not accounted for when reserving the event on the ring
buffer, and added 8 more entries, due to the calculation of
"sizeof(*entry) + nr_entries * sizeof(long)", as the sizeof(*entry)
now contains 8 entries.
The size of the caller field needs to be subtracted from the size of
the entry to create the correct allocation size"
* tag 'trace-v5.12-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix stack trace event size
|
|
As for bpf_link, refuse creating a non-O_RDWR fd. Since program fds
currently don't allow modifications this is a precaution, not a
straight up bug fix.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-2-lmb@cloudflare.com
|
|
Invoking BPF_OBJ_GET on a pinned bpf_link checks the path access
permissions based on file_flags, but the returned fd ignores flags.
This means that any user can acquire a "read-write" fd for a pinned
link with mode 0664 by invoking BPF_OBJ_GET with BPF_F_RDONLY in
file_flags. The fd can be used to invoke BPF_LINK_DETACH, etc.
Fix this by refusing non-O_RDWR flags in BPF_OBJ_GET. This works
because OBJ_GET by default returns a read write mapping and libbpf
doesn't expose a way to override this behaviour for programs
and links.
Fixes: 70ed506c3bbc ("bpf: Introduce pinnable bpf_link abstraction")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210326160501.46234-1-lmb@cloudflare.com
|
|
On x86 the struct pt_regs * grabbed by task_pt_regs() points to an
offset of task->stack. The pt_regs are later dereferenced in
__bpf_get_stack (e.g. by user_mode() check). This can cause a fault if
the task in question exits while bpf_get_task_stack is executing, as
warned by task_stack_page's comment:
* When accessing the stack of a non-current task that might exit, use
* try_get_task_stack() instead. task_stack_page will return a pointer
* that could get freed out from under you.
Taking the comment's advice and using try_get_task_stack() and
put_task_stack() to hold task->stack refcount, or bail early if it's
already 0. Incrementing stack_refcount will ensure the task's stack
sticks around while we're using its data.
I noticed this bug while testing a bpf task iter similar to
bpf_iter_task_stack in selftests, except mine grabbed user stack, and
getting intermittent crashes, which resulted in dumps like:
BUG: unable to handle page fault for address: 0000000000003fe0
\#PF: supervisor read access in kernel mode
\#PF: error_code(0x0000) - not-present page
RIP: 0010:__bpf_get_stack+0xd0/0x230
<snip...>
Call Trace:
bpf_prog_0a2be35c092cb190_get_task_stacks+0x5d/0x3ec
bpf_iter_run_prog+0x24/0x81
__task_seq_show+0x58/0x80
bpf_seq_read+0xf7/0x3d0
vfs_read+0x91/0x140
ksys_read+0x59/0xd0
do_syscall_64+0x48/0x120
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()")
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20210401000747.3648767-1-davemarchevsky@fb.com
|
|
Commit cbc3b92ce037 fixed an issue to modify the macros of the stack trace
event so that user space could parse it properly. Originally the stack
trace format to user space showed that the called stack was a dynamic
array. But it is not actually a dynamic array, in the way that other
dynamic event arrays worked, and this broke user space parsing for it. The
update was to make the array look to have 8 entries in it. Helper
functions were added to make it parse it correctly, as the stack was
dynamic, but was determined by the size of the event stored.
Although this fixed user space on how it read the event, it changed the
internal structure used for the stack trace event. It changed the array
size from [0] to [8] (added 8 entries). This increased the size of the
stack trace event by 8 words. The size reserved on the ring buffer was the
size of the stack trace event plus the number of stack entries found in
the stack trace. That commit caused the amount to be 8 more than what was
needed because it did not expect the caller field to have any size. This
produced 8 entries of garbage (and reading random data) from the stack
trace event:
<idle>-0 [002] d... 1976396.837549: <stack trace>
=> trace_event_raw_event_sched_switch
=> __traceiter_sched_switch
=> __schedule
=> schedule_idle
=> do_idle
=> cpu_startup_entry
=> secondary_startup_64_no_verify
=> 0xc8c5e150ffff93de
=> 0xffff93de
=> 0
=> 0
=> 0xc8c5e17800000000
=> 0x1f30affff93de
=> 0x00000004
=> 0x200000000
Instead, subtract the size of the caller field from the size of the event
to make sure that only the amount needed to store the stack trace is
reserved.
Link: https://lore.kernel.org/lkml/your-ad-here.call-01617191565-ext-9692@work.hours/
Cc: stable@vger.kernel.org
Fixes: cbc3b92ce037 ("tracing: Set kernel_stack's caller size properly")
Reported-by: Vasily Gorbik <gor@linux.ibm.com>
Tested-by: Vasily Gorbik <gor@linux.ibm.com>
Acked-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Add check of order < 0 before calling free_pages()
The function addresses that are traced by ftrace are stored in pages,
and the size is held in a variable. If there's some error in creating
them, the allocate ones will be freed. In this case, it is possible
that the order of pages to be freed may end up being negative due to a
size of zero passed to get_count_order(), and then that negative
number will cause free_pages() to free a very large section.
Make sure that does not happen"
* tag 'trace-v5.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Check if pages were allocated before calling free_pages()
|