summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2015-11-03Merge branch 'ras-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS changes from Ingo Molnar: "The main system reliability related changes were from x86, but also some generic RAS changes: - AMD MCE error injection subsystem enhancements. (Aravind Gopalakrishnan) - Fix MCE and CPU hotplug interaction bug. (Ashok Raj) - kcrash bootup robustness fix. (Baoquan He) - kcrash cleanups. (Borislav Petkov) - x86 microcode driver rework: simplify it by unmodularizing it and other cleanups. (Borislav Petkov)" * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/mce: Add a default case to the switch in __mcheck_cpu_ancient_init() x86/mce: Add a Scalable MCA vendor flags bit MAINTAINERS: Unify the microcode driver section x86/microcode/intel: Move #ifdef DEBUG inside the function x86/microcode/amd: Remove maintainers from comments x86/microcode: Remove modularization leftovers x86/microcode: Merge the early microcode loader x86/microcode: Unmodularize the microcode driver x86/mce: Fix thermal throttling reporting after kexec kexec/crash: Say which char is the unrecognized x86/setup/crash: Check memblock_reserve() retval x86/setup/crash: Cleanup some more x86/setup/crash: Remove alignment variable x86/setup: Cleanup crashkernel reservation functions x86/amd_nb, EDAC: Rename amd_get_node_id() x86/setup: Do not reserve crashkernel high memory if low reservation failed x86/microcode/amd: Do not overwrite final patch levels x86/microcode/amd: Extract current patch level read to a function x86/ras/mce_amd_inj: Inject bank 4 errors on the NBC x86/ras/mce_amd_inj: Trigger deferred and thresholding errors interrupts ...
2015-11-03Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "Kernel side changes: - Improve accuracy of perf/sched clock on x86. (Adrian Hunter) - Intel DS and BTS updates. (Alexander Shishkin) - Intel cstate PMU support. (Kan Liang) - Add group read support to perf_event_read(). (Peter Zijlstra) - Branch call hardware sampling support, implemented on x86 and PowerPC. (Stephane Eranian) - Event groups transactional interface enhancements. (Sukadev Bhattiprolu) - Enable proper x86/intel/uncore PMU support on multi-segment PCI systems. (Taku Izumi) - ... misc fixes and cleanups. The perf tooling team was very busy again with 200+ commits, the full diff doesn't fit into lkml size limits. Here's an (incomplete) list of the tooling highlights: New features: - Change the default event used in all tools (record/top): use the most precise "cycles" hw counter available, i.e. when the user doesn't specify any event, it will try using cycles:ppp, cycles:pp, etc and fall back transparently until it finds a working counter. (Arnaldo Carvalho de Melo) - Integration of perf with eBPF that, given an eBPF .c source file (or .o file built for the 'bpf' target with clang), will get it automatically built, validated and loaded into the kernel via the sys_bpf syscall, which can then be used and seen using 'perf trace' and other tools. (Wang Nan) Various user interface improvements: - Automatic pager invocation on long help output. (Namhyung Kim) - Search for more options when passing args to -h, e.g.: (Arnaldo Carvalho de Melo) $ perf report -h interface Usage: perf report [<options>] --gtk Use the GTK2 interface --stdio Use the stdio interface --tui Use the TUI interface - Show ordered command line options when -h is used or when an unknown option is specified. (Arnaldo Carvalho de Melo) - If options are passed after -h, show just its descriptions, not all options. (Arnaldo Carvalho de Melo) - Implement column based horizontal scrolling in the hists browser (top, report), making it possible to use the TUI for things like 'perf mem report' where there are many more columns than can fit in a terminal. (Arnaldo Carvalho de Melo) - Enhance the error reporting of tracepoint event parsing, e.g.: $ oldperf record -e sched:sched_switc usleep 1 event syntax error: 'sched:sched_switc' \___ unknown tracepoint Run 'perf list' for a list of valid events Now we get the much nicer: $ perf record -e sched:sched_switc ls event syntax error: 'sched:sched_switc' \___ can't access trace events Error: No permissions to read /sys/kernel/debug/tracing/events/sched/sched_switc Hint: Try 'sudo mount -o remount,mode=755 /sys/kernel/debug' And after we have those mount point permissions fixed: $ perf record -e sched:sched_switc ls event syntax error: 'sched:sched_switc' \___ unknown tracepoint Error: File /sys/kernel/debug/tracing/events/sched/sched_switc not found. Hint: Perhaps this kernel misses some CONFIG_ setting to enable this feature?. I.e. basically now the event parsing routing uses the strerror_open() routines introduced by and used in 'perf trace' work. (Jiri Olsa) - Fail properly when pattern matching fails to find a tracepoint, i.e. '-e non:existent' was being correctly handled, with a proper error message about that not being a valid event, but '-e non:existent*' wasn't, fix it. (Jiri Olsa) - Do event name substring search as last resort in 'perf list'. (Arnaldo Carvalho de Melo) E.g.: # perf list clock List of pre-defined events (to be used in -e): cpu-clock [Software event] task-clock [Software event] uncore_cbox_0/clockticks/ [Kernel PMU event] uncore_cbox_1/clockticks/ [Kernel PMU event] kvm:kvm_pvclock_update [Tracepoint event] kvm:kvm_update_master_clock [Tracepoint event] power:clock_disable [Tracepoint event] power:clock_enable [Tracepoint event] power:clock_set_rate [Tracepoint event] syscalls:sys_enter_clock_adjtime [Tracepoint event] syscalls:sys_enter_clock_getres [Tracepoint event] syscalls:sys_enter_clock_gettime [Tracepoint event] syscalls:sys_enter_clock_nanosleep [Tracepoint event] syscalls:sys_enter_clock_settime [Tracepoint event] syscalls:sys_exit_clock_adjtime [Tracepoint event] syscalls:sys_exit_clock_getres [Tracepoint event] syscalls:sys_exit_clock_gettime [Tracepoint event] syscalls:sys_exit_clock_nanosleep [Tracepoint event] syscalls:sys_exit_clock_settime [Tracepoint event] Intel PT hardware tracing enhancements: - Accept a zero --itrace period, meaning "as often as possible". In the case of Intel PT that is the same as a period of 1 and a unit of 'instructions' (i.e. --itrace=i1i). (Adrian Hunter) - Harmonize itrace's synthesized callchains with the existing --max-stack tool option. (Adrian Hunter) - Allow time to be displayed in nanoseconds in 'perf script'. (Adrian Hunter) - Fix potential infinite loop when handling Intel PT timestamps. (Adrian Hunter) - Slighly improve Intel PT debug logging. (Adrian Hunter) - Warn when AUX data has been lost, just like when processing PERF_RECORD_LOST. (Adrian Hunter) - Further document export-to-postgresql.py script. (Adrian Hunter) - Add option to synthesize branch stack from auxtrace data. (Adrian Hunter) Misc notable changes: - Switch the default callchain output mode to 'graph,0.5,caller', to make it look like the default for other tools, reducing the learning curve for people used to 'caller' based viewing. (Arnaldo Carvalho de Melo) - various call chain usability enhancements. (Namhyung Kim) - Introduce the 'P' event modifier, meaning 'max precision level, please', i.e.: $ perf record -e cycles:P usleep 1 Is now similar to: $ perf record usleep 1 Useful, for instance, when specifying multiple events. (Jiri Olsa) - Add 'socket' sort entry, to sort by the processor socket in 'perf top' and 'perf report'. (Kan Liang) - Introduce --socket-filter to 'perf report', for filtering by processor socket. (Kan Liang) - Add new "Zoom into Processor Socket" operation in the perf hists browser, used in 'perf top' and 'perf report'. (Kan Liang) - Allow probing on kmodules without DWARF. (Masami Hiramatsu) - Fix 'perf probe -l' for probes added to kernel module functions. (Masami Hiramatsu) - Preparatory work for the 'perf stat record' feature that will allow generating perf.data files with counting data in addition to the sampling mode we have now (Jiri Olsa) - Update libtraceevent KVM plugin. (Paolo Bonzini) - ... plus lots of other enhancements that I failed to list properly, by: Adrian Hunter, Alexander Shishkin, Andi Kleen, Andrzej Hajda, Arnaldo Carvalho de Melo, Dima Kogan, Don Zickus, Geliang Tang, He Kuang, Huaitong Han, Ingo Molnar, Jan Stancek, Jiri Olsa, Kan Liang, Kirill Tkhai, Masami Hiramatsu, Matt Fleming, Namhyung Kim, Paolo Bonzini, Peter Zijlstra, Rabin Vincent, Scott Wood, Stephane Eranian, Sukadev Bhattiprolu, Taku Izumi, Vaishali Thakkar, Wang Nan, Yang Shi and Yunlong Song" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (260 commits) perf unwind: Pass symbol source to libunwind tools build: Fix libiberty feature detection perf tools: Compile scriptlets to BPF objects when passing '.c' to --event perf record: Add clang options for compiling BPF scripts perf bpf: Attach eBPF filter to perf event perf tools: Make sure fixdep is built before libbpf perf script: Enable printing of branch stack perf trace: Add cmd string table to decode sys_bpf first arg perf bpf: Collect perf_evsel in BPF object files perf tools: Load eBPF object into kernel perf tools: Create probe points for BPF programs perf tools: Enable passing bpf object file to --event perf ebpf: Add the libbpf glue perf tools: Make perf depend on libbpf perf symbols: Fix endless loop in dso__split_kallsyms_for_kcore perf tools: Enable pre-event inherit setting by config terms perf symbols: we can now read separate debug-info files based on a build ID perf symbols: Fix type error when reading a build-id perf tools: Search for more options when passing args to -h perf stat: Cache aggregated map entries in extra cpumap ...
2015-11-03atomic: remove all traces of READ_ONCE_CTRL() and atomic*_read_ctrl()Linus Torvalds
This seems to be a mis-reading of how alpha memory ordering works, and is not backed up by the alpha architecture manual. The helper functions don't do anything special on any other architectures, and the arguments that support them being safe on other architectures also argue that they are safe on alpha. Basically, the "control dependency" is between a previous read and a subsequent write that is dependent on the value read. Even if the subsequent write is actually done speculatively, there is no way that such a speculative write could be made visible to other cpu's until it has been committed, which requires validating the speculation. Note that most weakely ordered architectures (very much including alpha) do not guarantee any ordering relationship between two loads that depend on each other on a control dependency: read A if (val == 1) read B because the conditional may be predicted, and the "read B" may be speculatively moved up to before reading the value A. So we require the user to insert a smp_rmb() between the two accesses to be correct: read A; if (A == 1) smp_rmb() read B Alpha is further special in that it can break that ordering even if the *address* of B depends on the read of A, because the cacheline that is read later may be stale unless you have a memory barrier in between the pointer read and the read of the value behind a pointer: read ptr read offset(ptr) whereas all other weakly ordered architectures guarantee that the data dependency (as opposed to just a control dependency) will order the two accesses. As a result, alpha needs a "smp_read_barrier_depends()" in between those two reads for them to be ordered. The coontrol dependency that "READ_ONCE_CTRL()" and "atomic_read_ctrl()" had was a control dependency to a subsequent *write*, however, and nobody can finalize such a subsequent write without having actually done the read. And were you to write such a value to a "stale" cacheline (the way the unordered reads came to be), that would seem to lose the write entirely. So the things that make alpha able to re-order reads even more aggressively than other weak architectures do not seem to be relevant for a subsequent write. Alpha memory ordering may be strange, but there's no real indication that it is *that* strange. Also, the alpha architecture reference manual very explicitly talks about the definition of "Dependence Constraints" in section 5.6.1.7, where a preceding read dominates a subsequent write. Such a dependence constraint admittedly does not impose a BEFORE (alpha architecture term for globally visible ordering), but it does guarantee that there can be no "causal loop". I don't see how you could avoid such a loop if another cpu could see the stored value and then impact the value of the first read. Put another way: the read and the write could not be seen as being out of order wrt other cpus. So I do not see how these "x_ctrl()" functions can currently be necessary. I may have to eat my words at some point, but in the absense of clear proof that alpha actually needs this, or indeed even an explanation of how alpha could _possibly_ need it, I do not believe these functions are called for. And if it turns out that alpha really _does_ need a barrier for this case, that barrier still should not be "smp_read_barrier_depends()". We'd have to make up some new speciality barrier just for alpha, along with the documentation for why it really is necessary. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul E McKenney <paulmck@us.ibm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-03Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking changes from Ingo Molnar: "The main changes in this cycle were: - More gradual enhancements to atomic ops: new atomic*_read_ctrl() ops, synchronize atomic_{read,set}() ordering requirements between architectures, add atomic_long_t bitops. (Peter Zijlstra) - Add _{relaxed|acquire|release}() variants for inc/dec atomics and use them in various locking primitives: mutex, rtmutex, mcs, rwsem. This enables weakly ordered architectures (such as arm64) to make use of more locking related optimizations. (Davidlohr Bueso) - Implement atomic[64]_{inc,dec}_relaxed() on ARM. (Will Deacon) - Futex kernel data cache footprint micro-optimization. (Rasmus Villemoes) - pvqspinlock runtime overhead micro-optimization. (Waiman Long) - misc smaller fixlets" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: ARM, locking/atomics: Implement _relaxed variants of atomic[64]_{inc,dec} locking/rwsem: Use acquire/release semantics locking/mcs: Use acquire/release semantics locking/rtmutex: Use acquire/release semantics locking/mutex: Use acquire/release semantics locking/asm-generic: Add _{relaxed|acquire|release}() variants for inc/dec atomics atomic: Implement atomic_read_ctrl() atomic, arch: Audit atomic_{read,set}() atomic: Add atomic_long_t bitops futex: Force hot variables into a single cache line locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL locking/osq: Relax atomic semantics locking/qrwlock: Rename ->lock to ->wait_lock locking/Documentation/lockstat: Fix typo - lokcing -> locking locking/atomics, cmpxchg: Privatize the inclusion of asm/cmpxchg.h
2015-11-03Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU changes from Ingo Molnar: "The main changes in this cycle were: - Improvements to expedited grace periods (Paul E McKenney) - Performance improvements to and locktorture tests for percpu-rwsem (Oleg Nesterov, Paul E McKenney) - Torture-test changes (Paul E McKenney, Davidlohr Bueso) - Documentation updates (Paul E McKenney) - Miscellaneous fixes (Paul E McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs() rcu: Better hotplug handling for synchronize_sched_expedited() rcu: Enable stall warnings for synchronize_rcu_expedited() rcu: Add tasks to expedited stall-warning messages rcu: Add online/offline info to expedited stall warning message rcu: Consolidate expedited CPU selection rcu: Prepare for consolidating expedited CPU selection cpu: Remove try_get_online_cpus() rcu: Stop excluding CPU hotplug in synchronize_sched_expedited() rcu: Stop silencing lockdep false positive for expedited grace periods rcu: Switch synchronize_sched_expedited() to IPI locktorture: Fix module unwind when bad torture_type specified torture: Forgive non-plural arguments rcutorture: Fix unused-function warning for torturing_tasks() rcutorture: Fix module unwind when bad torture_type specified rcu_sync: Cleanup the CONFIG_PROVE_RCU checks locking/percpu-rwsem: Clean up the lockdep annotations in percpu_down_read() locking/percpu-rwsem: Fix the comments outdated by rcu_sync locking/percpu-rwsem: Make use of the rcu_sync infrastructure locking/percpu-rwsem: Make percpu_free_rwsem() after kzalloc() safe ...
2015-11-03Merge branch 'irq-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "The irq departement delivers: - Rework the irqdomain core infrastructure to accomodate ACPI based systems. This is required to support ARM64 without creating artificial device tree nodes. - Sanitize the ACPI based ARM GIC initialization by making use of the new firmware independent irqdomain core - Further improvements to the generic MSI management - Generalize the irq migration on CPU hotplug - Improvements to the threaded interrupt infrastructure - Allow the migration of "chained" low level interrupt handlers - Allow optional force masking of interrupts in disable_irq[_nosysnc] - Support for two new interrupt chips - Sigh! - A larger set of errata fixes for ARM gicv3 - The usual pile of fixes, updates, improvements and cleanups all over the place" * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) Document that IRQ_NONE should be returned when IRQ not actually handled PCI/MSI: Allow the MSI domain to be device-specific PCI: Add per-device MSI domain hook of/irq: Use the msi-map property to provide device-specific MSI domain of/irq: Split of_msi_map_rid to reuse msi-map lookup irqchip/gic-v3-its: Parse new version of msi-parent property PCI/MSI: Use of_msi_get_domain instead of open-coded "msi-parent" parsing of/irq: Use of_msi_get_domain instead of open-coded "msi-parent" parsing of/irq: Add support code for multi-parent version of "msi-parent" irqchip/gic-v3-its: Add handling of PCI requester id. PCI/MSI: Add helper function pci_msi_domain_get_msi_rid(). of/irq: Add new function of_msi_map_rid() Docs: dt: Add PCI MSI map bindings irqchip/gic-v2m: Add support for multiple MSI frames irqchip/gic-v3: Fix translation of LPIs after conversion to irq_fwspec irqchip/mxs: Add Alphascale ASM9260 support irqchip/mxs: Prepare driver for hardware with different offsets irqchip/mxs: Panic if ioremap or domain creation fails irqdomain: Documentation updates irqdomain/msi: Use fwnode instead of of_node ...
2015-11-03Merge branch 'timers-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "The timer departement provides: - More y2038 work in the area of ntp and pps. - Optimization of posix cpu timers - New time related selftests - Some new clocksource drivers - The usual pile of fixes, cleanups and improvements" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) timeconst: Update path in comment timers/x86/hpet: Type adjustments clocksource/drivers/armada-370-xp: Implement ARM delay timer clocksource/drivers/tango_xtal: Add new timer for Tango SoCs clocksource/drivers/imx: Allow timer irq affinity change clocksource/drivers/exynos_mct: Use container_of() instead of this_cpu_ptr() clocksource/drivers/h8300_*: Remove unneeded memset()s clocksource/drivers/sh_cmt: Remove unneeded memset() in sh_cmt_setup() clocksource/drivers/em_sti: Remove unneeded memset()s clocksource/drivers/mediatek: Use GPT as sched clock source clockevents/drivers/mtk: Fix spurious interrupt leading to crash posix_cpu_timer: Reduce unnecessary sighand lock contention posix_cpu_timer: Convert cputimer->running to bool posix_cpu_timer: Check thread timers only when there are active thread timers posix_cpu_timer: Optimize fastpath_timer_check() timers, kselftest: Add 'adjtick' test to validate adjtimex() tick adjustments timers: Use __fls in apply_slack() clocksource: Remove return statement from void functions net: sfc: avoid using timespec ntp/pps: use y2038 safe types in pps_event_time ...
2015-11-03ring_buffer: Remove unneeded smp_wmb() before wakeup of reader benchmarkSteven Rostedt (Red Hat)
wake_up_process() has a memory barrier before doing anything, thus adding a memory barrier before calling it is redundant. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-03tracing: Allow dumping traces without tracking trace started cpusSasha Levin
We don't init iter->started when dumping the ftrace buffer, and there's no real need to do so - so allow skipping that check if the iter doesn't have an initialized ->started cpumask. Link: http://lkml.kernel.org/r/1441385156-27279-1-git-send-email-sasha.levin@oracle.com Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-03ring_buffer: Fix more races when terminating the producer in the benchmarkPetr Mladek
The commit b44754d8262d3aab8 ("ring_buffer: Allow to exit the ring buffer benchmark immediately") added a hack into ring_buffer_producer() that set @kill_test when kthread_should_stop() returned true. It improved the situation a lot. It stopped the kthread in most cases because the producer spent most of the time in the patched while cycle. But there are still few possible races when kthread_should_stop() is set outside of the cycle. Then we do not set @kill_test and some other checks pass. This patch adds a better fix. It renames @test_kill/TEST_KILL() into a better descriptive @test_error/TEST_ERROR(). Also it introduces break_test() function that checks for both @test_error and kthread_should_stop(). The new function is used in the producer when the check for @test_error is not enough. It is not used in the consumer because its state is manipulated by the producer via the "reader_finish" variable. Also we add a missing check into ring_buffer_producer_thread() between setting TASK_INTERRUPTIBLE and calling schedule_timeout(). Otherwise, we might miss a wakeup from kthread_stop(). Link: http://lkml.kernel.org/r/1441629518-32712-3-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-03ring_buffer: Do no not complete benchmark reader too earlyPetr Mladek
It seems that complete(&read_done) might be called too early in some situations. 1st scenario: ------------- CPU0 CPU1 ring_buffer_producer_thread() wake_up_process(consumer); wait_for_completion(&read_start); ring_buffer_consumer_thread() complete(&read_start); ring_buffer_producer() # producing data in # the do-while cycle ring_buffer_consumer(); # reading data # got error # set kill_test = 1; set_current_state( TASK_INTERRUPTIBLE); if (reader_finish) # false schedule(); # producer still in the middle of # do-while cycle if (consumer && !(cnt % wakeup_interval)) wake_up_process(consumer); # spurious wakeup while (!reader_finish && !kill_test) # leaving because # kill_test == 1 reader_finish = 0; complete(&read_done); 1st BANG: We might access uninitialized "read_done" if this is the the first round. # producer finally leaving # the do-while cycle because kill_test == 1; if (consumer) { reader_finish = 1; wake_up_process(consumer); wait_for_completion(&read_done); 2nd BANG: This will never complete because consumer already did the completion. 2nd scenario: ------------- CPU0 CPU1 ring_buffer_producer_thread() wake_up_process(consumer); wait_for_completion(&read_start); ring_buffer_consumer_thread() complete(&read_start); ring_buffer_producer() # CPU3 removes the module <--- difference from # and stops producer <--- the 1st scenario if (kthread_should_stop()) kill_test = 1; ring_buffer_consumer(); while (!reader_finish && !kill_test) # kill_test == 1 => we never go # into the top level while() reader_finish = 0; complete(&read_done); # producer still in the middle of # do-while cycle if (consumer && !(cnt % wakeup_interval)) wake_up_process(consumer); # spurious wakeup while (!reader_finish && !kill_test) # leaving because kill_test == 1 reader_finish = 0; complete(&read_done); BANG: We are in the same "bang" situations as in the 1st scenario. Root of the problem: -------------------- ring_buffer_consumer() must complete "read_done" only when "reader_finish" variable is set. It must not be skipped due to other conditions. Note that we still must keep the check for "reader_finish" in a loop because there might be spurious wakeups as described in the above scenarios. Solution: ---------- The top level cycle in ring_buffer_consumer() will finish only when "reader_finish" is set. The data will be read in "while-do" cycle so that they are not read after an error (kill_test == 1) or a spurious wake up. In addition, "reader_finish" is manipulated by the producer thread. Therefore we add READ_ONCE() to make sure that the fresh value is read in each cycle. Also we add the corresponding barrier to synchronize the sleep check. Next we set the state back to TASK_RUNNING for the situation where we did not sleep. Just from paranoid reasons, we initialize both completions statically. This is safer, in case there are other races that we are unaware of. As a side effect we could remove the memory barrier from ring_buffer_producer_thread(). IMHO, this was the reason for the barrier. ring_buffer_reset() uses spin locks that should provide the needed memory barrier for using the buffer. Link: http://lkml.kernel.org/r/1441629518-32712-2-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-03tracing: Remove redundant TP_ARGS redefiningDmitry Safonov
TP_ARGS is not used anywhere in trace.h nor trace_entries.h Firstly, I left just #undef TP_ARGS and had no errors - remove it. Link: http://lkml.kernel.org/r/1446576560-14085-1-git-send-email-0x7f454c46@gmail.com Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-03tracing: Rename max_stack_lock to stack_trace_max_lockSteven Rostedt (Red Hat)
Now that max_stack_lock is a global variable, it requires a naming convention that is unlikely to collide. Rename it to the same naming convention that the other stack_trace variables have. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-03tracing: Allow arch-specific stack tracerAKASHI Takahiro
A stack frame may be used in a different way depending on cpu architecture. Thus it is not always appropriate to slurp the stack contents, as current check_stack() does, in order to calcurate a stack index (height) at a given function call. At least not on arm64. In addition, there is a possibility that we will mistakenly detect a stale stack frame which has not been overwritten. This patch makes check_stack() a weak function so as to later implement arch-specific version. Link: http://lkml.kernel.org/r/1446182741-31019-5-git-send-email-takahiro.akashi@linaro.org Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-03bpf, verifier: annotate verbose printer with __printfDaniel Borkmann
The verbose() printer dumps the verifier state to user space, so let gcc take care to check calls to verbose() for (future) errors. make with W=1 correctly suggests: function might be possible candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format]. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: add support for persistent maps/progsDaniel Borkmann
This work adds support for "persistent" eBPF maps/programs. The term "persistent" is to be understood that maps/programs have a facility that lets them survive process termination. This is desired by various eBPF subsystem users. Just to name one example: tc classifier/action. Whenever tc parses the ELF object, extracts and loads maps/progs into the kernel, these file descriptors will be out of reach after the tc instance exits. So a subsequent tc invocation won't be able to access/relocate on this resource, and therefore maps cannot easily be shared, f.e. between the ingress and egress networking data path. The current workaround is that Unix domain sockets (UDS) need to be instrumented in order to pass the created eBPF map/program file descriptors to a third party management daemon through UDS' socket passing facility. This makes it a bit complicated to deploy shared eBPF maps or programs (programs f.e. for tail calls) among various processes. We've been brainstorming on how we could tackle this issue and various approches have been tried out so far, which can be read up further in the below reference. The architecture we eventually ended up with is a minimal file system that can hold map/prog objects. The file system is a per mount namespace singleton, and the default mount point is /sys/fs/bpf/. Any subsequent mounts within a given namespace will point to the same instance. The file system allows for creating a user-defined directory structure. The objects for maps/progs are created/fetched through bpf(2) with two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor along with a pathname is being passed to bpf(2) that in turn creates (we call it eBPF object pinning) the file system nodes. Only the pathname is being passed to bpf(2) for getting a new BPF file descriptor to an existing node. The user can use that to access maps and progs later on, through bpf(2). Removal of file system nodes is being managed through normal VFS functions such as unlink(2), etc. The file system code is kept to a very minimum and can be further extended later on. The next step I'm working on is to add dump eBPF map/prog commands to bpf(2), so that a specification from a given file descriptor can be retrieved. This can be used by things like CRIU but also applications can inspect the meta data after calling BPF_OBJ_GET. Big thanks also to Alexei and Hannes who significantly contributed in the design discussion that eventually let us end up with this architecture here. Reference: https://lkml.org/lkml/2015/10/15/925 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: consolidate bpf_prog_put{, _rcu} dismantle pathsDaniel Borkmann
We currently have duplicated cleanup code in bpf_prog_put() and bpf_prog_put_rcu() cleanup paths. Back then we decided that it was not worth it to make it a common helper called by both, but with the recent addition of resource charging, we could have avoided the fix in commit ac00737f4e81 ("bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put") if we would have had only a single, common path. We can simplify it further by assigning aux->prog only once during allocation time. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: align and clean bpf_{map,prog}_get helpersDaniel Borkmann
Add a bpf_map_get() function that we're going to use later on and align/clean the remaining helpers a bit so that we have them a bit more consistent: - __bpf_map_get() and __bpf_prog_get() that both work on the fd struct, check whether the descriptor is eBPF and return the pointer to the map/prog stored in the private data. Also, we can return f.file->private_data directly, the function signature is enough of a documentation already. - bpf_map_get() and bpf_prog_get() that both work on u32 user fd, call their respective __bpf_map_get()/__bpf_prog_get() variants, and take a reference. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: abstract anon_inode_getfd invocationsDaniel Borkmann
Since we're going to use anon_inode_getfd() invocations in more than just the current places, make a helper function for both, so that we only need to pass a map/prog pointer to the helper itself in order to get a fd. The new helpers are called bpf_map_new_fd() and bpf_prog_new_fd(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02bpf: convert hashtab lock to raw lockYang Shi
When running bpf samples on rt kernel, it reports the below warning: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping Preemption disabled at:[<ffff80000017db58>] kprobe_perf_func+0x30/0x228 CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4 Hardware name: Freescale Layerscape 2085a RDB Board (DT) Call trace: [<ffff80000008a5b0>] dump_backtrace+0x0/0x128 [<ffff80000008a6f8>] show_stack+0x20/0x30 [<ffff8000007da90c>] dump_stack+0x7c/0xa0 [<ffff8000000e4830>] ___might_sleep+0x188/0x1a0 [<ffff8000007e2200>] rt_spin_lock+0x28/0x40 [<ffff80000018bf9c>] htab_map_update_elem+0x124/0x320 [<ffff80000018c718>] bpf_map_update_elem+0x40/0x58 [<ffff800000187658>] __bpf_prog_run+0xd48/0x1640 [<ffff80000017ca6c>] trace_call_bpf+0x8c/0x100 [<ffff80000017db58>] kprobe_perf_func+0x30/0x228 [<ffff80000017dd84>] kprobe_dispatcher+0x34/0x58 [<ffff8000007e399c>] kprobe_handler+0x114/0x250 [<ffff8000007e3bf4>] kprobe_breakpoint_handler+0x1c/0x30 [<ffff800000085b80>] brk_handler+0x88/0x98 [<ffff8000000822f0>] do_debug_exception+0x50/0xb8 Exception stack(0xffff808349687460 to 0xffff808349687580) 7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000 7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000 74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083 74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000 74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000 7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000 7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000 7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000 7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77 Convert hashtab lock to raw lock to avoid such warning. Signed-off-by: Yang Shi <yang.shi@linaro.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02tracing: ftrace_event_is_function() can return booleanYaowei Bai
Make ftrace_event_is_function() return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-9-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02tracing: is_legal_op() can return booleanYaowei Bai
Make is_legal_op() return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-8-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02ring-buffer: rb_event_is_commit() can return booleanYaowei Bai
Make rb_event_is_commit() return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-7-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02ring-buffer: rb_per_cpu_empty() can return booleanYaowei Bai
Makes rb_per_cpu_empty() return bool to improve readability. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-6-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02ring_buffer: ring_buffer_empty{cpu}() can return booleanYaowei Bai
Make ring_buffer_empty() and ring_buffer_empty_cpu() return bool. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-5-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02ring-buffer: rb_is_reader_page() can return booleanYaowei Bai
Make rb_is_reader_page() return bool to improve readability due to this particular function only using either true or false as its return value. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-4-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02tracing: report_latency() in trace_irqsoff.c can return booleanYaowei Bai
This patch makes report_latency return bool due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-3-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02tracing: report_latency() in trace_sched_wakeup.c can return booleanYaowei Bai
This patch makes report_latency return bool to improve readability, indicating whether this new latency should be reported/recorded. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-2-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02tracing: Update instance_rmdir() to use tracefs_remove_recursiveJiaxing Wang
Update instancd_rmdir to use tracefs_remove_recursive instead of debugfs_remove_recursive.This was left in the transition from debugfs to tracefs. Link: http://lkml.kernel.org/r/1445169490-18315-2-git-send-email-hello.wjx@gmail.com Cc: stable@vger.kernel.org # 4.1+ Fixes: 8434dc9340cd2 ("tracing: Convert the tracing facility over to use tracefs") Signed-off-by: Jiaxing Wang <hello.wjx@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02tracing: Only benchmark the time tracepoints take if tracing is onChunyan Zhang
There's no need to record the time tracepoints take when tracing is off. This is because: 1) We cannot see these records since ring_buffer record is off at that moment. 2) If tracing is off and benchmark tracepoint is enabled, the time tracepoint takes is fewer than the same situation when tracing is on, since the tracepoints need to be wrote into ring_buffer, it would take more time. If turn on tracing at this moment, the average and standard deviation cannot exactly present the time that tracepoints take to write data into ring_buffer. Link: http://lkml.kernel.org/r/1445947933-27955-1-git-send-email-zhang.chunyan@linaro.org Signed-off-by: Chunyan Zhang <zhang.chunyan@linaro.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02tracing: Call on_each_cpu() when adding or removing single pids from ↵Steven Rostedt (Red Hat)
set_event_pid For the case where pids are already in set_event_pid, and one is added or removed then each CPU should be checked to make sure that the new or old pid is on or not on a CPU. For example: # echo 123 >> set_event_pid or # echo '!123' >> set_event_pid Link: http://lkml.kernel.org/r/20151030061643.GA19480@cac Suggested-by: Jiaxing Wang <hello.wjx@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02Merge branch 'pm-sleep'Rafael J. Wysocki
* pm-sleep: PM / hibernate: fix a comment typo input: i8042: Avoid resetting controller on system suspend/resume PM / PCI / ACPI: Kick devices that might have been reset by firmware PM / sleep: Add flags to indicate platform firmware involvement PM / sleep: Drop pm_request_idle() from pm_generic_complete() PCI / PM: Avoid resuming more devices during system suspend PM / wakeup: wakeup_source_create: use kstrdup_const PM / sleep: Report interrupt that caused system wakeup
2015-11-01Merge branch 'libnvdimm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull memremap fix from Dan Williams: "The new memremap() api introduced in the 4.3 cycle to unify/replace ioremap_cache() and ioremap_wt() is mishandling the highmem case. This patch has received a build success notification from a 0day-kbuild-robot run and has received an ack from Ard" From the commit message: "The impact of this bug is low for now since the pmem driver is the only user of memremap(), but this is important to fix before more conversions to memremap arrive in 4.4" * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: memremap: fix highmem support
2015-11-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2015-10-30blktrace: re-write setting q->blk_traceDavidlohr Bueso
This is really about simplifying the double xchg patterns into a single cmpxchg, with the same logic. Other than the immediate cleanup, there are some subtleties this change deals with: (i) While the load of the old bt is fully ordered wrt everything, ie: old_bt = xchg(&q->blk_trace, bt); [barrier] if (old_bt) (void) xchg(&q->blk_trace, old_bt); [barrier] blk_trace could still be changed between the xchg and the old_bt load. Note that this description is merely theoretical and afaict very small, but doing everything in a single context with cmpxchg closes this potential race. (ii) Ordering guarantees are obviously kept with cmpxchg. (iii) Gets rid of the hacky-by-nature (void)xchg pattern. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> eviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-29cgroup: fix race condition around termination check in css_task_iter_next()Tejun Heo
css_task_iter_next() checked @it->cur_task before grabbing css_set_lock and assumed that the result won't change afterwards; however, tasks could leave the cgroup being iterated terminating the iterator before css_task_lock is acquired. If this happens, css_task_iter_next() tries to calculate the current task from NULL cg_list pointer leading to the following oops. BUG: unable to handle kernel paging request at fffffffffffff7d0 IP: [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80 ... CPU: 4 PID: 6391 Comm: JobQDisp2 Not tainted 4.0.9-22_fbk4_rc3_81616_ge8d9cb6 #1 Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B08 03/04/2014 task: ffff880868e46400 ti: ffff88083404c000 task.ti: ffff88083404c000 RIP: 0010:[<ffffffff810d5f22>] [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80 RSP: 0018:ffff88083404fd28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88083404fd68 RCX: ffff8804697fb8b0 RDX: fffffffffffff7c0 RSI: ffff8803b7dff800 RDI: ffffffff822c0278 RBP: ffff88083404fd38 R08: 0000000000017160 R09: ffff88046f4070c0 R10: ffffffff810d61f7 R11: 0000000000000293 R12: ffff880863bf8400 R13: ffff88046b87fd80 R14: 0000000000000000 R15: ffff88083404fe58 FS: 00007fa0567e2700(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffff7d0 CR3: 0000000469568000 CR4: 00000000001406e0 Stack: 0000000000000246 0000000000000000 ffff88083404fde8 ffffffff810d6248 ffff88083404fd68 0000000000000000 ffff8803b7dff800 000001ef000001ee 0000000000000000 0000000000000000 ffff880863bf8568 0000000000000000 Call Trace: [<ffffffff810d6248>] cgroup_pidlist_start+0x258/0x550 [<ffffffff810cf66d>] cgroup_seqfile_start+0x1d/0x20 [<ffffffff8121f8ef>] kernfs_seq_start+0x5f/0xa0 [<ffffffff811cab76>] seq_read+0x166/0x380 [<ffffffff812200fd>] kernfs_fop_read+0x11d/0x180 [<ffffffff811a7398>] __vfs_read+0x18/0x50 [<ffffffff811a745d>] vfs_read+0x8d/0x150 [<ffffffff811a756f>] SyS_read+0x4f/0xb0 [<ffffffff818d4772>] system_call_fastpath+0x12/0x17 Fix it by moving the termination condition check inside css_set_lock. @it->cur_task is now cleared after being put and @it->task_pos is tested for termination instead of @it->cset_pos as they indicate the same condition and @it->task_pos is what's being dereferenced. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Calvin Owens <calvinowens@fb.com> Fixes: ed27b9f7a17d ("cgroup: don't hold css_set_rwsem across css task iteration") Acked-by: Zefan Li <lizefan@huawei.com>
2015-10-28Merge branch 'linus' into core/rcu, to fix up a semantic conflictIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-27seccomp, ptrace: add support for dumping seccomp filtersTycho Andersen
This patch adds support for dumping a process' (classic BPF) seccomp filters via ptrace. PTRACE_SECCOMP_GET_FILTER allows the tracer to dump the user's classic BPF seccomp filters. addr should be an integer which represents the ith seccomp filter (0 is the most recently installed filter). data should be a struct sock_filter * with enough room for the ith filter, or NULL, in which case the filter is not saved. The return value for this command is the number of BPF instructions the program represents, or negative in the case of errors. Command specific errors are ENOENT: which indicates that there is no ith filter in this seccomp tree, and EMEDIUMTYPE, which indicates that the ith filter was not installed as a classic BPF filter. A caveat with this approach is that there is no way to get explicitly at the heirarchy of seccomp filters, and users need to memcmp() filters to decide which are inherited. This means that a task which installs two of the same filter can potentially confuse users of this interface. v2: * make save_orig const * check that the orig_prog exists (not necessary right now, but when grows eBPF support it will be) * s/n/filter_off and make it an unsigned long to match ptrace * count "down" the tree instead of "up" when passing a filter offset v3: * don't take the current task's lock for inspecting its seccomp mode * use a 0x42** constant for the ptrace command value v4: * don't copy to userspace while holding spinlocks v5: * add another condition to WARN_ON v6: * rebase on net-next Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> CC: Will Drewry <wad@chromium.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> CC: Andy Lutomirski <luto@amacapital.net> CC: Pavel Emelyanov <xemul@parallels.com> CC: Serge E. Hallyn <serge.hallyn@ubuntu.com> CC: Alexei Starovoitov <ast@kernel.org> CC: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-28Merge tag 'fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull module preemption fix from Rusty Russell: "Turns out we should have always been disabling preemption here; someone finally caught it thanks to Peter Z's additional checks" * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: module: Fix locking in symbol_put_addr()
2015-10-26bpf: make tracing helpers gpl onlyAlexei Starovoitov
exported perf symbols are GPL only, mark eBPF helper functions used in tracing as GPL only as well. Suggested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-26bpf: fix bpf_perf_event_read() helperAlexei Starovoitov
Fix safety checks for bpf_perf_event_read(): - only non-inherited events can be added to perf_event_array map (do this check statically at map insertion time) - dynamically check that event is local and !pmu->count Otherwise buggy bpf program can cause kernel splat. Also fix error path after perf_event_attrs() and remove redundant 'extern'. Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Tested-by: Wang Nan <wangnan0@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-26memremap: fix highmem supportDan Williams
Currently memremap checks if the range is "System RAM" and returns the kernel linear address. This is broken for highmem platforms where a range may be "System RAM", but is not part of the kernel linear mapping. Fallback to ioremap_cache() in these cases, to let the arch code attempt to handle it. Note that ARM ioremap will WARN when attempting to remap ram, and in that case the caller needs to be fixed. For this reason, existing ioremap_cache() usages for ARM are already trained to avoid attempts to remap ram. The impact of this bug is low for now since the pmem driver is the only user of memremap(), but this is important to fix before more conversions to memremap arrive in 4.4. Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reported-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2015-10-26tracing: Fix sparse RCU warningSteven Rostedt (Red Hat)
p_start() and p_stop() are seq_file functions that match. Teach sparse to know that rcu_read_lock_sched() that is taken by p_start() is released by p_stop. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-25tracing: Check all tasks on each CPU when filtering pidsSteven Rostedt (Red Hat)
My tests found that if a task is running but not filtered when set_event_pid is modified, then it can still be traced. Call on_each_cpu() to check if the current running task should be filtered and update the per cpu flags of tr->data appropriately. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-25tracing: Implement event pid filteringSteven Rostedt (Red Hat)
Add the necessary hooks to use the pids loaded in set_event_pid to filter all the events enabled in the tracing instance that match the pids listed. Two probes are added to both sched_switch and sched_wakeup tracepoints to be called before other probes are called and after the other probes are called. The first is used to set the necessary flags to let the probes know to test if they should be traced or not. The sched_switch pre probe will set the "ignore_pid" flag if neither the previous or next task has a matching pid. The sched_switch probe will set the "ignore_pid" flag if the next task does not match the matching pid. The pre probe allows for probes tracing sched_switch to be traced if necessary. The sched_wakeup pre probe will set the "ignore_pid" flag if neither the current task nor the wakee task has a matching pid. The sched_wakeup post probe will set the "ignore_pid" flag if the current task does not have a matching pid. Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-25tracing: Add set_event_pid directory for future useSteven Rostedt (Red Hat)
Create a tracing directory called set_event_pid, which currently has no function, but will be used to filter all events for the tracing instance or the pids that are added to the file. The reason no functionality is added with this commit is that this commit focuses on the creation and removal of the pids in a safe manner. And tests can be made against this change to make sure things are correct before hooking features to the list of pids. Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-25tracepoint: Give priority to probes of tracepointsSteven Rostedt (Red Hat)
In order to guarantee that a probe will be called before other probes that are attached to a tracepoint, there needs to be a mechanism to provide priority of one probe over the others. Adding a prio field to the struct tracepoint_func, which lets the probes be sorted by the priority set in the structure. If no priority is specified, then a priority of 10 is given (this is a macro, and perhaps may be changed in the future). Now probes may be added to affect other probes that are attached to a tracepoint with a guaranteed order. One use case would be to allow tracing of tracepoints be able to filter by pid. A special (higher priority probe) may be added to the sched_switch tracepoint and set the necessary flags of the other tracepoints to notify them if they should be traced or not. In case a tracepoint is enabled at the sched_switch tracepoint too, the order of the two are not random. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-26timeconst: Update path in commentJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: hofrat@osadl.org Link: http://lkml.kernel.org/r/1436894685-5868-1-git-send-email-Jason@zx2c4.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-23Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes all around the map: an instrumentation fix, a nohz usability fix, a lockdep annotation fix and two task group scheduling fixes" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/core: Add missing lockdep_unpin() annotations sched/deadline: Fix migration of SCHED_DEADLINE tasks nohz: Revert "nohz: Set isolcpus when nohz_full is set" sched/fair: Update task group's load_avg after task migration sched/fair: Fix overly small weight for interactive group entities sched, tracing: Stop/start critical timings around the idle=poll idle loop
2015-10-23Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge fixes from Andrew Morton: "9 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put fault-inject: fix inverted interval/probability values in printk lib/Kconfig.debug: disable -Wframe-larger-than warnings with KASAN=y mm: make sendfile(2) killable thp: use is_zero_pfn() only after pte_present() check mailmap: update Javier Martinez Canillas' email MAINTAINERS: add Sergey as zsmalloc reviewer mm: cma: fix incorrect type conversion for size during dma allocation kmod: don't run async usermode helper as a child of kworker thread