summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2017-08-17locking/refcounts, x86/asm: Implement fast refcount overflow protectionKees Cook
This implements refcount_t overflow protection on x86 without a noticeable performance impact, though without the fuller checking of REFCOUNT_FULL. This is done by duplicating the existing atomic_t refcount implementation but with normally a single instruction added to detect if the refcount has gone negative (e.g. wrapped past INT_MAX or below zero). When detected, the handler saturates the refcount_t to INT_MIN / 2. With this overflow protection, the erroneous reference release that would follow a wrap back to zero is blocked from happening, avoiding the class of refcount-overflow use-after-free vulnerabilities entirely. Only the overflow case of refcounting can be perfectly protected, since it can be detected and stopped before the reference is freed and left to be abused by an attacker. There isn't a way to block early decrements, and while REFCOUNT_FULL stops increment-from-zero cases (which would be the state _after_ an early decrement and stops potential double-free conditions), this fast implementation does not, since it would require the more expensive cmpxchg loops. Since the overflow case is much more common (e.g. missing a "put" during an error path), this protection provides real-world protection. For example, the two public refcount overflow use-after-free exploits published in 2016 would have been rendered unexploitable: http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/ http://cyseclabs.com/page?n=02012016 This implementation does, however, notice an unchecked decrement to zero (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it resulted in a zero). Decrements under zero are noticed (since they will have resulted in a negative value), though this only indicates that a use-after-free may have already happened. Such notifications are likely avoidable by an attacker that has already exploited a use-after-free vulnerability, but it's better to have them reported than allow such conditions to remain universally silent. On first overflow detection, the refcount value is reset to INT_MIN / 2 (which serves as a saturation value) and a report and stack trace are produced. When operations detect only negative value results (such as changing an already saturated value), saturation still happens but no notification is performed (since the value was already saturated). On the matter of races, since the entire range beyond INT_MAX but before 0 is negative, every operation at INT_MIN / 2 will trap, leaving no overflow-only race condition. As for performance, this implementation adds a single "js" instruction to the regular execution flow of a copy of the standard atomic_t refcount operations. (The non-"and_test" refcount_dec() function, which is uncommon in regular refcount design patterns, has an additional "jz" instruction to detect reaching exactly zero.) Since this is a forward jump, it is by default the non-predicted path, which will be reinforced by dynamic branch prediction. The result is this protection having virtually no measurable change in performance over standard atomic_t operations. The error path, located in .text.unlikely, saves the refcount location and then uses UD0 to fire a refcount exception handler, which resets the refcount, handles reporting, and returns to regular execution. This keeps the changes to .text size minimal, avoiding return jumps and open-coded calls to the error reporting routine. Example assembly comparison: refcount_inc() before: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) refcount_inc() after: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3 ... .text.unlikely: ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx ffffffff816c36d7: 0f ff (bad) These are the cycle counts comparing a loop of refcount_inc() from 1 to INT_MAX and back down to 0 (via refcount_dec_and_test()), between unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL (refcount_t-full), and this overflow-protected refcount (refcount_t-fast): 2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s: cycles protections atomic_t 82249267387 none refcount_t-fast 82211446892 overflow, untested dec-to-zero refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero This code is a modified version of the x86 PAX_REFCOUNT atomic_t overflow defense from the last public patch of PaX/grsecurity, based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Thanks to PaX Team for various suggestions for improvement for repurposing this code to be a refcount-only protection. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Hans Liljestrand <ishkamiel@gmail.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Jann Horn <jannh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arozansk@redhat.com Cc: axboe@kernel.dk Cc: kernel-hardening@lists.openwall.com Cc: linux-arch <linux-arch@vger.kernel.org> Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-11Merge branch 'linus' into locking/core, to resolve conflictsIngo Molnar
Conflicts: include/linux/mm_types.h mm/huge_memory.c I removed the smp_mb__before_spinlock() like the following commit does: 8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()") and fixed up the affected commits. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10Merge tag 'drm-fixes-for-v4.13-rc5' of ↵Linus Torvalds
git://people.freedesktop.org/~airlied/linux Pull drm fixes from Dave Airlie: "Nothing too earth shattering here, it just seems like lots of little things all over the place. msm has probably the larger amount of changes, but they all seem fine, otherwise, some rockchip, i915, etnaviv and exynos fixes, along with one nouveau regression fix for some older GPUs" * tag 'drm-fixes-for-v4.13-rc5' of git://people.freedesktop.org/~airlied/linux: (35 commits) drm/nouveau/disp/nv04: avoid creation of output paths drm: make DRM_STM default n drm/exynos: forbid creating framebuffers from too small GEM buffers drm/etnaviv: Fix off-by-one error in reloc checking drm/i915: fix backlight invert for non-zero minimum brightness drm/i915/shrinker: Wrap need_resched() inside preempt-disable drm/i915/perf: fix flex eu registers programming drm/i915: Fix out-of-bounds array access in bdw_load_gamma_lut drm/i915/gvt: Change the max length of mmio_reg_rw from 4 to 8 drm/i915/gvt: Initialize MMIO Block with HW state drm/rockchip: vop: report error when check resource error drm/rockchip: vop: round_up pitches to word align drm/rockchip: vop: fix NV12 video display error drm/rockchip: vop: fix iommu page fault when resume drm/i915/gvt: clean workload queue if error happened drm/i915/gvt: change resetting to resetting_eng drm/msm: gpu: don't abuse dma_alloc for non-DMA allocations drm/msm: gpu: call qcom_mdt interfaces only for ARCH_QCOM drm/msm/adreno: Prevent unclocked access when retrieving timestamps drm/msm: Remove __user from __u64 data types ...
2017-08-10Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "21 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits) userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage zram: rework copy of compressor name in comp_algorithm_store() rmap: do not call mmu_notifier_invalidate_page() under ptl mm: fix list corruptions on shmem shrinklist mm/balloon_compaction.c: don't zero ballooned pages MAINTAINERS: copy virtio on balloon_compaction.c mm: fix KSM data corruption mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem mm: make tlb_flush_pending global mm: refactor TLB gathering API Revert "mm: numa: defer TLB flush for THP migration as long as possible" mm: migrate: fix barriers around tlb_flush_pending mm: migrate: prevent racy access to tlb_flush_pending fault-inject: fix wrong should_fail() decision in task context test_kmod: fix small memory leak on filesystem tests test_kmod: fix the lock in register_test_dev_kmod() test_kmod: fix bug which allows negative values on two config options test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case mm: ratelimit PFNs busy info message ...
2017-08-10mm: fix MADV_[FREE|DONTNEED] TLB flush miss problemMinchan Kim
Nadav reported parallel MADV_DONTNEED on same range has a stale TLB problem and Mel fixed it[1] and found same problem on MADV_FREE[2]. Quote from Mel Gorman: "The race in question is CPU 0 running madv_free and updating some PTEs while CPU 1 is also running madv_free and looking at the same PTEs. CPU 1 may have writable TLB entries for a page but fail the pte_dirty check (because CPU 0 has updated it already) and potentially fail to flush. Hence, when madv_free on CPU 1 returns, there are still potentially writable TLB entries and the underlying PTE is still present so that a subsequent write does not necessarily propagate the dirty bit to the underlying PTE any more. Reclaim at some unknown time at the future may then see that the PTE is still clean and discard the page even though a write has happened in the meantime. I think this is possible but I could have missed some protection in madv_free that prevents it happening." This patch aims for solving both problems all at once and is ready for other problem with KSM, MADV_FREE and soft-dirty story[3]. TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch there are parallel threads going on. In that case, forcefully, flush TLB to prevent for user to access memory via stale TLB entry although it fail to gather page table entry. I confirmed this patch works with [4] test program Nadav gave so this patch supersedes "mm: Always flush VMA ranges affected by zap_page_range v2" in current mmotm. NOTE: This patch modifies arch-specific TLB gathering interface(x86, ia64, s390, sh, um). It seems most of architecture are straightforward but s390 need to be careful because tlb_flush_mmu works only if mm->context.flush_mm is set to non-zero which happens only a pte entry really is cleared by ptep_get_and_clear and friends. However, this problem never changes the pte entries but need to flush to prevent memory access from stale tlb. [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com [4] https://patchwork.kernel.org/patch/9861621/ [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu] Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: Nadav Amit <namit@vmware.com> Reported-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Jeff Dike <jdike@addtoit.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: make tlb_flush_pending globalMinchan Kim
Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING| COMPACTION] but upcoming patches to solve subtle TLB flush batching problem will use it regardless of compaction/NUMA so this patch doesn't remove the dependency. [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement] Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: refactor TLB gathering APIMinchan Kim
This patch is a preparatory patch for solving race problems caused by TLB batch. For that, we will increase/decrease TLB flush pending count of mm_struct whenever tlb_[gather|finish]_mmu is called. Before making it simple, this patch separates architecture specific part and rename it to arch_tlb_[gather|finish]_mmu and generic part just calls it. It shouldn't change any behavior. Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.com Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Jeff Dike <jdike@addtoit.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: migrate: fix barriers around tlb_flush_pendingNadav Amit
Reading tlb_flush_pending while the page-table lock is taken does not require a barrier, since the lock/unlock already acts as a barrier. Removing the barrier in mm_tlb_flush_pending() to address this issue. However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending() while the page-table lock is already released, which may present a problem on architectures with weak memory model (PPC). To deal with this case, a new parameter is added to mm_tlb_flush_pending() to indicate if it is read without the page-table lock taken, and calling smp_mb__after_unlock_lock() in this case. Link: http://lkml.kernel.org/r/20170802000818.4760-3-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10mm: migrate: prevent racy access to tlb_flush_pendingNadav Amit
Patch series "fixes of TLB batching races", v6. It turns out that Linux TLB batching mechanism suffers from various races. Races that are caused due to batching during reclamation were recently handled by Mel and this patch-set deals with others. The more fundamental issue is that concurrent updates of the page-tables allow for TLB flushes to be batched on one core, while another core changes the page-tables. This other core may assume a PTE change does not require a flush based on the updated PTE value, while it is unaware that TLB flushes are still pending. This behavior affects KSM (which may result in memory corruption) and MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior). A proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED. Memory corruption in KSM is harder to produce in practice, but was observed by hacking the kernel and adding a delay before flushing and replacing the KSM page. Finally, there is also one memory barrier missing, which may affect architectures with weak memory model. This patch (of 7): Setting and clearing mm->tlb_flush_pending can be performed by multiple threads, since mmap_sem may only be acquired for read in task_numa_work(). If this happens, tlb_flush_pending might be cleared while one of the threads still changes PTEs and batches TLB flushes. This can lead to the same race between migration and change_protection_range() that led to the introduction of tlb_flush_pending. The result of this race was data corruption, which means that this patch also addresses a theoretically possible data corruption. An actual data corruption was not observed, yet the race was was confirmed by adding assertion to check tlb_flush_pending is not set by two threads, adding artificial latency in change_protection_range() and using sysctl to reduce kernel.numa_balancing_scan_delay_ms. Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com Fixes: 20841405940e ("mm: fix TLB flush race between migration, and change_protection_range") Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Russell King <linux@armlinux.org.uk> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10Merge tag 'pci-v4.13-fixes-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fix from Bjorn Helgaas: "Work around Renesas uPD72020x 32-bit DMA issue" * tag 'pci-v4.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: xhci: Reset Renesas uPD72020x USB controller for 32-bit DMA issue PCI: Add pci_reset_function_locked()
2017-08-10locking/lockdep: Apply crossrelease to completionsByungchul Park
Although wait_for_completion() and its family can cause deadlock, the lock correctness validator could not be applied to them until now, because things like complete() are usually called in a different context from the waiting context, which violates lockdep's assumption. Thanks to CONFIG_LOCKDEP_CROSSRELEASE, we can now apply the lockdep detector to those completion operations. Applied it. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: boqun.feng@gmail.com Cc: kernel-team@lge.com Cc: kirill@shutemov.name Cc: npiggin@gmail.com Cc: walken@google.com Cc: willy@infradead.org Link: http://lkml.kernel.org/r/1502089981-21272-10-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/lockdep: Handle non(or multi)-acquisition of a crosslockByungchul Park
No acquisition might be in progress on commit of a crosslock. Completion operations enabling crossrelease are the case like: CONTEXT X CONTEXT Y --------- --------- trigger completion context complete AX commit AX wait_for_complete AX acquire AX wait where AX is a crosslock. When no acquisition is in progress, we should not perform commit because the lock does not exist, which might cause incorrect memory access. So we have to track the number of acquisitions of a crosslock to handle it. Moreover, in case that more than one acquisition of a crosslock are overlapped like: CONTEXT W CONTEXT X CONTEXT Y CONTEXT Z --------- --------- --------- --------- acquire AX (gen_id: 1) acquire A acquire AX (gen_id: 10) acquire B commit AX acquire C commit AX where A, B and C are typical locks and AX is a crosslock. Current crossrelease code performs commits in Y and Z with gen_id = 10. However, we can use gen_id = 1 to do it, since not only 'acquire AX in X' but 'acquire AX in W' also depends on each acquisition in Y and Z until their commits. So make it use gen_id = 1 instead of 10 on their commits, which adds an additional dependency 'AX -> A' in the example above. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: boqun.feng@gmail.com Cc: kernel-team@lge.com Cc: kirill@shutemov.name Cc: npiggin@gmail.com Cc: walken@google.com Cc: willy@infradead.org Link: http://lkml.kernel.org/r/1502089981-21272-8-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/lockdep: Detect and handle hist_lock ring buffer overwriteByungchul Park
The ring buffer can be overwritten by hardirq/softirq/work contexts. That cases must be considered on rollback or commit. For example, |<------ hist_lock ring buffer size ----->| ppppppppppppiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii wrapped > iiiiiiiiiiiiiiiiiiiiiii.................... where 'p' represents an acquisition in process context, 'i' represents an acquisition in irq context. On irq exit, crossrelease tries to rollback idx to original position, but it should not because the entry already has been invalid by overwriting 'i'. Avoid rollback or commit for entries overwritten. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: boqun.feng@gmail.com Cc: kernel-team@lge.com Cc: kirill@shutemov.name Cc: npiggin@gmail.com Cc: walken@google.com Cc: willy@infradead.org Link: http://lkml.kernel.org/r/1502089981-21272-7-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/lockdep: Implement the 'crossrelease' featureByungchul Park
Lockdep is a runtime locking correctness validator that detects and reports a deadlock or its possibility by checking dependencies between locks. It's useful since it does not report just an actual deadlock but also the possibility of a deadlock that has not actually happened yet. That enables problems to be fixed before they affect real systems. However, this facility is only applicable to typical locks, such as spinlocks and mutexes, which are normally released within the context in which they were acquired. However, synchronization primitives like page locks or completions, which are allowed to be released in any context, also create dependencies and can cause a deadlock. So lockdep should track these locks to do a better job. The 'crossrelease' implementation makes these primitives also be tracked. Signed-off-by: Byungchul Park <byungchul.park@lge.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: boqun.feng@gmail.com Cc: kernel-team@lge.com Cc: kirill@shutemov.name Cc: npiggin@gmail.com Cc: walken@google.com Cc: willy@infradead.org Link: http://lkml.kernel.org/r/1502089981-21272-6-git-send-email-byungchul.park@lge.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/lockdep: Rework FS_RECLAIM annotationPeter Zijlstra
A while ago someone, and I cannot find the email just now, asked if we could not implement the RECLAIM_FS inversion stuff with a 'fake' lock like we use for other things like workqueues etc. I think this should be possible which allows reducing the 'irq' states and will reduce the amount of __bfs() lookups we do. Removing the 1 IRQ state results in 4 less __bfs() walks per dependency, improving lockdep performance. And by moving this annotation out of the lockdep code it becomes easier for the mm people to extend. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nikolay Borisov <nborisov@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akpm@linux-foundation.org Cc: boqun.feng@gmail.com Cc: iamjoonsoo.kim@lge.com Cc: kernel-team@lge.com Cc: kirill@shutemov.name Cc: npiggin@gmail.com Cc: walken@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking: Remove smp_mb__before_spinlock()Peter Zijlstra
Now that there are no users of smp_mb__before_spinlock() left, remove it entirely. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking: Introduce smp_mb__after_spinlock()Peter Zijlstra
Since its inception, our understanding of ACQUIRE, esp. as applied to spinlocks, has changed somewhat. Also, I wonder if, with a simple change, we cannot make it provide more. The problem with the comment is that the STORE done by spin_lock isn't itself ordered by the ACQUIRE, and therefore a later LOAD can pass over it and cross with any prior STORE, rendering the default WMB insufficient (pointed out by Alan). Now, this is only really a problem on PowerPC and ARM64, both of which already defined smp_mb__before_spinlock() as a smp_mb(). At the same time, we can get a much stronger construct if we place that same barrier _inside_ the spin_lock(). In that case we upgrade the RCpc spinlock to an RCsc. That would make all schedule() calls fully transitive against one another. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10mm, locking: Rework {set,clear,mm}_tlb_flush_pending()Peter Zijlstra
Commit: af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates") added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we can solve the same problem without this barrier. If instead we mandate that mm_tlb_flush_pending() is used while holding the PTL we're guaranteed to observe prior set_tlb_flush_pending() instances. For this to work we need to rework migrate_misplaced_transhuge_page() a little and move the test up into do_huge_pmd_numa_page(). NOTE: this relies on flush_tlb_range() to guarantee: (1) it ensures that prior page table updates are visible to the page table walker and (2) it ensures that subsequent memory accesses are only made visible after the invalidation has completed This is required for architectures that implement TRANSPARENT_HUGEPAGE (arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use mm_tlb_flush_pending() in their page-table operations (arm, arm64, x86). This appears true for: - arm (DSB ISB before and after), - arm64 (DSB ISHST before, and DSB ISH after), - powerpc (PTESYNC before and after), - s390 and x86 TLB invalidate are serializing instructions But I failed to understand the situation for: - arc, mips, sparc Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode() inside the PTL. It still needs to guarantee the PTL unlock happens _after_ the invalidate completes. Vineet, Ralf and Dave could you guys please have a look? Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rik van Riel <riel@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10jump_label: Provide hotplug context variantsMarc Zyngier
As using the normal static key API under the hotplug lock is pretty much impossible, let's provide a variant of some of them that require the hotplug lock to have already been taken. These function are only meant to be used in CPU hotplug callbacks. Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Leo Yan <leo.yan@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/20170801080257.5056-4-marc.zyngier@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10cpuset: Make nr_cpusets privatePaolo Bonzini
Any use of key->enabled (that is static_key_enabled and static_key_count) outside jump_label_lock should handle its own serialization. In the case of cpusets_enabled_key, the key is always incremented/decremented under cpuset_mutex, and hence the same rule applies to nr_cpusets. The rule *is* respected currently, but the mutex is static so nr_cpusets should be static too. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Zefan Li <lizefan@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1501601046-35683-4-git-send-email-pbonzini@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10jump_label: Fix concurrent static_key_enable/disable()Paolo Bonzini
static_key_enable/disable are trying to cap the static key count to 0/1. However, their use of key->enabled is outside jump_label_lock so they do not really ensure that. Rewrite them to do a quick check for an already enabled (respectively, already disabled), and then recheck under the jump label lock. Unlike static_key_slow_inc/dec, a failed check under the jump label lock does not modify key->enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1501601046-35683-2-git-send-email-pbonzini@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/rwsem-xadd: Add killable versions of rwsem_down_read_failed()Kirill Tkhai
Rename rwsem_down_read_failed() in __rwsem_down_read_failed_common() and teach it to abort waiting in case of pending signals and killable state argument passed. Note, that we shouldn't wake anybody up in EINTR path, as: We check for (waiter.task) under spinlock before we go to out_nolock path. Current task wasn't able to be woken up, so there are a writer, owning the sem, or a writer, which is the first waiter. In the both cases we shouldn't wake anybody. If there is a writer, owning the sem, and we were the only waiter, remove RWSEM_WAITING_BIAS, as there are no waiters anymore. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/149789534632.9059.2901382369609922565.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/rwsem-spinlock: Add killable versions of __down_read()Kirill Tkhai
Rename __down_read() in __down_read_common() and teach it to abort waiting in case of pending signals and killable state argument passed. Note, that we shouldn't wake anybody up in EINTR path, as: We check for signal_pending_state() after (!waiter.task) test and under spinlock. So, current task wasn't able to be woken up. It may be in two cases: a writer is owner of the sem, or a writer is a first waiter of the sem. If a writer is owner of the sem, no one else may work with it in parallel. It will wake somebody, when it call up_write() or downgrade_write(). If a writer is the first waiter, it will be woken up, when the last active reader releases the sem, and sem->count became 0. Also note, that set_current_state() may be moved down to schedule() (after !waiter.task check), as all assignments in this type of semaphore (including wake_up), occur under spinlock, so we can't miss anything. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/149789533283.9059.9829416940494747182.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10locking/atomic: Fix atomic_set_release() for 'funny' architecturesPeter Zijlstra
Those architectures that have a special atomic_set implementation also need a special atomic_set_release(), because for the very same reason WRITE_ONCE() is broken for them, smp_store_release() is too. The vast majority is architectures that have spinlock hash based atomic implementation except hexagon which seems to have a hardware 'feature'. The spinlock based atomics should be SC, that is, none of them appear to place extra barriers in atomic_cmpxchg() or any of the other SC atomic primitives and therefore seem to rely on their spinlock implementation being SC (I did not fully validate all that). Therefore, the normal atomic_set() is SC and can be used at atomic_set_release(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Chris Metcalf <cmetcalf@mellanox.com> [for tile] Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: davem@davemloft.net Cc: james.hogan@imgtec.com Cc: jejb@parisc-linux.org Cc: rkuo@codeaurora.org Cc: vgupta@synopsys.com Link: http://lkml.kernel.org/r/20170609110506.yod47flaav3wgoj5@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10Merge branch 'linus' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10Merge tag 'drm-misc-fixes-2017-08-08' of ↵Dave Airlie
git://anongit.freedesktop.org/git/drm-misc into drm-fixes Core Changes: - dma-buf: Allow multiple sync_files to wrap a single dma-fence (Chris) Driver Changes: - rockchip: misc fixes to vop driver from the downstream rockchip tree (Mark) - Error path cleanups to tc358767 & host1x (Lucas & Paul, respectively) * tag 'drm-misc-fixes-2017-08-08' of git://anongit.freedesktop.org/git/drm-misc: drm/rockchip: vop: report error when check resource error drm/rockchip: vop: round_up pitches to word align drm/rockchip: vop: fix NV12 video display error drm/rockchip: vop: fix iommu page fault when resume dma-buf/sync_file: Allow multiple sync_files to wrap a single dma-fence drm/bridge: tc358767: fix probe without attached output node
2017-08-10Merge branch 'msm-fixes-4.13-rc3' of ↵Dave Airlie
git://people.freedesktop.org/~robclark/linux into drm-fixes Bunch of msm fixes for 4.13 * 'msm-fixes-4.13-rc3' of git://people.freedesktop.org/~robclark/linux: drm/msm: gpu: don't abuse dma_alloc for non-DMA allocations drm/msm: gpu: call qcom_mdt interfaces only for ARCH_QCOM drm/msm/adreno: Prevent unclocked access when retrieving timestamps drm/msm: Remove __user from __u64 data types drm/msm: args->fence should be args->flags drm/msm: Turn off hardware clock gating before reading A5XX registers drm/msm: Allow hardware clock gating to be toggled drm/msm: Remove some potentially blocked register ranges drm/msm/mdp5: Drop clock names with "_clk" suffix drm/msm/mdp5: Fix typo in encoder_enable path drm/msm: NULL pointer dereference in drivers/gpu/drm/msm/msm_gem_vma.c drm/msm: fix WARN_ON in add_vma() with no iommu drm/msm/dsi: Calculate link clock rates with updated dsi->lanes drm/msm/mdp5: fix unclocked register access in _cursor_set() drm/msm: unlock on error in msm_gem_get_iova() drm/msm: fix an integer overflow test drm/msm/mdp5: Fix compilation warnings
2017-08-09Merge tag 'pinctrl-v4.13-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl Pull pin control fixes from Linus Walleij: "These are the pin control fixes I have gathered since the return from my vacation. They boiled in -next a while so let's get them in. Apart from the documentation build it is purely driver fixes. Which is nice. The Intel fixes seem kind of important. - Fix the documentation build as the docs were moved - Correct the UART pin list on the Intel Merrifield - Fix pin assignment and number of pins on the Marvell Armada 37xx pin controller - Cover the Setzer models in the Chromebook DMI quirk in the Intel cheryview driver so they start working - Add the missing "sim" function to the sunxi driver - Fix USB pin definitions on Uniphier Pro4 - Smatch fix for invalid reference in the zx pin control driver" * tag 'pinctrl-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: pinctrl: generic: update references to Documentation/pinctrl.txt pinctrl: intel: merrifield: Correct UART pin lists pinctrl: armada-37xx: Fix number of pin in south bridge pinctrl: armada-37xx: Fix the pin 23 on south bridge pinctrl: cherryview: Add Setzer models to the Chromebook DMI quirk pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver pinctrl: uniphier: fix USB3 pin assignment for Pro4 pinctrl: zte: fix dereference of 'data' in zx_set_mux()
2017-08-09Merge branch 'i2c/for-current' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "The main thing is to allow empty id_tables for ACPI to make some drivers get probed again. It looks a bit bigger than usual because it needs some internal renaming, too. Other than that, there is a fix for broken DSTDs, a super simple enablement for ARM MPS, and two documentation fixes which I'd like to see in v4.13 already" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: rephrase explanation of I2C_CLASS_DEPRECATED i2c: allow i2c-versatile for ARM MPS platforms i2c: designware: Some broken DSTDs use 1MiHz instead of 1MHz i2c: designware: Print clock freq on invalid clock freq error i2c: core: Allow empty id_table in ACPI case as well i2c: mux: pinctrl: mention correct module name in Kconfig help text
2017-08-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "The pull requests are getting smaller, that's progress I suppose :-) 1) Fix infinite loop in CIPSO option parsing, from Yujuan Qi. 2) Fix remote checksum handling in VXLAN and GUE tunneling drivers, from Koichiro Den. 3) Missing u64_stats_init() calls in several drivers, from Florian Fainelli. 4) TCP can set the congestion window to an invalid ssthresh value after congestion window reductions, from Yuchung Cheng. 5) Fix BPF jit branch generation on s390, from Daniel Borkmann. 6) Correct MIPS ebpf JIT merge, from David Daney. 7) Correct byte order test in BPF test_verifier.c, from Daniel Borkmann. 8) Fix various crashes and leaks in ASIX driver, from Dean Jenkins. 9) Handle SCTP checksums properly in mlx4 driver, from Davide Caratti. 10) We can potentially enter tcp_connect() with a cached route already, due to fastopen, so we have to explicitly invalidate it. 11) skb_warn_bad_offload() can bark in legitimate situations, fix from Willem de Bruijn" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (52 commits) net: avoid skb_warn_bad_offload false positives on UFO qmi_wwan: fix NULL deref on disconnect ppp: fix xmit recursion detection on ppp channels rds: Reintroduce statistics counting tcp: fastopen: tcp_connect() must refresh the route net: sched: set xt_tgchk_param par.net properly in ipt_init_target net: dsa: mediatek: add adjust link support for user ports net/mlx4_en: don't set CHECKSUM_COMPLETE on SCTP packets qed: Fix a memory allocation failure test in 'qed_mcp_cmd_init()' hysdn: fix to a race condition in put_log_buffer s390/qeth: fix L3 next-hop in xmit qeth hdr asix: Fix small memory leak in ax88772_unbind() asix: Ensure asix_rx_fixup_info members are all reset asix: Add rx->ax_skb = NULL after usbnet_skb_return() bpf: fix selftest/bpf/test_pkt_md_access on s390x netvsc: fix race on sub channel creation bpf: fix byte order test in test_verifier xgene: Always get clk source, but ignore if it's missing for SGMII ports MIPS: Add missing file for eBPF JIT. bpf, s390: fix build for libbpf and selftest suite ...
2017-08-08Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull rdma fixes from Doug Ledford: "Third set of -rc fixes for 4.13 cycle - small set of miscellanous fixes - a reasonably sizable set of IPoIB fixes that deal with multiple long standing issues" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: IB/hns: checking for IS_ERR() instead of NULL RDMA/mlx5: Fix existence check for extended address vector IB/uverbs: Fix device cleanup RDMA/uverbs: Prevent leak of reserved field IB/core: Fix race condition in resolving IP to MAC IB/ipoib: Notify on modify QP failure only when relevant Revert "IB/core: Allow QP state transition from reset to error" IB/ipoib: Remove double pointer assigning IB/ipoib: Clean error paths in add port IB/ipoib: Add get statistics support to SRIOV VF IB/ipoib: Add multicast packets statistics IB/ipoib: Set IPOIB_NEIGH_TBL_FLUSH after flushed completion initialization IB/ipoib: Prevent setting negative values to max_nonsrq_conn_qp IB/ipoib: Make sure no in-flight joins while leaving that mcast IB/ipoib: Use cancel_delayed_work_sync when needed IB/ipoib: Fix race between light events and interface restart
2017-08-08Merge tag 'scsi-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "Two small fixes, one re-fix of a previous fix and five patches sorting out hotplug in the bnx2X class of drivers. The latter is rather involved, but necessary because these drivers have started dropping lockdep recursion warnings on the hotplug lock because of its conversion to a percpu rwsem" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: sg: only check for dxfer_len greater than 256M scsi: aacraid: reading out of bounds scsi: qedf: Limit number of CQs scsi: bnx2i: Simplify cpu hotplug code scsi: bnx2fc: Simplify CPU hotplug code scsi: bnx2i: Prevent recursive cpuhotplug locking scsi: bnx2fc: Prevent recursive cpuhotplug locking scsi: bnx2fc: Plug CPU hotplug race
2017-08-07Merge tag 'for-linus-20170807' of git://git.infradead.org/linux-mtdLinus Torvalds
Pull MTD fixes from Brian Norris: "I missed getting these out for rc4, but here are some MTD fixes. Just NAND fixes (in both the core handling, and a few drivers). Notes stolen from Boris: Core fixes: - fix data interface setup for ONFI NANDs that do not support the SET FEATURES command - fix a kernel doc header - fix potential integer overflow when retrieving timing information from the parameter page - fix wrong OOB layout for small page NANDs Driver fixes: - fix potential division-by-zero bug - fix backward compat with old atmel-nand DT bindings - fix ->setup_data_interface() in the atmel NAND driver" * tag 'for-linus-20170807' of git://git.infradead.org/linux-mtd: mtd: nand: atmel: Fix EDO mode check mtd: nand: Declare tBERS, tR and tPROG as u64 to avoid integer overflow mtd: nand: Fix timing setup for NANDs that do not support SET FEATURES mtd: nand: Fix a docs build warning mtd: nand: sunxi: fix potential divide-by-zero error nand: fix wrong default oob layout for small pages using soft ecc mtd: nand: atmel: Fix DT backward compatibility in pmecc.c
2017-08-07pinctrl: generic: update references to Documentation/pinctrl.txtLudovic Desroches
Update deprecated references to Documentation/pinctrl.txt since it has been moved to Documentation/driver-api/pinctl.rst. Signed-off-by: Ludovic Desroches <ludovic.desroches@o2linux.fr> Fixes: 5a9b73832e9e ("pinctrl.txt: move it to the driver-api book") Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2017-08-06Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "A large number of ext4 bug fixes and cleanups for v4.13" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix copy paste error in ext4_swap_extents() ext4: fix overflow caused by missing cast in ext4_resize_fs() ext4, project: expand inode extra size if possible ext4: cleanup ext4_expand_extra_isize_ea() ext4: restructure ext4_expand_extra_isize ext4: fix forgetten xattr lock protection in ext4_expand_extra_isize ext4: make xattr inode reads faster ext4: inplace xattr block update fails to deduplicate blocks ext4: remove unused mode parameter ext4: fix warning about stack corruption ext4: fix dir_nlink behaviour ext4: silence array overflow warning ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesize ext4: release discard bio after sending discard commands ext4: convert swap_inode_data() over to use swap() on most of the fields ext4: error should be cleared if ea_inode isn't added to the cache ext4: Don't clear SGID when inheriting ACLs ext4: preserve i_mode if __ext4_set_acl() fails ext4: remove unused metadata accounting variables ext4: correct comment references to ext4_ext_direct_IO()
2017-08-05Merge tag 'media/v4.13-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media fixes from Mauro Carvalho Chehab: "This series is larger than I would like to submit for -rc4. My original intent were to sent it to either -rc2 or -rc3. Unfortunately, due to my vacations, I got a lot of pending stuff after my return, and had to do some biz trips, with prevented me to send this earlier. Several fixes: - some fixes at atomisp staging driver - several gcc 7 warning fixes - cleanup media SVG files, in order to fix PDF build on some distros - fix random Kconfig build of venus driver - some fixes for the venus driver - some changes from semaphone to mutex in ngene's driver - some locking fixes at dib0700 driver - several fixes on ngene's driver and frontends to make it properly support some new boards added on Kernel 4.13 - some fixes to CEC drivers - omap_vout: vrfb: convert to dmaengine - docs-rst: document EBUSY for VIDIOC_S_FMT Please notice that the big diffstat changes here are at the SVG files. Visually, the images look the same, but the file size is now a lot smaller than before, and they don't use some XML tags that would cause them to be badly parsed by some ImageMagick versions, or to require a lot of memory by TeTex, with would break PDF output on some distributions" * tag 'media/v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (68 commits) media: atomisp2: array underflow in imx_enum_frame_size() media: atomisp2: array underflow in ap1302_enum_frame_size() media: atomisp2: Array underflow in atomisp_enum_input() media: platform: davinci: drop VPFE_CMD_S_CCDC_RAW_PARAMS media: platform: davinci: return -EINVAL for VPFE_CMD_S_CCDC_RAW_PARAMS ioctl media: venus: don't abuse dma_alloc for non-DMA allocations media: venus: hfi: fix error handling in hfi_sys_init_done() media: venus: fix compile-test build on non-qcom ARM platform media: venus: mark PM functions as __maybe_unused media: cec-notifier: small improvements media: pulse8-cec: persistent_config should be off by default media: cec: cec_transmit_attempt_done: ignore CEC_TX_STATUS_MAX_RETRIES media: staging: atomisp: array underflow in ioctl media: lirc: LIRC_GET_REC_RESOLUTION should return microseconds media: svg: avoid too long lines media: svg files: simplify files media: selection.svg: simplify the SVG file media: vimc: set id_table for platform drivers media: staging: atomisp: disable warnings with cc-disable-warning media: davinci: variable 'common' set but not used ...
2017-08-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Radim Krčmář: "ARM: - Yet another race with VM destruction plugged - A set of small vgic fixes x86: - Preserve pending INIT - RCU fixes in paravirtual async pf, VM teardown, and VMXOFF emulation - nVMX interrupt injection and dirty tracking fixes - initialize to make UBSAN happy" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: arm/arm64: vgic: Use READ_ONCE fo cmpxchg KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit" KVM: nVMX: mark vmcs12 pages dirty on L2 exit kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown KVM: nVMX: do not pin the VMCS12 KVM: avoid using rcu_dereference_protected KVM: X86: init irq->level in kvm_pv_kick_cpu_op KVM: X86: Fix loss of pending INIT due to race KVM: async_pf: make rcu irq exit if not triggered from idle task KVM: nVMX: fixes to nested virt interrupt injection KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12 KVM: arm/arm64: Handle hva aging while destroying the vm KVM: arm/arm64: PMU: Fix overflow interrupt injection KVM: arm/arm64: Fix bug in advertising KVM_CAP_MSI_DEVID capability
2017-08-04RDMA/mlx5: Fix existence check for extended address vectorLeon Romanovsky
The extended address vector is the highest bit in be32 variable, but it was compared with the lowest. This patch fixes the endianness of that check and removes already declared define. Fixes: 17d2f88f92ce ("IB/mlx5: Add ODP atomics support") Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-08-04Merge tag 'ceph-for-4.13-rc4' of git://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "A bunch of fixes and follow-ups for -rc1 Luminous patches: issues with ->reencode_message() and last minute RADOS semantic changes in v12.1.2" * tag 'ceph-for-4.13-rc4' of git://github.com/ceph/ceph-client: libceph: make RECOVERY_DELETES feature create a new interval libceph: upmap semantic changes crush: assume weight_set != null imples weight_set_size > 0 libceph: fallback for when there isn't a pool-specific choose_arg libceph: don't call ->reencode_message() more than once per message libceph: make encode_request_*() work with r_mempool requests
2017-08-04Merge tag 'sound-4.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "Now we hit the usual ASoC-fix-flood in the middle of release. Most of the changes are trivial and device-specific, while one significant change is the fix for unbalanced of_graph_*() refcounts. This involved a change in the graph API itself that had been a bit messy" * tag 'sound-4.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: hda - Fix speaker output from VAIO VPCL14M1R device property: Fix usecount for of_graph_get_port_parent() ASoC: rt5665: fix wrong register for bclk ratio control ASoC: Intel: Use MCLK instead of BLCK as the sysclock for RT5514 codec on kabylake platform ASoC: Intel: Enabling ASRC for RT5663 codec on kabylake platform ASoC: codecs: msm8916-analog: fix DIG_CLK_CTL_RXD3_CLK_EN define ASoC: Intel: Skylake: Fix missing sentinels in sst_acpi_mach ASoC: sh: hac: add missing "int ret" ASoC: samsung: odroid: Fix EPLL frequency values ASoC: sgtl5000: Use snd_soc_kcontrol_codec() ASoC: rt5665: fix GPIO6 pin function define ASoC: ux500: Restore platform DAI assignments ASoC: fix pcm-creation regression ASoC: do not close shared backend dailink ASoC: pxa: SND_PXA2XX_SOC should depend on HAS_DMA ASoC: Intel: Skylake: Fix default dma_buffer_size ASoC: rt5663: Update the HW default values based on the shipping version ASoC: imx-ssi: add check on platform_get_irq return value
2017-08-03tcp: introduce tcp_rto_delta_us() helper for xmit timer fixNeal Cardwell
Pure refactor. This helper will be required in the xmit timer fix later in the patch series. (Because the TLP logic will want to make this calculation.) Fixes: 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)") Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03Merge tag 'vfio-v4.13-rc4' of git://github.com/awilliam/linux-vfioLinus Torvalds
Pull VFIO fixes from Alex Williamson: - SPAPR/EEH config build fix (Murilo Opsfelder Araujo) - Fix possible device lock deadlock (Alex Williamson) - Correctly size integrated endpoint PCIe capabilities (Alex Williamson) * tag 'vfio-v4.13-rc4' of git://github.com/awilliam/linux-vfio: vfio/pci: Fix handling of RC integrated endpoint PCIe capability size vfio/pci: Use pci_try_reset_function() on initial open include/linux/vfio.h: Guard powerpc-specific functions with CONFIG_VFIO_SPAPR_EEH
2017-08-03Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "15 fixes" [ This does not merge the "fortify: use WARN instead of BUG for now" patch, which needs a bit of extra work to build cleanly with all configurations. Arnd is on it. - Linus ] * emailed patches from Andrew Morton <akpm@linux-foundation.org>: ocfs2: don't clear SGID when inheriting ACLs mm: allow page_cache_get_speculative in interrupt context userfaultfd: non-cooperative: flush event_wqh at release time ipc: add missing container_of()s for randstruct cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() userfaultfd_zeropage: return -ENOSPC in case mm has gone mm: take memory hotplug lock within numa_zonelist_order_handler() mm/page_io.c: fix oops during block io poll in swapin path zram: do not free pool->size_class kthread: fix documentation build warning kasan: avoid -Wmaybe-uninitialized warning userfaultfd: non-cooperative: notify about unmap of destination during mremap mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries pid: kill pidhash_size in pidhash_init() mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
2017-08-03Merge tag 'kvm-arm-for-v4.13-rc4' of ↵Radim Krčmář
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm KVM/ARM Fixes for v4.13-rc4 - Yet another race with VM destruction plugged - A set of small vgic fixes
2017-08-02Merge tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "Two fixes from Trond this time, now that he's back from his vacation. The first is a stable fix for the EXCHANGE_ID issue on the mailing list, and the other fixes a double-free situation that he found at the same time. Stable fix: - Fix EXCHANGE_ID corrupt verifier issue Other fix: - Fix double frees in nfs4_test_session_trunk()" * tag 'nfs-for-4.13-4' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4: Fix double frees in nfs4_test_session_trunk() NFSv4: Fix EXCHANGE_ID corrupt verifier issue
2017-08-02mm: allow page_cache_get_speculative in interrupt contextKan Liang
Kernel panic when calling the IRQ-safe __get_user_pages_fast in NMI handler. The bug was introduced by commit 2947ba054a4d ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation"). The original x86 __get_user_page_fast used plain get_page() or page_ref_add(). However, the generic __get_user_page_fast uses page_cache_get_speculative(), which has VM_BUG_ON(in_interrupt()). There is no reason to prevent page_cache_get_speculative from using in interrupt context. According to the author, putting a BUG_ON there is just because the code is not verifying correctness of interrupt races. I did some tests in interrupt context. There is no issue found. Removing VM_BUG_ON(in_interrupt()) for page_cache_get_speculative(). Link: http://lkml.kernel.org/r/1501609146-59730-1-git-send-email-kan.liang@intel.com Fixes: 2947ba054a4d ("x86/mm/gup: Switch GUP to the generic get_user_page_fast() implementation") Signed-off-by: Kan Liang <kan.liang@intel.com> Cc: Jens Axboe <axboe@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ying Huang <ying.huang@intel.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02cpuset: fix a deadlock due to incomplete patching of cpusets_enabled()Dima Zavin
In codepaths that use the begin/retry interface for reading mems_allowed_seq with irqs disabled, there exists a race condition that stalls the patch process after only modifying a subset of the static_branch call sites. This problem manifested itself as a deadlock in the slub allocator, inside get_any_partial. The loop reads mems_allowed_seq value (via read_mems_allowed_begin), performs the defrag operation, and then verifies the consistency of mem_allowed via the read_mems_allowed_retry and the cookie returned by xxx_begin. The issue here is that both begin and retry first check if cpusets are enabled via cpusets_enabled() static branch. This branch can be rewritted dynamically (via cpuset_inc) if a new cpuset is created. The x86 jump label code fully synchronizes across all CPUs for every entry it rewrites. If it rewrites only one of the callsites (specifically the one in read_mems_allowed_retry) and then waits for the smp_call_function(do_sync_core) to complete while a CPU is inside the begin/retry section with IRQs off and the mems_allowed value is changed, we can hang. This is because begin() will always return 0 (since it wasn't patched yet) while retry() will test the 0 against the actual value of the seq counter. The fix is to use two different static keys: one for begin (pre_enable_key) and one for retry (enable_key). In cpuset_inc(), we first bump the pre_enable key to ensure that cpuset_mems_allowed_begin() always return a valid seqcount if are enabling cpusets. Similarly, when disabling cpusets via cpuset_dec(), we first ensure that callers of cpuset_mems_allowed_retry() will start ignoring the seqcount value before we let cpuset_mems_allowed_begin() return 0. The relevant stack traces of the two stuck threads: CPU: 1 PID: 1415 Comm: mkdir Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8817f9c28000 task.stack: ffffc9000ffa4000 RIP: smp_call_function_many+0x1f9/0x260 Call Trace: smp_call_function+0x3b/0x70 on_each_cpu+0x2f/0x90 text_poke_bp+0x87/0xd0 arch_jump_label_transform+0x93/0x100 __jump_label_update+0x77/0x90 jump_label_update+0xaa/0xc0 static_key_slow_inc+0x9e/0xb0 cpuset_css_online+0x70/0x2e0 online_css+0x2c/0xa0 cgroup_apply_control_enable+0x27f/0x3d0 cgroup_mkdir+0x2b7/0x420 kernfs_iop_mkdir+0x5a/0x80 vfs_mkdir+0xf6/0x1a0 SyS_mkdir+0xb7/0xe0 entry_SYSCALL_64_fastpath+0x18/0xad ... CPU: 2 PID: 1 Comm: init Tainted: G L 4.9.36-00104-g540c51286237 #4 Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017 task: ffff8818087c0000 task.stack: ffffc90000030000 RIP: int3+0x39/0x70 Call Trace: <#DB> ? ___slab_alloc+0x28b/0x5a0 <EOE> ? copy_process.part.40+0xf7/0x1de0 __slab_alloc.isra.80+0x54/0x90 copy_process.part.40+0xf7/0x1de0 copy_process.part.40+0xf7/0x1de0 kmem_cache_alloc_node+0x8a/0x280 copy_process.part.40+0xf7/0x1de0 _do_fork+0xe7/0x6c0 _raw_spin_unlock_irq+0x2d/0x60 trace_hardirqs_on_caller+0x136/0x1d0 entry_SYSCALL_64_fastpath+0x5/0xad do_syscall_64+0x27/0x350 SyS_clone+0x19/0x20 do_syscall_64+0x60/0x350 entry_SYSCALL64_slow_path+0x25/0x25 Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com Fixes: 46e700abc44c ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled") Signed-off-by: Dima Zavin <dmitriyz@waymo.com> Reported-by: Cliff Spradlin <cspradlin@waymo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christopher Lameter <cl@linux.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02kthread: fix documentation build warningJonathan Corbet
The kerneldoc comment for kthread_create() had an incorrect argument name, leading to a warning in the docs build. Correct it, and make one more small step toward a warning-free build. Link: http://lkml.kernel.org/r/20170724135916.7f486c6f@lwn.net Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02mm, mprotect: flush TLB if potentially racing with a parallel reclaim ↵Mel Gorman
leaving stale TLB entries Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not necessarily a data integrity issue but it is a correctness issue as there is a window where an mprotect that limits access still allows access. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Note that a variant of this patch was acked by Andy Lutomirski but this was for the x86 parts on top of his PCID work which didn't make the 4.13 merge window as expected. His ack is dropped from this version and there will be a follow-on patch on top of PCID that will include his ack. [akpm@linux-foundation.org: tweak comments] [akpm@linux-foundation.org: fix spello] Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: <stable@vger.kernel.org> [v4.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02KVM: avoid using rcu_dereference_protectedPaolo Bonzini
During teardown, accesses to memslots and buses are using rcu_dereference_protected with an always-true condition because these accesses are done outside the usual mutexes. This is because the last reference is gone and there cannot be any concurrent modifications, but rcu_dereference_protected is ugly and unobvious. Instead, check the refcount in kvm_get_bus and __kvm_memslots. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>