summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2019-10-08Add my linux-leds branch to MAINTAINERSPavel Machek
Add pointer to my git tree to MAINTAINERS. I'd like to maintain linux-leds for-next branch for 5.5. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
2019-10-08leds: core: Fix leds.h structure documentationDan Murphy
Update the leds.h structure documentation to define the correct arguments. Signed-off-by: Dan Murphy <dmurphy@ti.com> Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
2019-10-08Merge tag 'gpio-v5.4-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO fixes from Linus Walleij: - don't clear FLAG_IS_OUT when emulating open drain/source in gpiolib - fix up the usage of nonexclusive GPIO descriptors from device trees - fix the incorrect IEC offset when toggling trigger edge in the Spreadtrum driver - use the correct unit for debounce settings in the MAX77620 driver * tag 'gpio-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpio: max77620: Use correct unit for debounce times gpio: eic: sprd: Fix the incorrect EIC offset when toggling gpio: fix getting nonexclusive gpiods from DT gpiolib: don't clear FLAG_IS_OUT when emulating open-drain/open-source
2019-10-08Merge tag 'selinux-pr-20191007' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinuxfix from Paul Moore: "One patch to ensure we don't copy bad memory up into userspace" * tag 'selinux-pr-20191007' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: fix context string corruption in convert_context()
2019-10-08Merge tag 'linux-kselftest-5.4-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest Pull Kselftest fixes from Shuah Khan: "Fixes for existing tests and the framework. Cristian Marussi's patches add the ability to skip targets (tests) and exclude tests that didn't build from run-list. These patches improve the Kselftest results. Ability to skip targets helps avoid running tests that aren't supported in certain environments. As an example, bpf tests from mainline aren't supported on stable kernels and have dependency on bleeding edge llvm. Being able to skip bpf on systems that can't meet this llvm dependency will be helpful. Kselftest can be built and installed from the main Makefile. This change help simplify Kselftest use-cases which addresses request from users. Kees Cook added per test timeout support to limit individual test run-time" * tag 'linux-kselftest-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: selftests: watchdog: Add command line option to show watchdog_info selftests: watchdog: Validate optional file argument selftests/kselftest/runner.sh: Add 45 second timeout per test kselftest: exclude failed TARGETS from runlist kselftest: add capability to skip chosen TARGETS selftests: Add kselftest-all and kselftest-install targets
2019-10-08x86/cpu: Add Comet Lake to the Intel CPU models headerKan Liang
Comet Lake is the new 10th Gen Intel processor. Add two new CPU model numbers to the Intel family list. The CPU model numbers are not published in the SDM yet but they come from an authoritative internal source. [ bp: Touch up commit message. ] Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: ak@linux.intel.com Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/1570549810-25049-2-git-send-email-kan.liang@linux.intel.com
2019-10-08doc: move namespaces.rst from kbuild/ to core-api/Masahiro Yamada
We discussed a better location for this file, and agreed that core-api/ is a good fit. Rename it to symbol-namespaces.rst for disambiguation, and also add it to index.rst and MAINTAINERS. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Matthias Maennich <maennich@google.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-10-08arm64: armv8_deprecated: Checking return value for memory allocationYunfeng Ye
There are no return value checking when using kzalloc() and kcalloc() for memory allocation. so add it. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Signed-off-by: Will Deacon <will@kernel.org>
2019-10-08lib/string: Make memzero_explicit() inline instead of externalArvind Sankar
With the use of the barrier implied by barrier_data(), there is no need for memzero_explicit() to be extern. Making it inline saves the overhead of a function call, and allows the code to be reused in arch/*/purgatory without having to duplicate the implementation. Tested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H . Peter Anvin <hpa@zytor.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephan Mueller <smueller@chronox.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-crypto@vger.kernel.org Cc: linux-s390@vger.kernel.org Fixes: 906a4bb97f5d ("crypto: sha256 - Use get/put_unaligned_be32 to get input, memzero_explicit") Link: https://lkml.kernel.org/r/20191007220000.GA408752@rani.riverdale.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-08x86/cpu/vmware: Use the full form of INL in VMWARE_PORTSami Tolvanen
LLVM's assembler doesn't accept the short form INL instruction: inl (%%dx) but instead insists on the output register to be explicitly specified: <inline asm>:1:7: error: invalid operand for instruction inl (%dx) ^ LLVM ERROR: Error parsing inline asm Use the full form of the instruction to fix the build. Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Thomas Hellstrom <thellstrom@vmware.com> Cc: clang-built-linux@googlegroups.com Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: virtualization@lists.linux-foundation.org Cc: "VMware, Inc." <pv-drivers@vmware.com> Cc: x86-ml <x86@kernel.org> Link: https://github.com/ClangBuiltLinux/linux/issues/734 Link: https://lkml.kernel.org/r/20191007192129.104336-1-samitolvanen@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-08x86/asm: Fix MWAITX C-state hint valueJanakarajan Natarajan
As per "AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions", MWAITX EAX[7:4]+1 specifies the optional hint of the optimized C-state. For C0 state, EAX[7:4] should be set to 0xf. Currently, a value of 0xf is set for EAX[3:0] instead of EAX[7:4]. Fix this by changing MWAITX_DISABLE_CSTATES from 0xf to 0xf0. This hasn't had any implications so far because setting reserved bits in EAX is simply ignored by the CPU. [ bp: Fixup comment in delay_mwaitx() and massage. ] Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191007190011.4859-1-Janakarajan.Natarajan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-08btrfs: silence maybe-uninitialized warning in clone_rangeAustin Kim
GCC throws warning message as below: ‘clone_src_i_size’ may be used uninitialized in this function [-Wmaybe-uninitialized] #define IS_ALIGNED(x, a) (((x) & ((typeof(x))(a) - 1)) == 0) ^ fs/btrfs/send.c:5088:6: note: ‘clone_src_i_size’ was declared here u64 clone_src_i_size; ^ The clone_src_i_size is only used as call-by-reference in a call to get_inode_info(). Silence the warning by initializing clone_src_i_size to 0. Note that the warning is a false positive and reported by older versions of GCC (eg. 7.x) but not eg 9.x. As there have been numerous people, the patch is applied. Setting clone_src_i_size to 0 does not otherwise make sense and would not do any action in case the code changes in the future. Signed-off-by: Austin Kim <austindh.kim@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-08efi/tpm: Fix sanity check of unsigned tbl_size being less than zeroColin Ian King
Currently the check for tbl_size being less than zero is always false because tbl_size is unsigned. Fix this by making it a signed int. Addresses-Coverity: ("Unsigned compared against 0") Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Jerry Snitselaar <jsnitsel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Cc: linux-efi@vger.kernel.org Fixes: e658c82be556 ("efi/tpm: Only set 'efi_tpm_final_log_size' after successful event log parsing") Link: https://lkml.kernel.org/r/20191008100153.8499-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-08drm/panel: tpo-td043mtea1: Fix SPI aliasLaurent Pinchart
The panel-tpo-td043mtea1 driver incorrectly includes the OF vendor prefix in its SPI alias. Fix it, and move the manual alias to an SPI module device table. Fixes: dc2e1e5b2799 ("drm/panel: Add driver for the Toppoly TD043MTEA1 panel") Reported-by: H. Nikolaus Schaller <hns@goldelico.com> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-6-laurent.pinchart@ideasonboard.com Acked-by: Sam Ravnborg <sam@ravnborg.org> Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com> Tested-by: H. Nikolaus Schaller <hns@goldelico.com>
2019-10-08drm/panel: tpo-td028ttec1: Fix SPI aliasLaurent Pinchart
The panel-tpo-td028ttec1 driver incorrectly includes the OF vendor prefix in its SPI alias. Fix it. Fixes: 415b8dd08711 ("drm/panel: Add driver for the Toppoly TD028TTEC1 panel") Reported-by: H. Nikolaus Schaller <hns@goldelico.com> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-5-laurent.pinchart@ideasonboard.com Acked-by: Sam Ravnborg <sam@ravnborg.org> Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com> Tested-by: H. Nikolaus Schaller <hns@goldelico.com> Tested-by: Andreas Kemnade <andreas@kemnade.info>
2019-10-08drm/panel: sony-acx565akm: Fix SPI aliasLaurent Pinchart
The panel-sony-acx565akm driver incorrectly includes the OF vendor prefix in its SPI alias. Fix it, and move the manual alias to an SPI module device table. Fixes: 1c8fc3f0c5d2 ("drm/panel: Add driver for the Sony ACX565AKM panel") Reported-by: H. Nikolaus Schaller <hns@goldelico.com> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-4-laurent.pinchart@ideasonboard.com Acked-by: Sam Ravnborg <sam@ravnborg.org> Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2019-10-08drm/panel: nec-nl8048hl11: Fix SPI aliasLaurent Pinchart
The panel-nec-nl8048hl11 driver incorrectly includes the OF vendor prefix in its SPI alias. Fix it, and move the manual alias to an SPI module device table. Fixes: df439abe6501 ("drm/panel: Add driver for the NEC NL8048HL11 panel") Reported-by: H. Nikolaus Schaller <hns@goldelico.com> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-3-laurent.pinchart@ideasonboard.com Acked-by: Sam Ravnborg <sam@ravnborg.org> Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2019-10-08drm/panel: lg-lb035q02: Fix SPI aliasLaurent Pinchart
The panel-lg-lb035q02 driver incorrectly includes the OF vendor prefix in its SPI alias. Fix it, and move the manual alias to an SPI module device table. Fixes: f5b0c6542476 ("drm/panel: Add driver for the LG Philips LB035Q02 panel") Reported-by: H. Nikolaus Schaller <hns@goldelico.com> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Link: https://patchwork.freedesktop.org/patch/msgid/20191007170801.27647-2-laurent.pinchart@ideasonboard.com Acked-by: Sam Ravnborg <sam@ravnborg.org> Reviewed-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2019-10-07io_uring: remove wait loop spurious wakeupsPavel Begunkov
Any changes interesting to tasks waiting in io_cqring_wait() are commited with io_cqring_ev_posted(). However, io_ring_drop_ctx_refs() also tries to do that but with no reason, that means spurious wakeups every io_free_req() and io_uring_enter(). Just use percpu_ref_put() instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "The usual shower of hotfixes. Chris's memcg patches aren't actually fixes - they're mature but a few niggling review issues were late to arrive. The ocfs2 fixes are quite old - those took some time to get reviewer attention. Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg, mm/slab-generic" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) mm, sl[ou]b: improve memory accounting mm, memcg: make scan aggression always exclude protection mm, memcg: make memory.emin the baseline for utilisation determination mm, memcg: proportional memory.{low,min} reclaim mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() mm/page_alloc.c: fix a crash in free_pages_prepare() mm/z3fold.c: claim page in the beginning of free kernel/sysctl.c: do not override max_threads provided by userspace memcg: only record foreign writebacks with dirty pages when memcg is not disabled mm: fix -Wmissing-prototypes warnings writeback: fix use-after-free in finish_writeback_work() mm/memremap: drop unused SECTION_SIZE and SECTION_MASK panic: ensure preemption is disabled during panic() fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc() fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock() fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry() ocfs2: clear zero in unaligned direct IO
2019-10-07mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)Vlastimil Babka
In most configurations, kmalloc() happens to return naturally aligned (i.e. aligned to the block size itself) blocks for power of two sizes. That means some kmalloc() users might unknowingly rely on that alignment, until stuff breaks when the kernel is built with e.g. CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then developers have to devise workaround such as own kmem caches with specified alignment [1], which is not always practical, as recently evidenced in [2]. The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' variant would not help with code unknowingly relying on the implicit alignment. For slab implementations it would either require creating more kmalloc caches, or allocate a larger size and only give back part of it. That would be wasteful, especially with a generic alignment parameter (in contrast with a fixed alignment to size). Ideally we should provide to mm users what they need without difficult workarounds or own reimplementations, so let's make the kmalloc() alignment to size explicitly guaranteed for power-of-two sizes under all configurations. What this means for the three available allocators? * SLAB object layout happens to be mostly unchanged by the patch. The implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for caches with alignment larger than unsigned long long. Practically on at least x86 this includes kmalloc caches as they use cache line alignment, which is larger than that. Still, this patch ensures alignment on all arches and cache sizes. * SLUB layout is also unchanged unless redzoning is enabled through CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With this patch, explicit alignment is guaranteed with redzoning as well. This will result in more memory being wasted, but that should be acceptable in a debugging scenario. * SLOB has no implicit alignment so this patch adds it explicitly for kmalloc(). The potential downside is increased fragmentation. While pathological allocation scenarios are certainly possible, in my testing, after booting a x86_64 kernel+userspace with virtme, around 16MB memory was consumed by slab pages both before and after the patch, with difference in the noise. [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/ [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/ [3] https://lwn.net/Articles/787740/ [akpm@linux-foundation.org: documentation fixlet, per Matthew] Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: David Sterba <dsterba@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J . Wong" <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm, sl[ou]b: improve memory accountingVlastimil Babka
Patch series "guarantee natural alignment for kmalloc()", v2. This patch (of 2): SLOB currently doesn't account its pages at all, so in /proc/meminfo the Slab field shows zero. Modifying a counter on page allocation and freeing should be acceptable even for the small system scenarios SLOB is intended for. Since reclaimable caches are not separated in SLOB, account everything as unreclaimable. SLUB currently doesn't account kmalloc() and kmalloc_node() allocations larger than order-1 page, that are passed directly to the page allocator. As they also don't appear in /proc/slabinfo, it might look like a memory leak. For consistency, account them as well. (SLAB doesn't actually use page allocator directly, so no change there). Ideally SLOB and SLUB would be handled in separate patches, but due to the shared kmalloc_order() function and different kfree() implementations, it's easier to patch both at once to prevent inconsistencies. Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: "Darrick J . Wong" <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm, memcg: make scan aggression always exclude protectionChris Down
This patch is an incremental improvement on the existing memory.{low,min} relative reclaim work to base its scan pressure calculations on how much protection is available compared to the current usage, rather than how much the current usage is over some protection threshold. This change doesn't change the experience for the user in the normal case too much. One benefit is that it replaces the (somewhat arbitrary) 100% cutoff with an indefinite slope, which makes it easier to ballpark a memory.low value. As well as this, the old methodology doesn't quite apply generically to machines with varying amounts of physical memory. Let's say we have a top level cgroup, workload.slice, and another top level cgroup, system-management.slice. We want to roughly give 12G to system-management.slice, so on a 32GB machine we set memory.low to 20GB in workload.slice, and on a 64GB machine we set memory.low to 52GB. However, because these are relative amounts to the total machine size, while the amount of memory we want to generally be willing to yield to system.slice is absolute (12G), we end up putting more pressure on system.slice just because we have a larger machine and a larger workload to fill it, which seems fairly unintuitive. With this new behaviour, we don't end up with this unintended side effect. Previously the way that memory.low protection works is that if you are 50% over a certain baseline, you get 50% of your normal scan pressure. This is certainly better than the previous cliff-edge behaviour, but it can be improved even further by always considering memory under the currently enforced protection threshold to be out of bounds. This means that we can set relatively low memory.low thresholds for variable or bursty workloads while still getting a reasonable level of protection, whereas with the previous version we may still trivially hit the 100% clamp. The previous 100% clamp is also somewhat arbitrary, whereas this one is more concretely based on the currently enforced protection threshold, which is likely easier to reason about. There is also a subtle issue with the way that proportional reclaim worked previously -- it promotes having no memory.low, since it makes pressure higher during low reclaim. This happens because we base our scan pressure modulation on how far memory.current is between memory.min and memory.low, but if memory.low is unset, we only use the overage method. In most cromulent configurations, this then means that we end up with *more* pressure than with no memory.low at all when we're in low reclaim, which is not really very usable or expected. With this patch, memory.low and memory.min affect reclaim pressure in a more understandable and composable way. For example, from a user standpoint, "protected" memory now remains untouchable from a reclaim aggression standpoint, and users can also have more confidence that bursty workloads will still receive some amount of guaranteed protection. Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm, memcg: make memory.emin the baseline for utilisation determinationChris Down
Roman points out that when when we do the low reclaim pass, we scale the reclaim pressure relative to position between 0 and the maximum protection threshold. However, if the maximum protection is based on memory.elow, and memory.emin is above zero, this means we still may get binary behaviour on second-pass low reclaim. This is because we scale starting at 0, not starting at memory.emin, and since we don't scan at all below emin, we end up with cliff behaviour. This should be a fairly uncommon case since usually we don't go into the second pass, but it makes sense to scale our low reclaim pressure starting at emin. You can test this by catting two large sparse files, one in a cgroup with emin set to some moderate size compared to physical RAM, and another cgroup without any emin. In both cgroups, set an elow larger than 50% of physical RAM. The one with emin will have less page scanning, as reclaim pressure is lower. Rebase on top of and apply the same idea as what was applied to handle cgroup_memory=disable properly for the original proportional patch http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm, memcg: Handle cgroup_disable=memory when getting memcg protection"). Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Suggested-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm, memcg: proportional memory.{low,min} reclaimChris Down
cgroup v2 introduces two memory protection thresholds: memory.low (best-effort) and memory.min (hard protection). While they generally do what they say on the tin, there is a limitation in their implementation that makes them difficult to use effectively: that cliff behaviour often manifests when they become eligible for reclaim. This patch implements more intuitive and usable behaviour, where we gradually mount more reclaim pressure as cgroups further and further exceed their protection thresholds. This cliff edge behaviour happens because we only choose whether or not to reclaim based on whether the memcg is within its protection limits (see the use of mem_cgroup_protected in shrink_node), but we don't vary our reclaim behaviour based on this information. Imagine the following timeline, with the numbers the lruvec size in this zone: 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned. 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned. 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be scanned. (?!) * Of course, we won't usually scan all available pages in the zone even without this patch because of scan control priority, over-reclaim protection, etc. However, as shown by the tests at the end, these techniques don't sufficiently throttle such an extreme change in input, so cliff-like behaviour isn't really averted by their existence alone. Here's an example of how this plays out in practice. At Facebook, we are trying to protect various workloads from "system" software, like configuration management tools, metric collectors, etc (see this[0] case study). In order to find a suitable memory.low value, we start by determining the expected memory range within which the workload will be comfortable operating. This isn't an exact science -- memory usage deemed "comfortable" will vary over time due to user behaviour, differences in composition of work, etc, etc. As such we need to ballpark memory.low, but doing this is currently problematic: 1. If we end up setting it too low for the workload, it won't have *any* effect (see discussion above). The group will receive the full weight of reclaim and won't have any priority while competing with the less important system software, as if we had no memory.low configured at all. 2. Because of this behaviour, we end up erring on the side of setting it too high, such that the comfort range is reliably covered. However, protected memory is completely unavailable to the rest of the system, so we might cause undue memory and IO pressure there when we *know* we have some elasticity in the workload. 3. Even if we get the value totally right, smack in the middle of the comfort zone, we get extreme jumps between no pressure and full pressure that cause unpredictable pressure spikes in the workload due to the current binary reclaim behaviour. With this patch, we can set it to our ballpark estimation without too much worry. Any undesirable behaviour, such as too much or too little reclaim pressure on the workload or system will be proportional to how far our estimation is off. This means we can set memory.low much more conservatively and thus waste less resources *without* the risk of the workload falling off a cliff if we overshoot. As a more abstract technical description, this unintuitive behaviour results in having to give high-priority workloads a large protection buffer on top of their expected usage to function reliably, as otherwise we have abrupt periods of dramatically increased memory pressure which hamper performance. Having to set these thresholds so high wastes resources and generally works against the principle of work conservation. In addition, having proportional memory reclaim behaviour has other benefits. Most notably, before this patch it's basically mandatory to set memory.low to a higher than desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting *some* protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: +------------+-----------------+--------------------+--------------+ | memory.low | test (pgscan/s) | control (pgscan/s) | % of control | +------------+-----------------+--------------------+--------------+ | 21G | 0 | 0 | N/A | | 17G | 867 | 3799 | 23% | | 12G | 1203 | 3543 | 34% | | 8G | 2534 | 3979 | 64% | | 4G | 3980 | 4147 | 96% | | 0 | 3799 | 3980 | 95% | +------------+-----------------+--------------------+--------------+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [akpm@linux-foundation.org: reflow block comments to fit in 80 cols] [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm/vmpressure.c: fix a signedness bug in vmpressure_register_event()Dan Carpenter
The "mode" and "level" variables are enums and in this context GCC will treat them as unsigned ints so the error handling is never triggered. I also removed the bogus initializer because it isn't required any more and it's sort of confusing. [akpm@linux-foundation.org: reduce implicit and explicit typecasting] [akpm@linux-foundation.org: fix return value, add comment, per Matthew] Link: http://lkml.kernel.org/r/20190925110449.GO3264@mwanda Fixes: 3cadfa2b9497 ("mm/vmpressure.c: convert to use match_string() helper") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Enrico Weigelt <info@metux.net> Cc: Kate Stewart <kstewart@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm/page_alloc.c: fix a crash in free_pages_prepare()Qian Cai
On architectures like s390, arch_free_page() could mark the page unused (set_page_unused()) and any access later would trigger a kernel panic. Fix it by moving arch_free_page() after all possible accessing calls. Hardware name: IBM 2964 N96 400 (z/VM 6.4.0) Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3 Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f 000003d080012100 000003d080013fc0 0000000000000000 0000000000100000 00000000275cca48 0000000000000100 0000000000000008 000003d080010000 00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0 Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6 0000000026c2b962: e32014000036 pfd 2,1024(%r1) #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1) >0000000026c2b96e: 41101100 la %r1,256(%r1) 0000000026c2b972: a737fff8 brctg %r3,26c2b962 0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1) 0000000026c2b97c: e31003400004 lg %r1,832 0000000026c2b982: ebff1430016a asi 5168(%r1),-1 Call Trace: __free_pages_ok+0x16a/0x5d8) memblock_free_all+0x206/0x290 mem_init+0x58/0x120 start_kernel+0x2b0/0x570 startup_continue+0x6a/0xc0 INFO: lockdep is turned off. Last Breaking-Event-Address: __free_pages_ok+0x372/0x5d8 Kernel panic - not syncing: Fatal exception: panic_on_oops 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C In the past, only kernel_poison_pages() would trigger this but it needs "page_poison=on" kernel cmdline, and I suspect nobody tested that on s390. Recently, kernel_init_free_pages() (commit 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")) was added and could trigger this as well. [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw Fixes: 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option") Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Cc: <stable@vger.kernel.org> [5.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm/z3fold.c: claim page in the beginning of freeVitaly Wool
There's a really hard to reproduce race in z3fold between z3fold_free() and z3fold_reclaim_page(). z3fold_reclaim_page() can claim the page after z3fold_free() has checked if the page was claimed and z3fold_free() will then schedule this page for compaction which may in turn lead to random page faults (since that page would have been reclaimed by then). Fix that by claiming page in the beginning of z3fold_free() and not forgetting to clear the claim in the end. [vitalywool@gmail.com: v2] Link: http://lkml.kernel.org/r/20190928113456.152742cf@bigdell Link: http://lkml.kernel.org/r/20190926104844.4f0c6efa1366b8f5741eaba9@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Reported-by: Markus Linnala <markus.linnala@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Markus Linnala <markus.linnala@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07kernel/sysctl.c: do not override max_threads provided by userspaceMichal Hocko
Partially revert 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits") because the patch is causing a regression to any workload which needs to override the auto-tuning of the limit provided by kernel. set_max_threads is implementing a boot time guesstimate to provide a sensible limit of the concurrently running threads so that runaways will not deplete all the memory. This is a good thing in general but there are workloads which might need to increase this limit for an application to run (reportedly WebSpher MQ is affected) and that is simply not possible after the mentioned change. It is also very dubious to override an admin decision by an estimation that doesn't have any direct relation to correctness of the kernel operation. Fix this by dropping set_max_threads from sysctl_max_threads so any value is accepted as long as it fits into MAX_THREADS which is important to check because allowing more threads could break internal robust futex restriction. While at it, do not use MIN_THREADS as the lower boundary because it is also only a heuristic for automatic estimation and admin might have a good reason to stop new threads to be created even when below this limit. This became more severe when we switched x86 from 4k to 8k kernel stacks. Starting since 6538b8ea886e ("x86_64: expand kernel stack to 16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned value. In the particular case 3.12 kernel.threads-max = 515561 4.4 kernel.threads-max = 200000 Neither of the two values is really insane on 32GB machine. I am not sure we want/need to tune the max_thread value further. If anything the tuning should be removed altogether if proven not useful in general. But we definitely need a way to override this auto-tuning. Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz Fixes: 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits") Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07memcg: only record foreign writebacks with dirty pages when memcg is not ↵Baoquan He
disabled In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory' for saving memory. Now kdump kernel will always panic when dump vmcore to local disk: BUG: kernel NULL pointer dereference, address: 0000000000000ab8 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018 RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140 Call Trace: __set_page_dirty+0x52/0xc0 iomap_set_page_dirty+0x50/0x90 iomap_write_end+0x6e/0x270 iomap_write_actor+0xce/0x170 iomap_apply+0xba/0x11e iomap_file_buffered_write+0x62/0x90 xfs_file_buffered_aio_write+0xca/0x320 [xfs] new_sync_write+0x12d/0x1d0 vfs_write+0xa5/0x1a0 ksys_write+0x59/0xd0 do_syscall_64+0x59/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 And this will corrupt the 1st kernel too with 'cgroup_disable=memory'. Via the trace and with debugging, it is pointing to commit 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") which introduced this regression. Disabling memcg causes the null pointer dereference at uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath(). Fix it by returning directly if memcg is disabled, but not trying to record the foreign writebacks with dirty pages. Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm: fix -Wmissing-prototypes warningsYi Wang
We get two warnings when build kernel W=1: mm/shuffle.c:36:12: warning: no previous prototype for `shuffle_show' [-Wmissing-prototypes] mm/sparse.c:220:6: warning: no previous prototype for `subsection_mask_set' [-Wmissing-prototypes] Make the functions static to fix this. Link: http://lkml.kernel.org/r/1566978161-7293-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07writeback: fix use-after-free in finish_writeback_work()Tejun Heo
finish_writeback_work() reads @done->waitq after decrementing @done->cnt. However, once @done->cnt reaches zero, @done may be freed (from stack) at any moment and @done->waitq can contain something unrelated by the time finish_writeback_work() tries to read it. This led to the following crash. "BUG: kernel NULL pointer dereference, address: 0000000000000002" #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] SMP DEBUG_PAGEALLOC CPU: 40 PID: 555153 Comm: kworker/u98:50 Kdump: loaded Not tainted ... Workqueue: writeback wb_workfn (flush-btrfs-1) RIP: 0010:_raw_spin_lock_irqsave+0x10/0x30 Code: 48 89 d8 5b c3 e8 50 db 6b ff eb f4 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 9c 5b fa 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 48 89 d8 5b c3 89 c6 e8 fe ca 6b ff eb f2 66 90 RSP: 0018:ffffc90049b27d98 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000246 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000003 RDI: 0000000000000002 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 R10: ffff889fff407600 R11: ffff88ba9395d740 R12: 000000000000e300 R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88bfdfa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000002 CR3: 0000000002409005 CR4: 00000000001606e0 Call Trace: __wake_up_common_lock+0x63/0xc0 wb_workfn+0xd2/0x3e0 process_one_work+0x1f5/0x3f0 worker_thread+0x2d/0x3d0 kthread+0x111/0x130 ret_from_fork+0x1f/0x30 Fix it by reading and caching @done->waitq before decrementing @done->cnt. Link: http://lkml.kernel.org/r/20190924010631.GH2233839@devbig004.ftw2.facebook.com Fixes: 5b9cce4c7eb069 ("writeback: Generalize and expose wb_completion") Signed-off-by: Tejun Heo <tj@kernel.org> Debugged-by: Chris Mason <clm@fb.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [5.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm/memremap: drop unused SECTION_SIZE and SECTION_MASKAnshuman Khandual
SECTION_SIZE and SECTION_MASK macros are not getting used anymore. But they do conflict with existing definitions on arm64 platform causing following warning during build. Lets drop these unused macros. mm/memremap.c:16: warning: "SECTION_MASK" redefined #define SECTION_MASK ~((1UL << PA_SECTION_SHIFT) - 1) arch/arm64/include/asm/pgtable-hwdef.h:79: note: this is the location of the previous definition #define SECTION_MASK (~(SECTION_SIZE-1)) mm/memremap.c:17: warning: "SECTION_SIZE" redefined #define SECTION_SIZE (1UL << PA_SECTION_SHIFT) arch/arm64/include/asm/pgtable-hwdef.h:78: note: this is the location of the previous definition #define SECTION_SIZE (_AC(1, UL) << SECTION_SHIFT) Link: http://lkml.kernel.org/r/1569312010-31313-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: kbuild test robot <lkp@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07panic: ensure preemption is disabled during panic()Will Deacon
Calling 'panic()' on a kernel with CONFIG_PREEMPT=y can leave the calling CPU in an infinite loop, but with interrupts and preemption enabled. From this state, userspace can continue to be scheduled, despite the system being "dead" as far as the kernel is concerned. This is easily reproducible on arm64 when booting with "nosmp" on the command line; a couple of shell scripts print out a periodic "Ping" message whilst another triggers a crash by writing to /proc/sysrq-trigger: | sysrq: Trigger a crash | Kernel panic - not syncing: sysrq triggered crash | CPU: 0 PID: 1 Comm: init Not tainted 5.2.15 #1 | Hardware name: linux,dummy-virt (DT) | Call trace: | dump_backtrace+0x0/0x148 | show_stack+0x14/0x20 | dump_stack+0xa0/0xc4 | panic+0x140/0x32c | sysrq_handle_reboot+0x0/0x20 | __handle_sysrq+0x124/0x190 | write_sysrq_trigger+0x64/0x88 | proc_reg_write+0x60/0xa8 | __vfs_write+0x18/0x40 | vfs_write+0xa4/0x1b8 | ksys_write+0x64/0xf0 | __arm64_sys_write+0x14/0x20 | el0_svc_common.constprop.0+0xb0/0x168 | el0_svc_handler+0x28/0x78 | el0_svc+0x8/0xc | Kernel Offset: disabled | CPU features: 0x0002,24002004 | Memory Limit: none | ---[ end Kernel panic - not syncing: sysrq triggered crash ]--- | Ping 2! | Ping 1! | Ping 1! | Ping 2! The issue can also be triggered on x86 kernels if CONFIG_SMP=n, otherwise local interrupts are disabled in 'smp_send_stop()'. Disable preemption in 'panic()' before re-enabling interrupts. Link: http://lkml.kernel.org/r/20191002123538.22609-1-will@kernel.org Link: https://lore.kernel.org/r/BX1W47JXPMR8.58IYW53H6M5N@dragonstone Signed-off-by: Will Deacon <will@kernel.org> Reported-by: Xogium <contact@xogium.me> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Feng Tang <feng.tang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07fs: ocfs2: fix a possible null-pointer dereference in ↵Jia-Ju Bai
ocfs2_info_scan_inode_alloc() In ocfs2_info_scan_inode_alloc(), there is an if statement on line 283 to check whether inode_alloc is NULL: if (inode_alloc) When inode_alloc is NULL, it is used on line 287: ocfs2_inode_lock(inode_alloc, &bh, 0); ocfs2_inode_lock_full_nested(inode, ...) struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); Thus, a possible null-pointer dereference may occur. To fix this bug, inode_alloc is checked on line 286. This bug is found by a static analysis tool STCheck written by us. Link: http://lkml.kernel.org/r/20190726033717.32359-1-baijiaju1990@gmail.com Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock()Jia-Ju Bai
In ocfs2_write_end_nolock(), there are an if statement on lines 1976, 2047 and 2058, to check whether handle is NULL: if (handle) When handle is NULL, it is used on line 2045: ocfs2_update_inode_fsync_trans(handle, inode, 1); oi->i_sync_tid = handle->h_transaction->t_tid; Thus, a possible null-pointer dereference may occur. To fix this bug, handle is checked before calling ocfs2_update_inode_fsync_trans(). This bug is found by a static analysis tool STCheck written by us. Link: http://lkml.kernel.org/r/20190726033705.32307-1-baijiaju1990@gmail.com Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()Jia-Ju Bai
In ocfs2_xa_prepare_entry(), there is an if statement on line 2136 to check whether loc->xl_entry is NULL: if (loc->xl_entry) When loc->xl_entry is NULL, it is used on line 2158: ocfs2_xa_add_entry(loc, name_hash); loc->xl_entry->xe_name_hash = cpu_to_le32(name_hash); loc->xl_entry->xe_name_offset = cpu_to_le16(loc->xl_size); and line 2164: ocfs2_xa_add_namevalue(loc, xi); loc->xl_entry->xe_value_size = cpu_to_le64(xi->xi_value_len); loc->xl_entry->xe_name_len = xi->xi_name_len; Thus, possible null-pointer dereferences may occur. To fix these bugs, if loc-xl_entry is NULL, ocfs2_xa_prepare_entry() abnormally returns with -EINVAL. These bugs are found by a static analysis tool STCheck written by us. [akpm@linux-foundation.org: remove now-unused ocfs2_xa_add_entry()] Link: http://lkml.kernel.org/r/20190726101447.9153-1-baijiaju1990@gmail.com Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07ocfs2: clear zero in unaligned direct IOJia Guo
Unused portion of a part-written fs-block-sized block is not set to zero in unaligned append direct write.This can lead to serious data inconsistencies. Ocfs2 manage disk with cluster size(for example, 1M), part-written in one cluster will change the cluster state from UN-WRITTEN to WRITTEN, VFS(function dio_zero_block) doesn't do the cleaning because bh's state is not set to NEW in function ocfs2_dio_wr_get_block when we write a WRITTEN cluster. For example, the cluster size is 1M, file size is 8k and we direct write from 14k to 15k, then 12k~14k and 15k~16k will contain dirty data. We have to deal with two cases: 1.The starting position of direct write is outside the file. 2.The starting position of direct write is located in the file. We need set bh's state to NEW in the first case. In the second case, we need mapped twice because bh's state of area out file should be set to NEW while area in file not. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/5292e287-8f1a-fd4a-1a14-661e555e0bed@huawei.com Signed-off-by: Jia Guo <guojia12@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07x86/xen: Return from panic notifierBoris Ostrovsky
Currently execution of panic() continues until Xen's panic notifier (xen_panic_event()) is called at which point we make a hypercall that never returns. This means that any notifier that is supposed to be called later as well as significant part of panic() code (such as pstore writes from kmsg_dump()) is never executed. There is no reason for xen_panic_event() to be this last point in execution since panic()'s emergency_restart() will call into xen_emergency_restart() from where we can perform our hypercall. Nevertheless, we will provide xen_legacy_crash boot option that will preserve original behavior during crash. This option could be used, for example, if running kernel dumper (which happens after panic notifiers) is undesirable. Reported-by: James Dingwall <james@dingwall.me.uk> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Juergen Gross <jgross@suse.com>
2019-10-07riscv: Correct the handling of unexpected ebreak in do_trap_break()Vincent Chen
For the kernel space, all ebreak instructions are determined at compile time because the kernel space debugging module is currently unsupported. Hence, it should be treated as a bug if an ebreak instruction which does not belong to BUG_TRAP_TYPE_WARN or BUG_TRAP_TYPE_BUG is executed in kernel space. For the userspace, debugging module or user problem may intentionally insert an ebreak instruction to trigger a SIGTRAP signal. To approach the above two situations, the do_trap_break() will direct the BUG_TRAP_TYPE_NONE ebreak exception issued in kernel space to die() and will send a SIGTRAP to the trapped process only when the ebreak is in userspace. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [paul.walmsley@sifive.com: fixed checkpatch issue] Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-10-07riscv: avoid sending a SIGTRAP to a user thread trapped in WARN()Vincent Chen
On RISC-V, when the kernel runs code on behalf of a user thread, and the kernel executes a WARN() or WARN_ON(), the user thread will be sent a bogus SIGTRAP. Fix the RISC-V kernel code to not send a SIGTRAP when a WARN()/WARN_ON() is executed. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Christoph Hellwig <hch@lst.de> [paul.walmsley@sifive.com: fixed subject] Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-10-07riscv: avoid kernel hangs when trapped in BUG()Vincent Chen
When the CONFIG_GENERIC_BUG is disabled by disabling CONFIG_BUG, if a kernel thread is trapped by BUG(), the whole system will be in the loop that infinitely handles the ebreak exception instead of entering the die function. To fix this problem, the do_trap_break() will always call the die() to deal with the break exception as the type of break is BUG_TRAP_TYPE_BUG. Signed-off-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
2019-10-07uaccess: implement a proper unsafe_copy_to_user() and switch filldir over to itLinus Torvalds
In commit 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") I made filldir() use unsafe_put_user(), which improves code generation on x86 enormously. But because we didn't have a "unsafe_copy_to_user()", the dirent name copy was also done by hand with unsafe_put_user() in a loop, and it turns out that a lot of other architectures didn't like that, because unlike x86, they have various alignment issues. Most non-x86 architectures trap and fix it up, and some (like xtensa) will just fail unaligned put_user() accesses unconditionally. Which makes that "copy using put_user() in a loop" not work for them at all. I could make that code do explicit alignment etc, but the architectures that don't like unaligned accesses also don't really use the fancy "user_access_begin/end()" model, so they might just use the regular old __copy_to_user() interface. So this commit takes that looping implementation, turns it into the x86 version of "unsafe_copy_to_user()", and makes other architectures implement the unsafe copy version as __copy_to_user() (the same way they do for the other unsafe_xyz() accessor functions). Note that it only does this for the copying _to_ user space, and we still don't have a unsafe version of copy_from_user(). That's partly because we have no current users of it, but also partly because the copy_from_user() case is slightly different and cannot efficiently be implemented in terms of a unsafe_get_user() loop (because gcc can't do asm goto with outputs). It would be trivial to do this using "rep movsb", which would work really nicely on newer x86 cores, but really badly on some older ones. Al Viro is looking at cleaning up all our user copy routines to make this all a non-issue, but for now we have this simple-but-stupid version for x86 that works fine for the dirent name copy case because those names are short strings and we simply don't need anything fancier. Fixes: 9f79b78ef744 ("Convert filldir[64]() from __put_user() to unsafe_put_user()") Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-and-tested-by: Tony Luck <tony.luck@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07drm/i915: Mark contents as dirty on a write faultChris Wilson
Since dropping the set-to-gtt-domain in commit a679f58d0510 ("drm/i915: Flush pages on acquisition"), we no longer mark the contents as dirty on a write fault. This has the issue of us then not marking the pages as dirty on releasing the buffer, which means the contents are not written out to the swap device (should we ever pick that buffer as a victim). Notably, this is visible in the dumb buffer interface used for cursors. Having updated the cursor contents via mmap, and swapped away, if the shrinker should evict the old cursor, upon next reuse, the cursor would be invisible. E.g. echo 80 > /proc/sys/kernel/sysrq ; echo f > /proc/sysrq-trigger Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111541 Fixes: a679f58d0510 ("drm/i915: Flush pages on acquisition") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Matthew Auld <matthew.william.auld@gmail.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: <stable@vger.kernel.org> # v5.2+ Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190920121821.7223-1-chris@chris-wilson.co.uk (cherry picked from commit 5028851cdfdf78dc22eacbc44a0ab0b3f599ee4a) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2019-10-07drm/i915: Prevent bonded requests from overtaking each other on preemptionChris Wilson
Force bonded requests to run on distinct engines so that they cannot be shuffled onto the same engine where timeslicing will reverse the order. A bonded request will often wait on a semaphore signaled by its master, creating an implicit dependency -- if we ignore that implicit dependency and allow the bonded request to run on the same engine and before its master, we will cause a GPU hang. [Whether it will hang the GPU is debatable, we should keep on timeslicing and each timeslice should be "accidentally" counted as forward progress, in which case it should run but at one-half to one-third speed.] We can prevent this inversion by restricting which engines we allow ourselves to jump to upon preemption, i.e. baking in the arrangement established at first execution. (We should also consider capturing the implicit dependency using i915_sched_add_dependency(), but first we need to think about the constraints that requires on the execution/retirement ordering.) Fixes: 8ee36e048c98 ("drm/i915/execlists: Minimalistic timeslicing") References: ee1136908e9b ("drm/i915/execlists: Virtual engine bonding") Testcase: igt/gem_exec_balancer/bonded-slice Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190923152844.8914-3-chris@chris-wilson.co.uk (cherry picked from commit e2144503bf3b22275dd33cef2880e1cb5fb200c5) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2019-10-07drm/i915: Bump skl+ max plane width to 5k for linear/x-tiledVille Syrjälä
The officially validated plane width limit is 4k on skl+, however we already had people using 5k displays before we started to enforce the limit. Also it seems Windows allows 5k resolutions as well (though not sure if they do it with one plane or two). According to hw folks 5k should work with the possible exception of the following features: - Ytile (already limited to 4k) - FP16 (already limited to 4k) - render compression (already limited to 4k) - KVMR sprite and cursor (don't care) - horizontal panning (need to verify this) - pipe and plane scaling (need to verify this) So apart from last two items on that list we are already fine. We should really verify what happens with those last two items but I don't have a 5k display on hand atm so it'll have to wait. In the meantime let's just bump the limit back up to 5k since several users have already been using it without apparent issues. At least we'll be no worse off than we were prior to lowering the limits. Cc: stable@vger.kernel.org Cc: Sean Paul <sean@poorly.run> Cc: José Roberto de Souza <jose.souza@intel.com> Tested-by: Leho Kraav <leho@kraav.com> Fixes: 372b9ffb5799 ("drm/i915: Fix skl+ max plane width") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111501 Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190905135044.2001-1-ville.syrjala@linux.intel.com Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Reviewed-by: Sean Paul <sean@poorly.run> (cherry picked from commit bed34ef544f9ab37ab349c04cf4142282c4dcf5d) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2019-10-07drm/i915: Verify the engine after acquiring the active.lockChris Wilson
When using virtual engines, the rq->engine is not stable until we hold the engine->active.lock (as the virtual engine may be exchanged with the sibling). Since commit 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy") we may retire a request concurrently with resubmitting it to HW, we need to be extra careful to verify we are holding the correct lock for the request's active list. This is similar to the issue we saw with rescheduling the virtual requests, see sched_lock_engine(). Or else: <4> [876.736126] list_add corruption. prev->next should be next (ffff8883f931a1f8), but was dead000000000100. (prev=ffff888361ffa610). <4> [876.736136] WARNING: CPU: 2 PID: 21 at lib/list_debug.c:28 __list_add_valid+0x4d/0x70 <4> [876.736137] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel e1000e cdc_ether usbnet mii snd_pcm ptp pps_core mei_me mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915] <4> [876.736154] CPU: 2 PID: 21 Comm: ksoftirqd/2 Tainted: G U 5.3.0-CI-CI_DRM_6898+ #1 <4> [876.736156] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019 <4> [876.736157] RIP: 0010:__list_add_valid+0x4d/0x70 <4> [876.736159] Code: c3 48 89 d1 48 c7 c7 20 33 0e 82 48 89 c2 e8 4a 4a bc ff 0f 0b 31 c0 c3 48 89 c1 4c 89 c6 48 c7 c7 70 33 0e 82 e8 33 4a bc ff <0f> 0b 31 c0 c3 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 c0 33 0e 82 e8 <4> [876.736160] RSP: 0018:ffffc9000018bd30 EFLAGS: 00010082 <4> [876.736162] RAX: 0000000000000000 RBX: ffff888361ffc840 RCX: 0000000000000104 <4> [876.736163] RDX: 0000000080000104 RSI: 0000000000000000 RDI: 00000000ffffffff <4> [876.736164] RBP: ffffc9000018bd68 R08: 0000000000000000 R09: 0000000000000001 <4> [876.736165] R10: 00000000aed95de3 R11: 000000007fe927eb R12: ffff888361ffca10 <4> [876.736166] R13: ffff888361ffa610 R14: ffff888361ffc880 R15: ffff8883f931a1f8 <4> [876.736168] FS: 0000000000000000(0000) GS:ffff88849fd00000(0000) knlGS:0000000000000000 <4> [876.736169] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [876.736170] CR2: 00007f093a9173c0 CR3: 00000003bba08005 CR4: 0000000000760ee0 <4> [876.736171] PKRU: 55555554 <4> [876.736172] Call Trace: <4> [876.736226] __i915_request_submit+0x152/0x370 [i915] <4> [876.736263] __execlists_submission_tasklet+0x6da/0x1f50 [i915] <4> [876.736293] ? execlists_submission_tasklet+0x29/0x50 [i915] <4> [876.736321] execlists_submission_tasklet+0x34/0x50 [i915] <4> [876.736325] tasklet_action_common.isra.5+0x47/0xb0 <4> [876.736328] __do_softirq+0xd8/0x4ae <4> [876.736332] ? smpboot_thread_fn+0x23/0x280 <4> [876.736334] ? smpboot_thread_fn+0x6b/0x280 <4> [876.736336] run_ksoftirqd+0x2b/0x50 <4> [876.736338] smpboot_thread_fn+0x1d3/0x280 <4> [876.736341] ? sort_range+0x20/0x20 <4> [876.736343] kthread+0x119/0x130 <4> [876.736345] ? kthread_park+0xa0/0xa0 <4> [876.736347] ret_from_fork+0x24/0x50 <4> [876.736353] irq event stamp: 2290145 <4> [876.736356] hardirqs last enabled at (2290144): [<ffffffff8123cde8>] __slab_free+0x3e8/0x500 <4> [876.736358] hardirqs last disabled at (2290145): [<ffffffff819cfb4d>] _raw_spin_lock_irqsave+0xd/0x50 <4> [876.736360] softirqs last enabled at (2290114): [<ffffffff81c0033e>] __do_softirq+0x33e/0x4ae <4> [876.736361] softirqs last disabled at (2290119): [<ffffffff810b815b>] run_ksoftirqd+0x2b/0x50 <4> [876.736363] WARNING: CPU: 2 PID: 21 at lib/list_debug.c:28 __list_add_valid+0x4d/0x70 <4> [876.736364] ---[ end trace 3e58d6c7356c65bf ]--- <4> [876.736406] ------------[ cut here ]------------ <4> [876.736415] list_del corruption. prev->next should be ffff888361ffca10, but was ffff88840ac2c730 <4> [876.736421] WARNING: CPU: 2 PID: 5490 at lib/list_debug.c:53 __list_del_entry_valid+0x79/0x90 <4> [876.736422] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel e1000e cdc_ether usbnet mii snd_pcm ptp pps_core mei_me mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915] <4> [876.736433] CPU: 2 PID: 5490 Comm: i915_selftest Tainted: G U W 5.3.0-CI-CI_DRM_6898+ #1 <4> [876.736435] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019 <4> [876.736436] RIP: 0010:__list_del_entry_valid+0x79/0x90 <4> [876.736438] Code: 0b 31 c0 c3 48 89 fe 48 c7 c7 30 34 0e 82 e8 ae 49 bc ff 0f 0b 31 c0 c3 48 89 f2 48 89 fe 48 c7 c7 68 34 0e 82 e8 97 49 bc ff <0f> 0b 31 c0 c3 48 c7 c7 a8 34 0e 82 e8 86 49 bc ff 0f 0b 31 c0 c3 <4> [876.736439] RSP: 0018:ffffc900003ef758 EFLAGS: 00010086 <4> [876.736440] RAX: 0000000000000000 RBX: ffff888361ffc840 RCX: 0000000000000002 <4> [876.736442] RDX: 0000000080000002 RSI: 0000000000000000 RDI: 00000000ffffffff <4> [876.736443] RBP: ffffc900003ef780 R08: 0000000000000000 R09: 0000000000000001 <4> [876.736444] R10: 000000001418e4b7 R11: 000000007f0ea93b R12: ffff888361ffcab8 <4> [876.736445] R13: ffff88843b6d0000 R14: 000000000000217c R15: 0000000000000001 <4> [876.736447] FS: 00007f4e6f255240(0000) GS:ffff88849fd00000(0000) knlGS:0000000000000000 <4> [876.736448] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [876.736449] CR2: 00007f093a9173c0 CR3: 00000003bba08005 CR4: 0000000000760ee0 <4> [876.736450] PKRU: 55555554 <4> [876.736451] Call Trace: <4> [876.736488] i915_request_retire+0x224/0x8e0 [i915] <4> [876.736521] i915_request_create+0x4b/0x1b0 [i915] <4> [876.736550] nop_virtual_engine+0x230/0x4d0 [i915] Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111695 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190918145453.8800-1-chris@chris-wilson.co.uk (cherry picked from commit 37fa0de3c137d5f54f7e64f53495c9d501d42a4d) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2019-10-07drm/i915: Extend Haswell GT1 PSMI workaround to allChris Wilson
A few times in CI, we have detected a GPU hang on our Haswell GT2 systems with the characteristic IPEHR of 0x780c0000. When the PSMI w/a was first introducted, it was applied to all Haswell, but later on we found an erratum that supposedly restricted the issue to GT1 and so constrained it only be applied on GT1. That may have been a mistake... Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111692 Fixes: 167bc759e823 ("drm/i915: Restrict PSMI context load w/a to Haswell GT1") References: 2c550183476d ("drm/i915: Disable PSMI sleep messages on all rings around context switches") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190917194746.26710-1-chris@chris-wilson.co.uk (cherry picked from commit 56c05de6bd773b96deca379370965c49042b5fbf) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2019-10-07drm/i915: Don't mix srcu tag and negative error codesChris Wilson
While srcu may use an integer tag, it does not exclude potential error codes and so may overlap with our own use of -EINTR. Use a separate outparam to store the tag, and report the error code separately. Fixes: 2caffbf11762 ("drm/i915: Revoke mmaps and prevent access to fence registers across reset") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190912160834.30601-1-chris@chris-wilson.co.uk (cherry picked from commit eebab60f224fcfd560957715d08c31564d8672ed) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2019-10-07drm/i915: Whitelist COMMON_SLICE_CHICKEN2Kenneth Graunke
This allows userspace to use "legacy" mode for push constants, where they are committed at 3DPRIMITIVE or flush time, rather than being committed at 3DSTATE_BINDING_TABLE_POINTERS_XS time. Gen6-8 and Gen11 both use the "legacy" behavior - only Gen9 works in the "new" way. Conflating push constants with binding tables is painful for userspace, we would like to be able to avoid doing so. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Cc: stable@vger.kernel.org Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20190911014801.26821-1-kenneth@whitecape.org (cherry picked from commit 0606259e3b3a1220a0f04a92a1654a3f674f47ee) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>