summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2019-10-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Several cases of overlapping changes which were for the most part trivially resolvable. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: "I was battling a cold after some recent trips, so quite a bit piled up meanwhile, sorry about that. Highlights: 1) Fix fd leak in various bpf selftests, from Brian Vazquez. 2) Fix crash in xsk when device doesn't support some methods, from Magnus Karlsson. 3) Fix various leaks and use-after-free in rxrpc, from David Howells. 4) Fix several SKB leaks due to confusion of who owns an SKB and who should release it in the llc code. From Eric Biggers. 5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet. 6) Jumbo packets don't work after resume on r8169, as the BIOS resets the chip into non-jumbo mode during suspend. From Heiner Kallweit. 7) Corrupt L2 header during MPLS push, from Davide Caratti. 8) Prevent possible infinite loop in tc_ctl_action, from Eric Dumazet. 9) Get register bits right in bcmgenet driver, based upon chip version. From Florian Fainelli. 10) Fix mutex problems in microchip DSA driver, from Marek Vasut. 11) Cure race between route lookup and invalidation in ipv4, from Wei Wang. 12) Fix performance regression due to false sharing in 'net' structure, from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits) net: reorder 'struct net' fields to avoid false sharing net: dsa: fix switch tree list net: ethernet: dwmac-sun8i: show message only when switching to promisc net: aquantia: add an error handling in aq_nic_set_multicast_list net: netem: correct the parent's backlog when corrupted packet was dropped net: netem: fix error path for corrupted GSO frames macb: propagate errors when getting optional clocks xen/netback: fix error path of xenvif_connect_data() net: hns3: fix mis-counting IRQ vector numbers issue net: usb: lan78xx: Connect PHY before registering MAC vsock/virtio: discard packets if credit is not respected vsock/virtio: send a credit update when buffer size is changed mlxsw: spectrum_trap: Push Ethernet header before reporting trap net: ensure correct skb->tstamp in various fragmenters net: bcmgenet: reset 40nm EPHY on energy detect net: bcmgenet: soft reset 40nm EPHYs before MAC init net: phy: bcm7xxx: define soft_reset for 40nm EPHY net: bcmgenet: don't set phydev->link from MAC net: Update address for MediaTek ethernet driver in MAINTAINERS ipv4: fix race condition between route lookup and invalidation ...
2019-10-18symbol namespaces: revert to previous __ksymtab name schemeMatthias Maennich
The introduction of Symbol Namespaces changed the naming schema of the __ksymtab entries from __kysmtab__symbol to __ksymtab_NAMESPACE.symbol. That caused some breakages in tools that depend on the name layout in either the binaries(vmlinux,*.ko) or in System.map. E.g. kmod's depmod would not be able to read System.map without a patch to support symbol namespaces. A warning reported by depmod for namespaced symbols would look like depmod: WARNING: [...]/uas.ko needs unknown symbol usb_stor_adjust_quirks In order to address this issue, revert to the original naming scheme and rather read the __kstrtabns_<symbol> entries and their corresponding values from __ksymtab_strings to update the namespace values for symbols. After having read all symbols and handled them in handle_modversions(), the symbols are created. In a second pass, read the __kstrtabns_ entries and update the namespaces accordingly. Fixes: 8651ec01daed ("module: add support for symbol namespaces.") Reported-by: Stefan Wahren <stefan.wahren@i2se.com> Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Matthias Maennich <maennich@google.com> Signed-off-by: Jessica Yu <jeyu@kernel.org>
2019-10-17Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The main thing here is a long-awaited workaround for a CPU erratum on ThunderX2 which we have developed in conjunction with engineers from Cavium/Marvell. At the moment, the workaround is unconditionally enabled for affected CPUs at runtime but we may add a command-line option to disable it in future if performance numbers show up indicating a significant cost for real workloads. Summary: - Work around Cavium/Marvell ThunderX2 erratum #219 - Fix regression in mlock() ABI caused by sign-extension of TTBR1 addresses - More fixes to the spurious kernel fault detection logic - Fix pathological preemption race when enabling some CPU features at boot - Drop broken kcore macros in favour of generic implementations - Fix userspace view of ID_AA64ZFR0_EL1 when SVE is disabled - Avoid NULL dereference on allocation failure during hibernation" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: tags: Preserve tags for addresses translated via TTBR1 arm64: mm: fix inverted PAR_EL1.F check arm64: sysreg: fix incorrect definition of SYS_PAR_EL1_F arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled arm64: hibernate: check pgd table allocation arm64: cpufeature: Treat ID_AA64ZFR0_EL1 as RAZ when SVE is not enabled arm64: Fix kcore macros after 52-bit virtual addressing fallout arm64: Allow CAVIUM_TX2_ERRATUM_219 to be selected arm64: Avoid Cavium TX2 erratum 219 when switching TTBR arm64: Enable workaround for Cavium TX2 erratum 219 when running SMT arm64: KVM: Trap VM ops when ARM64_WORKAROUND_CAVIUM_TX2_219_TVM is set
2019-10-17net: phy: micrel: Update KSZ87xx PHY nameMarek Vasut
The KSZ8795 PHY ID is in fact used by KSZ8794/KSZ8795/KSZ8765 switches. Update the PHY ID and name to reflect that, as this family of switches is commonly refered to as KSZ87xx Signed-off-by: Marek Vasut <marex@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: David S. Miller <davem@davemloft.net> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: George McCollister <george.mccollister@gmail.com> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Sean Nyekjaer <sean.nyekjaer@prevas.dk> Cc: Tristram Ha <Tristram.Ha@microchip.com> Cc: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-17Merge branch 'errata/tx2-219' into for-next/fixesWill Deacon
Workaround for Cavium/Marvell ThunderX2 erratum #219. * errata/tx2-219: arm64: Allow CAVIUM_TX2_ERRATUM_219 to be selected arm64: Avoid Cavium TX2 erratum 219 when switching TTBR arm64: Enable workaround for Cavium TX2 erratum 219 when running SMT arm64: KVM: Trap VM ops when ARM64_WORKAROUND_CAVIUM_TX2_219_TVM is set
2019-10-17Merge tag 'gpio-v5.4-3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO fixes from Linus Walleij: "The fixes pertain to a problem with initializing the Intel GPIO irqchips when adding gpiochips. Andy fixed it up elegantly by adding a hardware initialization callback to the struct gpio_irq_chip so let's use this. Tested and verified on the target hardware" * tag 'gpio-v5.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: gpio: lynxpoint: set default handler to be handle_bad_irq() gpio: merrifield: Move hardware initialization to callback gpio: lynxpoint: Move hardware initialization to callback gpio: intel-mid: Move hardware initialization to callback gpiolib: Initialize the hardware with a callback gpio: merrifield: Restore use of irq_base
2019-10-17bpf: Check types of arguments passed into helpersAlexei Starovoitov
Introduce new helper that reuses existing skb perf_event output implementation, but can be called from raw_tracepoint programs that receive 'struct sk_buff *' as tracepoint argument or can walk other kernel data structures to skb pointer. In order to do that teach verifier to resolve true C types of bpf helpers into in-kernel BTF ids. The type of kernel pointer passed by raw tracepoint into bpf program will be tracked by the verifier all the way until it's passed into helper function. For example: kfree_skb() kernel function calls trace_kfree_skb(skb, loc); bpf programs receives that skb pointer and may eventually pass it into bpf_skb_output() bpf helper which in-kernel is implemented via bpf_skb_event_output() kernel function. Its first argument in the kernel is 'struct sk_buff *'. The verifier makes sure that types match all the way. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-11-ast@kernel.org
2019-10-17bpf: Add support for BTF pointers to x86 JITAlexei Starovoitov
Pointer to BTF object is a pointer to kernel object or NULL. Such pointers can only be used by BPF_LDX instructions. The verifier changed their opcode from LDX|MEM|size to LDX|PROBE_MEM|size to make JITing easier. The number of entries in extable is the number of BPF_LDX insns that access kernel memory via "pointer to BTF type". Only these load instructions can fault. Since x86 extable is relative it has to be allocated in the same memory region as JITed code. Allocate it prior to last pass of JITing and let the last pass populate it. Pointer to extable in bpf_prog_aux is necessary to make page fault handling fast. Page fault handling is done in two steps: 1. bpf_prog_kallsyms_find() finds BPF program that page faulted. It's done by walking rb tree. 2. then extable for given bpf program is binary searched. This process is similar to how page faulting is done for kernel modules. The exception handler skips over faulting x86 instruction and initializes destination register with zero. This mimics exact behavior of bpf_probe_read (when probe_kernel_read faults dest is zeroed). JITs for other architectures can add support in similar way. Until then they will reject unknown opcode and fallback to interpreter. Since extable should be aligned and placed near JITed code make bpf_jit_binary_alloc() return 4 byte aligned image offset, so that extable aligning formula in bpf_int_jit_compile() doesn't need to rely on internal implementation of bpf_jit_binary_alloc(). On x86 gcc defaults to 16-byte alignment for regular kernel functions due to better performance. JITed code may be aligned to 16 in the future, but it will use 4 in the meantime. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-10-ast@kernel.org
2019-10-17bpf: Add support for BTF pointers to interpreterAlexei Starovoitov
Pointer to BTF object is a pointer to kernel object or NULL. The memory access in the interpreter has to be done via probe_kernel_read to avoid page faults. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-9-ast@kernel.org
2019-10-17bpf: Implement accurate raw_tp context access via BTFAlexei Starovoitov
libbpf analyzes bpf C program, searches in-kernel BTF for given type name and stores it into expected_attach_type. The kernel verifier expects this btf_id to point to something like: typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc); which represents signature of raw_tracepoint "kfree_skb". Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb' and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint. In first case it passes btf_id of 'struct sk_buff *' back to the verifier core and 'void *' in second case. Then the verifier tracks PTR_TO_BTF_ID as any other pointer type. Like PTR_TO_SOCKET points to 'struct bpf_sock', PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on. PTR_TO_BTF_ID points to in-kernel structs. If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF then PTR_TO_BTF_ID#1234 points to one of in kernel skbs. When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32) the btf_struct_access() checks which field of 'struct sk_buff' is at offset 32. Checks that size of access matches type definition of the field and continues to track the dereferenced type. If that field was a pointer to 'struct net_device' the r2's type will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device' in vmlinux's BTF. Such verifier analysis prevents "cheating" in BPF C program. The program cannot cast arbitrary pointer to 'struct sk_buff *' and access it. C compiler would allow type cast, of course, but the verifier will notice type mismatch based on BPF assembly and in-kernel BTF. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org
2019-10-17bpf: Add attach_btf_id attribute to program loadAlexei Starovoitov
Add attach_btf_id attribute to prog_load command. It's similar to existing expected_attach_type attribute which is used in several cgroup based program types. Unfortunately expected_attach_type is ignored for tracing programs and cannot be reused for new purpose. Hence introduce attach_btf_id to verify bpf programs against given in-kernel BTF type id at load time. It is strictly checked to be valid for raw_tp programs only. In a later patches it will become: btf_id == 0 semantics of existing raw_tp progs. btd_id > 0 raw_tp with BTF and additional type safety. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-5-ast@kernel.org
2019-10-17bpf: Process in-kernel BTFAlexei Starovoitov
If in-kernel BTF exists parse it and prepare 'struct btf *btf_vmlinux' for further use by the verifier. In-kernel BTF is trusted just like kallsyms and other build artifacts embedded into vmlinux. Yet run this BTF image through BTF verifier to make sure that it is valid and it wasn't mangled during the build. If in-kernel BTF is incorrect it means either gcc or pahole or kernel are buggy. In such case disallow loading BPF programs. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-4-ast@kernel.org
2019-10-17bpf: Add typecast to bpf helpers to help BTF generationAlexei Starovoitov
When pahole converts dwarf to btf it emits only used types. Wrap existing bpf helper functions into typedef and use it in typecast to make gcc emits this type into dwarf. Then pahole will convert it to btf. The "btf_#name_of_helper" types will be used to figure out types of arguments of bpf helpers. The generated code before and after is the same. Only dwarf and btf sections are different. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-3-ast@kernel.org
2019-10-17netfilter: add and use nf_hook_slow_list()Florian Westphal
At this time, NF_HOOK_LIST() macro will iterate the list and then calls nf_hook() for each individual skb. This makes it so the entire list is passed into the netfilter core. The advantage is that we only need to fetch the rule blob once per list instead of per-skb. NF_HOOK_LIST now only works for ipv4 and ipv6, as those are the only callers. v2: use skb_list_del_init() instead of list_del (Edward Cree) Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-16net: sfp: move fwnode parsing into sfp-bus layerRussell King
Rather than parsing the sfp firmware node in phylink, parse it in the sfp-bus code, so we can re-use this code for PHYs without having to duplicate the parsing. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-16arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabledJulien Thierry
Preempting from IRQ-return means that the task has its PSTATE saved on the stack, which will get restored when the task is resumed and does the actual IRQ return. However, enabling some CPU features requires modifying the PSTATE. This means that, if a task was scheduled out during an IRQ-return before all CPU features are enabled, the task might restore a PSTATE that does not include the feature enablement changes once scheduled back in. * Task 1: PAN == 0 ---| |--------------- | |<- return from IRQ, PSTATE.PAN = 0 | <- IRQ | +--------+ <- preempt() +-- ^ | reschedule Task 1, PSTATE.PAN == 1 * Init: --------------------+------------------------ ^ | enable_cpu_features set PSTATE.PAN on all CPUs Worse than this, since PSTATE is untouched when task switching is done, a task missing the new bits in PSTATE might affect another task, if both do direct calls to schedule() (outside of IRQ/exception contexts). Fix this by preventing preemption on IRQ-return until features are enabled on all CPUs. This way the only PSTATE values that are saved on the stack are from synchronous exceptions. These are expected to be fatal this early, the exception is BRK for WARN_ON(), but as this uses do_debug_exception() which keeps IRQs masked, it shouldn't call schedule(). Signed-off-by: Julien Thierry <julien.thierry@arm.com> [james: Replaced a really cool hack, with an even simpler static key in C. expanded commit message with Julien's cover-letter ascii art] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Will Deacon <will@kernel.org>
2019-10-15net: phylink: use more linkmode_*Russell King
Use more linkmode_* helpers rather than open-coding the bitmap operations. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-15net/sched: fix corrupted L2 header with MPLS 'push' and 'pop' actionsDavide Caratti
the following script: # tc qdisc add dev eth0 clsact # tc filter add dev eth0 egress protocol ip matchall \ > action mpls push protocol mpls_uc label 0x355aa bos 1 causes corruption of all IP packets transmitted by eth0. On TC egress, we can't rely on the value of skb->mac_len, because it's 0 and a MPLS 'push' operation will result in an overwrite of the first 4 octets in the packet L2 header (e.g. the Destination Address if eth0 is an Ethernet); the same error pattern is present also in the MPLS 'pop' operation. Fix this error in act_mpls data plane, computing 'mac_len' as the difference between the network header and the mac header (when not at TC ingress), and use it in MPLS 'push'/'pop' core functions. v2: unbreak 'make htmldocs' because of missing documentation of 'mac_len' in skb_mpls_pop(), reported by kbuild test robot CC: Lorenzo Bianconi <lorenzo@kernel.org> Fixes: 2a2ea50870ba ("net: sched: add mpls manipulation actions to TC") Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: John Hurley <john.hurley@netronome.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-15gpiolib: Initialize the hardware with a callbackAndy Shevchenko
After changing the drivers to use GPIO core to add an IRQ chip it appears that some of them requires a hardware initialization before adding the IRQ chip. Add an optional callback ->init_hw() to allow that drivers to initialize hardware if needed. This change is a part of the fix NULL pointer dereference brought to the several drivers recently. Cc: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2019-10-14xarray.h: fix kernel-doc warningRandy Dunlap
Fix (Sphinx) kernel-doc warning in <linux/xarray.h>: include/linux/xarray.h:232: WARNING: Unexpected indentation. Link: http://lkml.kernel.org/r/89ba2134-ce23-7c10-5ee1-ef83b35aa984@infradead.org Fixes: a3e4d3f97ec8 ("XArray: Redesign xa_alloc API") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14bitmap.h: fix kernel-doc warning and typoRandy Dunlap
Fix kernel-doc warning in <linux/bitmap.h>: include/linux/bitmap.h:341: warning: Function parameter or member 'nbits' not described in 'bitmap_or_equal' Also fix small typo (bitnaps). Link: http://lkml.kernel.org/r/0729ea7a-2c0d-b2c5-7dd3-3629ee0803e2@infradead.org Fixes: b9fa6442f704 ("cpumask: Implement cpumask_or_equal()") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14mm, page_owner: rename flag indicating that page is allocatedVlastimil Babka
Commit 37389167a281 ("mm, page_owner: keep owner info when freeing the page") has introduced a flag PAGE_EXT_OWNER_ACTIVE to indicate that page is tracked as being allocated. Kirril suggested naming it PAGE_EXT_OWNER_ALLOCATED to make it more clear, as "active is somewhat loaded term for a page". Link: http://lkml.kernel.org/r/20190930122916.14969-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14mm, page_owner: fix off-by-one error in __set_page_owner_handle()Vlastimil Babka
Patch series "followups to debug_pagealloc improvements through page_owner", v3. These are followups to [1] which made it to Linus meanwhile. Patches 1 and 3 are based on Kirill's review, patch 2 on KASAN request [2]. It would be nice if all of this made it to 5.4 with [1] already there (or at least Patch 1). This patch (of 3): As noted by Kirill, commit 7e2f2a0cd17c ("mm, page_owner: record page owner for each subpage") has introduced an off-by-one error in __set_page_owner_handle() when looking up page_ext for subpages. As a result, the head page page_owner info is set twice, while for the last tail page, it's not set at all. Fix this and also make the code more efficient by advancing the page_ext pointer we already have, instead of calling lookup_page_ext() for each subpage. Since the full size of struct page_ext is not known at compile time, we can't use a simple page_ext++ statement, so introduce a page_ext_next() inline function for that. Link: http://lkml.kernel.org/r/20190930122916.14969-2-vbabka@suse.cz Fixes: 7e2f2a0cd17c ("mm, page_owner: record page owner for each subpage") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Kirill A. Shutemov <kirill@shutemov.name> Reported-by: Miles Chen <miles.chen@mediatek.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf-next 2019-10-14 The following pull-request contains BPF updates for your *net-next* tree. 12 days of development and 85 files changed, 1889 insertions(+), 1020 deletions(-) The main changes are: 1) auto-generation of bpf_helper_defs.h, from Andrii. 2) split of bpf_helpers.h into bpf_{helpers, helper_defs, endian, tracing}.h and move into libbpf, from Andrii. 3) Track contents of read-only maps as scalars in the verifier, from Andrii. 4) small x86 JIT optimization, from Daniel. 5) cross compilation support, from Ivan. 6) bpf flow_dissector enhancements, from Jakub and Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-14dmaengine: imx-sdma: fix size check for sdma script_numberRobin Gong
Illegal memory will be touch if SDMA_SCRIPT_ADDRS_ARRAY_SIZE_V3 (41) exceed the size of structure sdma_script_start_addrs(40), thus cause memory corrupt such as slob block header so that kernel trap into while() loop forever in slob_free(). Please refer to below code piece in imx-sdma.c: for (i = 0; i < sdma->script_number; i++) if (addr_arr[i] > 0) saddr_arr[i] = addr_arr[i]; /* memory corrupt here */ That issue was brought by commit a572460be9cf ("dmaengine: imx-sdma: Add support for version 3 firmware") because SDMA_SCRIPT_ADDRS_ARRAY_SIZE_V3 (38->41 3 scripts added) not align with script number added in sdma_script_start_addrs(2 scripts). Fixes: a572460be9cf ("dmaengine: imx-sdma: Add support for version 3 firmware") Cc: stable@vger.kernel Link: https://www.spinics.net/lists/arm-kernel/msg754895.html Signed-off-by: Robin Gong <yibin.gong@nxp.com> Reported-by: Jurgen Lambrecht <J.Lambrecht@TELEVIC.com> Link: https://lore.kernel.org/r/1569347584-3478-1-git-send-email-yibin.gong@nxp.com [vkoul: update the patch title] Signed-off-by: Vinod Koul <vkoul@kernel.org>
2019-10-13tcp: add rcu protection around tp->fastopen_rskEric Dumazet
Both tcp_v4_err() and tcp_v6_err() do the following operations while they do not own the socket lock : fastopen = tp->fastopen_rsk; snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una; The problem is that without appropriate barrier, the compiler might reload tp->fastopen_rsk and trigger a NULL deref. request sockets are protected by RCU, we can simply add the missing annotations and barriers to solve the issue. Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13Merge tag 'hwmon-for-v5.4-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging Pull hwmon fixes from Guenter Roeck: - Update/fix inspur-ipsps1 and k10temp Documentation - Fix nct7904 driver - Fix HWMON_P_MIN_ALARM mask in hwmon core * tag 'hwmon-for-v5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging: hwmon: docs: Extend inspur-ipsps1 title underline hwmon: (nct7904) Add array fan_alarm and vsen_alarm to store the alarms in nct7904_data struct. docs: hwmon: Include 'inspur-ipsps1.rst' into docs hwmon: Fix HWMON_P_MIN_ALARM mask hwmon: (k10temp) Update documentation and add temp2_input info hwmon: (nct7904) Fix the incorrect value of vsen_mask in nct7904_data struct
2019-10-12Merge tag 'usb-5.4-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are a lot of small USB driver fixes for 5.4-rc3. syzbot has stepped up its testing of the USB driver stack, now able to trigger fun race conditions between disconnect and probe functions. Because of that we have a lot of fixes in here from Johan and others fixing these reported issues that have been around since almost all time. We also are just deleting the rio500 driver, making all of the syzbot bugs found in it moot as it turns out no one has been using it for years as there is a userspace version that is being used instead. There are also a number of other small fixes in here, all resolving reported issues or regressions. All have been in linux-next without any reported issues" * tag 'usb-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (65 commits) USB: yurex: fix NULL-derefs on disconnect USB: iowarrior: use pr_err() USB: iowarrior: drop redundant iowarrior mutex USB: iowarrior: drop redundant disconnect mutex USB: iowarrior: fix use-after-free after driver unbind USB: iowarrior: fix use-after-free on release USB: iowarrior: fix use-after-free on disconnect USB: chaoskey: fix use-after-free on release USB: adutux: fix use-after-free on release USB: ldusb: fix NULL-derefs on driver unbind USB: legousbtower: fix use-after-free on release usb: cdns3: Fix for incorrect DMA mask. usb: cdns3: fix cdns3_core_init_role() usb: cdns3: gadget: Fix full-speed mode USB: usb-skeleton: drop redundant in-urb check USB: usb-skeleton: fix use-after-free after driver unbind USB: usb-skeleton: fix NULL-deref on disconnect usb:cdns3: Fix for CV CH9 running with g_zero driver. usb: dwc3: Remove dev_err() on platform_get_irq() failure usb: dwc3: Switch to platform_get_irq_byname_optional() ...
2019-10-12Merge branch 'efi-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI fixes from Ingo Molnar: "Misc EFI fixes all across the map: CPER error report fixes, fixes to TPM event log parsing, fix for a kexec hang, a Sparse fix and other fixes" * 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi/tpm: Fix sanity check of unsigned tbl_size being less than zero efi/x86: Do not clean dummy variable in kexec path efi: Make unexported efi_rci2_sysfs_init() static efi/tpm: Only set 'efi_tpm_final_log_size' after successful event log parsing efi/tpm: Don't traverse an event log with no events efi/tpm: Don't access event->count when it isn't mapped efivar/ssdt: Don't iterate over EFI vars if no SSDT override was specified efi/cper: Fix endianness of PCIe class code
2019-10-12Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "A handful of fixes: a kexec linking fix, an AMD MWAITX fix, a vmware guest support fix when built under Clang, and new CPU model number definitions" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add Comet Lake to the Intel CPU models header lib/string: Make memzero_explicit() inline instead of external x86/cpu/vmware: Use the full form of INL in VMWARE_PORT x86/asm: Fix MWAITX C-state hint value
2019-10-11Merge tag 'nfs-for-5.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Anna Schumaker: "Stable bugfixes: - Fix O_DIRECT accounting of number of bytes read/written # v4.1+ Other fixes: - Fix nfsi->nrequests count error on nfs_inode_remove_request() - Remove redundant mirror tracking in O_DIRECT - Fix leak of clp->cl_acceptor string - Fix race to sk_err after xs_error_report" * tag 'nfs-for-5.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: SUNRPC: fix race to sk_err after xs_error_report NFSv4: Fix leak of clp->cl_acceptor string NFS: Remove redundant mirror tracking in O_DIRECT NFS: Fix O_DIRECT accounting of number of bytes read/written nfs: Fix nfsi->nrequests count error on nfs_inode_remove_request
2019-10-11bpf: Align struct bpf_prog_statsEric Dumazet
Do not risk spanning these small structures on two cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20191011181140.2898-1-edumazet@google.com
2019-10-11Merge tag 'modules-for-v5.4-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull module fixes from Jessica Yu: "Code cleanups and kbuild/namespace related fixups from Masahiro. Most importantly, it fixes a namespace-related modpost issue for external module builds - Fix broken external module builds due to a modpost bug in read_dump(), where the namespace was not being strdup'd and sym->namespace would be set to bogus data. - Various namespace-related kbuild fixes and cleanups thanks to Masahiro Yamada" * tag 'modules-for-v5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: doc: move namespaces.rst from kbuild/ to core-api/ nsdeps: make generated patches independent of locale nsdeps: fix hashbang of scripts/nsdeps kbuild: fix build error of 'make nsdeps' in clean tree module: rename __kstrtab_ns_* to __kstrtabns_* to avoid symbol conflict modpost: fix broken sym->namespace for external module builds module: swap the order of symbol.namespace scripts: add_namespace: Fix coccicheck failed
2019-10-11compiler_attributes.h: Add 'fallthrough' pseudo keyword for switch/case useJoe Perches
Reserve the pseudo keyword 'fallthrough' for the ability to convert the various case block /* fallthrough */ style comments to appear to be an actual reserved word with the same gcc case block missing fallthrough warning capability. All switch/case blocks now should end in one of: break; fallthrough; goto <label>; return [expression]; continue; In C mode, GCC supports the __fallthrough__ attribute since 7.1, the same time the warning and the comment parsing were introduced. fallthrough devolves to an empty "do {} while (0)" if the compiler version (any version less than gcc 7) does not support the attribute. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-10SUNRPC: fix race to sk_err after xs_error_reportBenjamin Coddington
Since commit 4f8943f80883 ("SUNRPC: Replace direct task wakeups from softirq context") there has been a race to the value of the sk_err if both XPRT_SOCK_WAKE_ERROR and XPRT_SOCK_WAKE_DISCONNECT are set. In that case, we may end up losing the sk_err value that existed when xs_error_report was called. Fix this by reverting to the previous behavior: instead of using SO_ERROR to retrieve the value at a later time (which might also return sk_err_soft), copy the sk_err value onto struct sock_xprt, and use that value to wake pending tasks. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Fixes: 4f8943f80883 ("SUNRPC: Replace direct task wakeups from softirq context") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-09DIM: fix dim.h kernel-doc and headersRandy Dunlap
Lots of fixes to kernel-doc in structs, enums, and functions. Also add header files that are being used but not yet #included. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Yamin Friedman <yaminf@mellanox.com> Cc: Tal Gilboa <talgi@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09Merge tag 'spi-ptp-api' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi Pull in a dependency for Vladimir's work on more precise packet time stamping. Mark Brown says: ==================== spi: Add a PTP API For detailed timestamping of operations. ==================== Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-08Revert "tun: call dev_get_valid_name() before register_netdevice()"Eric Dumazet
This reverts commit 0ad646c81b2182f7fa67ec0c8c825e0ee165696d. As noticed by Jakub, this is no longer needed after commit 11fc7d5a0a2d ("tun: fix memory leak in error path") This no longer exports dev_get_valid_name() for the exclusive use of tun driver. Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-08Merge tag 'led-fixes-for-5.4-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds Pull LED fixes from Jacek Anaszewski: - fix a leftover from earlier stage of development in the documentation of recently added led_compose_name() and fix old mistake in the documentation of led_set_brightness_sync() parameter name. - MAINTAINERS: add pointer to Pavel Machek's linux-leds.git tree. Pavel is going to take over LED tree maintainership from myself. * tag 'led-fixes-for-5.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: Add my linux-leds branch to MAINTAINERS leds: core: Fix leds.h structure documentation
2019-10-08leds: core: Fix leds.h structure documentationDan Murphy
Update the leds.h structure documentation to define the correct arguments. Signed-off-by: Dan Murphy <dmurphy@ti.com> Signed-off-by: Jacek Anaszewski <jacek.anaszewski@gmail.com>
2019-10-08spi: Add a PTP system timestamp to the transfer structureVladimir Oltean
SPI is one of the interfaces used to access devices which have a POSIX clock driver (real time clocks, 1588 timers etc). The fact that the SPI bus is slow is not what the main problem is, but rather the fact that drivers don't take a constant amount of time in transferring data over SPI. When there is a high delay in the readout of time, there will be uncertainty in the value that has been read out of the peripheral. When that delay is constant, the uncertainty can at least be approximated with a certain accuracy which is fine more often than not. Timing jitter occurs all over in the kernel code, and is mainly caused by having to let go of the CPU for various reasons such as preemption, servicing interrupts, going to sleep, etc. Another major reason is CPU dynamic frequency scaling. It turns out that the problem of retrieving time from a SPI peripheral with high accuracy can be solved by the use of "PTP system timestamping" - a mechanism to correlate the time when the device has snapshotted its internal time counter with the Linux system time at that same moment. This is sufficient for having a precise time measurement - it is not necessary for the whole SPI transfer to be transmitted "as fast as possible", or "as low-jitter as possible". The system has to be low-jitter for a very short amount of time to be effective. This patch introduces a PTP system timestamping mechanism in struct spi_transfer. This is to be used by SPI device drivers when they need to know the exact time at which the underlying device's time was snapshotted. More often than not, SPI peripherals have a very exact timing for when their SPI-to-interconnect bridge issues a transaction for snapshotting and reading the time register, and that will be dependent on when the SPI-to-interconnect bridge figures out that this is what it should do, aka as soon as it sees byte N of the SPI transfer. Since spi_device drivers are the ones who'd know best how the peripheral behaves in this regard, expose a mechanism in spi_transfer which allows them to specify which word (or word range) from the transfer should be timestamped. Add a default implementation of the PTP system timestamping in the SPI core. This is not going to be satisfactory performance-wise, but should at least increase the likelihood that SPI device drivers will use PTP system timestamping in the future. There are 3 entry points from the core towards the SPI controller drivers: - transfer_one: The driver is passed individual spi_transfers to execute. This is the easiest to timestamp. - transfer_one_message: The core passes the driver an entire spi_message (a potential batch of spi_transfers). The core puts the same pre and post timestamp to all transfers within a message. This is not ideal, but nothing better can be done by default anyway, since the core has no insight into how the driver batches the transfers. - transfer: Like transfer_one_message, but for unqueued drivers (i.e. the driver implements its own queue scheduling). Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20190905010114.26718-3-olteanv@gmail.com Signed-off-by: Mark Brown <broonie@kernel.org>
2019-10-08lib/string: Make memzero_explicit() inline instead of externalArvind Sankar
With the use of the barrier implied by barrier_data(), there is no need for memzero_explicit() to be extern. Making it inline saves the overhead of a function call, and allows the code to be reused in arch/*/purgatory without having to duplicate the implementation. Tested-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H . Peter Anvin <hpa@zytor.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephan Mueller <smueller@chronox.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-crypto@vger.kernel.org Cc: linux-s390@vger.kernel.org Fixes: 906a4bb97f5d ("crypto: sha256 - Use get/put_unaligned_be32 to get input, memzero_explicit") Link: https://lkml.kernel.org/r/20191007220000.GA408752@rani.riverdale.lan Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "The usual shower of hotfixes. Chris's memcg patches aren't actually fixes - they're mature but a few niggling review issues were late to arrive. The ocfs2 fixes are quite old - those took some time to get reviewer attention. Subsystems affected by this patch series: ocfs2, hotfixes, mm/memcg, mm/slab-generic" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) mm, sl[ou]b: improve memory accounting mm, memcg: make scan aggression always exclude protection mm, memcg: make memory.emin the baseline for utilisation determination mm, memcg: proportional memory.{low,min} reclaim mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() mm/page_alloc.c: fix a crash in free_pages_prepare() mm/z3fold.c: claim page in the beginning of free kernel/sysctl.c: do not override max_threads provided by userspace memcg: only record foreign writebacks with dirty pages when memcg is not disabled mm: fix -Wmissing-prototypes warnings writeback: fix use-after-free in finish_writeback_work() mm/memremap: drop unused SECTION_SIZE and SECTION_MASK panic: ensure preemption is disabled during panic() fs: ocfs2: fix a possible null-pointer dereference in ocfs2_info_scan_inode_alloc() fs: ocfs2: fix a possible null-pointer dereference in ocfs2_write_end_nolock() fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry() ocfs2: clear zero in unaligned direct IO
2019-10-07mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)Vlastimil Babka
In most configurations, kmalloc() happens to return naturally aligned (i.e. aligned to the block size itself) blocks for power of two sizes. That means some kmalloc() users might unknowingly rely on that alignment, until stuff breaks when the kernel is built with e.g. CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned. Then developers have to devise workaround such as own kmem caches with specified alignment [1], which is not always practical, as recently evidenced in [2]. The topic has been discussed at LSF/MM 2019 [3]. Adding a 'kmalloc_aligned()' variant would not help with code unknowingly relying on the implicit alignment. For slab implementations it would either require creating more kmalloc caches, or allocate a larger size and only give back part of it. That would be wasteful, especially with a generic alignment parameter (in contrast with a fixed alignment to size). Ideally we should provide to mm users what they need without difficult workarounds or own reimplementations, so let's make the kmalloc() alignment to size explicitly guaranteed for power-of-two sizes under all configurations. What this means for the three available allocators? * SLAB object layout happens to be mostly unchanged by the patch. The implicitly provided alignment could be compromised with CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for caches with alignment larger than unsigned long long. Practically on at least x86 this includes kmalloc caches as they use cache line alignment, which is larger than that. Still, this patch ensures alignment on all arches and cache sizes. * SLUB layout is also unchanged unless redzoning is enabled through CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache. With this patch, explicit alignment is guaranteed with redzoning as well. This will result in more memory being wasted, but that should be acceptable in a debugging scenario. * SLOB has no implicit alignment so this patch adds it explicitly for kmalloc(). The potential downside is increased fragmentation. While pathological allocation scenarios are certainly possible, in my testing, after booting a x86_64 kernel+userspace with virtme, around 16MB memory was consumed by slab pages both before and after the patch, with difference in the noise. [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/ [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/ [3] https://lwn.net/Articles/787740/ [akpm@linux-foundation.org: documentation fixlet, per Matthew] Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Christoph Hellwig <hch@lst.de> Cc: David Sterba <dsterba@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: "Darrick J . Wong" <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm, memcg: make scan aggression always exclude protectionChris Down
This patch is an incremental improvement on the existing memory.{low,min} relative reclaim work to base its scan pressure calculations on how much protection is available compared to the current usage, rather than how much the current usage is over some protection threshold. This change doesn't change the experience for the user in the normal case too much. One benefit is that it replaces the (somewhat arbitrary) 100% cutoff with an indefinite slope, which makes it easier to ballpark a memory.low value. As well as this, the old methodology doesn't quite apply generically to machines with varying amounts of physical memory. Let's say we have a top level cgroup, workload.slice, and another top level cgroup, system-management.slice. We want to roughly give 12G to system-management.slice, so on a 32GB machine we set memory.low to 20GB in workload.slice, and on a 64GB machine we set memory.low to 52GB. However, because these are relative amounts to the total machine size, while the amount of memory we want to generally be willing to yield to system.slice is absolute (12G), we end up putting more pressure on system.slice just because we have a larger machine and a larger workload to fill it, which seems fairly unintuitive. With this new behaviour, we don't end up with this unintended side effect. Previously the way that memory.low protection works is that if you are 50% over a certain baseline, you get 50% of your normal scan pressure. This is certainly better than the previous cliff-edge behaviour, but it can be improved even further by always considering memory under the currently enforced protection threshold to be out of bounds. This means that we can set relatively low memory.low thresholds for variable or bursty workloads while still getting a reasonable level of protection, whereas with the previous version we may still trivially hit the 100% clamp. The previous 100% clamp is also somewhat arbitrary, whereas this one is more concretely based on the currently enforced protection threshold, which is likely easier to reason about. There is also a subtle issue with the way that proportional reclaim worked previously -- it promotes having no memory.low, since it makes pressure higher during low reclaim. This happens because we base our scan pressure modulation on how far memory.current is between memory.min and memory.low, but if memory.low is unset, we only use the overage method. In most cromulent configurations, this then means that we end up with *more* pressure than with no memory.low at all when we're in low reclaim, which is not really very usable or expected. With this patch, memory.low and memory.min affect reclaim pressure in a more understandable and composable way. For example, from a user standpoint, "protected" memory now remains untouchable from a reclaim aggression standpoint, and users can also have more confidence that bursty workloads will still receive some amount of guaranteed protection. Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm, memcg: make memory.emin the baseline for utilisation determinationChris Down
Roman points out that when when we do the low reclaim pass, we scale the reclaim pressure relative to position between 0 and the maximum protection threshold. However, if the maximum protection is based on memory.elow, and memory.emin is above zero, this means we still may get binary behaviour on second-pass low reclaim. This is because we scale starting at 0, not starting at memory.emin, and since we don't scan at all below emin, we end up with cliff behaviour. This should be a fairly uncommon case since usually we don't go into the second pass, but it makes sense to scale our low reclaim pressure starting at emin. You can test this by catting two large sparse files, one in a cgroup with emin set to some moderate size compared to physical RAM, and another cgroup without any emin. In both cgroups, set an elow larger than 50% of physical RAM. The one with emin will have less page scanning, as reclaim pressure is lower. Rebase on top of and apply the same idea as what was applied to handle cgroup_memory=disable properly for the original proportional patch http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm, memcg: Handle cgroup_disable=memory when getting memcg protection"). Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Suggested-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07mm, memcg: proportional memory.{low,min} reclaimChris Down
cgroup v2 introduces two memory protection thresholds: memory.low (best-effort) and memory.min (hard protection). While they generally do what they say on the tin, there is a limitation in their implementation that makes them difficult to use effectively: that cliff behaviour often manifests when they become eligible for reclaim. This patch implements more intuitive and usable behaviour, where we gradually mount more reclaim pressure as cgroups further and further exceed their protection thresholds. This cliff edge behaviour happens because we only choose whether or not to reclaim based on whether the memcg is within its protection limits (see the use of mem_cgroup_protected in shrink_node), but we don't vary our reclaim behaviour based on this information. Imagine the following timeline, with the numbers the lruvec size in this zone: 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned. 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned. 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be scanned. (?!) * Of course, we won't usually scan all available pages in the zone even without this patch because of scan control priority, over-reclaim protection, etc. However, as shown by the tests at the end, these techniques don't sufficiently throttle such an extreme change in input, so cliff-like behaviour isn't really averted by their existence alone. Here's an example of how this plays out in practice. At Facebook, we are trying to protect various workloads from "system" software, like configuration management tools, metric collectors, etc (see this[0] case study). In order to find a suitable memory.low value, we start by determining the expected memory range within which the workload will be comfortable operating. This isn't an exact science -- memory usage deemed "comfortable" will vary over time due to user behaviour, differences in composition of work, etc, etc. As such we need to ballpark memory.low, but doing this is currently problematic: 1. If we end up setting it too low for the workload, it won't have *any* effect (see discussion above). The group will receive the full weight of reclaim and won't have any priority while competing with the less important system software, as if we had no memory.low configured at all. 2. Because of this behaviour, we end up erring on the side of setting it too high, such that the comfort range is reliably covered. However, protected memory is completely unavailable to the rest of the system, so we might cause undue memory and IO pressure there when we *know* we have some elasticity in the workload. 3. Even if we get the value totally right, smack in the middle of the comfort zone, we get extreme jumps between no pressure and full pressure that cause unpredictable pressure spikes in the workload due to the current binary reclaim behaviour. With this patch, we can set it to our ballpark estimation without too much worry. Any undesirable behaviour, such as too much or too little reclaim pressure on the workload or system will be proportional to how far our estimation is off. This means we can set memory.low much more conservatively and thus waste less resources *without* the risk of the workload falling off a cliff if we overshoot. As a more abstract technical description, this unintuitive behaviour results in having to give high-priority workloads a large protection buffer on top of their expected usage to function reliably, as otherwise we have abrupt periods of dramatically increased memory pressure which hamper performance. Having to set these thresholds so high wastes resources and generally works against the principle of work conservation. In addition, having proportional memory reclaim behaviour has other benefits. Most notably, before this patch it's basically mandatory to set memory.low to a higher than desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting *some* protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: +------------+-----------------+--------------------+--------------+ | memory.low | test (pgscan/s) | control (pgscan/s) | % of control | +------------+-----------------+--------------------+--------------+ | 21G | 0 | 0 | N/A | | 17G | 867 | 3799 | 23% | | 12G | 1203 | 3543 | 34% | | 8G | 2534 | 3979 | 64% | | 4G | 3980 | 4147 | 96% | | 0 | 3799 | 3980 | 95% | +------------+-----------------+--------------------+--------------+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [akpm@linux-foundation.org: reflow block comments to fit in 80 cols] [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07memcg: only record foreign writebacks with dirty pages when memcg is not ↵Baoquan He
disabled In kdump kernel, memcg usually is disabled with 'cgroup_disable=memory' for saving memory. Now kdump kernel will always panic when dump vmcore to local disk: BUG: kernel NULL pointer dereference, address: 0000000000000ab8 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 598 Comm: makedumpfile Not tainted 5.3.0+ #26 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 10/02/2018 RIP: 0010:mem_cgroup_track_foreign_dirty_slowpath+0x38/0x140 Call Trace: __set_page_dirty+0x52/0xc0 iomap_set_page_dirty+0x50/0x90 iomap_write_end+0x6e/0x270 iomap_write_actor+0xce/0x170 iomap_apply+0xba/0x11e iomap_file_buffered_write+0x62/0x90 xfs_file_buffered_aio_write+0xca/0x320 [xfs] new_sync_write+0x12d/0x1d0 vfs_write+0xa5/0x1a0 ksys_write+0x59/0xd0 do_syscall_64+0x59/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 And this will corrupt the 1st kernel too with 'cgroup_disable=memory'. Via the trace and with debugging, it is pointing to commit 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") which introduced this regression. Disabling memcg causes the null pointer dereference at uninitialized data in function mem_cgroup_track_foreign_dirty_slowpath(). Fix it by returning directly if memcg is disabled, but not trying to record the foreign writebacks with dirty pages. Link: http://lkml.kernel.org/r/20190924141928.GD31919@MiWiFi-R3L-srv Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing") Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-07netfilter: ipset: move ip_set_get_ip_port() to ip_set_bitmap_port.c.Jeremy Sowden
ip_set_get_ip_port() is only used in ip_set_bitmap_port.c. Move it there and make it static. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>