summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)Author
2017-02-23Merge tag 'armsoc-drivers' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull ARM SoC driver updates from Arnd Bergmann: "Driver updates for ARM SoCs. A handful of driver changes this time around. The larger changes are: - Reset drivers for hi3660 and zx2967 - AHCI driver for Davinci, acked by Tejun and brought in here due to platform dependencies - Cleanups of atmel-ebi (External Bus Interface) - Tweaks for Rockchip GRF (General Register File) usage (kitchensink misc register range on the SoCs) - PM domains changes for support of two new ZTE SoCs (zx296718 and zx2967)" * tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (53 commits) soc: samsung: pmu: Add register defines for pad retention control reset: make zx2967 explicitly non-modular reset: core: fix reset_control_put soc: samsung: pm_domains: Read domain name from the new label property soc: samsung: pm_domains: Remove message about failed memory allocation soc: samsung: pm_domains: Remove unused name field soc: samsung: pm_domains: Use full names in subdomains registration log sata: ahci-da850: un-hardcode the MPY bits sata: ahci-da850: add a workaround for controller instability sata: ahci: export ahci_do_hardreset() locally sata: ahci-da850: implement a workaround for the softreset quirk sata: ahci-da850: add device tree match table sata: ahci-da850: get the sata clock using a connection id soc: samsung: pmu: Remove duplicated define for ARM_L2_OPTION register memory: atmel-ebi: Enable the SMC clock if specified soc: samsung: pmu: Remove unused and duplicated defines memory: atmel-ebi: Properly handle multiple reference to the same CS memory: atmel-ebi: Fix the test to enable generic SMC logic soc: samsung: pm_domains: Add new Exynos5433 compatible soc: samsung: pmu: Add dummy support for Exynos5433 SoC ...
2017-02-23Merge tag 'pci-v4.11-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI updates from Bjorn Helgaas: - add ASPM L1 substate support - enable PCIe Extended Tags when supported - configure PCIe MPS settings on iProc, Versatile, X-Gene, and Xilinx - increase VPD access timeout - add ACS quirks for Intel Union Point, Qualcomm QDF2400 and QDF2432 - use new pci_irq_alloc_vectors() in more drivers - fix MSI affinity memory leak - remove unused MSI interfaces and update documentation - remove unused AER .link_reset() callback - avoid pci_lock / p->pi_lock deadlock seen with perf - serialize sysfs enable/disable num_vfs operations - move DesignWare IP from drivers/pci/host/ to drivers/pci/dwc/ and refactor so we can support both hosts and endpoints - add DT ECAM-like support for HiSilicon Hip06/Hip07 controllers - add Rockchip system power management support - add Thunder-X cn81xx and cn83xx support - add Exynos 5440 PCIe PHY support * tag 'pci-v4.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (93 commits) PCI: dwc: Remove dependency of designware on CONFIG_PCI PCI: dwc: Add CONFIG_PCIE_DW_HOST to enable PCI dwc host PCI: dwc: Split pcie-designware.c into host and core files PCI: dwc: designware: Fix style errors in pcie-designware.c PCI: dwc: designware: Parse "num-lanes" property in dw_pcie_setup_rc() PCI: dwc: all: Split struct pcie_port into host-only and core structures PCI: dwc: designware: Get device pointer at the start of dw_pcie_host_init() PCI: dwc: all: Rename cfg_read/cfg_write to read/write PCI: dwc: all: Use platform_set_drvdata() to save private data PCI: dwc: designware: Move register defines to designware header file PCI: dwc: Use PTR_ERR_OR_ZERO to simplify code PCI: dra7xx: Group PHY API invocations PCI: dra7xx: Enable MSI and legacy interrupts simultaneously PCI: dra7xx: Add support to force RC to work in GEN1 mode PCI: dra7xx: Simplify probe code with devm_gpiod_get_optional() PCI: Move DesignWare IP support to new drivers/pci/dwc/ directory PCI: exynos: Support the PHY generic framework Documentation: binding: Modify the exynos5440 PCIe binding phy: phy-exynos-pcie: Add support for Exynos PCIe PHY Documentation: samsung-phy: Add exynos-pcie-phy binding ...
2017-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Some 'const'ing in qlogic networking drivers, from Bhumika Goyal. 2) Fix scheduling while atomic in l2tp network namespace exit by deferring the work to the workqueue. From Ridge Kennedy. 3) Fix use after free in dccp timewait handling, from Andrey Ryabinin. 4) mlx5e CQE compression engine not initialized properly, from Tariq Toukan. 5) Some UAPI header fixes from Dmitry V. Levin. 6) Don't overwrite module parameter value in mlx4 driver, from Majd Dibbiny. 7) Fix divide by zero in xt_hashlimit netfilter module, from Alban Browaeys. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits) bpf: Fix bpf_xdp_event_output net/mlx4_en: Use __skb_fill_page_desc() net/mlx4_core: Use cq quota in SRIOV when creating completion EQs net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs net/mlx4: Spoofcheck and zero MAC can't coexist net/mlx4: Change ENOTSUPP to EOPNOTSUPP uapi: fix linux/rds.h userspace compilation errors uapi: fix linux/seg6.h and linux/seg6_iptunnel.h userspace compilation errors lib: Remove string from parman config selection forcedeth: Remove return from a void function bpf: fix spelling mistake: "proccessed" -> "processed" uapi: fix linux/llc.h userspace compilation error uapi: fix linux/ip6_tunnel.h userspace compilation errors net/mlx5e: Fix wrong CQE decompression net/mlx5e: Update MPWQE stride size when modifying CQE compress state net/mlx5e: Fix broken CQE compression initialization net/mlx5e: Do not reduce LRO WQE size when not using build_skb net/mlx5e: Register/unregister vport representors on interface attach/detach net/mlx5e: s390 system compilation fix tcp: account for ts offset only if tsecr not zero ...
2017-02-23Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma Pull Mellanox rdma updates from Doug Ledford: "Mellanox specific updates for 4.11 merge window Because the Mellanox code required being based on a net-next tree, I keept it separate from the remainder of the RDMA stack submission that is based on 4.10-rc3. This branch contains: - Various mlx4 and mlx5 fixes and minor changes - Support for adding a tag match rule to flow specs - Support for cvlan offload operation for raw ethernet QPs - A change to the core IB code to recognize raw eth capabilities and enumerate them (touches non-Mellanox code) - Implicit On-Demand Paging memory registration support" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (40 commits) IB/mlx5: Fix configuration of port capabilities IB/mlx4: Take source GID by index from HW GID table IB/mlx5: Fix blue flame buffer size calculation IB/mlx4: Remove unused variable from function declaration IB: Query ports via the core instead of direct into the driver IB: Add protocol for USNIC IB/mlx4: Support raw packet protocol IB/mlx5: Support raw packet protocol IB/core: Add raw packet protocol IB/mlx5: Add implicit MR support IB/mlx5: Expose MR cache for mlx5_ib IB/mlx5: Add null_mkey access IB/umem: Indicate that process is being terminated IB/umem: Update on demand page (ODP) support IB/core: Add implicit MR flag IB/mlx5: Support creation of a WQ with scatter FCS offload IB/mlx5: Enable QP creation with cvlan offload IB/mlx5: Enable WQ creation and modification with cvlan offload IB/mlx5: Expose vlan offloads capabilities IB/uverbs: Enable QP creation with cvlan offload ...
2017-02-23Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto update from Herbert Xu: "API: - Try to catch hash output overrun in testmgr - Introduce walksize attribute for batched walking - Make crypto_xor() and crypto_inc() alignment agnostic Algorithms: - Add time-invariant AES algorithm - Add standalone CBCMAC algorithm Drivers: - Add NEON acclerated chacha20 on ARM/ARM64 - Expose AES-CTR as synchronous skcipher on ARM64 - Add scalar AES implementation on ARM64 - Improve scalar AES implementation on ARM - Improve NEON AES implementation on ARM/ARM64 - Merge CRC32 and PMULL instruction based drivers on ARM64 - Add NEON acclerated CBCMAC/CMAC/XCBC AES on ARM64 - Add IPsec AUTHENC implementation in atmel - Add Support for Octeon-tx CPT Engine - Add Broadcom SPU driver - Add MediaTek driver" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (142 commits) crypto: xts - Add ECB dependency crypto: cavium - switch to pci_alloc_irq_vectors crypto: cavium - switch to pci_alloc_irq_vectors crypto: cavium - remove dead MSI-X related define crypto: brcm - Avoid double free in ahash_finup() crypto: cavium - fix Kconfig dependencies crypto: cavium - cpt_bind_vq_to_grp could return an error code crypto: doc - fix typo hwrng: omap - update Kconfig help description crypto: ccm - drop unnecessary minimum 32-bit alignment crypto: ccm - honour alignmask of subordinate MAC cipher crypto: caam - fix state buffer DMA (un)mapping crypto: caam - abstract ahash request double buffering crypto: caam - fix error path for ctx_dma mapping failure crypto: caam - fix DMA API leaks for multiple setkey() calls crypto: caam - don't dma_map key for hash algorithms crypto: caam - use dma_map_sg() return code crypto: caam - replace sg_count() with sg_nents_for_len() crypto: caam - check sg_count() return value crypto: caam - fix HW S/G in ablkcipher_giv_edesc_alloc() ..
2017-02-23Merge tag 'rpmsg-v4.11' of git://github.com/andersson/remoteprocLinus Torvalds
Pull rpmsg updates from Bjorn Andersson: "This introduces an interface for user space to directly communicate on rpmsg endpoints without the implementation of specific kernel drivers, which is useful for e.g. debug channels: * tag 'rpmsg-v4.11' of git://github.com/andersson/remoteproc: rpmsg: rpmsg_create_ept() returns NULL on error rpmsg: qcom: smd: Return positively when not enabled rpmsg: unlock on error in rpmsg_eptdev_read() rpmsg: char: add CONFIG_NET dependency rpmsg: smd: Register rpmsg user space interface for edges rpmsg: Driver for user space endpoint interface rpmsg: qcom_smd: Implement endpoint "poll" rpmsg: Introduce "poll" to endpoint ops rpmsg: qcom_smd: Add support for "label" property
2017-02-23Merge tag 'rproc-v4.11' of git://github.com/andersson/remoteprocLinus Torvalds
Pull remoteproc updates from Bjorn Andersson: "This introduces support for booting the dedicated sensor core in the Qualcomm MSM8996, updates the Qualcomm ADSP and Hexagon drivers to utilize SMD subdevice helpers for properly handle shutdowns and restarts of the remoteproc, add virtio support to the ST remoteproc and refactor the Qualcomm Hexagon driver to handle variations between platforms. The support code for parsing, loading and authenticating Qualcomm firmware files (MDT) is refactored and move to drivers/soc/qcom, to allow for non-remoteproc drivers to utilize this. Finally it brings some cleanups to the remoteproc core" * tag 'rproc-v4.11' of git://github.com/andersson/remoteproc: (27 commits) remoteproc: qcom: mdt_loader: Use signed type for offset remoteproc: st: add virtio communication support remoteproc: st: correct probe error management remoteproc: Modify the function names remoteproc: Reduce asynchronous request_firmware to auto-boot only remoteproc: Drop qcom_scm_pas_supported() from adsp_probe() MAINTAINERS: Add missing rpmsg include path remoteproc: qcom: Use common SMD edge handler remoteproc: qcom: wcnss: Make SMD handling common remoteproc: Move qcom_mdt_loader into drivers/soc/qcom remoteproc: qcom: mdt_loader: Refactor MDT loader remoteproc: qcom: mdt_loader: Don't overwrite firmware object remoteproc: qcom: Extract non-mdt related helper remoteproc: qcom: q6v5: Decouple driver from MDT loader remoteproc: qcom: q6v5: Remove mss supply from 8916 remoteproc: qcom: fix initializers for qcom_mss_reg_res array remoteproc: Drop firmware_loading_complete remoteproc: Add RPROC_DELETED state remoteproc: Move rproc_delete_debug_dir() to rproc_del() remoteproc: qcom: Add SLPI rproc support to load and boot slpi proc. ...
2017-02-23Merge tag 'sound-4.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound updates from Takashi Iwai: "Here is the update of sound bits for 4.11: again at this time, no big changes in ALSA and ASoC core but only cosmetic changes like consitifaction. Meanwhile, quite a lot of developments are seen in a few driver side. ALSA Core: - Clean up, consitification of some ops HD-audio: - A slight behavior change of single_cmd option - Quirks for AmigaOne X1000, Samsung Ativ Book 8, Dell AiO, ALC221 HP, and fixes for Lewisburg controller - Realtek ALC299, ALC1220 codecs Others: - USB-audio: Tascam US-16x08 DSP mixer quirk - Intel HDMI LPE audio support for Baytrail / Cherrytrail; this contains some updates in drm/i915 for the new platform binding ASoC: - Lots of updates in Intel drivers, mostly for DisplayPort and HDMI on Skylake and onwards, as well as more Baytrail / Cherrytrail boards support - Channel mapping support for HDMI - Support for AllWinner A31 and A33, Everest Semiconductor ES8328, Nuvoton NAU8540. * tag 'sound-4.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (323 commits) ALSA: usb-audio: Tidy up mixer_us16x08.c ALSA: usb-audio: Fix memory leak and corruption in mixer_us16x08.c ALSA: usb-audio: purge needless variable length array ALSA: x86: hdmi: select CONFIG_SND_PCM ALSA: x86: Don't enable runtime PM as default ALSA: x86: Use runtime PM autosuspend ALSA: usb-audio: localize function without external linkage ALSA: usb-audio: localize one-referrer variable ALSA: usb-audio: Tascam US-16x08 DSP mixer quirk ALSA: emu10k1: constify snd_emux_operators structure ASoC: sun4i-spdif: drop unnessary snd_soc_unregister_component() ASoC: Intel: bxt: Add jack port initialize in bxt_rt298 machine ASoC: nau8825: automatic BCLK and LRC divde in master mode ASoC: hdac_hdmi: Add device id for Geminilake ASoC: Intel: Skylake: Add Geminlake IDs ASoC: rt298: Add DMI match for Geminilake reference platform ASoC: Intel: Skylake: Check device type to get endpoint configuration ASoC: Intel: bxt: Add jack port initialize in da7219_max98357a machine ASoC: Intel: Skylake: Add jack port initialize in nau88l25_ssm4567 machine ASoC: Intel: Skylake: Add jack port initialize in nau88l25_max98357a machine ...
2017-02-23Merge tag 'gpio-v4.11-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio Pull GPIO updates from Linus Walleij: "This is the bulk of GPIO changes for the v4.11 cycle Core changes: - Augment fwnode_get_named_gpiod() to configure the GPIO pin immediately after requesting it like all other APIs do. This is a treewide change also updating all users. - Pass a GPIO label down to gpiod_request() from fwnode_get_named_gpiod(). This makes debugfs and the userspace ABI correctly reflect the current in-kernel consumer of a pin taken using this abstraction. This is a treewide change also updating all users. - Rename devm_get_gpiod_from_child() to devm_fwnode_get_gpiod_from_child() to reflect the fact that this function is operating on a fwnode object. This is a treewide change also updating all users. - Make it possible to take multiple GPIOs in a single hog of device tree hogs. - The refactorings switching GPIO chips to use the .set_config() callback using standard pin control properties and providing a backend into the pin control subsystem that were also merged into the pin control tree naturally appear here too. Testing instrumentation: - A whole slew of cleanups and improvements to the mockup GPIO driver. We now have an extended userspace test exercising the subsystem, and we can inject interrupts etc from userspace to fully test the core GPIO functionality. New drivers: - New driver for the Cortina Systems Gemini GPIO controller. - New driver for the Exar XR17V352/354/358 chips. - New driver for the ACCES PCI-IDIO-16 PCI GPIO card. Driver changes: - RCAR: set the irqchip parent device, add fine-grained runtime PM support. - pca953x: support optional RESET control line on the chip. - DaVinci: cleanups and simplifications. Add support for multiple instances. - .set_multiple() and naming of lines on more or less all of the ISA/PCI GPIO controllers. - mcp23s08: refactored to use regmap as a first step to further rewrites and modernizations" * tag 'gpio-v4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (61 commits) gpio: reintroduce devm_get_gpiod_from_child() gpio: pci-idio-16: Fix PCI BAR index gpio: pci-idio-16: Fix PCI device ID code gpio: mockup: implement event injecting over debugfs gpio: mockup: add a dummy irqchip gpio: mockup: implement naming the lines gpio: mockup: code shrink gpio: mockup: readability tweaks gpio: Add GPIO support for the ACCES PCI-IDIO-16 gpio: Add the devm_fwnode_get_index_gpiod_from_child() helper gpio: Rename devm_get_gpiod_from_child() gpio: mcp23s08: Select REGMAP/REGMAP_I2C to fix build error gpio: ws16c48: Add support for GPIO names gpio: gpio-mm: Add support for GPIO names gpio: 104-idio-16: Add support for GPIO names gpio: 104-idi-48: Add support for GPIO names gpio: 104-dio-48e: Add support for GPIO names gpio: ws16c48: Remove unnecessary driver_data set gpio: gpio-mm: Remove unnecessary driver_data set gpio: 104-idio-16: Remove unnecessary driver_data set ...
2017-02-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry: - a new driver for Zeitech touchscreen controller - a new driver for Samsung "touchkeys" - touchscreen driver for Moorestown platform has been removed because platform support is gone - MPU3050 accelerometer driver was removed in favor of IIO driver - miscellaneous driver cleanup and fixes * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (88 commits) Input: zet6223 - export OF device ID as module aliases Input: tsc2004/5 - switch to using generic device properties Input: tsc2004/5 - fix regulator handling Input: tsc2005 - add OF device table Input: add driver for Zeitec ZET6223 Input: joydev - do not report stale values on first open Input: synaptics-rmi4 - forward upper mechanical buttons to PS/2 guest Input: synaptics-rmi4 - clean up F30 implementation Input: synaptics - use SERIO_OOB_DATA to handle trackstick buttons Input: psmouse - add a custom serio protocol to send extra information Input: synaptics-rmi4 - fix error return code in rmi_probe_interrupts() Input: xpad - restore LED state after device resume Input: synaptics-rmi4 - add rmi_find_function() Input: xpad - fix stuck mode button on Xbox One S pad Input: joydev - use clamp() macro Input: refuse to register absolute devices without absinfo Input: synaptics-rmi4 - add sysfs interfaces for hardware IDs Input: synaptics-rmi4 - add sysfs attribute update_fw_status Input: mousedev - stop offering PS/2 to userspace by default Input: tca8418 - switch to using generic device properties ...
2017-02-23Merge tag 'mfd-for-linus-4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull MFD updates from Lee Jones: "Core Frameworks: - Add new !TOUCHSCREEN_SUN4I dependency for SUN4I_GPADC - List include/dt-bindings/mfd/* to files supported in MAINTAINERS New Drivers: - Intel Apollo Lake SPI NOR - ST STM32 Timers (Advanced, Basic and PWM) - Motorola 6556002 CPCAP (PMIC) New Device Support: - Add support for AXP221 to axp20x - Add support for Intel Gemini Lake to intel-lpss-pci - Add support for MT6323 LED to mt6397-core - Add support for COMe-bBD#, COMe-bSL6, COMe-bKL6, COMe-cAL6 and COMe-cKL6 to kempld-core New Functionality: - Add support for Analog CODAC to sun6i-prcm - Add support for Watchdog to lpc_ich Fix-ups: - Error handling improvements; axp288_charger, axp20x, ab8500-sysctrl - Adapt platform data handling; axp20x - IRQ handling improvements; arizona, axp20x - Remove superfluous code; arizona, axp20x, lpc_ich - Trivial coding style/spelling fixes; axp20x, abx500, mfd.txt - Regmap fix-ups; axp20x - DT changes; mfd.txt, aspeed-lpc, aspeed-gfx, ab8500-core, tps65912, mt6397 - Use new I2C probing mechanism; max77686 - Constification; rk808 Bug Fixes: - Stop data transfer whilst suspended; cros_ec" * tag 'mfd-for-linus-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (43 commits) mfd: lpc_ich: Enable watchdog on Intel Apollo Lake PCH mfd: lpc_ich: Remove useless comments in core part mfd: Add support for several boards to Kontron PLD driver mfd: constify regmap_irq_chip structures MAINTAINERS: Add include/dt-bindings/mfd to MFD entry mfd: cpcap: Add minimal support mfd: mt6397: Add MT6323 LED support into MT6397 driver Documentation: devicetree: Add LED subnode binding for MT6323 PMIC mfd: tps65912: Export OF device ID table as module aliases mfd: ab8500-core: Rename clock device and compatible mfd: cros_ec: Send correct suspend/resume event to EC mfd: max77686: Remove I2C device ID table mfd: max77686: Use the struct i2c_driver .probe_new instead of .probe mfd: max77686: Use of_device_get_match_data() helper mfd: max77686: Don't attempt to get i2c_device_id .data mfd: ab8500-sysctrl: Handle probe deferral mfd: intel-lpss: Add Intel Gemini Lake PCI IDs mfd: axp20x: Fix AXP806 access errors on cold boot mfd: cros_ec: Send suspend state notification to EC mfd: cros_ec: Prevent data transfer while device is suspended ...
2017-02-23net/mlx4: Spoofcheck and zero MAC can't coexistEugenia Emantayev
Spoofcheck can't be enabled if VF MAC is zero. Vice versa, can't zero MAC if spoofcheck is on. Fixes: 8f7ba3ca12f6 ('net/mlx4: Add set VF mac address support') Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: "142 patches: - DAX updates - various misc bits - OCFS2 updates - most of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (142 commits) mm/z3fold.c: limit first_num to the actual range of possible buddy indexes mm: fix <linux/pagemap.h> stray kernel-doc notation zram: remove obsolete sysfs attrs mm/memblock.c: remove unnecessary log and clean up oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA mm: drop unused argument of zap_page_range() mm: drop zap_details::check_swap_entries mm: drop zap_details::ignore_dirty mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled mm: help __GFP_NOFAIL allocations which do not trigger OOM killer mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically mm: consolidate GFP_NOFAIL checks in the allocator slowpath lib/show_mem.c: teach show_mem to work with the given nodemask arch, mm: remove arch specific show_mem mm, page_alloc: warn_alloc print nodemask mm, page_alloc: do not report all nodes in show_mem Revert "mm: bail out in shrink_inactive_list()" mm, vmscan: consider eligible zones in get_scan_count mm, vmscan: cleanup lru size claculations mm, vmscan: do not count freed pages as PGDEACTIVATE ...
2017-02-22Merge tag 'devicetree-for-4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull DeviceTree updates from Rob Herring: "Pretty standard stuff with dtc upstream sync being the biggest piece. - Sync dtc to upstream commit 0931cea3ba20. This picks up overlay support in dtc. - Set dma_ops for reserved memory users. - Make references to IOMMU consistent in DT bindings. - Cleanup references to pm_power_off in bindings. - Move some display bindings that snuck into the old bindings/video/ path. - Fix some wrong documentation paths caused from binding restructuring. - Vendor prefixes for Faraday and Fujitsu. - Fix an of_node ref counting leak in of_find_node_opts_by_path - Introduce new graph helper of_graph_get_remote_node() which will be used by DRM drivers in 4.12" * tag 'devicetree-for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: (27 commits) DT: add Faraday Tec. as vendor of: introduce of_graph_get_remote_node of: Add missing space at end of pr_fmt(). of: make of_device_make_bus_id() static of: fix of_node leak caused in of_find_node_opts_by_path dt-bindings: net: remove reference to fixed link support dt-bindings: power: reset: qnap-poweroff: Drop reference to pm_power_off dt-bindings: power: reset: gpio-poweroff: Drop reference to pm_power_off dt-bindings: mfd: as3722: Drop reference to pm_power_off dt-bindings: display: move ANX7814 and SiI8620 bridge bindings of/unittest: Swap arguments of of_unittest_apply_overlay() Documentation: usb: fix wrong documentation paths serial: fsl-imx-uart.txt: Remove generic property devicetree: Add Fujitsu Ltd. vendor prefix Documentation: display: fix wrong documentation paths of: remove redundant memset in overlay bus:qcom : Fix typo in qcom,ebi2.txt dt-bindings: qman: Remove pool channel node Documentation: panel-dpi: fix path to display-timing.txt devicetree: bindings: clk: mvebu: fix description for sata1 on Armada XP ...
2017-02-22Merge tag 'docs-4.11' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "A slightly quieter cycle for documentation this time around. Three more DocBook template files have been converted to RST; only 21 to go. There are various build improvements and the usual array of documentation improvements and fixes" * tag 'docs-4.11' of git://git.lwn.net/linux: (44 commits) docs / driver-api: Fix structure references in device_link.rst PM / docs: Fix structure references in device.rst Add a target to check broken external links in the Documentation Documentation: Fix linux-api list typo Documentation: DocBook/Makefile comment typo Improve sparse documentation Documentation: make Makefile.sphinx no-ops quieter Documentation: DMA-ISA-LPC.txt Documentation: input: fix path to input code definitions docs: Remove the copyright year from conf.py docs: Fix a warning in the Korean HOWTO.rst translation PM / sleep / docs: Convert PM notifiers document to reST PM / core / docs: Convert sleep states API document to reST PM / core: Update kerneldoc comments in pm.h doc-rst: Fix recursive make invocation from macros doc-rst: Delete output of failed dot-SVG conversion doc-rst: Break shell command sequences on failure Documentation/sphinx: make targets independent of Sphinx work for HAVE_SPHINX=0 doc-rst: fixed cleandoc target when used with O=dir Documentation/sphinx: prevent generation of .pyc files in the source tree ...
2017-02-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "4.11 is going to be a relatively large release for KVM, with a little over 200 commits and noteworthy changes for most architectures. ARM: - GICv3 save/restore - cache flushing fixes - working MSI injection for GICv3 ITS - physical timer emulation MIPS: - various improvements under the hood - support for SMP guests - a large rewrite of MMU emulation. KVM MIPS can now use MMU notifiers to support copy-on-write, KSM, idle page tracking, swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is also supported, so that writes to some memory regions can be treated as MMIO. The new MMU also paves the way for hardware virtualization support. PPC: - support for POWER9 using the radix-tree MMU for host and guest - resizable hashed page table - bugfixes. s390: - expose more features to the guest - more SIMD extensions - instruction execution protection - ESOP2 x86: - improved hashing in the MMU - faster PageLRU tracking for Intel CPUs without EPT A/D bits - some refactoring of nested VMX entry/exit code, preparing for live migration support of nested hypervisors - expose yet another AVX512 CPUID bit - host-to-guest PTP support - refactoring of interrupt injection, with some optimizations thrown in and some duct tape removed. - remove lazy FPU handling - optimizations of user-mode exits - optimizations of vcpu_is_preempted() for KVM guests generic: - alternative signaling mechanism that doesn't pound on tsk->sighand->siglock" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits) x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 x86/paravirt: Change vcp_is_preempted() arg type to long KVM: VMX: use correct vmcs_read/write for guest segment selector/base x86/kvm/vmx: Defer TR reload after VM exit x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss x86/kvm/vmx: Simplify segment_base() x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels x86/kvm/vmx: Don't fetch the TSS base from the GDT x86/asm: Define the kernel TSS limit in a macro kvm: fix page struct leak in handle_vmon KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now KVM: Return an error code only as a constant in kvm_get_dirty_log() KVM: Return an error code only as a constant in kvm_get_dirty_log_protect() KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl() KVM: x86: remove code for lazy FPU handling KVM: race-free exit from KVM_RUN without POSIX signals KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message KVM: PPC: Book3S PR: Ratelimit copy data failure error messages KVM: Support vCPU-based gfn->hva cache KVM: use separate generations for each address space ...
2017-02-22Merge tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "Here are the XFS changes for 4.11. We aren't introducing any major features in this release cycle except for this being the first merge window I've managed on my own. :) Changes since last update: - Various cleanups - Livelock fixes for eofblocks scanning - Improved input verification for on-disk metadata - Fix races in the copy on write remap mechanism - Fix buffer io error timeout controls - Streamlining of directio copy on write - Asynchronous discard support - Fix asserts when splitting delalloc reservations - Don't bloat bmbt when right shifting extents - Inode alignment fixes for 32k block sizes" * tag 'xfs-4.11-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (39 commits) xfs: remove XFS_ALLOCTYPE_ANY_AG and XFS_ALLOCTYPE_START_AG xfs: simplify xfs_rtallocate_extent xfs: tune down agno asserts in the bmap code xfs: Use xfs_icluster_size_fsb() to calculate inode chunk alignment xfs: don't reserve blocks for right shift transactions xfs: fix len comparison in xfs_extent_busy_trim xfs: fix uninitialized variable in _reflink_convert_cow xfs: split indlen reservations fairly when under reserved xfs: handle indlen shortage on delalloc extent merge xfs: resurrect debug mode drop buffered writes mechanism xfs: clear delalloc and cache on buffered write failure xfs: don't block the log commit handler for discards xfs: improve busy extent sorting xfs: improve handling of busy extents in the low-level allocator xfs: don't fail xfs_extent_busy allocation xfs: correct null checks and error processing in xfs_initialize_perag xfs: update ctime and mtime on clone destinatation inodes xfs: allocate direct I/O COW blocks in iomap_begin xfs: go straight to real allocations for direct I/O COW writes xfs: return the converted extent in __xfs_reflink_convert_cow ...
2017-02-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Add Petr Mladek, Sergey Senozhatsky as printk maintainers, and Steven Rostedt as the printk reviewer. This idea came up after the discussion about printk issues at Kernel Summit. It was formulated and discussed at lkml[1]. - Extend a lock-less NMI per-cpu buffers idea to handle recursive printk() calls by Sergey Senozhatsky[2]. It is the first step in sanitizing printk as discussed at Kernel Summit. The change allows to see messages that would normally get ignored or would cause a deadlock. Also it allows to enable lockdep in printk(). This already paid off. The testing in linux-next helped to discover two old problems that were hidden before[3][4]. - Remove unused parameter by Sergey Senozhatsky. Clean up after a past change. [1] http://lkml.kernel.org/r/1481798878-31898-1-git-send-email-pmladek@suse.com [2] http://lkml.kernel.org/r/20161227141611.940-1-sergey.senozhatsky@gmail.com [3] http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com [4] http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: printk: drop call_console_drivers() unused param printk: convert the rest to printk-safe printk: remove zap_locks() function printk: use printk_safe buffers in printk printk: report lost messages in printk safe/nmi contexts printk: always use deferred printk when flush printk_safe lines printk: introduce per-cpu safe_print seq buffer printk: rename nmi.c and exported api printk: use vprintk_func in vprintk() MAINTAINERS: Add printk maintainers
2017-02-22Merge tag 'modules-for-v4.11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Summary of modules changes for the 4.11 merge window: - A few small code cleanups - Add modules git tree url to MAINTAINERS" * tag 'modules-for-v4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: MAINTAINERS: add tree for modules module: fix memory leak on early load_module() failures module: Optimize search_module_extables() modules: mark __inittest/__exittest as __maybe_unused livepatch/module: print notice of TAINT_LIVEPATCH module: Drop redundant declaration of struct module
2017-02-22mm: fix <linux/pagemap.h> stray kernel-doc notationRandy Dunlap
Delete stray (second) function description in find_lock_page() kernel-doc notation. Note: scripts/kernel-doc just ignores the second function description. Fixes: 2457aec63745e ("mm: non-atomically mark page accessed during page cache allocation where possible") Link: http://lkml.kernel.org/r/b037e9a3-516c-ec02-6c8e-fa5479747ba6@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Matthew Wilcox <mawilcox@microsoft.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: drop unused argument of zap_page_range()Kirill A. Shutemov
There's no users of zap_page_range() who wants non-NULL 'details'. Let's drop it. Link: http://lkml.kernel.org/r/20170118122429.43661-3-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: drop zap_details::check_swap_entriesKirill A. Shutemov
detail == NULL would give the same functionality as .check_swap_entries==true. Link: http://lkml.kernel.org/r/20170118122429.43661-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: drop zap_details::ignore_dirtyKirill A. Shutemov
The only user of ignore_dirty is oom-reaper. But it doesn't really use it. ignore_dirty only has effect on file pages mapped with dirty pte. But oom-repear skips shared VMAs, so there's no way we can dirty file pte in them. Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22lib/show_mem.c: teach show_mem to work with the given nodemaskMichal Hocko
show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, page_alloc: warn_alloc print nodemaskMichal Hocko
warn_alloc is currently used for to report an allocation failure or an allocation stall. We print some details of the allocation request like the gfp mask and the request order. We do not print the allocation nodemask which is important when debugging the reason for the allocation failure as well. We alreaddy print the nodemask in the OOM report. Add nodemask to warn_alloc and print it in warn_alloc as well. Link: http://lkml.kernel.org/r/20170117091543.25850-3-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, vmscan: cleanup lru size claculationsMichal Hocko
lruvec_lru_size returns the full size of the LRU list while we sometimes need a value reduced only to eligible zones (e.g. for lowmem requests). inactive_list_is_low is one such user. Later patches will add more of them. Add a new parameter to lruvec_lru_size and allow it filter out zones which are not eligible for the given context. Link: http://lkml.kernel.org/r/20170117103702.28542-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, thp: add new defer+madvise defrag optionDavid Rientjes
There is no thp defrag option that currently allows MADV_HUGEPAGE regions to do direct compaction and reclaim while all other thp allocations simply trigger kswapd and kcompactd in the background and fail immediately. The "defer" setting simply triggers background reclaim and compaction for all regions, regardless of MADV_HUGEPAGE, which makes it unusable for our userspace where MADV_HUGEPAGE is being used to indicate the application is willing to wait for work for thp memory to be available. The "madvise" setting will do direct compaction and reclaim for these MADV_HUGEPAGE regions, but does not trigger kswapd and kcompactd in the background for anybody else. For reasonable usage, there needs to be a mesh between the two options. This patch introduces a fifth mode, "defer+madvise", that will do direct reclaim and compaction for MADV_HUGEPAGE regions and trigger background reclaim and compaction for everybody else so that hugepages may be available in the near future. A proposal to allow direct reclaim and compaction for MADV_HUGEPAGE regions as part of the "defer" mode, making it a very powerful setting and avoids breaking userspace, was offered: http://marc.info/?t=148236612700003 This additional mode is a compromise. A second proposal to allow both "defer" and "madvise" to be selected at the same time was also offered: http://marc.info/?t=148357345300001. This is possible, but there was a concern that it might break existing userspaces the parse the output of the defrag mode, so the fifth option was introduced instead. This patch also cleans up the helper function for storing to "enabled" and "defrag" since the former supports three modes while the latter supports five and triple_flag_store() was getting unnecessarily messy. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701101614330.41805@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/swap: skip readahead only when swap slot cache is enabledHuang Ying
Because during swap off, a swap entry may have swap_map[] == SWAP_HAS_CACHE (for example, just allocated). If we return NULL in __read_swap_cache_async(), the swap off will abort. So when swap slot cache is disabled, (for swap off), we will wait for page to be put into swap cache in such race condition. This should not be a problem for swap slot cache, because swap slot cache should be drained after clearing swap_slot_cache_enabled. [ying.huang@intel.com: fix memory leak in __read_swap_cache_async()] Link: http://lkml.kernel.org/r/874lzt6znd.fsf@yhuang-dev.intel.com Link: http://lkml.kernel.org/r/5e2c5f6abe8e6eb0797408897b1bba80938e9b9d.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/swap: add cache for swap slots allocationTim Chen
We add per cpu caches for swap slots that can be allocated and freed quickly without the need to touch the swap info lock. Two separate caches are maintained for swap slots allocated and swap slots returned. This is to allow the swap slots to be returned to the global pool in a batch so they will have a chance to be coaelesced with other slots in a cluster. We do not reuse the slots that are returned right away, as it may increase fragmentation of the slots. The swap allocation cache is protected by a mutex as we may sleep when searching for empty slots in cache. The swap free cache is protected by a spin lock as we cannot sleep in the free path. We refill the swap slots cache when we run out of slots, and we disable the swap slots cache and drain the slots if the global number of slots fall below a low watermark threshold. We re-enable the cache agian when the slots available are above a high watermark. [ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access] [tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h] Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/swap: free swap slots in batchTim Chen
Add new functions that free unused swap slots in batches without the need to reacquire swap info lock. This improves scalability and reduce lock contention. Link: http://lkml.kernel.org/r/c25e0fcdfd237ec4ca7db91631d3b9f6ed23824e.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/swap: allocate swap slots in batchesTim Chen
Currently, the swap slots are allocated one page at a time, causing contention to the swap_info lock protecting the swap partition on every page being swapped. This patch adds new functions get_swap_pages and scan_swap_map_slots to request multiple swap slots at once. This will reduces the lock contention on the swap_info lock. Also scan_swap_map_slots can operate more efficiently as swap slots often occurs in clusters close to each other on a swap device and it is quicker to allocate them together. Link: http://lkml.kernel.org/r/9fec2845544371f62c3763d43510045e33d286a6.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/swap: skip readahead for unreferenced swap slotsTim Chen
We can avoid needlessly allocating page for swap slots that are not used by anyone. No pages have to be read in for these slots. Link: http://lkml.kernel.org/r/0784b3f20b9bd3aa5552219624cb78dc4ae710c9.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/swap: split swap cache into 64MB trunksHuang, Ying
The patch is to improve the scalability of the swap out/in via using fine grained locks for the swap cache. In current kernel, one address space will be used for each swap device. And in the common configuration, the number of the swap device is very small (one is typical). This causes the heavy lock contention on the radix tree of the address space if multiple tasks swap out/in concurrently. But in fact, there is no dependency between pages in the swap cache. So that, we can split the one shared address space for each swap device into several address spaces to reduce the lock contention. In the patch, the shared address space is split into 64MB trunks. 64MB is chosen to balance the memory space usage and effect of lock contention reduction. The size of struct address_space on x86_64 architecture is 408B, so with the patch, 6528B more memory will be used for every 1GB swap space on x86_64 architecture. One address space is still shared for the swap entries in the same 64M trunks. To avoid lock contention for the first round of swap space allocation, the order of the swap clusters in the initial free clusters list is changed. The swap space distance between the consecutive swap clusters in the free cluster list is at least 64M. After the first round of allocation, the swap clusters are expected to be freed randomly, so the lock contention should be reduced effectively. Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm/swap: add cluster lockHuang, Ying
This patch is to reduce the lock contention of swap_info_struct->lock via using a more fine grained lock in swap_cluster_info for some swap operations. swap_info_struct->lock is heavily contended if multiple processes reclaim pages simultaneously. Because there is only one lock for each swap device. While in common configuration, there is only one or several swap devices in the system. The lock protects almost all swap related operations. In fact, many swap operations only access one element of swap_info_struct->swap_map array. And there is no dependency between different elements of swap_info_struct->swap_map. So a fine grained lock can be used to allow parallel access to the different elements of swap_info_struct->swap_map. In this patch, a spinlock is added to swap_cluster_info to protect the elements of swap_info_struct->swap_map in the swap cluster and the fields of swap_cluster_info. This reduced locking contention for swap_info_struct->swap_map access greatly. Because of the added spinlock, the size of swap_cluster_info increases from 4 bytes to 8 bytes on the 64 bit and 32 bit system. This will use additional 4k RAM for every 1G swap space. Because the size of swap_cluster_info is much smaller than the size of the cache line (8 vs 64 on x86_64 architecture), there may be false cache line sharing between spinlocks in swap_cluster_info. To avoid the false sharing in the first round of the swap cluster allocation, the order of the swap clusters in the free clusters list is changed. So that, the swap_cluster_info sharing the same cache line will be placed as far as possible. After the first round of allocation, the order of the clusters in free clusters list is expected to be random. So the false sharing should be not serious. Compared with a previous implementation using bit_spin_lock, the sequential swap out throughput improved about 3.2%. Test was done on a Xeon E5 v3 system. The swap device used is a RAM simulated PMEM (persistent memory) device. To test the sequential swapping out, the test case created 32 processes, which sequentially allocate and write to the anonymous pages until the RAM and part of the swap device is used. [ying.huang@intel.com: v5] Link: http://lkml.kernel.org/r/878tqeuuic.fsf_-_@yhuang-dev.intel.com [minchan@kernel.org: initialize spinlock for swap_cluster_info] Link: http://lkml.kernel.org/r/1486434945-29753-1-git-send-email-minchan@kernel.org [hughd@google.com: annotate nested locking for cluster lock] Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702161050540.21773@eggly.anvils Link: http://lkml.kernel.org/r/dbb860bbd825b1aaba18988015e8963f263c3f0d.1484082593.git.tim.c.chen@linux.intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> escreveu: Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22powerpc: do not make the entire heap executableDenys Vlasenko
On 32-bit powerpc the ELF PLT sections of binaries (built with --bss-plt, or with a toolchain which defaults to it) look like this: [17] .sbss NOBITS 0002aff8 01aff8 000014 00 WA 0 0 4 [18] .plt NOBITS 0002b00c 01aff8 000084 00 WAX 0 0 4 [19] .bss NOBITS 0002b090 01aff8 0000a4 00 WA 0 0 4 Which results in an ELF load header: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000 This is all correct, the load region containing the PLT is marked as executable. Note that the PLT starts at 0002b00c but the file mapping ends at 0002aff8, so the PLT falls in the 0 fill section described by the load header, and after a page boundary. Unfortunately the generic ELF loader ignores the X bit in the load headers when it creates the 0 filled non-file backed mappings. It assumes all of these mappings are RW BSS sections, which is not the case for PPC. gcc/ld has an option (--secure-plt) to not do this, this is said to incur a small performance penalty. Currently, to support 32-bit binaries with PLT in BSS kernel maps *entire brk area* with executable rights for all binaries, even --secure-plt ones. Stop doing that. Teach the ELF loader to check the X bit in the relevant load header and create 0 filled anonymous mappings that are executable if the load header requests that. Test program showing the difference in /proc/$PID/maps: int main() { char buf[16*1024]; char *p = malloc(123); /* make "[heap]" mapping appear */ int fd = open("/proc/self/maps", O_RDONLY); int len = read(fd, buf, sizeof(buf)); write(1, buf, len); printf("%p\n", p); return 0; } Compiled using: gcc -mbss-plt -m32 -Os test.c -otest Unpatched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10690000-106c0000 rwxp 00000000 00:00 0 [heap] f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ffa90000-ffac0000 rw-p 00000000 00:00 0 [stack] 0x10690008 Patched ppc64 kernel: 00100000-00120000 r-xp 00000000 00:00 0 [vdso] 0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094 /usr/lib/libc-2.17.so 0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094 /usr/lib/libc-2.17.so 10000000-10010000 r-xp 00000000 fd:00 100674505 /home/user/test 10010000-10020000 r--p 00000000 fd:00 100674505 /home/user/test 10020000-10030000 rw-p 00010000 fd:00 100674505 /home/user/test 10180000-101b0000 rw-p 00000000 00:00 0 [heap] ^^^^ this has changed f7c60000-f7c90000 r-xp 00000000 fd:00 67898089 /usr/lib/ld-2.17.so f7c90000-f7ca0000 r--p 00020000 fd:00 67898089 /usr/lib/ld-2.17.so f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089 /usr/lib/ld-2.17.so ff860000-ff890000 rw-p 00000000 00:00 0 [stack] 0x10180008 The patch was originally posted in 2012 by Jason Gunthorpe and apparently ignored: https://lkml.org/lkml/2012/9/30/138 Lightly run-tested. Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.com Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: page_alloc: skip over regions of invalid pfns where possiblePaul Burton
When using a sparse memory model memmap_init_zone() when invoked with the MEMMAP_EARLY context will skip over pages which aren't valid - ie. which aren't in a populated region of the sparse memory map. However if the memory map is extremely sparse then it can spend a long time linearly checking each PFN in a large non-populated region of the memory map & skipping it in turn. When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient information to quickly discover the next valid PFN given an invalid one by searching through the list of memory regions & skipping forwards to the first PFN covered by the memory region to the right of the non-populated region. Implement this in order to speed up memmap_init_zone() for systems with extremely sparse memory maps. James said "I have tested this patch on a virtual model of a Samurai CPU with a sparse memory map. The kernel boot time drops from 109 to 62 seconds. " Link: http://lkml.kernel.org/r/20161125185518.29885-1-paul.burton@imgtec.com Signed-off-by: Paul Burton <paul.burton@imgtec.com> Tested-by: James Hartley <james.hartley@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm, compaction: add vmstats for kcompactd workDavid Rientjes
A "compact_daemon_wake" vmstat exists that represents the number of times kcompactd has woken up. This doesn't represent how much work it actually did, though. It's useful to understand how much compaction work is being done by kcompactd versus other methods such as direct compaction and explicitly triggered per-node (or system) compaction. This adds two new vmstats: "compact_daemon_migrate_scanned" and "compact_daemon_free_scanned" to represent the number of pages kcompactd has scanned as part of its migration scanner and freeing scanner, respectively. These values are still accounted for in the general "compact_migrate_scanned" and "compact_free_scanned" for compatibility. It could be argued that explicitly triggered compaction could also be tracked separately, and that could be added if others find it useful. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22mm: un-export wake_up_page functionsNicholas Piggin
These are no longer used outside mm/filemap.c, so un-export them and make them static where possible. These were exported specifically for NFS use in commit a4796e37c12e ("MM: export page_wakeup functions"). Link: http://lkml.kernel.org/r/20170103182234.30141-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: introduce vma_is_shmemMike Rapoport
Currently userfault relies on vma_is_anonymous and vma_is_hugetlb to ensure compatibility of a VMA with userfault. Introduction of vma_is_shmem allows detection if tmpfs backed VMAs, so that they may be used with userfaultfd. Current implementation presumes usage of vma_is_shmem only by slow path routines in userfaultfd, therefore the vma_is_shmem is not made inline to leave the few remaining free bits in vm_flags. Link: http://lkml.kernel.org/r/20161216144821.5183-30-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd supportMike Rapoport
shmem_mcopy_atomic_pte is the low level routine that implements the userfaultfd UFFDIO_COPY command. It is based on the existing mcopy_atomic_pte routine with modifications for shared memory pages. Link: http://lkml.kernel.org/r/20161216144821.5183-29-aarcange@redhat.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRYAndrea Arcangeli
Add support for VM_FAULT_RETRY to follow_hugetlb_page() so that get_user_pages_unlocked/locked and "nonblocking/FOLL_NOWAIT" features will work on hugetlbfs. This is required for fully functional userfaultfd non-present support on hugetlbfs. Link: http://lkml.kernel.org/r/20161216144821.5183-25-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: fix __mcopy_atomic_hugetlb retry/error processingMike Kravetz
The new routine copy_huge_page_from_user() uses kmap_atomic() to map PAGE_SIZE pages. However, this prevents page faults in the subsequent call to copy_from_user(). This is OK in the case where the routine is copied with mmap_sema held. However, in another case we want to allow page faults. So, add a new argument allow_pagefault to indicate if the routine should allow page faults. [dan.carpenter@oracle.com: unmap the correct pointer] Link: http://lkml.kernel.org/r/20170113082608.GA3548@mwanda [akpm@linux-foundation.org: kunmap() takes a page*, per Hugh] Link: http://lkml.kernel.org/r/20161216144821.5183-20-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd supportMike Kravetz
hugetlb_mcopy_atomic_pte is the low level routine that implements the userfaultfd UFFDIO_COPY command. It is based on the existing mcopy_atomic_pte routine with modifications for huge pages. Link: http://lkml.kernel.org/r/20161216144821.5183-18-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd ↵Mike Kravetz
support userfaultfd UFFDIO_COPY allows user level code to copy data to a page at fault time. The data is copied from user space to a newly allocated huge page. The new routine copy_huge_page_from_user performs this copy. Link: http://lkml.kernel.org/r/20161216144821.5183-17-aarcange@redhat.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: add madvise() event for MADV_DONTNEED requestPavel Emelyanov
If the page is punched out of the address space the uffd reader should know this and zeromap the respective area in case of the #PF event. Link: http://lkml.kernel.org/r/20161216144821.5183-14-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: optimize mremap_userfaultfd_complete()Andrea Arcangeli
Optimize the mremap_userfaultfd_complete() interface to pass only the vm_userfaultfd_ctx pointer through the stack as a microoptimization. Link: http://lkml.kernel.org/r/20161216144821.5183-13-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com> Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: add mremap() eventPavel Emelyanov
The event denotes that an area [start:end] moves to different location. Length change isn't reported as "new" addresses, if they appear on the uffd reader side they will not contain any data and the latter can just zeromap them. Waiting for the event ACK is also done outside of mmap sem, as for fork event. Link: http://lkml.kernel.org/r/20161216144821.5183-12-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22userfaultfd: non-cooperative: Add fork() eventPavel Emelyanov
When the mm with uffd-ed vmas fork()-s the respective vmas notify their uffds with the event which contains a descriptor with new uffd. This new descriptor can then be used to get events from the child and populate its mm with data. Note, that there can be different uffd-s controlling different vmas within one mm, so first we should collect all those uffds (and ctx-s) in a list and then notify them all one by one but only once per fork(). The context is created at fork() time but the descriptor, file struct and anon inode object is created at event read time. So some trickery is added to the userfaultfd_ctx_read() to handle the ctx queues' locking vs file creation. Another thing worth noticing is that the task that fork()-s waits for the uffd event to get processed WITHOUT the mmap sem. [aarcange@redhat.com: build warning fix] Link: http://lkml.kernel.org/r/20161216144821.5183-10-aarcange@redhat.com Link: http://lkml.kernel.org/r/20161216144821.5183-9-aarcange@redhat.com Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michael Rapoport <RAPOPORT@il.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: use memcg_kmem_cache_wq for slab destruction operationsTejun Heo
If there's contention on slab_mutex, queueing the per-cache destruction work item on the system_wq can unnecessarily create and tie up a lot of kworkers. Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it global and use that workqueue for the destruction work items too. While at it, convert the workqueue from an unbound workqueue to a per-cpu one with concurrency limited to 1. It's generally preferable to use per-cpu workqueues and concurrency limit of 1 is safe enough. This is suggested by Joonsoo Kim. Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov@tarantool.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22slab: remove synchronous synchronize_sched() from memcg cache deactivation pathTejun Heo
With kmem cgroup support enabled, kmem_caches can be created and destroyed frequently and a great number of near empty kmem_caches can accumulate if there are a lot of transient cgroups and the system is not under memory pressure. When memory reclaim starts under such conditions, it can lead to consecutive deactivation and destruction of many kmem_caches, easily hundreds of thousands on moderately large systems, exposing scalability issues in the current slab management code. This is one of the patches to address the issue. slub uses synchronize_sched() to deactivate a memcg cache. synchronize_sched() is an expensive and slow operation and doesn't scale when a huge number of caches are destroyed back-to-back. While there used to be a simple batching mechanism, the batching was too restricted to be helpful. This patch implements slab_deactivate_memcg_cache_rcu_sched() which slub can use to schedule sched RCU callback instead of performing synchronize_sched() synchronously while holding cgroup_mutex. While this adds online cpus, mems and slab_mutex operations, operating on these locks back-to-back from the same kworker, which is what's gonna happen when there are many to deactivate, isn't expensive at all and this gets rid of the scalability problem completely. Link: http://lkml.kernel.org/r/20170117235411.9408-9-tj@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jay Vana <jsvana@fb.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>