summaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)Author
2023-11-02Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
2023-11-02Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
2023-11-01Merge tag 'sysctl-6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "To help make the move of sysctls out of kernel/sysctl.c not incur a size penalty sysctl has been changed to allow us to not require the sentinel, the final empty element on the sysctl array. Joel Granados has been doing all this work. On the v6.6 kernel we got the major infrastructure changes required to support this. For v6.7-rc1 we have all arch/ and drivers/ modified to remove the sentinel. Both arch and driver changes have been on linux-next for a bit less than a month. It is worth re-iterating the value: - this helps reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array - the extra 64-byte penalty is no longer inncurred now when we move sysctls out from kernel/sysctl.c to their own files For v6.8-rc1 expect removal of all the sentinels and also then the unneeded check for procname == NULL. The last two patches are fixes recently merged by Krister Johansen which allow us again to use softlockup_panic early on boot. This used to work but the alias work broke it. This is useful for folks who want to detect softlockups super early rather than wait and spend money on cloud solutions with nothing but an eventual hung kernel. Although this hadn't gone through linux-next it's also a stable fix, so we might as well roll through the fixes now" * tag 'sysctl-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (23 commits) watchdog: move softlockup_panic back to early_param proc: sysctl: prevent aliased sysctls from getting passed to init intel drm: Remove now superfluous sentinel element from ctl_table array Drivers: hv: Remove now superfluous sentinel element from ctl_table array raid: Remove now superfluous sentinel element from ctl_table array fw loader: Remove the now superfluous sentinel element from ctl_table array sgi-xp: Remove the now superfluous sentinel element from ctl_table array vrf: Remove the now superfluous sentinel element from ctl_table array char-misc: Remove the now superfluous sentinel element from ctl_table array infiniband: Remove the now superfluous sentinel element from ctl_table array macintosh: Remove the now superfluous sentinel element from ctl_table array parport: Remove the now superfluous sentinel element from ctl_table array scsi: Remove now superfluous sentinel element from ctl_table array tty: Remove now superfluous sentinel element from ctl_table array xen: Remove now superfluous sentinel element from ctl_table array hpet: Remove now superfluous sentinel element from ctl_table array c-sky: Remove now superfluous sentinel element from ctl_talbe array powerpc: Remove now superfluous sentinel element from ctl_table arrays riscv: Remove now superfluous sentinel element from ctl_table array x86/vdso: Remove now superfluous sentinel element from ctl_table array ...
2023-11-01Merge tag 'docs-6.7' of git://git.lwn.net/linuxLinus Torvalds
Pull documentation updates from Jonathan Corbet: "The number of commits for documentation is not huge this time around, but there are some significant changes nonetheless: - Some more Spanish-language and Chinese translations - The much-discussed documentation of the confidential-computing threat model - Powerpc and RISCV documentation move under Documentation/arch - these complete this particular bit of documentation churn - A large traditional-Chinese documentation update - A new document on backporting and conflict resolution - Some kernel-doc and Sphinx fixes Plus the usual smattering of smaller updates and typo fixes" * tag 'docs-6.7' of git://git.lwn.net/linux: (40 commits) scripts/kernel-doc: Fix the regex for matching -Werror flag docs: backporting: address feedback Documentation: driver-api: pps: Update PPS generator documentation speakup: Document USB support doc: blk-ioprio: Bring the doc in line with the implementation docs: usb: fix reference to nonexistent file in UVC Gadget docs: doc-guide: mention 'make refcheckdocs' Documentation: fix typo in dynamic-debug howto scripts/kernel-doc: match -Werror flag strictly Documentation/sphinx: Remove the repeated word "the" in comments. docs: sparse: add SPDX-License-Identifier docs/zh_CN: Add subsystem-apis Chinese translation docs/zh_TW: update contents for zh_TW docs: submitting-patches: encourage direct notifications to commenters docs: add backporting and conflict resolution document docs: move riscv under arch docs: update link to powerpc/vmemmap_dedup.rst mm/memory-hotplug: fix typo in documentation docs: move powerpc under arch PCI: Update the devres documentation regarding to pcim_*() ...
2023-11-01Merge tag 'asm-generic-6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull ia64 removal and asm-generic updates from Arnd Bergmann: - The ia64 architecture gets its well-earned retirement as planned, now that there is one last (mostly) working release that will be maintained as an LTS kernel. - The architecture specific system call tables are updated for the added map_shadow_stack() syscall and to remove references to the long-gone sys_lookup_dcookie() syscall. * tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: hexagon: Remove unusable symbols from the ptrace.h uapi asm-generic: Fix spelling of architecture arch: Reserve map_shadow_stack() syscall number for all architectures syscalls: Cleanup references to sys_lookup_dcookie() Documentation: Drop or replace remaining mentions of IA64 lib/raid6: Drop IA64 support Documentation: Drop IA64 from feature descriptions kernel: Drop IA64 support from sig_fault handlers arch: Remove Itanium (IA-64) architecture
2023-11-01Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Improvements to the queue_rqs() support, and adding null_blk support for that as well (Chengming) - Series improving badblocks support (Coly) - Key store support for sed-opal (Greg) - IBM partition string handling improvements (Jan) - Make number of ublk devices supported configurable (Mike) - Cancelation improvements for ublk (Ming) - MD pull requests via Song: - Handle timeout in md-cluster, by Denis Plotnikov - Cleanup pers->prepare_suspend, by Yu Kuai - Rewrite mddev_suspend(), by Yu Kuai - Simplify md_seq_ops, by Yu Kuai - Reduce unnecessary locking array_state_store(), by Mariusz Tkaczyk - Make rdev add/remove independent from daemon thread, by Yu Kuai - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai - NVMe pull request via Keith: - nvme-auth updates (Mark) - nvme-tcp tls (Hannes) - nvme-fc annotaions (Kees) - Misc cleanups and improvements (Jiapeng, Joel) * tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits) block: ublk_drv: Remove unused function md: cleanup pers->prepare_suspend() nvme-auth: allow mixing of secret and hash lengths nvme-auth: use transformed key size to create resp nvme-auth: alloc nvme_dhchap_key as single buffer nvmet-tcp: use 'spin_lock_bh' for state_lock() powerpc/pseries: PLPKS SED Opal keystore support block: sed-opal: keystore access for SED Opal keys block:sed-opal: SED Opal keystore ublk: simplify aborting request ublk: replace monitor with cancelable uring_cmd ublk: quiesce request queue when aborting queue ublk: rename mm_lock as lock ublk: move ublk_cancel_dev() out of ub->mutex ublk: make sure io cmd handled in submitter task context ublk: don't get ublk device reference in ublk_abort_queue() ublk: Make ublks_max configurable ublk: Limit dev_id/ub_number values md-cluster: check for timeout while a new disk adding nvme: rework NVME_AUTH Kconfig selection ...
2023-11-01Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drmLinus Torvalds
Pull drm updates from Dave Airlie: "Highlights: - AMD adds some more upcoming HW platforms - Intel made Meteorlake stable and started adding Lunarlake - nouveau has a bunch of display rework in prepartion for the NVIDIA GSP firmware support - msm adds a7xx support - habanalabs has finished migration to accel subsystem Detail summary: kernel: - add initial vmemdup-user-array core: - fix platform remove() to return void - drm_file owner updated to reflect owner - move size calcs to drm buddy allocator - let GPUVM build as a module - allow variable number of run-queues in scheduler edid: - handle bad h/v sync_end in EDIDs panfrost: - add Boris as maintainer fbdev: - use fb_ops helpers more - only allow logo use from fbcon - rename fb_pgproto to pgprot_framebuffer - add HPD state to drm_connector_oob_hotplug_event - convert to fbdev i/o mem helpers i915: - Enable meteorlake by default - Early Xe2 LPD/Lunarlake display enablement - Rework subplatforms into IP version checks - GuC based TLB invalidation for Meteorlake - Display rework for future Xe driver integration - LNL FBC features - LNL display feature capability reads - update recommended fw versions for DG2+ - drop fastboot module parameter - added deviceid for Arrowlake-S - drop preproduction workarounds - don't disable preemption for resets - cleanup inlines in headers - PXP firmware loading fix - Fix sg list lengths - DSC PPS state readout/verification - Add more RPL P/U PCI IDs - Add new DG2-G12 stepping - DP enhanced framing support to state checker - Improve shared link bandwidth management - stop using GEM macros in display code - refactor related code into display code - locally enable W=1 warnings - remove PSR watchdog timers on LNL amdgpu: - RAS/FRU EEPROM updatse - IP discovery updatses - GC 11.5 support - DCN 3.5 support - VPE 6.1 support - NBIO 7.11 support - DML2 support - lots of IP updates - use flexible arrays for bo list handling - W=1 fixes - Enable seamless boot in more cases - Enable context type property for HDMI - Rework GPUVM TLB flushing - VCN IB start/size alignment fixes amdkfd: - GC 10/11 fixes - GC 11.5 support - use partial migration in GPU faults radeon: - W=1 Fixes - fix some possible buffer overflow/NULL derefs nouveau: - update uapi for NO_PREFETCH - scheduler/fence fixes - rework suspend/resume for GSP-RM - rework display in preparation for GSP-RM habanalabs: - uapi: expose tsc clock - uapi: block access to eventfd through control device - uapi: force dma-buf export to PAGE_SIZE alignments - complete move to accel subsystem - move firmware interface include files - perform hard reset on PCIe AXI drain event - optimise user interrupt handling msm: - DP: use existing helpers for DPCD - DPU: interrupts reworked - gpu: a7xx (a730/a740) support - decouple msm_drv from kms for headless devices mediatek: - MT8188 dsi/dp/edp support - DDP GAMMA - 12 bit LUT support - connector dynamic selection capability rockchip: - rv1126 mipi-dsi/vop support - add planar formats ast: - rename constants panels: - Mitsubishi AA084XE01 - JDI LPM102A188A - LTK050H3148W-CTA6 ivpu: - power management fixes qaic: - add detach slice bo api komeda: - add NV12 writeback tegra: - support NVSYNC/NHSYNC - host1x suspend fixes ili9882t: - separate into own driver" * tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm: (1803 commits) drm/amdgpu: Remove unused variables from amdgpu_show_fdinfo drm/amdgpu: Remove duplicate fdinfo fields drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloaded drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems drm/amdgpu: Retrieve CE count from ce_count_lo_chip in EccInfo table drm/amdgpu: Identify data parity error corrected in replay mode drm/amdgpu: Fix typo in IP discovery parsing drm/amd/display: fix S/G display enablement drm/amdxcp: fix amdxcp unloads incompletely drm/amd/amdgpu: fix the GPU power print error in pm info drm/amdgpu: Use pcie domain of xcc acpi objects drm/amd: check num of link levels when update pcie param drm/amdgpu: Add a read to GFX v9.4.3 ring test drm/amd/pm: call smu_cmn_get_smc_version in is_mode1_reset_supported. drm/amdgpu: get RAS poison status from DF v4_6_2 drm/amdgpu: Use discovery table's subrevision drm/amd/display: 3.2.256 drm/amd/display: add interface to query SubVP status drm/amd/display: Read before writing Backlight Mode Set Register drm/amd/display: Disable SYMCLK32_SE RCO on DCN314 ...
2023-10-30Merge tag 'hardening-v6.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: "One of the more voluminous set of changes is for adding the new __counted_by annotation[1] to gain run-time bounds checking of dynamically sized arrays with UBSan. - Add LKDTM test for stuck CPUs (Mark Rutland) - Improve LKDTM selftest behavior under UBSan (Ricardo Cañuelo) - Refactor more 1-element arrays into flexible arrays (Gustavo A. R. Silva) - Analyze and replace strlcpy and strncpy uses (Justin Stitt, Azeem Shaikh) - Convert group_info.usage to refcount_t (Elena Reshetova) - Add __counted_by annotations (Kees Cook, Gustavo A. R. Silva) - Add Kconfig fragment for basic hardening options (Kees Cook, Lukas Bulwahn) - Fix randstruct GCC plugin performance mode to stay in groups (Kees Cook) - Fix strtomem() compile-time check for small sources (Kees Cook)" * tag 'hardening-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (56 commits) hwmon: (acpi_power_meter) replace open-coded kmemdup_nul reset: Annotate struct reset_control_array with __counted_by kexec: Annotate struct crash_mem with __counted_by virtio_console: Annotate struct port_buffer with __counted_by ima: Add __counted_by for struct modsig and use struct_size() MAINTAINERS: Include stackleak paths in hardening entry string: Adjust strtomem() logic to allow for smaller sources hardening: x86: drop reference to removed config AMD_IOMMU_V2 randstruct: Fix gcc-plugin performance mode to stay in group mailbox: zynqmp: Annotate struct zynqmp_ipi_pdata with __counted_by drivers: thermal: tsens: Annotate struct tsens_priv with __counted_by irqchip/imx-intmux: Annotate struct intmux_data with __counted_by KVM: Annotate struct kvm_irq_routing_table with __counted_by virt: acrn: Annotate struct vm_memory_region_batch with __counted_by hwmon: Annotate struct gsc_hwmon_platform_data with __counted_by sparc: Annotate struct cpuinfo_tree with __counted_by isdn: kcapi: replace deprecated strncpy with strscpy_pad isdn: replace deprecated strncpy with strscpy NFS/flexfiles: Annotate struct nfs4_ff_layout_segment with __counted_by nfs41: Annotate struct nfs4_file_layout_dsaddr with __counted_by ...
2023-10-30Merge tag 'rcu-next-v6.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull RCU updates from Frederic Weisbecker: - RCU torture, locktorture and generic torture infrastructure updates that include various fixes, cleanups and consolidations. Among the user visible things, ftrace dumps can now be found into their own file, and module parameters get better documented and reported on dumps. - Generic and misc fixes all over the place. Some highlights: * Hotplug handling has seen some light cleanups and comments * An RCU barrier can now be triggered through sysfs to serialize memory stress testing and avoid OOM * Object information is now dumped in case of invalid callback invocation * Also various SRCU issues, too hard to trigger to deserve urgent pull requests, have been fixed - RCU documentation updates - RCU reference scalability test minor fixes and doc improvements. - RCU tasks minor fixes - Stall detection updates. Introduce RCU CPU Stall notifiers that allows a subsystem to provide informations to help debugging. Also cure some false positive stalls. * tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (56 commits) srcu: Only accelerate on enqueue time locktorture: Check the correct variable for allocation failure srcu: Fix callbacks acceleration mishandling rcu: Comment why callbacks migration can't wait for CPUHP_RCUTREE_PREP rcu: Standardize explicit CPU-hotplug calls rcu: Conditionally build CPU-hotplug teardown callbacks rcu: Remove references to rcu_migrate_callbacks() from diagrams rcu: Assume rcu_report_dead() is always called locally rcu: Assume IRQS disabled from rcu_report_dead() rcu: Use rcu_segcblist_segempty() instead of open coding it rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects srcu: Fix srcu_struct node grpmask overflow on 64-bit systems torture: Convert parse-console.sh to mktemp rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle() rcutorture: Replace schedule_timeout*() 1-jiffy waits with HZ/20 torture: Add kvm.sh --debug-info argument locktorture: Rename readers_bind/writers_bind to bind_readers/bind_writers doc: Catch-up update for locktorture module parameters locktorture: Add call_rcu_chains module parameter locktorture: Add new module parameters to lock_torture_print_module_parms() ...
2023-10-30Merge tag 'sched-core-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_OTHER) improvements: - Remove the old and now unused SIS_PROP code & option - Scan cluster before LLC in the wake-up path - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup NUMA scheduling improvements: - Improve the VMA access-PID code to better skip/scan VMAs - Extend tracing to cover VMA-skipping decisions - Improve/fix the recently introduced sched_numa_find_nth_cpu() code - Generalize numa_map_to_online_node() Energy scheduling improvements: - Remove the EM_MAX_COMPLEXITY limit - Add tracepoints to track energy computation - Make the behavior of the 'sched_energy_aware' sysctl more consistent - Consolidate and clean up access to a CPU's max compute capacity - Fix uclamp code corner cases RT scheduling improvements: - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates - Drive the ->rto_mask with rt_rq->pushable_tasks updates Scheduler scalability improvements: - Rate-limit updates to tg->load_avg - On x86 disable IBRS when CPU is offline to improve single-threaded performance - Micro-optimize in_task() and in_interrupt() - Micro-optimize the PSI code - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes Core scheduler infrastructure improvements: - Use saved_state to reduce some spurious freezer wakeups - Bring in a handful of fast-headers improvements to scheduler headers - Make the scheduler UAPI headers more widely usable by user-space - Simplify the control flow of scheduler syscalls by using lock guards - Fix sched_setaffinity() vs. CPU hotplug race Scheduler debuggability improvements: - Disallow writing invalid values to sched_rt_period_us - Fix a race in the rq-clock debugging code triggering warnings - Fix a warning in the bandwidth distribution code - Micro-optimize in_atomic_preempt_off() checks - Enforce that the tasklist_lock is held in for_each_thread() - Print the TGID in sched_show_task() - Remove the /proc/sys/kernel/sched_child_runs_first sysctl ... and misc cleanups & fixes" * tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) sched/fair: Remove SIS_PROP sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup sched/fair: Scan cluster before scanning LLC in wake-up path sched: Add cpus_share_resources API sched/core: Fix RQCF_ACT_SKIP leak sched/fair: Remove unused 'curr' argument from pick_next_entity() sched/nohz: Update comments about NEWILB_KICK sched/fair: Remove duplicate #include sched/psi: Update poll => rtpoll in relevant comments sched: Make PELT acronym definition searchable sched: Fix stop_one_cpu_nowait() vs hotplug sched/psi: Bail out early from irq time accounting sched/topology: Rename 'DIE' domain to 'PKG' sched/psi: Delete the 'update_total' function parameter from update_triggers() sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes sched/headers: Remove comment referring to rq::cpu_load, since this has been removed sched/numa: Complete scanning of inactive VMAs when there is no alternative sched/numa: Complete scanning of partial VMAs regardless of PID activity sched/numa: Move up the access pid reset logic sched/numa: Trace decisions related to skipping VMAs ...
2023-10-30Merge tag 'locking-core-2023-10-28' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Info Molnar: "Futex improvements: - Add the 'futex2' syscall ABI, which is an attempt to get away from the multiplex syscall and adds a little room for extentions, while lifting some limitations. - Fix futex PI recursive rt_mutex waiter state bug - Fix inter-process shared futexes on no-MMU systems - Use folios instead of pages Micro-optimizations of locking primitives: - Improve arch_spin_value_unlocked() on asm-generic ticket spinlock architectures, to improve lockref code generation - Improve the x86-32 lockref_get_not_zero() main loop by adding build-time CMPXCHG8B support detection for the relevant lockref code, and by better interfacing the CMPXCHG8B assembly code with the compiler - Introduce arch_sync_try_cmpxchg() on x86 to improve sync_try_cmpxchg() code generation. Convert some sync_cmpxchg() users to sync_try_cmpxchg(). - Micro-optimize rcuref_put_slowpath() Locking debuggability improvements: - Improve CONFIG_DEBUG_RT_MUTEXES=y to have a fast-path as well - Enforce atomicity of sched_submit_work(), which is de-facto atomic but was un-enforced previously. - Extend <linux/cleanup.h>'s no_free_ptr() with __must_check semantics - Fix ww_mutex self-tests - Clean up const-propagation in <linux/seqlock.h> and simplify the API-instantiation macros a bit RT locking improvements: - Provide the rt_mutex_*_schedule() primitives/helpers and use them in the rtmutex code to avoid recursion vs. rtlock on the PI state. - Add nested blocking lockdep asserts to rt_mutex_lock(), rtlock_lock() and rwbase_read_lock() .. plus misc fixes & cleanups" * tag 'locking-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits) futex: Don't include process MM in futex key on no-MMU locking/seqlock: Fix grammar in comment alpha: Fix up new futex syscall numbers locking/seqlock: Propagate 'const' pointers within read-only methods, remove forced type casts locking/lockdep: Fix string sizing bug that triggers a format-truncation compiler-warning locking/seqlock: Change __seqprop() to return the function pointer locking/seqlock: Simplify SEQCOUNT_LOCKNAME() locking/atomics: Use atomic_try_cmpxchg_release() to micro-optimize rcuref_put_slowpath() locking/atomic, xen: Use sync_try_cmpxchg() instead of sync_cmpxchg() locking/atomic/x86: Introduce arch_sync_try_cmpxchg() locking/atomic: Add generic support for sync_try_cmpxchg() and its fallback locking/seqlock: Fix typo in comment futex/requeue: Remove unnecessary ‘NULL’ initialization from futex_proxy_trylock_atomic() locking/local, arch: Rewrite local_add_unless() as a static inline function locking/debug: Fix debugfs API return value checks to use IS_ERR() locking/ww_mutex/test: Make sure we bail out instead of livelock locking/ww_mutex/test: Fix potential workqueue corruption locking/ww_mutex/test: Use prng instead of rng to avoid hangs at bootup futex: Add sys_futex_requeue() futex: Add flags2 argument to futex_requeue() ...
2023-10-30Merge tag 'vfs-6.7.ctime' of ↵Linus Torvalds
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode time accessor updates from Christian Brauner: "This finishes the conversion of all inode time fields to accessor functions as discussed on list. Changing timestamps manually as we used to do before is error prone. Using accessors function makes this robust. It does not contain the switch of the time fields to discrete 64 bit integers to replace struct timespec and free up space in struct inode. But after this, the switch can be trivially made and the patch should only affect the vfs if we decide to do it" * tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits) fs: rename inode i_atime and i_mtime fields security: convert to new timestamp accessors selinux: convert to new timestamp accessors apparmor: convert to new timestamp accessors sunrpc: convert to new timestamp accessors mm: convert to new timestamp accessors bpf: convert to new timestamp accessors ipc: convert to new timestamp accessors linux: convert to new timestamp accessors zonefs: convert to new timestamp accessors xfs: convert to new timestamp accessors vboxsf: convert to new timestamp accessors ufs: convert to new timestamp accessors udf: convert to new timestamp accessors ubifs: convert to new timestamp accessors tracefs: convert to new timestamp accessors sysv: convert to new timestamp accessors squashfs: convert to new timestamp accessors server: convert to new timestamp accessors client: convert to new timestamp accessors ...
2023-10-30Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds
Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...
2023-10-27Merge tag 'powerpc-6.6-6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix boot crash with FLATMEM since set_ptes() introduction - Avoid calling arch_enter/leave_lazy_mmu() in set_ptes() Thanks to Aneesh Kumar K.V and Erhard Furtner. * tag 'powerpc-6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes powerpc/mm: Fix boot crash with FLATMEM
2023-10-25powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptesAneesh Kumar K.V
With commit 9fee28baa601 ("powerpc: implement the new page table range API") we added set_ptes to powerpc architecture. The implementation included calling arch_enter/leave_lazy_mmu() calls. The patch removes the usage of arch_enter/leave_lazy_mmu() because set_pte is not supposed to be used when updating a pte entry. Powerpc architecture uses this rule to skip the expensive tlb invalidate which is not needed when you are setting up the pte for the first time. See commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit") for more details The patch also makes sure we are not using the interface to update a valid/present pte entry by adding VM_WARN_ON check all the ptes we are setting up. Furthermore, we add a comment to set_pte_filter to clarify it can only update folio-related flags and cannot filter pfn specific details in pte filtering. Removal of arch_enter/leave_lazy_mmu() also will avoid nesting of these functions that are not supported. For ex: remap_pte_range() -> arch_enter_lazy_mmu() -> set_ptes() -> arch_enter_lazy_mmu() -> arch_leave_lazy_mmu() -> arch_leave_lazy_mmu() Fixes: 9fee28baa601 ("powerpc: implement the new page table range API") Signed-off-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231024143604.16749-1-aneesh.kumar@linux.ibm.com
2023-10-23powerpc/mm: Fix boot crash with FLATMEMMichael Ellerman
Erhard reported that his G5 was crashing with v6.6-rc kernels: mpic: Setting up HT PICs workarounds for U3/U4 BUG: Unable to handle kernel data access at 0xfeffbb62ffec65fe Faulting instruction address: 0xc00000000005dc40 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Tainted: G T 6.6.0-rc3-PMacGS #1 Hardware name: PowerMac11,2 PPC970MP 0x440101 PowerMac NIP: c00000000005dc40 LR: c000000000066660 CTR: c000000000007730 REGS: c0000000022bf510 TRAP: 0380 Tainted: G T (6.6.0-rc3-PMacGS) MSR: 9000000000001032 <SF,HV,ME,IR,DR,RI> CR: 44004242 XER: 00000000 IRQMASK: 3 GPR00: 0000000000000000 c0000000022bf7b0 c0000000010c0b00 00000000000001ac GPR04: 0000000003c80000 0000000000000300 c0000000f20001ae 0000000000000300 GPR08: 0000000000000006 feffbb62ffec65ff 0000000000000001 0000000000000000 GPR12: 9000000000001032 c000000002362000 c000000000f76b80 000000000349ecd8 GPR16: 0000000002367ba8 0000000002367f08 0000000000000006 0000000000000000 GPR20: 00000000000001ac c000000000f6f920 c0000000022cd985 000000000000000c GPR24: 0000000000000300 00000003b0a3691d c0003e008030000e 0000000000000000 GPR28: c00000000000000c c0000000f20001ee feffbb62ffec65fe 00000000000001ac NIP hash_page_do_lazy_icache+0x50/0x100 LR __hash_page_4K+0x420/0x590 Call Trace: hash_page_mm+0x364/0x6f0 do_hash_fault+0x114/0x2b0 data_access_common_virt+0x198/0x1f0 --- interrupt: 300 at mpic_init+0x4bc/0x10c4 NIP: c000000002020a5c LR: c000000002020a04 CTR: 0000000000000000 REGS: c0000000022bf9f0 TRAP: 0300 Tainted: G T (6.6.0-rc3-PMacGS) MSR: 9000000000001032 <SF,HV,ME,IR,DR,RI> CR: 24004248 XER: 00000000 DAR: c0003e008030000e DSISR: 40000000 IRQMASK: 1 ... NIP mpic_init+0x4bc/0x10c4 LR mpic_init+0x464/0x10c4 --- interrupt: 300 pmac_setup_one_mpic+0x258/0x2dc pmac_pic_init+0x28c/0x3d8 init_IRQ+0x90/0x140 start_kernel+0x57c/0x78c start_here_common+0x1c/0x20 A bisect pointed to the breakage beginning with commit 9fee28baa601 ("powerpc: implement the new page table range API"). Analysis of the oops pointed to a struct page with a corrupted compound_head being loaded via page_folio() -> _compound_head() in hash_page_do_lazy_icache(). The access by the mpic code is to an MMIO address, so the expectation is that the struct page for that address would be initialised by init_unavailable_range(), as pointed out by Aneesh. Instrumentation showed that was not the case, which eventually lead to the realisation that pfn_valid() was returning false for that address, causing the struct page to not be initialised. Because the system is using FLATMEM, the version of pfn_valid() in memory_model.h is used: static inline int pfn_valid(unsigned long pfn) { ... return pfn >= pfn_offset && (pfn - pfn_offset) < max_mapnr; } Which relies on max_mapnr being initialised. Early in boot max_mapnr is zero meaning no PFNs are valid. max_mapnr is initialised in mem_init() called via: start_kernel() mm_core_init() # init/main.c:928 mem_init() But that is too late for the usage in init_unavailable_range() called via: start_kernel() setup_arch() # init/main.c:893 paging_init() free_area_init() init_unavailable_range() Although max_mapnr is currently set in mem_init(), the value is actually already available much earlier, as soon as mem_topology_setup() has completed, which is also before paging_init() is called. So move the initialisation there, which causes paging_init() to correctly initialise the struct page and fixes the bug. This bug seems to have been lurking for years, but went unnoticed because the pre-folio code was inspecting the uninitialised page->flags but not dereferencing it. Thanks to Erhard and Aneesh for help debugging. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Closes: https://lore.kernel.org/all/20230929132750.3cd98452@yea/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231023112500.1550208-1-mpe@ellerman.id.au
2023-10-23Merge tag 'v6.6-rc7' into sched/core, to pick up fixesIngo Molnar
Pick up recent sched/urgent fixes merged upstream. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-23BackMerge tag 'v6.6-rc7' into drm-nextDave Airlie
This is needed to add the msm pr which is based on a higher base. Signed-off-by: Dave Airlie <airlied@redhat.com>
2023-10-21Merge tag 'powerpc-6.6-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix stale propagated yield_cpu in qspinlocks leading to lockups - Fix broken hugepages on some configs due to ARCH_FORCE_MAX_ORDER - Fix a spurious warning when copros are in use at exit time Thanks to Nicholas Piggin, Christophe Leroy, Nysal Jan K.A Sachin Sant, and Shrikanth Hegde. * tag 'powerpc-6.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/qspinlock: Fix stale propagated yield_cpu powerpc/64s/radix: Don't warn on copros in radix__tlb_flush() powerpc/mm: Allow ARCH_FORCE_MAX_ORDER up to 12
2023-10-19file: convert to SLAB_TYPESAFE_BY_RCUChristian Brauner
In recent discussions around some performance improvements in the file handling area we discussed switching the file cache to rely on SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based freeing for files completely. This is a pretty sensitive change overall but it might actually be worth doing. The main downside is the subtlety. The other one is that we should really wait for Jann's patch to land that enables KASAN to handle SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this exists. With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times which requires a few changes. So it isn't sufficient anymore to just acquire a reference to the file in question under rcu using atomic_long_inc_not_zero() since the file might have already been recycled and someone else might have bumped the reference. In other words, callers might see reference count bumps from newer users. For this reason it is necessary to verify that the pointer is the same before and after the reference count increment. This pattern can be seen in get_file_rcu() and __files_get_rcu(). In addition, it isn't possible to access or check fields in struct file without first aqcuiring a reference on it. Not doing that was always very dodgy and it was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a reference under rcu or they must hold the files_lock of the fdtable. Failing to do either one of this is a bug. Thanks to Jann for pointing out that we need to ensure memory ordering between reallocations and pointer check by ensuring that all subsequent loads have a dependency on the second load in get_file_rcu() and providing a fixup that was folded into this patch. Cc: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18powerpc: implement arch_xor_unlock_is_negative_byte on 32-bitMatthew Wilcox (Oracle)
Simply remove the ifdef. The assembly is identical to that in the non-optimised case of test_and_clear_bits() on PPC32, and it's not clear to me how the PPC32 optimisation works, nor whether it would work for arch_xor_unlock_is_negative_byte(). If that optimisation would work, someone can implement it later, but this is more efficient than the implementation in filemap.c. Link: https://lkml.kernel.org/r/20231004165317.1061855-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18bitops: add xor_unlock_is_negative_byte()Matthew Wilcox (Oracle)
Replace clear_bit_and_unlock_is_negative_byte() with xor_unlock_is_negative_byte(). We have a few places that like to lock a folio, set a flag and unlock it again. Allow for the possibility of combining the latter two operations for efficiency. We are guaranteed that the caller holds the lock, so it is safe to unlock it with the xor. The caller must guarantee that nobody else will set the flag without holding the lock; it is not safe to do this with the PG_dirty flag, for example. Link: https://lkml.kernel.org/r/20231004165317.1061855-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-18spufs: convert to new timestamp accessorsJeff Layton
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-1-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18powerpc/qspinlock: Fix stale propagated yield_cpuNicholas Piggin
yield_cpu is a sample of a preempted lock holder that gets propagated back through the queue. Queued waiters use this to yield to the preempted lock holder without continually sampling the lock word (which would defeat the purpose of MCS queueing by bouncing the cache line). The problem is that yield_cpu can become stale. It can take some time to be passed down the chain, and if any queued waiter gets preempted then it will cease to propagate the yield_cpu to later waiters. This can result in yielding to a CPU that no longer holds the lock, which is bad, but particularly if it is currently in H_CEDE (idle), then it appears to be preempted and some hypervisors (PowerVM) can cause very long H_CONFER latencies waiting for H_CEDE wakeup. This results in latency spikes and hard lockups on oversubscribed partitions with lock contention. This is a minimal fix. Before yielding to yield_cpu, sample the lock word to confirm yield_cpu is still the owner, and bail out of it is not. Thanks to a bunch of people who reported this and tracked down the exact problem using tracepoints and dispatch trace logs. Fixes: 28db61e207ea ("powerpc/qspinlock: allow propagation of yield CPU down the queue") Cc: stable@vger.kernel.org # v6.2+ Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reported-by: Laurent Dufour <ldufour@linux.ibm.com> Reported-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Debugged-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20231016124305.139923-2-npiggin@gmail.com
2023-10-18powerpc/64s/radix: Don't warn on copros in radix__tlb_flush()Michael Ellerman
Sachin reported a warning when running the inject-ra-err selftest: # selftests: powerpc/mce: inject-ra-err Disabling lock debugging due to kernel taint MCE: CPU19: machine check (Severe) Real address Load/Store (foreign/control memory) [Not recovered] MCE: CPU19: PID: 5254 Comm: inject-ra-err NIP: [0000000010000e48] MCE: CPU19: Initiator CPU MCE: CPU19: Unknown ------------[ cut here ]------------ WARNING: CPU: 19 PID: 5254 at arch/powerpc/mm/book3s64/radix_tlb.c:1221 radix__tlb_flush+0x160/0x180 CPU: 19 PID: 5254 Comm: inject-ra-err Kdump: loaded Tainted: G M E 6.6.0-rc3-00055-g9ed22ae6be81 #4 Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1030.20 (NH1030_058) hv:phyp pSeries ... NIP radix__tlb_flush+0x160/0x180 LR radix__tlb_flush+0x104/0x180 Call Trace: radix__tlb_flush+0xf4/0x180 (unreliable) tlb_finish_mmu+0x15c/0x1e0 exit_mmap+0x1a0/0x510 __mmput+0x60/0x1e0 exit_mm+0xdc/0x170 do_exit+0x2bc/0x5a0 do_group_exit+0x4c/0xc0 sys_exit_group+0x28/0x30 system_call_exception+0x138/0x330 system_call_vectored_common+0x15c/0x2ec And bisected it to commit e43c0a0c3c28 ("powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs"), which added a warning in radix__tlb_flush() if mm->context.copros is still elevated. However it's possible for the copros count to be elevated if a process exits without first closing file descriptors that are associated with a copro, eg. VAS. If the process exits with a VAS file still open, the release callback is queued up for exit_task_work() via: exit_files() put_files_struct() close_files() filp_close() fput() And called via: exit_task_work() ____fput() __fput() file->f_op->release(inode, file) coproc_release() vas_user_win_ops->close_win() vas_deallocate_window() mm_context_remove_vas_window() mm_context_remove_copro() But that is after exit_mm() has been called from do_exit() and triggered the warning. Fix it by dropping the warning, and always calling __flush_all_mm(). In the normal case of no copros, that will result in a call to _tlbiel_pid(mm->context.id, RIC_FLUSH_ALL) just as the current code does. If the copros count is elevated then it will cause a global flush, which should flush translations from any copros. Note that the process table entry was cleared in arch_exit_mmap(), so copros should not be able to fetch any new translations. Fixes: e43c0a0c3c28 ("powerpc/64s/radix: combine final TLB flush and lazy tlb mm shootdown IPIs") Reported-by: Sachin Sant <sachinp@linux.ibm.com> Closes: https://lore.kernel.org/all/A8E52547-4BF1-47CE-8AEA-BC5A9D7E3567@linux.ibm.com/ Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Sachin Sant <sachinp@linux.ibm.com> Link: https://msgid.link/20231017121527.1574104-1-mpe@ellerman.id.au
2023-10-17powerpc/pseries: PLPKS SED Opal keystore supportGreg Joyce
Define operations for SED Opal to read/write keys from POWER LPAR Platform KeyStore(PLPKS). This allows non-volatile storage of SED Opal keys. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Link: https://lore.kernel.org/r/20231004201957.1451669-4-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-15Merge tag 'powerpc-6.6-4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix softlockup/crash when using hcall tracing - Fix pte_access_permitted() for PAGE_NONE on 8xx - Fix inverted pte_young() test in __ptep_test_and_clear_young() on 64-bit BookE - Fix unhandled math emulation exception on 85xx - Fix kernel crash on syscall return on 476 Thanks to Athira Rajeev, Christophe Leroy, Eddie James, and Naveen N Rao. * tag 'powerpc-6.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/47x: Fix 47x syscall return crash powerpc/85xx: Fix math emulation exception powerpc/64e: Fix wrong test in __ptep_test_and_clear_young() powerpc/8xx: Fix pte_access_permitted() for PAGE_NONE powerpc/pseries: Remove unused r0 in the hcall tracing code powerpc/pseries: Fix STK_PARAM access in the hcall tracing code
2023-10-15powerpc/mm: Allow ARCH_FORCE_MAX_ORDER up to 12Michael Ellerman
Christophe reported that the change to ARCH_FORCE_MAX_ORDER to limit the range to 10 had broken his ability to configure hugepages: # echo 1 > /sys/kernel/mm/hugepages/hugepages-8192kB/nr_hugepages sh: write error: Invalid argument Several of the powerpc defconfigs previously set the ARCH_FORCE_MAX_ORDER value to 12, via the definition in arch/powerpc/configs/fsl-emb-nonhw.config, used by: mpc85xx_defconfig mpc85xx_smp_defconfig corenet32_smp_defconfig corenet64_smp_defconfig mpc86xx_defconfig mpc86xx_smp_defconfig Fix it by increasing the allowed range to 12 to restore the previous behaviour. Fixes: 358e526a1648 ("powerpc/mm: Reinstate ARCH_FORCE_MAX_ORDER ranges") Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Closes: https://lore.kernel.org/all/8011d806-5b30-bf26-2bfe-a08c39d57e20@csgroup.eu/ Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230824122849.942072-1-mpe@ellerman.id.au
2023-10-12sched/topology: Rename 'DIE' domain to 'PKG'Peter Zijlstra
While reworking the x86 topology code Thomas tripped over creating a 'DIE' domain for the package mask. :-) Since these names are CONFIG_SCHED_DEBUG=y only, rename them to make the name less ambiguous. [ Shrikanth Hegde: rename on s390 as well. ] [ Valentin Schneider: also rename it in the comments. ] [ mingo: port to recent kernels & find all remaining occurances. ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Gautham R. Shenoy <gautham.shenoy@amd.com> Acked-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230712141056.GI3100107@hirez.programming.kicks-ass.net
2023-10-12fbdev: Replace fb_pgprotect() with pgprot_framebuffer()Thomas Zimmermann
Rename the fbdev mmap helper fb_pgprotect() to pgprot_framebuffer(). The helper sets VMA page-access flags for framebuffers in device I/O memory. Also clean up the helper's parameters and return value. Instead of the VMA instance, pass the individial parameters separately: existing page-access flags, the VMAs start and end addresses and the offset in the underlying device memory rsp file. Return the new page-access flags. These changes align pgprot_framebuffer() with other pgprot_() functions. v4: * fix commit message (Christophe) v3: * rename fb_pgprotect() to pgprot_framebuffer() (Arnd) Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230922080636.26762-3-tzimmermann@suse.de
2023-10-12fbdev: Avoid file argument in fb_pgprotect()Thomas Zimmermann
Only PowerPC's fb_pgprotect() needs the file argument, although the implementation in either phys_mem_access_prot() or pci_phys_mem_access_prot() does not use it. Pass NULL to the internal helper in preparation of further updates. A later patch will remove the file parameter from fb_pgprotect(). While at it, replace the shift operation with PHYS_PFN(). v5: * state function names in commit description (Javier) Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Javier Martinez Canillas <javierm@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20230922080636.26762-2-tzimmermann@suse.de
2023-10-11powerpc/47x: Fix 47x syscall return crashMichael Ellerman
Eddie reported that newer kernels were crashing during boot on his 476 FSP2 system: kernel tried to execute user page (b7ee2000) - exploit attempt? (uid: 0) BUG: Unable to handle kernel instruction fetch Faulting instruction address: 0xb7ee2000 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K FSP-2 Modules linked in: CPU: 0 PID: 61 Comm: mount Not tainted 6.1.55-d23900f.ppcnf-fsp2 #1 Hardware name: ibm,fsp2 476fpe 0x7ff520c0 FSP-2 NIP:  b7ee2000 LR: 8c008000 CTR: 00000000 REGS: bffebd83 TRAP: 0400   Not tainted (6.1.55-d23900f.ppcnf-fs p2) MSR:  00000030 <IR,DR>  CR: 00001000  XER: 20000000 GPR00: c00110ac bffebe63 bffebe7e bffebe88 8c008000 00001000 00000d12 b7ee2000 GPR08: 00000033 00000000 00000000 c139df10 48224824 1016c314 10160000 00000000 GPR16: 10160000 10160000 00000008 00000000 10160000 00000000 10160000 1017f5b0 GPR24: 1017fa50 1017f4f0 1017fa50 1017f740 1017f630 00000000 00000000 1017f4f0 NIP [b7ee2000] 0xb7ee2000 LR [8c008000] 0x8c008000 Call Trace: Instruction dump: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX ---[ end trace 0000000000000000 ]--- The problem is in ret_from_syscall where the check for icache_44x_need_flush is done. When the flush is needed the code jumps out-of-line to do the flush, and then intends to jump back to continue the syscall return. However the branch back to label 1b doesn't return to the correct location, instead branching back just prior to the return to userspace, causing bogus register values to be used by the rfi. The breakage was introduced by commit 6f76a01173cc ("powerpc/syscall: implement system call entry/exit logic in C for PPC32") which inadvertently removed the "1" label and reused it elsewhere. Fix it by adding named local labels in the correct locations. Note that the return label needs to be outside the ifdef so that CONFIG_PPC_47x=n compiles. Fixes: 6f76a01173cc ("powerpc/syscall: implement system call entry/exit logic in C for PPC32") Cc: stable@vger.kernel.org # v5.12+ Reported-by: Eddie James <eajames@linux.ibm.com> Tested-by: Eddie James <eajames@linux.ibm.com> Link: https://lore.kernel.org/linuxppc-dev/fdaadc46-7476-9237-e104-1d2168526e72@linux.ibm.com/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://msgid.link/20231010114750.847794-1-mpe@ellerman.id.au
2023-10-10powerpc: Remove now superfluous sentinel element from ctl_table arraysJoel Granados
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from powersave_nap_ctl_table and nmi_wd_lpm_factor_ctl_table. This removal is safe because register_sysctl implicitly uses ARRAY_SIZE() in addition to checking for the sentinel. Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-10-10docs: move powerpc under archCosta Shulyupin
and fix all in-tree references. Architecture-specific documentation is being moved into Documentation/arch/ as a way of cleaning up the top-level documentation directory and making the docs hierarchy more closely match the source hierarchy. Signed-off-by: Costa Shulyupin <costa.shul@redhat.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Jonathan Corbet <corbet@lwn.net> Link: https://lore.kernel.org/r/20230826165737.2101199-1-costa.shul@redhat.com
2023-10-10powerpc/85xx: Fix math emulation exceptionChristophe Leroy
Booting mpc85xx_defconfig kernel on QEMU leads to: Bad trap at PC: fe9bab0, SR: 2d000, vector=800 awk[82]: unhandled trap (5) at 0 nip fe9bab0 lr fe9e01c code 5 in libc-2.27.so[fe5a000+17a000] awk[82]: code: 3aa00000 3a800010 4bffe03c 9421fff0 7ca62b78 38a00000 93c10008 83c10008 awk[82]: code: 38210010 4bffdec8 9421ffc0 7c0802a6 <fc00048e> d8010008 4815190d 93810030 Trace/breakpoint trap WARNING: no useful console This is because allthough CONFIG_MATH_EMULATION is selected, Exception 800 calls unknown_exception(). Call emulation_assist_interrupt() instead. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/066caa6d9480365da9b8ed83692d7101e10ac5f8.1695657339.git.christophe.leroy@csgroup.eu
2023-10-09Merge tag 'v6.6-rc5' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2023-10-09powerpc/64e: Fix wrong test in __ptep_test_and_clear_young()Christophe Leroy
Commit 45201c879469 ("powerpc/nohash: Remove hash related code from nohash headers.") replaced: if ((pte_val(*ptep) & (_PAGE_ACCESSED | _PAGE_HASHPTE)) == 0) return 0; By: if (pte_young(*ptep)) return 0; But it should be: if (!pte_young(*ptep)) return 0; Fix it. Fixes: 45201c879469 ("powerpc/nohash: Remove hash related code from nohash headers.") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/8bb7f06494e21adada724ede47a4c3d97e879d40.1695659959.git.christophe.leroy@csgroup.eu
2023-10-09powerpc/8xx: Fix pte_access_permitted() for PAGE_NONEChristophe Leroy
On 8xx, PAGE_NONE is handled by setting _PAGE_NA instead of clearing _PAGE_USER. But then pte_user() returns 1 also for PAGE_NONE. As _PAGE_NA prevent reads, add a specific version of pte_read() that returns 0 when _PAGE_NA is set instead of always returning 1. Fixes: 351750331fc1 ("powerpc/mm: Introduce _PAGE_NA") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/57bcfbe578e43123f9ed73e040229b80f1ad56ec.1695659959.git.christophe.leroy@csgroup.eu
2023-10-06arch: Reserve map_shadow_stack() syscall number for all architecturesSohil Mehta
commit c35559f94ebc ("x86/shstk: Introduce map_shadow_stack syscall") recently added support for map_shadow_stack() but it is limited to x86 only for now. There is a possibility that other architectures (namely, arm64 and RISC-V), that are implementing equivalent support for shadow stacks, might need to add support for it. Independent of that, reserving arch-specific syscall numbers in the syscall tables of all architectures is good practice and would help avoid future conflicts. map_shadow_stack() is marked as a conditional syscall in sys_ni.c. Adding it to the syscall tables of other architectures is harmless and would return ENOSYS when exercised. Note, map_shadow_stack() was assigned #453 during the merge process since #452 was taken by fchmodat2(). For Powerpc, map it to sys_ni_syscall() as is the norm for Powerpc syscall tables. For Alpha, map_shadow_stack() takes up #563 as Alpha still diverges from the common syscall numbering system in the other architectures. Link: https://lore.kernel.org/lkml/20230515212255.GA562920@debug.ba.rivosinc.com/ Link: https://lore.kernel.org/lkml/b402b80b-a7c6-4ef0-b977-c0f5f582b78a@sirena.org.uk/ Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-04rcu: Standardize explicit CPU-hotplug callsFrederic Weisbecker
rcu_report_dead() and rcutree_migrate_callbacks() have their headers in rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug functions. Also rcu_cpu_starting() and rcu_report_dead() have different naming conventions while they mirror each other's effects. Fix the headers and propose a naming that relates both functions and aligns with the prefix of other rcutree CPU-hotplug functions. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
2023-10-04crash_core: change the prototype of function parse_crashkernel()Baoquan He
Add two parameters 'low_size' and 'high' to function parse_crashkernel(), later crashkernel=,high|low parsing will be added. Make adjustments in all call sites of parse_crashkernel() in arch. Link: https://lkml.kernel.org/r/20230914033142.676708-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Zhen Lei <thunder.leizhen@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Jiahao <chenjiahao16@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-10-04locking/local, arch: Rewrite local_add_unless() as a static inline functionUros Bizjak
Rewrite local_add_unless() as a static inline function with boolean return value, similar to the arch_atomic_add_unless() arch fallbacks. The function is currently unused. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230731084458.28096-1-ubizjak@gmail.com
2023-10-03syscalls: Cleanup references to sys_lookup_dcookie()Sohil Mehta
commit 'be65de6b03aa ("fs: Remove dcookies support")' removed the syscall definition for lookup_dcookie. However, syscall tables still point to the old sys_lookup_dcookie() definition. Update syscall tables of all architectures to directly point to sys_ni_syscall() instead. Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Namhyung Kim <namhyung@kernel.org> # for perf Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-10-01Merge tag 'mm-hotfixes-stable-2023-10-01-08-34' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Fourteen hotfixes, eleven of which are cc:stable. The remainder pertain to issues which were introduced after 6.5" * tag 'mm-hotfixes-stable-2023-10-01-08-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Crash: add lock to serialize crash hotplug handling selftests/mm: fix awk usage in charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh that may cause error mm: mempolicy: keep VMA walk if both MPOL_MF_STRICT and MPOL_MF_MOVE are specified mm/damon/vaddr-test: fix memory leak in damon_do_test_apply_three_regions() mm, memcg: reconsider kmem.limit_in_bytes deprecation mm: zswap: fix potential memory corruption on duplicate store arm64: hugetlb: fix set_huge_pte_at() to work with all swap entries mm: hugetlb: add huge page size param to set_huge_pte_at() maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states maple_tree: add mas_is_active() to detect in-tree walks nilfs2: fix potential use after free in nilfs_gccache_submit_read_data() mm: abstract moving to the next PFN mm: report success more often from filemap_map_folio_range() fs: binfmt_elf_efpic: fix personality for ELF-FDPIC
2023-09-30powerpc/pseries: Remove unused r0 in the hcall tracing codeAthira Rajeev
In the plpar_hcall trace code, currently we use r0 to store the value of r4. But this value is not used subsequently in the code. Hence remove this unused save to r0 in plpar_hcall and plpar_hcall9 Suggested-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230929172337.7906-2-atrajeev@linux.vnet.ibm.com
2023-09-30powerpc/pseries: Fix STK_PARAM access in the hcall tracing codeAthira Rajeev
In powerpc pseries system, below behaviour is observed while enabling tracing on hcall: # cd /sys/kernel/debug/tracing/ # cat events/powerpc/hcall_exit/enable 0 # echo 1 > events/powerpc/hcall_exit/enable # ls -bash: fork: Bad address Above is from power9 lpar with latest kernel. Past this, softlockup is observed. Initially while attempting via perf_event_open to use "PERF_TYPE_TRACEPOINT", kernel panic was observed. perf config used: ================ memset(&pe[1],0,sizeof(struct perf_event_attr)); pe[1].type=PERF_TYPE_TRACEPOINT; pe[1].size=96; pe[1].config=0x26ULL; /* 38 raw_syscalls/sys_exit */ pe[1].sample_type=0; /* 0 */ pe[1].read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING|PERF_FORMAT_ID|PERF_FORMAT_GROUP|0x10ULL; /* 1f */ pe[1].inherit=1; pe[1].precise_ip=0; /* arbitrary skid */ pe[1].wakeup_events=0; pe[1].bp_type=HW_BREAKPOINT_EMPTY; pe[1].config1=0x1ULL; Kernel panic logs: ================== Kernel attempted to read user page (8) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000008 Faulting instruction address: 0xc0000000004c2814 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: nfnetlink bonding tls rfkill sunrpc dm_service_time dm_multipath pseries_rng xts vmx_crypto xfs libcrc32c sd_mod t10_pi crc64_rocksoft crc64 sg ibmvfc scsi_transport_fc ibmveth dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 0 PID: 1431 Comm: login Not tainted 6.4.0+ #1 Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf000005 of:IBM,FW950.30 (VL950_892) hv:phyp pSeries NIP page_remove_rmap+0x44/0x320 LR wp_page_copy+0x384/0xec0 Call Trace: 0xc00000001416e400 (unreliable) wp_page_copy+0x384/0xec0 __handle_mm_fault+0x9d4/0xfb0 handle_mm_fault+0xf0/0x350 ___do_page_fault+0x48c/0xc90 hash__do_page_fault+0x30/0x70 do_hash_fault+0x1a4/0x330 data_access_common_virt+0x198/0x1f0 --- interrupt: 300 at 0x7fffae971abc git bisect tracked this down to below commit: 'commit baa49d81a94b ("powerpc/pseries: hvcall stack frame overhead")' This commit changed STACK_FRAME_OVERHEAD (112 ) to STACK_FRAME_MIN_SIZE (32 ) since 32 bytes is the minimum size for ELFv2 stack. With the latest kernel, when running on ELFv2, STACK_FRAME_MIN_SIZE is used to allocate stack size. During plpar_hcall_trace, first call is made to HCALL_INST_PRECALL which saves the registers and allocates new stack frame. In the plpar_hcall_trace code, STK_PARAM is accessed at two places. 1. To save r4: std r4,STK_PARAM(R4)(r1) 2. To access r4 back: ld r12,STK_PARAM(R4)(r1) HCALL_INST_PRECALL precall allocates a new stack frame. So all the stack parameter access after the precall, needs to be accessed with +STACK_FRAME_MIN_SIZE. So the store instruction should be: std r4,STACK_FRAME_MIN_SIZE+STK_PARAM(R4)(r1) If the "std" is not updated with STACK_FRAME_MIN_SIZE, we will end up with overwriting stack contents and cause corruption. But instead of updating 'std', we can instead remove it since HCALL_INST_PRECALL already saves it to the correct location. similarly load instruction should be: ld r12,STACK_FRAME_MIN_SIZE+STK_PARAM(R4)(r1) Fix the load instruction to correctly access the stack parameter with +STACK_FRAME_MIN_SIZE and remove the store of r4 since the precall saves it correctly. Cc: stable@vger.kernel.org # v6.2+ Fixes: baa49d81a94b ("powerpc/pseries: hvcall stack frame overhead") Co-developed-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230929172337.7906-1-atrajeev@linux.vnet.ibm.com
2023-09-29mm: hugetlb: add huge page size param to set_huge_pte_at()Ryan Roberts
Patch series "Fix set_huge_pte_at() panic on arm64", v2. This series fixes a bug in arm64's implementation of set_huge_pte_at(), which can result in an unprivileged user causing a kernel panic. The problem was triggered when running the new uffd poison mm selftest for HUGETLB memory. This test (and the uffd poison feature) was merged for v6.5-rc7. Ideally, I'd like to get this fix in for v6.6 and I've cc'ed stable (correctly this time) to get it backported to v6.5, where the issue first showed up. Description of Bug ================== arm64's huge pte implementation supports multiple huge page sizes, some of which are implemented in the page table with multiple contiguous entries. So set_huge_pte_at() needs to work out how big the logical pte is, so that it can also work out how many physical ptes (or pmds) need to be written. It previously did this by grabbing the folio out of the pte and querying its size. However, there are cases when the pte being set is actually a swap entry. But this also used to work fine, because for huge ptes, we only ever saw migration entries and hwpoison entries. And both of these types of swap entries have a PFN embedded, so the code would grab that and everything still worked out. But over time, more calls to set_huge_pte_at() have been added that set swap entry types that do not embed a PFN. And this causes the code to go bang. The triggering case is for the uffd poison test, commit 99aa77215ad0 ("selftests/mm: add uffd unit test for UFFDIO_POISON"), which causes a PTE_MARKER_POISONED swap entry to be set, coutesey of commit 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") - added in v6.5-rc7. Although review shows that there are other call sites that set PTE_MARKER_UFFD_WP (which also has no PFN), these don't trigger on arm64 because arm64 doesn't support UFFD WP. If CONFIG_DEBUG_VM is enabled, we do at least get a BUG(), but otherwise, it will dereference a bad pointer in page_folio(): static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry) { VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry)); return page_folio(pfn_to_page(swp_offset_pfn(entry))); } Fix === The simplest fix would have been to revert the dodgy cleanup commit 18f3962953e4 ("mm: hugetlb: kill set_huge_swap_pte_at()"), but since things have moved on, this would have required an audit of all the new set_huge_pte_at() call sites to see if they should be converted to set_huge_swap_pte_at(). As per the original intent of the change, it would also leave us open to future bugs when people invariably get it wrong and call the wrong helper. So instead, I've added a huge page size parameter to set_huge_pte_at(). This means that the arm64 code has the size in all cases. It's a bigger change, due to needing to touch the arches that implement the function, but it is entirely mechanical, so in my view, low risk. I've compile-tested all touched arches; arm64, parisc, powerpc, riscv, s390, sparc (and additionally x86_64). I've additionally booted and run mm selftests against arm64, where I observe the uffd poison test is fixed, and there are no other regressions. This patch (of 2): In order to fix a bug, arm64 needs to be told the size of the huge page for which the pte is being set in set_huge_pte_at(). Provide for this by adding an `unsigned long sz` parameter to the function. This follows the same pattern as huge_pte_clear(). This commit makes the required interface modifications to the core mm as well as all arches that implement this function (arm64, parisc, powerpc, riscv, s390, sparc). The actual arm64 bug will be fixed in a separate commit. No behavioral changes intended. Link: https://lkml.kernel.org/r/20230922115804.2043771-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20230922115804.2043771-2-ryan.roberts@arm.com Fixes: 8a13897fb0da ("mm: userfaultfd: support UFFDIO_POISON for hugetlbfs") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> [powerpc 8xx] Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> [vmalloc change] Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-09-22hardening: Provide Kconfig fragments for basic optionsKees Cook
Inspired by Salvatore Mesoraca's earlier[1] efforts to provide some in-tree guidance for kernel hardening Kconfig options, add a new fragment named "hardening-basic.config" (along with some arch-specific fragments) that enable a basic set of kernel hardening options that have the least (or no) performance impact and remove a reasonable set of legacy APIs. Using this fragment is as simple as running "make hardening.config". More extreme fragments can be added[2] in the future to cover all the recognized hardening options, and more per-architecture files can be added too. For now, document the fragments directly via comments. Perhaps .rst documentation can be generated from them in the future (rather than the other way around). [1] https://lore.kernel.org/kernel-hardening/1536516257-30871-1-git-send-email-s.mesoraca16@gmail.com/ [2] https://github.com/KSPP/linux/issues/14 Cc: Salvatore Mesoraca <s.mesoraca16@gmail.com> Cc: x86@kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: linux-kbuild@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2023-09-22powerpc/stacktrace: Fix arch_stack_walk_reliable()Michael Ellerman
The changes to copy_thread() made in commit eed7c420aac7 ("powerpc: copy_thread differentiate kthreads and user mode threads") inadvertently broke arch_stack_walk_reliable() because it has knowledge of the stack layout. Fix it by changing the condition to match the new logic in copy_thread(). The changes make the comments about the stack layout incorrect, rather than rephrasing them just refer the reader to copy_thread(). Also the comment about the stack backchain is no longer true, since commit edbd0387f324 ("powerpc: copy_thread add a back chain to the switch stack frame"), so remove that as well. Fixes: eed7c420aac7 ("powerpc: copy_thread differentiate kthreads and user mode threads") Reported-by: Joe Lawrence <joe.lawrence@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20230921232441.1181843-1-mpe@ellerman.id.au
2023-09-21futex: Add sys_futex_requeue()peterz@infradead.org
Finish off the 'simple' futex2 syscall group by adding sys_futex_requeue(). Unlike sys_futex_{wait,wake}() its arguments are too numerous to fit into a regular syscall. As such, use struct futex_waitv to pass the 'source' and 'destination' futexes to the syscall. This syscall implements what was previously known as FUTEX_CMP_REQUEUE and uses {val, uaddr, flags} for source and {uaddr, flags} for destination. This design explicitly allows requeueing between different types of futex by having a different flags word per uaddr. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/r/20230921105248.511860556@noisy.programming.kicks-ass.net