summaryrefslogtreecommitdiff
path: root/arch/riscv/include/asm
AgeCommit message (Collapse)Author
2024-10-01riscv: Fix kernel stack size when KASAN is enabledAlexandre Ghiti
We use Kconfig to select the kernel stack size, doubling the default size if KASAN is enabled. But that actually only works if KASAN is selected from the beginning, meaning that if KASAN config is added later (for example using menuconfig), CONFIG_THREAD_SIZE_ORDER won't be updated, keeping the default size, which is not enough for KASAN as reported in [1]. So fix this by moving the logic to compute the right kernel stack into a header. Fixes: a7555f6b62e7 ("riscv: stack: Add config of thread stack size") Reported-by: syzbot+ba9eac24453387a9d502@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000eb301906222aadc2@google.com/ [1] Cc: stable@vger.kernel.org Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240917150328.59831-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-24Merge tag 'riscv-for-linus-6.12-mw1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support using Zkr to seed KASLR - Support IPI-triggered CPU backtracing - Support for generic CPU vulnerabilities reporting to userspace - A few cleanups for missing licenses - The size limit on the XIP kernel has been removed - Support for tracing userspace stacks - Support for the Svvptc extension - Various cleanups and fixes throughout the tree * tag 'riscv-for-linus-6.12-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (47 commits) crash: Fix riscv64 crash memory reserve dead loop perf/riscv-sbi: Add platform specific firmware event handling tools: Optimize ring buffer for riscv tools: Add riscv barrier implementation RISC-V: Don't have MAX_PHYSMEM_BITS exceed phys_addr_t ACPI: NUMA: initialize all values of acpi_early_node_map to NUMA_NO_NODE riscv: Enable bitops instrumentation riscv: Omit optimized string routines when using KASAN ACPI: RISCV: Make acpi_numa_get_nid() to be static riscv: Randomize lower bits of stack address selftests: riscv: Allow mmap test to compile on 32-bit riscv: Make riscv_isa_vendor_ext_andes array static riscv: Use LIST_HEAD() to simplify code riscv: defconfig: Disable RZ/Five peripheral support RISC-V: Implement kgdb_roundup_cpus() to enable future NMI Roundup riscv: avoid Imbalance in RAS riscv: cacheinfo: Add back init_cache_level() function riscv: Remove unused _TIF_WORK_MASK drivers/perf: riscv: Remove redundant macro check riscv: define ILLEGAL_POINTER_VALUE for 64bit ...
2024-09-21Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
2024-09-20perf/riscv-sbi: Add platform specific firmware event handlingMayuresh Chitale
The SBI v2.0 specification pointed to by the link below reserves the event code 0xffff for platform specific firmware events. Update the driver to be able to parse and program such events. The platform specific firmware events must now be specified in the perf command as below: perf stat -e rCxxx ... where bits[63:62] = 0x3 of the event config indicate a platform specific firmware event and xxx indicate the actual event code which is passed as the event data. Signed-off-by: Mayuresh Chitale <mchitale@ventanamicro.com> Link: https://github.com/riscv-non-isa/riscv-sbi-doc/releases/download/v2.0/riscv-sbi.pdf Link: https://lore.kernel.org/r/20240812051109.6496-1-mchitale@ventanamicro.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-20RISC-V: Don't have MAX_PHYSMEM_BITS exceed phys_addr_tPalmer Dabbelt
I recently ended up with a warning on some compilers along the lines of CC kernel/resource.o In file included from include/linux/ioport.h:16, from kernel/resource.c:15: kernel/resource.c: In function 'gfr_start': include/linux/minmax.h:49:37: error: conversion from 'long long unsigned int' to 'resource_size_t' {aka 'unsigned int'} changes value from '17179869183' to '4294967295' [-Werror=overflow] 49 | ({ type ux = (x); type uy = (y); __cmp(op, ux, uy); }) | ^ include/linux/minmax.h:52:9: note: in expansion of macro '__cmp_once_unique' 52 | __cmp_once_unique(op, type, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_)) | ^~~~~~~~~~~~~~~~~ include/linux/minmax.h:161:27: note: in expansion of macro '__cmp_once' 161 | #define min_t(type, x, y) __cmp_once(min, type, x, y) | ^~~~~~~~~~ kernel/resource.c:1829:23: note: in expansion of macro 'min_t' 1829 | end = min_t(resource_size_t, base->end, | ^~~~~ kernel/resource.c: In function 'gfr_continue': include/linux/minmax.h:49:37: error: conversion from 'long long unsigned int' to 'resource_size_t' {aka 'unsigned int'} changes value from '17179869183' to '4294967295' [-Werror=overflow] 49 | ({ type ux = (x); type uy = (y); __cmp(op, ux, uy); }) | ^ include/linux/minmax.h:52:9: note: in expansion of macro '__cmp_once_unique' 52 | __cmp_once_unique(op, type, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_)) | ^~~~~~~~~~~~~~~~~ include/linux/minmax.h:161:27: note: in expansion of macro '__cmp_once' 161 | #define min_t(type, x, y) __cmp_once(min, type, x, y) | ^~~~~~~~~~ kernel/resource.c:1847:24: note: in expansion of macro 'min_t' 1847 | addr <= min_t(resource_size_t, base->end, | ^~~~~ cc1: all warnings being treated as errors which looks like a real problem: our phys_addr_t is only 32 bits now, so having 34-bit masks is just going to result in overflows. Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240731162159.9235-2-palmer@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-19Merge patch series "riscv: Improve KASAN coverage to fix unit tests"Palmer Dabbelt
Samuel Holland <samuel.holland@sifive.com> says: This series fixes two areas where uninstrumented assembly routines caused gaps in KASAN coverage on RISC-V, which were caught by KUnit tests. The KASAN KUnit test suite passes after applying this series. This series fixes the following test failures: # kasan_strings: EXPECTATION FAILED at mm/kasan/kasan_test.c:1520 KASAN failure expected in "kasan_int_result = strcmp(ptr, "2")", but none occurred # kasan_strings: EXPECTATION FAILED at mm/kasan/kasan_test.c:1524 KASAN failure expected in "kasan_int_result = strlen(ptr)", but none occurred not ok 60 kasan_strings # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1531 KASAN failure expected in "set_bit(nr, addr)", but none occurred # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1533 KASAN failure expected in "clear_bit(nr, addr)", but none occurred # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1535 KASAN failure expected in "clear_bit_unlock(nr, addr)", but none occurred # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1536 KASAN failure expected in "__clear_bit_unlock(nr, addr)", but none occurred # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1537 KASAN failure expected in "change_bit(nr, addr)", but none occurred # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1543 KASAN failure expected in "test_and_set_bit(nr, addr)", but none occurred # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1545 KASAN failure expected in "test_and_set_bit_lock(nr, addr)", but none occurred # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1546 KASAN failure expected in "test_and_clear_bit(nr, addr)", but none occurred # kasan_bitops_generic: EXPECTATION FAILED at mm/kasan/kasan_test.c:1548 KASAN failure expected in "test_and_change_bit(nr, addr)", but none occurred not ok 61 kasan_bitops_generic Samuel Holland (2): riscv: Omit optimized string routines when using KASAN riscv: Enable bitops instrumentation arch/riscv/include/asm/bitops.h | 43 ++++++++++++++++++--------------- arch/riscv/include/asm/string.h | 2 ++ arch/riscv/kernel/riscv_ksyms.c | 3 --- arch/riscv/lib/Makefile | 2 ++ arch/riscv/lib/strcmp.S | 1 + arch/riscv/lib/strlen.S | 1 + arch/riscv/lib/strncmp.S | 1 + arch/riscv/purgatory/Makefile | 2 ++ 8 files changed, 32 insertions(+), 23 deletions(-) * b4-shazam-merge: riscv: Enable bitops instrumentation riscv: Omit optimized string routines when using KASAN Link: https://lore.kernel.org/r/20240801033725.28816-1-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-19riscv: Enable bitops instrumentationSamuel Holland
Instead of implementing the bitops functions directly in assembly, provide the arch_-prefixed versions and use the wrappers from asm-generic to add instrumentation. This improves KASAN coverage and fixes the kasan_bitops_generic() unit test. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240801033725.28816-3-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-19riscv: Omit optimized string routines when using KASANSamuel Holland
The optimized string routines are implemented in assembly, so they are not instrumented for use with KASAN. Fall back to the C version of the routines in order to improve KASAN coverage. This fixes the kasan_strings() unit test. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240801033725.28816-2-samuel.holland@sifive.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-17ACPI: RISCV: Make acpi_numa_get_nid() to be staticHanjun Guo
acpi_numa_get_nid() is only called in acpi_numa.c for riscv, no need to add it in head file, so make it static and remove related functions in the asm/acpi.h. Spotted by doing some cleanup for arm64 ACPI. Signed-off-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Haibo Xu <haibo1.xu@intel.com> Link: https://lore.kernel.org/r/20240811031804.3347298-1-guohanjun@huawei.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-17riscv: Randomize lower bits of stack addressYunhui Cui
Implement arch_align_stack() to randomize the lower bits of the stack address. Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com> Link: https://lore.kernel.org/r/20240625030502.68988-1-cuiyunhui@bytedance.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-16Merge tag 'acpi-6.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "These update the ACPICA code in the kernel to upstream version 20240827, add support for ACPI-based enumeration of interrupt controllers on RISC-V along with some related irqchip updates, clean up the ACPI device object sysfs interface, add some quirks for backlight handling and IRQ overrides, fix assorted issues and clean up code. Specifics: - Check return value in acpi_db_convert_to_package() (Pei Xiao) - Detect FACS and allow setting the waking vector on reduced-hardware ACPI platforms (Jiaqing Zhao) - Allow ACPICA to represent semaphores as integers (Adrien Destugues) - Complete CXL 3.0 CXIMS structures support in ACPICA (Zhang Rui) - Make ACPICA support SPCR version 4 and add RISC-V SBI Subtype to DBG2 (Sia Jee Heng) - Implement the Dword_PCC Resource Descriptor Macro in ACPICA (Jose Marinho) - Correct the typo in struct acpi_mpam_msc_node member (Punit Agrawal) - Implement ACPI_WARNING_ONCE() and ACPI_ERROR_ONCE() and use them to prevent a Stall() violation warning from being printed every time this takes place (Vasily Khoruzhick) - Allow PCC Data Type in MCTP resource (Adam Young) - Fix memory leaks on acpi_ps_get_next_namepath() and acpi_ps_get_next_field() failures (Armin Wolf) - Add support for supressing leading zeros in hex strings when converting them to integers and update integer-to-hex-string conversions in ACPICA (Armin Wolf) - Add support for Windows 11 22H2 _OSI string (Armin Wolf) - Avoid warning for Dump Functions in ACPICA (Adam Lackorzynski) - Add extended linear address mode to HMAT MSCIS in ACPICA (Dave Jiang) - Handle empty connection_node in iasl (Aleksandrs Vinarskis) - Allow for more flexibility in _DSM args (Saket Dumbre) - Setup for ACPICA release 20240827 (Saket Dumbre) - Add ACPI device enumeration support for interrupt controller probing including taking dependencies into account (Sunil V L) - Implement ACPI-based interrupt controller probing on RISC-V (Sunil V L) - Add ACPI support for AIA in riscv-intc and add ACPI support to riscv-imsic, riscv-aplic, and sifive-plic (Sunil V L) - Do not release locks during operation region accesses in the ACPI EC driver (Rafael Wysocki) - Fix up the _STR handling in the ACPI device object sysfs interface, make it represent the device object attributes as an attribute group and make it rely on driver core functionality for sysfs attrubute management (Thomas Weißschuh) - Extend error messages printed to the kernel log when acpi_evaluate_dsm() fails to include revision and function number (David Wang) - Add a new AMDI0015 platform device ID to the ACPi APD driver for AMD SoCs (Shyam Sundar S K) - Use the driver core for the async probing management in the ACPI battery driver (Thomas Weißschuh) - Remove redundant initalizations of a local variable to NULL from the ACPI battery driver (Ilpo Järvinen) - Remove unneeded check in tps68470_pmic_opregion_probe() (Aleksandr Mishin) - Add support for setting the EPP register through the ACPI CPPC sysfs interface if it is in FFH (Mario Limonciello) - Fix MASK_VAL() usage in the ACPI CPPC library (Clément Léger) - Reduce the log level of a per-CPU message about idle states in the ACPI processor driver (Li RongQing) - Fix crash in exit_round_robin() in the ACPI processor aggregator device (PAD) driver (Seiji Nishikawa) - Add force_vendor quirk for Panasonic Toughbook CF-18 in the ACPI backlight driver (Hans de Goede) - Make the DMI checks related to backlight handling on Lenovo Yoga Tab 3 X90F less strict (Hans de Goede) - Enforce native backlight handling on Apple MacbookPro9,2 (Esther Shimanovich) - Add IRQ override quirks for Asus Vivobook Go E1404GAB and MECHREV GM7XG0M, and refine the TongFang GMxXGxx quirk (Li Chen, Tamim Khan, Werner Sembach) - Quirk ASUS ROG M16 to default to S3 sleep (Luke D. Jones) - Define and use symbols for device and class name lengths in the ACPI bus type code and make the code use strscpy() instead of strcpy() in several places (Muhammad Qasim Abdul Majeed)" * tag 'acpi-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (70 commits) ACPI: resource: Add another DMI match for the TongFang GMxXGxx ACPI: CPPC: Add support for setting EPP register in FFH ACPI: PM: Quirk ASUS ROG M16 to default to S3 sleep ACPI: video: Add force_vendor quirk for Panasonic Toughbook CF-18 ACPI: battery: use driver core managed async probing ACPI: button: Use strscpy() instead of strcpy() ACPI: resource: Skip IRQ override on Asus Vivobook Go E1404GAB ACPI: CPPC: Fix MASK_VAL() usage irqchip/sifive-plic: Add ACPI support ACPICA: Setup for ACPICA release 20240827 ACPICA: Allow for more flexibility in _DSM args ACPICA: iasl: handle empty connection_node ACPICA: HMAT: Add extended linear address mode to MSCIS ACPICA: Avoid warning for Dump Functions ACPICA: Add support for Windows 11 22H2 _OSI string ACPICA: Update integer-to-hex-string conversions ACPICA: Add support for supressing leading zeros in hex strings ACPICA: Allow for supressing leading zeros when using acpi_ex_convert_to_ascii() ACPICA: Fix memory leak if acpi_ps_get_next_field() fails ACPICA: Fix memory leak if acpi_ps_get_next_namepath() fails ...
2024-09-15Merge patch series "Svvptc extension to remove preventive sfence.vma"Palmer Dabbelt
Alexandre Ghiti <alexghiti@rivosinc.com> says: In RISC-V, after a new mapping is established, a sfence.vma needs to be emitted for different reasons: - if the uarch caches invalid entries, we need to invalidate it otherwise we would trap on this invalid entry, - if the uarch does not cache invalid entries, a reordered access could fail to see the new mapping and then trap (sfence.vma acts as a fence). We can actually avoid emitting those (mostly) useless and costly sfence.vma by handling the traps instead: - for new kernel mappings: only vmalloc mappings need to be taken care of, other new mapping are rare and already emit the required sfence.vma if needed. That must be achieved very early in the exception path as explained in patch 3, and this also fixes our fragile way of dealing with vmalloc faults. - for new user mappings: Svvptc makes update_mmu_cache() a no-op but we can take some gratuitous page faults (which are very unlikely though). Patch 1 and 2 introduce Svvptc extension probing. On our uarch that does not cache invalid entries and a 6.5 kernel, the gains are measurable: * Kernel boot: 6% * ltp - mmapstress01: 8% * lmbench - lat_pagefault: 20% * lmbench - lat_mmap: 5% Here are the corresponding numbers of sfence.vma emitted: * Ubuntu boot to login: Before: ~630k sfence.vma After: ~200k sfence.vma * ltp - mmapstress01 Before: ~45k After: ~6.3k * lmbench - lat_pagefault Before: ~665k After: 832 (!) * lmbench - lat_mmap Before: ~546k After: 718 (!) Thanks to Ved and Matt Evans for triggering the discussion that led to this patchset! * b4-shazam-merge: riscv: Stop emitting preventive sfence.vma for new userspace mappings with Svvptc riscv: Stop emitting preventive sfence.vma for new vmalloc mappings dt-bindings: riscv: Add Svvptc ISA extension description riscv: Add ISA extension parsing for Svvptc Link: https://lore.kernel.org/r/20240717060125.139416-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-15riscv: Remove unused _TIF_WORK_MASKJinjie Ruan
Since commit f0bddf50586d ("riscv: entry: Convert to generic entry"), _TIF_WORK_MASK is no longer used, so remove it. Fixes: f0bddf50586d ("riscv: entry: Convert to generic entry") Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Reviewed-by: Guo Ren <guoren@kernel.org> Reviewed-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20240711111508.1373322-1-ruanjinjie@huawei.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-15riscv: Stop emitting preventive sfence.vma for new userspace mappings with ↵Alexandre Ghiti
Svvptc The preventive sfence.vma were emitted because new mappings must be made visible to the page table walker but Svvptc guarantees that it will happen within a bounded timeframe, so no need to sfence.vma for the uarchs that implement this extension, we will then take gratuitous (but very unlikely) page faults, similarly to x86 and arm64. This allows to drastically reduce the number of sfence.vma emitted: * Ubuntu boot to login: Before: ~630k sfence.vma After: ~200k sfence.vma * ltp - mmapstress01 Before: ~45k After: ~6.3k * lmbench - lat_pagefault Before: ~665k After: 832 (!) * lmbench - lat_mmap Before: ~546k After: 718 (!) Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240717060125.139416-5-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-15riscv: Stop emitting preventive sfence.vma for new vmalloc mappingsAlexandre Ghiti
In 6.5, we removed the vmalloc fault path because that can't work (see [1] [2]). Then in order to make sure that new page table entries were seen by the page table walker, we had to preventively emit a sfence.vma on all harts [3] but this solution is very costly since it relies on IPI. And even there, we could end up in a loop of vmalloc faults if a vmalloc allocation is done in the IPI path (for example if it is traced, see [4]), which could result in a kernel stack overflow. Those preventive sfence.vma needed to be emitted because: - if the uarch caches invalid entries, the new mapping may not be observed by the page table walker and an invalidation may be needed. - if the uarch does not cache invalid entries, a reordered access could "miss" the new mapping and traps: in that case, we would actually only need to retry the access, no sfence.vma is required. So this patch removes those preventive sfence.vma and actually handles the possible (and unlikely) exceptions. And since the kernel stacks mappings lie in the vmalloc area, this handling must be done very early when the trap is taken, at the very beginning of handle_exception: this also rules out the vmalloc allocations in the fault path. Link: https://lore.kernel.org/linux-riscv/20230531093817.665799-1-bjorn@kernel.org/ [1] Link: https://lore.kernel.org/linux-riscv/20230801090927.2018653-1-dylan@andestech.com [2] Link: https://lore.kernel.org/linux-riscv/20230725132246.817726-1-alexghiti@rivosinc.com/ [3] Link: https://lore.kernel.org/lkml/20200508144043.13893-1-joro@8bytes.org/ [4] Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Yunhui Cui <cuiyunhui@bytedance.com> Link: https://lore.kernel.org/r/20240717060125.139416-4-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-15riscv: Add ISA extension parsing for SvvptcAlexandre Ghiti
Add support to parse the Svvptc string in the riscv,isa string. Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240717060125.139416-2-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-15Merge tag 'kvm-riscv-6.12-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini
KVM/riscv changes for 6.12 - Fix sbiret init before forwarding to userspace - Don't zero-out PMU snapshot area before freeing data - Allow legacy PMU access from guest - Fix to allow hpmcounter31 from the guest
2024-09-12Merge patch series "remove size limit on XIP kernel"Palmer Dabbelt
Nam Cao <namcao@linutronix.de> says: Hi, For XIP kernel, the writable data section is always at offset specified in XIP_OFFSET, which is hard-coded to 32MB. Unfortunately, this means the read-only section (placed before the writable section) is restricted in size. This causes build failure if the kernel gets too large. This series remove the use of XIP_OFFSET one by one, then remove this macro entirely at the end, with the goal of lifting this size restriction. Also some cleanup and documentation along the way. * b4-shazam-merge riscv: remove limit on the size of read-only section for XIP kernel riscv: drop the use of XIP_OFFSET in create_kernel_page_table() riscv: drop the use of XIP_OFFSET in kernel_mapping_va_to_pa() riscv: drop the use of XIP_OFFSET in XIP_FIXUP_FLASH_OFFSET riscv: drop the use of XIP_OFFSET in XIP_FIXUP_OFFSET riscv: replace misleading va_kernel_pa_offset on XIP kernel riscv: don't export va_kernel_pa_offset in vmcoreinfo for XIP kernel riscv: cleanup XIP_FIXUP macro riscv: change XIP's kernel_map.size to be size of the entire kernel ... Link: https://lore.kernel.org/r/cover.1717789719.git.namcao@linutronix.de Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-12riscv: remove limit on the size of read-only section for XIP kernelNam Cao
XIP_OFFSET is the hard-coded offset of writable data section within the kernel. By hard-coding this value, the read-only section of the kernel (which is placed before the writable data section) is restricted in size. This causes build failures if the kernel gets too big [1]. Remove this limit. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404211031.J6l2AfJk-lkp@intel.com [1] Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/3bf3a77be10ebb0d8086c028500baa16e7a8e648.1717789719.git.namcao@linutronix.de Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-12riscv: drop the use of XIP_OFFSET in kernel_mapping_va_to_pa()Nam Cao
XIP_OFFSET is the hard-coded offset of writable data section within the kernel. By hard-coding this value, the read-only section of the kernel (which is placed before the writable data section) is restricted in size. As a preparation to remove this hard-coded macro XIP_OFFSET entirely, remove the use of XIP_OFFSET in kernel_mapping_va_to_pa(). The macro XIP_OFFSET is used in this case to check if the virtual address is mapped to Flash or to RAM. The same check can be done with kernel_map.xiprom_sz. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/644c13d9467525a06f5d63d157875a35b2edb4bc.1717789719.git.namcao@linutronix.de Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-12riscv: drop the use of XIP_OFFSET in XIP_FIXUP_FLASH_OFFSETNam Cao
XIP_OFFSET is the hard-coded offset of writable data section within the kernel. By hard-coding this value, the read-only section of the kernel (which is placed before the writable data section) is restricted in size. As a preparation to remove this hard-coded macro XIP_OFFSET entirely, stop using XIP_OFFSET in XIP_FIXUP_FLASH_OFFSET. Instead, use __data_loc and _sdata to do the same thing. While at it, also add a description for XIP_FIXUP_FLASH_OFFSET. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/7b3319657edd1822f3457e7e7c07aaa326cc2f87.1717789719.git.namcao@linutronix.de Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-12riscv: drop the use of XIP_OFFSET in XIP_FIXUP_OFFSETNam Cao
XIP_OFFSET is the hard-coded offset of writable data section within the kernel. By hard-coding this value, the read-only section of the kernel (which is placed before the writable data section) is restricted in size. As a preparation to remove this hard-coded macro XIP_OFFSET entirely, stop using XIP_OFFSET in XIP_FIXUP_OFFSET. Instead, use CONFIG_PHYS_RAM_BASE and _sdata to do the same thing. While at it, also add a description for XIP_FIXUP_OFFSET. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/dba0409518b14ee83b346e099b1f7f934daf7b74.1717789719.git.namcao@linutronix.de Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-12riscv: replace misleading va_kernel_pa_offset on XIP kernelNam Cao
On XIP kernel, the name "va_kernel_pa_offset" is misleading: unlike "normal" kernel, it is not the virtual-physical address offset of kernel mapping, it is the offset of kernel mapping's first virtual address to first physical address in DRAM, which is not meaningful because the kernel's first physical address is not in DRAM. For XIP kernel, there are 2 different offsets because the read-only part of the kernel resides in ROM while the rest is in RAM. The offset to ROM is in kernel_map.va_kernel_xip_pa_offset, while the offset to RAM is not stored anywhere: it is calculated on-the-fly. Remove this confusing "va_kernel_pa_offset" and add "va_kernel_xip_data_pa_offset" as its replacement. This new variable is the offset of virtual mapping of the kernel's data portion to the corresponding physical addresses. With the introduction of this new variable, also rename va_kernel_xip_pa_offset -> va_kernel_xip_text_pa_offset to make it clear that this one is about the .text section. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/84e5d005c1386d88d7b2531e0b6707ec5352ee54.1717789719.git.namcao@linutronix.de Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-12riscv: cleanup XIP_FIXUP macroNam Cao
The XIP_FIXUP macro is used to fix addresses early during boot before MMU: generated code "thinks" the data section is in ROM while it is actually in RAM. So this macro corrects the addresses in the data section. This macro determines if the address needs to be fixed by checking if it is within the range starting from ROM address up to the size of (2 * XIP_OFFSET). This means if the kernel size is bigger than (2 * XIP_OFFSET), some addresses would not be fixed up. XIP kernel can still work if the above scenario does not happen. But this macro is obviously incorrect. Rewrite this macro to only fix up addresses within the data section. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/95f50a4ec8204ec4fcbf2a80c9addea0e0609e3b.1717789719.git.namcao@linutronix.de Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-11Merge branch 'acpi-riscv'Rafael J. Wysocki
Merge ACPI and irqchip updates related to external interrupt controller support on RISC-V: - Add ACPI device enumeration support for interrupt controller probing including taking dependencies into account (Sunil V L). - Implement ACPI-based interrupt controller probing on RISC-V (Sunil V L). - Add ACPI support for AIA in riscv-intc and add ACPI support to riscv-imsic, riscv-aplic, and sifive-plic (Sunil V L). * acpi-riscv: irqchip/sifive-plic: Add ACPI support irqchip/riscv-aplic: Add ACPI support irqchip/riscv-imsic: Add ACPI support irqchip/riscv-imsic-state: Create separate function for DT irqchip/riscv-intc: Add ACPI support for AIA ACPI: RISC-V: Implement function to add implicit dependencies ACPI: RISC-V: Initialize GSI mapping structures ACPI: RISC-V: Implement function to reorder irqchip probe entries ACPI: RISC-V: Implement PCI related functionality ACPI: pci_link: Clear the dependencies after probe ACPI: bus: Add RINTC IRQ model for RISC-V ACPI: scan: Define weak function to populate dependencies ACPI: scan: Add RISC-V interrupt controllers to honor list ACPI: scan: Refactor dependency creation ACPI: bus: Add acpi_riscv_init() function ACPI: scan: Add a weak arch_sort_irqchip_probe() to order the IRQCHIP probe arm64: PCI: Migrate ACPI related functions to pci-acpi.c
2024-09-03arch, mm: move definition of node_data to generic codeMike Rapoport (Microsoft)
Every architecture that supports NUMA defines node_data in the same way: struct pglist_data *node_data[MAX_NUMNODES]; No reason to keep multiple copies of this definition and its forward declarations, especially when such forward declaration is the only thing in include/asm/mmzone.h for many architectures. Add definition and declaration of node_data to generic code and drop architecture-specific versions. Link: https://lkml.kernel.org/r/20240807064110.1003856-8-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Zi Yan <ziy@nvidia.com> # for x86_64 and arm64 Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-03riscv: Fix RISCV_ALTERNATIVE_EARLYAlexandre Ghiti
RISCV_ALTERNATIVE_EARLY will issue sbi_ecall() very early in the boot process, before the first memory mapping is setup so we can't have any instrumentation happening here. In addition, when the kernel is relocatable, we must also not issue any relocation this early since they would have been patched virtually only. So, instead of disabling instrumentation for the whole kernel/sbi.c file and compiling it with -fno-pie, simply move __sbi_ecall() and __sbi_base_ecall() into their own file where this is fixed. Reported-by: Conor Dooley <conor.dooley@microchip.com> Closes: https://lore.kernel.org/linux-riscv/20240813-pony-truck-3e7a83e9759e@spud/ Reported-by: syzbot+cfbcb82adf6d7279fd35@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-riscv/00000000000065062c061fcec37b@google.com/ Fixes: 1745cfafebdf ("riscv: don't use global static vars to store alternative data") Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240829165048.49756-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-03riscv: Add license to vmalloc.hCharlie Jenkins
Add a missing license to vmalloc.h. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20240729-riscv_fence_license-v1-2-7d5648069640@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-09-03riscv: Add license to fence.hCharlie Jenkins
Add a missing license to fence.h. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20240729-riscv_fence_license-v1-1-7d5648069640@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-08-29Merge patch series "riscv: mm: Do not restrict mmap address based on hint"Palmer Dabbelt
Charlie Jenkins <charlie@rivosinc.com> says: There have been a couple of reports that using the hint address to restrict the address returned by mmap hint address has caused issues in applications. A different solution for restricting addresses returned by mmap is necessary to avoid breakages. [Palmer: This also just wasn't doing the right thing in the first place, as it didn't handle the sv39 cases we were trying to deal with.] * b4-shazam-merge: riscv: mm: Do not restrict mmap address based on hint riscv: selftests: Remove mmap hint address checks Revert "RISC-V: mm: Document mmap changes" Link: https://lore.kernel.org/r/20240826-riscv_mmap-v1-0-cd8962afe47f@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-08-29riscv: mm: Do not restrict mmap address based on hintCharlie Jenkins
The hint address should not forcefully restrict the addresses returned by mmap as this causes mmap to report ENOMEM when there is memory still available. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Fixes: b5b4287accd7 ("riscv: mm: Use hint address in mmap if available") Fixes: add2cc6b6515 ("RISC-V: mm: Restrict address space for sv39,sv48,sv57") Closes: https://lore.kernel.org/linux-kernel/ZbxTNjQPFKBatMq+@ghost/T/#mccb1890466bf5a488c9ce7441e57e42271895765 Link: https://lore.kernel.org/r/20240826-riscv_mmap-v1-3-cd8962afe47f@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-08-27irqchip/riscv-intc: Add ACPI support for AIASunil V L
The RINTC subtype structure in MADT also has information about other interrupt controllers. Save this information and provide interfaces to retrieve them when required by corresponding drivers. Signed-off-by: Sunil V L <sunilvl@ventanamicro.com> Reviewed-by: Anup Patel <anup@brainfault.org> Tested-by: Björn Töpel <bjorn@rivosinc.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20240812005929.113499-14-sunilvl@ventanamicro.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-27ACPI: RISC-V: Initialize GSI mapping structuresSunil V L
RISC-V has PLIC and APLIC in MADT as well as namespace devices. Initialize the list of those structures using MADT and namespace devices to create mapping between the ACPI handle and the GSI ranges. This will be used later to add dependencies. Signed-off-by: Sunil V L <sunilvl@ventanamicro.com> Tested-by: Björn Töpel <bjorn@rivosinc.com> Link: https://patch.msgid.link/20240812005929.113499-12-sunilvl@ventanamicro.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-08-19RISC-V: KVM: Fix to allow hpmcounter31 from the guestAtish Patra
The csr_fun defines a count parameter which defines the total number CSRs emulated in KVM starting from the base. This value should be equal to total number of counters possible for trap/emulation (32). Fixes: a9ac6c37521f ("RISC-V: KVM: Implement trap & emulate for hpmcounters") Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240816-kvm_pmu_fixes-v1-2-cdfce386dd93@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-08-19RISC-V: KVM: Allow legacy PMU access from guestAtish Patra
Currently, KVM traps & emulates PMU counter access only if SBI PMU is available as the guest can only configure/read PMU counters via SBI only. However, if SBI PMU is not enabled in the host, the guest will fallback to the legacy PMU which will try to access cycle/instret and result in an illegal instruction trap which is not desired. KVM can allow dummy emulation of cycle/instret only for the guest if SBI PMU is not enabled in the host. The dummy emulation will still return zero as we don't to expose the host counter values from a guest using legacy PMU. Fixes: a9ac6c37521f ("RISC-V: KVM: Implement trap & emulate for hpmcounters") Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240816-kvm_pmu_fixes-v1-1-cdfce386dd93@rivosinc.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-08-14RISC-V: hwprobe: Add MISALIGNED_PERF keyEvan Green
RISCV_HWPROBE_KEY_CPUPERF_0 was mistakenly flagged as a bitmask in hwprobe_key_is_bitmask(), when in reality it was an enum value. This causes problems when used in conjunction with RISCV_HWPROBE_WHICH_CPUS, since SLOW, FAST, and EMULATED have values whose bits overlap with each other. If the caller asked for the set of CPUs that was SLOW or EMULATED, the returned set would also include CPUs that were FAST. Introduce a new hwprobe key, RISCV_HWPROBE_KEY_MISALIGNED_PERF, which returns the same values in response to a direct query (with no flags), but is properly handled as an enumerated value. As a result, SLOW, FAST, and EMULATED are all correctly treated as distinct values under the new key when queried with the WHICH_CPUS flag. Leave the old key in place to avoid disturbing applications which may have already come to rely on the key, with or without its broken behavior with respect to the WHICH_CPUS flag. Fixes: e178bf146e4b ("RISC-V: hwprobe: Introduce which-cpus flag") Signed-off-by: Evan Green <evan@rivosinc.com> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Link: https://lore.kernel.org/r/20240809214444.3257596-2-evan@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-08-07RISC-V: Enable IPI CPU BacktraceRyo Takakura
Add arch_trigger_cpumask_backtrace() which is a generic infrastructure for sampling other CPUs' backtrace using IPI. The feature is used when lockups are detected or in case of oops/panic if parameters are set accordingly. Below is the case of oops with the oops_all_cpu_backtrace enabled. $ sysctl kernel.oops_all_cpu_backtrace=1 triggering oops shows: [ 212.214237] NMI backtrace for cpu 1 [ 212.214390] CPU: 1 PID: 610 Comm: in:imklog Tainted: G OE 6.10.0-rc6 #1 [ 212.214570] Hardware name: riscv-virtio,qemu (DT) [ 212.214690] epc : fallback_scalar_usercopy+0x8/0xdc [ 212.214809] ra : _copy_to_user+0x20/0x40 [ 212.214913] epc : ffffffff80c3a930 ra : ffffffff8059ba7e sp : ff20000000eabb50 [ 212.215061] gp : ffffffff82066f90 tp : ff6000008e958000 t0 : 3463303866660000 [ 212.215210] t1 : 000000000000005b t2 : 3463303866666666 s0 : ff20000000eabb60 [ 212.215358] s1 : 0000000000000386 a0 : 00007ff6e81df926 a1 : ff600000824df800 [ 212.215505] a2 : 000000000000003f a3 : 7fffffffffffffc0 a4 : 0000000000000000 [ 212.215651] a5 : 000000000000003f a6 : 0000000000000000 a7 : 0000000000000000 [ 212.215857] s2 : ff600000824df800 s3 : ffffffff82066cc0 s4 : 0000000000001c1a [ 212.216074] s5 : ffffffff8206a5a8 s6 : 00007ff6e81df926 s7 : ffffffff8206a5a0 [ 212.216278] s8 : ff600000824df800 s9 : ffffffff81e25de0 s10: 000000000000003f [ 212.216471] s11: ffffffff8206a59d t3 : ff600000824df812 t4 : ff600000824df812 [ 212.216651] t5 : ff600000824df818 t6 : 0000000000040000 [ 212.216796] status: 0000000000040120 badaddr: 0000000000000000 cause: 8000000000000001 [ 212.217035] [<ffffffff80c3a930>] fallback_scalar_usercopy+0x8/0xdc [ 212.217207] [<ffffffff80095f56>] syslog_print+0x1f4/0x2b2 [ 212.217362] [<ffffffff80096e5c>] do_syslog.part.0+0x94/0x2d8 [ 212.217502] [<ffffffff800979e8>] do_syslog+0x66/0x88 [ 212.217636] [<ffffffff803a5dda>] kmsg_read+0x44/0x5c [ 212.217764] [<ffffffff80392dbe>] proc_reg_read+0x7a/0xa8 [ 212.217952] [<ffffffff802ff726>] vfs_read+0xb0/0x24e [ 212.218090] [<ffffffff803001ba>] ksys_read+0x64/0xe4 [ 212.218264] [<ffffffff8030025a>] __riscv_sys_read+0x20/0x2c [ 212.218453] [<ffffffff80c4af9a>] do_trap_ecall_u+0x60/0x1d4 [ 212.218664] [<ffffffff80c56998>] ret_from_exception+0x0/0x64 Signed-off-by: Ryo Takakura <takakura@valinux.co.jp> Link: https://lore.kernel.org/r/20240718093659.158912-1-takakura@valinux.co.jp Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-27Merge tag 'riscv-for-linus-6.11-mw2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull more RISC-V updates from Palmer Dabbelt: - Support for NUMA (via SRAT and SLIT), console output (via SPCR), and cache info (via PPTT) on ACPI-based systems. - The trap entry/exit code no longer breaks the return address stack predictor on many systems, which results in an improvement to trap latency. - Support for HAVE_ARCH_STACKLEAK. - The sv39 linear map has been extended to support 128GiB mappings. - The frequency of the mtime CSR is now visible via hwprobe. * tag 'riscv-for-linus-6.11-mw2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (21 commits) RISC-V: Provide the frequency of time CSR via hwprobe riscv: Extend sv39 linear mapping max size to 128G riscv: enable HAVE_ARCH_STACKLEAK riscv: signal: Remove unlikely() from WARN_ON() condition riscv: Improve exception and system call latency RISC-V: Select ACPI PPTT drivers riscv: cacheinfo: initialize cacheinfo's level and type from ACPI PPTT riscv: cacheinfo: remove the useless input parameter (node) of ci_leaf_init() RISC-V: ACPI: Enable SPCR table for console output on RISC-V riscv: boot: remove duplicated targets line trace: riscv: Remove deprecated kprobe on ftrace support riscv: cpufeature: Extract common elements from extension checking riscv: Introduce vendor variants of extension helpers riscv: Add vendor extensions to /proc/cpuinfo riscv: Extend cpufeature.c to detect vendor extensions RISC-V: run savedefconfig for defconfig RISC-V: hwprobe: sort EXT_KEY()s in hwprobe_isa_ext0() alphabetically ACPI: NUMA: replace pr_info with pr_debug in arch_acpi_numa_init ACPI: NUMA: change the ACPI_NUMA to a hidden option ACPI: NUMA: Add handler for SRAT RINTC affinity structure ...
2024-07-26Merge tag 'bitmap-6.11-rc1' of https://github.com:/norov/linuxLinus Torvalds
Pull bitmap updates from Yury Norov: "Random fixes" * tag 'bitmap-6.11-rc1' of https://github.com:/norov/linux: riscv: Remove unnecessary int cast in variable_fls() radix tree test suite: put definition of bitmap_clear() into lib/bitmap.c bitops: Add a comment explaining the double underscore macros lib: bitmap: add missing MODULE_DESCRIPTION() macros cpumask: introduce assign_cpu() macro
2024-07-26RISC-V: Provide the frequency of time CSR via hwprobePalmer Dabbelt
The RISC-V architecture makes a real time counter CSR (via RDTIME instruction) available for applications in U-mode but there is no architected mechanism for an application to discover the frequency the counter is running at. Some applications (e.g., DPDK) use the time counter for basic performance analysis as well as fine grained time-keeping. Add support to the hwprobe system call to export the time CSR frequency to code running in U-mode. Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com> Reviewed-by: Evan Green <evan@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Punit Agrawal <punit.agrawal@bytedance.com> Link: https://lore.kernel.org/r/20240702033731.71955-2-cuiyunhui@bytedance.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-26riscv: Extend sv39 linear mapping max size to 128GStuart Menefy
This harmonizes all virtual addressing modes which can now all map (PGDIR_SIZE * PTRS_PER_PGD) / 4 of physical memory. The RISCV implementation of KASAN requires that the boundary between shallow mappings are aligned on an 8G boundary. In this case we need VMALLOC_START to be 8G aligned. So although we only need to move the start of the linear mapping down by 4GiB to allow 128GiB to be mapped, we actually move it down by 8GiB (creating a 4GiB hole between the linear mapping and KASAN shadow space) to maintain the alignment requirement. Signed-off-by: Stuart Menefy <stuart.menefy@codasip.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240630110550.1731929-1-stuart.menefy@codasip.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-26riscv: enable HAVE_ARCH_STACKLEAKJisheng Zhang
Add support for the stackleak feature. Whenever the kernel returns to user space the kernel stack is filled with a poison value. At the same time, disables the plugin in EFI stub code because EFI stub is out of scope for the protection. Tested on qemu and milkv duo: / # echo STACKLEAK_ERASING > /sys/kernel/debug/provoke-crash/DIRECT [ 38.675575] lkdtm: Performing direct entry STACKLEAK_ERASING [ 38.678448] lkdtm: stackleak stack usage: [ 38.678448] high offset: 288 bytes [ 38.678448] current: 496 bytes [ 38.678448] lowest: 1328 bytes [ 38.678448] tracked: 1328 bytes [ 38.678448] untracked: 448 bytes [ 38.678448] poisoned: 14312 bytes [ 38.678448] low offset: 8 bytes [ 38.689887] lkdtm: OK: the rest of the thread stack is properly erased Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Reviewed-by: Charlie Jenkins <charlie@rivosinc.com> Link: https://lore.kernel.org/r/20240623235316.2010-1-jszhang@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-22Merge patch series "riscv: Separate vendor extensions from standard extensions"Palmer Dabbelt
Charlie Jenkins <charlie@rivosinc.com> says: All extensions, both standard and vendor, live in one struct "riscv_isa_ext". There is currently one vendor extension, xandespmu, but it is likely that more vendor extensions will be added to the kernel in the future. As more vendor extensions (and standard extensions) are added, riscv_isa_ext will become more bloated with a mix of vendor and standard extensions. This also allows each vendor to be conditionally enabled through Kconfig. * b4-shazam-merge: riscv: cpufeature: Extract common elements from extension checking riscv: Introduce vendor variants of extension helpers riscv: Add vendor extensions to /proc/cpuinfo riscv: Extend cpufeature.c to detect vendor extensions Link: https://lore.kernel.org/r/20240719-support_vendor_extensions-v3-0-0af7587bbec0@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-22riscv: cpufeature: Extract common elements from extension checkingCharlie Jenkins
The __riscv_has_extension_likely() and __riscv_has_extension_unlikely() functions from the vendor_extensions.h can be used to simplify the standard extension checking code as well. Migrate those functions to cpufeature.h and reorganize the code in the file to use the functions. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20240719-support_vendor_extensions-v3-4-0af7587bbec0@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-22riscv: Introduce vendor variants of extension helpersCharlie Jenkins
Vendor extensions are maintained in per-vendor structs (separate from standard extensions which live in riscv_isa). Create vendor variants for the existing extension helpers to interface with the riscv_isa_vendor bitmaps. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Andy Chiu <andy.chiu@sifive.com> Link: https://lore.kernel.org/r/20240719-support_vendor_extensions-v3-3-0af7587bbec0@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-22riscv: Extend cpufeature.c to detect vendor extensionsCharlie Jenkins
Instead of grouping all vendor extensions into the same riscv_isa_ext that standard instructions use, create a struct "riscv_isa_vendor_ext_data_list" that allows each vendor to maintain their vendor extensions independently of the standard extensions. xandespmu is currently the only vendor extension so that is the only extension that is affected by this change. An additional benefit of this is that the extensions of each vendor can be conditionally enabled. A config RISCV_ISA_VENDOR_EXT_ANDES has been added to allow for that. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: Andy Chiu <andy.chiu@sifive.com> Tested-by: Yu Chien Peter Lin <peterlin@andestech.com> Reviewed-by: Yu Chien Peter Lin <peterlin@andestech.com> Link: https://lore.kernel.org/r/20240719-support_vendor_extensions-v3-1-0af7587bbec0@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-22Merge patch series "Add ACPI NUMA support for RISC-V"Palmer Dabbelt
Haibo Xu <haibo1.xu@intel.com> says: This patch series enable RISC-V ACPI NUMA support which was based on the recently approved ACPI ECR[1]. Patch 1/4 add RISC-V specific acpi_numa.c file to parse NUMA information from SRAT and SLIT ACPI tables. Patch 2/4 add the common SRAT RINTC affinity structure handler. Patch 3/4 change the ACPI_NUMA to a hidden option since it would be selected by default on all supported platform. Patch 4/4 replace pr_info with pr_debug in arch_acpi_numa_init() to avoid potential boot noise on ACPI platforms that are not NUMA. Based-on: https://github.com/linux-riscv/linux-riscv/tree/for-next [1] https://drive.google.com/file/d/1YTdDx2IPm5IeZjAW932EYU-tUtgS08tX/view?usp=sharing Testing: Since the ACPI AIA/PLIC support patch set is still under upstream review, hence it is tested using the poll based HVC SBI console and RAM disk. 1) Build latest Qemu with the following patch backported https://github.com/vlsunil/qemu/commit/42bd4eeefd5d4410a68f02d54fee406d8a1269b0 2) Build latest EDK-II https://github.com/tianocore/edk2/blob/master/OvmfPkg/RiscVVirt/README.md 3) Build Linux with the following configs enabled CONFIG_RISCV_SBI_V01=y CONFIG_SERIAL_EARLYCON_RISCV_SBI=y CONFIG_NONPORTABLE=y CONFIG_HVC_RISCV_SBI=y CONFIG_NUMA=y CONFIG_ACPI_NUMA=y 4) Build buildroot rootfs.cpio 5) Launch the Qemu machine qemu-system-riscv64 -nographic \ -machine virt,pflash0=pflash0,pflash1=pflash1 -smp 4 -m 8G \ -blockdev node-name=pflash0,driver=file,read-only=on,filename=RISCV_VIRT_CODE.fd \ -blockdev node-name=pflash1,driver=file,filename=RISCV_VIRT_VARS.fd \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -numa node,memdev=m0,cpus=0-1,nodeid=0 \ -numa node,memdev=m1,cpus=2-3,nodeid=1 \ -numa dist,src=0,dst=1,val=30 \ -kernel linux/arch/riscv/boot/Image \ -initrd buildroot/output/images/rootfs.cpio \ -append "root=/dev/ram ro console=hvc0 earlycon=sbi" [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x80000000-0x17fffffff] [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x180000000-0x27fffffff] [ 0.000000] NUMA: NODE_DATA [mem 0x17fe3bc40-0x17fe3cfff] [ 0.000000] NUMA: NODE_DATA [mem 0x27fff4c40-0x27fff5fff] ... [ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> HARTID 0x0 -> Node 0 [ 0.000000] ACPI: NUMA: SRAT: PXM 0 -> HARTID 0x1 -> Node 0 [ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> HARTID 0x2 -> Node 1 [ 0.000000] ACPI: NUMA: SRAT: PXM 1 -> HARTID 0x3 -> Node 1 * b4-shazam-merge: ACPI: NUMA: replace pr_info with pr_debug in arch_acpi_numa_init ACPI: NUMA: change the ACPI_NUMA to a hidden option ACPI: NUMA: Add handler for SRAT RINTC affinity structure ACPI: RISCV: Add NUMA support based on SRAT and SLIT Link: https://lore.kernel.org/r/cover.1718268003.git.haibo1.xu@intel.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-22ACPI: RISCV: Add NUMA support based on SRAT and SLITHaibo Xu
Add acpi_numa.c file to enable parse NUMA information from ACPI SRAT and SLIT tables. SRAT table provide CPUs(Hart) and memory nodes to proximity domain mapping, while SLIT table provide the distance metrics between proximity domains. Signed-off-by: Haibo Xu <haibo1.xu@intel.com> Reviewed-by: Sunil V L <sunilvl@ventanamicro.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Link: https://lore.kernel.org/r/65dbad1fda08a32922c44886e4581e49b4a2fecc.1718268003.git.haibo1.xu@intel.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-21Merge tag 'mm-stable-2024-07-21-14-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - In the series "mm: Avoid possible overflows in dirty throttling" Jan Kara addresses a couple of issues in the writeback throttling code. These fixes are also targetted at -stable kernels. - Ryusuke Konishi's series "nilfs2: fix potential issues related to reserved inodes" does that. This should actually be in the mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad. - More folio conversions from Kefeng Wang in the series "mm: convert to folio_alloc_mpol()" - Kemeng Shi has sent some cleanups to the writeback code in the series "Add helper functions to remove repeated code and improve readability of cgroup writeback" - Kairui Song has made the swap code a little smaller and a little faster in the series "mm/swap: clean up and optimize swap cache index". - In the series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David Hildenbrand has reworked the rather sketchy handling of the use of the zeropage in MAP_SHARED mappings. I don't see any runtime effects here - more a cleanup/understandability/maintainablity thing. - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of higher addresses, for aarch64. The (poorly named) series is "Restructure va_high_addr_switch". - The core TLB handling code gets some cleanups and possible slight optimizations in Bang Li's series "Add update_mmu_tlb_range() to simplify code". - Jane Chu has improved the handling of our fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the series "Enhance soft hwpoison handling and injection". - Jeff Johnson has sent a billion patches everywhere to add MODULE_DESCRIPTION() to everything. Some landed in this pull. - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has simplified migration's use of hardware-offload memory copying. - Yosry Ahmed performs more folio API conversions in his series "mm: zswap: trivial folio conversions". - In the series "large folios swap-in: handle refault cases first", Chuanhua Han inches us forward in the handling of large pages in the swap code. This is a cleanup and optimization, working toward the end objective of full support of large folio swapin/out. - In the series "mm,swap: cleanup VMA based swap readahead window calculation", Huang Ying has contributed some cleanups and a possible fixlet to his VMA based swap readahead code. - In the series "add mTHP support for anonymous shmem" Baolin Wang has taught anonymous shmem mappings to use multisize THP. By default this is a no-op - users must opt in vis sysfs controls. Dramatic improvements in pagefault latency are realized. - David Hildenbrand has some cleanups to our remaining use of page_mapcount() in the series "fs/proc: move page_mapcount() to fs/proc/internal.h". - David also has some highmem accounting cleanups in the series "mm/highmem: don't track highmem pages manually". - Build-time fixes and cleanups from John Hubbard in the series "cleanups, fixes, and progress towards avoiding "make headers"". - Cleanups and consolidation of the core pagemap handling from Barry Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them". - Lance Yang's series "Reclaim lazyfree THP without splitting" has reduced the latency of the reclaim of pmd-mapped THPs under fairly common circumstances. A 10x speedup is seen in a microbenchmark. It does this by punting to aother CPU but I guess that's a win unless all CPUs are pegged. - hugetlb_cgroup cleanups from Xiu Jianfeng in the series "mm/hugetlb_cgroup: rework on cftypes". - Miaohe Lin's series "Some cleanups for memory-failure" does just that thing. - Someone other than SeongJae has developed a DAMON feature in Honggyu Kim's series "DAMON based tiered memory management for CXL memory". This adds DAMON features which may be used to help determine the efficiency of our placement of CXL/PCIe attached DRAM. - DAMON user API centralization and simplificatio work in SeongJae Park's series "mm/damon: introduce DAMON parameters online commit function". - In the series "mm: page_type, zsmalloc and page_mapcount_reset()" David Hildenbrand does some maintenance work on zsmalloc - partially modernizing its use of pageframe fields. - Kefeng Wang provides more folio conversions in the series "mm: remove page_maybe_dma_pinned() and page_mkclean()". - More cleanup from David Hildenbrand, this time in the series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline() pages" and permits the removal of some virtio-mem hacks. - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()" is a cleanup to the anon folio handling in preparation for mTHP (multisize THP) swapin. - Kefeng Wang's series "mm: improve clear and copy user folio" implements more folio conversions, this time in the area of large folio userspace copying. - The series "Docs/mm/damon/maintaier-profile: document a mailing tool and community meetup series" tells people how to get better involved with other DAMON developers. From SeongJae Park. - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does that. - David Hildenbrand sends along more cleanups, this time against the migration code. The series is "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". - Jan Kara has found quite a lot of strangenesses and minor errors in the readahead code. He addresses this in the series "mm: Fix various readahead quirks". - SeongJae Park's series "selftests/damon: test DAMOS tried regions and {min,max}_nr_regions" adds features and addresses errors in DAMON's self testing code. - Gavin Shan has found a userspace-triggerable WARN in the pagecache code. The series "mm/filemap: Limit page cache size to that supported by xarray" addresses this. The series is marked cc:stable. - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations and cleanup" cleans up and slightly optimizes KSM. - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of code motion. The series (which also makes the memcg-v1 code Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put under config option" and "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1" - Dan Schatzberg's series "Add swappiness argument to memory.reclaim" adds an additional feature to this cgroup-v2 control file. - The series "Userspace controls soft-offline pages" from Jiaqi Yan permits userspace to stop the kernel's automatic treatment of excessive correctable memory errors. In order to permit userspace to monitor and handle this situation. - Kefeng Wang's series "mm: migrate: support poison recover from migrate folio" teaches the kernel to appropriately handle migration from poisoned source folios rather than simply panicing. - SeongJae Park's series "Docs/damon: minor fixups and improvements" does those things. - In the series "mm/zsmalloc: change back to per-size_class lock" Chengming Zhou improves zsmalloc's scalability and memory utilization. - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare refcount increments. So these paes can first be moved aside if they reside in the movable zone or a CMA block. - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps for much faster reading of vma information. The series is "query VMAs from /proc/<pid>/maps". - In the series "mm: introduce per-order mTHP split counters" Lance Yang improves the kernel's presentation of developer information related to multisize THP splitting. - Michael Ellerman has developed the series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)". This permits userspace to use all available huge page sizes. - In the series "revert unconditional slab and page allocator fault injection calls" Vlastimil Babka removes a performance-affecting and not very useful feature from slab fault injection. * tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits) mm/mglru: fix ineffective protection calculation mm/zswap: fix a white space issue mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix possible recursive locking detected warning mm/gup: clear the LRU flag of a page before adding to LRU batch mm/numa_balancing: teach mpol_to_str about the balancing mode mm: memcg1: convert charge move flags to unsigned long long alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting lib: reuse page_ext_data() to obtain codetag_ref lib: add missing newline character in the warning message mm/mglru: fix overshooting shrinker memory mm/mglru: fix div-by-zero in vmpressure_calc_level() mm/kmemleak: replace strncpy() with strscpy() mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB mm: ignore data-race in __swap_writepage hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr mm: shmem: rename mTHP shmem counters mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async() mm/migrate: putback split folios when numa hint migration fails ...
2024-07-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement - Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol - FPSIMD/SVE support for nested, including merged trap configuration and exception routing - New command-line parameter to control the WFx trap behavior under KVM - Introduce kCFI hardening in the EL2 hypervisor - Fixes + cleanups for handling presence/absence of FEAT_TCRX - Miscellaneous fixes + documentation updates LoongArch: - Add paravirt steal time support - Add support for KVM_DIRTY_LOG_INITIALLY_SET - Add perf kvm-stat support for loongarch RISC-V: - Redirect AMO load/store access fault traps to guest - perf kvm stat support - Use guest files for IMSIC virtualization, when available s390: - Assortment of tiny fixes which are not time critical x86: - Fixes for Xen emulation - Add a global struct to consolidate tracking of host values, e.g. EFER - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX - Print the name of the APICv/AVIC inhibits in the relevant tracepoint - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor - Drop MTRR virtualization, and instead always honor guest PAT on CPUs that support self-snoop - Update to the newfangled Intel CPU FMS infrastructure - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored - Misc cleanups x86 - MMU: - Small cleanups, renames and refactoring extracted from the upcoming Intel TDX support - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting, to avoid stalling vCPUs when splitting huge pages - Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state because the callers fully expect a valid SPTE, it's all but dangerous to let more MMU changes happen afterwards x86 - AMD: - Make per-CPU save_area allocations NUMA-aware - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code - Base support for running SEV-SNP guests. API-wise, this includes a new KVM_X86_SNP_VM type, encrypting/measure the initial image into guest memory, and finalizing it before launching it. Internally, there are some gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges This includes basic support for attestation guest requests, enough to say that KVM supports the GHCB 2.0 specification There is no support yet for loading into the firmware those signing keys to be used for attestation requests, and therefore no need yet for the host to provide certificate data for those keys. To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data x86 - Intel: - Remove an unnecessary EPT TLB flush when enabling hardware - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1) - KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed See commit 0dc902267cb3 ("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows Generic: - Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE, because the special casing needed by these pages is not due to just unmovability (and in fact they are only unmovable because the CPU cannot access them) - New ioctl to populate the KVM page tables in advance, which is useful to mitigate KVM page faults during guest boot or after live migration. The code will also be used by TDX, but (probably) not through the ioctl - Enable halt poll shrinking by default, as Intel found it to be a clear win - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out() - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout Selftests: - Remove dead code in the memslot modification stress test - Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs - Print the guest pseudo-RNG seed only when it changes, to avoid spamming the log for tests that create lots of VMs - Make the PMU counters test less flaky when counting LLC cache misses by doing CLFLUSH{OPT} in every loop iteration" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits) crypto: ccp: Add the SNP_VLEK_LOAD command KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops KVM: x86: Replace static_call_cond() with static_call() KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event x86/sev: Move sev_guest.h into common SEV header KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event KVM: x86: Suppress MMIO that is triggered during task switch emulation KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault" KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory KVM: Document KVM_PRE_FAULT_MEMORY ioctl mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE perf kvm: Add kvm-stat for loongarch64 LoongArch: KVM: Add PV steal time support in guest side ...