summaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)Author
2024-10-31powerpc/trace: Account for -fpatchable-function-entry support by toolchainNaveen N Rao
So far, we have relied on the fact that gcc supports both -mprofile-kernel, as well as -fpatchable-function-entry, and clang supports neither. Our Makefile only checks for CONFIG_MPROFILE_KERNEL to decide which files to build. Clang has a feature request out [*] to implement -fpatchable-function-entry, and is unlikely to support -mprofile-kernel. Update our Makefile checks so that we pick up the correct files to build once clang picks up support for -fpatchable-function-entry. [*] https://github.com/llvm/llvm-project/issues/57031 Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241030070850.1361304-2-hbathini@linux.ibm.com
2024-10-29of/fdt: add dt_phys arg to early_init_dt_scan and early_init_dt_verifyUsama Arif
__pa() is only intended to be used for linear map addresses and using it for initial_boot_params which is in fixmap for arm64 will give an incorrect value. Hence save the physical address when it is known at boot time when calling early_init_dt_scan for arm64 and use it at kexec time instead of converting the virtual address using __pa(). Note that arm64 doesn't need the FDT region reserved in the DT as the kernel explicitly reserves the passed in FDT. Therefore, only a debug warning is fixed with this change. Reported-by: Breno Leitao <leitao@debian.org> Suggested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Usama Arif <usamaarif642@gmail.com> Fixes: ac10be5cdbfa ("arm64: Use common of_kexec_alloc_and_setup_fdt()") Link: https://lore.kernel.org/r/20241023171426.452688-1-usamaarif642@gmail.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2024-10-29powerpc/64: Remove maple platformMichael Ellerman
The maple platform was added in 2004 [1], to support the "Maple" 970FX evaluation board. It was later used for IBM JS20/JS21 machines, as well as the Bimini machine, aka "Yellow Dog Powerstation". Sadly all those machines have passed into memory, and there's been no evidence for years that anyone is still using any of them. Remove the platform and related code. It can always be reinstated if there's interest. Note that this has no impact on support for 970FX based Power Macs. [1]: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/commit/?id=f0d068d65c5e555ffcfbc189de32598f6f00770c Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241013102957.548291-1-mpe@ellerman.id.au
2024-10-29powerpc/boot: Remove bogus reference to liloMichael Ellerman
The help text refers to lilo, but the install script does not run lilo and never has. The reference to lilo seems to have come originally from arch/ppc/Makefile, but it was not true there either. Remove it. Reported-by: Thorsten Leemhuis <linux@leemhuis.info> Link: https://fosstodon.org/@kernellogger/113032940928131612 Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009053806.135807-1-mpe@ellerman.id.au
2024-10-29powerpc/pseries: Fix dtl_access_lock to be a rw_semaphoreMichael Ellerman
The dtl_access_lock needs to be a rw_sempahore, a sleeping lock, because the code calls kmalloc() while holding it, which can sleep: # echo 1 > /proc/powerpc/vcpudispatch_stats BUG: sleeping function called from invalid context at include/linux/sched/mm.h:337 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 199, name: sh preempt_count: 1, expected: 0 3 locks held by sh/199: #0: c00000000a0743f8 (sb_writers#3){.+.+}-{0:0}, at: vfs_write+0x324/0x438 #1: c0000000028c7058 (dtl_enable_mutex){+.+.}-{3:3}, at: vcpudispatch_stats_write+0xd4/0x5f4 #2: c0000000028c70b8 (dtl_access_lock){+.+.}-{2:2}, at: vcpudispatch_stats_write+0x220/0x5f4 CPU: 0 PID: 199 Comm: sh Not tainted 6.10.0-rc4 #152 Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries Call Trace: dump_stack_lvl+0x130/0x148 (unreliable) __might_resched+0x174/0x410 kmem_cache_alloc_noprof+0x340/0x3d0 alloc_dtl_buffers+0x124/0x1ac vcpudispatch_stats_write+0x2a8/0x5f4 proc_reg_write+0xf4/0x150 vfs_write+0xfc/0x438 ksys_write+0x88/0x148 system_call_exception+0x1c4/0x5a0 system_call_common+0xf4/0x258 Fixes: 06220d78f24a ("powerpc/pseries: Introduce rwlock to gatekeep DTLB usage") Tested-by: Kajol Jain <kjain@linux.ibm.com> Reviewed-by: Nysal Jan K.A <nysal@linux.ibm.com> Reviewed-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20240819122401.513203-1-mpe@ellerman.id.au
2024-10-29powerpc/machdep: Drop include of dma-mapping.hMichael Ellerman
Drop the include of dma-mapping.h in machdep.h, replace it with forward declarations of struct device and struct pci_dev, and include time64.h and page.h which are required for time64_t and pgprot_t respectively. Add direct includes of some other headers to some files that were getting them via machdep.h. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009051826.132805-2-mpe@ellerman.id.au
2024-10-29powerpc/machdep: Drop include of seq_file.hMichael Ellerman
Drop the include of seq_file.h in machdep.h, replace it with a forward declaration of struct seq_file, which is all that's required. Add direct includes of seq_file.h to some files that were getting seq_file.h via machdep.h. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009051826.132805-1-mpe@ellerman.id.au
2024-10-29powerpc/64: Drop IPI_PRIORITY from asm-offsetsMichael Ellerman
The last use of IPI_PRIORITY in asm was removed in commit 37f55d30df2e ("KVM: PPC: Book3S HV: Convert kvmppc_read_intr to a C function"). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009051701.132282-1-mpe@ellerman.id.au
2024-10-29dma-mapping: drop unneeded includes from dma-mapping.hChristoph Hellwig
Back in the day a lot of logic was implemented inline in dma-mapping.h and needed various includes. Move of this has long been moved out of line, so we can drop various includes to improve kernel rebuild times. Signed-off-by: Christoph Hellwig <hch@lst.de>
2024-10-28asm-generic: provide generic page_to_phys and phys_to_page implementationsChristoph Hellwig
page_to_phys is duplicated by all architectures, and from some strange reason placed in <asm/io.h> where it doesn't fit at all. phys_to_page is only provided by a few architectures despite having a lot of open coded users. Provide generic versions in <asm-generic/memory_model.h> to make these helpers more easily usable. Note with this patch powerpc loses the CONFIG_DEBUG_VIRTUAL pfn_valid check. It will be added back in a generic version later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-10-25KVM: PPC: Explicitly require struct page memory for Ultravisor sharingSean Christopherson
Explicitly require "struct page" memory when sharing memory between guest and host via an Ultravisor. Given the number of pfn_to_page() calls in the code, it's safe to assume that KVM already requires that the pfn returned by gfn_to_pfn() is backed by struct page, i.e. this is likely a bug fix, not a reduction in KVM capabilities. Switching to gfn_to_page() will eventually allow removing gfn_to_pfn() and kvm_pfn_to_refcounted_page(). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-79-seanjc@google.com>
2024-10-25KVM: PPC: Use kvm_vcpu_map() to map guest memory to patch dcbz instructionsSean Christopherson
Use kvm_vcpu_map() when patching dcbz in guest memory, as a regular GUP isn't technically sufficient when writing to data in the target pages. As per Documentation/core-api/pin_user_pages.rst: Correct (uses FOLL_PIN calls): pin_user_pages() write to the data within the pages unpin_user_pages() INCORRECT (uses FOLL_GET calls): get_user_pages() write to the data within the pages put_page() As a happy bonus, using kvm_vcpu_{,un}map() takes care of creating a mapping and marking the page dirty. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-75-seanjc@google.com>
2024-10-25KVM: PPC: Remove extra get_page() to fix page refcount leakSean Christopherson
Don't manually do get_page() when patching dcbz, as gfn_to_page() gifts the caller a reference. I.e. doing get_page() will leak the page due to not putting all references. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-74-seanjc@google.com>
2024-10-25KVM: PPC: Use kvm_faultin_pfn() to handle page faults on Book3s PRSean Christopherson
Convert Book3S PR to __kvm_faultin_pfn()+kvm_release_faultin_page(), which are new APIs to consolidate arch code and provide consistent behavior across all KVM architectures. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-65-seanjc@google.com>
2024-10-25KVM: PPC: Book3S: Mark "struct page" pfns dirty/accessed after installing PTESean Christopherson
Mark pages/folios dirty/accessed after installing a PTE, and more specifically after acquiring mmu_lock and checking for an mmu_notifier invalidation. Marking a page/folio dirty after it has been written back can make some filesystems unhappy (backing KVM guests will such filesystem files is uncommon, and the race is minuscule, hence the lack of complaints). See the link below for details. This will also allow converting Book3S to kvm_release_faultin_page(), which requires that mmu_lock be held (for the aforementioned reason). Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-64-seanjc@google.com>
2024-10-25KVM: PPC: Drop unused @kvm_ro param from kvmppc_book3s_instantiate_page()Sean Christopherson
Drop @kvm_ro from kvmppc_book3s_instantiate_page() as it is now only written, and never read. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-63-seanjc@google.com>
2024-10-25KVM: PPC: Use __kvm_faultin_pfn() to handle page faults on Book3s RadixSean Christopherson
Replace Book3s Radix's homebrewed (read: copy+pasted) fault-in logic with __kvm_faultin_pfn(), which functionally does pretty much the exact same thing. Note, when the code was written, KVM indeed didn't do fast GUP without "!atomic && !async", but that has long since changed (KVM tries fast GUP for all writable mappings). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-62-seanjc@google.com>
2024-10-25KVM: PPC: Use __kvm_faultin_pfn() to handle page faults on Book3s HVSean Christopherson
Replace Book3s HV's homebrewed fault-in logic with __kvm_faultin_pfn(), which functionally does pretty much the exact same thing. Note, when the code was written, KVM indeed didn't do fast GUP without "!atomic && !async", but that has long since changed (KVM tries fast GUP for all writable mappings). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-61-seanjc@google.com>
2024-10-25KVM: PPC: e500: Use __kvm_faultin_pfn() to handle page faultsSean Christopherson
Convert PPC e500 to use __kvm_faultin_pfn()+kvm_release_faultin_page(), and continue the inexorable march towards the demise of kvm_pfn_to_refcounted_page(). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-55-seanjc@google.com>
2024-10-25KVM: PPC: e500: Mark "struct page" pfn accessed before dropping mmu_lockSean Christopherson
Mark pages accessed before dropping mmu_lock when faulting in guest memory so that shadow_map() can convert to kvm_release_faultin_page() without tripping its lockdep assertion on mmu_lock being held. Marking pages accessed outside of mmu_lock is ok (not great, but safe), but marking pages _dirty_ outside of mmu_lock can make filesystems unhappy. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-54-seanjc@google.com>
2024-10-25KVM: PPC: e500: Mark "struct page" dirty in kvmppc_e500_shadow_map()Sean Christopherson
Mark the underlying page as dirty in kvmppc_e500_ref_setup()'s sole caller, kvmppc_e500_shadow_map(), which will allow converting e500 to __kvm_faultin_pfn() + kvm_release_faultin_page() without having to do a weird dance between ref_setup() and shadow_map(). Opportunistically drop the redundant kvm_set_pfn_accessed(), as shadow_map() puts the page via kvm_release_pfn_clean(). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-53-seanjc@google.com>
2024-10-25KVM: Drop unused "hva" pointer from __gfn_to_pfn_memslot()Sean Christopherson
Drop @hva from __gfn_to_pfn_memslot() now that all callers pass NULL. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-19-seanjc@google.com>
2024-10-25KVM: Drop @atomic param from gfn=>pfn and hva=>pfn APIsSean Christopherson
Drop @atomic from the myriad "to_pfn" APIs now that all callers pass "false", and remove a comment blurb about KVM running only the "GUP fast" part in atomic context. No functional change intended. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-13-seanjc@google.com>
2024-10-25KVM: Drop KVM_ERR_PTR_BAD_PAGE and instead return NULL to indicate an errorSean Christopherson
Remove KVM_ERR_PTR_BAD_PAGE and instead return NULL, as "bad page" is just a leftover bit of weirdness from days of old when KVM stuffed a "bad" page into the guest instead of actually handling missing pages. See commit cea7bb21280e ("KVM: MMU: Make gfn_to_page() always safe"). Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-2-seanjc@google.com>
2024-10-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. No conflicts and no adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23powerpc: Adjust adding stack protector flags to KBUILD_CLAGS for clangNathan Chancellor
After fixing the HAVE_STACKPROTECTER checks for clang's in-progress per-task stack protector support [1], the build fails during prepare0 because '-mstack-protector-guard-offset' has not been added to KBUILD_CFLAGS yet but the other '-mstack-protector-guard' flags have. clang: error: '-mstack-protector-guard=tls' is used without '-mstack-protector-guard-offset', and there is no default clang: error: '-mstack-protector-guard=tls' is used without '-mstack-protector-guard-offset', and there is no default make[4]: *** [scripts/Makefile.build:229: scripts/mod/empty.o] Error 1 make[4]: *** [scripts/Makefile.build:102: scripts/mod/devicetable-offsets.s] Error 1 Mirror other architectures and add all '-mstack-protector-guard' flags to KBUILD_CFLAGS atomically during stack_protector_prepare, which resolves the issue and allows clang's implementation to fully work with the kernel. Cc: stable@vger.kernel.org # 6.1+ Link: https://github.com/llvm/llvm-project/pull/110928 [1] Reviewed-by: Keith Packard <keithp@keithp.com> Tested-by: Keith Packard <keithp@keithp.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009-powerpc-fix-stackprotector-test-clang-v2-2-12fb86b31857@kernel.org
2024-10-23powerpc: Fix stack protector Kconfig test for clangNathan Chancellor
Clang's in-progress per-task stack protector support [1] does not work with the current Kconfig checks because '-mstack-protector-guard-offset' is not provided, unlike all other architecture Kconfig checks. $ fd Kconfig -x rg -l mstack-protector-guard-offset ./arch/arm/Kconfig ./arch/riscv/Kconfig ./arch/arm64/Kconfig This produces an error from clang, which is interpreted as the flags not being supported at all when they really are. $ clang --target=powerpc64-linux-gnu \ -mstack-protector-guard=tls \ -mstack-protector-guard-reg=r13 \ -c -o /dev/null -x c /dev/null clang: error: '-mstack-protector-guard=tls' is used without '-mstack-protector-guard-offset', and there is no default This argument will always be provided by the build system, so mirror other architectures and use '-mstack-protector-guard-offset=0' for testing support, which fixes the issue for clang and does not regress support with GCC. Even with the first problem addressed, the 32-bit test continues to fail because Kbuild uses the powerpc64le-linux-gnu target for clang and nothing flips the target to 32-bit, resulting in an error about an invalid register valid: $ clang --target=powerpc64le-linux-gnu \ -mstack-protector-guard=tls -mstack-protector-guard-reg=r2 \ -mstack-protector-guard-offset=0 \ -x c -c -o /dev/null /dev/null clang: error: invalid value 'r2' in 'mstack-protector-guard-reg=', expected one of: r13 While GCC allows arbitrary registers, the implementation of '-mstack-protector-guard=tls' in LLVM shares the same code path as the user space thread local storage implementation, which uses a fixed register (2 for 32-bit and 13 for 62-bit), so the command line parsing enforces this limitation. Use the Kconfig macro '$(m32-flag)', which expands to '-m32' when supported, in the stack protector support cc-option call to properly switch the target to a 32-bit one, which matches what happens in Kbuild. While the 64-bit macro does not strictly need it, add the equivalent 64-bit option for symmetry. Cc: stable@vger.kernel.org # 6.1+ Link: https://github.com/llvm/llvm-project/pull/110928 [1] Reviewed-by: Keith Packard <keithp@keithp.com> Tested-by: Keith Packard <keithp@keithp.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241009-powerpc-fix-stackprotector-test-clang-v2-1-12fb86b31857@kernel.org
2024-10-23book3s64/hash: Early detect debug_pagealloc size requirementRitesh Harjani (IBM)
Add hash_supports_debug_pagealloc() helper to detect whether debug_pagealloc can be supported on hash or not. This checks for both, whether debug_pagealloc config is enabled and the linear map should fit within rma_size/4 region size. This can then be used early during htab_init_page_sizes() to decide linear map pagesize if hash supports either debug_pagealloc or kfence. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/c33c6691b2a2cf619cc74ac100118ca4dbf21a48.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Disable kfence if not early initRitesh Harjani (IBM)
Enable kfence on book3s64 hash only when early init is enabled. This is because, kfence could cause the kernel linear map to be mapped at PAGE_SIZE level instead of 16M (which I guess we don't want). Also currently there is no way to - 1. Make multiple page size entries for the SLB used for kernel linear map. 2. No easy way of getting the hash slot details after the page table mapping for kernel linear setup. So even if kfence allocate the pool in late init, we won't be able to get the hash slot details in kfence linear map. Thus this patch disables kfence on hash if kfence early init is not enabled. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/4a6eea8cfd1cd28fccfae067026bff30cbec1d4b.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/radix: Refactoring common kfence related functionsRitesh Harjani (IBM)
Both radix and hash on book3s requires to detect if kfence early init is enabled or not. Hash needs to disable kfence if early init is not enabled because with kfence the linear map is mapped using PAGE_SIZE rather than 16M mapping. We don't support multiple page sizes for slb entry used for kernel linear map in book3s64. This patch refactors out the common functions required to detect kfence early init is enabled or not. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/f4a787224fbe5bb787158ace579780c0257f6602.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Add kfence functionalityRitesh Harjani (IBM)
Now that linear map functionality of debug_pagealloc is made generic, enable kfence to use this generic infrastructure. 1. Define kfence related linear map variables. - u8 *linear_map_kf_hash_slots; - unsigned long linear_map_kf_hash_count; - DEFINE_RAW_SPINLOCK(linear_map_kf_hash_lock); 2. The linear map size allocated in RMA region is quite small (KFENCE_POOL_SIZE >> PAGE_SHIFT) which is 512 bytes by default. 3. kfence pool memory is reserved using memblock_phys_alloc() which has can come from anywhere. (default 255 objects => ((1+255) * 2) << PAGE_SHIFT = 32MB) 4. The hash slot information for kfence memory gets added in linear map in hash_linear_map_add_slot() (which also adds for debug_pagealloc). Reported-by: Pavithra Prakash <pavrampu@linux.vnet.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/5c2b61941b344077a2b8654dab46efa0322af3af.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Disable debug_pagealloc if it requires more memoryRitesh Harjani (IBM)
Make size of the linear map to be allocated in RMA region to be of ppc64_rma_size / 4. If debug_pagealloc requires more memory than that then do not allocate any memory and disable debug_pagealloc. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/e1ef66f32a1fe63bcbb89d5c11d86c65beef5ded.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Make kernel_map_linear_page() genericRitesh Harjani (IBM)
Currently kernel_map_linear_page() function assumes to be working on linear_map_hash_slots array. But since in later patches we need a separate linear map array for kfence, hence make kernel_map_linear_page() take a linear map array and lock in it's function argument. This is needed to separate out kfence from debug_pagealloc infrastructure. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/5b67df7b29e68d7c78d6fc1f42d41137299bac6b.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Refactor hash__kernel_map_pages() functionRitesh Harjani (IBM)
This refactors hash__kernel_map_pages() function to call hash_debug_pagealloc_map_pages(). This will come useful when we will add kfence support. No functionality changes in this patch. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/0cb8ddcccdcf61ea06ab4d92aacd770c16cc0f2c.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Add hash_debug_pagealloc_alloc_slots() functionRitesh Harjani (IBM)
This adds hash_debug_pagealloc_alloc_slots() function instead of open coding that in htab_initialize(). This is required since we will be separating the kfence functionality to not depend upon debug_pagealloc. Now that everything required for debug_pagealloc is under a #ifdef config. Bring in linear_map_hash_slots and linear_map_hash_count variables under the same config too. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/d1d5aabe1e4c693a983e59ccf3de08e3c28c5161.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Add hash_debug_pagealloc_add_slot() functionRitesh Harjani (IBM)
This adds hash_debug_pagealloc_add_slot() function instead of open coding that in htab_bolt_mapping(). This is required since we will be separating kfence functionality to not depend upon debug_pagealloc. No functionality change in this patch. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/026f0aaa1dddd89154dc8d20ceccfca4f63ccf79.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Refactor kernel linear map related callsRitesh Harjani (IBM)
This just brings all linear map related handling at one place instead of having those functions scattered in hash_utils file. Makes it easy for review. No functionality changes in this patch. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/56c610310aa50b5417976a39c5f15b78bc76c764.1729271995.git.ritesh.list@gmail.com
2024-10-23book3s64/hash: Remove kfence support temporarilyRitesh Harjani (IBM)
Kfence on book3s Hash on pseries is anyways broken. It fails to boot due to RMA size limitation. That is because, kfence with Hash uses debug_pagealloc infrastructure. debug_pagealloc allocates linear map for entire dram size instead of just kfence relevant objects. This means for 16TB of DRAM it will require (16TB >> PAGE_SHIFT) which is 256MB which is half of RMA region on P8. crash kernel reserves 256MB and we also need 2048 * 16KB * 3 for emergency stack and some more for paca allocations. That means there is not enough memory for reserving the full linear map in the RMA region, if the DRAM size is too big (>=16TB) (The issue is seen above 8TB with crash kernel 256 MB reservation). Now Kfence does not require linear memory map for entire DRAM. It only needs for kfence objects. So this patch temporarily removes the kfence functionality since debug_pagealloc code needs some refactoring. We will bring in kfence on Hash support in later patches. Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/1761bc39674473c8878dedca15e0d9a0d3a1b528.1729271995.git.ritesh.list@gmail.com
2024-10-23powerpc/mm/fault: Fix kfence page fault reportingRitesh Harjani (IBM)
copy_from_kernel_nofault() can be called when doing read of /proc/kcore. /proc/kcore can have some unmapped kfence objects which when read via copy_from_kernel_nofault() can cause page faults. Since *_nofault() functions define their own fixup table for handling fault, use that instead of asking kfence to handle such faults. Hence we search the exception tables for the nip which generated the fault. If there is an entry then we let the fixup table handler handle the page fault by returning an error from within ___do_page_fault(). This can be easily triggered if someone tries to do dd from /proc/kcore. eg. dd if=/proc/kcore of=/dev/null bs=1M Some example false negatives: =============================== BUG: KFENCE: invalid read in copy_from_kernel_nofault+0x9c/0x1a0 Invalid read at 0xc0000000fdff0000: copy_from_kernel_nofault+0x9c/0x1a0 0xc00000000665f950 read_kcore_iter+0x57c/0xa04 proc_reg_read_iter+0xe4/0x16c vfs_read+0x320/0x3ec ksys_read+0x90/0x154 system_call_exception+0x120/0x310 system_call_vectored_common+0x15c/0x2ec BUG: KFENCE: use-after-free read in copy_from_kernel_nofault+0x9c/0x1a0 Use-after-free read at 0xc0000000fe050000 (in kfence-#2): copy_from_kernel_nofault+0x9c/0x1a0 0xc00000000665f950 read_kcore_iter+0x57c/0xa04 proc_reg_read_iter+0xe4/0x16c vfs_read+0x320/0x3ec ksys_read+0x90/0x154 system_call_exception+0x120/0x310 system_call_vectored_common+0x15c/0x2ec Fixes: 90cbac0e995d ("powerpc: Enable KFENCE for PPC32") Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reported-by: Disha Goel <disgoel@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/a411788081d50e3b136c6270471e35aba3dfafa3.1729271995.git.ritesh.list@gmail.com
2024-10-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.12-rc4). Conflicts: 107a034d5c1e ("net/mlx5: qos: Store rate groups in a qos domain") 1da9cfd6c41c ("net/mlx5: Unregister notifier on eswitch init failure") Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-21powerpc/fadump: Move fadump_cma_init to setup_arch() after initmem_init()Ritesh Harjani (IBM)
During early init CMA_MIN_ALIGNMENT_BYTES can be PAGE_SIZE, since pageblock_order is still zero and it gets initialized later during initmem_init() e.g. setup_arch() -> initmem_init() -> sparse_init() -> set_pageblock_order() One such use case where this causes issue is - early_setup() -> early_init_devtree() -> fadump_reserve_mem() -> fadump_cma_init() This causes CMA memory alignment check to be bypassed in cma_init_reserved_mem(). Then later cma_activate_area() can hit a VM_BUG_ON_PAGE(pfn & ((1 << order) - 1)) if the reserved memory area was not pageblock_order aligned. Fix it by moving the fadump_cma_init() after initmem_init(), where other such cma reservations also gets called. <stack trace> ============== page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10010 flags: 0x13ffff800000000(node=1|zone=0|lastcpupid=0x7ffff) CMA raw: 013ffff800000000 5deadbeef0000100 5deadbeef0000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(pfn & ((1 << order) - 1)) ------------[ cut here ]------------ kernel BUG at mm/page_alloc.c:778! Call Trace: __free_one_page+0x57c/0x7b0 (unreliable) free_pcppages_bulk+0x1a8/0x2c8 free_unref_page_commit+0x3d4/0x4e4 free_unref_page+0x458/0x6d0 init_cma_reserved_pageblock+0x114/0x198 cma_init_reserved_areas+0x270/0x3e0 do_one_initcall+0x80/0x2f8 kernel_init_freeable+0x33c/0x530 kernel_init+0x34/0x26c ret_from_kernel_user_thread+0x14/0x1c Fixes: 11ac3e87ce09 ("mm: cma: use pageblock_order as the single alignment") Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Sachin P Bappalige <sachinpb@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/3ae208e48c0d9cefe53d2dc4f593388067405b7d.1729146153.git.ritesh.list@gmail.com
2024-10-21powerpc/fadump: Reserve page-aligned boot_memory_size during fadump_reserve_memRitesh Harjani (IBM)
This patch refactors all CMA related initialization and alignment code to within fadump_cma_init() which gets called in the end. This also means that we keep [reserve_dump_area_start, boot_memory_size] page aligned during fadump_reserve_mem(). Then later in fadump_cma_init() we extract the aligned chunk and provide it to CMA. This inherently also fixes an issue in the current code where the reserve_dump_area_start is not aligned when the physical memory can have holes and the suitable chunk starts at an unaligned boundary. After this we should be able to call fadump_cma_init() independently later in setup_arch() where pageblock_order is non-zero. Suggested-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/805d6b900968fb9402ad8f4e4775597db42085c4.1729146153.git.ritesh.list@gmail.com
2024-10-21powerpc/fadump: Refactor and prepare fadump_cma_init for late initRitesh Harjani (IBM)
We anyway don't use any return values from fadump_cma_init(). Since fadump_reserve_mem() from where fadump_cma_init() gets called today, already has the required checks. This patch makes this function return type as void. Let's also handle extra cases like return if fadump_supported is false or dump_active, so that in later patches we can call fadump_cma_init() separately from setup_arch(). Acked-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: Madhavan Srinivasan <maddy@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/a2afc3d6481a87a305e89cfc4a3f3d2a0b8ceab3.1729146153.git.ritesh.list@gmail.com
2024-10-16powerpc/cell: Switch to irq_get_nr_irqs()Bart Van Assche
Use the irq_get_nr_irqs() function instead of the global variable 'nr_irqs'. Prepare for changing 'nr_irqs' from an exported global variable into a variable with file scope. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/all/20241015190953.1266194-5-bvanassche@acm.org
2024-10-16powerpc/vdso: Flag VDSO64 entry points as functionsChristophe Leroy
On powerpc64 as shown below by readelf, vDSO functions symbols have type NOTYPE. $ powerpc64-linux-gnu-readelf -a arch/powerpc/kernel/vdso/vdso64.so.dbg ELF Header: Magic: 7f 45 4c 46 02 02 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, big endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: DYN (Shared object file) Machine: PowerPC64 Version: 0x1 ... Symbol table '.dynsym' contains 12 entries: Num: Value Size Type Bind Vis Ndx Name ... 1: 0000000000000524 84 NOTYPE GLOBAL DEFAULT 8 __[...]@@LINUX_2.6.15 ... 4: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6.15 5: 00000000000006c0 48 NOTYPE GLOBAL DEFAULT 8 __[...]@@LINUX_2.6.15 Symbol table '.symtab' contains 56 entries: Num: Value Size Type Bind Vis Ndx Name ... 45: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6.15 46: 00000000000006c0 48 NOTYPE GLOBAL DEFAULT 8 __kernel_getcpu 47: 0000000000000524 84 NOTYPE GLOBAL DEFAULT 8 __kernel_clock_getres To overcome that, commit ba83b3239e65 ("selftests: vDSO: fix vDSO symbols lookup for powerpc64") was applied to have selftests also look for NOTYPE symbols, but the correct fix should be to flag VDSO entry points as functions. The original commit that brought VDSO support into powerpc/64 has the following explanation: Note that the symbols exposed by the vDSO aren't "normal" function symbols, apps can't be expected to link against them directly, the vDSO's are both seen as if they were linked at 0 and the symbols just contain offsets to the various functions. This is done on purpose to avoid a relocation step (ppc64 functions normally have descriptors with abs addresses in them). When glibc uses those functions, it's expected to use it's own trampolines that know how to reach them. The descriptors it's talking about are the OPD function descriptors used on ABI v1 (big endian). But it would be more correct for a text symbol to have type function, even if there's no function descriptor for it. glibc has a special case already for handling the VDSO symbols which creates a fake opd pointing at the kernel symbol. So changing the VDSO symbol type to function shouldn't affect that. For ABI v2, there is no function descriptors and VDSO functions can safely have function type. So lets flag VDSO entry points as functions and revert the selftest change. Link: https://github.com/mpe/linux-fullhistory/commit/5f2dd691b62da9d9cc54b938f8b29c22c93cb805 Fixes: ba83b3239e65 ("selftests: vDSO: fix vDSO symbols lookup for powerpc64") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-By: Segher Boessenkool <segher@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/b6ad2f1ee9887af3ca5ecade2a56f4acda517a85.1728512263.git.christophe.leroy@csgroup.eu
2024-10-16powerpc/vdso: Implement __arch_get_vdso_rng_data()Christophe Leroy
VDSO time functions do not call any other function, so they don't need to save/restore LR. However, retrieving the address of VDSO data page requires using LR hence saving then restoring it, which can be heavy on some CPUs. On the other hand, VDSO functions on powerpc are not standard functions and require a wrapper function to call C VDSO functions. And that wrapper has to save and restore LR in order to call the C VDSO function, so retrieving VDSO data page address in that wrapper doesn't require additional save/restore of LR. For random VDSO functions it is a bit different. Because the function calls __arch_chacha20_blocks_nostack(), it saves and restores LR. Retrieving VDSO data page address can then be done there without additional save/restore of LR. So lets implement __arch_get_vdso_rng_data() and simplify the wrapper. It starts paving the way for the day powerpc will implement a more standard ABI for VDSO functions. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/a1a9bd0df508f1b5c04684b7366940577dfc6262.1727858295.git.christophe.leroy@csgroup.eu
2024-10-16powerpc/vdso: Add a page for non-time dataChristophe Leroy
The page containing VDSO time data is swapped with the one containing TIME namespace data when a process uses a non-root time namespace. For other data like powerpc specific data and RNG data, it means tracking whether time namespace is the root one or not to know which page to use. Simplify the logic behind by moving time data out of first data page so that the first data page which contains everything else always remains the first page. Time data is in the second or third page depending on selected time namespace. While we are playing with get_datapage macro, directly take into account the data offset inside the macro instead of adding that offset afterwards. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/0557d3ec898c1d0ea2fc59fa8757618e524c5d94.1727858295.git.christophe.leroy@csgroup.eu
2024-10-16KVM: PPC: replace call_rcu by kfree_rcu for simple kmem_cache_free callbackJulia Lawall
Since SLOB was removed and since commit 6c6c47b063b5 ("mm, slab: call kvfree_rcu_barrier() from kmem_cache_destroy()"), it is not necessary to use call_rcu when the callback only performs kmem_cache_free. Use kfree_rcu() directly. The changes were made using Coccinelle. Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Acked-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241013201704.49576-14-Julia.Lawall@inria.fr
2024-10-16powerpc/rtas: Use fsleep() to minimize additional sleep durationAnna-Maria Behnsen
When commit 38f7b7067dae ("powerpc/rtas: rtas_busy_delay() improvements") was introduced, documentation about proper usage of sleep related functions was outdated. The commit message references the usage of a HZ=100 system. When using a 20ms sleep duration on such a system and therefore using msleep(), the possible additional slack will be +10ms. When the system is configured with HZ=100 the granularity of a jiffy and of a bucket of the lowest timer wheel level is 10ms. To make sure a timer will not expire early (when queueing of the timer races with an concurrent update of jiffies), timers are always queued into the next bucket. This is the reason for the maximal possible slack of 10ms. fsleep() limits the maximal possible slack to 25% by making threshold between usleep_range() and msleep() HZ dependent. As soon as the accuracy of msleep() is sufficient, the less expensive timer list timer based sleeping function is used instead of the more expensive hrtimer based usleep_range() function. The udelay() will not be used in this specific usecase as the lowest sleep length is larger than 1 millisecond. Use fsleep() directly instead of using an own heuristic for the best sleeping mechanism to use. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/all/20241014-devel-anna-maria-b4-timers-flseep-v3-13-dc8b907cb62f@linutronix.de
2024-10-16powerpc/powernv: Free name on error in opal_event_init()Michael Ellerman
In opal_event_init() if request_irq() fails name is not freed, leading to a memory leak. The code only runs at boot time, there's no way for a user to trigger it, so there's no security impact. Fix the leak by freeing name in the error path. Reported-by: 2639161967 <2639161967@qq.com> Closes: https://lore.kernel.org/linuxppc-dev/87wmjp3wig.fsf@mail.lhotse Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20240920093520.67997-1-mpe@ellerman.id.au