pm24.git/arch/arm64/lib, branch master

Merge branch 'for-next/mops' into for-next/core

2024-11-14T12:07:28Z

* for-next/mops: : More FEAT_MOPS (memcpy instructions) uses - in-kernel routines arm64: mops: Document requirements for hypervisors arm64: lib: Use MOPS for copy_page() and clear_page() arm64: lib: Use MOPS for memcpy() routines arm64: mops: Document booting requirement for HCR_EL2.MCE2 arm64: mops: Handle MOPS exceptions from EL1 arm64: probes: Disable kprobes/uprobes on MOPS instructions # Conflicts: # arch/arm64/kernel/entry-common.c

arm64/crc32: Implement 4-way interleave using PMULL

2024-10-22T10:54:43Z

Now that kernel mode NEON no longer disables preemption, using FP/SIMD in library code which is not obviously part of the crypto subsystem is no longer problematic, as it will no longer incur unexpected latencies. So accelerate the CRC-32 library code on arm64 to use a 4-way interleave, using PMULL instructions to implement the folding. On Apple M2, this results in a speedup of 2 - 2.8x when using input sizes of 1k - 8k. For smaller sizes, the overhead of preserving and restoring the FP/SIMD register file may not be worth it, so 1k is used as a threshold for choosing this code path. The coefficient tables were generated using code provided by Eric. [0] [0] https://github.com/ebiggers/libdeflate/blob/master/scripts/gen_crc32_multipliers.c Cc: Eric Biggers Signed-off-by: Ard Biesheuvel Reviewed-by: Eric Biggers Link: https://lore.kernel.org/r/20241018075347.2821102-8-ardb+git@google.com Signed-off-by: Catalin Marinas

arm64/crc32: Reorganize bit/byte ordering macros

2024-10-22T10:54:43Z

In preparation for a new user, reorganize the bit/byte ordering macros that are used to parameterize the crc32 template code and instantiate CRC-32, CRC-32c and 'big endian' CRC-32. Signed-off-by: Ard Biesheuvel Reviewed-by: Eric Biggers Link: https://lore.kernel.org/r/20241018075347.2821102-7-ardb+git@google.com Signed-off-by: Catalin Marinas

arm64/lib: Handle CRC-32 alternative in C code

2024-10-22T10:54:43Z

In preparation for adding another code path for performing CRC-32, move the alternative patching for ARM64_HAS_CRC32 into C code. The logic for deciding whether to use this new code path will be implemented in C too. Reviewed-by: Eric Biggers Signed-off-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20241018075347.2821102-6-ardb+git@google.com Signed-off-by: Catalin Marinas

arm64: lib: Use MOPS for copy_page() and clear_page()

2024-10-17T15:42:51Z

Similarly to what was done to the memcpy() routines, make copy_page() and clear_page() also use the Armv8.8 FEAT_MOPS instructions. Note: For copy_page() this uses the CPY* instructions instead of CPYF* as CPYF* doesn't allow src and dst to be equal. It's not clear if copy_page() needs to allow equal src and dst but it has worked so far with the current implementation and there is no documentation forbidding it. Note, the unoptimized version of copy_page() in assembler.h is left as it is. Signed-off-by: Kristina Martsenko Link: https://lore.kernel.org/r/20240930161051.3777828-6-kristina.martsenko@arm.com Signed-off-by: Catalin Marinas

arm64: lib: Use MOPS for memcpy() routines

2024-10-17T15:42:51Z

Make memcpy(), memmove() and memset() use the Armv8.8 FEAT_MOPS instructions when implemented on the CPU. The CPY*/SET* instructions copy or set a block of memory of arbitrary size and alignment. They can be interrupted by the CPU and the copying resumed later. Their performance is expected to be close to the best generic copy/set sequence of loads/stores for a given CPU. Using them in the kernel's copy/set routines therefore avoids the need to periodically rewrite the routines to optimize for new microarchitectures. It could also lead to a performance improvement for some CPUs and systems. With this change the kernel will always use the instructions if they are implemented on the CPU (and have not been disabled by the arm64.nomops command line parameter). When not implemented the usual routines will be used (patched via alternatives). Note, we need to patch B/NOP instead of the whole sequence to avoid executing a partially patched sequence in case the compiler generates a mem*() call inside the alternatives patching code. Note that MOPS instructions have relaxed behavior on Device memory, but it is expected that these routines are not generally used on MMIO. Note: For memcpy(), this uses the CPY* instructions instead of CPYF*, as CPY* allows overlaps between the source and destination buffers, and despite contradicting the C standard, compilers require that memcpy() work on exactly overlapping source and destination: https://gcc.gnu.org/onlinedocs/gcc/Standards.html#C-Language https://reviews.llvm.org/D86993 Signed-off-by: Kristina Martsenko Link: https://lore.kernel.org/r/20240930161051.3777828-5-kristina.martsenko@arm.com Signed-off-by: Catalin Marinas

arm64: crypto: use CC_FLAGS_FPU for NEON CFLAGS

2024-05-19T21:36:18Z

Now that CC_FLAGS_FPU is exported and can be used anywhere in the source tree, use it instead of duplicating the flags here. Link: https://lkml.kernel.org/r/20240329072441.591471-6-samuel.holland@sifive.com Signed-off-by: Samuel Holland Reviewed-by: Christoph Hellwig Acked-by: Christian König Cc: Alex Deucher Cc: Borislav Petkov (AMD) Cc: Catalin Marinas Cc: Dave Hansen Cc: Huacai Chen Cc: Ingo Molnar Cc: Jonathan Corbet Cc: Masahiro Yamada Cc: Michael Ellerman Cc: Nathan Chancellor Cc: Nicolas Schier Cc: Palmer Dabbelt Cc: Russell King Cc: Thomas Gleixner Cc: WANG Xuerui Cc: Will Deacon Signed-off-by: Andrew Morton

arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs

2024-05-12T23:54:34Z

Support an instruction for resolving absolute addresses of per-CPU data from their per-CPU offsets. This instruction is internal-only and users are not allowed to use them directly. They will only be used for internal inlining optimizations for now between BPF verifier and BPF JITs. Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu access using tpidr_el1"), the per-cpu offset for the CPU is stored in the tpidr_el1/2 register of that CPU. To support this BPF instruction in the ARM64 JIT, the following ARM64 instructions are emitted: mov dst, src // Move src to dst, if src != dst mrs tmp, tpidr_el1/2 // Move per-cpu offset of the current cpu in tmp. add dst, dst, tmp // Add the per cpu offset to the dst. To measure the performance improvement provided by this change, the benchmark in [1] was used: Before: glob-arr-inc : 23.597 ± 0.012M/s arr-inc : 23.173 ± 0.019M/s hash-inc : 12.186 ± 0.028M/s After: glob-arr-inc : 23.819 ± 0.034M/s arr-inc : 23.285 ± 0.017M/s hash-inc : 12.419 ± 0.011M/s [1] https://github.com/anakryiko/linux/commit/8dec900975ef Signed-off-by: Puranjay Mohan Acked-by: Andrii Nakryiko Link: https://lore.kernel.org/r/20240502151854.9810-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov

arm64: Get rid of ARM64_HAS_NO_HW_PREFETCH

2023-12-05T12:02:52Z

Back in 2016, it was argued that implementations lacking a HW prefetcher could be helped by sprinkling a number of PRFM instructions in strategic locations. In 2023, the one platform that presumably needed this hack is no longer in active use (let alone maintained), and an quick experiment shows dropping this hack only leads to a 0.4% drop on a full kernel compilation (tested on a MT30-GS0 48 CPU system). Given that this is pretty much in the noise department and that it may give odd ideas to other implementers, drop the hack for good. Suggested-by: Will Deacon Suggested-by: Mark Rutland Signed-off-by: Marc Zyngier Acked-by: Catalin Marinas Link: https://lore.kernel.org/r/20231122133754.1240687-1-maz@kernel.org Signed-off-by: Will Deacon

arm64: Avoid cpus_have_const_cap() for ARM64_HAS_WFXT

2023-10-16T13:17:05Z

In __delay() we use cpus_have_const_cap() to check for ARM64_HAS_WFXT, but this is not necessary and alternative_has_cap() would be preferable. For historical reasons, cpus_have_const_cap() is more complicated than it needs to be. Before cpucaps are finalized, it will perform a bitmap test of the system_cpucaps bitmap, and once cpucaps are finalized it will use an alternative branch. This used to be necessary to handle some race conditions in the window between cpucap detection and the subsequent patching of alternatives and static branches, where different branches could be out-of-sync with one another (or w.r.t. alternative sequences). Now that we use alternative branches instead of static branches, these are all patched atomically w.r.t. one another, and there are only a handful of cases that need special care in the window between cpucap detection and alternative patching. Due to the above, it would be nice to remove cpus_have_const_cap(), and migrate callers over to alternative_has_cap_*(), cpus_have_final_cap(), or cpus_have_cap() depending on when their requirements. This will remove redundant instructions and improve code generation, and will make it easier to determine how each callsite will behave before, during, and after alternative patching. The cpus_have_const_cap() check in __delay() is an optimization to use WFIT and WFET in preference to busy-polling the counter and/or using regular WFE and relying upon the architected timer event stream. It is not necessary to apply this optimization in the window between detecting the ARM64_HAS_WFXT cpucap and patching alternatives. This patch replaces the use of cpus_have_const_cap() with alternative_has_cap_unlikely(), which will avoid generating code to test the system_cpucaps bitmap and should be better for all subsequent calls at runtime. Signed-off-by: Mark Rutland Cc: Marc Zyngier Cc: Suzuki K Poulose Cc: Will Deacon Signed-off-by: Catalin Marinas