<feed xmlns='http://www.w3.org/2005/Atom'>
<title>pm24.git/arch/arm64/lib, branch master</title>
<subtitle>Unnamed repository; edit this file 'description' to name the repository.
</subtitle>
<id>https://git.kobert.dev/pm24.git/atom?h=master</id>
<link rel='self' href='https://git.kobert.dev/pm24.git/atom?h=master'/>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/'/>
<updated>2024-11-14T12:07:28Z</updated>
<entry>
<title>Merge branch 'for-next/mops' into for-next/core</title>
<updated>2024-11-14T12:07:28Z</updated>
<author>
<name>Catalin Marinas</name>
<email>catalin.marinas@arm.com</email>
</author>
<published>2024-11-14T12:07:28Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=437330d90c507be109a161667a77eaf61be0edac'/>
<id>urn:sha1:437330d90c507be109a161667a77eaf61be0edac</id>
<content type='text'>
* for-next/mops:
  : More FEAT_MOPS (memcpy instructions) uses - in-kernel routines
  arm64: mops: Document requirements for hypervisors
  arm64: lib: Use MOPS for copy_page() and clear_page()
  arm64: lib: Use MOPS for memcpy() routines
  arm64: mops: Document booting requirement for HCR_EL2.MCE2
  arm64: mops: Handle MOPS exceptions from EL1
  arm64: probes: Disable kprobes/uprobes on MOPS instructions

# Conflicts:
#	arch/arm64/kernel/entry-common.c
</content>
</entry>
<entry>
<title>arm64/crc32: Implement 4-way interleave using PMULL</title>
<updated>2024-10-22T10:54:43Z</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ardb@kernel.org</email>
</author>
<published>2024-10-18T07:53:51Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=a6478d69cf56d5deb4c28a6486376d9c7895abec'/>
<id>urn:sha1:a6478d69cf56d5deb4c28a6486376d9c7895abec</id>
<content type='text'>
Now that kernel mode NEON no longer disables preemption, using FP/SIMD
in library code which is not obviously part of the crypto subsystem is
no longer problematic, as it will no longer incur unexpected latencies.

So accelerate the CRC-32 library code on arm64 to use a 4-way
interleave, using PMULL instructions to implement the folding.

On Apple M2, this results in a speedup of 2 - 2.8x when using input
sizes of 1k - 8k. For smaller sizes, the overhead of preserving and
restoring the FP/SIMD register file may not be worth it, so 1k is used
as a threshold for choosing this code path.

The coefficient tables were generated using code provided by Eric. [0]

[0] https://github.com/ebiggers/libdeflate/blob/master/scripts/gen_crc32_multipliers.c

Cc: Eric Biggers &lt;ebiggers@kernel.org&gt;
Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Reviewed-by: Eric Biggers &lt;ebiggers@google.com&gt;
Link: https://lore.kernel.org/r/20241018075347.2821102-8-ardb+git@google.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64/crc32: Reorganize bit/byte ordering macros</title>
<updated>2024-10-22T10:54:43Z</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ardb@kernel.org</email>
</author>
<published>2024-10-18T07:53:50Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=b98b23e19492f4009070761c53b755f623f60e49'/>
<id>urn:sha1:b98b23e19492f4009070761c53b755f623f60e49</id>
<content type='text'>
In preparation for a new user, reorganize the bit/byte ordering macros
that are used to parameterize the crc32 template code and instantiate
CRC-32, CRC-32c and 'big endian' CRC-32.

Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Reviewed-by: Eric Biggers &lt;ebiggers@google.com&gt;
Link: https://lore.kernel.org/r/20241018075347.2821102-7-ardb+git@google.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64/lib: Handle CRC-32 alternative in C code</title>
<updated>2024-10-22T10:54:43Z</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ardb@kernel.org</email>
</author>
<published>2024-10-18T07:53:49Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=fc7454107d1b7c27bb98d3b109e5f44a8a46d7f8'/>
<id>urn:sha1:fc7454107d1b7c27bb98d3b109e5f44a8a46d7f8</id>
<content type='text'>
In preparation for adding another code path for performing CRC-32, move
the alternative patching for ARM64_HAS_CRC32 into C code. The logic for
deciding whether to use this new code path will be implemented in C too.

Reviewed-by: Eric Biggers &lt;ebiggers@google.com&gt;
Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Link: https://lore.kernel.org/r/20241018075347.2821102-6-ardb+git@google.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64: lib: Use MOPS for copy_page() and clear_page()</title>
<updated>2024-10-17T15:42:51Z</updated>
<author>
<name>Kristina Martsenko</name>
<email>kristina.martsenko@arm.com</email>
</author>
<published>2024-09-30T16:10:51Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=ce6b5ff5f16dd9267d62d09b3af3f0c7dc3c24f0'/>
<id>urn:sha1:ce6b5ff5f16dd9267d62d09b3af3f0c7dc3c24f0</id>
<content type='text'>
Similarly to what was done to the memcpy() routines, make copy_page()
and clear_page() also use the Armv8.8 FEAT_MOPS instructions.

Note: For copy_page() this uses the CPY* instructions instead of CPYF*
as CPYF* doesn't allow src and dst to be equal. It's not clear if
copy_page() needs to allow equal src and dst but it has worked so far
with the current implementation and there is no documentation forbidding
it.

Note, the unoptimized version of copy_page() in assembler.h is left as
it is.

Signed-off-by: Kristina Martsenko &lt;kristina.martsenko@arm.com&gt;
Link: https://lore.kernel.org/r/20240930161051.3777828-6-kristina.martsenko@arm.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64: lib: Use MOPS for memcpy() routines</title>
<updated>2024-10-17T15:42:51Z</updated>
<author>
<name>Kristina Martsenko</name>
<email>kristina.martsenko@arm.com</email>
</author>
<published>2024-09-30T16:10:50Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=836ed3c4e473fef9e0814a2ba6dd40f9656c03f1'/>
<id>urn:sha1:836ed3c4e473fef9e0814a2ba6dd40f9656c03f1</id>
<content type='text'>
Make memcpy(), memmove() and memset() use the Armv8.8 FEAT_MOPS
instructions when implemented on the CPU.

The CPY*/SET* instructions copy or set a block of memory of arbitrary
size and alignment. They can be interrupted by the CPU and the copying
resumed later. Their performance is expected to be close to the best
generic copy/set sequence of loads/stores for a given CPU. Using them in
the kernel's copy/set routines therefore avoids the need to periodically
rewrite the routines to optimize for new microarchitectures. It could
also lead to a performance improvement for some CPUs and systems.

With this change the kernel will always use the instructions if they are
implemented on the CPU (and have not been disabled by the arm64.nomops
command line parameter). When not implemented the usual routines will be
used (patched via alternatives). Note, we need to patch B/NOP instead of
the whole sequence to avoid executing a partially patched sequence in
case the compiler generates a mem*() call inside the alternatives
patching code.

Note that MOPS instructions have relaxed behavior on Device memory, but
it is expected that these routines are not generally used on MMIO.

Note: For memcpy(), this uses the CPY* instructions instead of CPYF*, as
CPY* allows overlaps between the source and destination buffers, and
despite contradicting the C standard, compilers require that memcpy()
work on exactly overlapping source and destination:
  https://gcc.gnu.org/onlinedocs/gcc/Standards.html#C-Language
  https://reviews.llvm.org/D86993

Signed-off-by: Kristina Martsenko &lt;kristina.martsenko@arm.com&gt;
Link: https://lore.kernel.org/r/20240930161051.3777828-5-kristina.martsenko@arm.com
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
<entry>
<title>arm64: crypto: use CC_FLAGS_FPU for NEON CFLAGS</title>
<updated>2024-05-19T21:36:18Z</updated>
<author>
<name>Samuel Holland</name>
<email>samuel.holland@sifive.com</email>
</author>
<published>2024-03-29T07:18:20Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=7177089525d97bd8d7fefd6a9406f45d818410e0'/>
<id>urn:sha1:7177089525d97bd8d7fefd6a9406f45d818410e0</id>
<content type='text'>
Now that CC_FLAGS_FPU is exported and can be used anywhere in the source
tree, use it instead of duplicating the flags here.

Link: https://lkml.kernel.org/r/20240329072441.591471-6-samuel.holland@sifive.com
Signed-off-by: Samuel Holland &lt;samuel.holland@sifive.com&gt;
Reviewed-by: Christoph Hellwig &lt;hch@lst.de&gt;
Acked-by: Christian König &lt;christian.koenig@amd.com&gt; 
Cc: Alex Deucher &lt;alexander.deucher@amd.com&gt;
Cc: Borislav Petkov (AMD) &lt;bp@alien8.de&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Dave Hansen &lt;dave.hansen@linux.intel.com&gt;
Cc: Huacai Chen &lt;chenhuacai@kernel.org&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Masahiro Yamada &lt;masahiroy@kernel.org&gt;
Cc: Michael Ellerman &lt;mpe@ellerman.id.au&gt;
Cc: Nathan Chancellor &lt;nathan@kernel.org&gt;
Cc: Nicolas Schier &lt;nicolas@fjasle.eu&gt;
Cc: Palmer Dabbelt &lt;palmer@rivosinc.com&gt;
Cc: Russell King &lt;linux@armlinux.org.uk&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: WANG Xuerui &lt;git@xen0n.name&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs</title>
<updated>2024-05-12T23:54:34Z</updated>
<author>
<name>Puranjay Mohan</name>
<email>puranjay12@gmail.com</email>
</author>
<published>2024-05-02T15:18:53Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=7a4c32222b0e14349a6311e72bf6ebd3e1d1064b'/>
<id>urn:sha1:7a4c32222b0e14349a6311e72bf6ebd3e1d1064b</id>
<content type='text'>
Support an instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF
JITs.

Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
access using tpidr_el1"), the per-cpu offset for the CPU is stored in
the tpidr_el1/2 register of that CPU.

To support this BPF instruction in the ARM64 JIT, the following ARM64
instructions are emitted:

mov dst, src		// Move src to dst, if src != dst
mrs tmp, tpidr_el1/2	// Move per-cpu offset of the current cpu in tmp.
add dst, dst, tmp	// Add the per cpu offset to the dst.

To measure the performance improvement provided by this change, the
benchmark in [1] was used:

Before:
glob-arr-inc   :   23.597 ± 0.012M/s
arr-inc        :   23.173 ± 0.019M/s
hash-inc       :   12.186 ± 0.028M/s

After:
glob-arr-inc   :   23.819 ± 0.034M/s
arr-inc        :   23.285 ± 0.017M/s
hash-inc       :   12.419 ± 0.011M/s

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan &lt;puranjay12@gmail.com&gt;
Acked-by: Andrii Nakryiko &lt;andrii@kernel.org&gt;
Link: https://lore.kernel.org/r/20240502151854.9810-4-puranjay@kernel.org
Signed-off-by: Alexei Starovoitov &lt;ast@kernel.org&gt;
</content>
</entry>
<entry>
<title>arm64: Get rid of ARM64_HAS_NO_HW_PREFETCH</title>
<updated>2023-12-05T12:02:52Z</updated>
<author>
<name>Marc Zyngier</name>
<email>maz@kernel.org</email>
</author>
<published>2023-11-22T13:37:54Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=103423ad7e56d6c756738823c332c414b07899e6'/>
<id>urn:sha1:103423ad7e56d6c756738823c332c414b07899e6</id>
<content type='text'>
Back in 2016, it was argued that implementations lacking a HW
prefetcher could be helped by sprinkling a number of PRFM
instructions in strategic locations.

In 2023, the one platform that presumably needed this hack is no
longer in active use (let alone maintained), and an quick
experiment shows dropping this hack only leads to a 0.4% drop
on a full kernel compilation (tested on a MT30-GS0 48 CPU system).

Given that this is pretty much in the noise department and that
it may give odd ideas to other implementers, drop the hack for
good.

Suggested-by: Will Deacon &lt;will@kernel.org&gt;
Suggested-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Signed-off-by: Marc Zyngier &lt;maz@kernel.org&gt;
Acked-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Link: https://lore.kernel.org/r/20231122133754.1240687-1-maz@kernel.org
Signed-off-by: Will Deacon &lt;will@kernel.org&gt;
</content>
</entry>
<entry>
<title>arm64: Avoid cpus_have_const_cap() for ARM64_HAS_WFXT</title>
<updated>2023-10-16T13:17:05Z</updated>
<author>
<name>Mark Rutland</name>
<email>mark.rutland@arm.com</email>
</author>
<published>2023-10-16T10:24:47Z</published>
<link rel='alternate' type='text/html' href='https://git.kobert.dev/pm24.git/commit/?id=4c73056e3277f803ccdb7769e55fe7153edb07cc'/>
<id>urn:sha1:4c73056e3277f803ccdb7769e55fe7153edb07cc</id>
<content type='text'>
In __delay() we use cpus_have_const_cap() to check for ARM64_HAS_WFXT,
but this is not necessary and alternative_has_cap() would be preferable.

For historical reasons, cpus_have_const_cap() is more complicated than
it needs to be. Before cpucaps are finalized, it will perform a bitmap
test of the system_cpucaps bitmap, and once cpucaps are finalized it
will use an alternative branch. This used to be necessary to handle some
race conditions in the window between cpucap detection and the
subsequent patching of alternatives and static branches, where different
branches could be out-of-sync with one another (or w.r.t. alternative
sequences). Now that we use alternative branches instead of static
branches, these are all patched atomically w.r.t. one another, and there
are only a handful of cases that need special care in the window between
cpucap detection and alternative patching.

Due to the above, it would be nice to remove cpus_have_const_cap(), and
migrate callers over to alternative_has_cap_*(), cpus_have_final_cap(),
or cpus_have_cap() depending on when their requirements. This will
remove redundant instructions and improve code generation, and will make
it easier to determine how each callsite will behave before, during, and
after alternative patching.

The cpus_have_const_cap() check in __delay() is an optimization to use
WFIT and WFET in preference to busy-polling the counter and/or using
regular WFE and relying upon the architected timer event stream. It is
not necessary to apply this optimization in the window between detecting
the ARM64_HAS_WFXT cpucap and patching alternatives.

This patch replaces the use of cpus_have_const_cap() with
alternative_has_cap_unlikely(), which will avoid generating code to test
the system_cpucaps bitmap and should be better for all subsequent calls
at runtime.

Signed-off-by: Mark Rutland &lt;mark.rutland@arm.com&gt;
Cc: Marc Zyngier &lt;maz@kernel.org&gt;
Cc: Suzuki K Poulose &lt;suzuki.poulose@arm.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Signed-off-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
</content>
</entry>
</feed>
