pm24.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2024-11-19	Merge tag 'v6.13-p1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add sig driver API - Remove signing/verification from akcipher API - Move crypto_simd_disabled_for_test to lib/crypto - Add WARN_ON for return values from driver that indicates memory corruption Algorithms: - Provide crc32-arch and crc32c-arch through Crypto API - Optimise crc32c code size on x86 - Optimise crct10dif on arm/arm64 - Optimise p10-aes-gcm on powerpc - Optimise aegis128 on x86 - Output full sample from test interface in jitter RNG - Retry without padata when it fails in pcrypt Drivers: - Add support for Airoha EN7581 TRNG - Add support for STM32MP25x platforms in stm32 - Enable iproc-r200 RNG driver on BCMBCA - Add Broadcom BCM74110 RNG driver" * tag 'v6.13-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (112 commits) crypto: marvell/cesa - fix uninit value for struct mv_cesa_op_ctx crypto: cavium - Fix an error handling path in cpt_ucode_load_fw() crypto: aesni - Move back to module_init crypto: lib/mpi - Export mpi_set_bit crypto: aes-gcm-p10 - Use the correct bit to test for P10 hwrng: amd - remove reference to removed PPC_MAPLE config crypto: arm/crct10dif - Implement plain NEON variant crypto: arm/crct10dif - Macroify PMULL asm code crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply crypto: arm64/crct10dif - Remove obsolete chunking logic crypto: bcm - add error check in the ahash_hmac_init function crypto: caam - add error check to caam_rsa_set_priv_key_form hwrng: bcm74110 - Add Broadcom BCM74110 RNG driver dt-bindings: rng: add binding for BCM74110 RNG padata: Clean up in padata_do_multithreaded() crypto: inside-secure - Fix the return value of safexcel_xcbcmac_cra_init() crypto: qat - Fix missing destroy_workqueue in adf_init_aer() crypto: rsassa-pkcs1 - Reinstate support for legacy protocols ...
2024-11-15	crypto: aesni - Move back to module_init	Herbert Xu
	This patch reverts commit 0fbafd06bdde938884f7326548d3df812b267c3c ("crypto: aesni - fix failing setkey for rfc4106-gcm-aesni") by moving the aesni init function back to module_init from late_initcall. The original patch was needed because tests were synchronous. This is no longer the case so there is no need to postpone the registration. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - remove unneeded RETs	Eric Biggers
	Remove returns that are immediately followed by another return. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - remove unneeded FRAME_BEGIN and FRAME_END	Eric Biggers
	Stop using FRAME_BEGIN and FRAME_END in the AEGIS assembly functions, since all these functions are now leaf functions. This eliminates some unnecessary instructions. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - take advantage of block-aligned len	Eric Biggers
	Update a caller of aegis128_aesni_ad() to round down the length to a block boundary. After that, aegis128_aesni_ad(), aegis128_aesni_enc(), and aegis128_aesni_dec() are only passed whole blocks. Update the assembly code to take advantage of that, which eliminates some unneeded instructions. For aegis128_aesni_enc() and aegis128_aesni_dec(), the length is also always nonzero, so stop checking for zero length. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - optimize partial block handling using SSE4.1	Eric Biggers
	Optimize the code that loads and stores partial blocks, taking advantage of SSE4.1. The code is adapted from that in aes-gcm-aesni-x86_64.S. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - improve assembly function prototypes	Eric Biggers
	Adjust the prototypes of the AEGIS assembly functions: - Use proper types instead of 'void *', when applicable. - Move the length parameter to after the buffers it describes rather than before, to match the usual convention. Also shorten its name to just len (which is the name used in the assembly code). - Declare register aliases at the beginning of each function rather than once per file. This was necessary because len was moved, but also it allows adding some aliases where raw registers were used before. - Put assoclen and cryptlen in the correct order when declaring the finalization function in the .c file. - Remove the unnecessary "crypto_" prefix. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - optimize length block preparation using SSE4.1	Eric Biggers
	Start using SSE4.1 instructions in the AES-NI AEGIS code, with the first use case being preparing the length block in fewer instructions. In practice this does not reduce the set of CPUs on which the code can run, because all Intel and AMD CPUs with AES-NI also have SSE4.1. Upgrade the existing SSE2 feature check to SSE4.1, though it seems this check is not strictly necessary; the aesni-intel module has been getting away with using SSE4.1 despite checking for AES-NI only. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - don't bother with special code for aligned data	Eric Biggers
	Remove the AEGIS assembly code paths that were "optimized" to operate on 16-byte aligned data using movdqa, and instead just use the code paths that use movdqu and can handle data with any alignment. This does not reduce performance. movdqa is basically a historical artifact; on aligned data, movdqu and movdqa have had the same performance since Intel Nehalem (2008) and AMD Bulldozer (2011). And code that requires AES-NI cannot run on CPUs older than those anyway. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - eliminate some indirect calls	Eric Biggers
	Instead of using a struct of function pointers to decide whether to call the encryption or decryption assembly functions, use a conditional branch on a bool. Force-inline the functions to avoid actually generating the branch. This improves performance slightly since indirect calls are slow. Remove the now-unnecessary CFI stubs. Note that just force-inlining the existing functions might cause the compiler to optimize out the indirect branches, but that would not be a reliable way to do it and the CFI stubs would still be required. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - remove no-op init and exit functions	Eric Biggers
	Don't bother providing empty stubs for the init and exit methods in struct aead_alg, since they are optional anyway. Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-28	crypto: x86/aegis128 - access 32-bit arguments as 32-bit	Eric Biggers
	Fix the AEGIS assembly code to access 'unsigned int' arguments as 32-bit values instead of 64-bit, since the upper bits of the corresponding 64-bit registers are not guaranteed to be zero. Note: there haven't been any reports of this bug actually causing incorrect behavior. Neither gcc nor clang guarantee zero-extension to 64 bits, but zero-extension is likely to happen in practice because most instructions that operate on 32-bit registers zero-extend to 64 bits. Fixes: 1d373d4e8e15 ("crypto: x86 - Add optimized AEGIS implementations") Cc: stable@vger.kernel.org Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-26	crypto: x86/crc32c - eliminate jump table and excessive unrolling	Eric Biggers
	crc32c-pcl-intel-asm_64.S has a loop with 1 to 127 iterations fully unrolled and uses a jump table to jump into the correct location. This optimization is misguided, as it bloats the binary code size and introduces an indirect call. x86_64 CPUs can predict loops well, so it is fine to just use a loop instead. Loop bookkeeping instructions can compete with the crc instructions for the ALUs, but this is easily mitigated by unrolling the loop by a smaller amount, such as 4 times. Therefore, re-roll the loop and make related tweaks to the code. This reduces the binary code size of crc_pclmul() from 4546 bytes to 418 bytes, a 91% reduction. In general it also makes the code faster, with some large improvements seen when retpoline is enabled. More detailed performance results are shown below. They are given as percent improvement in throughput (negative means regressed) for CPU microarchitecture vs. input length in bytes. E.g. an improvement from 40 GB/s to 50 GB/s would be listed as 25%. Table 1: Results with retpoline enabled (the default): \| 512 \| 833 \| 1024 \| 2000 \| 3173 \| 4096 \| ---------------------+-------+-------+-------+------ +-------+-------+ Intel Haswell \| 35.0% \| 20.7% \| 17.8% \| 9.7% \| -0.2% \| 4.4% \| Intel Emerald Rapids \| 66.8% \| 45.2% \| 36.3% \| 19.3% \| 0.0% \| 5.4% \| AMD Zen 2 \| 29.5% \| 17.2% \| 13.5% \| 8.6% \| -0.5% \| 2.8% \| Table 2: Results with retpoline disabled: \| 512 \| 833 \| 1024 \| 2000 \| 3173 \| 4096 \| ---------------------+-------+-------+-------+------ +-------+-------+ Intel Haswell \| 3.3% \| 4.8% \| 4.5% \| 0.9% \| -2.9% \| 0.3% \| Intel Emerald Rapids \| 7.5% \| 6.4% \| 5.2% \| 2.3% \| -0.0% \| 0.6% \| AMD Zen 2 \| 11.8% \| 1.4% \| 0.2% \| 1.3% \| -0.9% \| -0.2% \| Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-26	crypto: x86/crc32c - access 32-bit arguments as 32-bit	Eric Biggers
	Fix crc32c-pcl-intel-asm_64.S to access 32-bit arguments as 32-bit values instead of 64-bit, since the upper bits of the corresponding 64-bit registers are not guaranteed to be zero. Also update the type of the length argument to be unsigned int rather than int, as the assembly code treats it as unsigned. Note: there haven't been any reports of this bug actually causing incorrect behavior. Neither gcc nor clang guarantee zero-extension to 64 bits, but zero-extension is likely to happen in practice because most instructions that operate on 32-bit registers zero-extend to 64 bits. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-26	crypto: x86/crc32c - simplify code for handling fewer than 200 bytes	Eric Biggers
	The assembly code in crc32c-pcl-intel-asm_64.S is invoked only for lengths >= 512, due to the overhead of saving and restoring FPU state. Therefore, it is unnecessary for this code to be excessively "optimized" for lengths < 200. Eliminate the excessive unrolling of this part of the code and use a more straightforward qword-at-a-time loop. Note: the part of the code in question is not entirely redundant, as it is still used to process any remainder mod 24, as well as any remaining data when fewer than 200 bytes remain after least one 3072-byte chunk. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-10	crypto: x86/cast5 - Remove unused cast5_ctr_16way	Dr. David Alan Gilbert
	commit e2d60e2f597a ("crypto: x86/cast5 - drop CTR mode implementation") removed the calls to cast5_ctr_16way but left the avx implementation. Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-10-02	move asm/unaligned.h to linux/unaligned.h	Al Viro
	asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-09-06	crypto: x86/aesni - update docs for aesni-intel module	Eric Biggers
	Update the kconfig help and module description to reflect that VAES instructions are now used in some cases. Also fix XTR => XCTR. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-08-24	crypto: x86/sha256 - Add parentheses around macros' single arguments	Fangrui Song
	The macros FOUR_ROUNDS_AND_SCHED and DO_4ROUNDS rely on an unexpected/undocumented behavior of the GNU assembler, which might change in the future (https://sourceware.org/bugzilla/show_bug.cgi?id=32073). M (1) (2) // 1 arg !? Future: 2 args M 1 + 2 // 1 arg !? Future: 3 args M 1 2 // 2 args Add parentheses around the single arguments to support future GNU assembler and LLVM integrated assembler (when the IsOperator hack from the following link is dropped). Link: https://github.com/llvm/llvm-project/commit/055006475e22014b28a070db1bff41ca15f322f0 Signed-off-by: Fangrui Song <maskray@google.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-08-10	crypto: x86/aes-gcm - fix PREEMPT_RT issue in gcm_crypt()	Eric Biggers
	On PREEMPT_RT, kfree() takes sleeping locks and must not be called with preemption disabled. Therefore, on PREEMPT_RT skcipher_walk_done() must not be called from within a kernel_fpu_{begin,end}() pair, even when it's the last call which is guaranteed to not allocate memory. Therefore, move the last skcipher_walk_done() in gcm_crypt() to the end of the function so that it goes after the kernel_fpu_end(). To make this work cleanly, rework the data processing loop to handle only non-last data segments. Fixes: b06affb1cb58 ("crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM") Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Closes: https://lore.kernel.org/linux-crypto/20240802102333.itejxOsJ@linutronix.de Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-06-07	crypto: x86/aes-gcm - rewrite the AES-NI optimized AES-GCM	Eric Biggers
	Rewrite the AES-NI implementations of AES-GCM, taking advantage of things I learned while writing the VAES-AVX10 implementations. This is a complete rewrite that reduces the AES-NI GCM source code size by about 70% and the binary code size by about 95%, while not regressing performance and in fact improving it significantly in many cases. The following summarizes the state before this patch: - The aesni-intel module registered algorithms "generic-gcm-aesni" and "rfc4106-gcm-aesni" with the crypto API that actually delegated to one of three underlying implementations according to the CPU capabilities detected at runtime: AES-NI, AES-NI + AVX, or AES-NI + AVX2. - The AES-NI + AVX and AES-NI + AVX2 assembly code was in aesni-intel_avx-x86_64.S and consisted of 2804 lines of source and 257 KB of binary. This massive binary size was not really appropriate, and depending on the kconfig it could take up over 1% the size of the entire vmlinux. The main loops did 8 blocks per iteration. The AVX code minimized the use of carryless multiplication whereas the AVX2 code did not. The "AVX2" code did not actually use AVX2; the check for AVX2 was really a check for Intel Haswell or later to detect support for fast carryless multiplication. The long source length was caused by factors such as significant code duplication. - The AES-NI only assembly code was in aesni-intel_asm.S and consisted of 1501 lines of source and 15 KB of binary. The main loops did 4 blocks per iteration and minimized the use of carryless multiplication by using Karatsuba multiplication and a multiplication-less reduction. - The assembly code was contributed in 2010-2013. Maintenance has been sporadic and most design choices haven't been revisited. - The assembly function prototypes and the corresponding glue code were separate from and were not consistent with the new VAES-AVX10 code I recently added. The older code had several issues such as not precomputing the GHASH key powers, which hurt performance. This rewrite achieves the following goals: - Much shorter source and binary sizes. The assembly source shrinks from 4300 lines to 1130 lines, and it produces about 9 KB of binary instead of 272 KB. This is achieved via a better designed AES-GCM implementation that doesn't excessively unroll the code and instead prioritizes the parts that really matter. Sharing the C glue code with the VAES-AVX10 implementations also saves 250 lines of C source. - Improve performance on most (possibly all) CPUs on which this code runs, for most (possibly all) message lengths. Benchmark results are given in Tables 1 and 2 below. - Use the same function prototypes and glue code as the new VAES-AVX10 algorithms. This fixes some issues with the integration of the assembly and results in some significant performance improvements, primarily on short messages. Also, the AVX and non-AVX implementations are now registered as separate algorithms with the crypto API, which makes them both testable by the self-tests. - Keep support for AES-NI without AVX (for Westmere, Silvermont, Goldmont, and Tremont), but unify the source code with AES-NI + AVX. Since 256-bit vectors cannot be used without VAES anyway, this is made feasible by just using the non-VEX coded form of most instructions. - Use a unified approach where the main loop does 8 blocks per iteration and uses Karatsuba multiplication to save one pclmulqdq per block but does not use the multiplication-less reduction. This strikes a good balance across the range of CPUs on which this code runs. - Don't spam the kernel log with an informational message on every boot. The following tables summarize the improvement in AES-GCM throughput on various CPU microarchitectures as a result of this patch: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell \| 2% \| 8% \| 11% \| 18% \| 31% \| 26% \| Intel Skylake \| 1% \| 4% \| 7% \| 12% \| 26% \| 19% \| Intel Cascade Lake \| 3% \| 8% \| 10% \| 18% \| 33% \| 24% \| AMD Zen 1 \| 6% \| 12% \| 6% \| 15% \| 27% \| 24% \| AMD Zen 2 \| 8% \| 13% \| 13% \| 19% \| 26% \| 28% \| AMD Zen 3 \| 8% \| 14% \| 13% \| 19% \| 26% \| 25% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| -------------------+-------+-------+-------+-------+-------+ Intel Broadwell \| 35% \| 29% \| 45% \| 55% \| 54% \| Intel Skylake \| 25% \| 19% \| 28% \| 33% \| 27% \| Intel Cascade Lake \| 36% \| 28% \| 39% \| 49% \| 54% \| AMD Zen 1 \| 27% \| 22% \| 23% \| 29% \| 26% \| AMD Zen 2 \| 32% \| 24% \| 22% \| 25% \| 31% \| AMD Zen 3 \| 30% \| 24% \| 22% \| 23% \| 26% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| -------------------+-------+-------+-------+-------+-------+-------+ Intel Broadwell \| 3% \| 8% \| 11% \| 19% \| 32% \| 28% \| Intel Skylake \| 3% \| 4% \| 7% \| 13% \| 28% \| 27% \| Intel Cascade Lake \| 3% \| 9% \| 11% \| 19% \| 33% \| 28% \| AMD Zen 1 \| 15% \| 18% \| 14% \| 20% \| 36% \| 33% \| AMD Zen 2 \| 9% \| 16% \| 13% \| 21% \| 26% \| 27% \| AMD Zen 3 \| 8% \| 15% \| 12% \| 18% \| 23% \| 23% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| -------------------+-------+-------+-------+-------+-------+ Intel Broadwell \| 36% \| 31% \| 40% \| 51% \| 53% \| Intel Skylake \| 28% \| 21% \| 23% \| 30% \| 30% \| Intel Cascade Lake \| 36% \| 29% \| 36% \| 47% \| 53% \| AMD Zen 1 \| 35% \| 31% \| 32% \| 35% \| 36% \| AMD Zen 2 \| 31% \| 30% \| 27% \| 38% \| 30% \| AMD Zen 3 \| 27% \| 23% \| 24% \| 32% \| 26% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 3000 MB/s to 3300 MB/s would be listed as 10%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. I didn't test Intel CPUs before Broadwell, AMD CPUs before Zen 1, or Intel low-power CPUs, as these weren't readily available to me. However, based on the design of the new code and the available information about these other CPU microarchitectures, I wouldn't expect any significant regressions, and there's a good chance performance is improved just as it is above. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-06-07	crypto: x86/aes-gcm - add VAES and AVX512 / AVX10 optimized AES-GCM	Eric Biggers
	Add implementations of AES-GCM for x86_64 CPUs that support VAES (vector AES), VPCLMULQDQ (vector carryless multiplication), and either AVX512 or AVX10. There are two implementations, sharing most source code: one using 256-bit vectors and one using 512-bit vectors. This patch improves AES-GCM performance by up to 162%; see Tables 1 and 2 below. I wrote the new AES-GCM assembly code from scratch, focusing on correctness, performance, code size (both source and binary), and documenting the source. The new assembly file aes-gcm-avx10-x86_64.S is about 1200 lines including extensive comments, and it generates less than 8 KB of binary code. The main loop does 4 vectors at a time, with the AES and GHASH instructions interleaved. Any remainder is handled using a simple 1 vector at a time loop, with masking. Several VAES + AVX512 implementations of AES-GCM exist from Intel, including one in OpenSSL and one proposed for inclusion in Linux in 2021 (https://lore.kernel.org/linux-crypto/1611386920-28579-6-git-send-email-megha.dey@intel.com/). These aren't really suitable to be used, though, due to the massive amount of binary code generated (696 KB for OpenSSL, 200 KB for Linux) and well as the significantly larger amount of assembly source (4978 lines for OpenSSL, 1788 lines for Linux). Also, Intel's code does not support 256-bit vectors, which makes it not usable on future AVX10/256-only CPUs, and also not ideal for certain Intel CPUs that have downclocking issues. So I ended up starting from scratch. Usually my much shorter code is actually slightly faster than Intel's AVX512 code, though it depends on message length and on which of Intel's implementations is used; for details, see Tables 3 and 4 below. To facilitate potential integration into other projects, I've dual-licensed aes-gcm-avx10-x86_64.S under Apache-2.0 OR BSD-2-Clause, the same as the recently added RISC-V crypto code. The following two tables summarize the performance improvement over the existing AES-GCM code in Linux that uses AES-NI and AVX2: Table 1: AES-256-GCM encryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 60% \| 62% \| 70% \| 69% \| Intel Sapphire Rapids \| 157% \| 145% \| 162% \| 119% \| 96% \| 96% \| Intel Emerald Rapids \| 156% \| 144% \| 161% \| 115% \| 95% \| 100% \| AMD Zen 4 \| 103% \| 89% \| 78% \| 56% \| 54% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 66% \| 48% \| 49% \| 70% \| 53% \| Intel Sapphire Rapids \| 80% \| 60% \| 41% \| 62% \| 38% \| Intel Emerald Rapids \| 79% \| 60% \| 41% \| 62% \| 38% \| AMD Zen 4 \| 51% \| 35% \| 27% \| 32% \| 25% \| Table 2: AES-256-GCM decryption throughput improvement, CPU microarchitecture vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ----------------------+-------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 42% \| 48% \| 59% \| 63% \| 67% \| 71% \| Intel Sapphire Rapids \| 159% \| 145% \| 161% \| 125% \| 102% \| 100% \| Intel Emerald Rapids \| 158% \| 144% \| 161% \| 124% \| 100% \| 103% \| AMD Zen 4 \| 110% \| 95% \| 80% \| 59% \| 56% \| 54% \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ----------------------+-------+-------+-------+-------+-------+ Intel Ice Lake \| 67% \| 56% \| 46% \| 70% \| 56% \| Intel Sapphire Rapids \| 79% \| 62% \| 39% \| 61% \| 39% \| Intel Emerald Rapids \| 80% \| 62% \| 40% \| 58% \| 40% \| AMD Zen 4 \| 49% \| 36% \| 30% \| 35% \| 28% \| The above numbers are percentage improvements in single-thread throughput, so e.g. an increase from 4000 MB/s to 6000 MB/s would be listed as 50%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. Note that indirect benchmarks (e.g. 'cryptsetup benchmark' or benchmarking dm-crypt I/O) include more overhead and won't see quite as much of a difference. All these benchmarks used an associated data length of 16 bytes. Note that AES-GCM is almost always used with short associated data lengths. The following two tables summarize how the performance of my code compares with Intel's AVX512 AES-GCM code, both the version that is in OpenSSL and the version that was proposed for inclusion in Linux. Neither version exists in Linux currently, but these are alternative AES-GCM implementations that could be chosen instead of mine. I collected the following numbers on Emerald Rapids using a userspace benchmark program that calls the assembly functions directly. I've also included a comparison with Cloudflare's AES-GCM implementation from https://boringssl-review.googlesource.com/c/boringssl/+/65987/3. Table 3: VAES-based AES-256-GCM encryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14171 \| 12956 \| 12318 \| 9588 \| 7293 \| 6449 \| AVX512_Intel_OpenSSL \| 14022 \| 12467 \| 11863 \| 9107 \| 5891 \| 6472 \| AVX512_Intel_Linux \| 13954 \| 12277 \| 11530 \| 8712 \| 6627 \| 5898 \| AVX512_Cloudflare \| 12564 \| 11050 \| 10905 \| 8152 \| 5345 \| 5202 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 4939 \| 3688 \| 1846 \| 1821 \| 738 \| AVX512_Intel_OpenSSL \| 4629 \| 4532 \| 2734 \| 2332 \| 1131 \| AVX512_Intel_Linux \| 4035 \| 2966 \| 1567 \| 1330 \| 639 \| AVX512_Cloudflare \| 3344 \| 2485 \| 1141 \| 1127 \| 456 \| Table 4: VAES-based AES-256-GCM decryption throughput in MB/s, implementation name vs. message length in bytes: \| 16384 \| 4096 \| 4095 \| 1420 \| 512 \| 500 \| ---------------------+-------+-------+-------+-------+-------+-------+ This implementation \| 14276 \| 13311 \| 13007 \| 11086 \| 8268 \| 8086 \| AVX512_Intel_OpenSSL \| 14067 \| 12620 \| 12421 \| 9587 \| 5954 \| 7060 \| AVX512_Intel_Linux \| 14116 \| 12795 \| 11778 \| 9269 \| 7735 \| 6455 \| AVX512_Cloudflare \| 13301 \| 12018 \| 11919 \| 9182 \| 7189 \| 6726 \| \| 300 \| 200 \| 64 \| 63 \| 16 \| ---------------------+-------+-------+-------+-------+-------+ This implementation \| 6454 \| 5020 \| 2635 \| 2602 \| 1079 \| AVX512_Intel_OpenSSL \| 5184 \| 5799 \| 2957 \| 2545 \| 1228 \| AVX512_Intel_Linux \| 4394 \| 4247 \| 2235 \| 1635 \| 922 \| AVX512_Cloudflare \| 4289 \| 3851 \| 1435 \| 1417 \| 574 \| So, usually my code is actually slightly faster than Intel's code, though the OpenSSL implementation has a slight edge on messages shorter than 256 bytes in this microbenchmark. (This also holds true when doing the same tests on AMD Zen 4.) It can be seen that the large code size (up to 94x larger!) of the Intel implementations doesn't seem to bring much benefit, so starting from scratch with much smaller code, as I've done, seems appropriate. The performance of my code on messages shorter than 256 bytes could be improved through a limited amount of unrolling, but it's unclear it would be worth it, given code size considerations (e.g. caches) that don't get measured in microbenchmarks. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-06-07	crypto: x86 - add missing MODULE_DESCRIPTION() macros	Jeff Johnson
	On x86, make allmodconfig && make W=1 C=1 warns: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/crc32-pclmul.o WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/crypto/curve25519-x86_64.o Add the missing MODULE_DESCRIPTION() macro invocations. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-05-31	crypto: x86/poly1305 - Switch to new Intel CPU model defines	Tony Luck
	New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-05-31	crypto: x86/twofish - Switch to new Intel CPU model defines	Tony Luck
	New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-05-22	crypto: x86/aes-xts - switch to new Intel CPU model defines	Tony Luck
	New CPU #defines encode vendor and family as well as model. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/20240520224620.9480-2-tony.luck@intel.com
2024-04-26	crypto: x86/aes-gcm - simplify GCM hash subkey derivation	Eric Biggers
	Remove a redundant expansion of the AES key, and use rodata for zeroes. Also rename rfc4106_set_hash_subkey() to aes_gcm_derive_hash_subkey() because it's used for both versions of AES-GCM, not just RFC4106. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26	crypto: x86/aes-gcm - delete unused GCM assembly code	Eric Biggers
	Delete aesni_gcm_enc() and aesni_gcm_dec() because they are unused. Only the incremental AES-GCM functions (aesni_gcm_init(), aesni_gcm_enc_update(), aesni_gcm_finalize()) are actually used. This saves 17 KB of object code. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-26	crypto: x86/aes-xts - simplify loop in xts_crypt_slowpath()	Eric Biggers
	Since the total length processed by the loop in xts_crypt_slowpath() is a multiple of AES_BLOCK_SIZE, just round the length down to AES_BLOCK_SIZE even on the last step. This doesn't change behavior, as the last step will process a multiple of AES_BLOCK_SIZE regardless. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/aes-xts - optimize size of instructions operating on lengths	Eric Biggers
	x86_64 has the "interesting" property that the instruction size is generally a bit shorter for instructions that operate on the 32-bit (or less) part of registers, or registers that are in the original set of 8. This patch adjusts the AES-XTS code to take advantage of that property by changing the LEN parameter from size_t to unsigned int (which is all that's needed and is what the non-AVX implementation uses) and using the %eax register for KEYLEN. This decreases the size of aes-xts-avx-x86_64.o by 1.2%. Note that changing the kmovq to kmovd was going to be needed anyway to make the AVX10/256 code really work on CPUs that don't support 512-bit vectors (since the AVX10 spec says that 64-bit opmask instructions will only be supported on processors that support 512-bit vectors). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/aes-xts - eliminate a few more instructions	Eric Biggers
	- For conditionally subtracting 16 from LEN when decrypting a message whose length isn't a multiple of 16, use the cmovnz instruction. - Fold the addition of 4*VL to LEN into the sub of VL or 16 from LEN. - Remove an unnecessary test instruction. This results in slightly shorter code, both source and binary. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/aes-xts - handle AES-128 and AES-192 more efficiently	Eric Biggers
	Decrease the amount of code specific to the different AES variants by "right-aligning" the sequence of round keys, and for AES-128 and AES-192 just skipping irrelevant rounds at the beginning. This shrinks the size of aes-xts-avx-x86_64.o by 13.3%, and it improves the efficiency of AES-128 and AES-192. The tradeoff is that for AES-256 some additional not-taken conditional jumps are now executed. But these are predicted well and are cheap on x86. Note that the ARMv8 CE based AES-XTS implementation uses a similar strategy to handle the different AES variants. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/aesni-xts - deduplicate aesni_xts_enc() and aesni_xts_dec()	Eric Biggers
	Since aesni_xts_enc() and aesni_xts_dec() are very similar, generate them from a macro that's passed an argument enc=1 or enc=0. This reduces the length of aesni-intel_asm.S by 112 lines while still producing the exact same object file in both 32-bit and 64-bit mode. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/aes-xts - handle CTS encryption more efficiently	Eric Biggers
	When encrypting a message whose length isn't a multiple of 16 bytes, encrypt the last full block in the main loop. This works because only decryption uses the last two tweaks in reverse order, not encryption. This improves the performance of decrypting messages whose length isn't a multiple of the AES block length, shrinks the size of aes-xts-avx-x86_64.o by 5.0%, and eliminates two instructions (a test and a not-taken conditional jump) when encrypting a message whose length is a multiple of the AES block length. While it's not super useful to optimize for ciphertext stealing given that it's rarely needed in practice, the other two benefits mentioned above make this optimization worthwhile. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/sha256-ni - simplify do_4rounds	Eric Biggers
	Instead of loading the message words into both MSG and \m0 and then adding the round constants to MSG, load the message words into \m0 and the round constants into MSG and then add \m0 to MSG. This shortens the source code slightly. It changes the instructions slightly, but it doesn't affect binary code size and doesn't seem to affect performance. Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/sha256-ni - optimize code size	Eric Biggers
	- Load the SHA-256 round constants relative to a pointer that points into the middle of the constants rather than to the beginning. Since x86 instructions use signed offsets, this decreases the instruction length required to access some of the later round constants. - Use punpcklqdq or punpckhqdq instead of longer instructions such as pshufd, pblendw, and palignr. This doesn't harm performance. The end result is that sha256_ni_transform shrinks from 839 bytes to 791 bytes, with no loss in performance. Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/sha256-ni - rename some register aliases	Eric Biggers
	MSGTMP[0-3] are used to hold the message schedule and are not temporary registers per se. MSGTMP4 is used as a temporary register for several different purposes and isn't really related to MSGTMP[0-3]. Rename them to MSG[0-3] and TMP accordingly. Also add a comment that clarifies what MSG is. Suggested-by: Stefan Kanthak <stefan.kanthak@nexgo.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/sha256-ni - convert to use rounds macros	Eric Biggers
	To avoid source code duplication, do the SHA-256 rounds using macros. This reduces the length of sha256_ni_asm.S by 153 lines while still producing the exact same object file. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-19	crypto: x86/aes-xts - access round keys using single-byte offsets	Eric Biggers
	Access the AES round keys using offsets -716 through 716, instead of 016 through 1416. This allows VEX-encoded instructions to address all round keys using 1-byte offsets, whereas before some needed 4-byte offsets. This decreases the code size of aes-xts-avx-x86_64.o by 4.2%. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-12	crypto: x86/aes-xts - make non-AVX implementation use new glue code	Eric Biggers
	Make the non-AVX implementation of AES-XTS (xts-aes-aesni) use the new glue code that was introduced for the AVX implementations of AES-XTS. This reduces code size, and it improves the performance of xts-aes-aesni due to the optimization for messages that don't span page boundaries. This required moving the new glue functions higher up in the file and allowing the IV encryption function to be specified by the caller. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-12	crypto: x86/sha512-avx2 - add missing vzeroupper	Eric Biggers
	Since sha512_transform_rorx() uses ymm registers, execute vzeroupper before returning from it. This is necessary to avoid reducing the performance of SSE code. Fixes: e01d69cb0195 ("crypto: sha512 - Optimized SHA512 x86_64 assembly routine using AVX instructions.") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-12	crypto: x86/sha256-avx2 - add missing vzeroupper	Eric Biggers
	Since sha256_transform_rorx() uses ymm registers, execute vzeroupper before returning from it. This is necessary to avoid reducing the performance of SSE code. Fixes: d34a460092d8 ("crypto: sha256 - Optimized sha256 x86_64 routine using AVX2's RORX instructions") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-12	crypto: x86/nh-avx2 - add missing vzeroupper	Eric Biggers
	Since nh_avx2() uses ymm registers, execute vzeroupper before returning from it. This is necessary to avoid reducing the performance of SSE code. Fixes: 0f961f9f670e ("crypto: x86/nhpoly1305 - add AVX2 accelerated NHPoly1305") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05	crypto: x86/aes-xts - wire up VAES + AVX10/512 implementation	Eric Biggers
	Add an AES-XTS implementation "xts-aes-vaes-avx10_512" for x86_64 CPUs with the VAES, VPCLMULQDQ, and either AVX10/512 or AVX512BW + AVX512VL extensions. This implementation uses zmm registers to operate on four AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. To avoid downclocking on older Intel CPU models, an exclusion list is used to prevent this 512-bit implementation from being used by default on some CPU models. They will use xts-aes-vaes-avx10_256 instead. For now, this exclusion list is simply coded into aesni-intel_glue.c. It may make sense to eventually move it into a more central location. xts-aes-vaes-avx10_512 is slightly faster than xts-aes-vaes-avx10_256 on some current CPUs. E.g., on AMD Zen 4, AES-256-XTS decryption throughput increases by 13% with 4096-byte inputs, or 14% with 512-byte inputs. On Intel Sapphire Rapids, AES-256-XTS decryption throughput increases by 2% with 4096-byte inputs, or 3% with 512-byte inputs. Future CPUs may provide stronger 512-bit support, in which case a larger benefit should be seen. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05	crypto: x86/aes-xts - wire up VAES + AVX10/256 implementation	Eric Biggers
	Add an AES-XTS implementation "xts-aes-vaes-avx10_256" for x86_64 CPUs with the VAES, VPCLMULQDQ, and either AVX10/256 or AVX512BW + AVX512VL extensions. This implementation avoids using zmm registers, instead using ymm registers to operate on two AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. This is the optimal implementation on CPUs that support VAES and AVX512 but where the zmm registers should not be used due to downclocking effects, for example Intel's Ice Lake. It should also be the optimal implementation on future CPUs that support AVX10/256 but not AVX10/512. The performance is slightly better than that of xts-aes-vaes-avx2, which uses the same 256-bit vector length, due to factors such as being able to use ymm16-ymm31 to cache the AES round keys, and being able to use the vpternlogd instruction to do XORs more efficiently. For example, on Ice Lake, the throughput of decrypting 4096-byte messages with AES-256-XTS is 6.6% higher with xts-aes-vaes-avx10_256 than with xts-aes-vaes-avx2. While this is a small improvement, it is straightforward to provide this implementation (xts-aes-vaes-avx10_256) as long as we are providing xts-aes-vaes-avx2 and xts-aes-vaes-avx10_512 anyway, due to the way the _aes_xts_crypt macro is structured. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05	crypto: x86/aes-xts - wire up VAES + AVX2 implementation	Eric Biggers
	Add an AES-XTS implementation "xts-aes-vaes-avx2" for x86_64 CPUs with the VAES, VPCLMULQDQ, and AVX2 extensions, but not AVX512 or AVX10. This implementation uses ymm registers to operate on two AES blocks at a time. The assembly code is instantiated using a macro so that most of the source code is shared with other implementations. This is the optimal implementation on AMD Zen 3. It should also be the optimal implementation on Intel Alder Lake, which similarly supports VAES but not AVX512. Comparing to xts-aes-aesni-avx on Zen 3, xts-aes-vaes-avx2 provides 70% higher AES-256-XTS decryption throughput with 4096-byte messages, or 23% higher with 512-byte messages. A large improvement is also seen with CPUs that do support AVX512 (e.g., 98% higher AES-256-XTS decryption throughput on Ice Lake with 4096-byte messages), though the following patches add AVX512 optimized implementations to get a bit more performance on those CPUs. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05	crypto: x86/aes-xts - wire up AESNI + AVX implementation	Eric Biggers
	Add an AES-XTS implementation "xts-aes-aesni-avx" for x86_64 CPUs that have the AES-NI and AVX extensions but not VAES. It's similar to the existing xts-aes-aesni in that uses xmm registers to operate on one AES block at a time. It differs from xts-aes-aesni in the following ways: - It uses the VEX-coded (non-destructive) instructions from AVX. This improves performance slightly. - It incorporates some additional optimizations such as interleaving the tweak computation with AES en/decryption, handling single-page messages more efficiently, and caching the first round key. - It supports only 64-bit (x86_64). - It's generated by an assembly macro that will also be used to generate VAES-based implementations. The performance improvement over xts-aes-aesni varies from small to large, depending on the CPU and other factors such as the size of the messages en/decrypted. For example, the following increases in AES-256-XTS decryption throughput are seen on the following CPUs: \| 4096-byte messages \| 512-byte messages \| ----------------------+--------------------+-------------------+ Intel Skylake \| 6% \| 31% \| Intel Cascade Lake \| 4% \| 26% \| AMD Zen 1 \| 61% \| 73% \| AMD Zen 2 \| 36% \| 59% \| (The above CPUs don't support VAES, so they can't use VAES instead.) While this isn't as large an improvement as what VAES provides, this still seems worthwhile. This implementation is fairly easy to provide based on the assembly macro that's needed for VAES anyway, and it will be the best implementation on a large number of CPUs (very roughly, the CPUs launched by Intel and AMD from 2011 to 2018). This makes the existing xts-aes-aesni mostly obsolete. For now, leave it in place to support 32-bit kernels and also CPUs like Intel Westmere that support AES-NI but not AVX. (We could potentially remove it anyway and just rely on the indirect acceleration via ecb-aes-aesni in those cases, but that change will need to be considered separately.) Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-05	crypto: x86/aes-xts - add AES-XTS assembly macro for modern CPUs	Eric Biggers
	Add an assembly file aes-xts-avx-x86_64.S which contains a macro that expands into AES-XTS implementations for x86_64 CPUs that support at least AES-NI and AVX, optionally also taking advantage of VAES, VPCLMULQDQ, and AVX512 or AVX10. This patch doesn't expand the macro at all. Later patches will do so, adding each implementation individually so that the motivation and use case for each individual implementation can be fully presented. The file also provides a function aes_xts_encrypt_iv() which handles the encryption of the IV (tweak), using AES-NI and AVX. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-02	crypto: x86/aesni - Update aesni_set_key() to return void	Chang S. Bae
	The aesni_set_key() implementation has no error case, yet its prototype specifies to return an error code. Modify the function prototype to return void and adjust the related code. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: linux-crypto@vger.kernel.org Cc: x86@kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2024-04-02	crypto: x86/aesni - Rearrange AES key size check	Chang S. Bae
	aes_expandkey() already includes an AES key size check. If AES-NI is unusable, invoke the function without the size check. Also, use aes_check_keylen() instead of open code. Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: linux-crypto@vger.kernel.org Cc: x86@kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>