Concepedia

Publication | Closed Access

A High-Throughput Toom-Cook-4 Polynomial Multiplier for Lattice-Based Cryptography Using a Novel Winograd-Schoolbook Algorithm

10

Citations

24

References

2023

Year

Abstract

Polynomial multiplication over rings is a significant bottleneck of ring learning with error (RLWE)-based encryption. To speed it up, the number theoretic transform (NTT) and Toom-Cook-4 (TC4) are commonly used algorithms. Compared with NTT, TC4 is less restrictive and more flexible. However, there is a large opportunity at the algorithm level to improve the Schoolbook algorithm and postprocessing of TC4. Therefore, we propose a novel and efficient Winograd-Schoolbook algorithm that reduces multiplication by 29.1% (N = 256). We also propose a fused and low-density postprocessing that simplifies the algorithm flow and reduces multiplication by 56.25%. In total, these two-part improvements reduce the multiplication of TC4 by 32.47%. A high throughput and efficiency TC4 polynomial multiplier (TCMW) is proposed to speed up polynomial multiplication over rings. In TCMW, a highly parallel full pipelined structure without data waiting between modules is designed to make the parallelism of each module match and avoid the storage of intermediate results. In addition, based on the improved TC4, data buffers with data reuse, elementwise vector multiplication (EWVM) arrays, and efficient interpolation arrays are all designed to improve the performance and efficiency of TCMW. Implemented on the Xilinx VC709 field programmable gate array (FPGA) platform, TCMW can perform a TC4-based <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$256\times 256$ </tex-math></inline-formula> polynomial multiplication over rings with an unrestricted modulus (as long as its factors do not contain 3 or 5) every <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.89~\mu $ </tex-math></inline-formula> s at a 385 MHz clock frequency. Compared with prior designs of TC4, under the same conditions, the throughput of TCMW achieves an improvement of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.91\times \,\,\sim \,\,7.71\times $ </tex-math></inline-formula> , and the efficiency of LUT and DSP achieve improvements of <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.31\times \,\,\sim \,\,3.67\times $ </tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.87\times \,\,\sim \,\,4.92\times $ </tex-math></inline-formula> , respectively.

References

YearCitations

Page 1