Publication | Open Access
CHARM: <u>C</u> omposing <u>H</u> eterogeneous <u>A</u> ccele <u>R</u> ators for <u>M</u> atrix Multiply on Versal ACAP Architecture
37
Citations
25
References
2023
Year
Unknown Venue
Small Mm LayersEngineeringHardware AccelerationHigh-performance ArchitectureVersal Acap ArchitectureHardware AlgorithmMany-core ArchitectureComputer EngineeringComputer ArchitectureMachine Learning ModelsParallel ProgrammingComputer ScienceParallel ComputingDeep LearningDense Matrix Multiply
Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes?
| Year | Citations | |
|---|---|---|
Page 1
Page 1