Publication | Closed Access
CHIP-KNN: A Configurable and High-Performance K-Nearest Neighbors Accelerator on Cloud FPGAs
27
Citations
22
References
2020
Year
Unknown Venue
Cluster ComputingEngineeringFpga PlatformsAdvanced ComputingHardware AlgorithmComputer ArchitectureCloud FpgasHardware SecurityKnn ParametersData ScienceHigh-performance ArchitectureParallel ComputingComputer EngineeringFpga PlatformComputer ScienceFpga DesignExternal-memory AlgorithmHardware AccelerationEdge ComputingCloud ComputingDomain-specific AcceleratorParallel Programming
The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multithread CPU implementation for large datasets. In this paper, we design and implement CHIP-KNN-an HLS-based, configurable, and high-performance KNN accelerator-which optimizes the off-chip memory access on cloud FPGAs with multiple DRAM or HBM (high-bandwidth memory) banks. CHIP-KNN is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension of each data point, the distance metric, and the number of nearest neighbors - K. To optimize its performance, we build an analytical performance model to explore the design space and balance the computation and memory access performance. Given a user configuration of the KNN parameters, our tool can automatically generate the optimal accelerator design on the given FPGA platform. Our experimental results on the Nimbix cloud computing platform show that: Compared to a 16-thread CPU implementation, CHIP-KNN on the Xilinx Alveo U200 FPGA board with four DRAM banks and U280 FPGA board with HBM achieves an average of 7.5x and 19.8x performance speedup, and 6.1x and 16.0x performance/dollar improvement.
| Year | Citations | |
|---|---|---|
Page 1
Page 1