Publication | Closed Access
iPIM: Programmable In-Memory Image Processing Accelerator Using Near-Bank Architecture
76
Citations
79
References
2020
Year
Unknown Venue
Hardware SecurityImage ProcessingImage AnalysisMemory Access PatternsEngineeringHardware AccelerationEdge ComputingHigh-performance ArchitectureImage ProcessorComputer EngineeringComputer ArchitectureSimb IsaMany-core ArchitectureParallel ProgrammingComputer ScienceParallel ComputingProcessor ArchitectureIn-memory Computing
Image processing is becoming an increasingly important domain for many applications on workstations and the datacenter that require accelerators for high performance and energy efficiency. GPU, which is the state-of-the-art accelerator for image processing, suffers from the memory bandwidth bottleneck. To tackle this bottleneck, near-bank architecture provides a promising solution due to its enormous bank-internal bandwidth and low-energy memory access. However, previous work lacks hardware programmability, while image processing workloads contain numerous heterogeneous pipeline stages with diverse computation and memory access patterns. Enabling programmable near-bank architecture with low hardware overhead remains challenging.This work proposes iPIM, the first programmable in-memory image processing accelerator using near-bank architecture. We first design a decoupled control-execution architecture to provide lightweight programmability support. Second, we propose the SIMB (Single-Instruction-Multiple-Bank) ISA to enable flexible control flow and data access. Third, we present an end-to-end compilation flow based on Halide that supports a wide range of image processing applications and maps them to our SIMB ISA. We further develop iPIM-aware compiler optimizations, including register allocation, instruction reordering, and memory order enforcement to improve performance. We evaluate a set of representative image processing applications on iPIM and demonstrate that on average iPIM obtains 11.02× acceleration and 79.49% energy saving over an NVIDIA Tesla V100 GPU. Further analysis shows that our compiler optimizations contribute 3.19× speedup over the unoptimized baseline.
| Year | Citations | |
|---|---|---|
Page 1
Page 1