VISTA: A 704mW 4K-UHD CNN Processor for Video and Image Spatial/Temporal Interpolation Acceleration

Abstract

Video convolutional neural networks (CNNs) have achieved great success in high-resolution imaging applications, such as video super-resolution (VSR) and demonstrated superior quality and temporal consistency by leveraging time information. In particular, as shown in Fig. 2.6.1, video CNNs can also support applications like video-frame interpolation (VFI) which is difficult to achieve by single-image CNNs. Therefore, video CNNs have enormous potential for next-generation imaging/display technology. However, there are three design challenges while inferencing high-throughput video CNNs. Firstly, massive external memory access (EMA) and computation complexity are induced since they both grow accordingly as the number of input frames (N) increases. Secondly, tremendous memory usage of feature maps (FMs) is required for supporting cross-frame alignment with in-order frame scheduling. Thirdly, supporting deformable convolution (DC) for alignment costs extra line buffers for irregular samples and computation for bilinear interpolation (BI). In this work, we present a video CNN processor supporting diverse-application video CNN inference at 4K-UHD resolution and address the challenges through three key features: 1) a cuboid-based layer-fusion (CBLF) inference flow to reduce EMA and computation complexity; 2) an alignment-aware memory optimization technique to save the FM memory size; 3) a hardware-model co-design of tile-based offset-confined deformable convolution (TODC) to alleviate the overheads of induced FM line buffers and computation logics for DC.

References

Page 1

	Year	Citations

Page 1