Learning Contextual Transformer Network for Image Inpainting

Abstract

Fully Convolutional Networks with attention modules have been proven effective for learning-based image inpainting. While many existing approaches could produce visually reasonable results, the generated images often show blurry textures or distorted structures around corrupted areas. The main reason is due to the fact that convolutional neural networks have limited capacity for modeling contextual information with long range dependencies. Although the attention mechanism can alleviate this problem to some extent, existing attention modules tend to emphasize similarities between the corrupted and the uncorrupted regions while ignoring the dependencies from within each of them. Hence, this paper proposes the Contextual Transformer Network (CTN) which not only learns relationships between the corrupted and the uncorrupted regions but also exploits their respective internal closeness. Besides, instead of a fully convolutional network, in our CTN, we stack several transformer blocks to replace convolution layers to better model the long range dependencies. Finally, by dividing the image into patches of different sizes, we propose a multi-scale multi-head attention module to better model the affinity among various image regions. Experiments on several benchmark datasets demonstrate superior performance by our proposed approach.

References

Page 1

	Year	Citations

Page 1