Publication | Open Access
Self-Attention Generative Adversarial Networks
2.2K
Citations
0
References
2018
Year
Gan GeneratorGenerative SystemCognitive ScienceEngineeringMachine LearningGenerative Adversarial NetworkSpectral NormalizationAttention LayersGenerative ModelsGenerative ModelGenerative AiAttentionDeep LearningSocial SciencesComputer VisionSynthetic Image Generation
Traditional convolutional GANs generate high‑resolution details based only on local points in lower‑resolution feature maps, and recent work shows that generator conditioning influences GAN performance. The paper proposes the Self‑Attention Generative Adversarial Network (SAGAN) to enable attention‑driven, long‑range dependency modeling in image generation. SAGAN generates details using cues from all feature locations, its discriminator ensures consistency across distant image regions, and spectral normalization of the generator improves training dynamics. Spectral normalization improves training dynamics, and SAGAN achieves state‑of‑the‑art ImageNet results with an Inception score of 52.52 and a Fréchet Inception distance of 18.65, while attention layers reveal that the generator focuses on object‑shaped neighborhoods.
In this paper, we propose the Self-Attention Generative Adversarial Network (SAGAN) which allows attention-driven, long-range dependency modeling for image generation tasks. Traditional convolutional GANs generate high-resolution details as a function of only spatially local points in lower-resolution feature maps. In SAGAN, details can be generated using cues from all feature locations. Moreover, the discriminator can check that highly detailed features in distant portions of the image are consistent with each other. Furthermore, recent work has shown that generator conditioning affects GAN performance. Leveraging this insight, we apply spectral normalization to the GAN generator and find that this improves training dynamics. The proposed SAGAN achieves the state-of-the-art results, boosting the best published Inception score from 36.8 to 52.52 and reducing Frechet Inception distance from 27.62 to 18.65 on the challenging ImageNet dataset. Visualization of the attention layers shows that the generator leverages neighborhoods that correspond to object shapes rather than local regions of fixed shape.