Separating Style and Content with Bilinear Models

TLDR

Perceptual systems routinely separate content from style, but a tractable computational model for this ability remains elusive, and existing factor models are either too simplistic or lack efficient learning algorithms. The study introduces a general framework for learning two-factor tasks using bilinear models. The framework employs bilinear models that capture factor interactions and can be efficiently fitted with singular value decomposition and expectation‑maximization. The authors report promising results on spoken vowel classification, font extrapolation, and face illumination translation across three perceptual domains.

Abstract

Perceptual systems routinely separate "content" from "style," classifying familiar words spoken in an unfamiliar accent, identifying a font or handwriting style across letters, or recognizing a familiar face or object seen under unfamiliar viewing conditions. Yet a general and tractable computational model of this ability to untangle the underlying factors of perceptual observations remains elusive (Hofstadter, 1985). Existing factor models (Mardia, Kent, & Bibby, 1979; Hinton & Zemel, 1994; Ghahramani, 1995; Bell & Sejnowski, 1995; Hinton, Dayan, Frey, & Neal, 1995; Dayan, Hinton, Neal, & Zemel, 1995; Hinton & Ghahramani, 1997) are either insufficiently rich to capture the complex interactions of perceptually meaningful factors such as phoneme and speaker accent or letter and font, or do not allow efficient learning algorithms. We present a general framework for learning to solve two-factor tasks using bilinear models, which provide sufficiently expressive representations of factor interactions but can nonetheless be fit to data using efficient algorithms based on the singular value decomposition and expectation-maximization. We report promising results on three different tasks in three different perceptual domains: spoken vowel classification with a benchmark multi-speaker database, extrapolation of fonts to unseen letters, and translation of faces to novel illuminants.

References

Page 1

	Year	Citations

Page 1