Category Learning Through Multimodality Sensing

TLDR

Humans learn complex categories without explicit labels, whereas most computational models assume labeled data, yet natural environments provide temporal correlations across modalities that influence human perception. This study proposes a simple unsupervised neural network that exploits these natural multimodal correlations. The network trains on co‑occurring lip motion and sound signals from a human speaker to develop separate visual and auditory speech classifiers. The resulting classifiers achieve performance comparable to supervised networks.

Abstract

Humans and other animals learn to form complex categories without receiving a target output, or teaching signal, with each input pattern. In contrast, most computer algorithms that emulate such performance assume the brain is provided with the correct output at the neuronal level or require grossly unphysiological methods of information propagation. Natural environments do not contain explicit labeling signals, but they do contain important information in the form of temporal correlations between sensations to different sensory modalities, and humans are affected by this correlational structure (Howells, 1944; McGurk & MacDonald, 1976; MacDonald & McGurk, 1978; Zellner & Kautz, 1990; Durgin & Proffitt, 1996). In this article we describe a simple, unsupervised neural network algorithm that also uses this natural structure. Using only the co-occurring patterns of lip motion and sound signals from a human speaker, the network learns separate visual and auditory speech classifiers that perform comparably to supervised networks.

References

Page 1

	Year	Citations

Page 1