Audio-visual speech perception without speech cues

Abstract

A series of experiments was conducted in which listeners were presented with audio-visual sentences in a transcription task. The visual components of the stimuli consisted of a male talker's face. The acoustic components consisted of: (1) natural speech; (2) envelope-shaped noise which preserved the duration and amplitude of the original speech waveform; and (3) various types of sine wave speech signals that followed the formant frequencies of a natural utterance. Sine wave speech is a skeletonized version of a natural utterance which contains frequency and amplitude variation of the formants, but lacks any fine-grained acoustic structure of speech. Intelligibility of the present set of sine wave sentences was relatively low in contrast to previous findings (Remez, Rubin, Pisoni, and Carrell, 1981). However, intelligibility was greatly increased when visual information from a talkers face was presented along with the auditory stimuli. Further experiments demonstrated that the intelligibility of single tones increased differentially depending on which formant analog was presented. It was predicted that the increase in intelligibility for the sine wave speech with an added video display would be greater than the gain observed with envelope-shaped noise. This prediction is based on the assumption that the information-bearing phonetic properties of spoken utterances are preserved in the audio-visual sine wave conditions.

References

Page 1

	Year	Citations

Page 1