What's in a face? Visual contributions to speech segmentation

Abstract

Abstract Recent research has demonstrated that adults successfully segment two interleaved artificial speech streams with incongruent statistics (i.e., streams whose combined statistics are noisier than the encapsulated statistics) only when provided with an indexical cue of speaker voice. In a series of five experiments, our study explores whether learners can utilise visual information to encapsulate statistics for each speech stream. We initially presented learners with incongruent artificial speech streams produced by the same female voice along with an accompanying visual display. Learners successfully segmented both streams when the audio stream was presented with an indexical cue of talking faces (Experiment 1). This learning cannot be attributed to the presence of the talking face display alone, as a single face paired with a single input stream did not improve segmentation (Experiment 2). Additionally, participants failed to successfully segment two streams when they were paired with a synchronised single talking face display (Experiment 3). Likewise, learners failed to successfully segment both streams when the visual indexical cue lacked audio-visual synchrony, such as changes in background screen colour (Experiment 4) or a static face display (Experiment 5). We end by discussing the possible relevance of the speaker's face in speech segmentation and bilingual language acquisition. Keywords: Face processingIndexical cuesMultiple representationsSpeech segmentationStatistical learning Acknowledgements We thank Beth Buerger, Molly Jamison, and Troy Gury for conducting experiments. We also thank Marissa Weyer and Chip Gerfen for help in assembling the visual stimuli. We are grateful to Rich Carlson and Chip for helpful comments and to NIH R03 grant HD048996-01 for support of this research. Notes 1Note that both experiments were conducted in the same laboratory using an identical auditory familiarisation stream and test files. Given this similarity (excepting the visual cues of the present experiment), we analysed the differences in performance. 2The other experiments in this study require greater numbers of participants in order to counterbalance face and language presentation order. Because the critical comparison for Experiment 2 is with Weiss et al., and not with other experiments here, we chose to keep the sample size comparable to Weiss et al., which included 13 participants. 3Unlike the previous experiments, the same person whose voice was used to create the audio stream was used to create the face video. Thus, if learners are able to pick up speaker-specific details (e.g., fundamental frequency) from the speech stream, then these cues should be, to a first approximation, compatible with the face display.

References

Page 1

	Year	Citations

Page 1