Speech waveform envelope cues for consonant recognition

TLDR

The study examined which time‑intensity envelope cues support consonant recognition. Twelve listeners heard noise modulated by speech envelopes of 19 nonsense syllables, with envelopes filtered at 20, 200, and 2000 Hz, and multidimensional scaling identified three envelope features that grouped consonants into four “enveme” categories. Consonant identification surpassed chance and improved with wider envelope bandwidth, and the enveme groups together with visemes could distinguish most consonants, indicating near‑perfect performance is possible using only envelope and visual cues.

Abstract

This study investigated the cues for consonant recognition that are available in the time-intensity envelope of speech. Twelve normal-hearing subjects listened to three sets of spectrally identical noise stimuli created by multiplying noise with the speech envelopes of 19 /aCa/ natural-speech nonsense syllables. The speech envelope for each of the three noise conditions was derived using a different low-pass filter cutoff (20, 200, and 2000 Hz). Average consonant identification performance was above chance for the three noise conditions and improved significantly with the increase in envelope bandwidth from 20–200 Hz. SINDSCAL multidimensional scaling analysis of the consonant confusions data identified three speech envelope features that divided the 19 consonants into four envelope feature groups (‘‘envemes’’). The enveme groups in combination with visually distinctive speech feature groupings (‘‘visemes’’) can distinguish most of the 19 consonants. These results suggest that near-perfect consonant identification performance could be attained by subjects who receive only enveme and viseme information and no spectral information.