Effects of Stimulus Content and Duration on Talker Identification

TLDR

Sixteen listeners identified talkers from speech samples of five types—vowels, CV sequences, monosyllabic words, disyllabic nonsense words, and sentences—recorded by 10 talkers, with varying duration and content. Accuracy rose with phoneme count, and vowel type strongly affected identifiability, response preference, and error patterns, with asymmetrical confusion matrices; reversed tapes further reduced performance, challenging voice‑quality‑based models of talker identification.

Abstract

Sixteen listeners attempted to identify the talker when listening to speech samples of varying duration and content. The samples, recorded by 10 different talkers, were of five types: excerpted vowels, excerpted consonant-vowel (CV) sequences, monosyllabic words, disyllabic nonsense words, and sentences. Identification accuracy improved directly with the number of phonemes in the sample even when duration was controlled. Stimulus-response matrices differed substantially between the two vowels ([a] and [i]) used in the vowel and CV samples: relative identifiability of the talkers, response preference, and error patterns were all found to depend on vowel type. Confusion matrices for a given vowel exhibit definite asymmetries. In a limited additional study, subsets of listeners made identifying responses with the tapes reversed; performance deteriorated on even the briefest excerpts. The results pose some difficulties for a model of talker-identification behavior based on attributes of voice quality.