Probing phoneme, language and speaker information in unsupervised speech representations

Abstract

Unsupervised models of representations based on Contrastive Predictive Coding\n(CPC)[1] are primarily used in spoken language modelling in that they encode\nphonetic information. In this study, we ask what other types of information are\npresent in CPC speech representations. We focus on three categories: phone\nclass, gender and language, and compare monolingual and bilingual models. Using\nqualitative and quantitative tools, we find that both gender and phone class\ninformation are present in both types of models. Language information, however,\nis very salient in the bilingual model only, suggesting CPC models learn to\ndiscriminate languages when trained on multiple languages. Some language\ninformation can also be retrieved from monolingual models, but it is more\ndiffused across all features. These patterns hold when analyses are carried on\nthe discrete units from a downstream clustering model. However, although there\nis no effect of the number of target clusters on phone class and language\ninformation, more gender information is encoded with more clusters. Finally, we\nfind that there is some cost to being exposed to two languages on a downstream\nphoneme discrimination task.\n

References

Page 1

	Year	Citations

Page 1