Publication | Closed Access
I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription
119
Citations
16
References
2014
Year
Unknown Venue
EngineeringMachine LearningSpeech SegmentsSpeech RecognitionNatural Language ProcessingPhoneticsAcoustic Feature VectorsSpeaker DiarizationRobust Speech RecognitionVoice RecognitionLanguage StudiesComputer ScienceDeep LearningSpeech CommunicationDeep Neural NetworksEtape 2011I-vector-based Speaker AdaptationMulti-speaker Speech RecognitionSpeech ProcessingSpeech InputSpeech PerceptionLinguisticsSpeaker Recognition
State of the art speaker recognition systems are based on the i-vector representation of speech segments. In this paper we show how this representation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and we report excellent results on a French language audio transcription task. The implemenation is very simple. An audio file is first diarized and each speaker cluster is represented by an i-vector. Acoustic feature vectors are augmented by the corresponding i-vectors before being presented to the DNN. (The same i-vector is used for all acoustic feature vectors aligned with a given speaker.) This supplementary information improves the DNN's ability to discriminate between phonetic events in a speaker independent way without having to make any modification to the DNN training algorithms. We report results on the ETAPE 2011 transcription task, and show that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used. For cross-entropy training, we obtained a word error rate (WER) reduction from 22.16% to 20.67% whereas for sequence training the WER reduces from 19.93% to 18.40%.
| Year | Citations | |
|---|---|---|
Page 1
Page 1