Picture My Voice : Audio to Visual Speech Synthesis using Artificial Neural Networks

Abstract

This paper presents an initial implementation and evaluation of a system that synthesizes visual speech directly from the acoustic waveform. An artifical neural network (ANN) was trained to map the cepstral coefficients of an individual&apos;s natural speech to the control parameters of an animated synthetic talking head. We trained on two data sets; one was a set of 400 words spoken in isolation by a single speaker and the other a subset of extemporaneous speech from 10 different speakers. The system showed learning in both cases. A perceptual evaluation test indicated that the system&apos;s generalization to new words by the same speaker provides significant visible information, but significantly below that given by a text-to-speech algorithm. 1. INTRODUCTION Persons find it hard to communicate when the auditory conditions are poor, e.g. due to noise, limited bandwidth, or hearing-impairment. Under such circumstances, face-to-face communication is preferable. The visual component of speech c...

References

Page 1

	Year	Citations

Page 1