Publication | Closed Access
Binary coding of speech spectrograms using a deep auto-encoder
382
Citations
13
References
2010
Year
Unknown Venue
EngineeringMachine LearningBinary CodingAutoencodersSpeech SpectrogramSpeech RecognitionLog Likelihood GradientData ScienceRobust Speech RecognitionAutomatic RecognitionHealth SciencesGenerative ModelsComputer ScienceDeep LearningSpeech CommunicationDeep Neural NetworksSpeech AcousticsSpeech ProcessingSpeech InputSpeech Feature Extraction
The study explores a layer‑by‑layer training strategy for a multi‑layer generative model of speech spectrogram patches. The model learns binary codes through layer‑by‑layer pretraining with contrastive divergence, then is unrolled into a deep auto‑encoder fine‑tuned by back‑propagation, and reconstructed spectrograms are assembled via overlap‑and‑add. Experimental results show binary codes yield about 2 dB lower log‑spectral distortion than subband vector quantization across the full frequency range of wide‑band speech. Index terms include deep learning, speech feature extraction, neural networks, auto‑encoder, binary codes, and Boltzmann machine.
This paper reports our recent exploration of the layer-by-layer learning strategy for training a multi-layer generative model of patches of speech spectrograms. The top layer of the generative model learns binary codes that can be used for efficient compression of speech and could also be used for scalable speech recognition or rapid speech content retrieval. Each layer of the generative model is fully connected to the layer below and the weights on these connections are pretrained efficiently by using the contrastive divergence approximation to the log likelihood gradient. After layer-bylayer pre-training we “unroll” the generative model to form a deep auto-encoder, whose parameters are then fine-tuned using back-propagation. To reconstruct the full-length speech spectrogram, individual spectrogram segments predicted by their respective binary codes are combined using an overlapand-add method. Experimental results on speech spectrogram coding demonstrate that the binary codes produce a logspectral distortion that is approximately 2 dB lower than a subband vector quantization technique over the entire frequency range of wide-band speech. Index Terms: deep learning, speech feature extraction, neural networks, auto-encoder, binary codes, Boltzmann machine
| Year | Citations | |
|---|---|---|
Page 1
Page 1