Fake audio detection using Hierarchical Representations Learning and Spectrogram Features

Abstract

Fake voice of the synthetic speech means a sound generated by using machine or using other techniques. Replay attack is one of the types of fake voice in which a person can use recording of the target speaker to conduct replay attack. All the smart phones and audio recordings devices offer high quality of recording and playback, so they can be used in replay attacks. In this system we have used two types of method to extract the features from the audio, one method is conventional features for this we have used Mel-frequency Cepstral Coefficients (MFCC), and the other method is Convolutional Neural Network (CNN) first we converted audio signal into spectrogram and then we extract features from the last layer of CNN model. We combined all these features and passed it to long short term memory (LSTM). The LSTM model is capable to acquire knowledge from these features and used one dense layer to give the output. For the experiments we used ASVSpoof2019 dataset, this dataset is distributed into two parts, Logical access (LA) and physical access (PA). Physical access used all the replayed audios so we use (PA). The experimental results show that this system gives high performance of 0.046 equal error rate with 98% accuracy on ASVSpoof2019 dataset.

References

Page 1

	Year	Citations

Page 1