Publication | Open Access
Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling
16
Citations
10
References
2015
Year
Unknown Venue
Convolutional Neural NetworkEngineeringMachine LearningMultimedia AnalysisVideo InterpretationSpeech RecognitionImage AnalysisData SciencePattern RecognitionVideo TransformerSparse SamplingFeature LearningEvent DetectionAudio RetrievalVideo UnderstandingDeep LearningBirthday PartySignal ProcessingComputer VisionAudio MiningSpeech ProcessingAudio Content Information
This paper presents advances in analyzing audio content information to detect events in videos, such as a parade or a birthday party. We developed a set of tools for audio processing within the predominantly vision-focused deep neural network (DNN) framework Caffe. Using these tools, we show, for the first time, the potential of using only a DNN for audio-based multimedia event detection. Training DNNs for event detection using the entire audio track from each video causes a computational bottleneck. Here, we address this problem by developing a sparse audio frame-sampling method that improves event-detection speed and accuracy. We achieved a 10 percentage-point improvement in event classification accuracy, with a 200x reduction in the number of training input examples as compared to using the entire track. This reduction in input feature volume led to a 16x reduction in the size of the DNN architecture and a 300x reduction in training time. We applied our method using the recently released YLI-MED dataset and compared our results with a state-of-the-art system and with results reported in the literature for TRECVIDMED. Our results show much higher MAP scores compared to a baseline i-vector system - at a significantly reduced computational cost. The speed improvement is relevant for processing videos on a large scale, and could enable more effective deployment in mobile systems.
| Year | Citations | |
|---|---|---|
Page 1
Page 1