Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling

This paper presents advances in analyzing audio content information to detect events in videos, such as a parade or a birthday party. We developed a set of tools for audio processing within the predominantly vision-focused deep neural network (DNN) framework Caffe. Using these tools, we show, for the first time, the potential of using only a DNN for audio-based multimedia event detection. Training DNNs for event detection using the entire audio track from each video causes a computational bottleneck. Here, we address this problem by developing a sparse audio frame-sampling method that improves event-detection speed and accuracy. We achieved a 10 percentage-point improvement in event classification accuracy, with a 200x reduction in the number of training input examples as compared to using the entire track. This reduction in input feature volume led to a 16x reduction in the size of the DNN architecture and a 300x reduction in training time. We applied our method using the recently released YLI-MED dataset and compared our results with a state-of-the-art system and with results reported in the literature for TRECVID MED. Our results show much higher MAP scores compared to a baseline i-vector system— at a significantly reduced computational cost. The speed improvement is relevant for processing videos on a large scale, and could enable more effective deployment in mobile systems.ICMR_paper_camera_ready_2015