Novel Dilated Separable Convolution Networks for Efficient Video Salient Object Detection in the Wild

Abstract

Appearance and motion are essential features in Video Salient Object Detection (VSOD) tasks. Most of the existing approaches utilize local features and thus fail to understand both the appearance as well as motion-specific semantics at the global level. Hence, these methods are unable to perform in unconstrained scenarios where multiple challenges, such as partial occlusion, motion blur, noise, clutter background, etc., exist. Moreover, these approaches require a large number of computational resources due to their complex structures, which limits their applicability to real-world deployment. To resolve these issues and to achieve a balance between accuracy and computational complexity, in this paper, a Dilation Separable Convolution Network (DSCNet) is proposed, which is equipped with Dilation Attention Fusion Module (DAFM), Bi-directional Cross-modality Fusion Module (BCFM), and Saliency Prediction Module (SPM) to extract enhanced multi-scaled motion and appearance features without increasing the model complexity. Further, a Bi-directional Separable Convolution Network (BSC-Net) equipped with a Separable Convolution Module (SCM)s and a FlowNet2.0 is proposed to utilize multi-scale contextual information across appearance cues and generate enhanced multi-scaled motion maps. For faster and better training of the DSCNet model, we propose a novel Stochastic Gradient-based Firefly Algorithm (SGFA), which adaptively balances the exploration and exploitation in multi-scaled, cross-modal embedded sub-spaces. With the help of the proposed SGFA algorithm, DSCNet+ model is constructed on top of DSCNet, which further improves the results in terms of the training speed as well as other evaluation metrics. The proposed models are evaluated on six benchmark datasets, and a detailed comparative study is provided with sixteen state-of-the-art (SOTA) models. One of the major highlights of the work is the significant performance of the proposed models on the most difficult DAVSOD-Diff dataset, which best reflects the challenging real-world scenarios.

References

Page 1

	Year	Citations

Page 1