SwiftNet: Real-time Video Object Segmentation

TLDR

SwiftNet is proposed as a real‑time, one‑shot video object segmentation framework that aims to provide a strong, efficient baseline for mobile vision applications. The method compresses spatiotemporal redundancy using Pixel‑Adaptive Memory, which selectively updates memory on frames with significant inter‑frame changes and on dynamic pixels, and incorporates a lightweight aggregation encoder with reversed sub‑pixel operations. On the DAVIS 2017 validation set, SwiftNet attains 77.8 % J&F and 70 FPS, surpassing all existing solutions in both accuracy and speed. The source code is available at https://github.com/haochenheheda/SwiftNet.

Abstract

In this work we present SwiftNet for real-time semisupervised video object segmentation (one-shot VOS), which reports 77.8% $\mathcal{J}\& \mathcal{F}$ and 70 FPS on DAVIS 2017 validation dataset, leading all present solutions in overall accuracy and speed performance. We achieve this by elaborately compressing spatiotemporal redundancy in matching-based VOS via Pixel-Adaptive Memory (PAM). Temporally, PAM adaptively triggers memory updates on frames where objects display noteworthy inter-frame variations. Spatially, PAM selectively performs memory update and match on dynamic pixels while ignoring the static ones, significantly reducing redundant computations wasted on segmentation-irrelevant pixels. To promote efficient reference encoding, light-aggregation encoder is also introduced in SwiftNet deploying reversed sub-pixel. We hope SwiftNet could set a strong and efficient baseline for real-time VOS and facilitate its application in mobile vision. The source code of SwiftNet can be found at https://github.com/haochenheheda/SwiftNet.

References

Page 1

	Year	Citations

Page 1