Publication | Closed Access
A Transductive Approach for Video Object Segmentation
127
Citations
35
References
2020
Year
Unknown Venue
Scene AnalysisEngineeringMachine LearningVideo InterpretationTransductive ApproachImage AnalysisData SciencePattern RecognitionVideo Content AnalysisStrong Transductive MethodComputational GeometryVideo TransformerVanilla Resnet50 BackboneMachine VisionComputer ScienceVideo UnderstandingDeep LearningComputer VisionVideo SegmentationScene UnderstandingImage SegmentationInstance Segmentation
Semi‑supervised video object segmentation seeks to isolate a target object from a video given its first‑frame mask, yet existing methods rely on external modules such as optical flow and instance segmentation, limiting fair comparison. The authors propose a simple transductive method that eliminates the need for extra modules, datasets, or specialized architectures. Their approach propagates pixel labels forward via feature similarity in an embedding space, holistically diffusing temporal information to capture long‑term appearance, and is implemented with a vanilla ResNet50 backbone. The method achieves 72.3 % on DAVIS 2017 validation and 63.1 % on test while running at ~37 fps, demonstrating that a simple transductive approach can serve as an efficient baseline.
Semi-supervised video object segmentation aims to separate a target object from a video sequence, given the mask in the first frame. Most of current prevailing methods utilize information from additional modules trained in other domains like optical flow and instance segmentation, and as a result they do not compete with other methods on common ground. To address this issue, we propose a simple yet strong transductive method, in which additional modules, datasets, and dedicated architectural designs are not needed. Our method takes a label propagation approach where pixel labels are passed forward based on feature similarity in an embedding space. Different from other propagation methods, ours diffuses temporal information in a holistic manner which take accounts of long-term object appearance. In addition, our method requires few additional computational overhead, and runs at a fast ~37 fps speed. Our single model with a vanilla ResNet50 backbone achieves an overall score of 72.3% on the DAVIS 2017 validation set and 63.1% on the test set. This simple yet high performing and efficient method can serve as a solid baseline that facilitates future research. Code and models are available at https://github.com/ microsoft/transductive-vos.pytorch.
| Year | Citations | |
|---|---|---|
Page 1
Page 1