Publication | Closed Access
FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos
376
Citations
53
References
2017
Year
Unknown Venue
Scene AnalysisEngineeringMachine LearningPixel Level SegmentationsCombine MotionVideo Segmentation BenchmarksVideo InterpretationImage Sequence AnalysisImage AnalysisPattern RecognitionMachine VisionGeneric ObjectsVideo UnderstandingDeep LearningComputer VisionFully Automatic SegmentationVideo SegmentationScene InterpretationScene Understanding
The authors propose an end‑to‑end learning framework for segmenting generic objects in videos. They train a two‑stream fully convolutional network that fuses motion and appearance cues, framing the task as structured prediction and bootstrapping weakly annotated videos with image datasets. Experiments on three challenging video segmentation benchmarks show that the method substantially outperforms the state of the art for generic, unseen objects, and the code and pretrained models are publicly released.
We propose an end-to-end learning framework for segmenting generic objects in videos. Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects in videos. We formulate this task as a structured prediction problem and design a two-stream fully convolutional neural network which fuses together motion and appearance in a unified framework. Since large-scale video datasets with pixel level segmentations are problematic, we show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. Through experiments on three challenging video segmentation benchmarks, our method substantially improves the state-of-the-art for segmenting generic (unseen) objects. Code and pre-trained models are available on the project website.
| Year | Citations | |
|---|---|---|
Page 1
Page 1