Publication | Closed Access
Mix-Teaching: A Simple, Unified and Effective Semi-Supervised Learning Framework for Monocular 3D Object Detection
41
Citations
58
References
2023
Year
EngineeringMachine Learning3D Computer VisionImage AnalysisPattern RecognitionImage-based ModelingComputational ImagingRobot LearningSemi-supervised LearningMachine VisionObject DetectionImage PatchesComputer ScienceDeep LearningMonocular 3D3D Object RecognitionComputer Vision3D VisionObject RecognitionScene UnderstandingScene Modeling
Semi-supervised learning (SSL) has promising potential for improving model performance using both labelled and unlabelled data. Since recovering 3D information from 2D images is an ill-posed problem, the current state-of-the-art methods of monocular 3D object detection (Mono3D) have relatively low precision and recall, making semi-supervised learning for Mono3D tasks challenging and understudied. In this work, we propose a unified and effective semi-supervised learning framework called Mix-Teaching that can be applied to most monocular 3D object detectors. Based on the idea of decomposition and recombination, unlabelled samples are firstly decomposed into collections of image patches with high-quality predictions and collections of background images containing no objects. The student model is then trained on the mixed images containing dense instances with high-quality pseudo-labels generated by the recombination operation. In addition, we propose an uncertainty-based filter to distinguish high-quality pseudo-labels from noisy predictions during the decomposition process. As results in KITTI and nuScenes benchmarks, Mix-Teaching consistently improves MonoFlex and GUPNet by significant margins under various labeling ratios. Our method achieves around +6.34% <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$AP_{3D}$ </tex-math></inline-formula> improvement against the GUPNet on the validation set when using only 10% labelled data. Using the full training set and the additional 38K raw images from KITTI, it can further improve the MonoFlex by +4.65% absolute improvement on <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$AP_{3D}$ </tex-math></inline-formula> for car detection, reaching 18.54% <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$AP_{3D}$ </tex-math></inline-formula> , which ranks the 1st place among all monocular based methods on the KITTI test leaderboard.
| Year | Citations | |
|---|---|---|
Page 1
Page 1