RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild

TLDR

Uncontrolled face detection has improved, but accurate and efficient 2D alignment and 3D reconstruction in the wild remain challenging. The paper introduces RetinaFace, a single‑shot method that jointly predicts face boxes, 2D landmarks, and 3D vertices via point regression on the image plane. RetinaFace is trained on WIDER FACE, AFLW, and FDDB images with manually annotated landmarks and semi‑automatic 3D vertex generation, using a regression target that predicts 3D vertices projected onto the image plane while preserving a common 3D topology, and the 3D branch is jointly trained with box and landmark branches without extra optimization. Experiments show that RetinaFace achieves stable face detection, accurate 2D alignment, and robust 3D reconstruction efficiently in a single shot.

Abstract

Though tremendous strides have been made in uncontrolled face detection, accurate and efficient 2D face alignment and 3D face reconstruction in-the-wild remain an open challenge. In this paper, we present a novel single-shot, multi-level face localisation method, named RetinaFace, which unifies face box prediction, 2D facial landmark localisation and 3D vertices regression under one common target: point regression on the image plane. To fill the data gap, we manually annotated five facial landmarks on the WIDER FACE dataset and employed a semi-automatic annotation pipeline to generate 3D vertices for face images from the WIDER FACE, AFLW and FDDB datasets. Based on extra annotations, we propose a mutually beneficial regression target for 3D face reconstruction, that is predicting 3D vertices projected on the image plane constrained by a common 3D topology. The proposed 3D face reconstruction branch can be easily incorporated, without any optimisation difficulty, in parallel with the existing box and 2D landmark regression branches during joint training. Extensive experimental results show that RetinaFace can simultaneously achieve stable face detection, accurate 2D face alignment and robust 3D face reconstruction while being efficient through single-shot inference.

References

Page 1

	Year	Citations

Page 1