Improved YOLOv3 model with feature map cropping for multi-scale road object detection

TLDR

Road object detection is essential for autonomous driving, yet the multi‑scale and uncertain distribution of vehicles and pedestrians poses a significant challenge to detection algorithms. This work proposes a YOLOv3‑based approach to enhance cross‑scale detection by concentrating on valuable regions, addressing the urgent need for robust multi‑scale object detection. The method employs a K‑means‑GIoU algorithm to generate realistic prior boxes, adds a small‑target detection branch with a feature‑map cropping module to suppress background and easy targets, and incorporates channel and spatial attention modules to strengthen focus on key regions. Training complexity is reduced for faster convergence, and on the KITTI dataset the approach achieves up to 2.86 % higher mAP than YOLOv3‑ultralytics while maintaining speed, particularly improving small‑scale object detection.

Abstract

Abstract Road object detection is an essential and imperative step for driving intelligent vehicles. Generally, road objects, such as vehicles and pedestrians, present the characteristic of multi-scale and uncertain distribution which puts a high demand on the detection algorithm. Therefore, this paper proposes a YOLOv3 (You Only Look Once v3)-based method aimed at enhancing the capability of cross-scale detection and focusing on the valuable area. The proposed method fills an urgent need for multi-scale detection, and its individual components will be useful in road object detection. The K-means-GIoU algorithm is designed to generate a priori boxes whose shapes are close to real boxes. This greatly reduces the complexity of training, paving the way for fast convergence. Then, a detection branch is added to detect small targets, and a feature map cropping module is introduced into the newly added detection branch to remove the areas with high probability of background targets and easy-to-detect targets, and the cropped areas of the feature map are filled with a value of 0. Further, a channel attention module and spatial attention module are added to strengthen the network’s attention to major regions. The experiment results on the KITTI dataset show that the proposed method maintains a fast detection speed and increases the mAP (mean average precision) value by as much as 2.86 <?CDATA $\%$?> <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" overflow="scroll"> <mml:mi mathvariant="normal">%</mml:mi> </mml:math> compared with YOLOv3-ultralytics, and especially improves the detection performance for small-scale objects.

References

Page 1

	Year	Citations

Page 1