TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

TLDR

Tracking multiple objects in videos relies on modeling spatial‑temporal interactions among the objects. This work introduces TransMOT, a graph‑transformer framework that efficiently captures spatial and temporal interactions for multiple‑object tracking. TransMOT represents trajectories and detection candidates as sparse weighted graphs and applies spatial and temporal transformer encoder layers followed by a spatial decoder, learning end‑to‑end to associate detections across time. On MOT15, MOT16, MOT17, and MOT20 benchmarks, TransMOT achieves state‑of‑the‑art performance.

Abstract

Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT is capable of effectively modeling the interactions of a large number of objects by arranging the trajectories of the tracked targets and detection candidates as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. Through end-to-end learning, TransMOT can exploit the spatial-temporal clues to directly estimate association from a large number of loosely filtered detection predictions for robust MOT in complex scenes. The proposed method is evaluated on multiple benchmark datasets, including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets.

References

Page 1

	Year	Citations

Page 1