LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model

Abstract

The capacity of existing human keypoint localization models is limited by keypoint priors provided by the training data. To alleviate this restriction and pursue more gen-eral model, this work studies keypoint localization from a different perspective by reasoning locations based on key-piont clues in text descriptions. We propose LocLLM, the first Large-Language Model (LLM) based keypoint local-ization model that takes images and text instructions as in-puts and outputs the desired keypoint coordinates. LocLLM leverages the strong reasoning capability of LLM and clues of keypoint type, location, and relationship in textual de-scriptions for keypoint localization. To effectively tune Lo-cLLM, we construct localization-based instruction conver-sations to connect keypoint description with corresponding coordinates in input image, and fine-tune the whole model in a parameter-efficient training pipeline. LocLLM shows remarkable performance on standard 2D/3D keypoint lo-calization benchmarks. Moreover, incorporating language clues into the localization makes LocLLM show superior flexibility and generalizable capability in cross dataset key-point localization, and even detecting novel type of key-points unseen during training<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">†</sup><sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">†</sup>Project page: https://github.com/kennethwdk/LocLLM.