GSVA: Generalized Segmentation via Multimodal Large Language Models

Abstract

Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple ob-jects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the com-plex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Con-necting Large Language Models (LLMs) and vision models, MLLMs are proficient in understanding contexts with visual inputs. Among them, LISA, as a representative, adopts a special [SEG] token to prompt a segmentation mask de-coder, e.g., SAM, to enable MLLMs in the RES task. How-ever, existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. Specifically, GSVA reuses the [SEG] token to prompt the segmentation model towards supporting multiple mask ref-erences simultaneously and innovatively learns to generate a [REJ] token to reject the null targets explicitly. Ex-periments validate GSVA's efficacy in resolving the GRES issue, marking a notable enhancement and setting a new record on the GRES benchmark gRefCOCO dataset. GSVA also proves effective across various classic referring seg-mentation and comprehension tasks. Code is available at https://github.com/LeapLabTHU/GSVA.

References

Page 1

	Year	Citations

Page 1