GRES: Generalized Referring Expression Segmentation

Abstract

Referring Expression Segmentation (RES) aims to generate a segmentation mask for the object described by a given language expression. Existing classic RES datasets and methods commonly support single-target expressions only, ie, one expression refers to one target object. Multi-target and no-target expressions are not considered. This limits the usage of RES in practice. In this paper, we introduce a new benchmark called Generalized Referring Expression Segmentation (GRES), which extends the classic RES to allow expressions to refer to an arbitrary number of target objects. Towards this, we construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions. GRES and gRefCOCO are designed to be well-compatible with RES, facilitating extensive experiments to study the performance gap of the existing RES methods on the GRES task. In the experimental study, we find that one of the big challenges of GRES is complex relationship modeling. Based on this, we propose a region-based GRES baseline ReLA that adaptively divides the image into regions with sub-instance clues, and explicitly models the region-region and region-language dependencies. The proposed approach ReLA achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks.

GRES Setting

Generalized Referring Expression Segmentation (GRES) allows expressions indicating any number of target objects. GRES takes an image and a referring expression as input, and requires mask prediction of the target object(s).

☆ Multi-object expressions: an expression indicates multiple target objects.
☆ No-target expressions: an expression does not touch on any object in the image.
☆ Single-target expressions: an expression indicates a single target object.

Figure 2. More applications of GRES brought by supporting multi-target and no-target expressions compared to classic RES.

Experiments

We benchmark the state-of-the-art methods on gRefCOCO to the best of our knowledge. If your method is more powerful, please feel free to contract us for benchmark evaluation, we will update the results.

TABLE 1. GRES results: comparison on gRefCOCO dataset.

TABLE 2. Results on classic RES in terms of cIoU. U: UMD split. G: Google split.

Downloads

BibTeX

Please consider to cite GRES if it helps your research.

@inproceedings{GRES,
  title={{GRES}: Generalized Referring Expression Segmentation},
  author={Liu, Chang and Ding, Henghui and Jiang, Xudong},
  booktitle={CVPR},
  year={2023}
}

License

GRES is licensed under a CC BY-NC-SA 4.0 License. The data of gRefCOCO is released for non-commercial research purpose only.