GRES: Generalized Referring Expression Segmentation

Nanyang Technological University
Project Leader & Corresponding Author

Figure 1. Classic Referring Expression Segmentation (RES) only supports expressions that indicate a single target object, e.g, (1). Compared with classic RES, the proposed Generalized Referring Expression Segmentation (GRES) supports expressions indicating an arbitrary number of target objects, for example, multi-target expressions like (2)-(5), and no-target expressions like (6). To support the GRES task, we construct the first large-scale GRES dataset called gRefCOCO.


Referring Expression Segmentation (RES) aims to generate a segmentation mask for the object described by a given language expression. Existing classic RES datasets and methods commonly support single-target expressions only, ie, one expression refers to one target object. Multi-target and no-target expressions are not considered. This limits the usage of RES in practice. In this paper, we introduce a new benchmark called Generalized Referring Expression Segmentation (GRES), which extends the classic RES to allow expressions to refer to an arbitrary number of target objects. Towards this, we construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions. GRES and gRefCOCO are designed to be well-compatible with RES, facilitating extensive experiments to study the performance gap of the existing RES methods on the GRES task. In the experimental study, we find that one of the big challenges of GRES is complex relationship modeling. Based on this, we propose a region-based GRES baseline ReLA that adaptively divides the image into regions with sub-instance clues, and explicitly models the region-region and region-language dependencies. The proposed approach ReLA achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks.

GRES Setting

Generalized Referring Expression Segmentation (GRES) allows expressions indicating any number of target objects. GRES takes an image and a referring expression as input, and requires mask prediction of the target object(s).

☆ Multi-object expressions: an expression indicates multiple target objects.
☆ No-target expressions: an expression does not touch on any object in the image.
☆ Single-target expressions: an expression indicates a single target object.

Figure 2. More applications of GRES brought by supporting multi-target and no-target expressions compared to classic RES.


We benchmark the state-of-the-art methods on gRefCOCO to the best of our knowledge. If your method is more powerful, please feel free to contract us for benchmark evaluation, we will update the results.

TABLE 1. GRES results: comparison on gRefCOCO dataset.

TABLE 2. Results on classic RES in terms of cIoU. U: UMD split. G: Google split.



Please consider to cite GRES if it helps your research.
  title={{GRES}: Generalized Referring Expression Segmentation},
  author={Liu, Chang and Ding, Henghui and Jiang, Xudong},


Creative Commons License
GRES is licensed under a CC BY-NC-SA 4.0 License. The data of gRefCOCO is released for non-commercial research purpose only.