Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation

We study universal zero-shot segmentation in this work to achieve panoptic, instance, and semantic segmentation for novel categories without any training samples. Such zero-shot segmentation ability relies on inter-class relationships in semantic space to transfer the visual knowledge learned from seen categories to unseen ones. Thus, it is desired to well bridge semantic-visual spaces and apply the semantic relationships to visual feature learning. We introduce a generative model to synthesize features for unseen categories, which links semantic and visual spaces as well as address the issue of lack of unseen training data. Furthermore, to mitigate the domain gap between semantic and visual spaces, firstly, we enhance the vanilla generator with learned primitives, each of which contains fine-grained attributes related to categories, and synthesize unseen features by selectively assembling these primitives. Secondly, we propose to disentangle the visual feature into the semantic-related part and the semantic-unrelated part that contains useful visual classification clues but is less relevant to semantic representation. The inter-class relationships of semantic-related visual features are then required to be aligned with those in semantic space, thereby transferring semantic knowledge to visual feature learning. The proposed approach achieves impressively state-of-the-art performance on zero-shot panoptic segmentation, zero-shot instance segmentation, and zero-shot semantic segmentation.

PADing Framework

Figure 2. Overview of our approach PADing for universal zero-shot image segmentation. We first obtain class-agnostic masks and their corresponding global representations, named class embeddings, from our backbone. A primitive generator is trained to produce synthetic features (i.e., fake class embeddings). The classifier, which takes class embeddings as input, is trained with both the real class embeddings from image and synthetic class embeddings by the generator. During the training of the generator, the proposed feature disentanglement and relationship alignment are employed to constrain the synthesized features.

Figure 3. Primitive Cross-Modal Generator. We use lots of learned primitives to represent fine-grained attributes. The generator synthesizes visual features via assembling these primitives according to input semantic embedding.

Figure 4. Relationship alignment. (a) The conventional relationship alignment. (b) Our proposed two-step relationship alignment. Considering the domain gap, we introduce a semantic-related visual space, where features are disentangled from visual space and have more direct relevance with semantic space. We have the relationship in semantic-related visual space be aligned with semantic space. u_i/s_j refers to unseen/seen category. Taking u₁ dog as an example, we aim to transfer its similarities with {cat, elephant, horse, zebra} from semantic to visual space.

Experiments

We benchmark the state-of-the-art methods on zero-shot panoptic segmentation (ZSP), zero-shot semantic segmentaion (ZSS), zero-shot instance segmentation (ZSI), and zero-shot detection (ZSD).

Zero-Shot Panoptic Segmentaiton (ZSP) Task: Because of the high similarities between semantic segmentation and panoptic segmentation, we develop the ZSP datasets by following the previous ZSS works. In order to avoid any information leakage, SPNet selects 15 classes in COCO stuff that do not appear in ImageNet as unseen classes. In COCO panoptic dataset, we find 14 classes overlapped with the 15 ones selected by SPNet and set them as unseen classes, i.e., {cow, giraffe, suitcase, frisbee, skateboard, carrot, scissors, cardboard, sky-other-merged, grass-merged, playingfield, river, road, tree-merged}, while the remaining 119 classes are set as seen classes. To guarantee no information leakage in the training set, we discard the training images that contain even one pixel of any unseen classes. Thus the model is trained by samples of seen classes only with 45617 training images. We use all 5k validation images to evaluate the performance of ZSP. Panoptic and semantic segmentation tasks are evaluated on the union of thing and stuff classes while instance segmentation is only evaluated on the thing classes.

TABLE 1. Zero-shot panoptic segmentation ablation study results on MSCOCO. G, P, A, D denote GMMN generator, primitive generator, disentanglement, and alignment, respectively.

TABLE 2. Comparison with other ZSS methods on COCO-Stuff.

TABLE 3. Results on GZSI using word2vec embedding.

TABLE 4. Results on GZSD using word2vec embedding.

BibTeX

Please consider to cite PADing if it helps your research.

@inproceedings{PADing,
  title={Primitive Generation and Semantic-related Alignment for Universal Zero-Shot Segmentation},
  author={He, Shuting and Ding, Henghui and Jiang, Wei},
  booktitle={CVPR},
  year={2023}
}

License

PADing is licensed under a CC BY-NC-SA 4.0 License.