MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions

Nanyang Technological University

Figure 1. Examples of video clips from Motion expressions Video Segmentation (MeViS) are provided to illustrate the dataset's nature and complexity. The selected target objects are masked in orange ▇. The expressions in MeViS primarily focus on motion attributes and the referred target object cannot be identified by examining a single frame solely. For instance, the first example features three parrots with similar appearances, and the target object is identified as "The bird flying away". This object can only be recognized by capturing its motion throughout the video.


Abstract

This paper strives for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. Existing referring video object datasets typically focus on salient objects and use language expressions that contain excessive static attributes that could potentially enable the target object to be identified in a single frame. These datasets downplay the importance of motion in video content for language-guided video object segmentation. To investigate the feasibility of using motion expressions to ground and segment objects in videos, we propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. We benchmarked 5 existing referring video object segmentation (RVOS) methods and conducted a comprehensive comparison on the MeViS dataset. The results show that current RVOS methods cannot effectively address motion expression-guided video segmentation. We further analyze the challenges and propose a baseline approach for the proposed MeViS dataset. The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms that leverage motion expressions as a primary cue for object segmentation in complex video scenes.

MeViS Setting

0442a954  d321dde4  02221fb0  bbe97d18 

Given a video and an expression describing the motion clues of the target object(s), MeViS requires to segment and track the target object(s) accuractely.

☆ Input: a video and a sentence that refer to the target object(s).
☆ Output: video segmentation mask for the target object(s).
☆ Expression: especially focus on describing motions, which may span several frames or hundreds of frames.
☆ Target Object: the number of target objects referred by sentence is any.

Dataset Statistics

TABLE 1. Scale comparison between MeViS and existing language-guided video segmentation datasets.
“Obj/Video”: average number of objects per video. “Obj/Expn”: average number of referred objects per expression.
Dataset Pub. & Year Videos Object Expression Mask Obj/Video Obj/Expn Target
A2D Sentence CVPR 2018 3,782 4,825 6,656 58k 1.28 1 Actor
J-HMDB Sentence CVPR 2018 928 928 928 31.8k 1 1 Actor
DAVIS16-RVOS ACCV 2018 50 50 100 3.4k 1 n/a Object
DAVIS17-RVOS ACCV 2018 90 205 1,544 13.5k 2.27 1 Object
Refer-Youtube-VOS ECCV 2020 3,978 7,451 15,009 131k 1.86 1 Object
MeViS (ours) ICCV 2023 2,006 8,171 28,570 443k 4.28 1.59 Object(s)

The newly built MeViS has the largest number of objects and language expressions. More importantly, MeViS focuses on segmenting objects in the videos indicated by motion expressions. The MeViS enables the investigation of the feasibility of using motion expressions for object segmentation and grounding in videos.

A Simple Baseline Approach


Figure 2. The overview architecture of the proposed baseline approach Language-guided Motion Perception and Matching (LMPM). We first detect all possible target objects in each frame of the video and use object embeddings to represent them through Language-Guided Extractor. Then, Motion Perception is conducted on all the object embeddings of the video to grasp the global temporal context. By leveraging language queries and object embeddings with motion information, we generate object trajectories through a Transformer Decoder. Finally, we match the language features with the predicted object trajectories to identify the target object(s).


Experiments

We benchmark the state-of-the-art methods to the best of our knowledge, please see the paper for details. If your method is more powerful, please feel free to contract us for benchmark evaluation, we will update the results.


TABLE 1. MeViS Benchmark Results.

Downloads & Evaluation



● We use Region Jaccard J, Boundary F measure F, and their mean J&F as the evaluation metrics.
● For the validation sets, the expressions are released to indicate the objects that are considered in evaluation.
● The validation set online evaluation server is [here] for daily evaluation.
● The test set online evaluation server will be open during the competition period only.

BibTeX

Please consider to cite MeViS if it helps your research.
@inproceedings{MeViS,
  title={{MeViS}: A Large-scale Benchmark for Video Segmentation with Motion Expressions},
  author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Loy, Chen Change},
  booktitle={ICCV},
  year={2023}
}

License

Creative Commons License
MeViS is licensed under a CC BY-NC-SA 4.0 License. The data of MeViS is released for non-commercial research purposes only.