MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

1Nanyang Technological University     2Zhejiang University     3University of Oxford     4ByteDance    

Figure 1. Examples of video clips from the coMplex video Object SEgmentation (MOSE) dataset. The selected target objects are masked in orange ▇. The most notable feature of MOSE is complex scenes, including the disappearance-reappearance of objects, inconspicuous small objects, heavy occlusions, crowded environments, etc. The goal of MOSE dataset is to provide a platform that promotes the development of more comprehensive and robust video object segmentation algorithms.


Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ∼90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future.


0442a954  d321dde4  02221fb0  bbe97d18  002b4dce  26ed56e6  c791ddbb  e5e9eb29 

Dataset Statistics

TABLE 1. Scale comparison between MOSE and existing VOS datasets.
“mBOR”: mean of the Bounding-box-Occlusion Rate. “Disapp. Rate”: the frequency of disappearance objects.
Dataset Year Videos Categories Objects Annotations Duration (min) mBOR Disapp. Rate
YouTube-Objects 2012 96 10 96 1,692 9.01 - -
SegTrack-v2 2013 14 11 24 1,475 0.69 0.12 8.3%
FBMS 2014 59 16 139 1,465 7.70 0.01 11.2%
JumpCut 2015 22 14 22 6,331 3.52 0 0%
DAVIS-2016 2016 50 - 50 3,440 2.28 - -
DAVIS-2017 2017 90 - 205 13,543 5.17 0.03 16.1%
YouTube-VOS 2018 4,453 94 7,755 197,272 334.81 0.05 13.0%
MOSE (ours) 2023 2,149 36 5,200 431,725 443.62 0.23 28.8%


We benchmark the state-of-the-art methods to the best of our knowledge, please see the Dataset Report for details. If your method is more powerful, please feel free to contract us for benchmark evaluation, we will update the results.

TABLE 2. Benchmark results of semi-supervised (one-shot) VOS.


The dataset is avalibale on OneDrive, Google Drive, and Baidu WangPan (Access Code: MOSE), please kindly refer to MOSE-api for more details.
🚀 Download the dataset using gdown command:
📦 train.tar.gz 20.5 GB
📦 valid.tar.gz 3.61 GB
Tips: gdown may be temporarily throttled by Google Drive due to excessive downloads, you may wait 24h or download from the Google Drive page with a google account. Please feel free to open an issue on MOSE-api.


  • ● Following DAVIS, we use Region Jaccard J, Boundary F measure F, and their mean J&F as the evaluation metrics.
    ● For the validation sets, the first-frame annotations are released to indicate the objects that are considered in evaluation.
    ● The validation set online evaluation server is [here] for daily evaluation.
    ● The test set online evaluation server will be open during the competition period only.


    Please consider to cite MOSE if it helps your research.
      title={MOSE: A New Dataset for Video Object Segmentation in Complex Scenes},
      author={Ding, Henghui and Liu, Chang and He, Shuting and Jiang, Xudong and Torr, Philip HS and Bai, Song},
      journal={arXiv preprint arXiv:2302.01872},


    Creative Commons License
    MOSE is licensed under a CC BY-NC-SA 4.0 License. The data of MOSE is released for non-commercial research purpose only.