The 3rd Pixel-level Video Understanding in the Wild Challenge Workshop

17 June, 2024

CVPR 2024, SEATTLE

Introduction

Pixel-level Scene Understanding is one of the fundamental problems in computer vision, which aims at recognizing object classes, masks and semantics of each pixel in the given image. Since the real-world is actually video-based rather than a static state, learning to perform video segmentation is more reasonable and practical for realistic applications. To advance the segmentation task from images to videos, we will present new datasets and competitions in this workshop, aiming at performing the challenging yet practical Pixel-level Video Understanding in the Wild (PVUW). This workshop includes workshop papers

This workshop will cover but not limit to the following topics:

● Semantic/panoptic segmentation for images/videos

● Video object/instance segmentation

● Referring expression segmentation

● Efficient computation for video scene parsing

● Object tracking

● Semi-supervised recognition in videos

● New metrics to evaluate the quality of video scene parsing results

● Real-world video applications, including autonomous driving, indoor robotics, visual navigation, etc.

Challenges

Pixel-level Video Understanding in the Wild Challenge (PVUW) challenge includes four tracks. In this year, we add two new tracks, Complex Video Object Segmentation Track based on MOSE [1] and Motion Expression guided Video Segmentation track based on MeViS [2]. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as the disappearance and reappearance of objects, inconspicuous small objects, heavy occlusions, and crowded environments in MOSE. Moreover, we provide a new motion expression guided video segmentation dataset MeViS to study the natural language-guided video understanding in complex environments. These new videos, sentences, and annotations enable us to foster the development of a more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios.

[1] MOSE: A New Dataset for Video Object Segmentation in Complex Scenes. ICCV 2023

[2] MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions. ICCV 2023

Track 1: Video Semantic Segmentation (VSS) Track

The video semantic segmentation task aims to recognize the semantics of all frames in a given video. To participant Track 1, please visit this link.

Track 2: Video Panoptic Segmentation (VPS) Track

The video panoptic segmentation task aims to jointly predict object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames. To participant Track 2, please visit this link.

Track 3: Complex Video Object Segmentation Track

The complex video object segmentation task aims to track and segment objects in complex environments. To participant Track 3, please visit this link.

Track 4: Motion Expression guided Video Segmentation Track

The motion expression guided video segmentation track focuses on segmenting objects in video content based on a sentence describing the motion of the objects. To participant Track 4, please visit this link.

Important Dates:

Feb 1th: Codalab websites open for registration. Training and validation data are released.
May 15th - 25th: Release the test data and open the submission of the final test results.
May 30th: The final competition results will be announced and top teams will be invited to give oral/poster presentations at our CVPR 2024 workshop.

Call for Papers

Submission: We invite authors to submit unpublished papers (8-page CVPR format) to our workshop, to be presented at a poster session upon acceptance. All submissions will go through a double-blind review process. All contributions must be submitted (along with supplementary materials, if any) at this link.

Accepted papers will be published in the official CVPR Workshops proceedings and the Computer Vision Foundation (CVF) Open Access archive.