Open Panoramic Segmentation ECCV 2024

Junwei Zheng¹	Ruiping Liu¹	Yufan Chen¹	Kunyu Peng¹
Chengzhi Wu¹	Kailun Yang²	Jiaming Zhang^1,†	Rainer Stiefelhagen¹

¹Karlsruhe Institute of Technology

²Hunan University

†Corresponding Author

[PDF]

[Code]

Abstract

Panoramic images, capturing a 360° field of view (FoV), encompass omnidirectional spatial information crucial for scene understanding. However, it is not only costly to obtain training-sufficient denseannotated panoramas but also application-restricted when training models in a close-vocabulary setting. To tackle this problem, in this work, we define a new task termed Open Panoramic Segmentation (OPS), where models are trained with FoV-restricted pinhole images in the source domain in an open-vocabulary setting while evaluated with FoVopen panoramic images in the target domain, enabling the zero-shot open panoramic semantic segmentation ability of models. Moreover, we propose a model named OOOPS with a Deformable Adapter Network (DAN), which significantly improves zero-shot panoramic semantic segmentation performance. To further enhance the distortion-aware modeling ability from the pinhole source domain, we propose a novel data augmentation method called Random Equirectangular Projection (RERP) which is specifically designed to address object deformations in advance. Surpassing other state-of-the-art open-vocabulary semantic segmentation approaches, a remarkable performance boost on three panoramic datasets, WildPASS, Stanford2D3D, and Matterport3D, proves the effectiveness of our proposed OOOPS model with RERP on the OPS task, especially +2.2% on outdoor WildPASS and +2.4% mIoU on indoor Stanford2D3D.

Task Definition

Fig. 1: (a) The challenge of existing state-of-the-art segmentation models. (b) The limitation of categories in traditional close-vocabulary panoramic segmentation tasks. (c) Our newly defined Open Panoramic Segmentation (OPS) task aims at tackling the above challenges. OPS consists of three important elements: Open the FoV targeted at the challenge of 360° FoV, Open the Vocabulary targeted at the drawback of close-vocabulary panoramic segmentation and Open the Domain targeted at the challenge of scarcity of panoramic labels.

Methodology

Fig. 2: Overview of the OOOPS model architecture. It consists of a frozen CLIP model and a Deformable Adapter Network (DAN) which includes Transformer Layers and the proposed Deformable Adapter Operator (DAO). Feature fusion occurs between the intermediate layers of CLIP and DAN. During the training pinhole images are fed into OOOPS producing mask proposals and proposal logits for the loss calculation. During the inference panoramas are fed into OOOPS to produce the segmentation predictions.

Fig. 3: Salient map generation in DAO.

Fig. 4: (a) Visualization of ERP on a panoramic image and (b) Our proposed RERP on pinhole images.

Visualization

Fig. 5: (a) Comparison on the WildPASS dataset and (b) Visualization of the prediction from OOOPS in close- and open-vocabulary settings.

Fig. 6: Visualization of deformable offsets. The green points • are sampling locations. The red points • are deformable offsets in 2 levels, indicating a deformable receptive field (e.g., each level has a 3×3 kernel size, resulting in (3×3)²=81 red points).

Citation

If you find our work useful in your research, please cite:

@inproceedings{zheng2024open,
title={Open Panoramic Segmentation},
author={Zheng, Junwei and Liu, Ruiping and Chen, Yufan and Peng, Kunyu and Wu, Chengzhi and Yang, Kailun and Zhang, Jiaming and Stiefelhagen, Rainer},
booktitle={European Conference on Computer Vision (ECCV)},
year={2024}
}