Extract Free Dense Labels from CLIP

ECCV 2022, Oral

Paper

Abstract

Contrastive Language-Image Pre-training (CLIP) has made a remarkable breakthrough in open-vocabulary zero-shot image recognition. Many recent studies leverage the pre-trained CLIP models for image-level classification and manipulation. In this paper, we wish examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. To this end, with minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins, e.g., mIoUs of unseen classes on PASCAL VOC, PASCAL Context, and COCO Stuff are improved from 35.6, 20.7, 30.3 to 86.1, 66.7, 54.7. We also test the robustness of MaskCLIP under input corruption and evaluate its capability in discriminating fine-grained objects and novel concepts. Our finding suggests that MaskCLIP can serve as a new reliable source of supervision for dense prediction tasks to achieve annotation-free segmentation.

Here we show the original image in (a), the segmentation result of MaskCLIP+ in (b), and the confidence maps of MaskCLIP and MaskCLIP+ for Batman in (c) and (d) respectively. Through the adaptation of CLIP, MaskCLIP can be directly used for segmentation of fine-grained and novel concepts (e.g., Batman and Joker) without any training operations and annotations. Combined with pseudo labeling and self-training, MaskCLIP+ further improves the segmentation result.

The

MaskCLIP Framework

Compared to the conventional fine-tuning method, the key to the success of MaskCLIP is keeping the pretrained weights frozen and making minimal adaptation to preserve the visuallanguage association. Besides, to compensate for the weakness of using the CLIP image encoder for segmentation, which is designed for classification, MaskCLIP+ uses the outputs of MaskCLIP as pseudo labels and trains a more advanced segmentation network such as DeepLabv2.

Qualitative results on

PASCAL Context

Here all results are obtained without any annotation. PD and KS refer to prompt denoising and key smoothing respectively. With PD, we can see some distraction classes are removed. KS is more aggressive. Its outputs are much less noisy but are dominated by a small number of classes. Finally, MaskCLIP+ yields the best results.

Qualitative results on

Web images

Here we show the segmentation results of MaskCLIP and MaskCLIP+ on various unseen classes, including fine-grained classes such as cars in different colors/imagery properties, celebrities, and animation characters. All results are obtained without any annotation.

Zero-shot

Segmentation

ST stands for self-training. mIoU(U) denotes mIoU of the unseen classes. SPNet-C is the SPNet with calibration. On PASCAL Context, all methods use DeepLabv3+-ResNet101 as the backbone segmentation model and the rest two datasets use DeepLabv2-ResNet101. For MaskCLIP+, CLIP-ResNet-50 is used to generate pseudo labels.

Method PASCAL-VOC COCO-Stuff PASCAL-Context
mIoU(U) mIoU hIoU mIoU(U) mIoU hIoU mIoU(U) mIoU hIoU
(Inductive)
SPNet 0.0 56.9 0.0 0.7 31.6 1.4 - - -
SPNet-C 15.6 63.2 26.1 8.7 32.8 14.0 - - -
ZS3Net 17.7 61.6 28.7 9.5 33.3 15.0 12.7 19.4 15.8
CaGNet 26.6 65.5 39.7 12.2 33.5 18.2 18.5 23.2 21.2
(Transductive)
SPNet+ST 25.8 64.8 38.8 26.9 34.0 30.3 - - -
ZS3Net+ST 21.2 63.0 33.3 10.6 33.7 16.2 20.7 26.0 23.4
CaGNet+ST 30.3 65.8 43.7 13.4 33.7 19.5 - - -
STRICT 35.6 70.9 49.8 30.3 34.9 32.6 - - -
MaskCLIP+
(ours)
86.1 88.1 87.4 54.7 39.6 45.0 66.7 48.1 53.3
↑ 50.5 ↑ 17.2 ↑ 37.6 ↑ 24.4 ↑ 4.7 ↑ 12.4 ↑ 46.0 ↑ 22.1 ↑ 29.9
Fully Sup. - 88.2 - - 39.9 - - 48.2 -

Paper

Citation

@InProceedings{zhou2022maskclip,
 author = {Zhou, Chong and Loy, Chen Change and Dai, Bo},
 title = {Extract Free Dense Labels from CLIP},
 booktitle = {European Conference on Computer Vision (ECCV)},
 year = {2022}
}

Related

Projects

  • Neural Prompt Search
    Y. Zhang, K. Zhou, Z. Liu
    Technical report, arXiv:2206.04673, 2022
    [arXiv] [Project Page]
  • Conditional Prompt Learning for Vision-Language Models
    K. Zhou, J. Yang, C. C. Loy, Z. Liu
    in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2022 (CVPR)
    [PDF] [arXiv] [Project Page]
  • Learning to Prompt for Vision-Language Models
    K. Zhou, J. Yang, C. C. Loy, Z. Liu
    International Journal of Computer Vision, 2022 (IJCV)
    [arXiv] [Project Page]

Contact


Chong Zhou
Email: chong033 at ntu.edu.sg