Masked Frequency Modeling for
Self-Supervised Visual Pre-Training

ICLR 2023

Paper

Abstract

We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in learning good representations. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token. Experimental results on image classification and semantic segmentation, as well as several robustness benchmarks show the competitive performance and advanced robustness of MFM compared with recent masked image modeling approaches. Furthermore, we also comprehensively investigate the effectiveness of classical image restoration tasks for representation learning from a unified frequency perspective and reveal their intriguing relations with our MFM approach.

Comparison of masking recipes in Masked Language Modeling (MLM), Masked Image Modeling (MIM), low-level image processing and Masked Frequency Modeling (MFM). Note the differences of masked information among MIM, low-level image processing and MFM.

The

MFM Pipeline

We convert each input image into frequency domain via Fast Fourier Transform (FFT) and mask a portion of frequencies on the frequency spectrum via a low-/high-pass filter. After inverse FFT (iFFT), the low-/high-pass filtered spatial images are then randomly fed to the encoder (e.g., ViT, CNN), with a lightweight one-layer head to predict the masked frequency values on the frequency spectrum via a frequency loss.

Diagnosis

of Low-Level Image Processing Tasks

We examine the representation learning capability of three representative low-level image processing tasks (i.e., image super-resolution (SR), deblurring, and denoising) from a unified frequency perspective. We observe that: 1) The optimal degradation level of each task in the context of representation learning is much heavier than its original task setting. 2) With right configuration of the task difficulty, all these low-level tasks can achieve comparable or even better performance than their supervised counterpart (e.g., 81.8% in DeiT). 3) Representation learning benefits from all lens of frequencies. Compared with these tasks, MFM provides a more general and unified frequency perspective to perform low-level corruptions while being conceptually simpler: we directly remove certain frequencies on the frequency spectrum via a low-/high-pass filter.

Table 1: Comparison of SR, deblurring, denoising and MFM tasks with ViT-B/16 on ImageNet-1K. All models are pre-trained for 300 epochs, and evaluated with top-1 fine-tuning accuracy. Corrupted image samples from ImageNet-1K training set with different degradation levels are visualized in both image and frequency domain. The studied hyper-parameter that controls the difficulty of degradation for each task is (a) downsampling scale factor, (b) Gaussian blur sigma, (c) Gaussian noise sigma, and (d) mask radius, respectively.

(a)

Task Param. Top-1 acc
SR x2 82.1
x4 82.2
x8 82.4
x16 82.1

(b)

Task Param. Top-1 acc
Deblur 1 79.7
3 81.2
5 81.7
7 81.5

(c)

Task Param. Top-1 acc
Denoise 25 82.4
50 82.6
75 82.7
100 82.6

(d)

Task Param. Top-1 acc
MFM 8 82.8
16 83.1
24 82.7
32 82.6

Comparison

with Previous Methods

ViT

Compared with other representative self-supervised learners, MFM can achieve comparable performance with fewer pre-training epochs while using none of the following: (i) extra data, (ii) extra model, (iii) mask token.

Table 2: ImageNet-1K top-1 fine-tuning accuracy of self-supervised models using ViT-S/16 and ViT-B/16 as the encoder.

Method Pre-train data Extra model Mask token Epochs ViT-S ViT-B
Scratch - - - - 79.9 81.8
MoCo v3 IN-1K momentum ViT - 600 81.4 83.2
DINO IN-1K momentum ViT - 1600 81.5 82.8
BEiT IN-1K+DALL-E dVAE 300 81.3 82.9
MAE IN-1K - 300 80.6 82.9
SR IN-1K - - 300 80.8 82.4
Deblur IN-1K - - 300 79.4 81.7
Denoise IN-1K - - 300 81.1 82.7
MFM IN-1K - - 300 81.6 83.1

ResNet-50

Different from ViT, we observe performance degeneration of low-level image processing tasks like SR, deblurring and denoising compared with the RSB training-from-scratch baseline. We hypothesize this discrepancy is due to the architectural difference between ViT and CNN. Compared with ViT, the convolution operation in CNN tends to be more effective in capturing high-frequency components. Thus, encouraging a CNN model to reconstruct high-frequency components of images brings no benefits to the performance. Instead, learning high-frequency information can compensate for the ability of ViT models in capturing the high-frequency components. In contrast, MFM outperforms its supervised counterparts in both ViT and CNN architectures as it leverages both low- and high-frequency components.

Table 3: ImageNet-1K top-1 fine-tuning accuracy of self-supervised models using ResNet-50 as the encoder.

(a) Training-from-scratch baselines

Method Epochs Top-1 acc
Original90 - 75.3
PyTorch90 - 76.1
FixRes120 - 77.0
DeiT300 - 78.4
ResNet-RS350 - 78.8
FAMS400 - 79.5

(b) Fine-tuning for 100 epochs

Method Epochs Top-1 acc
RSB A3 - 78.1
SR 300 77.9
Deblur 300 78.0
Denoise 300 77.5
MFM 300 78.5

(c) Fine-tuning for 300 epochs

Method Epochs Top-1 acc
RSB A2 - 79.8
SimSiam 400 79.1
MoCo v2 400 79.6
SimCLR 800 79.9
BYOL 400 80.0
SwAV 600 80.1
MFM 300 80.1

Robustness

Evaluation

We evaluate the robustness of our models on a series of benchmarks in three aspects: (i) adversarial robustness (FGSM, PGD and ImageNet-A), (ii) common corruption robustness (ImageNet-C), and (iii) out-of-distribution robustness (ImageNet-R and ImageNet-Sketch). We can conclude three observations: 1) Transformer-based models (e.g., ViT) are more robust than the CNN counterparts (e.g., ResNet-50). 2) Corruption-based tasks (e.g., SR, Deblur, Denoise and MFM) are generally more robust than the MIM task (e.g., MAE and SimMIM). 3) MFM achieves the best trade-off between standard performance and robustness (the robustness of MFM always ranks within the top two, while the standard accuracy is the best).

Table 4: Robustness evaluation on six robustness benchmarks (ViT-B/16). We report top-1 accuracy except for IN-C that uses the mean corruption error (mCE). The original ImageNet top-1 fine-tuning results are also appended for reference. The best results are in bold, and the second best results are underlined.

Method Robustness benchmarks Orig.
FGSM PGD IN-C (↓) IN-A IN-R IN-SK
Scratch 46.3 21.2 48.5 28.1 44.7 32.0 81.8
MAE 38.9 11.2 52.3 31.5 48.3 33.8 82.9
SR 46.1 21.5 46.3 29.1 49.2 35.5 82.4
Deblur 42.5 17.2 49.2 25.3 46.9 33.2 81.7
Denoise 47.6 24.3 47.8 30.7 48.4 34.8 82.7
MFM 47.7 24.4 47.5 32.7 48.6 34.8 83.1

Table 5: Robustness evaluation on six robustness benchmarks (ResNet-50). We report top-1 accuracy except for IN-C that uses the mean corruption error (mCE). The original ImageNet top-1 fine-tuning results are also appended for reference. The best results are in bold, and the second best results are underlined.

Method Robustness benchmarks Orig.
FGSM PGD IN-C (↓) IN-A IN-R IN-SK
Scratch 20.2 3.4 77.0 6.6 36.0 25.0 78.1
SimMIM 16.8 2.1 77.0 5.7 34.9 24.2 77.7
SR 17.2 1.9 73.6 6.5 35.8 25.4 77.9
Deblur 17.2 2.0 74.8 8.2 37.2 26.5 78.0
Denoise 15.8 1.8 78.0 7.2 35.6 24.7 77.5
MFM 18.5 2.3 74.2 9.0 36.9 26.7 78.5

Qualitative

Results

Compared with SR, Deblur and Denoise, MFM can utilize both high-frequency and low-frequency information for prediction.

Example results of recovered images on ImageNet-1K validation set for SR, deblurring, denoising and MFM tasks. We visualize both images and their frequency spectrums. We use the best pre-trained model of each task studied in our paper for visualization, i.e., the downsampling scale factor is ×8 for SR, the Gaussian blur sigma is 5 for Deblur, the Gaussian noise sigma is 75 for Denoise, and the mask radius is 16 for MFM.

Example results of recovered images on COCO validation set for SR, deblurring, denoising and MFM tasks, using the models pre-trained on ImageNet-1K. We visualize both images and their frequency spectrums.

Paper

Citation

@inproceedings{xie2023masked,
 title = {Masked Frequency Modeling for Self-Supervised Visual Pre-Training},
 author = {Xie, Jiahao and Li, Wei and Zhan, Xiaohang and Liu, Ziwei and Ong, Yew Soon and Loy, Chen Change},
 booktitle = {ICLR},
 year = {2023}
}

Related

Projects

  • Correlational Image Modeling for Self-Supervised Visual Pre-Training
    W. Li, J. Xie, C. C. Loy
    in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2023 (CVPR)
    [arXiv] [Project Page]
  • Delving into Inter-Image Invariance for Unsupervised Visual Representations
    J. Xie, X. Zhan, Z. Liu, Y. S. Ong, C. C. Loy
    International Journal of Computer Vision, 2022 (IJCV)
    [PDF] [DOI] [arXiv] [Project Page]
  • Unsupervised Object-Level Representation Learning from Scene Images
    J. Xie, X. Zhan, Z. Liu, Y. S. Ong, C. C. Loy
    in Proceedings of Neural Information Processing Systems, 2021 (NeurIPS)
    [PDF] [arXiv] [Project Page]
  • Online Deep Clustering for Unsupervised Representation Learning
    X. Zhan*, J. Xie*, Z. Liu, Y. S. Ong, C. C. Loy
    in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020 (CVPR)
    [PDF] [arXiv] [Project Page]

Contact


Jiahao Xie
Email: jiahao003 at e.ntu.edu.sg