Click here to flash read.
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale
vision representation pre-training. By reconstructing masked image patches from
a small portion of visible image regions, MAE forces the model to infer
semantic correlation within an image. Recently, some approaches apply
semantic-rich teacher models to extract image features as the reconstruction
target, leading to better performance. However, unlike the low-level features
such as pixel values, we argue the features extracted by powerful teacher
models already encode rich semantic correlation across regions in an intact
image.This raises one question: is reconstruction necessary in Masked Image
Modeling (MIM) with a teacher model? In this paper, we propose an efficient MIM
paradigm named MaskAlign. MaskAlign simply learns the consistency of visible
patch features extracted by the student model and intact image features
extracted by the teacher model. To further advance the performance and tackle
the problem of input inconsistency between the student and teacher model, we
propose a Dynamic Alignment (DA) module to apply learnable alignment. Our
experimental results demonstrate that masked modeling does not lose
effectiveness even without reconstruction on masked regions. Combined with
Dynamic Alignment, MaskAlign can achieve state-of-the-art performance with much
higher efficiency. Code and models will be available at
https://github.com/OpenPerceptionX/maskalign.
No creative common's license