Pre-training

MaPeT: Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

We propose a novel self-supervised pre-training technique for Vision Transformer called MaPeT and a novel image tokenizer called k-CLIP which directly employs discretized CLIP features.

Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo_Baraldi, Andrea Pilzer, Rita Cucchiara

MaPeT: Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training