Roberto Amoroso
Roberto Amoroso
Home
News
Experience
Awards
Publications
Activities
Contact
Light
Dark
Automatic
Self-supervised learning
MaPeT: Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
We propose a novel self-supervised pre-training technique for Vision Transformer called
MaPeT
and a novel image tokenizer called
k
-CLIP
which directly employs discretized CLIP features.
Lorenzo Baraldi
,
Roberto Amoroso
,
Marcella Cornia
,
Lorenzo_Baraldi
,
Andrea Pilzer
,
Rita Cucchiara
PDF
Cite
Code
ArXiv
Cite
×