Vision-and-Language

[ CVIU 2025 ] We propose a novel self-supervised pre-training technique for Vision Transformer called MaPeT and a novel image tokenizer called k-CLIP which directly employs discretized CLIP features.

Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, Lorenzo_Baraldi, Andrea Pilzer, Rita Cucchiara

[ TOMM 2024 ] We propose a novel deepfake detection method for images generated through Diffusion Models and created a new dataset COCO-Fake consisting of 650K generated fake images.

Roberto Amoroso, Davide Morelli, Marcella Cornia, Lorenzo Baraldi, Alberto Del Bimbo, Rita Cucchiara