Active data curation effectively distills large-scale multimodal models

Authors: Vishaal Udandarao, Nikhil Parthasarathy, Muhammad Ferjad Naeem, Talfan Evans, Samuel Albanie, Federico Tombari, Yongqin Xian, Alessio Tonioni, Olivier J Hénaff
Published in Arxive, 2024

Abstract

Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach – active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.

Paper

BibTex

@article{udandarao2024active,
  title={Active data curation effectively distills large-scale multimodal models},
  author={Udandarao, Vishaal and Parthasarathy, Nikhil and Naeem, Muhammad Ferjad and Evans, Talfan and Albanie, Samuel and Tombari, Federico and Xian, Yongqin and Tonioni, Alessio and H{\'e}naff, Olivier J},
  journal={arXiv preprint arXiv:2411.18674},
  year={2024}
}