AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Ahmed Masry, Juan A Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar

February, 2025

Abstract

Aligning visual and language representations in vision-language models hinges on the connector that maps vision-encoder features into the LLM space. Conventional MLP connectors often emit noisy, out-of-distribution inputs that misalign the modalities. AlignVLM instead maps visual tokens to a weighted mixture of existing LLM text embeddings, leveraging linguistic priors to keep visual features in-distribution. Coupled with document-specific modeling, this connector yields stronger vision-text alignment, improved robustness to noise, and state-of-the-art accuracy on document understanding benchmarks.

Type

Conference paper

Publication

In NeurIPS 2025

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Create your slides in Markdown - click the Slides button to check out the example.

Paper (coming soon)

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Abstract

Tianyu Zhang

Ph.D. Student in Machine Learning