AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
Ahmed Masry, Juan A Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar
February, 2025
Abstract
Aligning visual and language representations in vision-language models hinges on the connector that maps vision-encoder features into the LLM space. Conventional MLP connectors often emit noisy, out-of-distribution inputs that misalign the modalities. AlignVLM instead maps visual tokens to a weighted mixture of existing LLM text embeddings, leveraging linguistic priors to keep visual features in-distribution. Coupled with document-specific modeling, this connector yields stronger vision-text alignment, improved robustness to noise, and state-of-the-art accuracy on document understanding benchmarks.
Publication
In NeurIPS 2025
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Create your slides in Markdown - click the Slides button to check out the example.
Paper (coming soon)

Ph.D. Student in Machine Learning
My research interests include Algorithmic Game Theory, Agent-based Model Simulator, AI for Climate Change, Multi-agent Reinforcement Learning, Self-supervised Learning, Domain Adaptation. I am still exploring and learning slowly.