AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Abstract

Aligning visual and language representations in vision-language models hinges on the connector that maps vision-encoder features into the LLM space. Conventional MLP connectors often emit noisy, out-of-distribution inputs that misalign the modalities. AlignVLM instead maps visual tokens to a weighted mixture of existing LLM text embeddings, leveraging linguistic priors to keep visual features in-distribution. Coupled with document-specific modeling, this connector yields stronger vision-text alignment, improved robustness to noise, and state-of-the-art accuracy on document understanding benchmarks.

Type
Conference paper
Publication
In NeurIPS 2025
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Create your slides in Markdown - click the Slides button to check out the example.

Paper (coming soon)

Tianyu Zhang
Tianyu Zhang
Ph.D. Student in Machine Learning

My research interests include Algorithmic Game Theory, Agent-based Model Simulator, AI for Climate Change, Multi-agent Reinforcement Learning, Self-supervised Learning, Domain Adaptation. I am still exploring and learning slowly.