Improving GUI Grounding with Explicit Position-to-Coordinate Mapping
Suyuchen Wang, Tianyu Zhang, Ahmed Masry,
Christopher Pal, Spandana Gella, Bang Liu, Perouz
Taslakian
October, 2025
Abstract
GUI grounding requires mapping natural-language instructions to pixel coordinates, but
existing VLMs implicitly infer patch-to-pixel mappings and struggle on high-resolution displays. Our method
injects RULER tokens as explicit coordinate markers so the model references positions rather than generating
coordinates from scratch, and introduces Interleaved MRoPE to balance width and height positional encoding.
Across ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro, this explicit position-to-coordinate mapping yields
consistent gains, especially on high-resolution interfaces, enabling more reliable GUI automation.
Publication
In arXiv preprint arXiv 2510.03230
Click the Cite button above to demo the feature to enable visitors to import publication
metadata into their reference management software.
Create your slides in Markdown - click the Slides button to check out the example.
Paper
Ph.D. Student in Machine Learning
My research interests include Algorithmic Game Theory, Agent-based Model Simulator, AI
for Climate Change, Multi-agent Reinforcement Learning, Self-supervised Learning, Domain Adaptation. I am
still exploring and learning slowly.