Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Suyuchen Wang, Tianyu Zhang, Ahmed Masry, Christopher Pal, Spandana Gella, Bang Liu, Perouz Taslakian

October, 2025

Abstract

GUI grounding requires mapping natural-language instructions to pixel coordinates, but existing VLMs implicitly infer patch-to-pixel mappings and struggle on high-resolution displays. Our method injects RULER tokens as explicit coordinate markers so the model references positions rather than generating coordinates from scratch, and introduces Interleaved MRoPE to balance width and height positional encoding. Across ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro, this explicit position-to-coordinate mapping yields consistent gains, especially on high-resolution interfaces, enabling more reliable GUI automation.

Type

Preprint

Publication

In arXiv preprint arXiv 2510.03230

Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.

Create your slides in Markdown - click the Slides button to check out the example.

Paper

Tianyu Zhang

Ph.D. Student in Machine Learning

My research interests include Algorithmic Game Theory, Agent-based Model Simulator, AI for Climate Change, Multi-agent Reinforcement Learning, Self-supervised Learning, Domain Adaptation. I am still exploring and learning slowly.