Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Abstract

GUI grounding requires mapping natural-language instructions to pixel coordinates, but existing VLMs implicitly infer patch-to-pixel mappings and struggle on high-resolution displays. Our method injects RULER tokens as explicit coordinate markers so the model references positions rather than generating coordinates from scratch, and introduces Interleaved MRoPE to balance width and height positional encoding. Across ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro, this explicit position-to-coordinate mapping yields consistent gains, especially on high-resolution interfaces, enabling more reliable GUI automation.

Type
Preprint
Publication
In arXiv preprint arXiv 2510.03230
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.
Create your slides in Markdown - click the Slides button to check out the example.

Paper

Tianyu Zhang
Tianyu Zhang
Ph.D. Student in Machine Learning

My research interests include Algorithmic Game Theory, Agent-based Model Simulator, AI for Climate Change, Multi-agent Reinforcement Learning, Self-supervised Learning, Domain Adaptation. I am still exploring and learning slowly.