Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
and JavaScript.
Comparison of attended regions for X-LAN (10a ) and MSSA (10b ) during caption generation. The highlighted regions demonstrate the areas of focus for each model. MSSA effectively incorporates geometric features to produce captions that are spatially precise and contextually aligned.