Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
MQADet: a plug-and-play paradigm for enhancing open-vocabulary object detection via multimodal question answering
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 27 January 2026

MQADet: a plug-and-play paradigm for enhancing open-vocabulary object detection via multimodal question answering

  • Caixiong Li1,2,5 na1,
  • Xiongwei Zhao3 na1,
  • Jinhang Zhang4,
  • Xing Zhang1,2,5,
  • Qihao Sun4 &
  • …
  • Zhou Wu6 

Scientific Reports , Article number:  (2026) Cite this article

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

Open-vocabulary detection (OVD) aims to detect and classify objects from an unrestricted set of categories, including those unseen during training. Existing open-vocabulary detectors often suffer from visual-textual misalignment and long-tailed category imbalance, leading to poor performance when handling objects described by complex, long-tailed textual queries. To overcome these challenges, we propose Multimodal Question Answering Detection (MQADet), a universal plug-and-play paradigm that enhances existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models (MLLMs). MQADet can be seamlessly integrated with pre-trained object detectors without requiring additional training or fine-tuning. Specifically, we design a novel three-stage Multimodal Question Answering (MQA) pipeline that guides MLLMs to accurately localize objects described by complex textual queries while refining the focus of existing detectors toward semantically relevant regions. To evaluate our approach, we construct a comprehensive benchmark across four challenging open-vocabulary datasets and integrate three state-of-the-art detectors as baselines. Extensive experiments demonstrate that MQADet consistently improves detection accuracy, particularly for unseen and linguistically complex categories, across diverse and challenging scenarios. To support further research, we will publicly release our code.

Similar content being viewed by others

Evaluating large language models on multimodal chemistry olympiad exams

Article Open access 13 December 2025

Application of multimodal large language models for safety indicator calculation and contraindication prediction in laser vision correction

Article Open access 03 February 2025

Open vocabulary detection for concealed object detection in AMMW image

Article Open access 14 August 2025

Data availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Cheng, G. & Han, J. A survey on object detection in optical remote sensing images. ISPRS journal of photogrammetry and remote sensing. 117, 11–28 (2016).

    Google Scholar 

  2. Xu, G., Khan, A. S., Moshayedi, A. J., Zhang, X. & Shuxin, Y. The object detection, perspective and obstacles in robotic: a review. EAI Endorsed Transactions on AI and Robotics. 1(1) (2022).

  3. Huang, W.-J., Lu, Y.-L., Lin, S.-Y., Xie, Y. & Lin, Y.-Y. Aqt: Adversarial query transformers for domain adaptive object detection. In: IJCAI, pp. 972–979 (2022).

  4. Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39(6), 1137–1149 (2016).

    Google Scholar 

  5. Chen, S., Sun, P., Song, Y. & Luo, P. Diffusiondet: Diffusion model for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19830–19843 (2023).

  6. Zhao, Y. et al. Detrs beat yolos on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16965–16974 (2024).

  7. Lin, T.-Y. et al. Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer (2014).

  8. Li, B. et al. Seed-bench: Benchmarking multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13299–13308 (2024).

  9. Cui, C. et al. A survey on multimodal large language models for autonomous driving. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 958–979 (2024).

  10. Liu, S. et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European Conference on Computer Vision, pp. 38–55 (2025). Springer

  11. Cheng, T. et al. Yolo-world: Real-time open-vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16901–16911 (2024).

  12. Zhao, T., Liu, P., He, X., Zhang, L. & Lee, K. Real-time transformer-based open-vocabulary detection with efficient fusion head. arXiv:2403.06892 (2024).

  13. Radford, A. et al. Learning transferable visual models from natural language supervision. In: Meila, M. & Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR, (2021). https://proceedings.mlr.press/v139/radford21a.html

  14. Zhong, Y. et al. Regionclip: Region-based language-image pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16793–16803 (2022).

  15. Zhang, H. et al. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv:2203.03605 (2022).

  16. Chen, P. et al. Open vocabulary object detection with proposal mining and prediction equalization. arXiv:2206.11134 (2022).

  17. Zhao, S. et al. Exploiting unlabeled data with vision and language models for object detection. In: European Conference on Computer Vision, pp. 159–175 (2022). Springer

  18. Zhou, K., Yang, J., Loy, C. C. & Liu, Z. Learning to prompt for vision-language models. International Journal of Computer Vision 130(9), 2337–2348 (2022).

    Google Scholar 

  19. Du, Y. et al. Learning to prompt for open-vocabulary object detection with vision-language model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14084–14093 (2022).

  20. Chen, H. et al. Taskclip: Extend large vision-language model for task oriented object detection. arXiv:2403.08108 (2024).

  21. Bui, D. C., Le, T. V., Ngo, B. H. & Choi, T. J. Clear: Cross-transformers with pre-trained language model for person attribute recognition and retrieval. Pattern Recognition 164, 111486 (2025).

    Google Scholar 

  22. Bui, D.C., Le, T.V. & Ngo, B.H. C2t-net: Channel-aware cross-fused transformer-style networks for pedestrian attribute recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 351–358 (2024).

  23. Sun, Y., Zhang, K. & Su, Y. Multimodal question answering for unified information extraction. arXiv:2310.03017 (2023).

  24. Yu, L., Poirson, P., Yang, S., Berg, A.C. & Berg, T.L. Modeling context in referring expressions. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pp. 69–85 (2016). Springer

  25. Nagaraja, V.K., Morariu, V.I. & Davis, L.S. Modeling context between objects for referring expression understanding. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp. 792–807 (2016). Springer

  26. Chen, J. et al. Revisiting referring expression comprehension evaluation in the era of large multimodal models. arXiv:2406.16866 (2024)

  27. Peng, Z. et al. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824 (2023).

  28. Zhan, Y. et al. Griffon: Spelling out all object locations at any granularity with large language models. In: European Conference on Computer Vision, pp. 405–422 (2025). Springer

  29. Chen, K. et al. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195 (2023).

  30. Zhang, A. et al. Next-chat: An lmm for chat, detection and segmentation. arXiv:2311.04498 (2023).

  31. Liu, H., Li, C., Wu, Q. & Lee, Y. J. Visual instruction tuning. Advances in neural information processing systems 36 (2024).

  32. Liu, H., Li, C., Li, Y. & Lee, Y.J. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306 (2024).

  33. Liu, H., Li, C., Li, Y. & Lee, Y.J. Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306 (2024).

  34. Wu, Z. et al. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv:2412.10302 (2024).

  35. Wang, P. et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv:2409.12191 (2024).

  36. Comanici, G. et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv:2507.06261 (2025).

Download references

Funding

This study was funded by the Natural Science Foundation of Qinghai Province under Grant 2023-QLGKLYCZX-017.

Author information

Author notes
  1. These authors contributed equally: Caixiong Li and Xiongwei Zhao.

Authors and Affiliations

  1. School of Computer and Information Science, Qinghai Institute of Technology, Xining, 810016, China

    Caixiong Li & Xing Zhang

  2. Qinghai Provincial Key Laboratory of Big Data in Finance and Artificial Intelligence Application Technology, Xining, 810016, China

    Caixiong Li & Xing Zhang

  3. School of Information Science and Technology, Harbin Institute of Technology (Shen Zhen), Shenzhen, 518055, China

    Xiongwei Zhao

  4. State Key Laboratory of Robotics and System, Harbin Institute of Technology, Harbin, 150000, China

    Jinhang Zhang & Qihao Sun

  5. School of Computer Science and Technology, Qinghai University, Xining, 810016, China

    Caixiong Li & Xing Zhang

  6. Eryuan Digital Technology Co., Ltd., Zhengzhou, 450000, China

    Zhou Wu

Authors
  1. Caixiong Li
    View author publications

    Search author on:PubMed Google Scholar

  2. Xiongwei Zhao
    View author publications

    Search author on:PubMed Google Scholar

  3. Jinhang Zhang
    View author publications

    Search author on:PubMed Google Scholar

  4. Xing Zhang
    View author publications

    Search author on:PubMed Google Scholar

  5. Qihao Sun
    View author publications

    Search author on:PubMed Google Scholar

  6. Zhou Wu
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Caixiong Li: Methodology, Software, Writing–original draft preparation, Writing–review and editing. Xiongwei Zhao: Methodology, Writing–original draft preparation, Writing–review and editing. Jinhang Zhang: Data curation, Investigation, Formal analysis. Xing Zhang: Resources, funding acquisition. Qihao Sun: Visualization, Project administration. Zhou Wu: Visualization, Validation.

Corresponding author

Correspondence to Xing Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Zhao, X., Zhang, J. et al. MQADet: a plug-and-play paradigm for enhancing open-vocabulary object detection via multimodal question answering. Sci Rep (2026). https://doi.org/10.1038/s41598-026-36936-x

Download citation

  • Received: 26 September 2025

  • Accepted: 18 January 2026

  • Published: 27 January 2026

  • DOI: https://doi.org/10.1038/s41598-026-36936-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Open-vocabulary detection
  • Multimodal question answering
  • Multimodal large language models
Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics