Abstract
Current machine learning methods only utilize the three-channel color features of optical images for computer visual tasks. However, the optical images only explicitly present information of RGB color and two-dimensional planar shape, where the third-dimensional spatial features are not fully exploited. This limitation restricts the potential improvement in recognition performance. To address this issue, we propose a detection scheme to enhance model’s detection capabilities based on four independent features by combining the pseudo-depth and the RGB features without adding any additional hardware sensors. The monocular depth estimation model is first used as a virtual depth sensor to extract the pseudo-depth features from input optical images. Then the fused Depth-RGB features are fed into the neural network model for object detection training and inference to enhance capability for extracting spatial features. Experiments show that the proposed method has improved the detection metric mAP\(_{50}\) by 3.8 and 8.0 percentage points on the public M\(^3\)FD and COCO datasets, respectively. Notably, the scheme can be easily embedded into any machine learning models to definitely improve the detection performance.
Similar content being viewed by others
Data availability
The datasets generated and/or analysed during the current study are available in the GitHub repository with link https://github.com/htyb275/Pseudo-Depth-Detection.
References
Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, (2014) https://doi.org/10.1109/CVPR.2014.81.
Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137. https://doi.org/10.1109/TPAMI.2016.2577031 (2017).
Dai, J., Li, Y., He, K. & Sun, J. R-fcn: object detection via region-based fully convolutional networks, in Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, (Red Hook, NY, USA), p. 379–387, Curran Associates Inc., (2016).
Law, H. & Deng, J. Cornernet: Detecting objects as paired keypoints, in Computer Vision – ECCV 2018, Ferrari, V., Hebert, M., Sminchisescu, C., & Weiss, Y., eds., (Cham), pp. 765–781, Springer International Publishing, (2018)
Jiang, P., Ergu, D., Liu, F., Cai, Y. & Ma, B. A review of yolo algorithm developments. Procedia Computer Sci. 199, 1066. https://doi.org/10.1016/j.procs.2022.01.135 (2022).
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. Centernet: Keypoint triplets for object detection, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October, (2019).
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. Deformable detr: Deformable transformers for end-to-end object detection, arxiv:2010.04159.
Tong, X. et al. Meson Properties and Symmetry Emergence Based on the Deep Neural Network. Chin. Phys. Lett. 43, 020201 (2026).https://doi.org/10.1088/0256-307X/43/2/020201arxiv:2509.17093
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., et al., Emerging properties in self-supervised vision transformers, inProceedings of the IEEE/CVF international conference on computervision, pp. 9650–9660, 2021.
Niu, Z., Zhong, G. & Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 452, 48. https://doi.org/10.1016/j.neucom.2021.03.091 (2021).
Chen, Y., Chen, L., Xia, R., Yang, K. & Zou, K. Caat. Image super-resolution algorithm via channel attention and transformer. Array 28, 100628. https://doi.org/10.1016/j.array.2025.100628 (2025).
Meerits, S., Thomas, D., Nozick, V. & Saito, H. Fusionmls: Highly dynamic 3d reconstruction with consumergrade rgb-d cameras. Comput. Visual Media. 4, 287 (2018).
Chu, X., Deng, J., Ji, J., Zhang, Y., Li, H., & Zhang, Y. Oa-det3d: Embedding object awareness as a general plug-in for multi-camera 3d object detection: chu, X. et al., International Journal of Computer Vision 133 8022.(2025)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov ,A., & Zagoruyko, S., End-to-end object detection with transformers, in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox and J.-M. Frahm, eds., (Cham), pp. 213–229, Springer International Publishing, (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov ,A., & Zagoruyko, S., End-to-end object detection with transformers, in Computer Vision – ECCV 2020, Vedaldi, A., Bischof, H., Brox, T., & Frahm, J.-M., eds., (Cham), pp. 213–229, Springer International Publishing, (2020)
Cong, R. et al. Going from rgb to rgbd saliency: A depth-guided transformation model. IEEE Trans. Cybern. 50, 3627. https://doi.org/10.1109/TCYB.2019.2932005 (2020).
Cong, R. et al. Cir-net: Cross-modality interaction and refinement for rgb-d salient object detection. IEEE Trans. Image Process. 31, 6800. https://doi.org/10.1109/TIP.2022.3216198 (2022).
Piao, Y., Rong, Z., Zhang, M., Ren, W. & Lu, H. A2dele: Adaptive and attentive depth distiller for efficient rgb-d salient object detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 9057–9066 (2020).https://doi.org/10.1109/CVPR42600.2020.00908.
Wang, T., Zhu, X., Pang, J. & Lin, D. Fcos3d: Fully convolutional one-stage monocular 3d object detection. Proceedings of the IEEE/CVF international conference on computer vision , 913–922 (2021).
Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H., Depth-induced multi-scale recurrent attention network for saliency detection, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7253–7262, (2019), https://doi.org/10.1109/ICCV.2019.00735.
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., & Solomon, J., Detr3d: 3d object detection from multi-view images via 3d-to-2d queries, in Proceedings of the 5th Conference on Robot Learning, Faust, A., Hsu D., & Neumann, G., eds., vol. 164 of Proceedings of Machine Learning Research, pp. 180–191, PMLR, 08–11 Nov, (2022), https://proceedings.mlr.press/v164/wang22b.html.
Liu, Y., Wang, T., Zhang, X., & Sun, J., Petr: Position embedding transformation for multi-view 3d object detection, in Computer Vision – ECCV 2022, Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., & Hassner, T., eds., (Cham), pp. 531–548, Springer Nature Switzerland, (2022).
Musiat, A., Reichardt, L., Schulze, M., & Wasenmüller, O., Radarpillars: Efficient object detection from 4d radar point clouds, in 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), pp. 1656–1663, IEEE, (2024).
Song, H. et al. Cmkd-net: a cross-modal knowledge distillation method for remote sensing image classification. Adv. Space Res. 75, 8515. https://doi.org/10.1016/j.asr.2025.04.009 (2025).
Zhou, W., Cai, Y., Dong, X., Qiang, F. & Qiu, W. Adrnet-s*: Asymmetric depth registration network via contrastive knowledge distillation for rgb-d mirror segmentation. Information Fusion 108, 102392 (2024).
Ji, W., Li, J., Yu, S., Zhang, M., Piao, Y., Yao, S., et al., Calibrated rgb-d salient object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9471–9481, June, (2021).
Song, H. et al. Symmetrical learning and transferring: Efficient knowledge distillation for remote sensing image classification. Symmetry 17, 1002 (2025).
Qi, C.R., Litany, O., He, K., & Guibas, L.J. Deep hough voting for 3d object detection in point clouds, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October, (2019).
Chen, Y., Xia, R., Yang, K. & Zou, K. Dual degradation image inpainting method via adaptive feature fusion and u-net network. Appl. Soft Computing 174, 113010. https://doi.org/10.1016/j.asoc.2025.113010 (2025).
Zhang, J., Yang, J., Qin, Y., Xiao, Z. & Wang, J. Mgnet: Rgbt tracking via cross-modality cross-region mutual guidance. Neural Netw. 190, 107707. https://doi.org/10.1016/j.neunet.2025.107707 (2025).
Zhang, J., Zhang, S., Li, D., Wang, J. & Wang, J. Crack segmentation network via difference convolution-based encoder and hybrid cnn-mamba multi-scale attention. Pattern Recog. 167, 111723. https://doi.org/10.1016/j.patcog.2025.111723 (2025).
Shi, S., Wang, X., & Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June, (2019).
Zhang, H., Jiang, H., Yao, Q., Sun, Y., Zhang, R., Zhao H., et al. Detect anything 3d in the wild, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5048–5059, October, (2025).
Yang, L. et al. Bevheight++: Toward robust visual centric 3d object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).
Zhang, H. et al. Test-time correction: An online 3d detection system via visual prompting. IEEE Trans. Pattern Anal. Mach. Intell. 48, 3666. https://doi.org/10.1109/TPAMI.2025.3642076 (2026).
Yang, L. et al. Sgv3d: Toward scenario generalization for vision-based roadside 3d object detection. IEEE Transactions on Intelligent Transportation Systems (2025).
Simony, M., Milzy, S., Amendey, K., & Gross, H.-M. Complex-yolo: An euler-region-proposal for real-time 3d object detection on point clouds, in Proceedings of the European conference on computer vision (ECCV) workshops, pp. 0–0, (2018).
Weng, X., & Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct, (2019).
Chen, Z. et al. Graph-detr4d: Spatio-temporal graph modeling for multi-view 3d object detection. IEEE Trans. Image Process. 33, 4488 (2024).
Xie, Q., Lai, Y.-K., Wu, J., Wang, Z., Zhang, Y., Xu, K., et al. Mlcvnet: Multi-level context votenet for 3d object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June, 2020.
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K. & Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1623. https://doi.org/10.1109/TPAMI.2020.3019967 (2022).
Wang, J., Lin, C., Sun, L., Liu, R., Nie, L., Li, M., et al., From editor to dense geometry estimator, arxiv:2509.04338.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., et al., Dinov2: Learning robust visual features without supervision, Transactions on Machine Learning Research (2024) .
Zhou, Y., & Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4490–4499, (2018).
Liang, M., Yang, B., Wang, S., & Urtasun, R., Deep continuous fusion for multi-sensor 3d object detection, in Proceedings of the European conference on computer vision (ECCV), pp. 641–656, (2018).
Yang, Z., Sun, Y., Liu, S., Shen, X., & Jia, J. Std: Sparse-to-dense 3d object detector for point cloud, in Proceedings of the IEEE/CVF international conference on computer vision, pp. 1951–1960, (2019).
Liu, J., Fan, X., Huang, Z., Wu, G., Liu, R., Zhong, W., et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5811, (2022).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. Microsoft coco: Common objects in context, in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars, eds., (Cham), pp. 740–755, Springer International Publishing, (2014).
Park, D., Ambrus, R., Guizilini, V.C., Li, J., & Gaidon, A. Is pseudo-lidar needed for monocular 3d object detection?, 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 3122.(2021)
Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., & Ouyang, W. Rethinking pseudo-lidar representation, in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox and J.-M. Frahm, eds., (Cham), pp. 311–327, Springer International Publishing, (2020).
Liu, Z., Tan, Y., He, Q. & Xiao, Y. Swinnet: Swin transformer drives edge-aware rgb-d and rgb-t salient object detection. IEEE Trans. Circuits Syst. Video Technol. 32, 4486. https://doi.org/10.1109/TCSVT.2021.3127149 (2022).
Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., & Yang, Y., Cross-modal contrastive learning for text-to-image generation, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 833–842, (2021).
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., et al. Depth anything v2, in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak et al., eds., vol. 37, pp. 21875–21911, Curran Associates, Inc., (2024), https://doi.org/10.52202/079017-0688.
Tian, Y., Fan, L., Chen, K., Katabi, D., Krishnan, D., & Isola, P. Learning vision from models rivals learning vision from data, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 15887.
Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., et al., Transfusion: Robust lidar-camera fusion for 3d object detection with transformers, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1090–1099, (2022).
Li, Z. et al. Bevformer: Learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 47, 2020. https://doi.org/10.1109/TPAMI.2024.3515454 (2025).
Li, H. & Wu, X.-J. Crossfuse: A novel cross attention mechanism based infrared and visible image fusion approach. Inf. Fusion 103, 102147. https://doi.org/10.1016/j.inffus.2023.102147 (2024).
Tang, L., Xiang, X., Zhang, H., Gong, M. & Ma, J. Divfusion: Darkness-free infrared and visible image fusion. Inf. Fusion 91, 477. https://doi.org/10.1016/j.inffus.2022.10.034 (2023).
Chen, H., Li, Y. & Su, D. Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for rgb-d salient object detection. Pattern Recognition 86, 376. https://doi.org/10.1016/j.patcog.2018.08.007 (2019).
Deevi, S.A., Lee, C., Gan, L., Nagesh, S., Pandey, G., & Chung, S.-J., Rgb-x object detection via scene-specific fusion modules, in 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 7351–7360, (2024), https://doi.org/10.1109/WACV57701.2024.00720.
Guo, X., Zhou, W. & Liu, T. Contrastive learning-based knowledge distillation for rgb-thermal urban scene semantic segmentation. Knowledge-Based Syst. 292, 111588. https://doi.org/10.1016/j.knosys.2024.111588 (2024).
Ma, J. et al. Multiscale sparse cross-attention network for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 63, 1. https://doi.org/10.1109/TGRS.2025.3525582 (2025).
Wan, D., Lu, R., Fang, Y., Lang, X., Shu, S., Chen, J., et al., Yolov11-rgbt: Towards a comprehensive single-stage multispectral object detection framework, arxiv:2506.14696.
Qingyun, F., Dapeng, H., & Zhaokui, W., Cross-modality fusion transformer for multispectral object detection, arxiv:2111.00273.
Zhou, K., Chen, L., & Cao, X., Improving multispectral pedestrian detection by addressing modality imbalance problems, in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox and J.-M. Frahm, eds., (Cham), pp. 787–803, Springer International Publishing, (2020).
Jocher, G., & Qiu, J., Ultralytics yolo11, (2024).
Khanam, R., & Hussain, M., Yolov11: An overview of the key architectural enhancements, arxiv:2410.17725.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P., Focal loss for dense object detection, in 2017 IEEE international conference on computer vision (ICCV), IEEE International Conference on Computer Vision, pp. 2999–3007, IEEE; IEEE Comp Soc, (2017), https://doi.org/10.1109/ICCV.2017.324.
Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., et al., Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection, in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan and H. Lin, eds., vol. 33, pp. 21002–21012, Curran Associates, Inc., 2020, https://proceedings.neurips.cc/paper_files/paper/2020/file/f0bda020d2470f2e74990a07a607ebd9-Paper.pdf
Cartucho, J., Ventura, & Veloso, M., Robust object recognition through symbiotic deep learning in mobile robots, in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2336–2341, (2018).
Redmon, J., & Farhadi, A., Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767 (2018) .
Mustapha, A., Mohamed, L., & Ali, K., An overview of gradient descent algorithm optimization in machine learning: Application in the ophthalmology field, in Smart Applications and Data Analysis, Hamlich, M., Bellatreche, L., Mondal A., & Ordonez, C., eds., (Cham), pp. 349–359, Springer International Publishing, (2020).
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y.M., Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv:2004.10934 (2020) .
Zhang, H., Cisse, M., Dauphin, Y.N., & Lopez-Paz, D., mixup: Beyond empirical risk minimization, arxiv:1710.09412.
Acknowledgements
This work is supported by the National Key R&D Program of China (2022YFA1604803), the National Major in High Resolution Earth Observation (68-Y50G07-9001-22/23), and the Natural Science Basic Research Program of Shaanxi (2025JC-YBMS-020).
Author information
Authors and Affiliations
Contributions
Conceptualization: Q.L., W.F., and B.L.; Methodology: S.Q.L., W.F., Q.L.; Formal analysis and data curation: S.Q.L., Q.L., B.L.; Writing—original draft preparation: S.Q.L., X.T.,Q.L.; Writing-review and editing: S.Q.L., W.F.,B.L.,X.T., and Q.L.. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Li, SQ., Feng, W., Liu, B. et al. Pseudo-depth-based deep neural network model for object detection. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45310-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-026-45310-w


