Pseudo-depth-based deep neural network model for object detection

Li, Si-Qi; Feng, Wei; Liu, Bin; Tong, Xin; Li, Qiang

doi:10.1038/s41598-026-45310-w

Download PDF

Article
Open access
Published: 26 March 2026

Pseudo-depth-based deep neural network model for object detection

Si-Qi Li¹,
Wei Feng^2,3,4,
Bin Liu⁵,
Xin Tong¹ &
…
Qiang Li¹

Scientific Reports , Article number: (2026) Cite this article

438 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Current machine learning methods only utilize the three-channel color features of optical images for computer visual tasks. However, the optical images only explicitly present information of RGB color and two-dimensional planar shape, where the third-dimensional spatial features are not fully exploited. This limitation restricts the potential improvement in recognition performance. To address this issue, we propose a detection scheme to enhance model’s detection capabilities based on four independent features by combining the pseudo-depth and the RGB features without adding any additional hardware sensors. The monocular depth estimation model is first used as a virtual depth sensor to extract the pseudo-depth features from input optical images. Then the fused Depth-RGB features are fed into the neural network model for object detection training and inference to enhance capability for extracting spatial features. Experiments show that the proposed method has improved the detection metric mAP\(_{50}\) by 3.8 and 8.0 percentage points on the public M\(^3\)FD and COCO datasets, respectively. Notably, the scheme can be easily embedded into any machine learning models to definitely improve the detection performance.

Predicting wind-driven spatial deposition through simulated color images using deep autoencoders

Article Open access 25 January 2023

High quality monocular depth estimation with parallel decoder

Article Open access 05 October 2022

Efficient attention vision transformers for monocular depth estimation on resource-limited hardware

Article Open access 05 July 2025

Data availability

The datasets generated and/or analysed during the current study are available in the GitHub repository with link https://github.com/htyb275/Pseudo-Depth-Detection.

References

Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation, in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, (2014) https://doi.org/10.1109/CVPR.2014.81.
Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137. https://doi.org/10.1109/TPAMI.2016.2577031 (2017).
Google Scholar
Dai, J., Li, Y., He, K. & Sun, J. R-fcn: object detection via region-based fully convolutional networks, in Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, (Red Hook, NY, USA), p. 379–387, Curran Associates Inc., (2016).
Law, H. & Deng, J. Cornernet: Detecting objects as paired keypoints, in Computer Vision – ECCV 2018, Ferrari, V., Hebert, M., Sminchisescu, C., & Weiss, Y., eds., (Cham), pp. 765–781, Springer International Publishing, (2018)
Jiang, P., Ergu, D., Liu, F., Cai, Y. & Ma, B. A review of yolo algorithm developments. Procedia Computer Sci. 199, 1066. https://doi.org/10.1016/j.procs.2022.01.135 (2022).
Google Scholar
Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. Centernet: Keypoint triplets for object detection, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October, (2019).
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. Deformable detr: Deformable transformers for end-to-end object detection, arxiv:2010.04159.
Tong, X. et al. Meson Properties and Symmetry Emergence Based on the Deep Neural Network. Chin. Phys. Lett. 43, 020201 (2026).https://doi.org/10.1088/0256-307X/43/2/020201 arxiv:2509.17093
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., et al., Emerging properties in self-supervised vision transformers, inProceedings of the IEEE/CVF international conference on computervision, pp. 9650–9660, 2021.
Niu, Z., Zhong, G. & Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 452, 48. https://doi.org/10.1016/j.neucom.2021.03.091 (2021).
Google Scholar
Chen, Y., Chen, L., Xia, R., Yang, K. & Zou, K. Caat. Image super-resolution algorithm via channel attention and transformer. Array 28, 100628. https://doi.org/10.1016/j.array.2025.100628 (2025).
Google Scholar
Meerits, S., Thomas, D., Nozick, V. & Saito, H. Fusionmls: Highly dynamic 3d reconstruction with consumergrade rgb-d cameras. Comput. Visual Media. 4, 287 (2018).
Google Scholar
Chu, X., Deng, J., Ji, J., Zhang, Y., Li, H., & Zhang, Y. Oa-det3d: Embedding object awareness as a general plug-in for multi-camera 3d object detection: chu, X. et al., International Journal of Computer Vision 133 8022.(2025)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov ,A., & Zagoruyko, S., End-to-end object detection with transformers, in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox and J.-M. Frahm, eds., (Cham), pp. 213–229, Springer International Publishing, (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov ,A., & Zagoruyko, S., End-to-end object detection with transformers, in Computer Vision – ECCV 2020, Vedaldi, A., Bischof, H., Brox, T., & Frahm, J.-M., eds., (Cham), pp. 213–229, Springer International Publishing, (2020)
Cong, R. et al. Going from rgb to rgbd saliency: A depth-guided transformation model. IEEE Trans. Cybern. 50, 3627. https://doi.org/10.1109/TCYB.2019.2932005 (2020).
Google Scholar
Cong, R. et al. Cir-net: Cross-modality interaction and refinement for rgb-d salient object detection. IEEE Trans. Image Process. 31, 6800. https://doi.org/10.1109/TIP.2022.3216198 (2022).
Google Scholar
Piao, Y., Rong, Z., Zhang, M., Ren, W. & Lu, H. A2dele: Adaptive and attentive depth distiller for efficient rgb-d salient object detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 9057–9066 (2020).https://doi.org/10.1109/CVPR42600.2020.00908.
Wang, T., Zhu, X., Pang, J. & Lin, D. Fcos3d: Fully convolutional one-stage monocular 3d object detection. Proceedings of the IEEE/CVF international conference on computer vision , 913–922 (2021).
Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H., Depth-induced multi-scale recurrent attention network for saliency detection, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7253–7262, (2019), https://doi.org/10.1109/ICCV.2019.00735.
Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., & Solomon, J., Detr3d: 3d object detection from multi-view images via 3d-to-2d queries, in Proceedings of the 5th Conference on Robot Learning, Faust, A., Hsu D., & Neumann, G., eds., vol. 164 of Proceedings of Machine Learning Research, pp. 180–191, PMLR, 08–11 Nov, (2022), https://proceedings.mlr.press/v164/wang22b.html.
Liu, Y., Wang, T., Zhang, X., & Sun, J., Petr: Position embedding transformation for multi-view 3d object detection, in Computer Vision – ECCV 2022, Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., & Hassner, T., eds., (Cham), pp. 531–548, Springer Nature Switzerland, (2022).
Musiat, A., Reichardt, L., Schulze, M., & Wasenmüller, O., Radarpillars: Efficient object detection from 4d radar point clouds, in 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), pp. 1656–1663, IEEE, (2024).
Song, H. et al. Cmkd-net: a cross-modal knowledge distillation method for remote sensing image classification. Adv. Space Res. 75, 8515. https://doi.org/10.1016/j.asr.2025.04.009 (2025).
Google Scholar
Zhou, W., Cai, Y., Dong, X., Qiang, F. & Qiu, W. Adrnet-s*: Asymmetric depth registration network via contrastive knowledge distillation for rgb-d mirror segmentation. Information Fusion 108, 102392 (2024).
Google Scholar
Ji, W., Li, J., Yu, S., Zhang, M., Piao, Y., Yao, S., et al., Calibrated rgb-d salient object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9471–9481, June, (2021).
Song, H. et al. Symmetrical learning and transferring: Efficient knowledge distillation for remote sensing image classification. Symmetry 17, 1002 (2025).
Google Scholar
Qi, C.R., Litany, O., He, K., & Guibas, L.J. Deep hough voting for 3d object detection in point clouds, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October, (2019).
Chen, Y., Xia, R., Yang, K. & Zou, K. Dual degradation image inpainting method via adaptive feature fusion and u-net network. Appl. Soft Computing 174, 113010. https://doi.org/10.1016/j.asoc.2025.113010 (2025).
Google Scholar
Zhang, J., Yang, J., Qin, Y., Xiao, Z. & Wang, J. Mgnet: Rgbt tracking via cross-modality cross-region mutual guidance. Neural Netw. 190, 107707. https://doi.org/10.1016/j.neunet.2025.107707 (2025).
Google Scholar
Zhang, J., Zhang, S., Li, D., Wang, J. & Wang, J. Crack segmentation network via difference convolution-based encoder and hybrid cnn-mamba multi-scale attention. Pattern Recog. 167, 111723. https://doi.org/10.1016/j.patcog.2025.111723 (2025).
Google Scholar
Shi, S., Wang, X., & Li, H. Pointrcnn: 3d object proposal generation and detection from point cloud, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June, (2019).
Zhang, H., Jiang, H., Yao, Q., Sun, Y., Zhang, R., Zhao H., et al. Detect anything 3d in the wild, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5048–5059, October, (2025).
Yang, L. et al. Bevheight++: Toward robust visual centric 3d object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025).
Zhang, H. et al. Test-time correction: An online 3d detection system via visual prompting. IEEE Trans. Pattern Anal. Mach. Intell. 48, 3666. https://doi.org/10.1109/TPAMI.2025.3642076 (2026).
Google Scholar
Yang, L. et al. Sgv3d: Toward scenario generalization for vision-based roadside 3d object detection. IEEE Transactions on Intelligent Transportation Systems (2025).
Simony, M., Milzy, S., Amendey, K., & Gross, H.-M. Complex-yolo: An euler-region-proposal for real-time 3d object detection on point clouds, in Proceedings of the European conference on computer vision (ECCV) workshops, pp. 0–0, (2018).
Weng, X., & Kitani, K. Monocular 3d object detection with pseudo-lidar point cloud, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Oct, (2019).
Chen, Z. et al. Graph-detr4d: Spatio-temporal graph modeling for multi-view 3d object detection. IEEE Trans. Image Process. 33, 4488 (2024).
Google Scholar
Xie, Q., Lai, Y.-K., Wu, J., Wang, Z., Zhang, Y., Xu, K., et al. Mlcvnet: Multi-level context votenet for 3d object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June, 2020.
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K. & Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1623. https://doi.org/10.1109/TPAMI.2020.3019967 (2022).
Google Scholar
Wang, J., Lin, C., Sun, L., Liu, R., Nie, L., Li, M., et al., From editor to dense geometry estimator, arxiv:2509.04338.
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., et al., Dinov2: Learning robust visual features without supervision, Transactions on Machine Learning Research (2024) .
Zhou, Y., & Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4490–4499, (2018).
Liang, M., Yang, B., Wang, S., & Urtasun, R., Deep continuous fusion for multi-sensor 3d object detection, in Proceedings of the European conference on computer vision (ECCV), pp. 641–656, (2018).
Yang, Z., Sun, Y., Liu, S., Shen, X., & Jia, J. Std: Sparse-to-dense 3d object detector for point cloud, in Proceedings of the IEEE/CVF international conference on computer vision, pp. 1951–1960, (2019).
Liu, J., Fan, X., Huang, Z., Wu, G., Liu, R., Zhong, W., et al. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5811, (2022).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. Microsoft coco: Common objects in context, in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars, eds., (Cham), pp. 740–755, Springer International Publishing, (2014).
Park, D., Ambrus, R., Guizilini, V.C., Li, J., & Gaidon, A. Is pseudo-lidar needed for monocular 3d object detection?, 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 3122.(2021)
Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., & Ouyang, W. Rethinking pseudo-lidar representation, in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox and J.-M. Frahm, eds., (Cham), pp. 311–327, Springer International Publishing, (2020).
Liu, Z., Tan, Y., He, Q. & Xiao, Y. Swinnet: Swin transformer drives edge-aware rgb-d and rgb-t salient object detection. IEEE Trans. Circuits Syst. Video Technol. 32, 4486. https://doi.org/10.1109/TCSVT.2021.3127149 (2022).
Google Scholar
Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., & Yang, Y., Cross-modal contrastive learning for text-to-image generation, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 833–842, (2021).
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., et al. Depth anything v2, in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak et al., eds., vol. 37, pp. 21875–21911, Curran Associates, Inc., (2024), https://doi.org/10.52202/079017-0688.
Tian, Y., Fan, L., Chen, K., Katabi, D., Krishnan, D., & Isola, P. Learning vision from models rivals learning vision from data, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023) 15887.
Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., et al., Transfusion: Robust lidar-camera fusion for 3d object detection with transformers, in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1090–1099, (2022).
Li, Z. et al. Bevformer: Learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. 47, 2020. https://doi.org/10.1109/TPAMI.2024.3515454 (2025).
Google Scholar
Li, H. & Wu, X.-J. Crossfuse: A novel cross attention mechanism based infrared and visible image fusion approach. Inf. Fusion 103, 102147. https://doi.org/10.1016/j.inffus.2023.102147 (2024).
Google Scholar
Tang, L., Xiang, X., Zhang, H., Gong, M. & Ma, J. Divfusion: Darkness-free infrared and visible image fusion. Inf. Fusion 91, 477. https://doi.org/10.1016/j.inffus.2022.10.034 (2023).
Google Scholar
Chen, H., Li, Y. & Su, D. Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for rgb-d salient object detection. Pattern Recognition 86, 376. https://doi.org/10.1016/j.patcog.2018.08.007 (2019).
Google Scholar
Deevi, S.A., Lee, C., Gan, L., Nagesh, S., Pandey, G., & Chung, S.-J., Rgb-x object detection via scene-specific fusion modules, in 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 7351–7360, (2024), https://doi.org/10.1109/WACV57701.2024.00720.
Guo, X., Zhou, W. & Liu, T. Contrastive learning-based knowledge distillation for rgb-thermal urban scene semantic segmentation. Knowledge-Based Syst. 292, 111588. https://doi.org/10.1016/j.knosys.2024.111588 (2024).
Google Scholar
Ma, J. et al. Multiscale sparse cross-attention network for remote sensing scene classification. IEEE Trans. Geosci. Remote Sens. 63, 1. https://doi.org/10.1109/TGRS.2025.3525582 (2025).
Google Scholar
Wan, D., Lu, R., Fang, Y., Lang, X., Shu, S., Chen, J., et al., Yolov11-rgbt: Towards a comprehensive single-stage multispectral object detection framework, arxiv:2506.14696.
Qingyun, F., Dapeng, H., & Zhaokui, W., Cross-modality fusion transformer for multispectral object detection, arxiv:2111.00273.
Zhou, K., Chen, L., & Cao, X., Improving multispectral pedestrian detection by addressing modality imbalance problems, in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox and J.-M. Frahm, eds., (Cham), pp. 787–803, Springer International Publishing, (2020).
Jocher, G., & Qiu, J., Ultralytics yolo11, (2024).
Khanam, R., & Hussain, M., Yolov11: An overview of the key architectural enhancements, arxiv:2410.17725.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P., Focal loss for dense object detection, in 2017 IEEE international conference on computer vision (ICCV), IEEE International Conference on Computer Vision, pp. 2999–3007, IEEE; IEEE Comp Soc, (2017), https://doi.org/10.1109/ICCV.2017.324.
Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., et al., Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection, in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan and H. Lin, eds., vol. 33, pp. 21002–21012, Curran Associates, Inc., 2020, https://proceedings.neurips.cc/paper_files/paper/2020/file/f0bda020d2470f2e74990a07a607ebd9-Paper.pdf
Cartucho, J., Ventura, & Veloso, M., Robust object recognition through symbiotic deep learning in mobile robots, in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2336–2341, (2018).
Redmon, J., & Farhadi, A., Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767 (2018) .
Mustapha, A., Mohamed, L., & Ali, K., An overview of gradient descent algorithm optimization in machine learning: Application in the ophthalmology field, in Smart Applications and Data Analysis, Hamlich, M., Bellatreche, L., Mondal A., & Ordonez, C., eds., (Cham), pp. 349–359, Springer International Publishing, (2020).
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y.M., Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv:2004.10934 (2020) .
Zhang, H., Cisse, M., Dauphin, Y.N., & Lopez-Paz, D., mixup: Beyond empirical risk minimization, arxiv:1710.09412.

Download references

Acknowledgements

This work is supported by the National Key R&D Program of China (2022YFA1604803), the National Major in High Resolution Earth Observation (68-Y50G07-9001-22/23), and the Natural Science Basic Research Program of Shaanxi (2025JC-YBMS-020).

Author information

Authors and Affiliations

State Key Laboratory of Porous Metal Materials, School of Physical Science and Technology, Northwestern Polytechnical University, Xi’an, 710072, China
Si-Qi Li, Xin Tong & Qiang Li
School of Information Mechanics and Sensing Engineering, Xidian University, Xi’an, 710071, China
Wei Feng
Xi’an Key Laboratory of Advanced Remote Sensing, Xi’an, 710071, China
Wei Feng
Shaanxi Innovation Center for Multi-source Fusion Detection and Recognition, Xi’an, China
Wei Feng
Shanghai Aerospace Control Technology Institute, Shanghai, 201109, China
Bin Liu

Authors

Si-Qi Li
View author publications
Search author on:PubMed Google Scholar
Wei Feng
View author publications
Search author on:PubMed Google Scholar
Bin Liu
View author publications
Search author on:PubMed Google Scholar
Xin Tong
View author publications
Search author on:PubMed Google Scholar
Qiang Li
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization: Q.L., W.F., and B.L.; Methodology: S.Q.L., W.F., Q.L.; Formal analysis and data curation: S.Q.L., Q.L., B.L.; Writing—original draft preparation: S.Q.L., X.T.,Q.L.; Writing-review and editing: S.Q.L., W.F.,B.L.,X.T., and Q.L.. All authors reviewed the manuscript.

Corresponding author

Correspondence to Qiang Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, SQ., Feng, W., Liu, B. et al. Pseudo-depth-based deep neural network model for object detection. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45310-w

Download citation

Received: 21 January 2026
Accepted: 18 March 2026
Published: 26 March 2026
DOI: https://doi.org/10.1038/s41598-026-45310-w