Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Reports
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific reports
  3. articles
  4. article
SVRS: self-supervised 3D voxel reconstruction network from stereo vision
Download PDF
Download PDF
  • Article
  • Open access
  • Published: 31 March 2026

SVRS: self-supervised 3D voxel reconstruction network from stereo vision

  • Zhengyang Zou1,
  • Yunxia Wu1,
  • Hailan Zhang1,
  • Qian Xu1 &
  • …
  • Ruifeng Wang1 

Scientific Reports , Article number:  (2026) Cite this article

  • 424 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Engineering
  • Mathematics and computing

Abstract

Three-dimensional voxel reconstruction based on stereo vision is essential for environmental perception in autonomous robots. Existing pseudo-LiDAR methods recover voxel grids by estimating depth maps and projecting them pixel by pixel, leading to high computational cost and boundary over-smoothing. To overcome these issues, we model the inverse relationship between 2D pixels and 3D voxel grids and propose a Self-supervised 3D Voxel Reconstruction network from Stereo vision (SVRS). Specifically, we represent a given 3D scene as multi-scale uniform cubic voxel grids and introduce a novel Pixel-Voxel Projecting Module (PVPM). PVPM projects the 3D position of each voxel grid into index coordinates, which establishes implicit stereo–voxel correspondences and converts dense pixel features into sparse voxel representations. Furthermore, we explore an Octree-based Encoder-Decoder Architecture (OEDA) to reconstruct multi-scale voxel grids via hierarchical spatial partitioning, avoiding the influence of dense empty grids on sparse occupied grids via a coarse-to-fine manner. Finally, SVRS leverages off-the-shelf stereo matching methods within a self-supervised training framework. Experiments on the DrivingStereo dataset show that SVRS achieves competitive reconstruction accuracy while improving inference speed by up to 14\(\times\) over advanced pseudo-LiDAR approaches and 3\(\times\) over real-time approaches.

Similar content being viewed by others

SV-TransFusion for LiDAR 3D object detection with Sparse Voxel–Query Interaction

Article Open access 13 March 2026

Centralised visual processing center for remote sensing target detection

Article Open access 24 July 2024

Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion

Article Open access 25 March 2024

Data availability

The data supporting the findings of this study are available within the paper. The associated pre-processed raw data is available and can be shared with interested parties upon reasonable request. Please contact the corresponding author for more information.

Code availability

Our code is avaliable on https://github.com/zzy729425207/SVRS. Please contact the corresponding author for more information.

References

  1. Menze, M., Heipke, C. & Geiger, A. Joint 3d estimation of vehicles and scene flow (ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015).

    Google Scholar 

  2. Guo, Y. et al. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4338–4364 (2020).

    Google Scholar 

  3. You, Y. et al. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310 (2019).

  4. Wang, Y. et al. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8445–8453 (2019).

  5. Roldao, L., De Charette, R. & Verroust-Blondet, A. 3d semantic scene completion: A survey. Int. J. Comput. Vision 130, 1978–2005 (2022).

    Google Scholar 

  6. Zhang, Z., Peng, R., Hu, Y. & Wang, R. Geomvsnet: Learning multi-view stereo with geometry perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21508–21518 (2023).

  7. Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30, 328–341 (2007).

    Google Scholar 

  8. Guo, X., Yang, K., Yang, W., Wang, X. & Li, H. Group-wise correlation stereo network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3273–3282 (2019).

  9. Wu, Z. et al. Real-time stereo matching with high accuracy via spatial attention-guided upsampling. Appl. Intell. 53, 24253–24274 (2023).

    Google Scholar 

  10. Xu, G., Wang, Y., Cheng, J., Tang, J. & Yang, X. Accurate and efficient stereo matching via attention concatenation volume. IEEE Trans. Pattern Anal. Mach. Intell. (2023).

  11. Xu, H. & Zhang, J. Aanet: Adaptive aggregation network for efficient stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1959–1968 (2020).

  12. Lipson, L., Teed, Z. & Deng, J. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 2021 International Conference on 3D Vision (3DV), 218–227 (IEEE, 2021).

  13. Guo, C., Peng, H. & Wang, J. Recurrent convolutional model based on gated spiking neural p system for stereo matching networks. Appl. Intell. 53, 29570–29584 (2023).

    Google Scholar 

  14. Tosi, F., Liao, Y., Schmitt, C. & Geiger, A. Smd-nets: Stereo mixture density networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8942–8952 (2021).

  15. Li, Z. et al. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. (2024).

  16. Li, Y. et al. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proc. AAAI Conf. Artif. Intell. 37, 1486–1494 (2023).

    Google Scholar 

  17. Huang, Y., Zheng, W., Zhang, Y., Zhou, J. & Lu, J. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9223–9232 (2023).

  18. Li, Y. et al. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9087–9098 (2023).

  19. Rukhovich, D., Vorontsova, A. & Konushin, A. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 1265–1274, https://doi.org/10.1109/WACV51458.2022.00133 (2022).

  20. Li, H. et al. Stereovoxelnet: Real-time obstacle detection based on occupancy voxels from a stereo camera using deep neural networks. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 4826–4833 (IEEE, 2023).

  21. Mayer, N. et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4040–4048 (2016).

  22. Durou, J.-D., Falcone, M. & Sagona, M. Numerical methods for shape-from-shading: A new survey with benchmarks. Comput Vision Image Understanding 109, 22–43 (2008).

    Google Scholar 

  23. Schonberger, J. L. & Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4104–4113 (2016).

  24. Kaya, B., Kumar, S., Oliveira, C., Ferrari, V. & Van Gool, L. Multi-view photometric stereo revisited. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3126–3135 (2023).

  25. Su, W. & Tao, W. Efficient edge-preserving multi-view stereo network for depth estimation. Proc. AAAI Conf. Artif. Intell. 37, 2348–2356 (2023).

    Google Scholar 

  26. Müller, T., Evans, A., Schied, C. & Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 41, 1–15 (2022).

    Google Scholar 

  27. Tosi, F., Tonioni, A., De Gregorio, D. & Poggi, M. Nerf-supervised deep stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 855–866 (2023).

  28. Fei, B. et al. 3d gaussian splatting as new era: A survey. IEEE Transactions on Visualization and Computer Graphics (2024).

  29. Peng, Z. et al. Rtg-slam: Real-time 3d reconstruction at scale using gaussian splatting. In ACM SIGGRAPH 2024 Conference Papers, 1–11 (2024).

  30. Zhou, X. et al. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21634–21643 (2024).

  31. Xue, Y. et al. Bi-ssc: Geometric-semantic bidirectional fusion for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20124–20134 (2024).

  32. Xiao, H., Xu, H., Kang, W. & Li, Y. Instance-aware monocular 3d semantic scene completion. IEEE Trans. Intell. Transportation Syst. (2024).

  33. Behley, J. et al. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9297–9307 (2019).

  34. Cao, A.-Q. & De Charette, R. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3991–4001 (2022).

  35. Shamsafar, F., Woerz, S., Rahim, R. & Zell, A. Mobilestereonet: Towards lightweight deep networks for stereo matching. In Proceedings of the ieee/cvf Winter Conference on Applications of Computer Vision, 2417–2426 (2022).

  36. Duan, Y., Guo, X. & Zhu, Z. Diffusiondepth: Diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023).

  37. Miao, R. et al. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540 (2023).

  38. Li, B. et al. Bridging stereo geometry and bev representation with reliable mutual interaction for semantic scene completion. arXiv preprint arXiv:2303.13959 (2023).

  39. Mei, X. et al. On building an accurate stereo matching system on graphics hardware. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 467–474 (IEEE, 2011).

  40. Zbontar, J. & LeCun, Y. Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1592–1599 (2015).

  41. Kendall, A. et al. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, 66–75 (2017).

  42. Xu, B., Xu, Y., Yang, X., Jia, W. & Guo, Y. Bilateral grid learning for stereo matching networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12497–12506 (2021).

  43. Xu, G., Wang, X., Ding, X. & Yang, X. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21919–21928 (2023).

  44. Shi, X. et al. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Proc. Syst. 28 (2015).

  45. Howard, A. G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).

  46. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520 (2018).

  47. Cheng, J. et al. Monster: Marry monodepth to stereo unleashes power. In Proceedings of the Computer Vision and Pattern Recognition Conference, 6273–6282 (2025).

  48. Koneputugodage, C. H., Ben-Shabat, Y. & Gould, S. Octree guided unoriented surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16717–16726 (2023).

  49. Kanopoulos, N., Vasanthavada, N. & Baker, R. Design of an image edge detection filter using the sobel operator. IEEE J. Solid-State Circuits 23, 358–367. https://doi.org/10.1109/4.996 (1988).

    Google Scholar 

  50. Yang, G. et al. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 899–908 (2019).

Download references

Funding

This publication has emanated from research supported by The National Natural Science Foundation of China (Grant No. 52374165).

Author information

Authors and Affiliations

  1. School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing, 100083, China

    Zhengyang Zou, Yunxia Wu, Hailan Zhang, Qian Xu & Ruifeng Wang

Authors
  1. Zhengyang Zou
    View author publications

    Search author on:PubMed Google Scholar

  2. Yunxia Wu
    View author publications

    Search author on:PubMed Google Scholar

  3. Hailan Zhang
    View author publications

    Search author on:PubMed Google Scholar

  4. Qian Xu
    View author publications

    Search author on:PubMed Google Scholar

  5. Ruifeng Wang
    View author publications

    Search author on:PubMed Google Scholar

Contributions

Zou.Z and Wu.Y wrote the main manuscript text and conducted the experiments, Zhang.H and Xu.Q checked the manuscript text, Wang.R collected experimental data. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Zhengyang Zou or Yunxia Wu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zou, Z., Wu, Y., Zhang, H. et al. SVRS: self-supervised 3D voxel reconstruction network from stereo vision. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45924-0

Download citation

  • Received: 31 October 2025

  • Accepted: 23 March 2026

  • Published: 31 March 2026

  • DOI: https://doi.org/10.1038/s41598-026-45924-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Keywords

  • Stereo vision
  • Environmental perception
  • Voxel reconstruction
  • Octree architecture
Download PDF

Associated content

Collection

AI-based neurotechnologies

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Subjects
  • Follow us on Facebook
  • Follow us on X
  • Sign up for alerts
  • RSS feed

About the journal

  • About Scientific Reports
  • Contact
  • Journal policies
  • Guide to referees
  • Calls for Papers
  • Editor's Choice
  • Journal highlights
  • Open Access Fees and Funding

Publish with us

  • For authors
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Reports (Sci Rep)

ISSN 2045-2322 (online)

nature.com footer links

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics