SVRS: self-supervised 3D voxel reconstruction network from stereo vision

Zou, Zhengyang; Wu, Yunxia; Zhang, Hailan; Xu, Qian; Wang, Ruifeng

doi:10.1038/s41598-026-45924-0

Download PDF

Article
Open access
Published: 31 March 2026

SVRS: self-supervised 3D voxel reconstruction network from stereo vision

Zhengyang Zou¹,
Yunxia Wu¹,
Hailan Zhang¹,
Qian Xu¹ &
…
Ruifeng Wang¹

Scientific Reports , Article number: (2026) Cite this article

424 Accesses
Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

Abstract

Three-dimensional voxel reconstruction based on stereo vision is essential for environmental perception in autonomous robots. Existing pseudo-LiDAR methods recover voxel grids by estimating depth maps and projecting them pixel by pixel, leading to high computational cost and boundary over-smoothing. To overcome these issues, we model the inverse relationship between 2D pixels and 3D voxel grids and propose a Self-supervised 3D Voxel Reconstruction network from Stereo vision (SVRS). Specifically, we represent a given 3D scene as multi-scale uniform cubic voxel grids and introduce a novel Pixel-Voxel Projecting Module (PVPM). PVPM projects the 3D position of each voxel grid into index coordinates, which establishes implicit stereo–voxel correspondences and converts dense pixel features into sparse voxel representations. Furthermore, we explore an Octree-based Encoder-Decoder Architecture (OEDA) to reconstruct multi-scale voxel grids via hierarchical spatial partitioning, avoiding the influence of dense empty grids on sparse occupied grids via a coarse-to-fine manner. Finally, SVRS leverages off-the-shelf stereo matching methods within a self-supervised training framework. Experiments on the DrivingStereo dataset show that SVRS achieves competitive reconstruction accuracy while improving inference speed by up to 14\(\times\) over advanced pseudo-LiDAR approaches and 3\(\times\) over real-time approaches.

SV-TransFusion for LiDAR 3D object detection with Sparse Voxel–Query Interaction

Article Open access 13 March 2026

Centralised visual processing center for remote sensing target detection

Article Open access 24 July 2024

Dense monocular depth estimation for stereoscopic vision based on pyramid transformer and multi-scale feature fusion

Article Open access 25 March 2024

Data availability

The data supporting the findings of this study are available within the paper. The associated pre-processed raw data is available and can be shared with interested parties upon reasonable request. Please contact the corresponding author for more information.

Code availability

Our code is avaliable on https://github.com/zzy729425207/SVRS. Please contact the corresponding author for more information.

References

Menze, M., Heipke, C. & Geiger, A. Joint 3d estimation of vehicles and scene flow (ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015).
Google Scholar
Guo, Y. et al. Deep learning for 3d point clouds: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 4338–4364 (2020).
Google Scholar
You, Y. et al. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. arXiv preprint arXiv:1906.06310 (2019).
Wang, Y. et al. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8445–8453 (2019).
Roldao, L., De Charette, R. & Verroust-Blondet, A. 3d semantic scene completion: A survey. Int. J. Comput. Vision 130, 1978–2005 (2022).
Google Scholar
Zhang, Z., Peng, R., Hu, Y. & Wang, R. Geomvsnet: Learning multi-view stereo with geometry perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21508–21518 (2023).
Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30, 328–341 (2007).
Google Scholar
Guo, X., Yang, K., Yang, W., Wang, X. & Li, H. Group-wise correlation stereo network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3273–3282 (2019).
Wu, Z. et al. Real-time stereo matching with high accuracy via spatial attention-guided upsampling. Appl. Intell. 53, 24253–24274 (2023).
Google Scholar
Xu, G., Wang, Y., Cheng, J., Tang, J. & Yang, X. Accurate and efficient stereo matching via attention concatenation volume. IEEE Trans. Pattern Anal. Mach. Intell. (2023).
Xu, H. & Zhang, J. Aanet: Adaptive aggregation network for efficient stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1959–1968 (2020).
Lipson, L., Teed, Z. & Deng, J. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 2021 International Conference on 3D Vision (3DV), 218–227 (IEEE, 2021).
Guo, C., Peng, H. & Wang, J. Recurrent convolutional model based on gated spiking neural p system for stereo matching networks. Appl. Intell. 53, 29570–29584 (2023).
Google Scholar
Tosi, F., Liao, Y., Schmitt, C. & Geiger, A. Smd-nets: Stereo mixture density networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8942–8952 (2021).
Li, Z. et al. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers. IEEE Trans. Pattern Anal. Mach. Intell. (2024).
Li, Y. et al. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proc. AAAI Conf. Artif. Intell. 37, 1486–1494 (2023).
Google Scholar
Huang, Y., Zheng, W., Zhang, Y., Zhou, J. & Lu, J. Tri-perspective view for vision-based 3d semantic occupancy prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9223–9232 (2023).
Li, Y. et al. Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9087–9098 (2023).
Rukhovich, D., Vorontsova, A. & Konushin, A. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 1265–1274, https://doi.org/10.1109/WACV51458.2022.00133 (2022).
Li, H. et al. Stereovoxelnet: Real-time obstacle detection based on occupancy voxels from a stereo camera using deep neural networks. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 4826–4833 (IEEE, 2023).
Mayer, N. et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4040–4048 (2016).
Durou, J.-D., Falcone, M. & Sagona, M. Numerical methods for shape-from-shading: A new survey with benchmarks. Comput Vision Image Understanding 109, 22–43 (2008).
Google Scholar
Schonberger, J. L. & Frahm, J.-M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4104–4113 (2016).
Kaya, B., Kumar, S., Oliveira, C., Ferrari, V. & Van Gool, L. Multi-view photometric stereo revisited. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3126–3135 (2023).
Su, W. & Tao, W. Efficient edge-preserving multi-view stereo network for depth estimation. Proc. AAAI Conf. Artif. Intell. 37, 2348–2356 (2023).
Google Scholar
Müller, T., Evans, A., Schied, C. & Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 41, 1–15 (2022).
Google Scholar
Tosi, F., Tonioni, A., De Gregorio, D. & Poggi, M. Nerf-supervised deep stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 855–866 (2023).
Fei, B. et al. 3d gaussian splatting as new era: A survey. IEEE Transactions on Visualization and Computer Graphics (2024).
Peng, Z. et al. Rtg-slam: Real-time 3d reconstruction at scale using gaussian splatting. In ACM SIGGRAPH 2024 Conference Papers, 1–11 (2024).
Zhou, X. et al. Drivinggaussian: Composite gaussian splatting for surrounding dynamic autonomous driving scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21634–21643 (2024).
Xue, Y. et al. Bi-ssc: Geometric-semantic bidirectional fusion for camera-based 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20124–20134 (2024).
Xiao, H., Xu, H., Kang, W. & Li, Y. Instance-aware monocular 3d semantic scene completion. IEEE Trans. Intell. Transportation Syst. (2024).
Behley, J. et al. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9297–9307 (2019).
Cao, A.-Q. & De Charette, R. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3991–4001 (2022).
Shamsafar, F., Woerz, S., Rahim, R. & Zell, A. Mobilestereonet: Towards lightweight deep networks for stereo matching. In Proceedings of the ieee/cvf Winter Conference on Applications of Computer Vision, 2417–2426 (2022).
Duan, Y., Guo, X. & Zhu, Z. Diffusiondepth: Diffusion denoising approach for monocular depth estimation. arXiv preprint arXiv:2303.05021 (2023).
Miao, R. et al. Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540 (2023).
Li, B. et al. Bridging stereo geometry and bev representation with reliable mutual interaction for semantic scene completion. arXiv preprint arXiv:2303.13959 (2023).
Mei, X. et al. On building an accurate stereo matching system on graphics hardware. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), 467–474 (IEEE, 2011).
Zbontar, J. & LeCun, Y. Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1592–1599 (2015).
Kendall, A. et al. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, 66–75 (2017).
Xu, B., Xu, Y., Yang, X., Jia, W. & Guo, Y. Bilateral grid learning for stereo matching networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12497–12506 (2021).
Xu, G., Wang, X., Ding, X. & Yang, X. Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 21919–21928 (2023).
Shi, X. et al. Convolutional lstm network: A machine learning approach for precipitation nowcasting. Adv. Neural Inf. Proc. Syst. 28 (2015).
Howard, A. G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520 (2018).
Cheng, J. et al. Monster: Marry monodepth to stereo unleashes power. In Proceedings of the Computer Vision and Pattern Recognition Conference, 6273–6282 (2025).
Koneputugodage, C. H., Ben-Shabat, Y. & Gould, S. Octree guided unoriented surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16717–16726 (2023).
Kanopoulos, N., Vasanthavada, N. & Baker, R. Design of an image edge detection filter using the sobel operator. IEEE J. Solid-State Circuits 23, 358–367. https://doi.org/10.1109/4.996 (1988).
Google Scholar
Yang, G. et al. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 899–908 (2019).

Download references

Funding

This publication has emanated from research supported by The National Natural Science Foundation of China (Grant No. 52374165).

Author information

Authors and Affiliations

School of Artificial Intelligence, China University of Mining and Technology-Beijing, Beijing, 100083, China
Zhengyang Zou, Yunxia Wu, Hailan Zhang, Qian Xu & Ruifeng Wang

Authors

Zhengyang Zou
View author publications
Search author on:PubMed Google Scholar
Yunxia Wu
View author publications
Search author on:PubMed Google Scholar
Hailan Zhang
View author publications
Search author on:PubMed Google Scholar
Qian Xu
View author publications
Search author on:PubMed Google Scholar
Ruifeng Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Zou.Z and Wu.Y wrote the main manuscript text and conducted the experiments, Zhang.H and Xu.Q checked the manuscript text, Wang.R collected experimental data. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Zhengyang Zou or Yunxia Wu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zou, Z., Wu, Y., Zhang, H. et al. SVRS: self-supervised 3D voxel reconstruction network from stereo vision. Sci Rep (2026). https://doi.org/10.1038/s41598-026-45924-0

Download citation

Received: 31 October 2025
Accepted: 23 March 2026
Published: 31 March 2026
DOI: https://doi.org/10.1038/s41598-026-45924-0