Abstract
As autonomous driving technology steps into the phase of large-scale commercialization, safety and comfort have become key indicators for measuring its performance. Currently, some studies have begun to focus on improving the safety and comfort of urban driving by paying attention to irregular surface regions. However, datasets and studies for unstructured scenes, which are characterized by numerous irregular obstacles and road surface undulations, remain exceedingly rare. To expand the scope of autonomous driving applications, a perception dataset, which focuses on irregular obstacles and road surface vibrations in unstructured scenes, has been built. It takes into consideration the fact that the detection of various irregular obstacles in unstructured scenes plays a key role in trajectory planning, while the recognition of undulating road surface conditions in these scenes is crucial for speed planning. Therefore, we investigate unstructured scene understanding through 3D semantic occupancy prediction, which is used to detect irregular obstacles in unstructured scenes, and road surface elevation reconstruction, which characterizes the bumpy and uneven conditions of road surfaces. The dataset provides detailed annotations for 3D semantic occupancy prediction and road surface elevation reconstruction, offering a comprehensive representation of unstructured scenes. In addition, trajectory and speed planning information is provided to explore the relationship between perception and planning in unstructured scenes. Natural language descriptions of scenes are also provided to explore the interpretability of autonomous driving decision-making. Experiments have been conducted with various state-of-the-art methods to demonstrate the effectiveness of our dataset and the challenges posed by these tasks. To the best of our knowledge, this is the world’s first comprehensive benchmark for perception in unstructured scenes, which serves as a valuable resource for extending autonomous driving technology from urban to unstructured scenes.
Similar content being viewed by others
Background & Summary
Autonomous driving data, which is the foundation for algorithmic innovation, is essential for the development of autonomous driving technology1. Public datasets, which serve as benchmarks for the evaluation and optimization of algorithms, take a dominant position. Beyond autonomous driving in structured urban scenes, however, little attention has been paid to autonomous driving in unstructured environments. According to data from the U.S. Federal Highway Administration (FHWA), 32.11% of roads in both urban and rural areas in the U.S. were unpaved as of 20202. It is worth noting that unpaved roads account for up to 20% of fatal traffic accidents in certain states3. Therefore, as autonomous driving technology becomes more widespread, it is imperative to extend its applications from structured urban scenes to complex unstructured scenes.
Effective perception and understanding of various scenes is the key distinction between autonomous driving in structured and unstructured scenes. Environmental perception, a core component of autonomous driving systems, has been studied extensively. It is one of the most challenging tasks in autonomous driving and widely considered as a critical factor in safe and efficient autonomous navigation. Currently, numerous public datasets for autonomous driving in urban scenes have been developed to facilitate research and development of perception methods. The KITTI4 dataset is one of the earliest and most foundational datasets for autonomous driving. It covers urban and suburban roads in Germany and supports various tasks, including 3D object detection, depth estimation, and semantic segmentation. The nuScenes5 dataset focuses on the complex urban environments of Singapore and Boston. It provides multi-modal sensor data widely used for 3D object detection and trajectory prediction tasks. The Waymo6 offers a large-scale multi-modal perception dataset, collected in multiple cities and suburbs in the United States. The ApolloScape7 is a dataset for urban and suburban areas in China with dense annotations for scene parsing, car instance segmentation, and lane segmentation.
In contrast, research on autonomous driving in non-urban scenes still has significant potential to be explored. The DeepScene8 and RUGD9 focuses on dense semantic segmentation in outdoor scenes. However, it lacks additional sensor modalities such as LiDAR, and provides a relatively small number of annotated frames, which limits its applicability to large-scale or multi-modal learning tasks. Although, the A2D210 dataset covers both urban and rural roads. It provides multi-sensor fusion data for tasks such as 3D object detection and semantic segmentation. it exhibits limited variability in weather conditions. AutoMine11 is the first autonomous driving dataset for perception and localization in mining scenes, which is collected in unstructured scenes. Furthermore, the Rellis-3D12 dataset emphasizes farmlands and grasslands. It provides rich 3D perception data, which is of great significance for research on off-road autonomous driving. WildOcc13 extends this line of research by providing multi-modal 3D occupancy annotations in diverse off-road scenes, enabling detailed semantic understanding of complex terrains. Although the aforementioned datasets contain some perception tasks for unstructured scenes, they primarily focus on traditional tasks such as 3D object detection and lack comprehensive understanding and representation of unstructured scenes. Table 1 provides an overview of existing off-road datasets along with their data modalities, annotation characteristics, supported tasks, and environmental conditions.
Road surface perception, which determines the safety and comfort of vehicles, has been explored recently. In 2024, the RSRD dataset14 was developed, specifically for road surface reconstruction. This dataset was collected in both urban and rural areas with various road surface conditions. The data in this dataset primarily focuses on the road surface reconstruction with an emphasis on road obstacles such as cracks and speed bumps. However, most of these data were collected on structured roads, resulting in limited scene coverage. It also lacks unified annotations for general object detection, especially for irregularly-shaped obstacles such as muddiness and puddles on roads15,16. Although there exist some autonomous driving perception datasets for non-urban scenes, the unique dataset, which can provide comprehensive understanding and representation of unstructured scenes, has yet to be established.
In this paper, we have rethought the characteristics of unstructured scenes, which are defined by numerous irregular obstacles and road surface undulations. In traditional urban scenes, where roads are generally flat, most studies have overlooked the influence of road surface undulations on speed planning. Previous datasets have primarily focused on driving risks (such as vehicles, pedestrians, and bicyclists on the road) and vehicle passage areas (such as lane detection and road segmentation). Little effort has been made to explore road surface elevation reconstruction, which models the elevations and depressions of road surface17. In reality, road surface perception plays a critical role in autonomous driving perception tasks to avoid risks presented by impassable areas and discomfort caused by road surface bumps. As the only physical interface between the vehicle and the ground, the traversability and irregularities of the road essentially determine the boundaries of vehicle dynamics in terms of safety and comfort18. Additionally, one key difference between unstructured and structured environments is the presence of numerous irregular obstacles on the road, such as fallen rocks, soil piles, muddness, and irregular road boundaries. These factors are all critical determinants that affect vehicle trajectory planning.
In summary, there lacks a unique perception dataset for unstructured scenes in the literature, which can provide a comprehensive representation and understanding of road surface conditions and irregular obstacles, supporting the safety and comfort of autonomous vehicles in these challenging environments. In this paper, we introduce a perception dataset for unstructured scenes, characterized by 3D semantic occupancy prediction and road surface elevation reconstruction tasks. 3D semantic occupancy prediction is a general object detection method that effectively detects various irregular obstacles and provides support for trajectory planning. Road surface elevation reconstruction models undulating road conditions, providing guidance for speed planning. These two tasks are jointly employed to represent scene understanding in autonomous driving for unstructured environments. This dataset serves as an addition and complement to existing public datasets for autonomous driving.
Our data collection platform consists of a monocular camera, a Light Detection and Ranging (LiDAR), an Inertial Measurement Unit (IMU) and a Real-Time Kinematic (RTK) system. The scanning range is primarily focused on the road surface. The collected scenes cover various irregular obstacles, including muddiness, elevations, and depressions. Multi-sensor data synchronization was conducted to ensure consistency. The unstructured scene perception dataset comprises approximately 23,549 frames, including high-resolution images, LiDAR point clouds, 3D semantic occupancy prediction labels, depth estimation labels, road surface elevation reconstruction labels, 3d bounding boxes, vehicle motion data, vehicle planning data, and image captions.
To validate the effectiveness of our dataset, we conduct evaluations on two representative tasks: 3D semantic occupancy prediction and road surface elevation reconstruction. These tasks reflect recent trends in 3D scene understanding, which has evolved from early single-view estimation to transformer-based architectures and multi-modal reasoning. In 3D semantic occupancy prediction, methods such as MonoScene19, TPVFormer20, and OccFormer21 enhance voxel-level representations using camera-only inputs. While Co-Occ22 introduces LiDAR-camera fusion and L2COcc23 distills LiDAR knowledge into efficient monocular models. Similarly, in road surface elevation reconstruction, models have progressed from CRF-based designs like NeWCRFs24 to transformer-based approaches such as DepthFormer25 and BinsFormer26. Furthermore, ScaleDepth27 decomposes the depth into relative prediction and scale recovery for improved generalization. We adopted these representative methods to evaluate the performance and applicability of our dataset.
The major contributions of this paper are summarized as follows:
(i) We constructed the world’s first comprehensive perception dataset for unstructured scenes, which focuses on irregular obstacles and road surface undulations. The tasks of 3D semantic occupancy prediction and road surface elevation reconstruction were defined to effectively represent and understand unstructured scenes.
(ii) Depth estimation labels were provided for monocular vision tasks. Annotations for path planning and natural language scene descriptions were also provided, which are used for future exploration of end-to-end (E2E) autonomous driving and vision-language models (VLMs) in unstructured scenes.
(iii) Experiments were conducted with the state-of-the-art (SOTA) methods to evaluate the effectiveness of our dataset. The results demonstrated its effectiveness across three tasks: 1) 3D semantic occupancy prediction, 2) road surface elevation reconstruction, and 3) depth estimation.
Methods
Hardware platform
We use a transport truck as the hardware acquisition platform. The vehicle is equipped with a front-facing LiDAR, an inertial navigation system, and a monocular camera. The sensors are mounted at the front of the vehicle with a 5-degree downward tilt to better focus on the ground. To obtain denser point cloud data with wider coverage in open environments for precise road surface reconstruction, we utilize the Livox Horizon LiDAR to acquire dense ground point cloud data. The camera is the Basler acA1920-40g with an 8 mm focal length, capable of capturing high-quality images at a resolution of 1920 × 1200. The RTK system provides a horizontal positioning accuracy of 8 mm and a vertical positioning accuracy of 15mm, meeting the requirements for ground reconstruction and ground truth acquisition.
Sensor configuration
The Livox Horizon LiDAR operates at a capture frequency of 10 Hz, with an 81.7° horizontal field of view (FOV), a 25.1° vertical FOV, a range of 90 meters, an accuracy of ± 2 cm, and a maximum point output of 480,000 points per second, with a maximum detection range of 260 meters. The industrial camera has a capture frequency of 10 Hz, a 1/1.2-inch anamorphic sensor format, a resolution of 1920 × 1200, and a 70° field of view (FOV). The inertial navigation system operates at an update frequency of 50 Hz.
Data collection
Data collection was conducted from June 2022 to May 2025 in six different surface mines and surrounding unstructured areas located in Inner Mongolia and Shaanxi Province, China. To mitigate potential biases introduced by a single data collection platform, we employed a multi-platform data acquisition strategy involving various types of trucks and cars. Baseline evaluations were conducted to assess the cross-platform generalization capability of the dataset, as shown in Table 4.
To enhance the diversity of the dataset and improve the practical applicability of trained models, we collected data under a wide range of environmental conditions, including cloudy, rainy, snowy, dusty, night and backlight scenes. Raw data were collected on unstructured roads with features such as bumps, protrusions, depressions, and muddy surfaces. Vehicle movement in uneven terrain caused vibrations that pose significant challenges for sensor calibration and synchronization. In addition, to further enhance the practical applicability of the dataset, we also collected data from semi-structured transition areas that connect structured roads with unstructured roads. These regions are primarily located at the surrounding areas of surface mines. Through statistical analysis and preliminary experiments, we observed a pronounced long-tailed distribution of object classes in the dataset. For instance, traffic signs and pedestrians, which are commonly seen in urban environments, are infrequent classes in unstructured scenes. Moreover, the experimental results revealed notable challenges in small object detection under such scene. To address these issues, we refined our data collection strategy by implementing dedicated acquisition plans for rare categories and small objects. The objective is to increase the number of samples for infrequent and small objects in order to mitigate data imbalance and enhance perception performance in long-tailed distributions and small-object scenes.
All sensors were connected via cables to an onboard control unit equipped with a ROS environment, and data were recorded and stored in ROSbag files. A unified timestamp was applied to all data types, facilitating subsequent software-based synchronization. Image data were saved in its original RAW format to ensure high-quality acquisition. LiDAR data included the XYZ coordinates of the scanned points and their reflection intensity. The IMU measured the vehicle’s roll and pitch angles during movement, while the GNSS module provided outputs including longitude, latitude, altitude (LLA), heading (yaw), and speed.
Data processing
First, we synchronize the collected multi-sensor data based on their timestamps to achieve data alignment in the temporal dimension for subsequent joint processing of images and point clouds. Next, to provide ground truth for 3D tasks, a dense 3D representation of the scene is necessary. To obtain static point clouds, the point clouds of moving obstacles, such as vehicles, annotated through AI assistance and manual verification, have been removed. Following this, a local static map, which is a dense point cloud representation, is constructed from the obtained static point cloud frames by multi-frame stacking using a LiDAR odometry approach, as shown in Fig. 1(a).
Subsequently, we perform instance segmentation annotation assisted by pretrained GLEE28 to extract semantic information from the images. Then, we obtain a semantic point cloud map by utilizing the projection relationships between images and LiDAR, combined with the local point cloud map. 3D semantic occupancy prediction labels are generated through the voxelization of the semantic point cloud map. Ground truth for road surface elevation reconstruction and depth estimation is generated by extracting the elevation and depth values from the local point cloud map, respectively. The visualization of different tasks is illustrated in Figs. 1(b) and 2(c).
Data Overview. (a) Hardware system that includes a monocular camera, a Light Detection and Ranging (LiDAR), an Inertial Measurement Unit (IMU) and a Real-Time Kinematic (GNSS) system. (b) Scene Reconstruction, which is characterized by numerous irregularly shaped obstacles and an uneven road surface. Semantic classes are predicted from camera images with the assistance of SAM for annotation. (c) A dense point cloud map is generated from single-frame point clouds using LiDAR odometry. (d) Semantic images are then projected onto the dense point cloud map, creating a dense semantic point cloud. Voxelization is performed on this map to obtain 3D semantic occupancy prediction labels, which distinguish drivable and non-drivable areas and provide guidance for trajectory planning. (e) A rule-based aggregation method converts the dense point cloud map into a Road Elevation Map, where road undulations affect speed planning.
Data annotation
To enhance the safety and comfort of autonomous vehicles in unstructured scenes, we provide 3D semantic occupancy prediction, which distinguishes various regions of the road to offer critical data for trajectory planning, and unstructured road surface elevation reconstruction labels, which provide valuable information for speed planning.
The 3D semantic occupancy prediction task accurately captures the complex shapes and spatial layouts of unstructured scenes, providing a comprehensive three-dimensional representation for complex environments. There are 11 types of obstacles, including foreground elements, such as trucks, widebody vehicles, machinery, excavators, cars, pedestrians, and traffic signs, as well as background elements, such as roads, walls, barriers, and muddiness. These detailed semantic annotations provide more precise and comprehensive perception information for decision-making and trajectory planning as illustrated in Fig. 2(d).
The unstructured road surface elevation reconstruction label captures the bumpiness of unstructured roads using elevation information, providing valuable data for autonomous vehicle speed planning as illustrated in Fig. 2(e). Moreover, the depth estimation label can be used to evaluate the performance of monocular depth estimation algorithms in unstructured scenes and enhance the performance of monocular 3D tasks.
Additionally, to support future research on planning tasks, we also provide natural language description labels, which can be used to aid scene understanding for decision-making and planning. Trajectory and speed planning information for the collected vehicles is also provided.
Annotation statistics
Compared to urban scenes, the object classes differ significantly. We classify vehicles in our dataset into five types, all uniformly labeled under the vehicle class. Our primary focus is on road surface classes in unstructured scenes. In addition to traditional drivable roads, we annotate obstacles such as elevations and muddiness on the road, which affect normal vehicle operation. This expands the applicability of autonomous driving in diverse scenes. Moreover, we also annotate traffic signs and background objects. The relevant data class statistics are shown in Fig. 3(a). As can be seen, the data exhibits a long-tail distribution, which reflects real-world scene characteristics while also presenting challenges for algorithm design. Additionally, we also analyze the distribution of dimensions for different objects, as shown in Fig. 3(b), and the distribution of distance for different objects, as shown in Fig. 3(c). The distribution of object dimensions reveals significant variations in scale. Small and irregular object detection have emerged as challenging issues.
Future Technology Outlook
Despite significant recent advancements in the field of autonomous driving, navigating in unstructured, non-urban scenes remains challenging, potentially leading to severe accidents. Large language models (LLMs) have demonstrated impressive reasoning abilities, approaching the capabilities of general artificial intelligence. In autonomous driving, related work has already begun exploring how to construct visual-language models(VLMs) to achieve scene description, scene analysis, and decision reasoning, thereby enabling higher levels of autonomous driving29,30. We believe that exploring the construction of visual-language models (VLMs) for autonomous driving with visual representations, combined with 3D semantic occupancy prediction31,32 and road surface elevation reconstruction in unstructured scenes, presents a challenging, but meaningful task, as shown in Fig. 1(d). In addition, data-centric autonomous driving has demonstrated robust algorithm-driven capabilities. Constructing a data closed-loop in unstructured scenes to address the long-tail problem in autonomous driving represents another future challenge, as illustrated in Fig. 1(d).
Data Records
Dataset organization

The raw data in this study have been deposited in the Science Data Bank33, where all data files and documentation are openly accessible. We have created a dataset containing 23549 pairs of autonomous driving data samples with diverse labels in unstructured scenes. The raw data includes image data captured by a camera, point cloud data collected by Horizon LiDAR and Ouster LiDAR, and IMU data. All files are named using timestamp followed by the file format suffix. The provided annotation labels include 3D semantic occupancy prediction, road elevation labels, depth estimation labels, instance segmentation labels, 3D point cloud segmentation labels, and local dense point cloud maps. All labels are named in the same manner as the source data and are stored in separate folders. Additionally, image-text descriptions are provided for future multimodal task exploration. Moreover, we provide camera intrinsic and extrinsic calibration information, dataset partitioning, and real-time vehicle pose data. The detailed data organization hierarchy is shown in Table 1, which comprehensively presents the file structure, data types, and data descriptions. In the dataset, we extract time-continuous sequences to conduct and evaluate experiments related to motion planning. The data collection frequency is 10 Hz, which has been downsampled to 2 Hz.
Data Format
The source images are losslessly converted from the raw format to .jpg format. The source LiDAR point clouds are saved in .bin format, containing XYZ coordinate information and reflectivity intensity. Labels for 3D semantic occupancy prediction are stored in .npy format, which represent the semantic class and occupancy state of each voxel in 3D space. Road elevation reconstruction labels and depth estimation labels are saved losslessly in 16-bit .jpg format. The camera’s intrinsic and extrinsic parameters follow the calibration file format of the KITTI dataset, with each data frame corresponding to a calibration file stored as a text file.
Vehicle motion information includes global and local poses with three degrees of freedom for XYZ translation and rotation, as well as three degrees of freedom for linear speed. All this information is stored in text files. Additionally, we provide dataset partition files, including training, validation, and test sets. Developers can also re-partition the data according to their own criteria. To support future multimodal task exploration, we also provide image-text description labels, stored in text files. The detailed data types and descriptions are provided in Table 2.
Technical Validation
We have constructed a dataset for unstructured scenes understanding. To validate the effectiveness of the dataset, we conducted experimental evaluations on the benchmark. Firstly, we performed the 3D semantic occupancy prediction task, which divides the spatial grid to simultaneously predict the occupancy status and semantic class of each cell. It provides more accurate environmental perception and understanding for autonomous vehicles. Secondly, we carried out the road surface elevation reconstruction task, which is used to assess the bumpiness of road surfaces in unstructured scenes. Thirdly, we conducted experimental validation of monocular depth estimation on the dataset. The depth estimation task aims to infer the distance (depth) of each point in the scene to the camera. The latest algorithms are selected for experimental validation. Furthermore, we conducted a systematic evaluation to assess the accuracy of the automated labeling results.
Evaluation of Automated Annotation
To assess the reliability of our automated annotation pipeline, we conducted a quantitative evaluation against a manually annotated ground-truth subset. Specifically, we selected a representative subset of 6137 frames and manually labeled instance segmentation masks. We then applied our automated labeling system, which is pretrained with various dataset in unstructured scene, to the same subset. The results show that the automated method achieves 84.49 APmask in instance segmentation when compared to the human-annotated ground truth. Compared with the state-of-the-art methods such as GLEE28 and Grounded-SAM34, which achieve 49.6 mIoU on the SegInW dataset(a benchmark in the wild, and 62.0 APmask on the COCO dataset respectively, our method demonstrates significantly improved annotation performance. Additionally, all automatically generated annotations were further manually verified and corrected to ensure label quality.
3D Semantic Occupancy Prediction
Baseline & Metrics
We consider four baselines for semantic occupancy prediction, all of which are among the best available open-source methods: TPVFormer20, MonoScene19, OccFormer(large)21, and OccFormer(small)21. These methods represent classical monocular semantic occupancy prediction approaches, and we evaluated them using their RGB input versions. All model inputs were resized to a resolution of 384 × 1280, except for OccFormer(large)21, which uses a higher resolution of 896 × 1920.
In 3D semantic occupancy prediction tasks, commonly used evaluation metrics measure the accuracy of the model’s predictions regarding the semantic classification and occupancy of each voxel in 3D space. These metrics are similar to those used in 3D segmentation and semantic classification tasks, and are summarized as follows:
IoU (Intersection over Union): IoU measures the overlap between the predicted and ground truth voxels for each class as defined as (1). It is calculated as the ratio of the intersection to the union of predicted and actual voxels.
where TP, FP, and FN indicate the number of true positive, false positive, and false negative predictions.
mIoU (Mean Intersection over Union): mIoU as shown in (2) is the average of the IoU values across all classes and is used to assess the overall performance of the model across multiple classes.
C represents the total number of classes. IoUi is the IoU for the i-th class.
Benchmark Evaluation
Table 3 presents the per-class results and the mean IoU (mIoU) across all classes. For each method, we separately evaluated performance on moving and static objects in the scene. As shown in Table 3, different methods achieve considerable performance on high-frequency classes. However, for rare classes that suffer from a long-tail distribution, the experimental results are poor, with some IoU scores approaching zero. It is evident that the data exhibits a pronounced long-tail distribution, which poses significant challenges for research. For static objects, the performance on muddiness is not good. One reason is the limited number of samples for the muddiness class. Moreover, the irregular shape of muddiness adds complexity to accurate recognition. In addition, we noticed that the performance pedestrians, are also below expectations. We conducted validation tests on CO-Occ22 with multimodal fusion method and found that multimodal fusion significantly improves the performance for rare classes, as shown in Fig. 4.
From the evaluations of these methods, it is evident that the unstructured scenes we designed are more challenging compared to traditional urban scenes. On one hand, our scenes introduce a large number of new object classes with irregular shapes, which has a significant impact on the safe driving for autonomous vehicles. As shown in Table 3, the performance on muddiness is relatively poor. One reason is the highly variable morphology of muddiness, which leads to significant intra-class feature variability. Additionally, the blending of muddiness with the road results in low feature distinction between muddiness and the road. On the other hand, the data exhibits a long-tailed distribution, where some classes such as excavator, car, and pedestrian, suffer from poor model performance due to scarce samples. In addition, the sample classes present challenges such as small object detection, camouflaged object detection, and boundary detection. These issues present new challenges for traditional 3D semantic occupancy prediction methods, and this work could potentially open a new line of research.
Generalization Evaluation
We conducted a series of experiments to validate the generalization capability of the model across platforms and across domains. Our dataset consists of multisource data collected from six distinct surface mines, each varying in terrain structure, and sensor configurations. In this experiment, data from Regions 1 to 4 were used for model training, while Regions 5 and 6 were kept as unseen test regions to evaluate generalization performance. We then perform two types of evaluation. 1) In-domain test: Evaluation on the test set from region 1 to region 4 to evaluate the performance of models within the same platform and domain. 2) Cross-domain test: Evaluation on the test sets from the remaining two mining regions to examine generalization to unseen platforms and scenes.
The results show that, although the model experiences performance degradation in the cross-domain test compared to the in-domain setting, it still maintains stable performance overall. This demonstrates the model’s ability to generalize across different platforms and scenes. Moreover, the experimental design highlights the practical value and inherent challenges of our dataset in supporting real-world deployment under diverse and unstructured scenes.
Road Elevation Reconstruction
Baseline & Metrics
We considered four primary monocular depth estimation baselines to validate road surface elevation reconstruction and depth estimation tasks in unstructured scenes. The selected methods represent the best available open-source approaches, including NeWCRFs24, DepthFormer25, BinsFormer26, and ScaleDepth27, which are seminal depth estimation methods. We evaluated the RGB input versions of each method.
In depth estimation tasks, key evaluation metrics include Absolute Relative Error (AbsRel), which measures the average relative difference between predicted and true depths, and Root Mean Squared Error (RMSE), which calculates the overall error magnitude. In addition, Squared Relative Error (SqRel) penalizes larger errors by squaring the differences, while accuracy under a threshold evaluates the proportion of predictions that are within an acceptable range of the ground truth. These metrics provide a balanced view of a model’s depth estimation performance.
Absolute Relative Error(AbsRel): It calculates the mean of the absolute differences between predicted depth values di and ground truth depth values \({d}_{i}^{\ast }\). It provides a relative error measure, which indicates the average error in terms of the actual depth values. The formula is shown in the following equation (3):
Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared differences between predicted and true depth values. It penalizes larger errors more heavily, making it sensitive to outliers. RMSE provides a measure of the absolute difference between predicted and true depths. The formula is shown in the following equation (4):
Squared Relative Error (SqRel): This metric calculates the mean of the squared differences between predicted and ground truth depth values. It is a variant of the relative error that squares the difference. The formula is shown in the following equation (5):
RMSE Logarithmic (RMSE log): RMSE log computes the RMSE in logarithmic space, which is useful when depth values span several orders of magnitude. It measures the error in log space, thus penalizing relative differences rather than absolute ones. The formula is shown in the following equation (6):
Accuracy under a threshold (δ < 1.25, δ < 1.252, δ < 1.253): These metrics measure the fraction of predicted depth values that fall within a certain factor of the ground truth depth values. These accuracy metrics indicate the percentage of predictions that are acceptably close to the actual values. These accuracy metrics provide a straightforward way to understand how many predictions are close to the ground truth. The formula is shown in the following equation (7):
δ < 1.25: The percentage of predictions where the predicted depth is within a factor of 1.25 of the ground truth.
δ < 1.252: The percentage of predictions within a factor of 1.25 squared of the ground truth.
δ < 1.253: The percentage of predictions within a factor of 1.25 cubed of the ground truth.
Results and Discussion
Table 5 presents the road surface elevation reconstruction results of various depth estimation models. Our region of interest is defined as 15 to 30 meters ahead of the vehicle. Due to the high-precision and dense point cloud labels, all models achieved significant performance on the metrics as shown in Table 5. The experimental results showed that the algorithms can accurately reconstruct irregular obstacles, including muddiness, walls, and elevation, on unstructured roads. It demonstrated the effectiveness of our dataset in representing the structure of unstructured road surfaces. The Absolute Relative Error is approximately 6% within the 15 to 30 meter depth range. In the elevation direction, an elevation of 30 cm will result in an error of 1.8 cm. This level of accuracy meets the performance requirements for unstructured road surface reconstruction. The experimental results of road surface elevation reconstruction were visualized in Fig. 5, which displays the elevation predictions from a BEV (Bird’s Eye View) perspective.
As shown in Fig. 5, the Ground Truth in the first row exhibits continuous elevation changes. Both the ScaleDepth27 and NewWCRFs24 are able to reconstruct the ground elevation changes. In the second row, the Ground Truth shows a distinct elevation on the right side of the road, and all algorithms are able to reconstruct the elevation in this area. Additionally, in the last row, the ground has significant undulations that clearly demonstrates the effectiveness of different algorithms in uneven road surfaces reconstruction. Reconstructing the road surface in front of the vehicle is crucial for effective vehicle speed control strategies. Road surface elevation reconstruction data provides a foundation for comfortable and safe driving. According to experimental statistics, the relative error increases with distance, indicating higher accuracy at shorter ranges. This phenomenon is consistent with the observed data pattern. Texture details are preserved at small depths but diminish at greater depths due to the perspective effect. This dataset is highly challenging, and future research will require further exploration to develop more advanced models for achieving more accurate road surface elevation reconstruction.
In addition to evaluating road surface elevation reconstruction, we also evaluated the task of monocular depth estimation to verify the effectiveness of our dataset. Table 6 presents the monocular depth estimation results of different depth estimation models. Owing to the high-precision and dense point cloud labels, all models achieved notable performance across the metrics shown in Table 6. Consistent with the results of road elevation reconstruction, the relative error in monocular depth estimation increases with distance, indicating higher accuracy at closer ranges. It’s clear that we provided a depth estimation dataset for unstructured scenes, where the complex and varied backgrounds pose additional challenges to traditional depth estimation methods. Further research is required to develop advanced models for more accurate depth estimation.
Usage Notes
This dataset also provides trajectory planning and speed planning annotations for the ego vehicle. Researchers can conduct studies related to scene perception and trajectory planning in unstructured scenes. The dataset is available for download under the CC BY 4.0 license.
Code availability
We have released the processing code for this dataset, written in Python, which includes scripts for workflow processing, visualization, and data parsing. This toolkit is available in the code repository at https://github.com/ruiqi-song/UnScenes3D.
References
Li, L. et al. Data-Centric Evolution in Autonomous Driving: A Comprehensive Survey of Big Data System, Data Mining, and Closed-Loop Technologies. arXiv preprint arXiv:2401.12888. (2024).
Unknown Author. Highway Statistics 2020: kILOMETERS BY TYPE OF SURFACE AND OWNERSHIP/FUNCTIONAL SYSTEM NATIONAL SUMMARY. (2021).
Unknown Author. Unpaved Roads: Safety Needs and Treatments. (2014).
Geiger, A., Lenz, P., Stiller, C. & Urtasun, R. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR) 32, 1231–1237 (2013).
Caesar, H. et al. nuScenes: A multimodal dataset for autonomous driving. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11621-11631 (2020).
Sun, P. et al. Waymo Open Dataset: An Autonomous Driving Dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2446-2454 (2020).
Huang, X. et al. The apolloscape dataset for autonomous driving. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 954-960 (2019).
Valada, A., Oliveira, G. L., Brox, T. & Burgard, W. Deep multispectral semantic scene understanding of forested environments using multimodal fusion. International Symposium on Experimental Robotics (ISER). 465-477 (2016).
Wigness, M., Eum, S., Rogers, J. G., Han, D. & Kwon, H. A Rugd Dataset for Autonomous Navigation and Visual Perception in Unstructured Outdoor Environments. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 5000-5007 (2020).
Geyer, J. et al. A2D2 Autonomous Driving Dataset. arXiv preprint arXiv:2004.06320. (2019).
Li, Y. et al. AutoMine: An unmanned mine dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 21308-21317 (2022).
Jiang, P., Osteen, P., Wigness, M. & Saripalli, S. Rellis-3d dataset: Data, benchmarks and analysis. In 2021 IEEE international conference on robotics and automation (ICRA). 1110-1116 (2020).
Zhai, H. et al. WildOcc: 3D Semantic Occupancy Prediction in the Wild. arXiv preprint arXiv:2410.15792. (2024).
Zhao, T. et al. A road surface reconstruction dataset for autonomous driving. Scientific data 11, 459 (2024).
Song, R. et al. MSFANet: A Light Weight Object Detector Based on Context Aggregation and Attention Mechanism for Autonomous Mining Truck. IEEE Transactions on Intelligent Vehicles 8, 2285–2295 (2023).
Ai, Y. et al. A Real-Time Road Boundary Detection Approach in Surface Mine Based on Meta Random Forest. IEEE Transactions on Intelligent Vehicles 9, 1989–2001 (2024).
Chen, L. et al. Milestones in Autonomous Driving and Intelligent Vehicles—Part II: Perception and Planning. IEEE Transactions on Systems, Man, and Cybernetics: Systems 53, 6401–6415 (2023).
Chang, C. et al. MetaScenario: A Framework for Driving Scenario Data Description, Storage and Indexing. IEEE Transactions on Intelligent Vehicles 8, 1156–1175 (2023).
Cao, A. Q., & De Charette, R. Monoscene: Monocular 3d semantic scene completion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3991-4001 (2022).
Huang, Y., Zheng, W., Zhang, Y., Zhou, J., & Lu, J. Tri-perspective view for vision-based 3d semantic occupancy prediction. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9223-9232 (2023).
Zhang, Y., Zhu, Z., & Du, D. Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision. 9433-9443 (2023).
Pan, J., Wang, Z. & Wang, L. Co-Occ: Coupling Explicit Feature Fusion With Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction. IEEE Robotics and Automation Letters 9, 5687–5694 (2024).
Wang, R. et al. L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model. arXiv preprint arXiv:2503.12369. (2025).
Yuan, W., Gu, X., Dai, Z., Zhu, S. & Tan, P. Neural window fully-connected crfs for monocular depth estimation. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 3916-3925 (2022).
Agarwal, A., & Arora, C. Depthformer: Multiscale vision transformer for monocular depth estimation with global local information fusion. 2022 IEEE International Conference on Image Processing (ICIP), 3873-3877 (2022).
Li, Z., Wang, X., Liu, X., & Jiang, J. Binsformer: Revisiting adaptive bins for monocular depth estimation. IEEE Transactions on Image Processing (2024).
Zhu, R. et al. ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation. arXiv preprint arXiv:2407.08187 (2024).
Wu, J. et al. General object foundation model for images and videos at scale. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3783-3795 (2024).
Tian et al. Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289. (2024).
Zhao, R. et al. DriveLLaVA: Human-Level Behavior Decisions via Vision Language Model. Sensors (Basel, Switzerland) 24, 4113 (2024).
Zheng, W. et al. Occworld: Learning a 3d occupancy world model for autonomous driving. In European conference on computer vision. 55-72 (2024).
Wei, J. et al. OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving. arXiv preprint arXiv:2409.03272. (2024).
UnScenes3D. Scene as Occupancy and Reconstruction: A Comprehensive Dataset for Unstructured Scene Understanding. https://doi.org/10.57760/sciencedb.16688 (2025).
Ren, T. et al. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. arXiv preprint arXiv:2401.14159. (2024).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision. 10012-10022 (2021).
Dai, Z., Liu, H., Le, Q. V. & Tan, M. Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems 34, 3965–3977 (2021).
Tan, M., & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning. 6105-6114 (2019).
Dabbiru, L., Goodin, C. & Scherrer, N. LIDAR Data Segmentation in Off-Road Environment Using Convolutional Neural Networks. SAE International Journal of Advances and Current Practices in Mobility 2, 3288–3292 (2020).
Yu, Z. et al. Context and geometry aware voxel transformer for semantic scene completion. arXiv preprint arXiv:2405.13675 (2024).
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 62373356) and the National Key Research and Development Program of China (Grant No.2022YFB4703700).
Author information
Authors and Affiliations
Contributions
L.C. was responsible for the overall organization of the paper and contributed to its supervision. R.S. conducted the experiments, wrote and edited the manuscript. H.W. and L.L. advised, revised, and reviewed the manuscript. B.D. conducted data collection, processing and experiments. F.W. led the project and all related work comprehensively.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chen, L., Song, R., Wu, H. et al. Scene as Occupancy and Reconstruction: A Comprehensive Dataset for Unstructured Scene Understanding. Sci Data 12, 1232 (2025). https://doi.org/10.1038/s41597-025-05532-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05532-5







