Background & Summary

Autonomous driving data, which is the foundation for algorithmic innovation, is essential for the development of autonomous driving technology1. Public datasets, which serve as benchmarks for the evaluation and optimization of algorithms, take a dominant position. Beyond autonomous driving in structured urban scenes, however, little attention has been paid to autonomous driving in unstructured environments. According to data from the U.S. Federal Highway Administration (FHWA), 32.11% of roads in both urban and rural areas in the U.S. were unpaved as of 20202. It is worth noting that unpaved roads account for up to 20% of fatal traffic accidents in certain states3. Therefore, as autonomous driving technology becomes more widespread, it is imperative to extend its applications from structured urban scenes to complex unstructured scenes.

Effective perception and understanding of various scenes is the key distinction between autonomous driving in structured and unstructured scenes. Environmental perception, a core component of autonomous driving systems, has been studied extensively. It is one of the most challenging tasks in autonomous driving and widely considered as a critical factor in safe and efficient autonomous navigation. Currently, numerous public datasets for autonomous driving in urban scenes have been developed to facilitate research and development of perception methods. The KITTI4 dataset is one of the earliest and most foundational datasets for autonomous driving. It covers urban and suburban roads in Germany and supports various tasks, including 3D object detection, depth estimation, and semantic segmentation. The nuScenes5 dataset focuses on the complex urban environments of Singapore and Boston. It provides multi-modal sensor data widely used for 3D object detection and trajectory prediction tasks. The Waymo6 offers a large-scale multi-modal perception dataset, collected in multiple cities and suburbs in the United States. The ApolloScape7 is a dataset for urban and suburban areas in China with dense annotations for scene parsing, car instance segmentation, and lane segmentation.

In contrast, research on autonomous driving in non-urban scenes still has significant potential to be explored. The DeepScene8 and RUGD9 focuses on dense semantic segmentation in outdoor scenes. However, it lacks additional sensor modalities such as LiDAR, and provides a relatively small number of annotated frames, which limits its applicability to large-scale or multi-modal learning tasks. Although, the A2D210 dataset covers both urban and rural roads. It provides multi-sensor fusion data for tasks such as 3D object detection and semantic segmentation. it exhibits limited variability in weather conditions. AutoMine11 is the first autonomous driving dataset for perception and localization in mining scenes, which is collected in unstructured scenes. Furthermore, the Rellis-3D12 dataset emphasizes farmlands and grasslands. It provides rich 3D perception data, which is of great significance for research on off-road autonomous driving. WildOcc13 extends this line of research by providing multi-modal 3D occupancy annotations in diverse off-road scenes, enabling detailed semantic understanding of complex terrains. Although the aforementioned datasets contain some perception tasks for unstructured scenes, they primarily focus on traditional tasks such as 3D object detection and lack comprehensive understanding and representation of unstructured scenes. Table 1 provides an overview of existing off-road datasets along with their data modalities, annotation characteristics, supported tasks, and environmental conditions.

Table 1 Overview of autonomous driving dataset.

Road surface perception, which determines the safety and comfort of vehicles, has been explored recently. In 2024, the RSRD dataset14 was developed, specifically for road surface reconstruction. This dataset was collected in both urban and rural areas with various road surface conditions. The data in this dataset primarily focuses on the road surface reconstruction with an emphasis on road obstacles such as cracks and speed bumps. However, most of these data were collected on structured roads, resulting in limited scene coverage. It also lacks unified annotations for general object detection, especially for irregularly-shaped obstacles such as muddiness and puddles on roads15,16. Although there exist some autonomous driving perception datasets for non-urban scenes, the unique dataset, which can provide comprehensive understanding and representation of unstructured scenes, has yet to be established.

In this paper, we have rethought the characteristics of unstructured scenes, which are defined by numerous irregular obstacles and road surface undulations. In traditional urban scenes, where roads are generally flat, most studies have overlooked the influence of road surface undulations on speed planning. Previous datasets have primarily focused on driving risks (such as vehicles, pedestrians, and bicyclists on the road) and vehicle passage areas (such as lane detection and road segmentation). Little effort has been made to explore road surface elevation reconstruction, which models the elevations and depressions of road surface17. In reality, road surface perception plays a critical role in autonomous driving perception tasks to avoid risks presented by impassable areas and discomfort caused by road surface bumps. As the only physical interface between the vehicle and the ground, the traversability and irregularities of the road essentially determine the boundaries of vehicle dynamics in terms of safety and comfort18. Additionally, one key difference between unstructured and structured environments is the presence of numerous irregular obstacles on the road, such as fallen rocks, soil piles, muddness, and irregular road boundaries. These factors are all critical determinants that affect vehicle trajectory planning.

In summary, there lacks a unique perception dataset for unstructured scenes in the literature, which can provide a comprehensive representation and understanding of road surface conditions and irregular obstacles, supporting the safety and comfort of autonomous vehicles in these challenging environments. In this paper, we introduce a perception dataset for unstructured scenes, characterized by 3D semantic occupancy prediction and road surface elevation reconstruction tasks. 3D semantic occupancy prediction is a general object detection method that effectively detects various irregular obstacles and provides support for trajectory planning. Road surface elevation reconstruction models undulating road conditions, providing guidance for speed planning. These two tasks are jointly employed to represent scene understanding in autonomous driving for unstructured environments. This dataset serves as an addition and complement to existing public datasets for autonomous driving.

Our data collection platform consists of a monocular camera, a Light Detection and Ranging (LiDAR), an Inertial Measurement Unit (IMU) and a Real-Time Kinematic (RTK) system. The scanning range is primarily focused on the road surface. The collected scenes cover various irregular obstacles, including muddiness, elevations, and depressions. Multi-sensor data synchronization was conducted to ensure consistency. The unstructured scene perception dataset comprises approximately 23,549 frames, including high-resolution images, LiDAR point clouds, 3D semantic occupancy prediction labels, depth estimation labels, road surface elevation reconstruction labels, 3d bounding boxes, vehicle motion data, vehicle planning data, and image captions.

To validate the effectiveness of our dataset, we conduct evaluations on two representative tasks: 3D semantic occupancy prediction and road surface elevation reconstruction. These tasks reflect recent trends in 3D scene understanding, which has evolved from early single-view estimation to transformer-based architectures and multi-modal reasoning. In 3D semantic occupancy prediction, methods such as MonoScene19, TPVFormer20, and OccFormer21 enhance voxel-level representations using camera-only inputs. While Co-Occ22 introduces LiDAR-camera fusion and L2COcc23 distills LiDAR knowledge into efficient monocular models. Similarly, in road surface elevation reconstruction, models have progressed from CRF-based designs like NeWCRFs24 to transformer-based approaches such as DepthFormer25 and BinsFormer26. Furthermore, ScaleDepth27 decomposes the depth into relative prediction and scale recovery for improved generalization. We adopted these representative methods to evaluate the performance and applicability of our dataset.

The major contributions of this paper are summarized as follows:

(i) We constructed the world’s first comprehensive perception dataset for unstructured scenes, which focuses on irregular obstacles and road surface undulations. The tasks of 3D semantic occupancy prediction and road surface elevation reconstruction were defined to effectively represent and understand unstructured scenes.

(ii) Depth estimation labels were provided for monocular vision tasks. Annotations for path planning and natural language scene descriptions were also provided, which are used for future exploration of end-to-end (E2E) autonomous driving and vision-language models (VLMs) in unstructured scenes.

(iii) Experiments were conducted with the state-of-the-art (SOTA) methods to evaluate the effectiveness of our dataset. The results demonstrated its effectiveness across three tasks: 1) 3D semantic occupancy prediction, 2) road surface elevation reconstruction, and 3) depth estimation.

Methods

Hardware platform

We use a transport truck as the hardware acquisition platform. The vehicle is equipped with a front-facing LiDAR, an inertial navigation system, and a monocular camera. The sensors are mounted at the front of the vehicle with a 5-degree downward tilt to better focus on the ground. To obtain denser point cloud data with wider coverage in open environments for precise road surface reconstruction, we utilize the Livox Horizon LiDAR to acquire dense ground point cloud data. The camera is the Basler acA1920-40g with an 8 mm focal length, capable of capturing high-quality images at a resolution of 1920 × 1200. The RTK system provides a horizontal positioning accuracy of 8 mm and a vertical positioning accuracy of 15mm, meeting the requirements for ground reconstruction and ground truth acquisition.

Sensor configuration

The Livox Horizon LiDAR operates at a capture frequency of 10 Hz, with an 81.7° horizontal field of view (FOV), a 25.1° vertical FOV, a range of 90 meters, an accuracy of  ± 2 cm, and a maximum point output of 480,000 points per second, with a maximum detection range of 260 meters. The industrial camera has a capture frequency of 10 Hz, a 1/1.2-inch anamorphic sensor format, a resolution of 1920  × 1200, and a 70° field of view (FOV). The inertial navigation system operates at an update frequency of 50 Hz.

Data collection

Data collection was conducted from June 2022 to May 2025 in six different surface mines and surrounding unstructured areas located in Inner Mongolia and Shaanxi Province, China. To mitigate potential biases introduced by a single data collection platform, we employed a multi-platform data acquisition strategy involving various types of trucks and cars. Baseline evaluations were conducted to assess the cross-platform generalization capability of the dataset, as shown in Table 4.

To enhance the diversity of the dataset and improve the practical applicability of trained models, we collected data under a wide range of environmental conditions, including cloudy, rainy, snowy, dusty, night and backlight scenes. Raw data were collected on unstructured roads with features such as bumps, protrusions, depressions, and muddy surfaces. Vehicle movement in uneven terrain caused vibrations that pose significant challenges for sensor calibration and synchronization. In addition, to further enhance the practical applicability of the dataset, we also collected data from semi-structured transition areas that connect structured roads with unstructured roads. These regions are primarily located at the surrounding areas of surface mines. Through statistical analysis and preliminary experiments, we observed a pronounced long-tailed distribution of object classes in the dataset. For instance, traffic signs and pedestrians, which are commonly seen in urban environments, are infrequent classes in unstructured scenes. Moreover, the experimental results revealed notable challenges in small object detection under such scene. To address these issues, we refined our data collection strategy by implementing dedicated acquisition plans for rare categories and small objects. The objective is to increase the number of samples for infrequent and small objects in order to mitigate data imbalance and enhance perception performance in long-tailed distributions and small-object scenes.

All sensors were connected via cables to an onboard control unit equipped with a ROS environment, and data were recorded and stored in ROSbag files. A unified timestamp was applied to all data types, facilitating subsequent software-based synchronization. Image data were saved in its original RAW format to ensure high-quality acquisition. LiDAR data included the XYZ coordinates of the scanned points and their reflection intensity. The IMU measured the vehicle’s roll and pitch angles during movement, while the GNSS module provided outputs including longitude, latitude, altitude (LLA), heading (yaw), and speed.

Data processing

First, we synchronize the collected multi-sensor data based on their timestamps to achieve data alignment in the temporal dimension for subsequent joint processing of images and point clouds. Next, to provide ground truth for 3D tasks, a dense 3D representation of the scene is necessary. To obtain static point clouds, the point clouds of moving obstacles, such as vehicles, annotated through AI assistance and manual verification, have been removed. Following this, a local static map, which is a dense point cloud representation, is constructed from the obtained static point cloud frames by multi-frame stacking using a LiDAR odometry approach, as shown in Fig. 1(a).

Fig. 1
figure 1

Dataset construction framework and future outlook: (a) Data processing. (b) Data label visualization. (c) Scene text description. (d) Future work outlook.

Subsequently, we perform instance segmentation annotation assisted by pretrained GLEE28 to extract semantic information from the images. Then, we obtain a semantic point cloud map by utilizing the projection relationships between images and LiDAR, combined with the local point cloud map. 3D semantic occupancy prediction labels are generated through the voxelization of the semantic point cloud map. Ground truth for road surface elevation reconstruction and depth estimation is generated by extracting the elevation and depth values from the local point cloud map, respectively. The visualization of different tasks is illustrated in Figs. 1(b) and 2(c).

Fig. 2
figure 2

Data Overview. (a) Hardware system that includes a monocular camera, a Light Detection and Ranging (LiDAR), an Inertial Measurement Unit (IMU) and a Real-Time Kinematic (GNSS) system. (b) Scene Reconstruction, which is characterized by numerous irregularly shaped obstacles and an uneven road surface. Semantic classes are predicted from camera images with the assistance of SAM for annotation. (c) A dense point cloud map is generated from single-frame point clouds using LiDAR odometry. (d) Semantic images are then projected onto the dense point cloud map, creating a dense semantic point cloud. Voxelization is performed on this map to obtain 3D semantic occupancy prediction labels, which distinguish drivable and non-drivable areas and provide guidance for trajectory planning. (e) A rule-based aggregation method converts the dense point cloud map into a Road Elevation Map, where road undulations affect speed planning.

Data annotation

To enhance the safety and comfort of autonomous vehicles in unstructured scenes, we provide 3D semantic occupancy prediction, which distinguishes various regions of the road to offer critical data for trajectory planning, and unstructured road surface elevation reconstruction labels, which provide valuable information for speed planning.

The 3D semantic occupancy prediction task accurately captures the complex shapes and spatial layouts of unstructured scenes, providing a comprehensive three-dimensional representation for complex environments. There are 11 types of obstacles, including foreground elements, such as trucks, widebody vehicles, machinery, excavators, cars, pedestrians, and traffic signs, as well as background elements, such as roads, walls, barriers, and muddiness. These detailed semantic annotations provide more precise and comprehensive perception information for decision-making and trajectory planning as illustrated in Fig. 2(d).

The unstructured road surface elevation reconstruction label captures the bumpiness of unstructured roads using elevation information, providing valuable data for autonomous vehicle speed planning as illustrated in Fig. 2(e). Moreover, the depth estimation label can be used to evaluate the performance of monocular depth estimation algorithms in unstructured scenes and enhance the performance of monocular 3D tasks.

Additionally, to support future research on planning tasks, we also provide natural language description labels, which can be used to aid scene understanding for decision-making and planning. Trajectory and speed planning information for the collected vehicles is also provided.

Annotation statistics

Compared to urban scenes, the object classes differ significantly. We classify vehicles in our dataset into five types, all uniformly labeled under the vehicle class. Our primary focus is on road surface classes in unstructured scenes. In addition to traditional drivable roads, we annotate obstacles such as elevations and muddiness on the road, which affect normal vehicle operation. This expands the applicability of autonomous driving in diverse scenes. Moreover, we also annotate traffic signs and background objects. The relevant data class statistics are shown in Fig. 3(a). As can be seen, the data exhibits a long-tail distribution, which reflects real-world scene characteristics while also presenting challenges for algorithm design. Additionally, we also analyze the distribution of dimensions for different objects, as shown in Fig. 3(b), and the distribution of distance for different objects, as shown in Fig. 3(c). The distribution of object dimensions reveals significant variations in scale. Small and irregular object detection have emerged as challenging issues.

Fig. 3
figure 3

Label distribution. (a) The number of labeled voxels for each class is presented, along with the root classes corresponding to these classes. (b) Distribution of Length, Width, and Height Dimensions for 3D Object Detection. (c) Distribution of Distance for 3D Object Detection.

Future Technology Outlook

Despite significant recent advancements in the field of autonomous driving, navigating in unstructured, non-urban scenes remains challenging, potentially leading to severe accidents. Large language models (LLMs) have demonstrated impressive reasoning abilities, approaching the capabilities of general artificial intelligence. In autonomous driving, related work has already begun exploring how to construct visual-language models(VLMs) to achieve scene description, scene analysis, and decision reasoning, thereby enabling higher levels of autonomous driving29,30. We believe that exploring the construction of visual-language models (VLMs) for autonomous driving with visual representations, combined with 3D semantic occupancy prediction31,32 and road surface elevation reconstruction in unstructured scenes, presents a challenging, but meaningful task, as shown in Fig. 1(d). In addition, data-centric autonomous driving has demonstrated robust algorithm-driven capabilities. Constructing a data closed-loop in unstructured scenes to address the long-tail problem in autonomous driving represents another future challenge, as illustrated in Fig. 1(d).

Data Records

Dataset organization

The raw data in this study have been deposited in the Science Data Bank33, where all data files and documentation are openly accessible. We have created a dataset containing 23549 pairs of autonomous driving data samples with diverse labels in unstructured scenes. The raw data includes image data captured by a camera, point cloud data collected by Horizon LiDAR and Ouster LiDAR, and IMU data. All files are named using timestamp followed by the file format suffix. The provided annotation labels include 3D semantic occupancy prediction, road elevation labels, depth estimation labels, instance segmentation labels, 3D point cloud segmentation labels, and local dense point cloud maps. All labels are named in the same manner as the source data and are stored in separate folders. Additionally, image-text descriptions are provided for future multimodal task exploration. Moreover, we provide camera intrinsic and extrinsic calibration information, dataset partitioning, and real-time vehicle pose data. The detailed data organization hierarchy is shown in Table 1, which comprehensively presents the file structure, data types, and data descriptions. In the dataset, we extract time-continuous sequences to conduct and evaluate experiments related to motion planning. The data collection frequency is 10 Hz, which has been downsampled to 2 Hz.

Data Format

The source images are losslessly converted from the raw format to .jpg format. The source LiDAR point clouds are saved in .bin format, containing XYZ coordinate information and reflectivity intensity. Labels for 3D semantic occupancy prediction are stored in .npy format, which represent the semantic class and occupancy state of each voxel in 3D space. Road elevation reconstruction labels and depth estimation labels are saved losslessly in 16-bit .jpg format. The camera’s intrinsic and extrinsic parameters follow the calibration file format of the KITTI dataset, with each data frame corresponding to a calibration file stored as a text file.

Vehicle motion information includes global and local poses with three degrees of freedom for XYZ translation and rotation, as well as three degrees of freedom for linear speed. All this information is stored in text files. Additionally, we provide dataset partition files, including training, validation, and test sets. Developers can also re-partition the data according to their own criteria. To support future multimodal task exploration, we also provide image-text description labels, stored in text files. The detailed data types and descriptions are provided in Table 2.

Table 2 Data File Descriptions.

Technical Validation

We have constructed a dataset for unstructured scenes understanding. To validate the effectiveness of the dataset, we conducted experimental evaluations on the benchmark. Firstly, we performed the 3D semantic occupancy prediction task, which divides the spatial grid to simultaneously predict the occupancy status and semantic class of each cell. It provides more accurate environmental perception and understanding for autonomous vehicles. Secondly, we carried out the road surface elevation reconstruction task, which is used to assess the bumpiness of road surfaces in unstructured scenes. Thirdly, we conducted experimental validation of monocular depth estimation on the dataset. The depth estimation task aims to infer the distance (depth) of each point in the scene to the camera. The latest algorithms are selected for experimental validation. Furthermore, we conducted a systematic evaluation to assess the accuracy of the automated labeling results.

Evaluation of Automated Annotation

To assess the reliability of our automated annotation pipeline, we conducted a quantitative evaluation against a manually annotated ground-truth subset. Specifically, we selected a representative subset of 6137 frames and manually labeled instance segmentation masks. We then applied our automated labeling system, which is pretrained with various dataset in unstructured scene, to the same subset. The results show that the automated method achieves 84.49 APmask in instance segmentation when compared to the human-annotated ground truth. Compared with the state-of-the-art methods such as GLEE28 and Grounded-SAM34, which achieve 49.6 mIoU on the SegInW dataset(a benchmark in the wild, and 62.0 APmask on the COCO dataset respectively, our method demonstrates significantly improved annotation performance. Additionally, all automatically generated annotations were further manually verified and corrected to ensure label quality.

3D Semantic Occupancy Prediction

Baseline & Metrics

We consider four baselines for semantic occupancy prediction, all of which are among the best available open-source methods: TPVFormer20, MonoScene19, OccFormer(large)21, and OccFormer(small)21. These methods represent classical monocular semantic occupancy prediction approaches, and we evaluated them using their RGB input versions. All model inputs were resized to a resolution of 384 × 1280, except for OccFormer(large)21, which uses a higher resolution of 896 × 1920.

In 3D semantic occupancy prediction tasks, commonly used evaluation metrics measure the accuracy of the model’s predictions regarding the semantic classification and occupancy of each voxel in 3D space. These metrics are similar to those used in 3D segmentation and semantic classification tasks, and are summarized as follows:

IoU (Intersection over Union): IoU measures the overlap between the predicted and ground truth voxels for each class as defined as (1). It is calculated as the ratio of the intersection to the union of predicted and actual voxels.

$${\rm{IoU}}=\frac{{\rm{TP}}}{{\rm{TP}}+{\rm{FP}}+{\rm{FN}}}$$
(1)

where TP, FP, and FN indicate the number of true positive, false positive, and false negative predictions.

mIoU (Mean Intersection over Union): mIoU as shown in (2) is the average of the IoU values across all classes and is used to assess the overall performance of the model across multiple classes.

$${\rm{mIoU}}=\frac{1}{C}\mathop{\sum }\limits_{i=1}^{C}{{\rm{IoU}}}_{i}$$
(2)

C represents the total number of classes. IoUi is the IoU for the i-th class.

Benchmark Evaluation

Table 3 presents the per-class results and the mean IoU (mIoU) across all classes. For each method, we separately evaluated performance on moving and static objects in the scene. As shown in Table 3, different methods achieve considerable performance on high-frequency classes. However, for rare classes that suffer from a long-tail distribution, the experimental results are poor, with some IoU scores approaching zero. It is evident that the data exhibits a pronounced long-tail distribution, which poses significant challenges for research. For static objects, the performance on muddiness is not good. One reason is the limited number of samples for the muddiness class. Moreover, the irregular shape of muddiness adds complexity to accurate recognition. In addition, we noticed that the performance pedestrians, are also below expectations. We conducted validation tests on CO-Occ22 with multimodal fusion method and found that multimodal fusion significantly improves the performance for rare classes, as shown in Fig. 4.

Table 3 Evaluation and comparison experiments of 3D semantic occupancy prediction.
Fig. 4
figure 4

Comparison of the performance of different 3D semantic occupancy prediction methods. The scenes cover a range of conditions including sunny, rainy, slippery, nighttime, snowy, backlight, and semi-structured road.

From the evaluations of these methods, it is evident that the unstructured scenes we designed are more challenging compared to traditional urban scenes. On one hand, our scenes introduce a large number of new object classes with irregular shapes, which has a significant impact on the safe driving for autonomous vehicles. As shown in Table 3, the performance on muddiness is relatively poor. One reason is the highly variable morphology of muddiness, which leads to significant intra-class feature variability. Additionally, the blending of muddiness with the road results in low feature distinction between muddiness and the road. On the other hand, the data exhibits a long-tailed distribution, where some classes such as excavator, car, and pedestrian, suffer from poor model performance due to scarce samples. In addition, the sample classes present challenges such as small object detection, camouflaged object detection, and boundary detection. These issues present new challenges for traditional 3D semantic occupancy prediction methods, and this work could potentially open a new line of research.

Table 4 Evaluation and comparison of 3D semantic occupancy prediction across platforms and domains.

Generalization Evaluation

We conducted a series of experiments to validate the generalization capability of the model across platforms and across domains. Our dataset consists of multisource data collected from six distinct surface mines, each varying in terrain structure, and sensor configurations. In this experiment, data from Regions 1 to 4 were used for model training, while Regions 5 and 6 were kept as unseen test regions to evaluate generalization performance. We then perform two types of evaluation. 1) In-domain test: Evaluation on the test set from region 1 to region 4 to evaluate the performance of models within the same platform and domain. 2) Cross-domain test: Evaluation on the test sets from the remaining two mining regions to examine generalization to unseen platforms and scenes.

The results show that, although the model experiences performance degradation in the cross-domain test compared to the in-domain setting, it still maintains stable performance overall. This demonstrates the model’s ability to generalize across different platforms and scenes. Moreover, the experimental design highlights the practical value and inherent challenges of our dataset in supporting real-world deployment under diverse and unstructured scenes.

Road Elevation Reconstruction

Baseline & Metrics

We considered four primary monocular depth estimation baselines to validate road surface elevation reconstruction and depth estimation tasks in unstructured scenes. The selected methods represent the best available open-source approaches, including NeWCRFs24, DepthFormer25, BinsFormer26, and ScaleDepth27, which are seminal depth estimation methods. We evaluated the RGB input versions of each method.

In depth estimation tasks, key evaluation metrics include Absolute Relative Error (AbsRel), which measures the average relative difference between predicted and true depths, and Root Mean Squared Error (RMSE), which calculates the overall error magnitude. In addition, Squared Relative Error (SqRel) penalizes larger errors by squaring the differences, while accuracy under a threshold evaluates the proportion of predictions that are within an acceptable range of the ground truth. These metrics provide a balanced view of a model’s depth estimation performance.

Absolute Relative Error(AbsRel): It calculates the mean of the absolute differences between predicted depth values di and ground truth depth values \({d}_{i}^{\ast }\). It provides a relative error measure, which indicates the average error in terms of the actual depth values. The formula is shown in the following equation (3):

$${\rm{AbsRel}}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\frac{| {d}_{i}-{d}_{i}^{\ast }| }{{d}_{i}^{\ast }}$$
(3)

Root Mean Squared Error (RMSE): RMSE is the square root of the average of the squared differences between predicted and true depth values. It penalizes larger errors more heavily, making it sensitive to outliers. RMSE provides a measure of the absolute difference between predicted and true depths. The formula is shown in the following equation (4):

$${\rm{RMSE}}=\sqrt{\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left({d}_{i}-{d}_{i}^{\ast }\right)}^{2}}$$
(4)

Squared Relative Error (SqRel): This metric calculates the mean of the squared differences between predicted and ground truth depth values. It is a variant of the relative error that squares the difference. The formula is shown in the following equation (5):

$${\rm{SqRel}}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\frac{{\left({d}_{i}-{d}_{i}^{\ast }\right)}^{2}}{{d}_{i}^{\ast }}$$
(5)

RMSE Logarithmic (RMSE log): RMSE log computes the RMSE in logarithmic space, which is useful when depth values span several orders of magnitude. It measures the error in log space, thus penalizing relative differences rather than absolute ones. The formula is shown in the following equation (6):

$${\rm{RMSElog}}=\sqrt{\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\left(\log ({d}_{i})-\log ({d}_{i}^{\ast })\right)}^{2}}$$
(6)

Accuracy under a threshold (δ < 1.25, δ < 1.252δ < 1.253): These metrics measure the fraction of predicted depth values that fall within a certain factor of the ground truth depth values. These accuracy metrics indicate the percentage of predictions that are acceptably close to the actual values. These accuracy metrics provide a straightforward way to understand how many predictions are close to the ground truth. The formula is shown in the following equation (7):

$$\delta =max\left(\frac{{d}_{i}}{{d}_{i}^{\ast }},\frac{{d}_{i}^{\ast }}{{d}_{i}}\right)$$
(7)

δ < 1.25: The percentage of predictions where the predicted depth is within a factor of 1.25 of the ground truth.

δ < 1.252: The percentage of predictions within a factor of 1.25 squared of the ground truth.

δ < 1.253: The percentage of predictions within a factor of 1.25 cubed of the ground truth.

Results and Discussion

Table 5 presents the road surface elevation reconstruction results of various depth estimation models. Our region of interest is defined as 15 to 30 meters ahead of the vehicle. Due to the high-precision and dense point cloud labels, all models achieved significant performance on the metrics as shown in Table 5. The experimental results showed that the algorithms can accurately reconstruct irregular obstacles, including muddiness, walls, and elevation, on unstructured roads. It demonstrated the effectiveness of our dataset in representing the structure of unstructured road surfaces. The Absolute Relative Error is approximately 6% within the 15 to 30 meter depth range. In the elevation direction, an elevation of 30 cm will result in an error of 1.8 cm. This level of accuracy meets the performance requirements for unstructured road surface reconstruction. The experimental results of road surface elevation reconstruction were visualized in Fig. 5, which displays the elevation predictions from a BEV (Bird’s Eye View) perspective.

Table 5 Evaluation and comparison experiments of road surface elevation reconstruction.
Fig. 5
figure 5

Comparison of the performance of different road surface elevation reconstruction methods. The scenes cover a range of conditions including sunny, dusty, cloudy, snowy, nighttime.

As shown in Fig. 5, the Ground Truth in the first row exhibits continuous elevation changes. Both the ScaleDepth27 and NewWCRFs24 are able to reconstruct the ground elevation changes. In the second row, the Ground Truth shows a distinct elevation on the right side of the road, and all algorithms are able to reconstruct the elevation in this area. Additionally, in the last row, the ground has significant undulations that clearly demonstrates the effectiveness of different algorithms in uneven road surfaces reconstruction. Reconstructing the road surface in front of the vehicle is crucial for effective vehicle speed control strategies. Road surface elevation reconstruction data provides a foundation for comfortable and safe driving. According to experimental statistics, the relative error increases with distance, indicating higher accuracy at shorter ranges. This phenomenon is consistent with the observed data pattern. Texture details are preserved at small depths but diminish at greater depths due to the perspective effect. This dataset is highly challenging, and future research will require further exploration to develop more advanced models for achieving more accurate road surface elevation reconstruction.

In addition to evaluating road surface elevation reconstruction, we also evaluated the task of monocular depth estimation to verify the effectiveness of our dataset. Table 6 presents the monocular depth estimation results of different depth estimation models. Owing to the high-precision and dense point cloud labels, all models achieved notable performance across the metrics shown in Table 6. Consistent with the results of road elevation reconstruction, the relative error in monocular depth estimation increases with distance, indicating higher accuracy at closer ranges. It’s clear that we provided a depth estimation dataset for unstructured scenes, where the complex and varied backgrounds pose additional challenges to traditional depth estimation methods. Further research is required to develop advanced models for more accurate depth estimation.

Table 6 Evaluation and comparison experiments of depth estimation.

Usage Notes

This dataset also provides trajectory planning and speed planning annotations for the ego vehicle. Researchers can conduct studies related to scene perception and trajectory planning in unstructured scenes. The dataset is available for download under the CC BY 4.0 license.