An RGB-TIR Dataset from UAV Platform for Robust Urban Traffic Scenes Semantic Segmentation

Ouyang, Junlin; Wang, Qingwang; Shang, Ying; Jin, Pengcheng; Zhong, Hangwei; Zhou, Liman; Shen, Tao

doi:10.1038/s41597-025-05994-7

Download PDF

Data Descriptor
Open access
Published: 28 October 2025

An RGB-TIR Dataset from UAV Platform for Robust Urban Traffic Scenes Semantic Segmentation

Junlin Ouyang¹,
Qingwang Wang¹,
Ying Shang¹,
Pengcheng Jin¹,
Hangwei Zhong¹,
Liman Zhou² &
…
Tao Shen¹

Scientific Data volume 12, Article number: 1701 (2025) Cite this article

2656 Accesses
1 Citations
Metrics details

Subjects

Abstract

We introduce Kust4K, a UAV-based dataset for RGB-TIR multimodal semantic segmentation. Kust4K dataset is designed to overcome key limitations in existing UAV-based semantic segmentation datasets: low information density, limited data volume, and insufficient robustness discussion under non-ideal environment. Kust4K dataset featuring 4,024 of 640 × 512 pixel-aligned RGB-Thermal Infrared image pairs captured across diverse urban road scenes under variable illumination. Extensive experiments with state-of-the-art models demonstrate Kust4K’s effectiveness, with multimodal training, significantly outperforming unimodal baselines. Additionally, these results highlight that multimodal image information is critical for obtaining more reliable semantic segmentation results. In total, Kust4K dataset advance robust urban traffic scene understanding, offering a valuable resource for intelligent transportation research.

Large language model-driven knowledge graph reasoning for enhanced semantic segmentation

Article Open access 11 December 2025

M3OT: A Multi-Drone Multi-Modality dataset for Multi-Object Tracking

Article Open access 08 December 2025

Multiview angle UAV infrared image simulation with segmented model and object detection for traffic surveillance

Article Open access 12 February 2025

Background & Summary

With the accelerating pace of urbanization and the increasing complexity of traffic issues, traditional urban traffic management methods (e.g., fixed surveillance cameras) can no longer meet the demands for efficient, comprehensive, and reliable monitoring. This is particularly in complex and dynamic urban road scenes, where traditional monitoring methods often face limitations such as restricted viewpoints and incomplete information. UAV remote sensing technology¹, with its high flexibility, broad coverage, and relatively low cost, has gradually become a key tool to address these issues.

However, despite the immense potential of UAV remote sensing images in many applications^2,3,4,5,6,7, a single image modality often struggles to provide comprehensive information in complex environments^8,9,10,11. This limitation is particularly evident in low-illumination, complex and dynamically changing urban road scenes¹². As shown in Fig. 1, RGB images perform poorly under low-illumination conditions. Although thermal infrared (TIR) images can provide effective thermal radiation information under low-illumination conditions, such as at night, they lack sufficient color and shape detail. Therefore, how to combine information from multiple modalities and leverage the strengths of each modality has become a key issue^13,14,15 in enhancing UAV visual perception capabilities.

To achieve fine-grained scene understanding critical for traffic management, identifying vehicles and pedestrians, semantic segmentation – the pixel-level classification of image, emerges as a fundamental task. In recent years, many semantic segmentation datasets^{8,16,17,18,19,20} captured from a UAV perspective have been released, significantly advancing the development of data-driven UAV semantic segmentation models, the attribute comparison of existing UAV semantic segmentation datasets and multimodal semantic segmentation datasets is shown in Table 1. However, these datasets face three key challenges that need to be addressed:

High scene redundancy. Existing UAV-based semantic segmentation datasets often suffer from information redundancy. Many images in these datasets^16,18,19 are collected from the same or highly similar scenes, resulting in minimal inter-frame variation. As evidenced by datasets like Aeroscapes¹⁶ (which contains 3,269 images with ~23 captures per scene), consecutive frames often capture identical or highly similar scenes, resulting in negligible inter-frame variation. This redundancy fundamentally limits the effective information gain per annotated sample, consequently constraining the generalization capability and real-world robustness of segmentation models in diverse operational scenarios.
Low data volume. Due to the wide spatial coverage of UAV-captured images, which typically contain many diverse object categories and exhibit complex, densely populated scenes, the manual annotation process for semantic segmentation is both time-consuming and labor-intensive. As a result, existing UAV-based semantic segmentation datasets (e.g., UDD¹⁷ and UAVid¹⁹) generally suffer from limited sample sizes. This shortage of large-scale annotated data hinders the ability to fully meet the demands of data-driven deep learning models, which require substantial amounts of training data to achieve optimal performance.
Limited robustness discussion. Existing UAV-based semantic segmentation datasets only include RGB images, which limit their ability to meet the all-time monitoring requirements, particularly in low-illumination environments. Additionally, the stability of imaging sensors is often compromised during UAV flights, especially in high-speed, vibration-prone, or windy conditions. These conditions are commonly encountered in real-world applications and easily affect image quality.

Table 1 Overview of existing open access RGB-TIR and UAV-based single-modality semantic segmentation datasets.

Full size table

To address the issues of exiting UAV-based semantic segmentation datasets, as depict in Fig. 2, we propose a multimodal UAV semantic segmentation dataset, termed as Kust4K, which covers a wide range of urban road scenes, both RGB and thermal infrared (TIR) images were captured. RGB and TIR modalities exhibit distinct advantages, and neither modality alone can provide reliable visual perception. As shown in Fig. 1(a,b), each modality—RGB and TIR provides unique feature information: RGB images offer color and texture details of contents, while TIR images capture the thermal radiation information of the objects. The complementary information of these two modalities enables more robust multimodal joint perception. Besides, some TIR-exclusive studies^10,21 facing reliability limitations under specific operational conditions. As illustrated in Fig. 1(c), vehicle occlusion induces artifacts in TIR images – particularly problematic in UAV perspectives where such artifacts may be misinterpreted as actual vehicles. Consequently, our Kust4K dataset integrates both RGB and TIR modalities to mitigate these constraints. Samples of our Kust4K dataset under different lighting conditions in various urban road scenes, along with their corresponding semantic segmentation annotations, are shown in Fig. 3.

Additionally, considering the potential image quality degradation that may occur during UAV flight¹⁵, we manually applied noise to a subset of the images. By introducing noise, we can evaluate the model’s performance under low-quality imaging conditions, further advancing the application and development of UAV semantic segmentation technologies in complex environments. For detailed information on the noise augmentation process, please refer to the Dataset Attributes and Statistics section. Kust4K is a UAV-based RGB-TIR semantic segmentation dataset, specifically designed for robust urban traffic scene understanding under challenging conditions, which serves as a valuable resource to facilitate advances in robust multimodal perception and intelligent transportation systems.

Methods

We utilized a hexacopter drone model HET-3021C equipped with both RGB and thermal infrared (TIR) dual-spectrum cameras as the multimodal image acquisition platform, The HET-3021C is a professional-grade unmanned aerial vehicle specifically designed for long-range, high-precision patrol and surveillance missions. The detailed specifications of the HET-3021C drone are summarized in Table 2. The construction of our Kust4K dataset consists of four main stages: data acquisition, data registration, data annotation, and dataset generation.

Table 2 Hexacopter drone HET-3021C setup.

Full size table

Data acquisition

Previous datasets captured from UAV perspectives often consist of consecutive video frames with limited incremental information between frames, which poses challenges for effectively improving downstream semantic segmentation performance. To address this issue, we sampled synchronized RGB and TIR frames every 15 frames from the recorded dual-modal videos. This approach yielded a total of 4,024 paired RGB-TIR images. The RGB images have a resolution of 1920 × 1080, while the TIR images are captured at a resolution of 640 × 512.

Data registration

Due to inherent differences in imaging time and field of view between RGB and TIR sensors, registration of the collected RGB-TIR image pairs is necessary to ensure accurate spatial alignment. This challenge is particularly pronounced in urban road scenes, where the presence of fast-moving vehicles exacerbates misalignment between the two modalities. Proper registration is critical to enable effective multimodal fusion and to maintain the integrity of semantic segmentation labels across modalities. To address this, we first applied the Scale-Invariant Feature Transform (SIFT) algorithm to extract key points from the two modalities and subsequently computed the transformation matrix based on these matched feature points to align the RGB and TIR images. However, since the Kust4K dataset was collected under varying lighting conditions, low-illumination environments posed significant challenges for reliable key point detection. We then employed manual registration. During this process, the RGB-TIR images were cropped and resized to a resolution of 640 × 512 pixels to match the TIR image resolution. The RGB-TIR multimodal image registration process is shown in Fig. 4.

Data annotation

The Kust4K dataset includes RGB images captured under varying lighting conditions. Relying solely on a single modality, particularly in low-illumination environments, makes it challenging to achieve high-precision, high-quality semantic segmentation annotations. Therefore, it is essential to simultaneously leverage information from both modalities for accurate semantic segmentation labelling. Besides, the semantic segmentation annotation tool Labelme²² integrates the Segment Anything Model (SAM)²³, significantly enhancing the efficiency of semantic segmentation labelling. This is particularly valuable for UAV remote sensing images, which often cover large areas and include a wide variety of content categories. Based on this, we have designed a systematic semi-automated annotation workflow for the Kust4K multimodal semantic segmentation dataset. Considering the varying importance and frequency of occurrence of different content in real-world applications, we selected eight categories for annotation: Road, Building, Motorcycle, Car, Truck, Tree, Person, and Traffic Facilities. The annotation workflow for our Kust4K dataset is shown in Fig. 5, which consists of four stages:

Multimodal Channel Aggregation. To simultaneously acquire both RGB and TIR information for annotation in Labelme²², we first aggregated the RGB and TIR images. Specifically, we split the TIR image into three channels and calculated the average value of these channels, which was then used as the fourth channel to combine with the RGB image. The color information from the RGB image and the thermal radiation information from the TIR image within a single four-channel image.
SAM-based Initial Masking. We use SAM in Labelme²² to generate initial semantic segmentation masks for the multimodal four-channel images. Through interactive prompt points, generating initialization masks for the corresponding contents, as shown in Fig. 5(b).
Expert Annotation Refinement. Expert annotators refine the initial labels generated by SAM, referring information such as color and texture from the original RGB image, and temperature and thermal radiation from the TIR image. This process includes boundary refinement, category correction and other optimizations. Particular attention is given to small objects, such as motorcycles and person.
Experts Cross Verification. Finally, we adopted a cross-validation approach, where expert annotators validate and correct the semantic segmentation labels created by other annotators. The final review is conducted by a reviewer who was not involved in the previous annotation process. This iterative cycle of validation, correction, and review continues until the reviewer determine that the quality of the semantic segmentation labels meets the required standards. Only then are the validated samples included in the final Kust4K dataset.

Generation of dataset

The semantic segmentation mask annotations exported from Labelme²² in JSON format are then converted into PNG format mask images suitable for semantic segmentation tasks. Specifically, according to JSON file, the corresponding mask annotations (segmentation points and categories) are mapped to the PNG mask image using different pixel values. In this stage, the pixel values corresponding to each category are as follows: “Road” is assigned a pixel value of 1, “Building” as 2, “Motorcycle” as 3, “Car” as 4, “Truck” as 5, “Tree” is assigned 6, “Person” is assigned 7, and “Traffic Facilities” as 8. Any unannotated areas are considered as “unlabeled” and are assigned a pixel value of 0.

Data Records

The Kust4K²⁴ dataset is publicly available at figshare. Kust4K dataset is designed to provide a reliable dataset for the data-driven UAV visual perception, particularly for developing robust multimodal semantic segmentation models. This section introduces the image file naming and dataset splits.

File naming

The image naming format we define is “NNNNNT.png”, where “NNNNN” represents the image’s sequence number, starting from “00001” to ensure that each image has a unique identifier. The “T” denotes the time of image acquisition, with “D” indicating daytime and “N” indicating nighttime. For the same scene, the naming of the RGB and TIR images remains consistent, ensuring a clear correspondence between the RGB and TIR images.

These detailed annotations and statistical analyses, driving the development of UAV-based semantic segmentation technology in intelligent urban traffic management.

Figure 6(a) presents a statistical analysis of the samples under different lighting conditions in Kust4K dataset. Additionally, considering the potential image quality degradation that may occur in real-world, we applied noise processing to a subset of Kust4K dataset to enhance the dataset’s diversity and improve the model’s adaptability to low-quality data. Specifically, as shown in Fig. 6(b), for RGB failure samples, we set all pixel values to 0 to simulate scenes under complete darkness or extremely low light conditions. For TIR failure samples, we introduced Gaussian blur to simulate image blurring caused by focusing failure or vibrations during high-speed UAV flight. For each RGB-TIR image pair, we only applied noise to one modality to ensure that the model could still extract information from the other modality. This design not only simulates potential modality failures in real-world applications but also helps prevent the model from becoming overly reliant on any single modality during multimodal learning. The experimental results related to modality failure can be found in the Technical Validation section.

Dataset splits

We divided Kust4K dataset into training, testing, and validation subsets with 7:2:1 split ratio. To prevents model bias caused by illumination imbalance, we balanced the samples based on the illumination intensity. The training set contains 2814 RGB-TIR image pairs (1758 taken in day, 1056 taken in night), the testing set contains 807 RGB-TIR image pairs (504 taken in day, 303 taken in night), the validation set contains 403 RGB-TIR image pairs (252 taken in day, 151 taken in night). Besides, the training set contains 528 RGB failure samples and 316 TIR failure samples, the testing set contains 151 RGB failure samples and 91 TIR failure samples, the validation set contains 74 RGB failure samples and 46 TIR failure samples.

Data Overview

Dataset attributes and statistics

Kust4K dataset covers a variety of urban road scenes and is designed to provide reliable UAV-based semantic segmentation data for intelligent urban traffic management. The dataset annotates key categories in urban traffic management, offering comprehensive support for traffic scene analysis. The Kust4K dataset includes 8 object categories: “Road”, “Building”, “Motorcycle”, “Car”, “Truck”, “Tree”, “Person”, and “Traffic Facilities”, with all unlabeled regions designated as the background. Kust4K dataset provides reliable semantic segmentation annotations for researchers in the field. In Fig. 7, we present a statistical visualization of the pixel counts for the annotations of different categories in Kust4K dataset. This allows us to visually observe the distribution of each category within the dataset, helping researchers to better understand the composition of our Kust4K dataset.

Technical Validation

Task definition

Semantic segmentation is one of the core tasks in computer vision, with the primary goal of assigning a precise semantic label to each pixel in a scene, representing the object category or region type it belongs to. Kust4K dataset contains 4,024 pairs of RGB-TIR images captured from a UAV perspective, aimed at providing a detailed analysis of complex urban road scenes under varying lighting conditions.

Models and training details

Semantic segmentation models

We selected 8 state-of-the-art semantic segmentation methods^{25,26,27,28,29,30,31,32} for comparison and evaluation, including 3 CNN-based methods (UNet³⁰, UperNet³¹ and RTFNet²⁵), Attention-based methods (FEANet²⁶, EAEFNet²⁷, CMNeXt²⁸ and CMX²⁹), and the latest Mamba-based methods (Sigma³²). Traditional single-modality semantic segmentation methods (UNet³⁰ and UperNet³¹) lack modality fusion and interaction modules. To address this, we replicated the original single-modality branches and fused the images by element-wise addition of feature matrices. These 9 different methods cover a range from traditional single-modality methods to multimodal fusion techniques, aiming to thoroughly explore and evaluate the segmentation performance of different algorithms in complex urban road scenes.

Training details

All models were built using the PyTorch framework and experiments were conducted on a single Nvidia GeForce RTX 4090 GPU with 24 GB of VRAM, the resolution of all input RGB and TIR images is 640 × 512. To ensure the fairness and reliability of the experimental results, we strictly adhered to the hyperparameter configurations recommended in the original papers for each method, including optimizer type, learning rate, loss functions, and other relevant training details. In terms of data augmentation, all methods used horizontal flipping with a probability of 0.5 during training. This was the only data augmentation strategy applied to ensure consistency across models. Furthermore, the pre-trained weights for all models were obtained from the official publicly available checkpoints to ensure consistency in the starting point of model training.

Performance evaluation metrics

Same as previous multimodal semantic segmentation works^26,27,28, we employ mean intersection over union (mIoU) as evaluation metrics to assess the performance of different semantic segmentation models on Kust4K dataset. mIoU is the average result of summing the percentage overlap between the target or region masks and the predicted outputs for each category. The mathematical definition of mIoU is as follows:

$$mIoU=\frac{1}{N}\mathop{\sum }\limits_{i=0}^{N}\frac{T{P}_{i}}{T{P}_{i}+F{P}_{i}+F{N}_{i}}$$

where N represents the number of categories. $T{P}_{i}={\sum }_{m=1}^{M}{P}_{ii}^{m},F{P}_{i}={\sum }_{m=1}^{M}{\sum }_{j=1,j\ne i}^{N}{P}_{ji}^{k}$ and $F{N}_{i}={\sum }_{m=1}^{M}{\sum }_{j=1,j\ne i}^{N}{P}_{ij}^{k}$, where M is the number of samples in the test set. For the test sample k, ${P}_{ii}^{m}$ is the number of pixels correctly classified as class i, ${P}_{ji}^{k}$ is the number of pixels misclassified as class i but actually belonging to class j, and ${P}_{ij}^{k}$ is the number of pixels misclassified as class j but actually belonging to class i.

Experimental result analysis

Table 3 and Fig. 8 present the quantitative and qualitative experimental results of different semantic segmentation methods on Kust4K dataset. We use an effective element-wise additional feature fusion module for CNN-based single-modality UNet³⁰ and UperNet³¹. While UNet³⁰ and UperNet³¹ perform reasonably well in simple scenes, there remains a significant performance gap when handling complex objects and low-contrast details, compared to the well-designed multimodal fusion modules in multimodal semantic segmentation methods. The CNN-based multimodal semantic segmentation method RTFNet²⁵ shows an improvement in mIoU by 7.3% compared to UNet³⁰ and 4.6% compared to UperNet³¹. This improvement is attributed to RTFNet’s design of multi-level cross-modal feature fusion, which incorporates multi-scale TIR features into the RGB features, thereby enhancing the overall performance.

Table 3 Quantitative experiments of different semantic segmentation methods on Kust4K dataset.

Full size table

Besides, small objects are often challenging to segment accurately, particularly under low-illumination conditions. This phenomenon is particularly evident in our Kust4K dataset, which contains small objects in the scene. These small objects are often difficult to segment accurately, especially under low-illumination conditions. As shown in Table 3, compared to CNN-based methods, Attention-based and Mamba-based methods exhibit better segmentation performance for small objects such as Motorcycle and Traffic Facilities. This improvement is attributed to the long-range feature interaction design in Attention-based (global attention) and Mamba-based (state space recurrence) methods, which allows for a deeper understanding of the advantageous features of both modalities. By fusing the color and shape information from the RGB images with the thermal radiation information from the TIR images, Attention-based and Mamba-based methods effectively complement each other’s weaknesses, thereby enhancing the model’s ability to segment small objects.

In Fig. 9, we present qualitative comparisons of different methods under single-modality failure scenes. As observed from Fig. 9, UperNet³¹ demonstrates notably weaker segmentation performance compared to other multimodal semantic segmentation methods in cases of modality failure. For example, UperNet³¹ exhibits misclassification of vehicles when the TIR modality fails and reduced accuracy in “tree” segmentation when the RGB modality fails. In contrast, the other multimodal semantic segmentation methods^27,32, benefiting from their specially designed adaptive feature fusion modules, exhibit stronger robustness against modality failure. This indicates that modality failure negatively impacts semantic segmentation tasks, particularly when the model lacks the ability to adaptively adjust the weights of each modality.

Modality robustness analysis

To further analyse the effect of different modal images on model robustness, we carried out experiments comparing unimodal and multimodal inputs, and a discussion for when a particular modal image fails.

Modality-specific analysis

UNet³⁰, CMX²⁹ and Sigma³² was trained and tested using single RGB images, single TIR images, and RGB-TIR image pairs (multimodal input). As shown in Table 4, single-modality inputs generally perform worse, especially RGB images. In contrast, TIR images are less sensitive to lighting conditions and can provide effective target cues by capturing the thermal radiation information of objects, resulting in significantly better performance compared to unimodal RGB images. When using RGB-TIR image pairs as multimodal inputs, the model can simultaneously leverage the color and texture information from RGB images and the thermal radiation data from TIR images, compensating for the limitations of single-modality inputs. As shown in Table 4, when RGB + TIR is used as input, UNet³⁰, CMX²⁹ and Sigma³² achieve optimal performance. Moreover, compared to single-modality methods, the “RGB + TIR” multimodal input shows a more significant improvement in multi-modality model performance. For UNet³⁰, when the input is “RGB + TIR”, the performance improves by 3.6% compared to TIR alone, besides, CMX²⁹ and Sigma³² show improvements of 5.1% and 6.3%, respectively. This indicates that multimodal methods (CMX²⁹ and Sigma³²) are better at utilizing information from different modalities, with the performance improvement being more significant when compared to the multimodal input of single-modality method.

Table 4 Comparison of the performance of different methods and input modalities on the Kust4K dataset, using mIoU (%) as the evaluation metric.

Full size table

Modality-failure analysis

In real-world applications, RGB and TIR imaging sensors may experience modality failure due to adverse weather conditions, hardware malfunctions, or other factors. Therefore, testing the robustness of the model in cases of single-modality failure is crucial. We conducted experiments on UNet³⁰, CMX²⁹ and Sigma³² trained with Kust4K multimodal images, specifically testing it on modality failure samples from the test set (151 RGB failure samples and 91 TIR failure samples), the results are presented in Table 5. The segmentation performance of all three methods decreases in modality failure scenes. Compared to their overall performance, the performance gap between UNet³⁰, CMX²⁹ and Sigma³² narrows when the TIR modality fails, indicating that the TIR modality plays a dominant role in the feature fusion. While multimodal methods effectively leverage the strengths of both modalities, they are more susceptible to the failure of the dominant modality. In Fig. 10, we present the semantic segmentation results of CMX²⁹ using single-modality and multimodality inputs when the RGB or TIR modality fails. The segmentation results indicate that when the failed image is used as a single-modality input, the model is almost unable to perform reliable semantic segmentation. When only the non-failed modality is used for segmentation, the model performs better and can accurately segment the scene.

Table 5 Comparison of the performance of different methods under different modality failure scenarios on the Kust4K dataset, using mIoU (%) as the evaluation metric.

Full size table

While RGB + TIR is used as the input for CMX²⁹, the segmentation results are inferior compared normal unimodal input (“Tree” in the RGB failure case; “traffic facilities” in the RGB failure case). However, the multimodal interaction and fusion module in the CMX²⁹ model is still able to extract effective information from the non-failed modality, maintaining relatively stable performance. This result demonstrates that even when one modality image fails, the model can still effectively use the data from the remaining modality, partially compensating for the impact of the failed modality, thus highlighting the important role of multimodal input in enhancing model robustness.

Data availability

The dataset described in this article is now publicly available on figshare²⁴ (https://doi.org/10.6084/m9.figshare.29476610.v3).

Code availability

All the data processing code written in Python is available on the figshare at https://doi.org/10.6084/m9.figshare.29476610.v3. The code includes the following scripts: tools/visual.py: This script is used for visualizing semantic segmentation annotations and includes the color palette corresponding to each category. tools/merge.py: This script aggregates RGB and TIR images to form a four-channel image for annotation purposes. Additionally, the infos folder contains statistical information on modality failure scenes: broke_RGB.txt, broke_TIR.txt, broken_in_test_day_151.txt, broken_in_test_night_91.txt, broken_in_train_day_528.txt, broken_in_train_night_316.txt, broken_in_val_day_74.txt and broken_in_val_night_46.txt.

References

Hawashin, D. et al. Blockchain applications in UAV industry: Review, opportunities, and challenges. Journal of Network and Computer Applications 230, 103932 (2024).
Article Google Scholar
Aibibu, T., Lan, J., Zeng, Y., Hu, J. & Yong, Z. Multiview angle UAV infrared image simulation with segmented model and object detection for traffic surveillance. Sci. Rep. 15, 5254, https://doi.org/10.1038/s41598-025-89585-x (2025).
Article ADS PubMed PubMed Central CAS Google Scholar
Ma, W., Chu, Z., Chen, H. & Li, M. Spatio-temporal envolutional graph neural network for traffic flow prediction in UAV-based urban traffic monitoring system. Sci. Rep. 14, 26800, https://doi.org/10.1038/s41598-024-78335-0 (2024).
Article ADS PubMed PubMed Central CAS Google Scholar
Huang, Y. et al. Plant species classification of coastal wetlands based on UAV images and object- oriented deep learning. Biodiv Sci 31, 22411, https://www.biodiversity-science.net/CN/abstract/article_93263.shtml (2023).
Article Google Scholar
Li, H., Fu, T., Hao, H. & Yu, Z. MAVSD: A Multi-Angle View Segmentation Dataset for Detection of Solidago Canadensis L. Sci. Data 12, 861, https://doi.org/10.1038/s41597-025-05199-y (2025).
Article PubMed PubMed Central Google Scholar
Song, Z. et al. An infrared dataset for partially occluded person detection in complex environment for search and rescue. Sci. Data 12, 300, https://doi.org/10.1038/s41597-025-04600-0 (2025).
Article PubMed PubMed Central Google Scholar
Deng, L. et al. Comparison of 2D and 3D vegetation species mapping in three natural scenarios using UAV-LiDAR point clouds and improved deep learning methods. International Journal of Applied Earth Observation and Geoinformation 125, 103588, https://www.sciencedirect.com/science/article/pii/S1569843223004120 (2023).
Article Google Scholar
Cai, W. et al. VDD: Varied Drone Dataset for semantic segmentation. Journal of Visual Communication and Image Representation 109, 104429 https://www.sciencedirect.com/science/article/pii/S1047320325000434 (2025).
Article Google Scholar
Zhu, Q. et al. Unrestricted region and scale: Deep self-supervised building mapping framework across different cities from five continents. ISPRS J Photogramm Remote Sens. 209, 344–367, https://www.sciencedirect.com/science/article/pii/S0924271624000303 (2024).
Article ADS Google Scholar
Suo, J. et al. HIT-UAV: A high-altitude infrared thermal dataset for Unmanned Aerial Vehicle-based object detection. Sci. Data 10, 227, https://doi.org/10.1038/s41597-023-02066-6 (2023).
Article PubMed PubMed Central Google Scholar
Zhu, Q., Ran, L., Zhang, Y. & Guan, Q. Integrating geographic knowledge into deep learning for spatiotemporal local climate zone mapping derived thermal environment exploration across Chinese climate zones. ISPRS J Photogramm Remote Sens. 217, 53–75, https://www.sciencedirect.com/science/article/pii/S0924271624003162 (2024).
Article ADS Google Scholar
Wang, Q., Yin, C., Song, H., Shen, T. & Gu, Y. UTFNet: Uncertainty-guided trustworthy fusion network for RGB-thermal semantic segmentation. IEEE Geosci Remote Sens Lett. 20, 1–5 (2023).
CAS Google Scholar
Sun, Y., Cao, B., Zhu, P. & Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans Circuits Syst Video Technol. 32, 6700–6713 (2022).
Article Google Scholar
Wang, Q. et al. Improving rgb-infrared pedestrian detection by reducing cross-modality redundancy. in 2022 IEEE International Conference on Image Processing (ICIP). 526–530 (IEEE).
Ouyang, J., Jin, P. & Wang, Q. Multimodal feature-guided pre-training for RGB-T perception. IEEE J Sel Top Appl Earth Obs Remote Sens. (2024).
Nigam, I., Huang, C. & Ramanan, D. Ensemble knowledge transfer for semantic segmentation. in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). 1499–1508 (IEEE).
Chen, Y., Wang, Y., Lu, P., Chen, Y. & Wang, G. Large-scale structure from motion with semantic constraints of aerial images. in Pattern Recognition and Computer Vision: First Chinese Conference, PRCV 2018, Guangzhou, China, November 23-26, 2018, Proceedings, Part I 1. 347–359 (Springer).
Anonymous. Icg drone dataset, http://dronedataset.icg.tugraz.at/ (2019).
Lyu, Y., Vosselman, G., Xia, G.-S., Yilmaz, A. & Yang, M. Y. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS J Photogramm Remote Sens. 165, 108–119 (2020).
Article ADS Google Scholar
Rahnemoonfar, M. et al. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding. IEEE Access 9, 89644–89654 (2021).
Article Google Scholar
Liu, F. et al. InfMAE: A Foundation Model in the Infrared Modality. 420–437 (Springer Nature Switzerland, 2025).
wkentaro. Labelme, https://github.com/wkentaro/labelme (2018).
Kirillov, A. et al. Segment anything. in Proceedings of the IEEE/CVF international conference on computer vision. 4015–4026 (2023).
OuYang, J., Wang, Q. & Shen, T. Kust4K: An RGB-TIR Dataset from UAV Platform for Robust Urban Traffic Scenes Semantic Segmentation. figshare https://doi.org/10.6084/m9.figshare.29476610.v3 (2025).
Sun, Y., Zuo, W. & Liu, M. RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot Autom Lett. 4, 2576–2583 (2019).
Article Google Scholar
Deng, F. et al. FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation. in 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). 4467–4473 (IEEE, 2021).
Liang, M. et al. Explicit attention-enhanced fusion for RGB-thermal perception tasks. IEEE Robot Autom Lett. 8, 4060–4067 (2023).
Article Google Scholar
Zhang, J. et al. Delivering arbitrary-modal semantic segmentation. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1136–1147 (2023).
Zhang, J. et al. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE trans Intell Transp Syst. 24, 14679–14694 (2023).
Article Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. 234–241 (Springer).
Xiao, T., Liu, Y., Zhou, B., Jiang, Y. & Sun, J. Unified perceptual parsing for scene understanding. in Proceedings of the European conference on computer vision (ECCV). 418–434.
Wan, Z. et al. Sigma: Siamese mamba network for multi-modal semantic segmentation. in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 1734–1744 (IEEE, 2025).
Ha, Q., Watanabe, K., Karasawa, T., Ushiku, Y. & Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 5108–5115 (IEEE).
Shivakumar, S. S. et al. Pst900: Rgb-thermal calibration, dataset and segmentation network. in 2020 IEEE international conference on robotics and automation (ICRA). 9441–9447 (IEEE, 2020).

Download references

Acknowledgements

This work is funded in part by the Yunnan Fundamental Research Projects under Grant 202401AW070019 and 202301AV070003, in part by the Youth Project of the National Natural Science Foundation of China under Grant 62201237.

Author information

Authors and Affiliations

The Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China
Junlin Ouyang, Qingwang Wang, Ying Shang, Pengcheng Jin, Hangwei Zhong & Tao Shen
The College of Surveying and Information Engineering, West Yunnan University of Applied Sciences, Dali, 671000, China
Liman Zhou

Authors

Junlin Ouyang
View author publications
Search author on:PubMed Google Scholar
Qingwang Wang
View author publications
Search author on:PubMed Google Scholar
Ying Shang
View author publications
Search author on:PubMed Google Scholar
Pengcheng Jin
View author publications
Search author on:PubMed Google Scholar
Hangwei Zhong
View author publications
Search author on:PubMed Google Scholar
Liman Zhou
View author publications
Search author on:PubMed Google Scholar
Tao Shen
View author publications
Search author on:PubMed Google Scholar

Contributions

Junlin Ouyang designed the experiments, organized the data, annotated the images, and wrote the paper. Qingwang Wang supervised the work, reviewed the paper, and provided funding. Yin Shang, Pengcheng Jin and Liman Zhou were responsible for flying the drone and taking RGB-TIR images, annotating the images. Tao Shen reviewed the paper and assisted with the experiments. All authors reviewed the manuscript.

Corresponding author

Correspondence to Qingwang Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Ouyang, J., Wang, Q., Shang, Y. et al. An RGB-TIR Dataset from UAV Platform for Robust Urban Traffic Scenes Semantic Segmentation. Sci Data 12, 1701 (2025). https://doi.org/10.1038/s41597-025-05994-7

Download citation

Received: 11 July 2025
Accepted: 15 September 2025
Published: 28 October 2025
Version of record: 28 October 2025
DOI: https://doi.org/10.1038/s41597-025-05994-7

Subjects

Abstract

Similar content being viewed by others

Large language model-driven knowledge graph reasoning for enhanced semantic segmentation

M3OT: A Multi-Drone Multi-Modality dataset for Multi-Object Tracking

Multiview angle UAV infrared image simulation with segmented model and object detection for traffic surveillance

Background & Summary

Methods

Data acquisition

Data registration

Data annotation

Generation of dataset

Data Records

File naming

Dataset splits

Data Overview

Dataset attributes and statistics

Technical Validation

Task definition

Models and training details

Semantic segmentation models

Training details

Performance evaluation metrics

Experimental result analysis

Modality robustness analysis

Modality-specific analysis

Modality-failure analysis

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links