Abstract
The classification and identification of forest tree species is of great value in the study of species diversity and forest monitoring. With the development of emerging technologies, the combination of remote sensing images and deep learning methods has become an important means to study multi-label image classification. However, nowadays, due to the small difference between tree species images, the difficulty of artificial labeling, and the difficulty of obtaining data sets, there are few studies on multi-label classification for tree species images. Therefore, taking the TreeSatAI dataset as an example, a multi-branch and multi-label image classification model (MMTSC) specifically designed for multi-source remote sensing data is proposed to classify and identify 15 tree species in the dataset. In a complex forest stand scenario with unbalanced data, our F1-Score and Precision are as high as about 72% and 82%, respectively. The visualization results of the confusion matrix and Grad-CAM heat map further verify the model’s recognition ability on different categories. To comprehensively evaluate the model performance, we compared it with other state-of-the-art (SOTA) methods for multi-label image classification tasks and conducted a series of ablation experiments. Experimental results show that the MMTSC model outperforms other SOTA methods in F1-Score, Precision, Recall, and mAP. In addition, we also compared the model’s backbone network DenseNet121 with the classic structures of EfficientNet-B0, ConvNeXt-Tiny, ResNet-18, MobileNetV3 and RegNetX-800MF. The evaluation results showed that the DenseNet121 architecture performed best in this task, verifying its effectiveness and adaptability as a backbone network. Finally, we use the results of the deep learning-based multi-label tree species classification model for biomass estimation, providing practical suggestions for relevant institutions, thereby contributing to the scientific management of forest resources and the improvement of carbon sequestration capacity.
Similar content being viewed by others
Introduction
In the course of human development history, forests around the world have had a profound impact on human survival and development1. They provide extremely important ecosystem service functions, such as the supply services of food and timber, the cultural service of entertainment and leisure, the regulatory service of balancing climate change, and the support service of nutrient cycling2,3. However, in recent years, the health and development of forests have been seriously threatened by extreme man-made and natural disturbances4,5. Therefore, it is urgent to carry out forest monitoring actions to obtain real-time data of forest resources, so as to provide some practical measures for relevant organizations.
The classification and identification of forest tree species is an important part of monitoring operations, which can accurately reveal the distribution of different tree species and their changing trends6,7,8. As an emerging technology, satellite remote sensing has been widely used to map the spatiotemporal distribution of tree species in large areas of forests with its rapid development9,10,11,12. The forest monitoring system supported by multi-source remote sensing data can effectively obtain information of important economic and ecological significance in the forest. However, one remote sensing image is composed of many bands and has a complex and huge data structure. The distribution of tree species in the image is very dense, and its classification and recognition is a very challenging task if only relying on artificial means. In order to efficiently and accurately classify remote sensing images, the combination of computer technology and image recognition algorithm has become an important research direction for scholars13,14,15,16.
Early studies on tree species classification predominantly relied on machine learning algorithms such as Support Vector Machines (SVM) and Random Forests (RF) to perform prediction and classification17,18,19. Nowadays, deep learning algorithm technologies represented by convolutional neural networks(CNN) have shown superior performance in image classification and segmentation than machine learning methods20,21. The biggest difference with SVF, RF and other machine learning methods is that the deep learning method represented by CNN can not only automatically extract the feature information in the tree species image, but also reduce the dependence on artificial labeling, and achieve high-precision prediction and classification with the help of the structural information of the spatial spectrum. Studies have shown that based on the dense neural network (DenseNet) model, the leaf images of 185 tree species were classified and identified with an accuracy rate of more than 90% on the test set22. In addition, by comparing the recognition performance of CNN and traditional machine learning methods in the eastern forests of South America23, as well as evaluating the classification performance of ResNet and the lightweight gradient boosting machine (LightGBM) on different tree species in German forests using multi-source datasets24, the results consistently demonstrated that CNN outperformed the other methods in terms of classification and recognition accuracy. With the in-depth study of CNN models, scholars have begun to introduce spatial and channel attention mechanisms, which can focus on the discriminative areas in the image and enhance the recognition ability of sparsely distributed tree species, so as to improve the overall prediction accuracy of tree species25. A study investigated the distribution of natural secondary forests in northern China and integrated the Convolutional Block Attention Module (CBAM) into various CNN architectures, resulting in satisfactory F1 scores26. Moreover, the incorporation of CSF as a spatial attention mechanism significantly improved the model’s ability to detect tree crowns27. In addition, attention mechanisms such as Multi-head Attention28 and Temporal Attention29 are also widely used to improve the accuracy of tree species recognition. At the same time, multimodal learning methods have attracted increasing attention. The fusion of remote sensing data from sources such as drones, hyperspectral imagery, radar, and satellites has shown great potential for single-tree identification in forests30,31, providing a new path for the classification and identification of multi-label forest tree species.
To sum up, most studies on tree species recognition focus on the classification of single tree species, that is, to assign a single label to remote sensing images containing only a single tree species, and predict a specific tree category. However, research on multi-label classification of remote sensing images containing multiple tree species remains relatively limited, with current studies primarily focusing on land use and built-up areas32,33,34,35. However, in the natural ecosystem, the coexistence of mixed forest and other tree species is more common, and the classification of single-label tree species images can not meet the actual needs. In contrast, multi-label tree species image classification can not only improve the accuracy of classification, but also provide more scientific support for biodiversity research and forest management24. However, the existing remote sensing image data of forest tree species often face a serious problem of category imbalance, and the data usually comes from different sensors, so it is difficult to effectively combine data from different sources to improve the classification performance of tree species36. In addition, due to the difficulty of multi-tree species coverage image data collection and the large amount of manual annotation work, the open data set of multi-label image classification specifically for forest tree species is extremely scarce. Therefore, it is of great significance and necessity to carry out multi-label classification research on forest tree species images.
In this study, we focus on solving the problem of joint identification of multiple tree species in images and propose a multi-branch tree species classification model (MMTSC) to classify and identify diverse forest tree species in complex environments using multimodal remote sensing datasets. Not only that, we also applied this classification result to the accurate estimation of the biomass of ecological forests and economic forests, promoting the closed-loop application of tree species monitoring results in actual forest management. The main contributions of this paper are as follows:
-
1.
The MMTSC model was constructed. The architecture is designed specifically for remote sensing multi-source data and is suitable for the multi-branch structure of aerial drone images, Sentinel-1 (S1) and Sentinel-2 (S2) inputs. Each branch has a customized feature extraction module, and adopts a combined modal-aware fusion strategy of learnable weights, multi-head attention and residual connection for feature fusion. Compared with conventional concatenation methods, the proposed model can dynamically capture key interactions between different modalities and significantly enhance multi-label classification performance in complex forest stand scenarios.
-
2.
Excellent classification results were achieved. Monitoring of 15 tree species was achieved under extremely unbalanced class conditions. Our F1-Score is as high as 72%, surpassing other SOTA methods. We also verified the rationality and effectiveness of the model design through a series of ablation experiments.
-
3.
Demonstrated the application potential and good generalization ability of deep learning models in practical forest resource management. Based on the classification results and combined with actual data, we can achieve accurate measurement of ecological tree species and economic tree species, put forward some practical suggestions for regional forest operation and management, and provide a new path for the next step of carbon storage estimation.
Materials
Study area
We conducted the experiment using the publicly available dataset TreeSatAI24. The TreeSatAI dataset originated in Lower Saxony, Germany, located in northwestern Germany at latitudes 51.29° N to 53.89° N and longitudinals 6.39° E to 11.60° E, as shown in Fig. 1. The region has a temperate maritime climate with mild temperatures, abundant precipitation, and flat terrain at an average elevation of about 50 m. It is mainly covered by deciduous forest tree species such as Fagus, Quercus, and others.
TreeSatAI dataset research area. The map was generated using ArcGIS 10.8 (Esri, https://www.esri.com/en-us/arcgis/about-arcgis/overview). Administrative boundaries were sourced from the GADM database (version 3.6, https://gadm.org/download_country_v3.html), and elevation data were retrieved from the Geospatial Data Cloud (http://www.gscloud.cn).
Dataset
The TreeSatAI dataset includes three types of image data: Aerial, Sentinel-1 (S1), and Sentinel-2 (S2), collected between 2015 and 2020, with varying image dimensions such as 60 × 60 m and 120 × 120 m. As shown in Table 1, the dataset also provides detailed information on spectral bands and spatial resolution. For consistency, we used only the 60 × 60 m images from each source in our experiments.
In the early stage of building the model, we read and preprocess the TreeSatAI dataset through programs, including reading 18,948 images of single tree species and 31,433 images containing multiple tree species; and generate multiple label vectors corresponding to the labels. Each image corresponds to one or more tree species, and each dimension is independently represented as a category. To enhance the receptive field, 50,381 images were uniformly adjusted to the size of 224 × 224. After resampling, the label-based data set is randomly divided into training set and validation set according to the ratio of 7:3.
Methods
Multi-branch multi-label image classification model
Based on images with three different features, aerial photography, Sentinel-1 and sentinel-2, we specially designed a multi-branch and multi-label tree species classification model (MMTSC), which is also suitable for images containing a single tree species in the dataset. The overall model framework is shown in Fig. 2. This model processes data from different sources (Aerial, S1, S2), through different branches. The procedure first connects the flat feature \({X_{A1}}\) extracted from the aerial branch with the pooled and flattened feature \({X_{w1}}\). These concatenated features are then integrated with the flattened features \({X_{S1}}\) from the S1 branch and \({X_{S2}}\) from the S2 branch through the layerwise fusion module. Finally, the multi-label classification output is generated by the classification head, which comprises a dropout layer for regularization and a fully connected (linear) layer. Details will be discussed in the following sections.
Aerial imaging branch
Aerial photography data is the images obtained by multiple drone flights. Aerial images, especially high-resolution images, usually have rich semantic information and are very detailed and structured. Based on this, we designed an improved version of DenseNet121, the DenseNet121CW model. This model consists of a basic Dense Block, a fine-tuned convolutional block attention (CBAMC), and a wavelet transform module, as shown in Fig. 3.
The significant advantage of the DenseNet121 model is that during operation, each layer is directly connected to all previous layers, so the input features are accumulated layer by layer at different layers22,37. Through feature reuse, not only can the disadvantages of gradient vanishing be effectively alleviated, but also redundant feature mappings can be reduced, making the algorithm more lightweight. Therefore, we use the DenseNet121 model as the main part of the aerial photography branch for customization, and design the first layer of convolution as a 4-channel input \({F_{input}}\) to adapt to four bands:
Among them, \(F_{1}^{{\left( {c^{\prime},i,j} \right)}}\) is the output of the first layer of convolution, \({W_{conv0}} \in {R^{64 \times 4 \times 7 \times 7}}\) is a 7 × 7 convolution kernel with 64 output channels.
By removing the classification head and using DenseNet121 as a feature extractor, the output is the extracted feature vector part \({F_{dense}}\), which provides the basis for subsequent fusion:
Since the tree species images are relatively dense, in order to focus on the key areas and adjust the channel attention weights, CBAMC weighting was introduced after feature extraction38. We replace the channel attention from the fully connected layer with a 1 × 1 convolution to better process features and keep it consistent with the standard convolution:
Among them, \({M_c}\) and \({M_s}\)are channel attention and spatial attention weights respectively; \(\sigma\) is the Sigmoid activation function, which is used to normalize the channel weight \({M_c}\); \({F_{avg}}\) and \({F_{max}}\) are global average pooling and maximum pooling, \({F_{avgs}}\) and \({F_{maxs}}\) are channel dimension average pooling and maximum pooling, \({W_1}\) and \({W_2}\) are two 1 × 1 convolution kernels; finally flattened into vector \({X_{A1}}\).
Wavelet transform extracts the features of aerial images and stacks them into a multi-channel feature map \({F_{wavelet}}\).Through two layers of convolution \({F_{wavelet1}}\) and \({F_{wavelet2}}\), higher-dimensional features are learned to make up for the lack of edge information, thereby further improving the classification performance:
Among them, \(cA\) is the low-frequency component, \(cH,cV,cD\) are the high-requency components, \(ReLU\) is the nonlinear activation function, \({W_{wavelet1}}\) and \({W_{wavelet2}}\) are the convolution kernels of the first and second layers of convolution respectively;
Finally, through pooling and flattening:
Among them, \({X_{w1}}\varepsilon {R^{B \times 1280}}\) is a feature vector of fixed size.
Traditional wavelets are mostly processed independently. On this basis, we incorporate the concept of embedding into it39. The features of \({X_{A1}}\) and \({X_{w1}}\) are deeply concatenated and embedded, and together with subsequent branches, the fusion of dynamic weighting and multi-head attention is completed.
Sentinel-1 and Sentinel-2 imaging branches
S1 usually comes from the backscatter satellite data provided by the European Space Agency (ESA), and S2 usually comes from the atmospherically corrected satellite data provided by ESA. Based on the low resolution and multi-spectral characteristics of S1 and S2 image data, we designed improved version of Swin Transformer tiny, the Swin Transformer tinyC1 and the Swin Transformer tinyC2, to capture long-range dependencies and multi-scale features, thereby enhancing the overall classification performance. As shown in Figs. 4 and 5, these two models mainly include the basic Swin Transformer Block and the CBAMC module.
Swin Transformer tiny is a visually efficient model using a layered sliding window. It can not only handle the complex relationship between long-distance and local areas in an image, but also reduce the complexity of calculation and save environmental resources40. Therefore, we used Swin Transformer tiny as the main model for the S1 and S2 branches in our own design.
For S1, its low-resolution images usually have strong surface structure and rich texture information, but also accompanied by a large amount of scattering noise; for S2, although its spatial resolution is relatively low, the global read-through spectral information is more important. To suppress noise while retaining important structural texture features in the S1 branch, depthwise convolution (depthwise_conv) is first applied to the 3 input channels, expanding them to 48 channels. Subsequently, these channels are integrated using a 1 × 1 pointwise convolution, reducing the number of channels back to the original 3:
Among them, \(X_{1}^{{\left( {c,i,j} \right)}}\) is the output feature map after depthwise convolution; \(W_{{dw}}^{{\left( {c,m,n} \right)}}\) is the weight value of the depthwise convolution kernel in channel c; \(X_{{in}}^{{\left( {c,i+m,j+n} \right)}}\) is the pixel value of the input image in channel c and position \(\left( {i,j} \right)\); \(X_{2}^{{\left( {c^{\prime},i,j} \right)}}\) is the feature map after channel adjustment; and \(W_{{adj}}^{{\left( {c^{\prime},c} \right)}}\) is the channel-adjusted 1 × 1 convolution weight.
After normalization and extraction of deep features of the image, the CBAMC module weighted \({X_5}\) in the above aerial photography branch is introduced:
Among them, \(X_{3}^{{\left( {c,i,j} \right)}}\) is the batch normalization of each channel feature; \({\mu _c}\) and \(\sigma _{c}^{2}\) are the mean and variance of the cth channel respectively; \({\gamma _c}, {\beta _c}\) are learnable translation and scaling parameters respectively; and \({X_4}\) is the deep feature extracted by Swin Transformer.
Finally, the multi-dimensional feature map is flattened into a one-dimensional vector \({X_{s1}}\) for subsequent fusion:
For example, in the design of elements such as the S1 branch, the depthwise convolution operation is removed in the S2 branch, the first convolutional layer is adjusted to accept a 12-channel input through a 1 × 1 pointwise convolution, and the multi-dimensional feature map is flattened into a one-dimensional vector \({X_{s2}}\), laying the foundation for subsequent fusion.
Fusion branch
As an indispensable component of the MMTSC model, the fusion branch computes parallel features of different subspaces and improves the information expression ability of the model under the action of the multi-head attention mechanism. Therefore, we adjusted the contribution of each branch feature before feeding them into the multi-head attention module, based on the differences in image resolution and the relative importance of the structural information contained in the three types of data:
Among them, \({\alpha _1}, {\alpha _2}, {\alpha _3}\) are learnable weight parameters, respectively.
The multi-head attention mechanism is used to calculate the correlation between different branch features. In order to enhance the attention features while retaining the original feature information28, the output is residually connected with the dimensionality reduction:
Among them, \({W_{fusion}}\)is the weight matrix of the full connection, and \({W_{fusion}}{X_{connect}}\) is the dimensionality reduction operation.
Finally, the prediction result Y of the multi-label classification tree species image is obtained:
Among them, \(\sigma\) is the Sigmoid activation function, which is usually used to predict category probabilities in multi-label classification tasks, and its range is \(\left[ {0,1} \right]\); \({W_{cls}}\) is the weight matrix of the fully connected layer of the classification head.
Experimental setup
Our experimental environment is configured with an Nvidia RTX 4090 graphics card, a 64-bit Win11 operating system, and a network model built based on the Pytorch2.0 version of the deep learning framework.
As mentioned above, datasets for multi-label tree species classification are even more scarce. Our current dataset of only 50,000 labeled samples has 15 tree species categories, which are extremely unbalanced, making it a challenging task to train a deep learning model from scratch. Based on this, we use the mix up technique to generate more training samples of different forms by mixing samples. This approach can not only alleviate the problem of small samples in a few categories to a certain extent, but also reduce the risk of our model overfitting a single sample41.
To prevent overfitting, the AdamW optimizer with a weight decay of 1e-5 is used during training42. The learning rate was set to 1e-4 for the basic parameters and 1e-3 for the fusion module to accelerate its learning. Further, OneCycleLR is used to replace the fixed learning rate, enabling faster convergence in early training43. The learning rate is gradually increased to 1e-3 during the first 30% of training to avoid local optima, and then decreased to near zero over the remaining 70% to reduce oscillation and ensure stable convergence.The model is trained for 100 epochs with a batch size of 32, and the best-performing model on the validation set is saved.
In multi-label classification tasks, our dataset is characterized by a severe class imbalance, with no more than approximately four tree species per image. To address this issue and enhance the model’s learning ability under such conditions, an asymmetric loss function is used. By asymmetrically adjusting the gradients of positive and negative samples, the dominance of negative samples in the loss function is reduced, thereby improving the performance of the model in the case of class imbalance44:
Among them, \(y,\hat {y}\) are the true label and predicted probability of multi-label binary classification, \({\gamma _{pos}}\) is the gradient amplification factor of positive samples, \({\gamma _{neg}}\) is the gradient suppression factor of negative samples, \({\lambda _{reg}}\)is the regularization weight, and \(\theta\) is the model trainable parameter. For each category i, when \({y_i}~\)= 1, it is a positive sample, and the positive loss is \(- {y_i}{\left( {1 - {{\hat {y}}_i}} \right)^{{\gamma _{pos}}}}log\left( {{{\hat {y}}_i}} \right)\); when \({y_i}~\)= 0, it is a negative sample, and the negative loss is \(- \left( {1 - {y_i}} \right){\hat {y}_i}^{{{\gamma _{neg}}}}log\left( {1 - {{\hat {y}}_i}} \right)\).
Evaluation indicators
We use F1-Score Micro, F1-Score Weighted, Precision Micro, Precision Weighted, Recall Micro, Recall Weighted and average precision (mAP) as evaluation indicators:
These metrics are calculated based on true positives (TP), where the actual value is positive and predicted as positive; false positives (FP), where the actual value is negative but predicted as positive; false negatives (FN), where the actual value is positive but predicted as negative; and true negatives (TN), where the actual value is negative and predicted as negative. Additionally, they consider the total number of samples (N), the total number of classes (K), the class index (k), and the number of samples in class k(\({N_k}\)).
Among them, Precision Micro is used to measure the overall prediction accuracy of the model, while Precision Weighted reflects its recognition ability in each category. Recall Micro evaluates the overall recall performance of the model for all positive samples, while Recall Weighted reflects the model’s ability to identify true positive samples under class imbalance conditions. F1-Score Micro measures the overall classification performance of the model, while F1-Score Weighted provides a comprehensive classification evaluation of the model under imbalanced data based on the integration of various precision and recall rates. mAP is used to measure the overall performance of the model. F1-Score Micro, Precision Micro, and Recall Micro are calculated by aggregating the TP, FP, and FN across all classes, making them independent of the number of samples per class. In contrast, F1-Score Weighted, Precision Weighted, and Recall Weighted are first computed for each class individually and then weighted by the number of samples in each class, meaning the contribution of each class to the overall weighted metric is positively correlated with its sample size. In Eqs. (19)–(27), \(T{P_k}\), \(F{P_k}\), and \(F{N_k}\) represent the true positives, false positives, and false negatives for class k, respectively. Similarly, \(Precisio{n_k}\) and \(Recall{_k}\) denote the precision and recall for class k, respectively. By calculating the above indicators, we can not only find out the overall classification performance of the model, but also effectively identify rare categories in the data set.
Results
Performance of the MMTSC model
Based on the above experimental settings, the MMTSC model achieved optimal performance, demonstrating the effectiveness of the proposed approach. On the validation set, the F1-Score Micro and F1-Score Weighted for classifying 15 tree species reached 72.05% and 71.73%, respectively. The Precision Weighted and Precision Micro were 81.74% and 78.93%, respectively, while both Recall Micro and Recall Weighted were 67.42%. The mAP was 76.60%. At the same time, based on this classification result and WorldCover land use coverage data, we drew a classification map of various tree species in this study area.
Tree species classification map of Lower Saxony, Germany. This map was created using ArcGIS 10.8 (Esri, https://www.esri.com/en-us/arcgis/about-arcgis/overview). The base land cover layer was obtained from the ESA WorldCover 2020 dataset (https://esa-worldcover.org/en). Tree species classification results were generated by applying our proposed MMTSC model to the publicly available TreeSatAI dataset.
As shown in Fig. 6, the various tree species are mostly distributed in the southeast, northeast and central parts of Lower Saxony, Germany. As can be seen from the figure, the southeast is mainly dominated by Picea, Fagus and Quercus; Pinus is the main tree species in the northeast; Pseudotsuga is the main tree species in the central part, mixed with small proportions of Abies and Betula.
Comparison of MMTSC model with other state-of-the-art methods (SOTA)
In order to demonstrate the superior performance of the MMTSC model, it is compared with the following SOTA methods without any changes to the experimental settings. Due to the defects of class imbalance in the dataset, the precision and recall rates will be affected by it. However, F1-Score is the harmonic average of the two and can be used as a substitute variable to represent the overall classification ability of the model. Therefore, F1-Score Micro and F1-Score Weighted weights are used as primary indicators, and Precision Weighted, Precision Micro, Recall Micro, Recall Weighted, and mAP are used as secondary indicators. Each indicator is a percentage, the maximum value is bolded, and the specific results are shown in Table 2.
Among them, TResNet45, as a representative of lightweight yet high-performance models, achieved state-of-the-art results on multiple image classification tasks as early as 2020, ranking among the top methods in multi-label image classification; ML-Decoder46 is a decoder structure that has emerged as a SOTA architecture for multi-label visual classification when integrated with feature extractors like Vision Transformer (ViT)47 and Swin Transformer (Swin)40, with its predictive capability validated across multiple public datasets; Asymmetric Loss (ASL)44 has become one of the representative methods in multi-label image classification tasks due to its robustness in dealing with label imbalance and the Spatial Attention Mechanism (SAL)48 that can enhance attention to local areas. As shown in the table, the MMTSC model we designed achieved an improvement of about 2–9% in the F1-Score Micro, F1-Score Weighted, and mAP values, showing stable and excellent performance.
Ablation study of MMTSC model
In order to evaluate the actual effect of each module in our proposed multi-branch multi-label tree classification model, we conducted a number of ablation analysis experiments. We mark the complete model, remove the S1 and S2 branches, wavelet transform, fine-tune CBAM, and the baseline established by the existing model as (a) to (e) in turn to verify whether each module part plays a role in the overall design of the model. The results are shown in Table 3. The evaluation indicators remain unchanged, with all values expressed as percentages, and the highest value for each indicator highlighted in bold.
As can be seen from Table 3, F1-Score of the network model is as high as 72%, Precision exceeds 80%, and the mAP is about 77%. Compared with the established baseline, they all showed different degrees of improvement. After removing each module part, the Micro and Weighted of F1-Score all show varying degrees of decline. After removing the S1 and S2 branches, F1-Score Micro and F1-Score Weighted decreased by 0.82% and 0.65% respectively. After removing the wavelet transform module, they decreased by 1.08% and 0.77% respectively. After removing the CBAMC module, they decreased by 0.51% and 0.73% respectively. In summary, the experiments demonstrate that each module of our proposed multi-branch multi-label tree species classification network has achieved certain results.
Indicator value heat map and confusion matrix visualization analysis
In a multi-label classification task, each category can be regarded as an independent binary classification task. Based on this feature, we saved the best-performing model according to the highest indicator obtained on the validation set after training. We then plotted a heat map of indicator values for all categories in the dataset, along with the corresponding confusion matrix, to evaluate and analyze the classification performance of each category, as shown in Figs. 7 and 8a–o. Figure 7 is a heat map of indicators for all categories. The horizontal axis represents all categories, and the vertical axis represents the F1-Score, Recall, and Precision evaluation index values (the value range is from 0 to 1). The darker the color, the better the classification performance. Figure 8a–o is a confusion matrix diagram of all categories. Taking (a) as an example, it shows the classification result of the Abies class. The darker the color, the higher the value. In the matrix, the upper left represents TN, the lower left represents FN, the upper right represents FP, and the lower right represents TP.
Confusion matrix of all categories. a Confusion matrix for the Abies class, b confusion matrix for the Acer class, c confusion matrix for the Alnus class, d confusion matrix for the Betula class, e confusion matrix for the cleared class, f confusion matrix for the Fagus class, g confusion matrix for the Fraxinus class, h confusion matrix for the Larix class, i confusion matrix for the Picea class, j confusion matrix for the Pinus class, k confusion matrix for the Populus class, l confusion matrix for the Prunus class, m confusion matrix for the Pseudotsuga class, n confusion matrix for the Quercus class, o confusion matrix for the Tilia class.
Taking F1-Score as the main indicator for analysis, it can be seen from Fig. 7 that the F1-Score of Tilia and Prunus are higher, while that of Betula and Larix are lower. Figure 8 shows that the FN values of Prunus(l) and Tilia(o) classes in the confusion matrix are both 0, while the FN and FP values of Betula(d) and Larix(h) classes are significantly higher than those of the first two classes. This shows that the classification performance of Tilia and Prunus classes is relatively good, while Betula and Larix classes have the most classification errors and relatively poor classification performance.
Grad-CAM heat map visualization analysis
Due to the extremely serious class imbalance in the dataset, the means of resampling the training set may be one of the reasons for the excellent classification performance of Tilia and Prunus classes. On the other hand, compared with other categories, the sample features of Tilia and Prunus may be easier to distinguish, so that the model can learn its classification rules well in the training process. Although the number of samples in the Betula and Larix categories is relatively large, their features may highly overlap with those of other tree species, making it difficult for the model to effectively identify their features, resulting in many false detections and omissions. Therefore, in order to further visualize the degree of attention of the model to the image features of different tree species, we carried out Grad-CAM thermal map analysis to explain the model prediction49. Due to the low resolution of S1 and S2 images, we only show the heat map visualization of the convolutional layers in the aerial branch of the backbone network of the overall model. As shown in Fig. 9a–o, there are random original images of all categories in the dataset and their corresponding heat maps.
Original images and Grad-CAM heatmaps of all categories. a The original image alongside the Grad-CAM heatmap for the Abies class, b the original image alongside the Grad-CAM heatmap for the Acer class, c the original image alongside the Grad-CAM heatmap for the Alnus class, d the original image alongside the Grad-CAM heatmap for the Betula class, e the original image alongside the Grad-CAM heatmap for the Cleared class, f the original image alongside the Grad-CAM heatmap for the Fagus class, g the original image alongside the Grad-CAM heatmap for the Fraxinus class, h the original image alongside the Grad-CAM heatmap for the Larix class, i the original image alongside the Grad-CAM heatmap for the Picea class, j the original image alongside the Grad-CAM heatmap for the Pinus class, k the original image alongside the Grad-CAM heatmap for the Populus class, l the original image alongside the Grad-CAM heatmap for the Prunus class, m the original image alongside the Grad-CAM heatmap for the Pseudotsuga class, n the original image alongside the Grad-CAM heatmap for the Quercus class, o the original image alongside the Grad-CAM heatmap for the Tilia class.
The Grad-CAM heat map is a tool for explaining model predictions. Red-green-yellow indicates the areas that the model mainly identifies and focuses on, and blue indicates the areas that the model pays the least attention to. As can be seen from Fig. 9, in the heat maps of Prunus (l) and Tilia (o), there are more green areas than blue areas. When compared with the original map, it was found that the model effectively pay attention to the important characteristics of the two species. In the heat maps of the Betula(d) and Larix(h) classes, the blue area occupies the vast majority of the image. After comparing it with the original image, we believe that the model fails to capture the class features of Betula and Larix classes effectively. The above results are consistent with the predictions of the model, which once again confirms the effectiveness of the model.
Discussion
Advantages of MMTSC model
In order to effectively combine forest tree species image data from different sensor sources and improve their classification performance, we proposed the MMTSC model.
To demonstrate the excellent performance of our designed model, we compare it with other SOTA multi-label image classification methods. As shown in Table 2, the indicators of the above SOTA methods are not as good as those of the MMTSC model. We believe that the combination of ML-Decoder and feature extractors such as ViT and Swin fails to effectively identify local features in tree species areas. Due to the relatively sparse image labels, decoder-based modeling faces certain challenges. In contrast, TResNet benefits from a CNN architecture that is inherently suited for capturing local textures, while ASL + SAL demonstrate strong robustness in handling class imbalance and focusing on local regions. As a result, the method combining ML-Decoder with feature extractors performs slightly worse on the TreeSatAI dataset. The MMTSC model has multi-scale, multi-modal, and multi-stage enhancement capabilities, surpassing some single-branch structure models such as TResNet and ASL + SAL.
As can be seen from Fig. 2, the MMTSC model is designed to process data of different modalities and is an optimization specifically for data of different modalities. For example, the Aerial branch uses the DenseNet121CW model for efficient feature reuse and processing of high-resolution visible light data. The S1 and S2 branches use the Swin Transformer tinyC1 and Swin Transformer tinyC2 models, respectively, which can establish long-distance dependencies, to process radar and optical remote sensing data. In order to weaken the impact of modal differences on classification performance, we use adaptive weight parameters to adjust the features extracted from different modal branches before performing feature fusion and complementation. In addition, we also add CBAMC and wavelet transform modules to supplement the multi-scale information. As shown in the results of the ablation experiment in Table 3, each component of the MMTSC model has made a certain contribution to the classification performance.
To further validate the effectiveness of our proposed model, we replace the DenseNet121 network used in the main part of the multi-branch and multi-label tree species classification model with the current representative lightweight network, including: EfficientNetB050, ConvNeXttiny51, ResNet1852, MobileNetV353, RegNetX800MF54. They were compared on the same TreeSatAI dataset. The results are shown in Table 4. The evaluation index remains unchanged, the value of each index is a percentage, and the maximum value is bolded.
The table shows that all three representative lightweight networks, when used as replacements, resulted in varying degrees of performance degradation. We believe that the reason for this phenomenon may be that the DenseNet121 network has the typical characteristics of dense connection and feature reuse, hence the network can better capture the association between image labels of different tree species. The reality is that EfficientNetB0 50 and MobileNetV353 have fewer parameters, while ResNet1852 and RegNetX800MF54 have shallow layer depths, which makes the network model unable to capture deep semantics, and its performance is slightly inferior to DenseNet121. As a lightweight version of the network designed for modern optimization, ConvNeXttiny has a parameter count of nearly 28 M, far exceeding DenseNet121, EfficientNetB050, ResNet1852, MobileNetV353 and RegNetX800MF54. It is more suitable for large data sets. It overfits on our small data set, making it difficult for the model to fully demonstrate its advantages. Above, the comparative experiments prove that the network model architecture we designed is optimal.
In summary, based on the results from Tables 2, 3 and 4, as well as the visualization analysis, the MMTSC model demonstrates excellent classification performance.
MMTSC model application
Biomass estimation of forest tree species is one of the applications of multi-label tree species image classification. As a key indicator for measuring tree species structure and function, it serves as an important task in forestry management and environmental monitoring and has attracted the attention of many scholars55. Traditional direct observation and biomass estimation methods based on growth equation not only have insufficient accuracy, but also time-consuming and laborious.
Therefore, based on the results of the above multi-label tree species classification model, we use the German Fourth National Forest Inventory database and the Intergovernmental Panel on Climate Change (IPCC) proposed biomass expansion factors for common tree species in temperate regions to estimate the abovementioned biomass AGB of each tree species. The calculation formula is as follow:
Among them, \(pro\) is the probability value of the model predicting the tree species on a certain image; \(Annual\_volume\) is the average annual volume of the tree species, in \({m^3}/ha\); BCEFs stands for the Biomass Conversion and Expansion Factors, in \(t/{m^3}\).
We removed the Cleared class and divided the remaining 14 classes into ecological tree species and economic tree species, as shown in Table 5.
Table 5 shows that the aboveground biomass of economic tree species is much higher than that of ecological tree species. Combined with the latest federal forest inventory report, it was found that Fagus, one of the important components of the forest ecosystem, was threatened by extreme climate such as drought and high temperature, and was severely damaged and dying; Picea and Pinus, as the most important raw materials for the wood and papermaking industries, were also severely harmed by the invasion of beetles and fungi; and the aging of the forest was also becoming increasingly serious. Based on this, we believe that the following measures can be taken to address this issue: (1) Priority is given to tree species adapted to local climate change, such as Abies, to enhance forest stability and ecological diversity. (2) Introduce broad-leaved tree species in economic forest to increase the proportion of mixed forest; At the same time, aerial photography and remote sensing technologies such as drones are used to detect the health status of tree species such as Picea and Pinus, and strengthen the prevention and control of threats from beetles and fungi. (3) Moderately extend the cutting cycle and promote the transition of economic forest to old forest with higher ecological value; timely replanting of harvested land to achieve the renewal and optimization of forest structure. (4) Applying artificial intelligence technologies such as deep learning to tree classification to enable efficient and accurate forest monitoring, such as biomass estimation, to ensure sustainable forest ecosystem services.
Limitations and future works
Although the proposed MMTSC model achieves superior performance on the remote sensing dataset of Lower Saxony, Germany, accurate identification of tree species in complex multi-stand scenarios is still a challenging task in realistic remote sensing images. Since publicly available high-quality multi-label and multi-source tree species remote sensing datasets are still relatively scarce, this study was only verified on one regional dataset and still has certain regional adaptability limitations. To further verify the transferability, generalization ability and practicality of the model, we will expand the scope of research in the future. We have completed the field research in Laoshan Forest Farm in Nanjing, China, and are currently conducting classification and identification experiments on major tree species using the MMTSC model proposed in this paper. In addition, we also plan to combine the classification results with ecological parameters in the future to model and predict the forest carbon storage and carbon sink potential in the region, and compare and analyze it with the forest carbon storage in Lower Saxony, Germany, to promote the practical application of the MMTSC model in forest resource monitoring and carbon assessment.
Conclusion
Due to the small differences between remote sensing image classes, the availability of public datasets of multi-label tree species is a major obstacle. Therefore, in this paper, the MMTSC model for multi-source tree species images is designed and verified on the public TreeSatAI dataset. On the validation set, our F1-Score, Precision, and mPA values reached approximately 72%, 82%, and 77%, respectively. We conducted ablation and comparison experiments, and performed visualization analysis to verify the effectiveness and superior performance of our proposed network model. In addition, we estimated the aboveground biomass of different tree species in the dataset based on the classification results. Based on this result, we put forward three policy recommendations for forest management.
Data availability
The datasets or algorithm code used or analysed during the current study available from the corresponding author on reasonable request.
References
Bonan, G. B. Forests and climate change: forcings, feedbacks, and the climate benefits of forests. Science 320, 1444–1449 (2008).
Guo, R. Z., Song, Y. B. & Dong, M. Progress and prospects of ecosystem disservices: an updated literature review. Sustainability 14, 10396 (2022).
Zhao, Q. & Shao, J. Evaluating the impact of simulated land use changes under multiple scenarios on ecosystem services in Ji’an, China. Ecol. Ind. 156, 111040 (2023).
Gittman, R. K. et al. Assessing how restoration can facilitate 30×30 goals for climate-resilient coastal ecosystems in the united States. Conserv. Biol. e14429 https://doi.org/10.1111/cobi.14429 (2024).
Bergkvist, J. et al. Quantifying the impact of climate change and forest management on Swedish forest ecosystems using the dynamic vegetation model LPJ-GUESS. Earth’s Future 13, e2024EF004662 (2025).
Vanguri, R., Laneve, G. & Hościło, A. Mapping forest tree species and its biodiversity using enmap hyperspectral data along with Sentinel-2 temporal data: an approach of tree species classification and diversity indices. Ecol. Ind. 167, 112671 (2024).
Zhang, Y. et al. The potential of optical and SAR time-series data for the improvement of aboveground biomass carbon estimation in Southwestern china’s evergreen coniferous forests. GISci. Remote Sens. 61, 2345438 (2024).
Makhubele, L., Chirwa, P. & Araia, M. Tree biomass carbon stocks and biodiversity, and their determinants in a traditional agroforestry landscape in the Vhembe biosphere reserve, South Africa. Agrofor. Syst. 99, 7 (2025).
Zheng, J. et al. Surveying coconut trees using high-resolution satellite imagery in remote atolls of the Pacific ocean. Remote Sens. Environ. 287, 113485 (2023).
López-García, J., Cruz-Bello, G. M., & De Lourdes Manzo-Delgado, L. A long-term analysis, modeling and drivers of forest recovery in central Mexico. Environ. Monit. Assess. 197, 87 (2024).
Wang, X. et al. Semantic segmentation network for Mangrove tree species based on UAV remote sensing images. Sci. Rep. 14, 29860 (2024).
Roca, M. et al. Subtidal seagrass and blue carbon mapping at the regional scale: a cloud-native multi-temporal Earth observation approach. GISci. Remote Sens. 62, 2438838 (2025).
Onishi, M. & Ise, T. Explainable identification and mapping of trees using UAV RGB image and deep learning. Sci. Rep. 11, 903 (2021).
Cheng, G., Xie, X., Han, J., Guo, L. & Xia, G. S. Remote sensing image scene classification Meets deep learning: challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 13, 3735–3756 (2020).
Yu, N., Ren, H., Deng, T. & Fan, X. Stepwise locating bidirectional pyramid network for object detection in remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 20, 6001905 (2023).
Guo, J. et al. C3DA: a universal domain adaptation method for scene classification from remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 21, 1–5 (2024).
Prasad, A. M., Iverson, L. R. & Liaw, A. Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9, 181–199 (2006).
Kandare, K., Ørka, H. O., Dalponte, M., Næsset, E. & Gobakken, T. Individual tree crown approach for predicting site index in boreal forests using airborne laser scanning and hyperspectral data. Int. J. Appl. Earth Obs. Geoinf. 60, 72–82 (2017).
Dalponte, M., Frizzera, L. & Gianelle, D. Individual tree crown delineation and tree species classification with hyperspectral and lidar data. PeerJ 6, e6227 (2019).
Broni-Bediako, C., Xia, J. & Yokoya, N. Real-Time semantic segmentation. A brief survey and comparative study in remote sensing. IEEE Geosci. Remote Sens. Mag. https://doi.org/10.1109/MGRS.2023.3321258 (2023).
Rodríguez-Lira, D. C. et al. Trends in machine and deep learning techniques for plant disease identification: a systematic review. Agriculture 14, 2188 (2024).
Wang, N., Pu, T., Zhang, Y., Liu, Y. & Zhang, Z. More appropriate DenseNetBL classifier for small sample tree species classification using UAV-based RGB imagery. Heliyon 9, e20467 (2023).
Polonen, I. et al. Tree species identification using 3d spectral data and 3d convolutional neural network. In: 2018 9th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) (IEEE, New York, (2018).
Ahlswede, S. et al. TreeSatAI benchmark archive: a multi-sensor, multi-label dataset for tree species classification in remote sensing. Earth Syst. Sci. Data. 15, 681–695 (2023).
Chen, L., Tian, X., Chai, G., Zhang, X. & Chen, E. A new CBAM-P-Net model for few-shot forest species classification using airborne hyperspectral images. Remote Sens. 13, 1269 (2021).
Ma, Y., Zhao, Y., Im, J., Zhao, Y. & Zhen, Z. A deep-learning-based tree species classification for natural secondary forests using unmanned aerial vehicle hyperspectral images and lidar. Ecol. Ind. 159, 111608 (2024).
Zheng, J. et al. Growing status observation for oil palm trees using unmanned aerial vehicle (UAV) images. ISPRS J. Photogramm. Remote Sens. 173, 95–121 (2021).
Zhang, Y. et al. Attention is all you need: utilizing attention in AI-enabled drug discovery. Brief. Bioinform. 25, bbad467 (2023).
He, C., Wang, Q., Xu, W., Yuan, B. & Xie, W. A. Rocky desertification land extraction method based on spectral-texture-scattering-terrain multisource features with time series. IEEE Trans. Geosci. Remote Sens. 63, 1–16 (2025).
Dong, R. et al. An adaptive image fusion method for Sentinel-2 images and high-resolution images with long-time intervals. Int. J. Appl. Earth Obs. Geoinf. 121, 103381 (2023).
Jiang, Y., Li, X., Peng, L., Li, C. & Song, T. Assessing the potential of multi-seasonal Sentinel-2 satellite imagery combined with airborne lidar for urban tree species identification. Sci. Rep. 15, 25107 (2025).
Li, W. et al. Joint semantic–geometric learning for polygonal Building segmentation from high-resolution remote sensing images. ISPRS J. Photogrammetry Remote Sens. 201, 26–37 (2023).
Cui, Y. et al. A novel approach to land cover classification by integrating automatic generation of training samples and machine learning algorithms on Google Earth engine. Ecol. Ind. 154, 110904 (2023).
Goodin, D. G., Anibas, K. L. & Bezymennyi, M. Mapping land cover and land use from object-based classification: an example from a complex agricultural landscape. Int. J. Remote Sens. 36, 4702–4723 (2015).
Guo, J., Jiao, S., Sun, H., Song, B. & Chi, Y. Cross-modal compositional learning for multilabel remote sensing image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 18, 5810–5823 (2025).
Chavda, S. & Goyani, M. Scene level image classification: a literature review. Neural Process. Lett. 55, 2471–2520 (2023).
Zhou, J. et al. Intelligent classification of maize straw types from UAV remote sensing images using DenseNet201 deep transfer learning algorithm. Ecol. Ind. 166, 112331 (2024).
Zhang, Y., Xu, M. & Li, X. Remote Sensing Image Retrieval Based on DenseNet Model and CBAM. in IEEE 3rd International Conference on Computer and Communication Engineering Technology (CCET) 86–90 (IEEE, Beijing, 2020). https://doi.org/10.1109/CCET50901.2020.9213121
Savareh, B. A., Emami, H., Hajiabadi, M., Azimi, S. M. & Ghafoori, M. Wavelet-enhanced convolutional neural network: a new Idea in a deep learning paradigm. Biomed. Eng. Biomed. Tech. 64, 195–205 (2019).
Jiao, D. et al. SymSwin: multi-scale-aware super-resolution of remote sensing images based on swin transformers. Remote Sens. 16, 4734 (2024).
Zhang, H., Cisse, M. & Dauphin, Y. N. & Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. Preprint at (2018). https://doi.org/10.48550/arXiv.1710.09412
Zhou, P., Xie, X., Lin, Z. & Yan, S. Towards understanding convergence and generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 46, 6486–6493 (2024).
Kornblith, S., Shlens, J. & Le, Q. V. Do Better ImageNet Models Transfer Better? In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019) 2656–2666 (IEEE Computer Soc, Los Alamitos, 2019). https://doi.org/10.1109/CVPR.2019.00277
Ridnik, T. et al. Asymmetric loss for multi-label classification. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV 2021) 82–91 (IEEE, New York, 2021). https://doi.org/10.1109/ICCV48922.2021.00015
Ridnik, T. et al. TResNet: high performance GPU-dedicated architecture.
Ridnik, T., Sharir, G., Ben-Cohen, A., Ben-Baruch, E. & Noy, A. ML-decoder: scalable and versatile classification head. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (IEEE, Waikoloa, 2023). https://doi.org/10.1109/wacv56688.2023.00012
Dosovitskiy, A. et al. An image is worth 16x16 Words: transformers for image recognition at scale. Preprint at (2021). https://doi.org/10.48550/arXiv.2010.11929
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. CBAM: Convolutional Block Attention Module. Preprint at (2018). https://doi.org/10.48550/arXiv.1807.06521
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via Gradient-Based localization. Int. J. Comput. Vis. 128, 336–359 (2020).
Shahi, T. B., Dahal, S., Sitaula, C., Neupane, A. & Guo, W. Deep learning-based weed detection using UAV images: a comparative study. Drones 7, 624 (2023).
Khan, T., Khan, Z. A. & Choi, C. Enhancing real-time fire detection: an effective multi-attention network and a fire benchmark. Neural Comput. Applic. https://doi.org/10.1007/s00521-023-09298-y (2023).
Wang, J., Du, C. & Gao, T. Remote sensing scene classification based on ResNet18 and support vector machine. In: International Conference on Machine Intelligence and Digital Applications 666–670 (ACM, Ningbo China, 2024). https://doi.org/10.1145/3662739.3673680
Howard, A. et al. Searching for MobileNetV3. In: IEEE/CVF International Conference on Computer Vision (ICCV) 1314–1324 (IEEE, Seoul, 2019). https://doi.org/10.1109/ICCV.2019.00140
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K. & Dollar, P. Designing Network Design Spaces. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10425–10433 (IEEE, Seattle, 2020). https://doi.org/10.1109/CVPR42600.2020.01044
Yan, X. et al. Evaluation of machine learning methods and multi-source remote sensing data combinations to construct forest above-ground biomass models. Int. J. Digit. Earth. 16, 4471–4491 (2023).
Acknowledgements
We would like to thank the researchers who provided the TreeSatAI dataset, and we would also like to thank the anonymous reviewers and editors for suggesting changes to this paper.
Funding
This study was funded by the Major Special Project for Philosophy and Social Science Research in Higher Educational Institutions of Jiangsu Province (Grant No. 2020SJZDA073) and the Jiangsu Provincial Graduate Student Research and Practice Innovation Program (Grant No. KYCX25_1440).
Author information
Authors and Affiliations
Contributions
Conceptualization, T.Q. and Q.Z.; methodology, T.Q.; software, T.Q.; validation, T.Q. and Q.Z.; formal analysis, Q.Z.; investigation, Q.Z.; resources, Q.Z.; data curation, T.Q.; writing—original draft preparation, T.Q.; writing—review and editing, T.Q. and Q.Z.; visualization, T.Q.; supervi-sion, Q.Z.; project administration, Q.Z.; funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Qin, T., Zhao, Q. Multi-branch and multi-label tree species classification using deep learning for UAV aerial photography and Sentinel remote sensing images. Sci Rep 15, 32710 (2025). https://doi.org/10.1038/s41598-025-19827-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-19827-5