Introduction

Winter wheat is one of the world’s main food crops, and its quality and yield are related to the food supply of nearly half of the world’s population1,2. With the increasing population of the world, the demand for wheat will continue to rise, so it is of great significance to study and ensure the growth of winter wheat. The tillering stage is a critical period that determines the life cycle and yield of wheat. Efficient and accurate monitoring of the growth status at the tillering stage has a very important impact on the yield and quality of winter wheat3,4,5. In recent years, remote sensing technology based on UAV has become a hot technology in current research due to its advantages of miniaturization, low cost, flexible operation and high real-time performance, and has been widely used in crop growth monitoring. It mainly carries various sensors through UAV, such as RGB sensors, thermal infrared sensors, multi-spectral sensors, hyperspectral sensors, etc., to perform monitoring tasks in specific scenarios6,7. Liu et al.8 used UAV to capture RGB images of wheat in the field to obtain uniformity information on seedling emergence and seedling deficiency of wheat seeds, which provides an innovative method for efficiently measuring the uniformity of wheat seedling emergence. Liyuan et al.9 proposed the RGRI-Otsu method to accurately extract the canopy temperature of maize by using the RGB and thermal infrared images of maize taken by UAV, and compared it with the measured temperature of the hand-held infrared thermometer of ground fire, showing a strong correlation. Tao et al.10 used UAV to obtain multi-spectral images of wheat, and proposed a gradient change features (GCFs) method based on vegetation index (VIs) to accurately estimate the tiller number of wheat in multiple growth stages and fertilization treatments.

Although UAV remote sensing technology provides a new means for monitoring the growth of large-scale crops, how to accurately and efficiently process these images and extract targeted feature information from them has become a new challenge11,12. In recent years, the continuous development of deep learning technology has prompted more and more scholars to use it for segmentation of agricultural remote sensing images. Ma et al.13 proposed a method based on deep learning semantic segmentation, EarSegNet, to segment wheat ears from field wheat canopy images. The results show that EarSegNet can not only achieve accurate segmentation from the flowering wheat ear canopy image, but also has better performance than other segmentation methods. Hanhui et al.14 proposed an end-to-end potato stem and leaf segmentation method by combining YOLOv8x with five bands including RGB and RGB-dsm. The results show that the high mutation information of potato leaf stem edge can effectively improve the segmentation effect of leaf stems, and can be used for automated phenotypic segmentation of other cultivated crops.

Winter wheat gradually forms axillary buds between leaf sheaths and leaves at the tillering stage, showing a compact growth state and low canopy coverage. Therefore, it is very challenging to segment the complex background and the canopy of winter wheat from the remote sensing image of winter wheat at the tillering stage. In recent years, research has mainly focused on using deep learning technology to analyze the contour characteristics and spectral characteristics of crops for segmentation tasks. Dehua et al.15 explored the segmentation method of complex soil background and canopy based on 807 nm near-infrared image, excess green (ExG) images in the visible light bands of 470 nm, 550 nm and 660 nm, and soil adjusted vegetation index (SAVI) image in the 807 and 660 nm band according to the spectral characteristics of green vegetation. Xiaowu et al.16 proposed a fast weed segmentation method based on crop detection model (CDM) and ExG, with an accuracy of 92.50% and an IoU of 76.14%. However, it is difficult to achieve end-to-end segmentation with the above research methods. For example, the segmentation method based on ExG needs to find the segmentation threshold, and other methods based on spectral features also need to understand the characteristics of the specific band of the object in advance, which increases the cost of pre-data processing. In addition, the canopy coverage of winter wheat at the tillering stage is low, and the complex background will lead to the complex spectrum of remote sensing images, which undoubtedly increases the difficulty of segmentation. Different from the principle of optical sensors to obtain the reflected wavelength of electromagnetic waves, thermal infrared sensors obtain temperature information by detecting the thermal infrared band radiated by the object itself. Research on utilizing the temperature information captured by thermal infrared sensors to guide the segmentation of remotely sensed images, thereby improving the segmentation accuracy, is relatively scarce.

This paper proposes a semantic segmentation method for multi-source remote sensing images based on RGB sensors and thermal infrared sensors, namely Tiff-SegFormer. The main objectives of this study are (1) to design a segmentation method for multi-source remote sensing images by combining the temperature characteristics of thermal infrared remote sensing images and the spectral characteristics of visible light, (2) to compare this method with other models using a self-built winter wheat data set at the tillering stage to evaluate the performance of this model, (3) to verify the generalization ability of the proposed model through the test dataset.

Materials and methods

Experimental site

This study was carried out in a high standard farmland in Dalu Town, Zhenjiang City, Jiangsu Province, China(32°11’30’ N, 119°45’12’ E). The region belongs to the subtropical monsoon climate with four distinct seasons, which is suitable for growing wheat. The wheat was planted on November 20, 2023, and the variety was “Zhenmai No. 12”.

Data collection

Image acquisition was performed 55 days after wheat sowing (DAS), corresponding to the tillering stage. The UAV was DJI M300 RTK, and the image acquisition module used Zenmuse H20T, as shown in Fig. 1a. The H20T has four sensors, namely a 20-megapixel zoom camera, a 12-megapixel wide-angle camera, a laser rangefinder, and a 640 × 512 resolution thermal imaging camera. The collection was carried out at noon (11:00a.m. to 12:30p.m.) in clear and windless weather. The DJI M300 RTK equipped with H20T was used to collect images using wide-angle and thermal infrared sensors. The route planning area is shown in Fig. 1b (three red circle areas in the Fig. are the experimental areas), the route height was 30 m, and the overlap rate was 80%. A total of 1444 original images were obtained, including 722 visible images with a pixel resolution of 4056 × 3040 (JPG format) and 722 thermal infrared images with a pixel resolution of 640 × 512 (JPG format). The image can be found at https://github.com/wylSUGAR/wheat_tillering_stage.

Fig. 1
figure 1

Dji M300 UAV carries H20T thermal infrared camera (a) and UAV path planning area (b).

Data preprocessing and data extraction

From the RGB image, the sudden changes in the edges and contours of the wheat and the background can be observed. From the TIR image, the detailed contours are fuzzy, but the sudden changes in the temperature of the wheat and the background can be observed. In this study, the RGB image and TIR image obtained by the H20T camera were preprocessed, including the following three steps, as shown in Fig. 2a. In the first step, the temperature information of the TIR image was extracted using the DJI Thermal SDK, and the TIR image was converted into a TIFF image (the code can be found at https://github.com/wylSUGAR/TIR_DJ_tiff). The TIFF image was converted into a one-dimensional matrix, and each element represents the temperature value at that location, as shown in Fig. 2b. In the second step, the RGB image and the TIR image needed to be registered due to the different positions and resolutions of the wide-angle and thermal infrared sensors17,18,19 using the UAV thermal infrared and visible light image registration method proposed by Lingxuan et al.17. The resolution of the registered RGB and TIR was 640 × 512. In the third step, the image resolution was cut to 512 × 512 as the input data of the model.

Fig. 2
figure 2

Steps of image preprocessing. (a) The TIFF image is converted into a one-dimensional matrix of temperature (b).

Model construction

Dataset preparation

After data preprocessing, 722 RGB images and 722 TIFF images were obtained with a resolution of 512 × 512. Labelme software (https://github.com/labelmeai/labelme) was used to label the RGB images of wheat at the tillering stage into two labels, namely background and wheat2 (wheat canopy at the tillering stage). The dataset was divided into 4:1 for model training and evaluation. In addition, the data was enhanced by changing the contrast (coefficients of 0.3 and 1.1), brightness (coefficients of 0.4 and 1.2), color (coefficients of 0.3 and 1.3), adding motion blur and Gaussian noise, and flipping the original images upside down and left to right. This helped to make up for the limited number of datasets and improves the generalization and robustness of the model. Finally, 5489 RGB images and 5489 TIFF images were obtained, of which 4420 were training data sets and 1069 were evaluation data sets.

The Tiff-SegFormer for semantic segmentation

The Transformer model is the current mainstream model for deep learning, which has a better ability to capture global information than the traditional CNN model 20,21,22,23. The Transformer model combined with the encoder-decoder structure is currently a hot topic in the field of semantic segmentation24,25. The design of Tiff-SegFormer is inspired by SegFormer25, and the design of Tiff-SegFormer is based on the SegFormer encoder and decoder with some improvements, as shown in Fig. 3.

The encoder of Tiff-SegFormer is divided into two layers, and the structure of each layer is the same as that of SegFormer. It consists of 4 transformerblocks, which extract multi-level and multi-scale features of the input image. Each transformerblock consists of multiple Efficient self-attention and mix feed-forward network (Mix-FFN), as well as an overlapped patch merging. Efficient self-attention reduces the computational complexity of the self-attention mechanism, which is defined in Eqs. (13).

$$Attention(Q,K,V) = Softmax\left( {\frac{{QK^{T} }}{{\sqrt {d_{{head}} } }}} \right)V$$
(1)
$${\text{ }}\hat{K} = Reshape\left( {\frac{N}{R},C \cdot R} \right)(K)$$
(2)
$$K = Linear(C \cdot R,C)({\text{ }}\hat{K})$$
(3)

where Re shape represents reshaping K into a sequence of \(\frac{N}{R} \times (C \cdot R)\), Linear represents the conversion of a tensor of dimension \(C \cdot R\) into a tensor of dimension \(C\), R represents the reduction ratio, and the dimension of K is \(\frac{N}{R} \times C\). The complexity of the self-attention mechanism is reduced from \(O(N^{2} )\) to \(O(\frac{{N^{2} }}{R}).\)

Mix-FFN removes the position encoding and uses \(3 \times 3C{\text{onv}}\) to obtain position information and more inductive bias information, which is defined in Eq. (4). This design allows the test dataset to have a different resolution from the training dataset.

$${\text{x}}_{out} = MLP(GELU(Conv_{3 \times 3} (MLP(x_{in} )))) + x_{in}$$
(4)

Overlapped patch merging implements the size and channel transformation of multi-level features by a convolution operation.

The input of the upper encoder of Tiff-SegFormer is a three-channel RGB image, and the multi-level features of the RGB image are obtained after the hierarchical transformerblock. The input of the lower encoder is a preprocessed single-channel TIFF image (the same size as the RGB image). Firstly, the TIFF image is converted into a two-dimensional matrix (each element represents temperature), and then processed by max-min normalization and then multiplied by 255 (to keep it in the same order of magnitude as the RGB image for easy feature connection in the decoder), as defined in Eq. (5). The processed results are obtained after hierarchical transformerblock to get the multilevel features of TIFF image.

$$x^{*} = \frac{{x - x_{\min } }}{{x_{\max } - x_{\min } }}$$
(5)
Fig. 3
figure 3

Tiff-SegFormer model architecture.

The decoder of Tiff-SegFormer is different from that of SegFormer. Firstly, the multi-level features obtained by the upper and lower encoders pass through a multilayer perceptron (MLP) layer respectively, and the number of channels is unified as D. Secondly, the multi-level features of RGB images and TIFF images are up-sampled to 1/4 of the input and connected to obtain a feature dimension of \(\frac{H}{4} \times \frac{W}{4} \times (D \times 8)\). Thirdly, through a mixed attention module convolutional block attention module (CBAM)26, combined with channel attention and spatial attention mechanism, the key information of different multi-level features of RGB images and TIFF images is extracted. Finally, after all the features are fused through an MLP layer, a segmentation mask \(M\) of \(\frac{H}{4} \times \frac{W}{4} \times N_{{{\text{c}}ls}}\) is calculated, which is defined in Eqs. (69):

$$\hat{F}_{i} = Linear(C_{i} ,C)(F_{i} ),\forall i$$
(6)
$${\text{ }}\hat{F}_{i} = Upsample\left( {\frac{H}{4} \times \frac{W}{4}} \right){\text{ }}\left( {{\text{ }}\hat{F}_{i} } \right),\forall i$$
(7)
$$F = Linear(4C,C)(CBAM(Concat({\text{ }}\hat{F}_{i} ))),\forall i$$
(8)
$$M = Linear(C,N_{{{\text{cls}}}} )(F)$$
(9)

where \(N_{{{\text{cls}}}}\) is the number of classes to be predicted, \(Linear(C_{in} ,C_{out} )( \cdot )\) denotes that the input of the linear layer is \(C_{in}\) and the output is \(C_{out}.\)

Tiff-SegFormer training implement

The model was trained on a cloud computing server. The server system was Ubuntu 20.04.6 LTS, the computing power module was NVIDIA GeForce RTX 4090 GPU, 24G video memory, 20-core CPU, 80G memory. The deep learning framework used PyTorch 1.13.

The inputs of Tiff-SegFormer were 512 × 512 × 3 RGB image and 512 × 512 TIFF image. The images were normalized by the mean and variance of the imagenet dataset before training and then input into the model. The backbone of the Tiff-SegFormer model used the b025 architecture by default. The upper and lower encoders generatde four-level hierarchical features, which provide high-resolution and low-resolution scale features of RGB images and TIFF images, respectively. The dimension of each layer of features was \(\frac{H}{{2^{i + 1} }} \times \frac{W}{{2^{i + 1} }} \times C_{i}\), where \(H = W = 512,C_{i} \in \{ 32,64,160,256\} ,i \in \{ 1,2,3,4\}\). In addition, in the experiment of this study, the reduction ratio(R) in each efficient self-attention of the upper and lower encoders was set to 4, and the number of efficient self-attention and Mix_FFN blocks in each transformerblock was set to N = 2. In the MLP layer of the decoder, the unified channel number D was set to 768. Finally, the output size was restored to the original image size by bilinear interpolation.

The model was trained using the pre-trained weights of the VOC2012 + SBD dataset for 150 epochs with a batch size of 16. In order to speed up training efficiency and reduce the number of training parameters, thereby reducing the dimension of gradient calculation and model optimization space, frozen weights were used for training from 0 to 50 epochs27. The Adam optimizer was used in training to improve convergence efficiency. The initial learning rate was set to 0.001 and the minimum learning rate was set to 0.00001.

Model performance evaluation

The compared methods

In order to prove the efficiency of the Tiff-SegFormer model, a comparative test was conducted and compared with the following segmentation methods (Table 1).

  1. (1)

    Research28,29,30,31 showed that UNet and DeepLabv3 + networks can accurately perform semantic segmentation tasks on UAV remote sensing images. Therefore, this study used these two networks as comparison methods. Both UNet and DeepLabv3 + used the pre-training weights of the VOC2012 + SBD dataset, and the input was a 512 × 512 × 3 dimension RGB image. The backbone of UNet used VGG16 32, and the backbone of DeepLabv3 + used MobileNetV2 33. Other training hyperparameters were the same as Tiff-SegFormer.

  2. (2)

    HRNet34,35 model can maintain high-resolution representation in the whole visual task and learn strong semantic and accurate spatial information. HRNet was a relatively new deep learning model and can be used for semantic segmentation tasks. Therefore, this study used HRNet as a comparison method. The backbone of HRNet used hrnetv2_w18, the pre-training weights of the VOC2012 + SBD dataset were also used. Other training hyperparameters were the same as Tiff-SegFormer.

  3. (3)

    SegFormer was the backbone network of Tiff-SegFormer, so it was necessary to use SegFormer as a comparison network. The backbone network adopted SegFormer-b0, which was consistent with the b0 architecture adopted by Tiff-SegFormer by default. In the model initialization phase of SegFormer-b0, the pre-training weights of the VOC2012 + SBD dataset were also used. Other training hyperparameters were the same as Tiff-SegFormer.

  4. (4)

    Hanhui et al.14 combined RGB images with digital surface model (DSM) or crop height model (CHM), and achieved good results in potato stem and leaf segmentation. Caiwang et al.36 combined RGB and near-infrared bands, and used Mask R-CNN to segment strawberry plant canopy from the combined image, and also achieved good segmentation results. Therefore, this paper combined RGB images and TIFF images by channel concatenation, and used the four-channel (RGB + TIFF) image after concatenation as the input of SegFormer-b0 model as the comparison method of Tiff-SegFormer. In the initialization stage of the model, the pre-training weights of the VOC2012 + SBD dataset on SegFormer-b0 were used, and the other hyperparameters were the same as Tiff-SegFormer.

Table 1 Details of Tiff-SegFormer and the compared methods.

Performance evaluation with the test dataset

The test data set was re-prepared according to the method in section “Data preprocessing and data extraction”, which contains 507 RGB images and 507 corresponding TIFF images. The test data sets were applied to the trained Tiff-SegFormer model as well as UNet, DeepLabv3+, SegFormer and four-channel (RGB + TIFF) Segformer, respectively, to evaluate the performance of Tiff-SegFormer in applications. These data do not overlap with the training and evaluation data sets.

Evaluation metrics

In order to verify the performance of the Tiff-SegFormer model, six methods were used to quantitatively evaluate the performance of all segmentation methods29, namely Precision, Recall, Intersection over Union (loU), mean Intersection over Union (mloU), mean pixelaccuracy (mPA), and accuracy (Eqs. 1015). The formulas are as follows:

$${\text{Precision }} = \, \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}$$
(10)
$${\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(11)
$${\text{IoU}} = \frac{{{\text{TP}}}}{{{\text{FN}} + {\text{FP}} + {\text{TP}}}}$$
(12)
$${\text{mIoU}} = \frac{{1}}{{{\text{k}} + {1}}}\sum\limits_{{{\text{i}} = {0}}}^{{\text{k}}} {\frac{{{\text{TP}}}}{{{\text{FN}} + {\text{FP}} + {\text{TP}}}}}$$
(13)
$${\text{mPA}} = \frac{{1}}{{{\text{k}} + {1}}}\sum\limits_{{{\text{i}} = {0}}}^{{\text{k}}} {\frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}}$$
(14)
$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{FN}} + {\text{TN}}}}$$
(15)

where k represents the number of classes except the background, and the value is 1. TP represents the number of pixels correctly classified as ‘winter wheat’; TN represents the number of pixels correctly classified as ‘background’; FP represents the number of ‘background’ pixels incorrectly judged as ‘winter wheat’; FN represents the number of ‘winter wheat’ pixels incorrectly judged as 'background’.

Results and discussion

Model training results

Using the training dataset and evaluation dataset in section “Dataset preparation”, a total of five semantic segmentation models were constructed, namely UNet, DeepLabv3+, HRNet, SegFormer, four-channel (RGB + TIFF) Segformer, and Tiff-SegFormer. Figure 4 shows the relationship between the epoch and the mIoU change curve of the evaluation data set during the training process of all models. Since all models are initialized with pre-trained weights, the mIoU of the model training increases rapidly in the first 10 epochs. Among them, The Tiff-SegFormer has the largest increase in mIoU in the first 10 epochs, exceeding 70%. The Segformer with four channels (RGB + TIFF) has the least increase in mIoU in the first 10 epochs, close to 60%. All models use freezing training at 0–50 epochs, so the mIoU curve rises gently at this stage, and the mIoU curve rises rapidly after 50 epochs. After training 80 epochs, the mIoU curve rises slowly and tends to be stable.

Fig. 4
figure 4

MIoU Epoch curve with different models in model training. UNet (a), DeepLabv3+ (b), HRNet (c), SegFormer (d), SegFormer(RGB + TIFF) (e), Tiff-SegFormer (f).

As shown in Table 2, on the validation dataset, the mIoU, mPA and accuracy of Tiff-SegFormer are 84.28%, 88.97% and 94.55%, respectively, which are all better than UNet, DeepLabv3+, HRNet, SegFormer and Segformer with four channels (RGB + TIFF). Among them, the mIoU, mPA and accuracy of SegFormer reached 83.38%, 88.43% and 94.19%, respectively, and the performance is closest to that of Tiff-SegFormer. The four-channel Segformer14,36 that concatenates RGB and TIFF images has mIoU, mPA, and accuracy of 82.89%, 88.0%, and 94.02%, respectively. Compared with the SegFormer that only inputs RGB images, the performance is worse. The reason analysis may be that RGB images belong to spectral features and TIFF images belong to temperature features. Although the two are processed by unified quantization, the direct concatenate fusion operation may lead to the interference between different categories of features, which is counterproductive. DeepLabv3 + performs the worst on the performance indicators of mIoU, mPA and accuracy, which are 81.05%, 86.28% and 93.40%, respectively. In terms of the performance of each class, all models perform better on the Winter Wheat (W) class than the Background (B) class. Among them, Tiff-SegFormer achieved 93.48%, 98.48% and 94.84% for IoU, recall and precision on the W class, respectively, and 75.08%, 79.45%, and 93.18% for IoU, recall, and precision on the B class respectively. Tiff-SegFormer is also superior to other models in all performance indicators for each classification.

In terms of model parameters and training time, the UNet has the largest number of model parameters, reaching 97.25 M, and the longest training time, which took 8h10min. While the SegFormer has the smallest number of model parameters, reaching 14.58 M, and the shortest training time, which took 5h28min. The Tiff-SegFormer model is divided into two layers at the encoder end, which are used to train RGB images and TIFF images respectively, so the number of model parameters is doubled compared to the SegFormer, which is 28.63 M. However, the training time does not increase significantly, and the training time is 5h48min, which is 20 min more than that of SegFormer.

Table 2 Performance of the models over the training and validation dataset.

Results of test

In order to verify the generalization ability of the model, the trained model was used on the test data set in section “Performance evaluation with the test dataset”, and the results are shown in Table 3. The Tiff-SegFormer model performs best in terms of mIoU, mPA, and accuracy, reaching 84.94%, 91.46%, and 94.71%. The performance of the SegFormer on mIoU and mPA is second only to the Tiff-SegFormer with 83.13% and 90.23% respectively, while the accuracy is 93.69%, which is only better than DeepLabv3+. Therefore, for the above three indicators, the Tiff-SegFormer is 1.81%, 1.23% and 1.02% higher than that of the SegFormer, respectively. The DeepLabv3 + model performs worst in terms of mIoU, mPA and accuracy, reaching 80.12%, 87.71% and 92.88%, respectively. In the Winter Wheat (W) and Background (B) classes, the Tiff-SegFormer model is also the best in terms of IoU, Recall and Precision. The Tiff-SegFormer model does not show a huge advantage in the W class, and the IoU is 0.76% higher than the second four-channel (RGB + TIFF) Segformer, recall and precision are 0.03% and 0.3% higher than the second-performing SegFormer. However, the Tiff-SegFormer model shows great advantages in the segmentation performance of class B, with three indicators superior to the second-performing SegFormer 2.48%, 2.43% and 0.18%, respectively.

In addition, it can also be seen from the results of the test data set that compared with SegFormer which only has spectral features, Tiff-SegFormer which integrates spectral features and temperature features has improved in all performance indexes. In particular, the segmentation accuracy of Background (B) class is particularly improved. It is indicated that for the category of Winter Wheat (W), visible light images have provided more feature information in terms of spectral features such as color and contour. However, in the Background (B) category, since the spectral features are not obvious and irregular, the temperature features can provide more feature information to improve the model segmentation effect.

Table 3 Performance of the models over the test dataset.
Fig. 5
figure 5

Examples of segmentation results with different models.

Figure 5 shows the segmentation results of different models. It can be seen that all five models have successfully segmented and extracted winter wheat at the tillering stage from UAV images, but the segmentation effect of the Tiff-SegFormer model is better than that of other models. In Fig. 5a, when the winter wheat and the background distribution are complex, UNet, DeepLabv3+, HRNet, SegFormer and four-channel (RGB + TIFF) Segformer all have obvious mis-segmentation, especially misidentifying background pixels as wheat pixels, while the segmentation effect of Tiff-SegFormer is more close to ground truth. In Fig. 5d, all models can well identify obvious field lateral gullies, but only four-channel (RGB + TIFF) Segformer and Tiff-SegFormer can identify unobvious longitudinal gullies, and Tiff-SegFormer has a smoother recognition effect. Tiff-SegFormer adds temperature information in addition to RGB visible light information to guide segmentation at the same time, so it is better at identifying edge information of different categories in more complex environments, which is more obvious in Fig.s 5(a) and (d). In summary, the proposed Tiff-SegFormer can better segment winter wheat from the captured UAV remote sensing images of winter wheat at the tillering stage, and can more accurately identify the edge information of winter wheat and background.

Ablation experiment

In order to verify the performance improvement of the proposed feature fusion method in the field of semantic segmentation, several ablation experiments were conducted on the experimental data set. Ablation experiments mainly analyze the effect of increasing the backbone size of Tiff-SegFormer on model performance. Table 4a shows the parameters of different Backbone in the four experiments. The main differences between different backbone are the feature dimensions of each layer transformerblock (Embed_dims), and the number of Efficient Self-Attention and Mix_FFN modules in each transformerblock (N). Table 4b shows the performance results of different backbone in the dataset. With the increase of Embed_dims and N, the performance of the model improves, especially in the Background (B) class. The reason for the analysis may be that the Background (B) class is disordered, and more temperature feature information is extracted through deeper iterations, thus improving the segmentation effect of the model. As backbone’s size increases, so does the size and training time of the model. The Ablation Experiment demonstrated the potential of Tiff-SegFormer in different application scenarios.

Table 4 Ablation experiments on model size and performance.

Discussion

Innovation of study

The complexity of the field environment, coupled with the poor canopy coverage of winter wheat at the tillering stage, leads to the possible cross-presentation of soil background and wheat canopy in UAV remote sensing images, and it is difficult to cope with such a challenge by using manual segmentation or deep learning segmentation methods based only on RGB remote sensing images6,7,13. Therefore, it is necessary to combine the fusion of multi-source data to improve the segmentation effect. There have been studies on fusing RGB remote sensing with hyperspectral and point cloud information37,38, which are more costly in terms of data acquisition equipment, and the larger volume of data brings computational costs. In this study, RGB and thermal infrared images are collected simultaneously by multi-source sensors of UAV, and a deep learning method based on visible light spectrum and temperature feature fusion is proposed to automatically segment winter wheat and background at the tillering stage. The proposed method has the advantages of low cost and simple operation. Compared with other widely used segmentation methods, the model training efficiency is not greatly affected while the segmentation accuracy is improved.

The proposed Tiff-SegFormer model adopts an encoder-decoder architecture. The encoder is divided into two layers, upper and lower, which respectively extract the features of spectrum and temperature with an efficient self-attention mechanism. In the forward propagation, convolution operation is used instead of position coding to obtain more inductive bias information and improve the segmentation effect. At the same time, the input size of the test sample will not be limited by the training sample. The CBAM block is added to the decoder to add spatial and channel attention to each feature layer, which improves the robustness of the model.

Potential application

This work helps to achieve efficient and accurate segmentation of UAV remote sensing images in complex farmland environments. Specifically, RGB remote sensing imaging has higher requirements for illumination, but the actual environment may have a lot of noise effects (such as cloudy days, fogging, equipment jitter, etc.). Tiff-SegFormer introduces the radiation temperature information of the object itself obtained by the thermal infrared image, which will have better anti-noise ability. Figure 6a shows that the image is processed by Gaussian noise and motion blur. Figure 6b shows the ground truth of the image. The image is segmented by the trained Tiff-SegFormer, and the segmented image is Fig. 6c. It can be found that the Tiff-SegFormer has good anti-noise ability and robustness. In addition, the RGB image of 640 × 512 × 3 (Fig. 6d) and the TIFF image of 640 × 512 (Fig. 6e) are directly input into the Tiff-SegFormer model for segmentation, and good segmentation results can also be obtained (Fig. 6f). Therefore, the Tiff-SegFormer is not limited by the size of the training image and can be applied to images taken by sensors of different models or manufacturers.

In addition, since the Tiff-SegFormer model combines visible light and thermal infrared images, when the phenotype and edge features of the segmented object are difficult to distinguish, the temperature information will provide more segmentation features. Therefore, the application of Tiff-SegFormer in weed identification, multi-crop intercropping segmentation, crop pest identification and segmentation task for other growth periods of winter wheat can be further studied.

Finally, the essence of this work is to study the feature fusion of multi-source images, so the Tiff-SegFormer model can be applied not only to visible and thermal infrared images, but also to the feature fusion of other different combination images, including: multispectral, hyperspectral, etc. The main difference is in the pre-processing stage and Overlap Patch Embeddings module. In the pre-processing stage, it is necessary to register the fusion images with different features to achieve the matching of different image resolutions and pixel features. Secondly, because the number of channels in different combination images is different, the input of Overlap Patch Embeddings module needs to be adjusted, while other encoder modules do not need to be adjusted.

Fig. 6
figure 6

Examples of Robustness testing results with Tiff-SegFormer, RGB image after adding noise (a), the ground truth image of (a, b), Segmentation result of (a) by Tiff-SegFormer (c), 640 × 512 × 3 size RGB image (d), 640 × 512 size TIFF image (e), segmentation result of the 640 × 512 image by Tiff-SegFormer (f).

Conclusions

In this study, an automatic segmentation model Tiff-SegFormer for UAV multi-source remote sensing images of winter wheat at the tillering stage was proposed. The UAV remote sensing image collected from the winter wheat field at the tillering stage is classified at the pixel level to segment the winter wheat canopy and background. Tiff-SegFormer combines visible light spectral features and thermal infrared temperature features, which can significantly improve the segmentation accuracy of winter wheat at the tillering stage. The results show that the Tiff-SegFormer can achieve accurate segmentation of winter wheat from remote sensing images captured by winter wheat at the tillering stage (mIoU = 84.28%, mPA = 88.97%, accuracy = 94.55%). Comparing the Tiff-SegFormer with widely used segmentation models and methods, it is found that Tiff-SegFormer has better segmentation performance and is an efficient and robust segmentation tool. The performance evaluation on the test data shows that Tiff-SegFormer has better generalization ability (mIoU = 84.94%, mPA = 91.46%, accuracy = 94.71%), and also shows better segmentation effect than other models. In addition, through extreme noise processing and size transformation of the test image, it is found that the Tiff-SegFormer can also achieve effective segmentation of winter wheat. Therefore, Tiff-SegFormer has great potential in realizing automatic segmentation of crop scenes such as winter wheat at the tillering stage.