Introduction

With the continuous improvement in computer processing power, deep learning has made remarkable progress in various fields1,2,3. In the agricultural domain, deep models have demonstrated outstanding performance in areas such as plant diseases identification4, fruit counting5, as well as other applications6,7.

Intelligent tea picking is also a highly researched direction in agricultural applications. Tea, as a crucial economic crop, often faces significant damage during traditional mechanical tea picking due to indiscriminate cutting, leading to a notable decline in the quality and economic value of the tea leaves. To address this issue, some scholars are dedicated to designing intelligent tea-picking devices. Accurate identification and detection of tea leaves are fundamental to the operation of these devices and are crucial for precise harvesting. For this purpose, some studies8,9,10 employ a combination of computer vision and deep learning to achieve accurate identification of tea leaves. However, their models have high parameters, slow inference speeds, and require deployment on devices with abundant computing resources and large operating memory. This poses new challenges for achieving the recognition of tea leaves in natural environments.

Knowledge distillation11,12,13 has been proven to be an effective model compression method, capable of addressing the aforementioned challenges. It achieves this by transferring the dark knowledge of a powerful yet complex teacher model to a lightweight student model, thereby improving the performance of the student model without incurring additional costs, thus achieving the goal of model compression. In recent years, with in-depth research on knowledge distillation, many efficient methods have emerged. Mimick14 opted to learn the feature of the regions proposed by the student’s region proposal network instead of the entire teacher’s feature to simplify the task. FGFI15 obtained fine-grained feature maps by leveraging Ground-Truth and anchors. The above methods mainly focus on extracting feature from the foreground, but some works16,17 have found that the background is also helpful for object detection. Guo et al.18 found through separate knowledge distillation experiments for the background and foreground that both contribute to improved performance of the student model. This confirms the existence of exploitable dark knowledge in the background. Additionally, through visual analysis of the gradient information of the Feature Pyramid Network (FPN)19, they found that the gradient for the foreground is significantly larger than that for the background, indicating different levels of importance for foreground and background. On the other hand, MGD20 discovered that, during distillation, masking random regions of the student’s feature map and forcing the student to reconstruct feature in these masked areas through a simple generation block enhances the student’s feature representation ability. However, MGD uses randomly generated masks, which do not enable the student to focus on areas considered important by the teacher. They also overlook the fact that the background has more pixels compared to the foreground, leading to a scenario where background features occupy more loss during distillation, causing an imbalance in losses. This issue is more pronounced in the tea leaves dataset we constructed, as shown in Fig. 1. The pixel ratio between foreground and background regions in the tea leaves dataset is approximately 0.9:9.1. In contrast, the ratio in the COCO dataset21 is approximately 5.6:4.4. This stark disproportion in the ratio makes background feature more likely to dominate in the tea leaves dataset. Due to the mentioned characteristics, treating all pixels equally and extracting feature for training the student model in an equal manner would further exacerbate the negative impact on distillation. Therefore, to address this imbalance in losses, we adopted Decoupled Distillation, aiming to alleviate the impact of the highly imbalanced foreground-to-background pixels ratio in the tea leaves dataset.

Fig. 1
Fig. 1
Full size image

Comparing the proportion of foreground pixels in the tea leaves dataset to that in the COCO dataset.

Based on the profound observations and understanding mentioned above, we propose Reconstructed Feature and Dual Distillation (RFDD), as shown in Fig. 2. Specifically, we start by obtaining a spatial attention map from the feature map of the teacher model. Based on the magnitudes of attention values, we generate a specific mask to mask the feature of the student model. This is intended to guide the subsequent generation block to focus on regions considered important by the teacher. Secondly, studies like SENet22 and CBAM23 have shown that adding channel attention in addition to spatial attention can further enhance the model’s performance. Additionally, Zhang et al.24 demonstrated that having the student model learn the channel attention map of the teacher model during distillation leads to improved student performance. This indicates that the teacher’s level of interest in different channels is valuable dark knowledge. Therefore, to further explore the positive impact of channels on distillation, we utilize the SE block22 to learn channel clues from the teacher’s feature map and then use them to enhance the student’s feature, improving the student model’s channel awareness. Next, we separate foreground and background features based on Ground-Truth, guiding the student to focus on critical pixels and channels. During Decoupled Distillation, we adopt an equal strategy for extracting foreground and background features. Furthermore, since the Decoupled Distillation severs the connection between foreground and background, resulting in the loss of global relation information, to compensate for this deficiency, we employ GcNet25 to implement Global Distillation, collecting information about these losses and transfer it to the student. Our loss function is calculated only on the feature map, making our method applicable to various detectors, including two-stage models, anchor-based one-stage models, and anchor-free one-stage models. Experimental results indicate that using RFDD in tea leaves recognition models across different frameworks has yielded promising outcomes.

Fig. 2
Fig. 2
Full size image

An illustration of RFDD, including the two components: Reconstructed Feature and Dual Distillation. Reconstructed Feature enhances the representation capability of the student model. Dual Distillation comprises Decoupled Distillation and Global Distillation, where Decoupled Distillation enforces the student model to focus on critical pixels and channels, Global Distillation bridges the gap in global pixel-wise relation knowledge caused by Decoupled Distillation.

Related works

Object detection

Object detection has always been a fundamental and challenging task in computer vision, aiming to identify the category and location of objects in images. Typically, detectors based on Convolutional Neural Networks (CNN) can be categorized into two-stage detectors26,27 and one-stage detectors28,29,30. Two-stage detectors are renowned for their higher detection accuracy. However, due to the use of Region Proposal Network (RPN), they have a slower inference speed, making them challenging to meet real-time application requirements. In contrast, one-stage detectors directly predict the category and bounding boxes of target objects on the feature map, offering faster detection speed. Furthermore, one-stage detectors can be further divided into anchor-based28 and anchor-free29,30 categories. Anchor-based models enhance detection speed and maintain accuracy by using predefined anchor boxes, but the number of anchor boxes far exceeds the number of targets, leading to additional computations. On the other hand, anchor-free detectors are more lightweight, directly predicting the key points and positions of target objects, but they have relatively lower detection accuracy. Despite the use of different detection heads in these three types of detectors, they all take feature as input. Therefore, our knowledge distillation method can be easily applied to them.

Tea leaves recognition

Early scholars both domestically and internationally primarily employed traditional machine learning and image processing techniques for tea leaves recognition and detection. Chen et al.31 achieved three-dimensional reconstruction and detection of tea leaves using color transformation, area filtering, and edge computation. Wu et al.32 obtained a higher tea leaves recognition rate than the Otsu algorithm by using color change and the k-means clustering algorithm. Wu et al.33 designed a maximum variance automatic threshold method for G and G-B components, effectively distinguishing buds from old leaves, achieving a recognition rate of up to 92%. However, the above studies were conducted under controlled conditions, and it is challenging to replicate the real conditions in tea gardens with complex environments, including uncontrolled lighting and high similarity between foreground and background. Therefore, many scholars have begun to apply deep learning to the study of tea leaves.

Compared to machine learning, the advantage of deep learning lies in abandoning feature engineering, enabling automatic feature selection based on the given data, and establishing end-to-end models. Chen et al.8 successfully implemented detection and picking point localization of tea leaves using Faster-RCNN and Fully Convolutional Networks (FCN). Wang et al.34 proposed a tea leaves picking point localization method based on Mask-RCNN. Qing et al.35 proposed an identification method using the YOLOv3 deep convolutional model, which enhanced the model’s ability to recognize tea leaves by adding the SPP module in the convolution. However, while these deep models exhibit high performance and strong generalization capabilities, their large number of parameters and slow inference speed do not meet the real-time requirements of tea leaves recognition in natural environments. Therefore, we need to compress large models, and employing the knowledge distillation approach proves to be an effective solution to this problem. By distilling knowledge into smaller models, we can achieve faster detection speed, fewer parameters and computations while maintaining a detection accuracy close to that of larger models. Consequently, the small model obtained through knowledge distillation can replace the large model to accomplish real-time tea leaves detection tasks.

Knowledge distillation

Knowledge distillation is a model compression technique initially proposed by Hinton et al.11. This method transfers dark knowledge by using soft labels generated by the teacher model as inputs for the student model, thereby enhancing the classification accuracy of the student model without incurring additional computational costs. Building upon Hinton’s ideas, many researchers have designed efficient distillation methods for classification task36,37,38.

In recent years, some studies have successfully applied knowledge distillation to object detection. Chen et al.39 were the first to apply knowledge distillation to object detection models. The authors took the classic Faster R-CNN model framework as an example, distilling knowledge from three parts: neck feature, the classification head, and the regression head, and transferring it to the student model. Many researchers in the past believed that the extreme imbalance between foreground and background was a major factor influencing distillation. The authors of FGFI15, in order to reduce the impact of background feature on distillation, used fine-grained masks to extract feature of target objects and their surrounding areas. Sun et al.40 used Gaussian masks to guide the student to focus on the Ground-Truth and its surrounding areas. However, the study by Guo et al.18 indicates that background features also have a positive effect on distillation. The authors distilled knowledge from the background, enhancing the performance of the student model. Additionally, Yang et al.20 found that by randomly masking the feature maps of the student model and using a simple generative block to reconstruct feature, the representation capability of the student could be improved. Based on this, they proposed an innovative distillation method called Masked Generative Distillation (MGD). However, MGD utilizes random masks, which may introduce additional noise feature, and the authors treat all pixels equally.

Materials and methods

Dataset construction

The Taiping Houkui tea leaves dataset used in this paper was collected from the Huangshan Houkui Tea Plantation, as shown in Fig. 3. These images (3500 in total) were collected between April and May 2023 using two digital cameras, namely the NIKON D7100 and the SONY ZV-E10, with a resolution of 4000 × 6000 pixels, and were stored in JPG format. To obtain clear and accurate images of tea leaves, we chose to capture them by shooting from an overhead perspective with a vertical camera axis forming an angle of 30°-60° with the horizontal line, at a distance of 40–60 cm. If the shooting angle is too low, there will be a significant area blocked between the leaves, while a too high angle makes the stems of the tender shoots invisible. All tea leaves images were captured in natural environments, and to ensure the diversity of the dataset, it includes images of tea leaves obtained from different angles, backgrounds, occlusions, and lighting conditions.

Fig. 3
Fig. 3
Full size image

Taiping Houkui Tea dataset images.

Additionally, due to the complex tea garden environment, the varying growth angles of tea leaves, and the influence of environmental factors such as weather and lighting on tea leaves recognition, we employed common data augmentation techniques such as flipping, random rotation, and color jitter to increase tea leaves images and improve the generalization ability of the neural network model. Augmenting image data through techniques like flipping and random rotation allows for simulation of tea leaves features from various angles; adjustments to image brightness and contrast simulate changes in tea leaves color due to factors like lighting and weather in the tea garden. The partial effect diagram of data augmentation is shown in Fig. 4, Subsequently, we used LabelMe software41 to annotate our collected tea leaves dataset, defining the annotated regions as the “tea” class and generating JSON files. Finally, we randomly divided the images into training, validation, and testing sets in an 8:1:1 ratio.

Fig. 4
Fig. 4
Full size image

The effect of data augmentation on tea leaves images.

Method

In knowledge distillation, Gou et al.42 categorize dark knowledge into relation-based knowledge, response-based knowledge, and feature-based knowledge according to its types. In this paper, we define \(T \in {{\mathbb{R}}^{C \times H \times W}}\) and \(S \in {{\mathbb{R}}^{C \times H \times W}}\) to represent the features extracted by the teacher model and the student model, respectively. Mathematically, the distillation formula based on feature is expressed as:

$$L=\frac{1}{{CHW}}\mathop { \cdot \sum }\nolimits_{{c=1}}^{C} \mathop \sum \nolimits_{{h=1}}^{H} \mathop \sum \nolimits_{{w=1}}^{W} {({T_{c,h,w}} - \phi ({S_{c,h,w}}))^2}$$
(1)

where \(\phi\) is the adaptive layer that aligns the shapes of \({T_{c,h,w}}\) and \({S_{c,h,w}}\) and \(C,H,W\) denote the channel, height, and width of the feature maps, respectively. Building upon the foundation of Eq. (1), many researches43,44,45 have focused on guiding student to imitate the teacher’s feature maps to improve the student’s performance. The recent MGD discovered that teachers can enhance students’ representational capabilities by guiding them in feature reconstruction. However, MGD adopted random masks and overlooked the negative impact of foreground and background imbalance on distillation. To address these issues, we propose RFDD.

In section “Reconstructed feature” of this paper, we guide the student model to reconstruct the complete feature of the teacher model, simultaneously adding the SE block to enhance the student model’s channel awareness. Subsequently, in section “Decoupled Distillation”, we segment the feature maps into foreground and background based on Ground-Truth for Decoupled Distillation. Finally, in section “Global Distillation”, we utilize GcNet to capture global pixels relation, collecting dark knowledge lost during the Decoupled Distillation.

Reconstructed feature

To overcome the issues introduced by the random masks in MGD, we utilize the teacher’s spatial attention maps to design the masks, guiding the generation block more accurately to reconstruct features in areas deemed important by the teacher. First, we calculate the absolute mean along the channel dimension of the teacher model:

$${G^{Spatial}}(T)=\frac{1}{C}\mathop { \cdot \sum }\nolimits_{{c=1}}^{C} \left| {{T_c}} \right|$$
(2)

where \({G^{Spatial}} \in {{\mathbb{R}}^{1 \times {\text{H}} \times {\text{W}}}}\) represents the spatial attention map. Subsequently, the spatial attention mask obtained based on \({G^{Spatial}}\) can be formulated as:

$${\mathcal{A}^{Spatial}}(T)=H \cdot W \cdot {\text{softmax}}\left( {\frac{{{G^{Spatial}}(T)}}{\mathcal{T}}} \right)$$
(3)

where \(\mathcal{T}\) is the hyperparameter introduced in11 for changing the probability distribution.

After Eqs. (2) and (3), the shape of \({\mathcal{A}^{Spatial}}(T)\) we obtain is \(1 \times {\text{H}} \times W\). Therefore, we consider the values of \({\mathcal{A}^{Spatial}}(T)\) to represent the teacher model’s level of interest in different pixels of the feature map. Later, we set a hyperparameter \(\lambda\), and when the value of \({\mathcal{A}^{Spatial}}(T)\) is greater than \(\lambda\), the value of the mask \(M_{{h,w}}^{{Spatial}}\) is set to 0; otherwise, it is set to 1. This can be expressed using the formula:

$$M_{{h,w}}^{{Spatial}}=\left\{ \begin{array}{ll} {\text{0}},& \quad {\text{if }}\mathcal{A}_{{h,w}}^{{Spatial}}>\lambda \\ 1,& \quad {\text{Otherwise}} \end{array} \right.$$
(4)

where \(\lambda\) is a hyperparameter controlling the number of pixels in the mask. \({\mathcal{A}^{Spatial}}(T)\) represents the attention score of the teacher’s feature map at the coordinate point \(\left( {h,w} \right)\). Next, we use the mask \(M_{{h,w}}^{{Spatial}}\) to overlay the feature map of the student model:

$$S_{{c,h,w}}^{{mask}}=\phi \left( {{S_{c,h,w}}} \right) \cdot M_{{h,w}}^{{Spatial}}$$
(5)

In this paper, we use \(1 \times 1\) convolutional layers as the adaptive layer \(\phi\). The mask generated with the help of teacher attention map allow us to mask specific regions of the student feature map based on the teacher’s level of interest in different areas. This effectively eliminates the drawbacks introduced by randomly generated mask. Afterward, we reconstruct the teacher’s feature using a simple generative block consisting of two convolutional layers: \({{\mathbf{W}}_{G1}}\) and \({{\mathbf{W}}_{G2}}\), an activation layer \(\text{R}\text{e}\text{L}\text{U}\). It can be formulated as:

$${{\rm X}_{gen}}={{\mathbf{W}}_{G2}}\left( {\text{R}\text{e}\text{L}\text{U}\left( {{{\mathbf{W}}_{G1}}\left( {S_{{c,h,w}}^{{mask}}} \right)} \right)} \right)$$
(6)

where ​\({{\mathbf{W}}_{G1}}\) and ​\({{\mathbf{W}}_{G2}}\) are \(3 \times 3\) convolutional layers.

On the other hand, due to the tea leaves dataset containing a large number of detection targets, the task of detecting tea leaves is essentially a dense prediction task for multiple objects. For dense prediction, extracting target information at different scales can bring significant performance fluctuations to the detectors, which has not been considered in many works46,47. Therefore, based on this, we introduce a simple and lightweight SE (Squeeze-and-Excitation) layer to learn channel clues from the teacher’s feature and apply them to enhance the student’s feature, further improving the model’s perceptual capability. Firstly, we compress the global spatial information into channel attention map through global average pooling. In a formal sense, the channel attention map \({G^{Channel}} \in {{\mathbb{R}}^{C \times 1 \times 1}}\) is obtained by collapsing the dimensions of the feature map \({\text{H}} \times {\text{W}}\), and its formula is as follows:

$${G^{Channel}}(T)=\frac{1}{{H \times W}}\mathop { \cdot \sum }\nolimits_{{h=1}}^{H} \sum\nolimits_{{w=1}}^{W} {\left| {{T_{h,w}}} \right|}$$
(7)

Afterwards, we perform a linear projection operation on it to obtain the clue matrix \({{\text{X}}_{clue}} \in {{\mathbb{R}}^{C \times 1 \times 1}}\), which is mathematically represented as follows:

$${{\rm X}_{clue}}=\sigma \left( {{{\mathbf{W}}_{C2}}\left( {ReLU\left( {{{\mathbf{W}}_{C1}}\left( {{G^{Channel}}(T)} \right)} \right)} \right)} \right)$$
(8)

where \({{\mathbf{W}}_{C1}}\) and \({{\mathbf{W}}_{C2}}\) represent two fully connected layers. Finally, we merge the generated feature map \({{\rm X}_{gen}} \in {{\mathbb{R}}^{C \times H \times W}}\) with the obtained channel clue matrix \({{\text{X}}_{clue}}\) to enhance the student model’s channel perception ability. This can be defined as:

$$\mathcal{G}(S_{{c,h,w}}^{{mask}})={{\rm X}_{gen}} \odot {{\rm X}_{clue}}$$
(9)

where \(\odot\) denotes the Hadamard product. In a nutshell, through the aforementioned Reconstructed Feature, we optimized the strategy of masking feature, successfully transmitting dark knowledge about the importance of regions. The resulting student feature map will contain more valuable semantic information, ultimately enhancing the representation capability of the student model.

Decoupled distillation

Due to the tea dataset having a significantly larger number of background pixels than foreground pixels, background feature dominate when calculating the loss function, thus affecting the distillation effect of the student model. Therefore, in this section, we perform Decoupled Distillation to balance the extraction of pixels. First, we set a binary mask \(M_{{h,w}}^{{Segmetation}}\) based on the Ground-Truth to separate foreground and background. It can be formalized as:

$$M_{{h,w}}^{{Segmetation}}=\left\{ \begin{array}{ll} 1,&\quad{\text{if}}\left( {h,w} \right) \in {\text{B}} \\ 0,&\quad {\text{Otherwise}} \end{array} \right.$$
(10)

where \({\text{B}}\) represents the actual tea leaves annotation boxes. \(\left( {h,w} \right)\) denotes the horizontal and vertical coordinates of the corresponding position in the feature map. If a pixel is inside the Ground-Truth boxes, the value of \(M_{{h,w}}^{{Segmetation}}\) is set to 1; otherwise, it is 0. Secondly, to minimize the negative impact of pixel imbalance on distillation, we designed the following loss function:

$$\begin{aligned} {L_{decoupled}}&=\frac{\alpha }{{{N_{obj}}}} \cdot \sum\limits_{{c=1}}^{C} {\sum\limits_{{h=1}}^{H} {\sum\limits_{{w=1}}^{W} {M_{{h,w}}^{{Segmetation}}{{({T_{c,h,w}} - \mathcal{G}(S_{{c,h,w}}^{{mask}}))}^2}} } } \\ &\quad+\frac{\beta }{{{N_{bg}}}} \cdot \sum\limits_{{c=1}}^{C} {\sum\limits_{{h=1}}^{H} {\sum\limits_{{w=1}}^{W} {\left( {1 - M_{{h,w}}^{{Segmetation}}} \right){{({T_{c,h,w}} - \mathcal{G}(S_{{c,h,w}}^{{mask}}))}^2}} } } \\ {{\text{N}}_{{\text{obj}}}}&={\text{C}} \cdot \sum\limits_{{{\text{h=1}}}}^{{\text{H}}} {\sum\limits_{{{\text{w=1}}}}^{{\text{W}}} {{\text{M}}_{{{\text{h,w}}}}^{{{\text{Segmetation}}}}} } ,{{\text{N}}_{{\text{bg}}}}{\text{=C}} \cdot \sum\limits_{{{\text{h=1}}}}^{{\text{H}}} {\sum\limits_{{{\text{w=1}}}}^{{\text{W}}} {\left( {1 - {\text{M}}_{{{\text{h,w}}}}^{{{\text{Segmetation}}}}} \right)} } \end{aligned}$$
(11)

where \(\alpha\) and \(\beta\) are hyperparameters used to balance foreground and background distillation losses, \({N_{obj}}\) and \({N_{bg}}\) represent the number of pixels in the foreground and background regions, respectively.

Global distillation

In knowledge distillation, the relation between different pixels48,49 have been proven to be a form of dark knowledge that can enhance the performance of detection models. In section “Decoupled Distillation”, we performed Decoupled Distillation, allowing the student model to have different levels of attention to foreground and background features. However, it severs the relation between foreground and background. Therefore, we propose Global Distillation, as shown in Fig. 5, we utilize the GcBlock25 to extract relation knowledge between global pixels and distill it from the teacher detector to the student detector. The loss function for Global Distillation is as follows:

$$\begin{aligned}{L_{global}}&=\gamma \cdot {\sum {\left( {{\rm Z}\left( {{{\text{T}}_{{\text{c,h,w}}}}} \right) - {\rm Z}\left( {\mathcal{G}(S_{{c,h,w}}^{{mask}})} \right)} \right)} ^2} \\ {\rm Z}\left( {\text{Y}} \right)&={\text{Y}}+{{\mathbf{W}}_{L3}}\left( {\operatorname{Re} LU\left( {{\text{LN}}\left( {{{\mathbf{W}}_{L2}}\left( {\sum\nolimits_{{j=1}}^{{{N^p}}} {\frac{{{e^{{{\mathbf{W}}_{L1}}{Y_j}}}}}{{\sum\nolimits_{{m=1}}^{{{N^p}}} {{e^{{{\mathbf{W}}_{L1}}{Y_{\text{M}}}}}} }}} {{\text{Y}}_{\text{j}}}} \right)} \right)} \right)} \right)\end{aligned}$$
(12)

where \({{\mathbf{W}}_{L1}}\), \({{\mathbf{W}}_{L2}}\) and \({{\mathbf{W}}_{L3}}\)denote convolutional layers, \(LN\) denotes the layer normalization. \({N^p}\)is the number of pixels in the feature and \(\gamma\) is a hyper-parameter to balance the loss.

Fig. 5
Fig. 5
Full size image

Employing GcBlock for Global Distillation. The input features are derived from the original teacher feature and the reconstructed student feature, respectively.

Overall loss

Based on our proposed distillation method (RFDD), the overall loss function for training the student detector is as follows:

$${L_{toatl}}={L_{original}}+{L_{decoupled}}+{L_{global}}$$
(13)

where \({{\text{L}}_{{\text{original}}}}\) is the original loss of the detector. Our distillation loss is calculated only on the feature maps, which are obtained from the neck of the detector. Therefore, our method can be applied to different tea detectors.

Experiments

Dataset

We evaluate our proposed knowledge distillation method on the Taiping HouKui Tea dataset constructed above. Specifically, we use 2800 tea leaves images for model training, validate the trained model on 350 images, and finally test the model on another set of 350 images. For evaluating model performance, metrics such as Average Precision (AP), mean Average Precision (mAP), and frames per second (FPS) are commonly chosen for assessing detection accuracy and speed. Since our tea leaves dataset consists only of the “tea” category, AP and mAP are equivalent in our experiments. Therefore, we use AP and FPS to assess the student model before and after distillation. Additionally, we calculate the model’s AP with a confidence threshold IOU of 0.5.

Details

Our comprehensive experiments involve three different frameworks of detectors: two-stage detectors Faster RCNN, Mask RCNN, anchor-based one-stage detector RetinaNet, and anchor-free one-stage detectors FCOS, RepPoints. We use ResNet101 and ResNet5050 as the backbone networks for the teacher and student models, respectively. Additionally, we conduct a series of ablation experiments and studies the sensitivity of hyperparameters. In all experiments, we utilize the MMdetection51toolbox and PyTorch52 framework on a server equipped with an RTX 3080Ti GPU.

RFDD uses 5 hyperparameters: \(\alpha\), \(\beta\), \(\gamma\), \(\lambda\), and \(\mathcal{T}\). Specifically, we set {\(\alpha =6 \times {10^{ - 5}}\),\(\beta =2 \times {10^{ - 5}}\),\(\gamma =5 \times {10^{ - 7}}\),\(\lambda =0.8\),\(\mathcal{T}=0.5\)} for two-stage detectors and {\(\alpha =2 \times {10^{ - 4}}\),\(\beta =4 \times {10^{ - 4}}\),\(\gamma =5 \times {10^{ - 6}}\),\(\lambda =1\),\(\mathcal{T}=0.5\)} for all one-stage detectors. During training, we utilized the SGD optimizer and trained each model for 24 epochs. The initial learning rate was set to 0.0025, batch size was 2, and the learning rate was divided by 10 at the 16th and 22nd epochs. Additionally, we set the momentum to 0.9 and weight decay to 0.0001.

Table 1 shows the params, flops, and FPS of the models we used in the experiments. Typically, within the same framework, as the backbone network deepens, the model’s performance improves, but the number of parameters and computational load also significantly increase. This results in a noticeable decrease in detection speed (FPS), failing to meet the real-time requirements of detection tasks. Therefore, balancing detection accuracy and speed is crucial in the task of tea leaves detection in natural environments.

Table 1 Params, flops, and FPS of the models used in this paper.

Main results

In the comparative experiments of this paper, we conducted distillation experiments on five detectors and compared our RFDD method with three knowledge distillation methods: Defeat18, MGD20, and DiffKD53. As shown in Table 2; Fig. 6, our distillation method surpasses MGD, Defeat and DiffKD. In the first group of experiments, RetinaNet was used as both the teacher and student detection frameworks. Compared to the baseline RetinaNet-Res50, our distillation method improved the AP by 3.14%. The improvement was 0.71%, 0.45%, and 0.33% higher than that achieved by Defeat, MGD, and DiffKD, respectively. Similar to the first group, each subsequent experiment used ResNet50 as the backbone network for the student model and ResNet101 for the teacher model. The results of each group confirmed the effectiveness of our method, providing a promising solution for achieving high-precision detection of tea leaves with lightweight models. Additionally, as shown in Fig. 7, we present the visualized results of feature maps from student models trained using the aforementioned distillation methods. From these visualizations, it is clear that student models trained with different knowledge distillation methods have enhanced feature representation capabilities, and compared to features extracted by other distillation methods, the features extracted by RFDD are superior and more distinguishable.

Table 2 Detection results on the Taiping Houkui Tea dataset using different distillation methods and detection frameworks. Defeat can only be applied to anchor-based detectors.
Fig. 6
Fig. 6
Full size image

Comparison of detection accuracy, detection speed, and model size between the student models guided by RFDD and the teacher models.

Fig. 7
Fig. 7
Full size image

Visualization of feature maps from the original student model and student models trained using different distillation methods. The teacher detector is Faster R-CNN-Res101, while the student detector is Faster R-CNN-Res50.

Analysis

RFDD is mainly composed of two parts and is equipped with five hyperparameters. Therefore, we need to conduct a series of ablation experiments and hyperparameters studies to explore the effectiveness of each part in distillation. In the subsequent experiments in this paper, we use RetinaNet as the framework for the model, with ResNet101 and ResNet50 set as the backbone networks for the teacher and student models, respectively, for exploration and research.

Different distillation regions

In this section, we designed experiments by decoupling foreground and background during the distillation. Surprisingly, as shown in Table 3, distilling foreground and background together resulted in the poorest performance, even worse than distilling only the background or foreground. And we observed that distilling only the background region and distilling only the foreground region achieved nearly identical performance, further indicating that the background plays a positive role in distillation. Additionally, when we decoupled the distillation of foreground and background, we achieved performance surpassing that of distilling only the foreground or only the background, indicating that Decoupling Distillation contributes to improved performance in the student model.

Table 3 Exploration research on distillation regions.

The attention mask and channel clue

As shown in Table 4, we explored the impact of two crucial modules, Attention Mask and Channel Clue, during the Reconstructed Feature. Experimental results indicate that the model exhibits optimal performance when both modules are active simultaneously. Furthermore, when we arbitrarily remove one component, the model’s performance experiences varying degrees of decline. Without Attention Mask, AP decreases by 0.41%, and without Channel Clue, AP decreases by 0.32%. This indicates that Attention Mask and Channel Clue can collaborate to enhance the perceptual ability of the student model towards targets, which is crucial for the task of tea leaves detection.

Table 4 A ablation study of two critical modules, attention mask and Channel Clue, under the RetinaNet framework.

The generation block

RFDD employs a simple block to reconstruct the masked features. In Eq. (6), we use two \(3 \times 3\) convolutional layers and one activation layer \(\text{R}\text{e}\text{L}\text{U}\) to implement it. As shown in Table 5, we designed corresponding experimental settings based on the size of convolutional kernels and the number of convolutional layers. When there is only one convolutional layer, the performance of the student model is slightly improved, and when the number of convolutional layers is increased to two or three, a more significant performance improvement can be observed. Regarding the size of the convolutional kernel, the \(5 \times 5\) convolution kernel requires more computational resources, and according to the experimental results, the performance of the \(5 \times 5\) convolution is weaker than that of the \(3 \times 3\) one under the premise of using two convolutional layers. Based on the experimental results and Occam’s razor principle, we finally chose two \(3 \times 3\) convolutional layers for the generation block in RFDD.

Table 5 Discussion on the generation block. Conv Layers refer to the number of convolutional layers, and Kernel size represents the size of the convolutional kernel.

Sensitivity study of global distillation

In Decoupled Distillation, we forcibly segmented the feature map, leading to a lack of relation knowledge between foreground pixels and background pixels. To compensate for this information loss, we adopted Global Distillation. In this section, we explored Global Distillation using GcBlock and Non-Local54. As shown in Table 6, both methods were able to improve the student model with valuable extracted global relation knowledge. And GCBlock performed better, bringing a 0.49% increase in AP.

Table 6 Discussion on global distillation. The baseline represents experimental results without global distillation.

Sensitivity study of \(\mathcal{T}\)

In Eq. (3), we utilize the temperature hyperparameter \(\mathcal{T}\) to adjust the pixel distribution of the feature map. When \(\mathcal{T}\) is greater than 1 or less than 1, the gaps between pixels on the feature map become wider or narrower, respectively. In this section, we designed several experiments to explore the impact of the width of gaps between pixels on distillation. As shown in Fig. 8, we examined whether our method is sensitive to the hyperparameter \(\mathcal{T}\).Experimental results show that the student models in each group achieve an AP value of 76% or higher. Specifically, when \(\mathcal{T}\) is 0.5, the AP differs by 0.25% compared to when \(\mathcal{T}\) is 1. Additionally, the gap between the best and worst results in terms of AP is only 0.27%. This indicates that our model is not sensitive to the hyperparameter \(\mathcal{T}\).

Fig. 8
Fig. 8
Full size image

Sensitivity study of hyperparameters using RetinaNet-Res101 and RetinaNet-Res50.

Sensitivity study of α, β and γ

In this paper, we use three hyperparameters \(\alpha\), \(\beta\), and \(\gamma\) to balance different distillation loss terms. They control the trade-off between performance improvement and knowledge transfer for the student detector, allowing the model to fine-tune the type and amount of knowledge transferred from the teacher detector during training to ensure the effective learning of the student detector. To achieve optimal distillation performance, in this section, we conducted a series of experiments to analyze the sensitivity of hyperparameters and determine their values. As shown in Fig. 9, we selected RetinaNet and Faster RCNN as examples of one-stage and two-stage models, respectively. The experimental results indicate that for one-stage detectors, the model achieves the best Decoupled Distillation performance when \(\alpha =6 \times {10^{ - 5}}\) and \(\beta =2 \times {10^{ - 5}}\), and the model obtains the best Global Distillation performance when \(\gamma =5 \times {10^{ - 7}}\). For two-stage detectors, the student model performs best when \(\alpha =6 \times {10^{ - 5}}\),\(\beta =2 \times {10^{ - 5}}\) and \(\gamma =5 \times {10^{ - 7}}\). Additionally, it is observed that in both frameworks, the worst hyperparameter combinations result in a decrease of 0.24%AP and 0.25%AP, respectively. However, they outperformed the baseline models by 2.9%AP and 3.28%AP. This indicates that our method is insensitive to the choice of hyperparameters.

Fig. 9
Fig. 9
Full size image

Sensitivity Analysis of \(\alpha\), \(\beta\)and \(\gamma\) with RetinaNet (Left) and Faster RCNN (Right) frameworks.

Sensitivity study of λ

In our proposed method RFDD, there is an important parameter \(\lambda\), which controls the range of masked feature pixels. As the value of \(\lambda\) increases, pixels with higher attention scores in the teacher model will be masked, and these pixels are mostly located in the Ground-Truth region. When the value of \(\lambda\) decreases, the masked pixels will appear in the background region. According to the research on MGD, the reconstructed features are more valuable. Therefore, if the model can reconstruct feature pixels with relatively high scores in the background region, the performance of the student model will be further improved. Based on this, we designed an exploration experiment for the values of \(\lambda\), as shown in Fig. 10. According to the experimental results, we found that as \(\lambda\) increases, the AP gain of the student increases, but when \(\lambda\) becomes too large, the performance decreases. Therefore, we hope to control the value of \(\lambda\) to help the model better compromise between encoding low-score and high-score regions.

Fig. 10
Fig. 10
Full size image

Analysis of hyperparameters \(\lambda\).

Qualitative analysis

In this section, we tested tea leaves images from the same test set using the original student model and the distilled student model. The detection results are shown in Fig. 11. From the results of the test, it can be observed that the distilled student model significantly reduces the number of missed and false detections of tea leaves, while increasing the number of correctly detected tea leaves. This indicates that the performance of the distilled student model is greatly improved, demonstrating the potential to accomplish real-time tea leaves tasks.

Fig. 11
Fig. 11
Full size image

Detection results comparison of RetinaNet-Res50 (Left) and RetinaNet-Res50-RFDD (Right) on tea leaves images.

Conclusions

In this paper, to enable lightweight models to efficiently perform tea leaves detection in natural environments, we propose a new knowledge distillation method RFDD. On the one hand, we selectively mask specific pixels in the student feature map and reconstruct the masked feature through a generation block. Simultaneously, we enhance the student model’s channel perception by incorporating the SE block. On the other hand, we decouple the feature into foreground and background features according to the Ground-Truth, separately distilling them. This allows the student model to focus on critical pixels and channels while balancing the extraction of foreground and background features. Finally, we utilize the GcBlock to perform Global Distillation, capturing the relation information lost during Decoupled Distillation. Experimental results on various frameworks and detectors with different backbones demonstrate the effectiveness of our method. However, our method relies on a hyperparameter to limit the size of the mask. We aim to minimize human intervention in model performance, so our future research will focus on alternative strategies for obtaining masks.