Learning lightweight tea detector with reconstructed feature and dual distillation

Zheng, Zhe; Zuo, Guanpeng; Zhang, Wu; Zhang, Chenlu; Zhang, Jing; Rao, Yuan; Jiang, Zhaohui

doi:10.1038/s41598-024-73674-4

Download PDF

Article
Open access
Published: 10 October 2024

Learning lightweight tea detector with reconstructed feature and dual distillation

Zhe Zheng¹,
Guanpeng Zuo¹,
Wu Zhang^1,2,
Chenlu Zhang¹,
Jing Zhang¹,
Yuan Rao^1,2 &
…
Zhaohui Jiang^1,2

Scientific Reports volume 14, Article number: 23669 (2024) Cite this article

1998 Accesses
3 Citations
Metrics details

Subjects

Abstract

Currently, image recognition based on deep neural networks has become the mainstream direction of research; therefore, significant progress has been made in its application in the field of tea detection. Many deep models exhibit high recognition rates in tea leaves detection. However, deploying these models directly on tea-picking equipment in natural environments is impractical; the extremely high parameters and computational complexity of these models make it challenging to perform real-time tea leaves detection. Meanwhile, lightweight models struggle to achieve competitive detection accuracy; therefore, this paper addresses the issue of computational resource constraints in remote mountain areas and proposes Reconstructed Feature and Dual Distillation (RFDD) to enhance the detection capability of lightweight models for tea leaves. In our method, the Reconstructed Feature selectively masks the feature of the student model based on the spatial attention map of the teacher model; it utilizes a generation block to force the student model to generate the teacher’s full feature. The Dual Distillation comprises Decoupled Distillation and Global Distillation. Decoupled Distillation divides the reconstructed feature into foreground and background features based on the Ground-Truth. This compels the student model to allocate different attention to foreground and background, focusing on their critical pixels and channels. However, Decoupled Distillation leads to the loss of relation knowledge between foreground and background pixels. Therefore, we further perform Global Distillation to extract this lost knowledge. Since RFDD only requires loss calculation on feature map, it can be easily applied to various detectors. We conducted experiments on detectors with different frameworks, using a tea dataset collected at the Huangshan Houkui Tea Plantation. The experimental results indicate that, under the guidance of RFDD, the student detectors have achieved performance improvements to varying degrees. For instance, a one-stage detector like RetinaNet (ResNet-50) experienced a 3.14% increase in Average Precision (AP) after RFDD guidance. Similarly, a two-stage model like Faster RCNN (ResNet-50) obtained a 3.53% improvement in AP. This offers promising prospects for lightweight models to efficiently perform real-time tea leaves detection tasks.

Multistage feature fusion knowledge distillation

Article Open access 11 June 2024

Cosine similarity-guided knowledge distillation for robust object detectors

Article Open access 14 August 2024

Convolutional neural network based on transfer learning for discriminating the fermentation degree of black tea

Article Open access 08 August 2025

Introduction

With the continuous improvement in computer processing power, deep learning has made remarkable progress in various fields^1,2,3. In the agricultural domain, deep models have demonstrated outstanding performance in areas such as plant diseases identification⁴, fruit counting⁵, as well as other applications^6,7.

Intelligent tea picking is also a highly researched direction in agricultural applications. Tea, as a crucial economic crop, often faces significant damage during traditional mechanical tea picking due to indiscriminate cutting, leading to a notable decline in the quality and economic value of the tea leaves. To address this issue, some scholars are dedicated to designing intelligent tea-picking devices. Accurate identification and detection of tea leaves are fundamental to the operation of these devices and are crucial for precise harvesting. For this purpose, some studies^8,9,10 employ a combination of computer vision and deep learning to achieve accurate identification of tea leaves. However, their models have high parameters, slow inference speeds, and require deployment on devices with abundant computing resources and large operating memory. This poses new challenges for achieving the recognition of tea leaves in natural environments.

Knowledge distillation^11,12,13 has been proven to be an effective model compression method, capable of addressing the aforementioned challenges. It achieves this by transferring the dark knowledge of a powerful yet complex teacher model to a lightweight student model, thereby improving the performance of the student model without incurring additional costs, thus achieving the goal of model compression. In recent years, with in-depth research on knowledge distillation, many efficient methods have emerged. Mimick¹⁴ opted to learn the feature of the regions proposed by the student’s region proposal network instead of the entire teacher’s feature to simplify the task. FGFI¹⁵ obtained fine-grained feature maps by leveraging Ground-Truth and anchors. The above methods mainly focus on extracting feature from the foreground, but some works^16,17 have found that the background is also helpful for object detection. Guo et al.¹⁸ found through separate knowledge distillation experiments for the background and foreground that both contribute to improved performance of the student model. This confirms the existence of exploitable dark knowledge in the background. Additionally, through visual analysis of the gradient information of the Feature Pyramid Network (FPN)¹⁹, they found that the gradient for the foreground is significantly larger than that for the background, indicating different levels of importance for foreground and background. On the other hand, MGD²⁰ discovered that, during distillation, masking random regions of the student’s feature map and forcing the student to reconstruct feature in these masked areas through a simple generation block enhances the student’s feature representation ability. However, MGD uses randomly generated masks, which do not enable the student to focus on areas considered important by the teacher. They also overlook the fact that the background has more pixels compared to the foreground, leading to a scenario where background features occupy more loss during distillation, causing an imbalance in losses. This issue is more pronounced in the tea leaves dataset we constructed, as shown in Fig. 1. The pixel ratio between foreground and background regions in the tea leaves dataset is approximately 0.9:9.1. In contrast, the ratio in the COCO dataset²¹ is approximately 5.6:4.4. This stark disproportion in the ratio makes background feature more likely to dominate in the tea leaves dataset. Due to the mentioned characteristics, treating all pixels equally and extracting feature for training the student model in an equal manner would further exacerbate the negative impact on distillation. Therefore, to address this imbalance in losses, we adopted Decoupled Distillation, aiming to alleviate the impact of the highly imbalanced foreground-to-background pixels ratio in the tea leaves dataset.

Based on the profound observations and understanding mentioned above, we propose Reconstructed Feature and Dual Distillation (RFDD), as shown in Fig. 2. Specifically, we start by obtaining a spatial attention map from the feature map of the teacher model. Based on the magnitudes of attention values, we generate a specific mask to mask the feature of the student model. This is intended to guide the subsequent generation block to focus on regions considered important by the teacher. Secondly, studies like SENet²² and CBAM²³ have shown that adding channel attention in addition to spatial attention can further enhance the model’s performance. Additionally, Zhang et al.²⁴ demonstrated that having the student model learn the channel attention map of the teacher model during distillation leads to improved student performance. This indicates that the teacher’s level of interest in different channels is valuable dark knowledge. Therefore, to further explore the positive impact of channels on distillation, we utilize the SE block²² to learn channel clues from the teacher’s feature map and then use them to enhance the student’s feature, improving the student model’s channel awareness. Next, we separate foreground and background features based on Ground-Truth, guiding the student to focus on critical pixels and channels. During Decoupled Distillation, we adopt an equal strategy for extracting foreground and background features. Furthermore, since the Decoupled Distillation severs the connection between foreground and background, resulting in the loss of global relation information, to compensate for this deficiency, we employ GcNet²⁵ to implement Global Distillation, collecting information about these losses and transfer it to the student. Our loss function is calculated only on the feature map, making our method applicable to various detectors, including two-stage models, anchor-based one-stage models, and anchor-free one-stage models. Experimental results indicate that using RFDD in tea leaves recognition models across different frameworks has yielded promising outcomes.

Related works

Object detection

Object detection has always been a fundamental and challenging task in computer vision, aiming to identify the category and location of objects in images. Typically, detectors based on Convolutional Neural Networks (CNN) can be categorized into two-stage detectors^26,27 and one-stage detectors^28,29,30. Two-stage detectors are renowned for their higher detection accuracy. However, due to the use of Region Proposal Network (RPN), they have a slower inference speed, making them challenging to meet real-time application requirements. In contrast, one-stage detectors directly predict the category and bounding boxes of target objects on the feature map, offering faster detection speed. Furthermore, one-stage detectors can be further divided into anchor-based²⁸ and anchor-free^29,30 categories. Anchor-based models enhance detection speed and maintain accuracy by using predefined anchor boxes, but the number of anchor boxes far exceeds the number of targets, leading to additional computations. On the other hand, anchor-free detectors are more lightweight, directly predicting the key points and positions of target objects, but they have relatively lower detection accuracy. Despite the use of different detection heads in these three types of detectors, they all take feature as input. Therefore, our knowledge distillation method can be easily applied to them.

Tea leaves recognition

Early scholars both domestically and internationally primarily employed traditional machine learning and image processing techniques for tea leaves recognition and detection. Chen et al.³¹ achieved three-dimensional reconstruction and detection of tea leaves using color transformation, area filtering, and edge computation. Wu et al.³² obtained a higher tea leaves recognition rate than the Otsu algorithm by using color change and the k-means clustering algorithm. Wu et al.³³ designed a maximum variance automatic threshold method for G and G-B components, effectively distinguishing buds from old leaves, achieving a recognition rate of up to 92%. However, the above studies were conducted under controlled conditions, and it is challenging to replicate the real conditions in tea gardens with complex environments, including uncontrolled lighting and high similarity between foreground and background. Therefore, many scholars have begun to apply deep learning to the study of tea leaves.

Compared to machine learning, the advantage of deep learning lies in abandoning feature engineering, enabling automatic feature selection based on the given data, and establishing end-to-end models. Chen et al.⁸ successfully implemented detection and picking point localization of tea leaves using Faster-RCNN and Fully Convolutional Networks (FCN). Wang et al.³⁴ proposed a tea leaves picking point localization method based on Mask-RCNN. Qing et al.³⁵ proposed an identification method using the YOLOv3 deep convolutional model, which enhanced the model’s ability to recognize tea leaves by adding the SPP module in the convolution. However, while these deep models exhibit high performance and strong generalization capabilities, their large number of parameters and slow inference speed do not meet the real-time requirements of tea leaves recognition in natural environments. Therefore, we need to compress large models, and employing the knowledge distillation approach proves to be an effective solution to this problem. By distilling knowledge into smaller models, we can achieve faster detection speed, fewer parameters and computations while maintaining a detection accuracy close to that of larger models. Consequently, the small model obtained through knowledge distillation can replace the large model to accomplish real-time tea leaves detection tasks.

Knowledge distillation

Knowledge distillation is a model compression technique initially proposed by Hinton et al.¹¹. This method transfers dark knowledge by using soft labels generated by the teacher model as inputs for the student model, thereby enhancing the classification accuracy of the student model without incurring additional computational costs. Building upon Hinton’s ideas, many researchers have designed efficient distillation methods for classification task^36,37,38.

In recent years, some studies have successfully applied knowledge distillation to object detection. Chen et al.³⁹ were the first to apply knowledge distillation to object detection models. The authors took the classic Faster R-CNN model framework as an example, distilling knowledge from three parts: neck feature, the classification head, and the regression head, and transferring it to the student model. Many researchers in the past believed that the extreme imbalance between foreground and background was a major factor influencing distillation. The authors of FGFI¹⁵, in order to reduce the impact of background feature on distillation, used fine-grained masks to extract feature of target objects and their surrounding areas. Sun et al.⁴⁰ used Gaussian masks to guide the student to focus on the Ground-Truth and its surrounding areas. However, the study by Guo et al.¹⁸ indicates that background features also have a positive effect on distillation. The authors distilled knowledge from the background, enhancing the performance of the student model. Additionally, Yang et al.²⁰ found that by randomly masking the feature maps of the student model and using a simple generative block to reconstruct feature, the representation capability of the student could be improved. Based on this, they proposed an innovative distillation method called Masked Generative Distillation (MGD). However, MGD utilizes random masks, which may introduce additional noise feature, and the authors treat all pixels equally.

Materials and methods

Dataset construction

The Taiping Houkui tea leaves dataset used in this paper was collected from the Huangshan Houkui Tea Plantation, as shown in Fig. 3. These images (3500 in total) were collected between April and May 2023 using two digital cameras, namely the NIKON D7100 and the SONY ZV-E10, with a resolution of 4000 × 6000 pixels, and were stored in JPG format. To obtain clear and accurate images of tea leaves, we chose to capture them by shooting from an overhead perspective with a vertical camera axis forming an angle of 30°-60° with the horizontal line, at a distance of 40–60 cm. If the shooting angle is too low, there will be a significant area blocked between the leaves, while a too high angle makes the stems of the tender shoots invisible. All tea leaves images were captured in natural environments, and to ensure the diversity of the dataset, it includes images of tea leaves obtained from different angles, backgrounds, occlusions, and lighting conditions.

Additionally, due to the complex tea garden environment, the varying growth angles of tea leaves, and the influence of environmental factors such as weather and lighting on tea leaves recognition, we employed common data augmentation techniques such as flipping, random rotation, and color jitter to increase tea leaves images and improve the generalization ability of the neural network model. Augmenting image data through techniques like flipping and random rotation allows for simulation of tea leaves features from various angles; adjustments to image brightness and contrast simulate changes in tea leaves color due to factors like lighting and weather in the tea garden. The partial effect diagram of data augmentation is shown in Fig. 4, Subsequently, we used LabelMe software⁴¹ to annotate our collected tea leaves dataset, defining the annotated regions as the “tea” class and generating JSON files. Finally, we randomly divided the images into training, validation, and testing sets in an 8:1:1 ratio.

Method

In knowledge distillation, Gou et al.⁴² categorize dark knowledge into relation-based knowledge, response-based knowledge, and feature-based knowledge according to its types. In this paper, we define $T \in {{\mathbb{R}}^{C \times H \times W}}$ and $S \in {{\mathbb{R}}^{C \times H \times W}}$ to represent the features extracted by the teacher model and the student model, respectively. Mathematically, the distillation formula based on feature is expressed as:

$$L=\frac{1}{{CHW}}\mathop { \cdot \sum }\nolimits_{{c=1}}^{C} \mathop \sum \nolimits_{{h=1}}^{H} \mathop \sum \nolimits_{{w=1}}^{W} {({T_{c,h,w}} - \phi ({S_{c,h,w}}))^2}$$

(1)

where $\phi$ is the adaptive layer that aligns the shapes of ${T_{c,h,w}}$ and ${S_{c,h,w}}$ and $C,H,W$ denote the channel, height, and width of the feature maps, respectively. Building upon the foundation of Eq. (1), many researches^43,44,45 have focused on guiding student to imitate the teacher’s feature maps to improve the student’s performance. The recent MGD discovered that teachers can enhance students’ representational capabilities by guiding them in feature reconstruction. However, MGD adopted random masks and overlooked the negative impact of foreground and background imbalance on distillation. To address these issues, we propose RFDD.

In section “Reconstructed feature” of this paper, we guide the student model to reconstruct the complete feature of the teacher model, simultaneously adding the SE block to enhance the student model’s channel awareness. Subsequently, in section “Decoupled Distillation”, we segment the feature maps into foreground and background based on Ground-Truth for Decoupled Distillation. Finally, in section “Global Distillation”, we utilize GcNet to capture global pixels relation, collecting dark knowledge lost during the Decoupled Distillation.

Reconstructed feature

To overcome the issues introduced by the random masks in MGD, we utilize the teacher’s spatial attention maps to design the masks, guiding the generation block more accurately to reconstruct features in areas deemed important by the teacher. First, we calculate the absolute mean along the channel dimension of the teacher model:

$${G^{Spatial}}(T)=\frac{1}{C}\mathop { \cdot \sum }\nolimits_{{c=1}}^{C} \left| {{T_c}} \right|$$

(2)

where ${G^{Spatial}} \in {{\mathbb{R}}^{1 \times {\text{H}} \times {\text{W}}}}$ represents the spatial attention map. Subsequently, the spatial attention mask obtained based on ${G^{Spatial}}$ can be formulated as:

$${\mathcal{A}^{Spatial}}(T)=H \cdot W \cdot {\text{softmax}}\left( {\frac{{{G^{Spatial}}(T)}}{\mathcal{T}}} \right)$$

(3)

where $\mathcal{T}$ is the hyperparameter introduced in¹¹ for changing the probability distribution.

After Eqs. (2) and (3), the shape of ${\mathcal{A}^{Spatial}}(T)$ we obtain is $1 \times {\text{H}} \times W$. Therefore, we consider the values of ${\mathcal{A}^{Spatial}}(T)$ to represent the teacher model’s level of interest in different pixels of the feature map. Later, we set a hyperparameter $\lambda$, and when the value of ${\mathcal{A}^{Spatial}}(T)$ is greater than $\lambda$, the value of the mask $M_{{h,w}}^{{Spatial}}$ is set to 0; otherwise, it is set to 1. This can be expressed using the formula:

$$M_{{h,w}}^{{Spatial}}=\left\{ \begin{array}{ll} {\text{0}},& \quad {\text{if }}\mathcal{A}_{{h,w}}^{{Spatial}}>\lambda \\ 1,& \quad {\text{Otherwise}} \end{array} \right.$$

(4)

where $\lambda$ is a hyperparameter controlling the number of pixels in the mask. ${\mathcal{A}^{Spatial}}(T)$ represents the attention score of the teacher’s feature map at the coordinate point $\left( {h,w} \right)$. Next, we use the mask $M_{{h,w}}^{{Spatial}}$ to overlay the feature map of the student model:

$$S_{{c,h,w}}^{{mask}}=\phi \left( {{S_{c,h,w}}} \right) \cdot M_{{h,w}}^{{Spatial}}$$

(5)

In this paper, we use $1 \times 1$ convolutional layers as the adaptive layer $\phi$. The mask generated with the help of teacher attention map allow us to mask specific regions of the student feature map based on the teacher’s level of interest in different areas. This effectively eliminates the drawbacks introduced by randomly generated mask. Afterward, we reconstruct the teacher’s feature using a simple generative block consisting of two convolutional layers: ${{\mathbf{W}}_{G1}}$ and ${{\mathbf{W}}_{G2}}$, an activation layer $\text{R}\text{e}\text{L}\text{U}$. It can be formulated as:

$${{\rm X}_{gen}}={{\mathbf{W}}_{G2}}\left( {\text{R}\text{e}\text{L}\text{U}\left( {{{\mathbf{W}}_{G1}}\left( {S_{{c,h,w}}^{{mask}}} \right)} \right)} \right)$$

(6)

where ${{\mathbf{W}}_{G1}}$ and ${{\mathbf{W}}_{G2}}$ are $3 \times 3$ convolutional layers.

On the other hand, due to the tea leaves dataset containing a large number of detection targets, the task of detecting tea leaves is essentially a dense prediction task for multiple objects. For dense prediction, extracting target information at different scales can bring significant performance fluctuations to the detectors, which has not been considered in many works^46,47. Therefore, based on this, we introduce a simple and lightweight SE (Squeeze-and-Excitation) layer to learn channel clues from the teacher’s feature and apply them to enhance the student’s feature, further improving the model’s perceptual capability. Firstly, we compress the global spatial information into channel attention map through global average pooling. In a formal sense, the channel attention map ${G^{Channel}} \in {{\mathbb{R}}^{C \times 1 \times 1}}$ is obtained by collapsing the dimensions of the feature map ${\text{H}} \times {\text{W}}$, and its formula is as follows:

$${G^{Channel}}(T)=\frac{1}{{H \times W}}\mathop { \cdot \sum }\nolimits_{{h=1}}^{H} \sum\nolimits_{{w=1}}^{W} {\left| {{T_{h,w}}} \right|}$$

(7)

Afterwards, we perform a linear projection operation on it to obtain the clue matrix ${{\text{X}}_{clue}} \in {{\mathbb{R}}^{C \times 1 \times 1}}$, which is mathematically represented as follows:

$${{\rm X}_{clue}}=\sigma \left( {{{\mathbf{W}}_{C2}}\left( {ReLU\left( {{{\mathbf{W}}_{C1}}\left( {{G^{Channel}}(T)} \right)} \right)} \right)} \right)$$

(8)

where ${{\mathbf{W}}_{C1}}$ and ${{\mathbf{W}}_{C2}}$ represent two fully connected layers. Finally, we merge the generated feature map ${{\rm X}_{gen}} \in {{\mathbb{R}}^{C \times H \times W}}$ with the obtained channel clue matrix ${{\text{X}}_{clue}}$ to enhance the student model’s channel perception ability. This can be defined as:

$$\mathcal{G}(S_{{c,h,w}}^{{mask}})={{\rm X}_{gen}} \odot {{\rm X}_{clue}}$$

(9)

where $\odot$ denotes the Hadamard product. In a nutshell, through the aforementioned Reconstructed Feature, we optimized the strategy of masking feature, successfully transmitting dark knowledge about the importance of regions. The resulting student feature map will contain more valuable semantic information, ultimately enhancing the representation capability of the student model.

Decoupled distillation

Due to the tea dataset having a significantly larger number of background pixels than foreground pixels, background feature dominate when calculating the loss function, thus affecting the distillation effect of the student model. Therefore, in this section, we perform Decoupled Distillation to balance the extraction of pixels. First, we set a binary mask $M_{{h,w}}^{{Segmetation}}$ based on the Ground-Truth to separate foreground and background. It can be formalized as:

$$M_{{h,w}}^{{Segmetation}}=\left\{ \begin{array}{ll} 1,&\quad{\text{if}}\left( {h,w} \right) \in {\text{B}} \\ 0,&\quad {\text{Otherwise}} \end{array} \right.$$

(10)

where ${\text{B}}$ represents the actual tea leaves annotation boxes. $\left( {h,w} \right)$ denotes the horizontal and vertical coordinates of the corresponding position in the feature map. If a pixel is inside the Ground-Truth boxes, the value of $M_{{h,w}}^{{Segmetation}}$ is set to 1; otherwise, it is 0. Secondly, to minimize the negative impact of pixel imbalance on distillation, we designed the following loss function:

$$\begin{aligned} {L_{decoupled}}&=\frac{\alpha }{{{N_{obj}}}} \cdot \sum\limits_{{c=1}}^{C} {\sum\limits_{{h=1}}^{H} {\sum\limits_{{w=1}}^{W} {M_{{h,w}}^{{Segmetation}}{{({T_{c,h,w}} - \mathcal{G}(S_{{c,h,w}}^{{mask}}))}^2}} } } \\ &\quad+\frac{\beta }{{{N_{bg}}}} \cdot \sum\limits_{{c=1}}^{C} {\sum\limits_{{h=1}}^{H} {\sum\limits_{{w=1}}^{W} {\left( {1 - M_{{h,w}}^{{Segmetation}}} \right){{({T_{c,h,w}} - \mathcal{G}(S_{{c,h,w}}^{{mask}}))}^2}} } } \\ {{\text{N}}_{{\text{obj}}}}&={\text{C}} \cdot \sum\limits_{{{\text{h=1}}}}^{{\text{H}}} {\sum\limits_{{{\text{w=1}}}}^{{\text{W}}} {{\text{M}}_{{{\text{h,w}}}}^{{{\text{Segmetation}}}}} } ,{{\text{N}}_{{\text{bg}}}}{\text{=C}} \cdot \sum\limits_{{{\text{h=1}}}}^{{\text{H}}} {\sum\limits_{{{\text{w=1}}}}^{{\text{W}}} {\left( {1 - {\text{M}}_{{{\text{h,w}}}}^{{{\text{Segmetation}}}}} \right)} } \end{aligned}$$

(11)

where $\alpha$ and $\beta$ are hyperparameters used to balance foreground and background distillation losses, ${N_{obj}}$ and ${N_{bg}}$ represent the number of pixels in the foreground and background regions, respectively.

Global distillation

In knowledge distillation, the relation between different pixels^48,49 have been proven to be a form of dark knowledge that can enhance the performance of detection models. In section “Decoupled Distillation”, we performed Decoupled Distillation, allowing the student model to have different levels of attention to foreground and background features. However, it severs the relation between foreground and background. Therefore, we propose Global Distillation, as shown in Fig. 5, we utilize the GcBlock²⁵ to extract relation knowledge between global pixels and distill it from the teacher detector to the student detector. The loss function for Global Distillation is as follows:

$$\begin{aligned}{L_{global}}&=\gamma \cdot {\sum {\left( {{\rm Z}\left( {{{\text{T}}_{{\text{c,h,w}}}}} \right) - {\rm Z}\left( {\mathcal{G}(S_{{c,h,w}}^{{mask}})} \right)} \right)} ^2} \\ {\rm Z}\left( {\text{Y}} \right)&={\text{Y}}+{{\mathbf{W}}_{L3}}\left( {\operatorname{Re} LU\left( {{\text{LN}}\left( {{{\mathbf{W}}_{L2}}\left( {\sum\nolimits_{{j=1}}^{{{N^p}}} {\frac{{{e^{{{\mathbf{W}}_{L1}}{Y_j}}}}}{{\sum\nolimits_{{m=1}}^{{{N^p}}} {{e^{{{\mathbf{W}}_{L1}}{Y_{\text{M}}}}}} }}} {{\text{Y}}_{\text{j}}}} \right)} \right)} \right)} \right)\end{aligned}$$

(12)

where ${{\mathbf{W}}_{L1}}$, ${{\mathbf{W}}_{L2}}$ and ${{\mathbf{W}}_{L3}}$denote convolutional layers, $LN$ denotes the layer normalization. ${N^p}$is the number of pixels in the feature and $\gamma$ is a hyper-parameter to balance the loss.

Overall loss

Based on our proposed distillation method (RFDD), the overall loss function for training the student detector is as follows:

$${L_{toatl}}={L_{original}}+{L_{decoupled}}+{L_{global}}$$

(13)

where ${{\text{L}}_{{\text{original}}}}$ is the original loss of the detector. Our distillation loss is calculated only on the feature maps, which are obtained from the neck of the detector. Therefore, our method can be applied to different tea detectors.

Experiments

Dataset

We evaluate our proposed knowledge distillation method on the Taiping HouKui Tea dataset constructed above. Specifically, we use 2800 tea leaves images for model training, validate the trained model on 350 images, and finally test the model on another set of 350 images. For evaluating model performance, metrics such as Average Precision (AP), mean Average Precision (mAP), and frames per second (FPS) are commonly chosen for assessing detection accuracy and speed. Since our tea leaves dataset consists only of the “tea” category, AP and mAP are equivalent in our experiments. Therefore, we use AP and FPS to assess the student model before and after distillation. Additionally, we calculate the model’s AP with a confidence threshold IOU of 0.5.

Details

Our comprehensive experiments involve three different frameworks of detectors: two-stage detectors Faster RCNN, Mask RCNN, anchor-based one-stage detector RetinaNet, and anchor-free one-stage detectors FCOS, RepPoints. We use ResNet101 and ResNet50⁵⁰ as the backbone networks for the teacher and student models, respectively. Additionally, we conduct a series of ablation experiments and studies the sensitivity of hyperparameters. In all experiments, we utilize the MMdetection⁵¹toolbox and PyTorch⁵² framework on a server equipped with an RTX 3080Ti GPU.

RFDD uses 5 hyperparameters: $\alpha$, $\beta$, $\gamma$, $\lambda$, and $\mathcal{T}$. Specifically, we set {$\alpha =6 \times {10^{ - 5}}$,$\beta =2 \times {10^{ - 5}}$,$\gamma =5 \times {10^{ - 7}}$,$\lambda =0.8$,$\mathcal{T}=0.5$} for two-stage detectors and {$\alpha =2 \times {10^{ - 4}}$,$\beta =4 \times {10^{ - 4}}$,$\gamma =5 \times {10^{ - 6}}$,$\lambda =1$,$\mathcal{T}=0.5$} for all one-stage detectors. During training, we utilized the SGD optimizer and trained each model for 24 epochs. The initial learning rate was set to 0.0025, batch size was 2, and the learning rate was divided by 10 at the 16th and 22nd epochs. Additionally, we set the momentum to 0.9 and weight decay to 0.0001.

Table 1 shows the params, flops, and FPS of the models we used in the experiments. Typically, within the same framework, as the backbone network deepens, the model’s performance improves, but the number of parameters and computational load also significantly increase. This results in a noticeable decrease in detection speed (FPS), failing to meet the real-time requirements of detection tasks. Therefore, balancing detection accuracy and speed is crucial in the task of tea leaves detection in natural environments.

Table 1 Params, flops, and FPS of the models used in this paper.

Full size table

Main results

In the comparative experiments of this paper, we conducted distillation experiments on five detectors and compared our RFDD method with three knowledge distillation methods: Defeat¹⁸, MGD²⁰, and DiffKD⁵³. As shown in Table 2; Fig. 6, our distillation method surpasses MGD, Defeat and DiffKD. In the first group of experiments, RetinaNet was used as both the teacher and student detection frameworks. Compared to the baseline RetinaNet-Res50, our distillation method improved the AP by 3.14%. The improvement was 0.71%, 0.45%, and 0.33% higher than that achieved by Defeat, MGD, and DiffKD, respectively. Similar to the first group, each subsequent experiment used ResNet50 as the backbone network for the student model and ResNet101 for the teacher model. The results of each group confirmed the effectiveness of our method, providing a promising solution for achieving high-precision detection of tea leaves with lightweight models. Additionally, as shown in Fig. 7, we present the visualized results of feature maps from student models trained using the aforementioned distillation methods. From these visualizations, it is clear that student models trained with different knowledge distillation methods have enhanced feature representation capabilities, and compared to features extracted by other distillation methods, the features extracted by RFDD are superior and more distinguishable.

Table 2 Detection results on the Taiping Houkui Tea dataset using different distillation methods and detection frameworks. Defeat can only be applied to anchor-based detectors.

Full size table

Analysis

RFDD is mainly composed of two parts and is equipped with five hyperparameters. Therefore, we need to conduct a series of ablation experiments and hyperparameters studies to explore the effectiveness of each part in distillation. In the subsequent experiments in this paper, we use RetinaNet as the framework for the model, with ResNet101 and ResNet50 set as the backbone networks for the teacher and student models, respectively, for exploration and research.

Different distillation regions

In this section, we designed experiments by decoupling foreground and background during the distillation. Surprisingly, as shown in Table 3, distilling foreground and background together resulted in the poorest performance, even worse than distilling only the background or foreground. And we observed that distilling only the background region and distilling only the foreground region achieved nearly identical performance, further indicating that the background plays a positive role in distillation. Additionally, when we decoupled the distillation of foreground and background, we achieved performance surpassing that of distilling only the foreground or only the background, indicating that Decoupling Distillation contributes to improved performance in the student model.

Table 3 Exploration research on distillation regions.

Full size table

The attention mask and channel clue

As shown in Table 4, we explored the impact of two crucial modules, Attention Mask and Channel Clue, during the Reconstructed Feature. Experimental results indicate that the model exhibits optimal performance when both modules are active simultaneously. Furthermore, when we arbitrarily remove one component, the model’s performance experiences varying degrees of decline. Without Attention Mask, AP decreases by 0.41%, and without Channel Clue, AP decreases by 0.32%. This indicates that Attention Mask and Channel Clue can collaborate to enhance the perceptual ability of the student model towards targets, which is crucial for the task of tea leaves detection.

Table 4 A ablation study of two critical modules, attention mask and Channel Clue, under the RetinaNet framework.

Full size table

The generation block

RFDD employs a simple block to reconstruct the masked features. In Eq. (6), we use two $3 \times 3$ convolutional layers and one activation layer $\text{R}\text{e}\text{L}\text{U}$ to implement it. As shown in Table 5, we designed corresponding experimental settings based on the size of convolutional kernels and the number of convolutional layers. When there is only one convolutional layer, the performance of the student model is slightly improved, and when the number of convolutional layers is increased to two or three, a more significant performance improvement can be observed. Regarding the size of the convolutional kernel, the $5 \times 5$ convolution kernel requires more computational resources, and according to the experimental results, the performance of the $5 \times 5$ convolution is weaker than that of the $3 \times 3$ one under the premise of using two convolutional layers. Based on the experimental results and Occam’s razor principle, we finally chose two $3 \times 3$ convolutional layers for the generation block in RFDD.

Table 5 Discussion on the generation block. Conv Layers refer to the number of convolutional layers, and Kernel size represents the size of the convolutional kernel.

Full size table

Sensitivity study of global distillation

In Decoupled Distillation, we forcibly segmented the feature map, leading to a lack of relation knowledge between foreground pixels and background pixels. To compensate for this information loss, we adopted Global Distillation. In this section, we explored Global Distillation using GcBlock and Non-Local⁵⁴. As shown in Table 6, both methods were able to improve the student model with valuable extracted global relation knowledge. And GCBlock performed better, bringing a 0.49% increase in AP.

Table 6 Discussion on global distillation. The baseline represents experimental results without global distillation.

Full size table

Sensitivity study of $\mathcal{T}$

In Eq. (3), we utilize the temperature hyperparameter $\mathcal{T}$ to adjust the pixel distribution of the feature map. When $\mathcal{T}$ is greater than 1 or less than 1, the gaps between pixels on the feature map become wider or narrower, respectively. In this section, we designed several experiments to explore the impact of the width of gaps between pixels on distillation. As shown in Fig. 8, we examined whether our method is sensitive to the hyperparameter $\mathcal{T}$.Experimental results show that the student models in each group achieve an AP value of 76% or higher. Specifically, when $\mathcal{T}$ is 0.5, the AP differs by 0.25% compared to when $\mathcal{T}$ is 1. Additionally, the gap between the best and worst results in terms of AP is only 0.27%. This indicates that our model is not sensitive to the hyperparameter $\mathcal{T}$.

Sensitivity study of α, β and γ

In this paper, we use three hyperparameters $\alpha$, $\beta$, and $\gamma$ to balance different distillation loss terms. They control the trade-off between performance improvement and knowledge transfer for the student detector, allowing the model to fine-tune the type and amount of knowledge transferred from the teacher detector during training to ensure the effective learning of the student detector. To achieve optimal distillation performance, in this section, we conducted a series of experiments to analyze the sensitivity of hyperparameters and determine their values. As shown in Fig. 9, we selected RetinaNet and Faster RCNN as examples of one-stage and two-stage models, respectively. The experimental results indicate that for one-stage detectors, the model achieves the best Decoupled Distillation performance when $\alpha =6 \times {10^{ - 5}}$ and $\beta =2 \times {10^{ - 5}}$, and the model obtains the best Global Distillation performance when $\gamma =5 \times {10^{ - 7}}$. For two-stage detectors, the student model performs best when $\alpha =6 \times {10^{ - 5}}$,$\beta =2 \times {10^{ - 5}}$ and $\gamma =5 \times {10^{ - 7}}$. Additionally, it is observed that in both frameworks, the worst hyperparameter combinations result in a decrease of 0.24%AP and 0.25%AP, respectively. However, they outperformed the baseline models by 2.9%AP and 3.28%AP. This indicates that our method is insensitive to the choice of hyperparameters.

Sensitivity study of λ

In our proposed method RFDD, there is an important parameter $\lambda$, which controls the range of masked feature pixels. As the value of $\lambda$ increases, pixels with higher attention scores in the teacher model will be masked, and these pixels are mostly located in the Ground-Truth region. When the value of $\lambda$ decreases, the masked pixels will appear in the background region. According to the research on MGD, the reconstructed features are more valuable. Therefore, if the model can reconstruct feature pixels with relatively high scores in the background region, the performance of the student model will be further improved. Based on this, we designed an exploration experiment for the values of $\lambda$, as shown in Fig. 10. According to the experimental results, we found that as $\lambda$ increases, the AP gain of the student increases, but when $\lambda$ becomes too large, the performance decreases. Therefore, we hope to control the value of $\lambda$ to help the model better compromise between encoding low-score and high-score regions.

Qualitative analysis

In this section, we tested tea leaves images from the same test set using the original student model and the distilled student model. The detection results are shown in Fig. 11. From the results of the test, it can be observed that the distilled student model significantly reduces the number of missed and false detections of tea leaves, while increasing the number of correctly detected tea leaves. This indicates that the performance of the distilled student model is greatly improved, demonstrating the potential to accomplish real-time tea leaves tasks.

Conclusions

In this paper, to enable lightweight models to efficiently perform tea leaves detection in natural environments, we propose a new knowledge distillation method RFDD. On the one hand, we selectively mask specific pixels in the student feature map and reconstruct the masked feature through a generation block. Simultaneously, we enhance the student model’s channel perception by incorporating the SE block. On the other hand, we decouple the feature into foreground and background features according to the Ground-Truth, separately distilling them. This allows the student model to focus on critical pixels and channels while balancing the extraction of foreground and background features. Finally, we utilize the GcBlock to perform Global Distillation, capturing the relation information lost during Decoupled Distillation. Experimental results on various frameworks and detectors with different backbones demonstrate the effectiveness of our method. However, our method relies on a hyperparameter to limit the size of the mask. We aim to minimize human intervention in model performance, so our future research will focus on alternative strategies for obtaining masks.

Data availability

The datasets used during the study are available from the corresponding author upon reasonable request.

Plant material availability

We confirm that all the experimental research and field studies on plants (either cultivated or wild), including the collection of plant material, complied with relevant institutional, national, and international guidelines and legislation. The Taiping Houkui tea leaves dataset used in this paper was collected from the Huangshan Houkui Tea Plantation. All the material is owned by the authors and/or no permissions are required.

References

Lauriola, I., Lavelli, A. & Aiolli, F. An introduction to deep learning in natural language processing: Models, techniques, and tools. Neurocomputing 470, 443–456. https://doi.org/10.1016/j.neucom.2021.05.103 (2022).
Article Google Scholar
Yin, L. et al. Convolution-Transformer for Image Feature Extraction (2024). https://doi.org/10.32604/cmes.2024.051083
Yin, L. et al. AFBNet: A lightweight adaptive feature fusion module for super-resolution algorithms. Comput. Model. Eng. Sci. 140(3). https://doi.org/10.32604/cmes.2024.050853 (2024).
Wang, R. et al. Deep neural network compression for plant disease recognition. Symmetry 13(10), 1769. https://doi.org/10.3390/sym13101769 (2021).
Article ADS Google Scholar
Kang, H. & Chen, C. Fast implementation of real-time fruit detection in apple orchards using deep learning. Comput. Electron. Agric. 168, 105108. https://doi.org/10.1016/j.compag.2019.105108 (2020).
Article Google Scholar
Khan, S. D., Alarabi, L. & Basalamah, S. Segmentation of farmlands in aerial images by deep learning framework with feature fusion and context aggregation modules. Multimedia Tools Appl. 82(27), 42353–42372. https://doi.org/10.1007/s11042-023-14962-5 (2023).
Article Google Scholar
Wang, D. & Yang, S. X. Broad learning system with Takagi–Sugeno fuzzy subsystem for tobacco origin identification based on near infrared spectroscopy. Appl. Soft Comput. 134, 109970. https://doi.org/10.1016/j.asoc.2022.109970 (2023).
Article Google Scholar
Chen, Y. T. & Chen, S. F. Localizing plucking points of tea leaves using deep convolutional neural networks. Comput. Electron. Agric. 171, 105298. https://doi.org/10.1016/j.compag.2020.105298 (2020).
Article Google Scholar
Xu, W. et al. Detection and classification of tea buds based on deep learning. Comput. Electron. Agric. 192, 106547. https://doi.org/10.1016/j.compag.2021.106547 (2022).
Article Google Scholar
Zhang, S. et al. Edge device detection of tea leaves with one bud and two leaves based on ShuffleNetv2-YOLOv5-Lite-E. 13(2), 577. https://doi.org/10.3390/agronomy13020577 (2023).
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv Preprint. https://doi.org/10.48550/arXiv.1503.02531 (2015).
Wang, R. et al. Progressive multi-level distillation learning for pruning network. Complex. Intell. Syst. 1–13. https://doi.org/10.1007/s40747-023-01036-0 (2023).
Zhao, B., Cui, Q., Song, R., Qiu, Y. & Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 11953–11962 (2022). https://doi.org/10.1109/CVPR52688.2022.01165
Li, Q., Jin, S. & Yan, J. Mimicking very efficient network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6356–6364. https://doi.org/10.1109/CVPR.2017.776 (2017).
Wang, T., Yuan, L., Zhang, X. & Feng, J. Distilling object detectors with fine-grained feature imitation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 4933–4942. https://doi.org/10.1109/CVPR.2019.00507 (2019).
De Vries, T., Misra, I., Wang, C. & Van der Maaten, L. Does object recognition work for everyone? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 52–59. https://doi.org/10.1109/CVPRW56347.2022.00443 (2019).
Guo, J. et al. Beyond human parts: Dual part-aligned representations for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision 3642–3651. https://doi.org/10.1109/ICCV.2019.00374 (2019).
Guo, J. et al. Distilling object detectors via decoupled features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2154–2164. https://doi.org/10.1109/CVPR46437.2021.00219 (2021).
Lin, T. Y. et al. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2117–2125. https://doi.org/10.1109/CVPR.2017.106 (2017).
Yang, Z. et al. Masked generative distillation. In European Conference on Computer Vision 53–69 (Springer, 2022). https://doi.org/10.48550/arXiv.2205.01529
Lin, T. Y. et al. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13 740–755 (Springer, 2014). https://doi.org/10.1007/978-3-319-10602-1_48.
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141) (2018). https://doi.org/10.1109/CVPR.2018.00745
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 3–19. https://doi.org/10.48550/arXiv.1807.06521 (2018).
Zhang, L. & Ma, K. Improve object detection with feature-based knowledge distillation: Towards accurate and efficient detectors. In International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2205.15156 (2020).
Cao, Y., Xu, J., Lin, S., Wei, F. & Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. https://doi.org/10.48550/arXiv.1904.11492 (2019).
Ren, S., He, K., Girshick, R. & Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28. https://doi.org/10.48550/arXiv.1506.01497 (2015).
He, K., Gkioxari, G., Dollár, P. & Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision 2961–2969. https://doi.org/10.1109/ICCV.2017.322 (2017).
Lin, T. Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision 2980–2988. https://doi.org/10.48550/arXiv.1708.02002 (2017).
Tian, Z., Shen, C., Chen, H. & He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision 9627–9636. https://doi.org/10.48550/arXiv.1904.01355 (2019).
Yang, Z., Liu, S., Hu, H., Wang, L. & Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision 9657–9666. https://doi.org/10.48550/arXiv.1904.11490 (2019).
Chen, J. et al. Research on a parallel robot for tea flushes plucking. In 2015 International Conference on Education, Management, Information and Medicine (Atlantis Press, 2015). https://doi.org/10.2991/emim-15.2015.5 (22–26).
Wu, X. et al. Tea buds image identification based on lab color model and K-means clustering. J. Chin. Agric. Mech. 36, 161–164. https://doi.org/10.13733/j.jcam.issn.2095-5553.2015.05.040 (2015).
Article ADS CAS Google Scholar
Wu, X., Zhang, F. & Lv, J. Research on recognition of tea tender leaf based on image color information. J. Tea Sci. 33(6), 584–589. https://doi.org/10.13305/j.cnki.jts.2013.06.015 (2013).
Article Google Scholar
Wang, T. et al. Tea picking point detection and location based on Mask-RCNN. Inform. Process. Agric. 10(2), 267–275. https://doi.org/10.1016/j.inpa.2021.12.004 (2023).
Article ADS CAS Google Scholar
Qingqing, Z. H. A. N. G. et al. Tea buds recognition under complex scenes based on optimized YOLOV3 model. Acta Agriculturae Zhejiangensis. 33(9), 1740. https://doi.org/10.3969/j.issn.1004-1524.2021.09.18 (2021).
Article Google Scholar
Chen, Y. et al. Improved feature distillation via projector ensemble. Adv. Neural Inf. Process. Syst. 35, 12084–12095. https://doi.org/10.48550/arXiv.2210.15274 (2022).
Hao, Z., Guo, J., Han, K., Tang, Y., Hu, H., Wang, Y., & Xu, C. One for All: Bridge the Gap Between Heterog eneous Architectures in Knowledge Distillation. arXiv preprint arXiv:2310.19444. https://doi.org/10.48550/arXiv.2310.19444(2023)
Walawalkar, D., Shen, Z. & Savvides, M. Online ensemble model compression using knowledge distillation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16 18–35 (Springer, 2020). https://doi.org/10.1007/978-3-030-58529-7_2
Chen, G., Choi, W., Yu, X., Han, T. & Chandraker, M. Learning efficient object detection models with knowledge distillation. Adv. Neural. Inf. Process. Syst. 30 (2017).
Sun, R., Tang, F., Zhang, X., Xiong, H. & Tian, Q. Distilling object detectors with task adaptive regularization. arXiv Preprint. https://doi.org/10.48550/arXiv.2006.13108. arXiv:2006.13108 (2020).
Russell, B. C., Torralba, A., Murphy, K. P. & Freeman, W. T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vision 77, 157–173. https://doi.org/10.1007/s11263-007-0090-8 (2008).
Article Google Scholar
Gou, J., Yu, B., Maybank, S. J. & Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vision 129, 1789–1819. https://doi.org/10.1007/s11263-021-01453-z (2021).
Article Google Scholar
Dai, X. et al. General instance distillation for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7842–7851 (2021). https://doi.org/10.1109/CVPR46437.2021.00775.
Tung, F. & Mori, G. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision 1365–1374. https://doi.org/10.1109/ICCV.2019.00145 (2019).
Zhixing, D. et al. Distilling object detectors with feature richness. Adv. Neural. Inf. Process. Syst. 34, 5213–5224. https://doi.org/10.48550/arXiv.2111.00674 (2021).
Article Google Scholar
Yang, Z. et al. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 4643–4652. https://doi.org/10.48550/arXiv.2111.11837 (2022).
Carion, N. et al. End-to-end object detection with transformers. In European Conference on Computer Vision 213–229 (Springer, 2020). https://doi.org/10.1007/978-3-030-58452-8_13
Park, W., Kim, D., Lu, Y. & Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 3967–3976 . https://doi.org/10.48550/arXiv.1904.05068 2019).
Liu, Y. et al. Knowledge distillation via instance relationship graph. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 7096–7104. https://doi.org/10.1109/CVPR.2019.00726 (2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778. https://doi.org/10.48550/arXiv.1512.03385 (2016)
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X. et al. MMDetection:Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. https://doi.org/10.48550/arXiv.1906.07155 (2019).
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G. et al. Pytorch: An imperative style, high-performance deep learning library. Advances Neural Inf. Process. Syst. 32. https://doi.org/10.48550/arXiv.1912.01703 (2019).
Huang, T. et al. Knowledge diffusion for distillation. Adv. Neural. Inf. Process. Syst. 36. https://doi.org/10.48550/arXiv.2305.15712 (2024).
Wang, X., Girshick, R., Gupta, A. & He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 7794–7803. https://doi.org/10.1109/CVPR.2018.00813 (2018).

Download references

Acknowledgements

The authors acknowledge the financial support provided by Key Research and Development Project of Anhui Province (202204c06020022, 202104a06020012), Independent Project of Anhui Key Laboratory of Smart Agricultural Technology and Equipment (APKLSATE2019X001), The National Natural Science Foundation of China (32371993).

Author information

Authors and Affiliations

School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang West Road, Shushan District, Hefei City, Anhui Province, China
Zhe Zheng, Guanpeng Zuo, Wu Zhang, Chenlu Zhang, Jing Zhang, Yuan Rao & Zhaohui Jiang
Key Laboratory of Agricultural Sensors, Ministry of Agriculture and Rural Affairs, Hefei, China
Wu Zhang, Yuan Rao & Zhaohui Jiang

Authors

Zhe Zheng
View author publications
Search author on:PubMed Google Scholar
Guanpeng Zuo
View author publications
Search author on:PubMed Google Scholar
Wu Zhang
View author publications
Search author on:PubMed Google Scholar
Chenlu Zhang
View author publications
Search author on:PubMed Google Scholar
Jing Zhang
View author publications
Search author on:PubMed Google Scholar
Yuan Rao
View author publications
Search author on:PubMed Google Scholar
Zhaohui Jiang
View author publications
Search author on:PubMed Google Scholar

Contributions

Z.Z. conceived ideas, designed experiments, drew figures and tables, and wrote major manuscripts, while G.Z. supervised the entire practical work and participated in the construction of the dataset. W.Z. reviewed the manuscript and provided suggestions for potential improvements. C.Z., J.Z., Y.R. and J.Z. reviewed the manuscript and guided the overall project.

Corresponding author

Correspondence to Wu Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zheng, Z., Zuo, G., Zhang, W. et al. Learning lightweight tea detector with reconstructed feature and dual distillation. Sci Rep 14, 23669 (2024). https://doi.org/10.1038/s41598-024-73674-4

Download citation

Received: 12 January 2024
Accepted: 19 September 2024
Published: 10 October 2024
Version of record: 10 October 2024
DOI: https://doi.org/10.1038/s41598-024-73674-4

Subjects

Abstract

Similar content being viewed by others

Multistage feature fusion knowledge distillation

Cosine similarity-guided knowledge distillation for robust object detectors

Convolutional neural network based on transfer learning for discriminating the fermentation degree of black tea

Introduction

Related works

Object detection

Tea leaves recognition

Knowledge distillation

Materials and methods

Dataset construction

Method

Reconstructed feature

Decoupled distillation

Global distillation

Overall loss

Experiments

Dataset

Details

Main results

Analysis

Different distillation regions

The attention mask and channel clue

The generation block

Sensitivity study of global distillation

Sensitivity study of \(\mathcal{T}\)

Sensitivity study of α, β and γ

Sensitivity study of λ

Qualitative analysis

Conclusions

Data availability

Plant material availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links