Introduction

Black tea is classified as a type of fully fermented tea, which is the second largest tea category in China1. According to customs statistics, the export value of Chinese tea in 2024 was 1.419 billion dollars, with an export volume of 374,100 t. Among them, the export volume of black tea was 24,800 t, accounting for 6.62%. In 2024, the import value of Chinese tea was 157 million dollars, with an import volume of 54,000 t. Among them, the import volume of black tea was 41,900 t, accounting for 77.63%. It is named after the red color of the tea broth and the bottom of the leaves after the dry tea is brewed. It is made from the buds and leaves of the tea tree and refined through typical processes such as withering, kneading, fermentation, and drying. Fermentation is an important step to form the flavor of black tea, so for the recognition of the black tea fermentation degree becomes crucial2. Presently, in the processing of black tea, the identification of fermentation levels is entirely based on the tea master’s own experience in tea making, which is arbitrary and subjective, and is not conducive to the mass production of high-quality black tea3. Consequently, precisely quantifying the fermentation stages of black tea remains a significant obstacle to advancing toward digitalized tea processing methods.

For the last few years, some scholars have done a large number of studies on the discrimination of the degree of fermentation. Wei et al. analyzed the differences in the content of volatile organic compounds in pomelo wine at different stages of fermentation by PLS regression. The PLS model showed that the ratio of α-phellandrene/geraniol alcohol in pomelo wine could be a potential indicator for determining the degree of fermentation of pomelo wine4. Jiang et al. qualitatively identified the solid-state fermentation degree using PLS-DA after wavelength variable screening using FT-NIR spectroscopy technology. CARS and SCARS were used to screen important wavelengths. The experimental results showed that the SCARS-PLS-DA model gained even superior outcome during validation with a discrimination rate of 91.43%5. Riza et al. developed the YOLO-CoLa model within the YOLOv8 framework to accurately detect the degree of fermentation of cocoa beans. The proposed model achieved a mAP@0.5 of 70.4%, representing an improvement of 9.3% compared to the original model, effectively enhancing detection performance6.

The fermentation of cacao beans and pomelo wine in the above study is different from that of tea fermentation, and the discrimination of the degree of fermentation is also extremely different. Last several years, some scholars have also done some research on the tea fermentation degree of identification. Chen et al. conducted spectral analysis of total catechins and theanine in 161 tea samples. The best scaling models for these compounds demonstrated strong predictive capabilities. The results indicated that NIR could be an effective method to detect the degree of tea fermentation quickly and accurately7. Fraser et al. studied the biochemical components of oolong tea during fermentation using non-targeted methods. Correlation of the spectra revealed two volatile compounds whose concentrations increased during the fermentation phase of the process. This study highlighted the latent capacity of DART-MS for rapid monitoring of complex production processes such as tea fermentation8. Cao et al. developed a sensing system based on carbon quantum dots doped with cobalt ions to assess the fermentation levels of black tea. The least squares support vector machine model developed was 100% accurate in distinguishing the degree of fermentation. This was an accurate and effective method to measure the levels of black tea fermentation9. A specific comparison is shown in Table 1.

Table 1 Comparison of existing methods for determining the degree of tea fermentation

All of the above discriminations of tea fermentation level are realized by using traditional techniques, but the high cost of spectrometers, the susceptibility of spectral reflectance to interference, and the large production cost of high-quality, high-purity carbon quantum dots are not conducive to large-scale applications in actual production. For the past few years, deep learning technology has been widely used in agriculture10. Chawla et al. proposed a new method for identifying okra infected with yellow vein mosaic virus using deep learning models. This study showed that the MobileNet model achieved excellent accuracy when combined with all three RNNs, exceeding 99.27%11. Chen et al. proposed an automated detection model MTD-YOLOv7, for fruit and fruit bundle maturity. The total score of MTD-YOLOv7 in multi-task learning was 86.6%12. Model detection has high accuracy and fast speed. Tian et al. proposed an apple detection model based on YOLOv3 for different growth stages in complex orchards. The average detection time of this model was 0.304 s/frame, which could perform real-time detection of apples in the orchard13. For deep learning technology application in tea fermentation degree discriminant has rarely been reported. However, large models are not suitable to be deployed due to the limitations of hardware conditions in the actual production environment14. In addition, there is a high demand for real-time performance in actual production, which cannot be met by large models. Lightweight models have certain advantages in addressing these challenges, including high computational efficiency, low memory usage, and ease of deployment to edge devices15. Zhang et al. put forward a lightweight framework based on the knowledge distillation strategy, which greatly reduced the complexity of the multimodal solar irradiance prediction model while guaranteeing an acceptable accuracy and facilitates the actual deployment16. Sun et al. proposed a lightweight, high-accuracy model for detecting passion fruit in complex environments. Knowledge distillation was utilized to transfer knowledge from the teacher model with strong ability to the student model with weak ability. The detection accuracy is significantly enhanced17. In the case of average accuracy and detection capability, the proposed model superior to the most advanced models. The aforementioned study provides a reference for the lightweight research study in this paper. But it can’t be used to distinguish the fermentation level of black tea. On this basis, a lightweight convolutional neural network based on transfer learning was proposed to determine the level of black tea fermentation. In this paper, the main contributions are as follows. (1) Using transfer learning strategies, 14 types of convolutional neural networks are experimentally compared, and student model and teacher model are selected. (2) By replacing the loss function, the student model’s discriminative performance has been improved. (3) The optimizer of the model is changed, which further improves the model discrimination performance. (4) The above model is subjected to knowledge distillation experiments at different Distillation Loss ratios. The model shows the best discriminative performance when the Distillation Loss ratio is 2.0. The research process of this paper can be understood in detail in Fig. 1.

Fig. 1: The specific research flowchart.
figure 1

This is the research flowchart of the entire article, starting from practical industrial problems, creating and dividing a dataset, training and improving the network, and finally testing the model’s performance.

Results

Experimental environment and parameter settings

The training framework and parameter settings used in this research experiment are listed in detail in Table 2.

Table 2 Experimental framework and parameter setting

Model evaluation indicators

This study belongs to image classification, so FLOPs, Params, Accuracy, Precision, Recall, F1, and FPS are used to evaluate the performance of the discrimination model.

  1. (1)

    FLOPs: the floating-point operations in the model reasoning process, reflecting the complexity of the model.

  2. (2)

    Params: the number of parameters of the model, which reflects the complexity of the model.

  3. (3)

    Accuracy: the proportion of samples with accurate classification to the total number of such samples.

  4. (4)

    Precision: the proportion of correct predictions that are positive (TP) over all predictions that are positive (TP + FP).

  5. (5)

    Recall: the proportion of positive examples (TP + FN) predicted correctly (TP) in the sample.

  6. (6)

    F1: Sometimes, Precision and Recall alone cannot fully evaluate the performance of a model. F1 scores can be used to evaluate a model comprehensively.

  7. (7)

    FPS is the refresh frequency of an image.

The calculation formula is shown in Eq. (1).

$$\left\{\begin{array}{c}\mathrm{Accuracy}=\frac{{TP}+{TN}}{{TP}+{FP}+{FN}+{TN}}\\ \mathrm{Precision}=\frac{{TP}}{{TP}+{FP}}\\ \begin{array}{c}\mathrm{Recall}=\frac{{TP}}{{TP}+{FN}}\\ {\rm{F}}1=\frac{2Precision\,* \,Recall}{Precision\,+\,Recall}\end{array}\end{array}\right.$$
(1)

Loss changes during training

The curve of loss value change in the training process of the improved model is made, as shown in Fig. 2. Figure 2 shows that the loss value of the improved model gradually decreased with the increase of the number of training generations, and the overall tendency was stabilized.

Fig. 2: Visualization of the model training process.
figure 2

The black curve represents the change in training loss, the red curve represents the change in validation loss, and the blue curve represents the change in knowledge distillation loss.

Comparative results of basic network experiments

In the same experimental conditions, the pre-trained model was added by applying the strategy of transfer learning to experiment with the aforementioned selected convolutional neural network models (Table 3). Table 3 shows that all the models were able to determine in black tea fermentation level, among which Efficientnet_v2_m had the best discriminative result, which was used as the teacher model. The FLOPs, Params, Accuracy, Precision, Recall, F1, and FPS of the model were 5.445 G, 52.862 M, respectively, 0.9706, 0.9740, 0.9379, 0.9550, and 13.78. Considering the problem that large models are not suitable to be deployed in the process of practical application, and trying to minimize the FLOPs and Params of the model under the guarantee of the discriminative accuracy, ResNet18 was selected as the student model, and the model’s FLOPs, Params, Accuracy, Precision, Recall, F1 and FPS were 1.824 G, 11.178 M, 0.9037, 0.9065, 0.8153, 0.8519 and 75.24 respectively.

Table 3 Comparative results of basic network experiments

Optimizer comparison experiment

Under the same experimental conditions, Table 4 shows the experimental results of three optimizers for the model after replacing the loss function. Comparing the experimental results in Table 4, it can be seen that all three optimizers do not change the FLOPs and Params of the model. When the optimizer was AdamW, the model had the highest Accuracy, Precision, Recall, F1, and FPS, which were 0.9425, 0.9272, 0.8881, 0.9064, and 74.60, respectively, followed by RMSProp, SGD was the worst. The reason for this is that AdamW can adaptively adjust the learning rate based on the first-order and second-order moment estimates of the gradient. At the same time, weight decay can be performed after calculating the gradient, which is a more accurate implementation method that can better regularize the model and enhance its generalization ability.

Table 4 Experimental results of optimizer comparison

Knowledge distillation experiment results

AT method was applied to the student model after the optimizer was replaced, and knowledge Distillation experiments were carried out with a Distillation Loss ratio of 0.1–2.0 (Table 5). Table 5 shows that when the Distillation Loss ratio was 0.1, 0.5, 0.8, 1.4, 1.8, and 2.0, the discriminant performance of the model was enhanced, the complexity of the models had not increased, and the speed was similar.

Table 5 Knowledge distillation experiment results

At this point, relying solely on Precision and Recall can’t evaluate the superiority or inferiority of the models. The judging indexes of F1 can be combined. Therefore, the Distillation Loss ratio of 2.0 had the best effect; the model’s Accuracy, Precision, Recall, F1, FPS were 0.9452, 0.9280, 0.9055, 0.9164, 74.22, respectively, for the model.

Results of ablation experiments

To test and verify the effectiveness of each step of improvement, ablation experiments were done under equal experimental conditions. The results are presented in Table 6, and the improvement process model metrics are visualized as shown in Fig. 3. From Table 7 and Fig. 3, it can be seen that for the selected student model ResNet18, replacing the loss function with PolyLoss, the FLOPs, Params, Accuracy, Precision, Recall, F1, and FPS of the model were 1.824 G, 11.178 M, 0.9265, 0.9164, 0.8630, 0.8836, and 73.75, respectively. This indicates that PolyLoss can guide model learning in a richer information space, enabling the model to capture data features more comprehensively and improve discrimination accuracy. After replacing the optimizer with AdamW, the FLOPs, Params, Accuracy, Precision, Recall, F1, and FPS of the model were 1.824 G, 11.178 M, 0.9425, 0.9272, 0.8881, 0.9064, and 74.60, respectively. This demonstrates that the AdamW optimizer combines the advantages of various optimization algorithms, such as RMSProp, and can adaptively adjust the parameter update step size during training. This adaptive capability allows AdamW to update parameters more accurately, thereby accelerating the model’s convergence speed. After conducting knowledge distillation experiments using the AT method, the FLOPs, Params, Accuracy, Precision, Recall, F1, and FPS of the model were 1.824 G, 11.178 M, 0.9452, 0.9280, 0.9055, 0.9164, and 74.22, respectively. This indicates that when the Distillation Loss ratio is 2.0, the model can effectively mine knowledge from the teacher model without affecting speed, thereby optimizing the performance of the student model.

Fig. 3: Performance comparison of model improvement process.
figure 3

Among them, the black curve represents the Accuracy change of the model, the red curve represents the Precision change of the model, the blue curve represents the Recall change of the model, and the green curve represents the F1 change of the model.

Table 6 Results of ablation experiments
Table 7 Convolutional neural network models

Confusion matrix comparison

The confusion matrix of the model before and after improvement has been created (Fig. 4). Figure 4 shows that the enhanced model has improved the accuracy of distinguishing each fermentation level of black tea, but the probability of discriminating mild fermentation and excessive fermentation as moderate fermentation was increased, compared with the original model. The reason may be that moderate fermentation was in the middle of mild fermentation and excessive fermentation, which was a transitional stage and had some overlap with both mild fermentation and excessive fermentation.

Fig. 4: Comparison of model discriminant confusion matrices before and after improvement.
figure 4

The first image represents the discriminant confusion matrix of the original model, and the second image represents the discriminant confusion matrix of the improved model.

Comparison of model detection effects

Two photos were randomly selected in the test set for testing and thermogram visualization comparison (Fig. 5). Figure 5 shows that the original model misjudged moderate fermentation as excessive fermentation in the first group, and the modified model avoided the misjudgment phenomenon. At the second set, the improved model had a higher confidence in the identification of the level of tea fermentation. From the heat map in the second group, it can be seen that the original model focused on a smaller range, and the improved model focused on a wider range of fermented black tea, and the judgment was more integrated and comprehensive.

Fig. 5: Comparison of model discrimination effect.
figure 5

The Group 1 in the picture is the discrimination effect diagram, and the Group 2 is the discrimination heat map.

Discussion

In this study, a lightweight convolutional neural network based on transfer learning was proposed to identify the fermentation level of black tea. Firstly, the transfer learning strategy was used to experimentally compare 14 kinds of convolutional neural network. The student model ResNet18 and the teacher model Efficientnet_v2_m were comprehensively selected according to the model complexity and the experimental results. Secondly, the student model’s loss function was replaced with PolyLoss, and then the original optimizer RMSProp was replaced with AdamW. Finally, the AT method was used to distill knowledge from the model after replacing the optimizer. The results from experiments conducted on a custom dataset indicated that the Accuracy, Precision, Recall, F1, and FPS of the improved model were 0.9452, 0.9280, 0.9055, 0.9164, and 74.22, respectively. The model improved Accuracy, Precision, Recall, and F1 by 0.0415, 0.0215, 0.0902, and 0.0645, respectively, without increasing complexity, with comparable speed. The improved model demonstrated enhanced accuracy in distinguishing the various levels of black tea fermentation compared to the original model, but the probability of discriminating mild fermentation and excessive fermentation as moderate fermentation increased. The reason may be that moderate fermentation was in the middle of mild fermentation and excessive fermentation, which was a transitional stage and has some overlap with both mild fermentation and excessive fermentation. The model should be optimized for this to further reduce the misjudgment rate of the model.

Although the improved model has improved its discriminative performance and achieved lightweight effects, there are still certain limitations. For example, for deep learning models, there is limited image data, and the model may not be able to fully learn the complex features and subtle differences of images during the fermentation process of black tea. Especially when facing some rare or special fermentation states, there may be insufficient generalization ability. When collecting images, the lighting conditions and background are also relatively simple, and it still cannot cover all possible situations in the actual production environment. Furthermore, the deployment of models in actual production will also be a technical challenge. In terms of hardware compatibility, it is undoubtedly one of the primary challenges. The existing production environment is often equipped with diverse hardware devices, with different models, specifications, and performance. Our model needs to be deeply adapted to these different hardware components in order to integrate smoothly. In terms of real-time processing requirements, production scenarios have extremely strict requirements for the response speed of models. The model must analyze input data and output results in a very short amount of time to meet the continuity and efficiency of the production process.

Next, we aim to further refine discriminative model and design effective deployment strategies to ensure the successful deployment of the model to actual production. Moving forward, we plan to gather additional images of black tea fermentation from diverse varieties and complex settings to broaden the dataset and enhance the model’s generalization capabilities. In addition, in order to improve the robustness of the model under different lighting conditions, we will introduce lighting normalization processing to adaptively adjust the pixel values of the image, simulate the imaging effect under different lighting environments, and enable the model to learn more robust feature representations. Simultaneously, we will employ deep learning techniques to integrate the image features of fermented black tea with its internal chemical components, allowing for a more thorough and comprehensive assessment of the fermentation level.

Methods

Dataset production

The images used in this study were collected from the Tea Research Institute of the Chinese Academy of Agricultural Sciences, located at 120.03°E longitude and 30.18°N latitude. Tea bud was a bud and a leaf of Tie Guanyin. In this study, fermentation experiments were conducted in an artificial climate box (LHS-150), with the fermentation temperature and relative humidity set to 30 °C and 90%, respectively. In this experiment, black tea was fermented for 5 h, and a total of 187 black tea fermentation images were collected at 0 h, 1 h, 2 h, 3 h, 4 h and 5 h using Canon camera (EOS80D, Canon). The image samples covered the entire stage of black tea fermentation. During the image collection process, the height of the camera from the fermented tea samples was set to 400 mm.

Due to the small number of collected image samples, in order to enhance the robustness of the model, this paper performed operations such as rotation, mirroring, noise addition, and cropping on the 187 images. The rotation operation simulates the actual changes in shooting angles, allowing the model to recognize image features from different angles, as there may be differences in shooting angles in reality. Mirror processing simulates the symmetry of images in the horizontal or vertical direction, increasing the diversity of data, considering that some fermentation features may have symmetry. Noise addition simulates potential interference during the image acquisition process, such as device noise, to make the model adaptable to noise and improve robustness. Trimming is the process of removing unnecessary background or redundant parts from an image, highlighting key information, and allowing the model to better focus on the key features of black tea fermentation, thereby improving the model’s generalization ability. The amount of data has been expanded from 187 to 374018. The samples from the six different fermentation time points were categorized into three groups based on fermentation degree, 0–3 h for mild fermentation, 4 h for moderate fermentation, and 5 h for excessive fermentation. The data set was divided into a ratio of 60% for training, 20% for validation, and 20% for testing. Figure 6 shows the specific examples of fermentation degree of black tea and the production of dataset.

Fig. 6: Specific samples and dataset production.
figure 6

Divide the fermentation stages into three categories according to time sequence, collect images, and then perform image amplification to divide the dataset.

Convolutional neural network

Convolutional neural network is one of the representative algorithms of deep learning. It consists of a number of convolutional layers and pooling layers, especially in image processing, convolutional neural network has a very good performance19. There are many convolutional neural network models for different tasks and scene image classification. This article chooses the convolutional neural network models as shown in Table 7.

Transfer learning

Transfer learning plays a very vital role in the field of deep learning. The core idea is that models trained on large datasets can be “migrated” to new tasks, thus avoiding the need to start training from scratch20. Using pre-trained models is a special transfer learning strategy. In this study, a pre-trained convolutional neural network model is loaded during the training process to shorten the model training time and improve the model performance.

ResNet18

The Resnet model was proposed by He et al. in 201521 and has been widely applied in various computer vision tasks, with its performance and stability fully validated. It has a simple structure, efficient training speed, and good generalization ability, which can achieve good training results in a short period of time. Based on model maturity, discriminant effect, and considerations of resource and time efficiency, this paper selects ResNet18 network as the student model, which is constructed by stacking multiple residual blocks. Each residual block contains two 3 × 3 convolutional layers, and the input is directly added to the output through skip connections, thus solving the problem of gradient vanishing and degradation in deep networks. This design enables the network to effectively train at deeper levels. The specific network framework is shown in Fig. 7.

Fig. 7: Structure of ResNet18 network.
figure 7

The ResNet18 network consists of 17 convolutional layers and 1 fully connected layer. The solid line represents no change in the number of channels in the residual block, while the dashed line represents a change in the number of channels.

Efficientnet_v2_m

Efficientnetv2 is the second-generation model of the Efficientnet family, presented by Google at the ICML 2021 conference22. Efficientnetv2 inherits the core concept of Efficientnetv1, the composite scaling method, but makes several improvements to achieve smaller model size, faster training speed, and better parameter efficiency. Efficientnetv2 has several variants, including s, m, l, etc., each with different complexity and performance. Efficientnetv2 adopts the Fused MBConv structure, which is an improvement on the traditional MBConv structure. It combines expansion convolution, and depthwise convolution into a standard 3 × 3 convolutional layer, simplifying the network structure, reducing computational costs, and accelerating the training process. In the initial part of the network, using Fused MBConv can significantly improve training speed. As the network depth increases, it gradually returns to using more traditional MBConv modules to maintain a balance between model performance and efficiency. In addition, Efficientnet_v2_m further optimizes the scaling strategy to adapt to the new Fused MBConv structure, enabling the model to achieve optimal efficiency and performance balance at different scales. The network structure of Efficientnet_v2_m is presented in Fig. 8.

Fig. 8: Network architecture diagram of Efficientnet_v2_m.
figure 8

On the left is the structural diagram of Efficientnet_v2_m, on the upper right is the Fused-MBConv structure, and on the lower right is the MBConv structure.

PolyLoss

Cross-Entropy Loss is used to measure the difference between the actual output of a neural network and the correct label, and updates network parameters through backpropagation23. It effectively prevents class imbalance during the training process and has robustness in class sorting. The calculation formula is shown in Eq. (2), where \({\alpha }_{j}\in R+\) is a polynomial coefficient and \({P}_{t}\) is the predicted probability of the target class label.

$${L}_{{CE}}=-log\left({P}_{t}\right)=\mathop{\sum }\limits_{j=1}^{\infty }{\alpha }_{j}{\left(1-{P}_{t}\right)}^{j}$$
(2)

PolyLoss is an optimized version of Cross-Entropy Loss, which approximates the loss function using Taylor expansion as a simple framework24. The loss function is designed as a linear combination of polynomial functions, and the specific calculation formula is (3). Among them, N represents the number of important coefficients to be adjusted, and \({\varepsilon }_{j}\in [-\frac{1}{j},\infty ]\) is the perturbation term. In this study, the original loss function Cross-Entropy loss of the selected student model is replaced with PolyLoss.

$${L}_{P{loy}}=-log\left({P}_{t}\right)+\mathop{\sum }\limits_{j=1}^{N}{\varepsilon }_{j}{\left(1-{P}_{t}\right)}^{j}$$
(3)

AdamW

The AdamW optimizer is a variant of the Adam optimizer that combines weight decay (L2 regularization) with the Adam optimizer25. The key to AdamW is that it treats weight decay separately from gradient updating, which helps to address the incompatibility of L2 regularization with adaptive learning rate algorithms. In this study, the optimizer of the model after replacing the loss function is replaced with AdamW instead of RMSProp.

Knowledge distillation

In recent years, computing power has been continuously improving, and deep learning network models are becoming larger and larger. However, limited by resource capacity, deep neural models are difficult to deploy on devices. As an effective method of model optimization, knowledge distillation can reduce model complexity and computational overhead while retaining the key knowledge of high-performance models26.

Attention Transfer (AT), proposed at ICLR2017 conference, is a knowledge distillation method27. It draws attention from a network of teachers and distills the learned attention map into the student network as a kind of knowledge, so that the learning network tries to generate attention map similar to the teacher network, so as to improve the performance of the student network. This study uses the Efficientnet_v2_m model as the teacher model and employs the AT method to conduct knowledge distillation experiments on the student model ResNet18 at different distillation loss ratio. The specific schematic diagram is shown in Fig. 9.

Fig. 9: Schematic diagram of AT.
figure 9

The top of the picture is the teacher model, and the bottom is the student model. This method extracts attention from the teacher network and uses it as a goal to guide the student network.