Introduction

Pears are the second most consumed pome fruit worldwide1. Today, the area devoted to world pear production is approximately 1.5 million ha, yielding 25 million tons pears annually2. Pears have a pleasant sweet taste, are a source of fiber and vitamin C, and have low sodium content, which make it a healthy snack if eaten fresh2. In addition, Pears have many beneficial properties for human health, such as antitussive, anti-inflammatory, antihyperglycemic, and diuretic3. Therefore, pears are very popular in daily life, with consumption by the adult population ranging from 23 to 108 g/day4. As an ancient fruit in temperate regions5, there are many groups of pears, such as the white pear, sand pear, and Ussurian pear. Typically, several varieties of fruits are grown simultaneously in an orchard, so it is easy to mix different varieties of fruits with similar appearance during harvesting and marketing process6,7. Although different varieties of fruits are similar in external appearance, they have different intrinsic qualities in terms of taste or nutritional value8,9. For example, There is a great diversity of pear varieties in China due to its widespread consumption. Each kind is characterized by different appearances and contents of phenolic compounds, nutritional ingredients, antioxidant and anti-inflammatory activities and some other properties3. Which is why people spend a lot of time sorting, packing and labeling fruits before selling them8,9. In addition, the identification and classification of fruits is a necessary task as it smart agricultural applications10, such as mechanized automatic picking and fruit harvest assessment in large orchards, intensive processing of fruits in factories and integrated fruit weighing in supermarkets to automatically complete fruit price calculations, etc. In the traditional agricultural planting and harvesting process, fruit identification mainly relies on manual labor, which is highly subjective, slow and unable to meet large orchard applications11. While commonly used physicochemical analysis methods based on fruits components testing are time-consuming, expensive to run, and complex in samples preparation12,13. Subsequently, with industrial development and technological advances, a number of rapid and non-destructive techniques for differentiating fruits varieties have emerged, such as electronic nose, visible and near-infrared spectroscopy, and image processing-based methods14,15. Especially the application of machine vision and deep learning will improve the efficiency of all aspects of agricultural.

Deep learning has been successfully applied as a non-destructive technique for automatic identification, classification and detection of fruits and vegetables with the advantages of fast, convenience, low cost, and high accuracy16. In particular, Convolutional Neural Networks (CNNs), a deep learning-based framework with strong capabilities in automatic feature learning of images, have achieved impressive results in various food and agricultural challenges17,18. Recently, CNNs have been used for fruit recognition tasks, but mainly for quality assessment and bruise detection11. For instance, Agus Pratondo19 used the classifier built by Inception-v3 to classify three types of pears and compared it with several traditional machine learning algorithms. The results showed that the classifier built using Inception-v3 had the best performance, with an accuracy of 94.00%. Ismail et al.20 utilized the EfficientNet methodology to develop a system, which was trained and evaluated on two fruit datasets. The average accuracy for the two testing datasets was 96.7% and 93.8%, respectively. Rojas-Aranda, J. L., et al.21 presented a three-fruit system based on CNN architecture and MobileNetV2. This method achieved 95% accuracy without using plastic bags to wrap the fruits. Gill, Harmandeep Singh, et al.22 employed CNN, Recurrent Neural Networks, and Long Short-Term Memory deep learning methods to extract optimal image features, and to select features after extraction, and finally, use extracted image features to classify the fruits. José Naranjo-Torres23 in his review found that CNNs are highly efficient for addressing critical tasks on fruit image processing within the agro-food industry. However, CNN-based approaches should still face important challenges in order to apply them in real-world scenarios. For example, search of CNN parameters, the number of layers and filters when proposing a CNN architecture for a specific problem, as well as determining the parameters and hyperparameters of the model, remains a relevant problem commonly solve by trial-and-error tuning until getting the best settings, which is very time-consuming for deep learning models. Additionally, images captured during agricultural production and processing are often noisy, and the “black-boxes” nature of deep learning results in a lack of model interpretability, these factors further impact the widespread application of CNNs in the agriculture.

Accordingly, the contributions of this study are as follows.

  1. 1.

    We used BO deep learning to classify 9 categories of pears. Different hyperparameters and corresponding model performances were evaluated and compared, and two of the optimal models were further analyzed.

  2. 2.

    We set up two datasets and different validation set proportions to study the performance impact of data configuration on BO deep learning. Specifically, one is a common dataset configuration, i.e. more training data-less testing data, and the other is designed to approach the reality that the testing set is infinite, i.e. less training data-more testing data11. In addition, we added different degrees of Gaussian white noise to the data to be closer to actual agricultural production applications.

  3. 3.

    We used three visualization methods (feature visualization, strongest activations and LIME techniques) to reveal how two optimal models make classification decisions as “black boxes”.

Materials and methods

Fruits-360 dataset

“Fruits-360” (https://www.kaggle.com/moltean/fruits, Version: 2020.05.18.0, accessed on 10 October 2022) is a publicly available benchmark dataset8,24, which has been employed by several studies to evaluate their proposed models, For example, Siddiqi25 used this dataset to classify different categories of fruits and illustrated that the Fruits-360 dataset is larger compared to other fruit datasets. Kodors et al.26 used this dataset to classify apples and pears in order to compare the performance of different deep learning architectures. Choudhary, K., et al.27 developed a fruit recognition approach using this dataset, i.e. the CNN-based ResNet-50 method was employed for extracting features and fruit identification, and has been determined to be 99% accurate. Rahman M M et al.28 in the literature review section, enumerated and compared ten pieces of literature on fruit recognition and classification, among which four utilized the Fruit-360 dataset. Based on this, an enhanced version of “Fruits-360” dataset was employed in this study to objectively evaluate and demonstrate the performance of BO deep learning models and to facilitate researchers to reproduce our work. Fruit-360 as a comprehensive dataset featuring 131 different categories of fruits and vegetables, totaling 90,483 images. Among them, pear includes 9 types, namely Pear, Pear 2, Pear Abate, Pear Forelle, Pear Kaiser, Pear Monster, Pear Red, Pear Stone, and Pear Williams. Each image (100*100 pixels) is of a single pear on a white background, as shown in Fig. 1.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Nine classes of pears: (1) Pear, (2) Pear 2, (3) Pear Abate, (4) Pear Forelle, (5) Pear Kaiser, (6) Pear Monster, (7) Pear Red, (8) Pear Stone and (9) Pear Williams.

Training and testing datasets set-up

Two datasets are constructed using images of 9 categories of pears in Fruits-360, as shown in Table 1. Dataset A is configured based on the original training and test sets provided by Fruits-360, and Dataset B is an inverted version of Dataset A. That is, the test set of Dataset B corresponds to the training set of Dataset A, and the training set of Dataset B corresponds to the test set of dataset A. In addition, in the original “Fruits-360” data set, the fifth category (Pear Kaiser) has the least training and testing data, 300 and 102 images respectively, while the eighth respectively (Pear Stone) has the most, 711 and 237 images respectively. In order to avoid data imbalance, the image of each category is extracted from the original data set, i.e. 300 and 100 images were randomly selected from the training and test set of each category respectively, and data augmentation was then performed on each image to construct the two datasets in this study.

Data augmentation is a commonly utilized technique that enhance the training effect of deep learning29. This approach mitigates the issue of overfitting in deep networks by effectively expanding the dataset size, particularly when training set is limited. Common data augmentation strategies encompass techniques such as geometric rotation, adversarial training, and generative adversarial networks (GANs), etc30. Despite their utility, these approaches exhibit limitations, for instance, geometric rotation may not effectively resolve the issue of low accuracy in CNN when identifying images with noise, while GANs are characterized by their complexity and the difficulty associated with training them31. Additive Gaussian white noise is a fundamental noise model used in information theory to mimic the effect of many random processes that occur in nature environments32. The integration of Gaussian white noise into images provides a straightforward method for dataset augmentation. Consequently, to enlarge the dataset size and simulate varying qualities of images encountered in real-world scenarios, data augmentation is performed by injecting Gaussian white noise with a mean (M) ranging from 0 to 1 in increments of 0.1 and a fixed variance of 0.01. Pear images with the addition of different levels of gaussian white noise as shown in Fig. 2.

Compared to other studies33,34,35, which generally employed only one dataset (more training data – less testing data, it similar to the data configuration in the original “Fruit-360” ) to evaluate the performance of models, we set up these two datasets in order to find out the effect of different dataset configuration on classification results. This is because it is often difficult to obtain large training sets in practice, and even when large training sets are available, manually labeling them is a time-consuming and laborious work36. In addition, a larger training set means more computer resource consumption and longer training time for the same model. Therefore, these dataset configuration, especially dataset B, is closer to real applications and helps reflect the real performance of the models.

Table 1 The number of Pear images per class for training and testing in the two datasets.
Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Pear images with the addition of different levels of gaussian white noise.

Deep learning using bayesian optimization

In the implementation of deep learning, the determination of appropriate network depth and hyperparameters typically relies on the practitioner’s expertise and experience, these parameters need to be continuously adjusted to train the model and further to determine the optimal parameters and corresponding optimal models. This process often requires repetition, particularly when the training dataset is modified or the network requires periodic updates, resulting in considerable time and computing expenditure. BO uses an objective function to train the model, which makes it suitable for handling expensive evaluations in the deep learning training process. Because of its powerful hyperparameter adjustment strategy and its high efficiency, which applied in more and more research of deep learning. In additional, BO reduces the manual trial and error often required in deep learning applications by dynamically balancing exploration and leveraging existing information in the hyperparameter space, thereby speeding up the development process and making it possible to discover better configurations than manual tuning. It is often more efficient at finding optimal hyperparameter combinations than traditional methods such as random search.

Choose hyperparameters to optimize

In this study, BO is employed to achieve optimal hyperparameters selection. The hyperparameters include network section depth (NSD), initial learning rate, stochastic gradient descent (SGD) momentum, and L2 regularization strength, these hyperparameters can significantly affect the training process and performance of the model. The determination of hyperparameters search range values based on experience and expertise, literature, or a few simple attempts. The specific information is as follows.

  1. (1)

    NSD. The architecture of the network is divided into three sections, each containing an identical number of convolutional layers, such that the total number of convolutional layers equals three times the NSD. In order to ensure the number of parameters and the required amount of computation are roughly the same for different NSDs in each iteration, the objective function takes the number of convolutional filters in each layer proportional to 1 / sqrt (NSD) and this parameter range values is set to from 1 to 3.

  2. (2)

    Initial learning rate determines the size of the steps taken during the optimization process. A good learning rate is crucial for efficient convergence during training. In additional, the optimal initial learning rate can vary depending on the dataset and network architecture. This parameter range value is set to from 1e-2 to 1.

  3. (3)

    SGD Momentum introduces inertia to the parameter updates, helping the optimization algorithm to navigate through the loss landscape more efficiently. By incorporating information from previous updates, momentum can smooth the trajectory of parameter updates, leading to faster convergence and fosters better generalization. This parameter range value is set to from 0.8 to 0.98.

  4. (4)

    L2 regularization strength. L2 regularization are used to mitigate overfitting by penalizing large weights in the network. The regularization strength controls the impact of this penalty on the loss function and appropriate value is important to balancing between reducing overfitting and maintaining model performance. This parameter range value is set to from 1e-10 to 1e-2.

Perform BO and objective function

The objective function for the Bayesian optimizer is established to minimize the classification error on the validation set while tuning hyperparameters for training CNNs. It leveraging past evaluations to guide subsequent iterations toward the goal of efficiently exploring the hyperparameter space. The objective function takes hyperparameters as input and trains a CNN model using these hyperparameters on the training data. The model’s performance is then assessed by evaluating its classification error on the validation set. This error serves as the optimization criterion, guiding the BO process towards hyperparameter configurations that yield lower validation set error.

Validation set configuration and optimal model selection

After reaching the set maximum number of iterations, the hyperparameters yielding the lowest validation set error are selected as the optimal configuration. A final optimal model is trained using these optimal hyperparameters on the training dataset. The performance of optimal model is then evaluated on the independent test set, providing an unbiased assessment of the model’s performance on unseen data. Among them, we also considered the impact of different configurations of validation sets on the performance of the optimal models.

Image processing

The original size of the images in the “Fruit-360” dataset and the noise-added images both are 100*100 pixels. The method in this study serves as a model trained from scratch, and the size of the input image does not have to be limited to the size requirements of the pretrained networks. In order to reduce the model’s reliance on high-performance computer for storage, training, and inference, and further enhance the practical applicability of the models, all images are resized to 32*32 pixels. The adjusted image is greatly reduced size whether compared with the original image or the size requirements of the pretrained network for the input image. For instance, in previous research, all images were resized to fit the input size requirements of each pre-trained network. Specifically, 227*227 pixels for AlexNet, and 224*224 pixels for VGG-19, ResNet-18, ResNet-50 and ResNet-10111.

Metrics for performance evaluation of optimal model

Given that BO selects the optimal model based on the minimum error observed on the validation set, it is possible that the optimal model overfits on the validation set. Therefore, an independent testing set is used to test the performance of the optimal model, and it is visually displayed through the confusion matrix, precision, recall and F1-Score. Among them, confusion matrix is a table layout used to describe and visualize the performance of a trained model on a testing set37.

Visualization methods

Deep learning models are often perceived as opaque or “black boxes.” While their remarkable performance is undeniable, but their lack of interpretability poses a challenge for widespread adoption, particularly in fields like food science. Hence, this study employs three visualization techniques to elucidate the inner workings of the optimal models obtained through BO, enhancing its interpretability and credibility in food applications.

Feature visualization

Features are generally the physical characteristics of an object that can be used to distinguish it from other objects. A fruit has many physical characteristics including color, texture, shape, and size, which are used by traditional fixed-features based machine learning methods for recognition and classification tasks, such as detecting the defects or maturity of fruits38,39. However, such fixed, simple features-based classifiers are not robust or suitable for complex tasks because fruits have many inter-class and intra-class similarities and variations, especially inter-class similarities and intra-class variations pose significant challenges10,40. In contrast to fixed-simple features-based machine learning methods, CNNs are able to automatically learn and integrate features from training images and use them for classification task41. Specifically, the convolutional layers act as feature extractors for the input images whose dimensionality is subsequently reduced by the pooling layers, and the fully connected layers act as classifiers42,43.

Strongest activations

The purpose of presenting the strongest activations was to observe and compare how the optimal models recognize pears. In the strongest activation images, strong positive activation is shown by white pixels and strong negative activation is shown by black pixels43. We focus on the white areas in the images as they indicate the areas recognized by the optimal models44. To elucidate the discriminative features learned by the model, one image of pear was randomly select from the testing set and its series of images are fed into the trained model separately. Then, to show the strongest activations of the last convolutional layer11. This approach offers insights into the regions of interest within the input images that contribute significantly to the model’s classification decisions.

LIME

Since LIME typically uses simple and more interpretable models (e.g. linear model or decision tree model) to locally approximate the predictions of the target black-box model, LIME was applied here to figure out how optimal models make classification decisions on pears in order to further improve the interpretability of the models45,46.

To provide interpretable insights at the instance level, the image of pear was randomly selected from the testing set and fed into the trained model to show the corresponding LIME image using the method “local interpretable model-agnostic explanations”. This method facilitates the understanding of how the model makes decisions by highlighting the important regions of the input image that influence the model’s output, thereby enhancing interpretability and trustworthiness.

Computer configuration and model hyperparameters

The BO for Deep Learning were implemented using MATLAB R2023b version, running on the same personal desktop with Intel(R) Core i9-13900kF CPU*1, 32G RAM*2 and NVIDIA® GeForce RTX 4090 GPU*1, and trained by SGD with Momentum. In addition, the hyperparameters of deep learning based on BO comprise both fixed values and those within specified search ranges. Among the fixed hyperparameters, learn rate drop factor = 0.1, learn rate drop period = 40, minibatch size = 256, and max epochs = 60. In addition, perform objective function evaluation 30 times to better exploit the BO.

Results and discussion

Performance of the optimal models

Tables 2 and 3 present the results from the BO process for hyperparameter tuning in deep learning model. Each row represents an iteration of the BO process. Once employed this type of table can be useful for tracking the progress of hyperparameter optimization, observing how changes in hyperparameters affect model performance, and determining the best combination of hyperparameters for optimal model performance. Based on these two tables, the optimal models and their hyperparameters under the two data set configurations were determined. The optimal models trained using dataset A and dataset B are defined as model A and model B respectively.

Figures 3 and 4 show the progression of minimum objective values over the number of function evaluations during the BO process. These graphs help in visualizing the effectiveness of the BO, specifically in how quickly it can converge to a near-optimal solution and the stability of its estimations throughout the process. The X-axis is the number of function evaluations, representing how many times the optimization algorithm has tested different hyperparameter configurations. The Y-axis is the minimum objective value, which indicates the value of the performance metric achieved up to each evaluation in the optimization process, lower values indicate better model performance. The blue line (Min observed objective) represents the actual objective value observed at each evaluation of the optimization, it essentially tracks the lowest error achieved during the optimization process. This line tends to decrease or remain flat over time, indicating that the optimization process is either finding better solutions or maintaining the best solution found so far. The green line (Estimated min objective) represents the estimated minimum objective value that the Bayesian model predicts based on the data it has gathered from previous evaluations. It provides a prediction of the potential minimum error that can be achieved with the given hyperparameters. The fluctuations in this line reflect the BO’s exploration of the hyperparameter space and its estimation of where the lowest error might be found. These graphs help illustrate the dynamics between observed changes of deep learning model performance and predicted by the Bayesian model.

Specifically, in Fig. 3, during 0–9 evaluations, there is significant fluctuation in the estimated min objective, which is expected as the optimization process explores the hyperparameter space in initial stage. The min observed objective (blue line) remains relatively flat, suggesting that the optimization process has not found significantly better hyperparameters during this period. During 9–18 evaluations, the blue and green lines remain similar, they both have a decrease at the beginning. Among them, the green line shows several ups and downs, indicating continued exploration and refinement of the hyperparameter space. The blue line remains flat, indicating that the hyperparameters at the tenth evaluation were the optimal hyperparameters during this period. During 18–30 evaluations, the estimated min objective drops significantly, the minimum observed objective also decreases and subsequently remains stable, thus reflecting the actual improvement in the model’s performance. Overall, Fig. 3 indicates a successful BO process that progressively improves the model’s performance on the verification set and finally obtained the optimal model A.

In Fig. 4, the changes of the green and blue lines are simpler than those in Fig. 3. After the two lines declined rapidly in the initial 0–3 evaluations, then, the green line entered a fluctuating state, and the blue line remained flat until the end of the evaluation. The green line’s fluctuations show the BO’s exploration and refinement process, trying to estimate and find even better hyperparameters. However, the lack of significant improvement in the blue line indicates that the process is primarily exploring around a local optimum. It means the BO process quickly identified a set of hyperparameters that significantly reduced the model error on the verification set, and were not find significantly better hyperparameters in subsequent evaluations. This stability is a positive sign and indicates consistent performance. Overall, Fig. 4 also indicates a successful optimization process where a high quality set of hyperparameters was found early, with subsequent evaluations confirming its effectiveness and finally get the optimal model B.

Table 2 The results from the BO process for hyperparameter tuning in a deep learning model on dataset A.
Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Minimum objective versus Number of function evaluations on dataset A.

Table 3 The results from the BO process for hyperparameter tuning in a deep learning model on dataset B.
Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Minimum objective versus Number of function evaluations on dataset B.

The confusion matrices of the two optimal models for the testing sets in dataset A and dataset B are shown in Figs. 5 and 6 respectively. Correct predictions for each category are located on the diagonal of the confusion matrix and marked in blue, while incorrect predictions are marked in pink. In Figure.5, the optimal model (model A) trained using training dataset A (30% of which is validation data) demonstrates significantly higher accuracy, achieving 97.29%. In contrast, in Fig. 6, the optimal model (model B) trained with training dataset B (70% of which is validation data) shows increased misclassifications across all categories, resulting in an accuracy of 90.39%. In addition, the Precision, recall and F1-Score of the optimal model A and B are shown in Table 4. Figures 5 and 6, and Table 4 indicate that the dataset configuration has a substantial impact on the classification results of Bayesian optimized deep learning models. On the one hand, relatively more training data and less test data help to improve the overall accuracy of the optimal model. The ratios of training and test data used by the model A and model B are 21:10 and 1:10 respectively. In real applications, the test set is often infinitely large47, so Dataset B is more consistent with real applications. On the other hand, a lot of noise is added to the data. As shown in the Fig. 2, when the noise intensity is high, our human eyes even cannot recognize that the picture contains pears. Even so, the model B still achieved an overall accuracy of more than 90%. In addition, we tested the deep learning based on BO using the original data of pear in “Fruit-360” (all parameters are consistent, and the verification set accounts for 20% of the training set), and the optimal model accuracy is 100%. We also tested different proportions of the validation set within the training set, ranging from 10 to 80%. As shown in Table 5, on Dataset A, the optimal model’s accuracy fluctuated between 95.68% and 97.29%, while on Dataset B, the optimal model’s accuracy varied between 86.30% and 90.39%. This indicates that in this study, the more training data available, the less the optimal model is affected by the size of the validation set.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

Confusion matrix of the optimal model A for test set in dataset A.

Fig. 6
Fig. 6The alternative text for this image may have been generated using AI.
Full size image

Confusion matrix of the optimal model B for test set in dataset B.

Table 4 Precision, recall and F1-Score of the optimal model A and B.
Table 5 Different proportion of validation set in training set and optimal model accuracy.

Model interpretability analysis

Feature visualization

In this study, the feature visualization of the last fully connected layer of the two optimal models was used to explain to us how the optimal models obtained through BO under different dataset configurations build understanding of pear images, i.e. the common and high-level features of each type of pear learned by the optimal models from the training set, as shown in Fig. 748.

It is evident that the feature visualization images of the two optimal models exhibit distinct patterns or styles, even for the same class of pears. This suggests that different models interpret the same class of pear in varied ways. Additionally, although some classes of pears appear very similar in appearance, their corresponding feature visualization images generated by different models remain distinct. This indicates that two optimal models have successfully learned the true differences between classes of pears41. For example, the first (Pear) and sixth (Pear Monster) types, as well as the fourth (Pear Forelle) and seventh (Pear Red) types in Fig. 1 have certain similarities in shape, size and color. But their corresponding feature visualization images are obviously different, which is similar to our previous research results11. Furthermore, it should be noted that the color of the sixth category in the two feature visualization images is significantly brighter than first category, and the seventh category is more reddish than the fourth category, which is consistent with the color of these pears to a certain extent.

In addition, the two optimal models are both series network but different in depths, model A has 34 layers and model B has 25 layers. The feature visualization images of the two models are abstract and difficult to understand. In particular, Model A appears more abstract and intricate than Model B. This phenomenon is due to the different depths of the models because CNNs typically build understanding of images in a hierarchical way over many layers, where earlier layers learn basic and low-level features such as colours, edges, textures, or shapes, and later layers learn and integrate simple features (learned by earlier layers) into increasing complex and abstract features such as patterns, parts or objects, so that the last fully connected layer learns the high-level features of each class and used for classification, but sometimes the high-level features are too abstract to be interpreted45,49. Based on this, since deeper layers can learn the combinations of features learned by the previous layers, the deeper model implies more convolutional layers, which can extract more advanced and complex features than the relatively shallow model11,50.

Fig. 7
Fig. 7The alternative text for this image may have been generated using AI.
Full size image

Feature visualization of the last fully connected layer of the optimal models (1. Pear, 2. Pear 2, 3. Pear Abate, 4. Pear Forelle, 5. Pear Kaiser, 6. Pear Monster, 7. Pear Red, 8. Pear Stone and 9. Pear Williams.)

Strongest activations

Figure 8 shows a series of images of the same type of pears randomly selected from different test sets and these images containing different levels of added noise, and the corresponding strongest activations generated by the last convolutional layer of the two optimal models. In previous research11, we revealed how a CNN-based model classifies different fruits, these results suggest that models with different frameworks and depths recognize fruits in different ways. In this study, we further examine the impact of Gaussian white noise in images on the strongest activation of the last convolutional layer of the optimal model, and analyze patterns or differences in the way the model responds to noise.

Specifically, compared with the strongest activation of Model B, Model A performs more consistently across images with different noise levels, indicating that Model A have better feature extraction capabilities, especially when processing noisy images achieves stronger feature detection. While Model B requires higher quality and less noisy images to remain effective, which also corresponds to the accuracy of the two optimal models. Even so, this does not mean that model B is unacceptable in practical applications, because we are not always exposed to low signal-to-noise ratio images in the real world. For example, when Gaussian white noise intensity more than 0.5, it becomes challenging for the human eye to discern image content. In addition, Model B achieved an accuracy of greater than 90% on the challenging test set.

Fig. 8
Fig. 8The alternative text for this image may have been generated using AI.
Full size image

Images of pears with different Gaussian white noise added and their corresponding strongest activations in the last convolutional layer of the two optimal models.

LIME

Figure 9 shows the feature importance maps corresponding to model A and B as determined by the LIME. Specifically, the first column shows the classification results of the two optimal models on randomly selected images of pears from different test sets, i.e. the three categories that received the highest classification probabilities are displayed at the top of the image. The second column shows the recognition region of the image that the model used to classify. And the third column shows the most important features determined by each model50. For instance, in row 1 column 1, model A classified the pear image as Class 7 (Pear Red) with 100% probability, and Class 4 (Pear Forelle) and Class 5 (Pear Kaiser) with 0 probability. In row 1 column 2, the feature map shows which regions of the image were important for the classification of the Pear Red (Class 7). According to the chromaticity bar, the red regions have a high importance, i.e. model A focuses on the lower part of the pear to predict as Class 7, and the prediction accuracy decreases when these regions are removed46. For row 1 column 3, it is a masked image and the visible regions need to be focused on as it indicates the most important features identified by model A, it corresponding to the important regions in the row 1 column 2 image.

Compared with the LIME map of model A, model B has fewer important areas (warm tone areas). In the top4 features images, the overlapping area between the features of model B and pear is less than that of model A. Therefore we can infer that if the application requires high accuracy and detail (e.g., quality control in pears processing), Model A might be preferred. However, for applications needing faster processing with reasonable accuracy (e.g., tasks with less noise in images), Model B could be more suitable, although its accuracy is 90.39%, which is lower than Model A’s 97.29%, but Model B has fewer layers.

Fig. 9
Fig. 9The alternative text for this image may have been generated using AI.
Full size image

Understanding of two optimal models by using LIME technique.

Based on the above, Sect. 3.2 provides an insight into the two optimal models through three visualization methods to explorethe models’ working mechanisms in this task. Specifically, feature visualization images show the different understanding of pear images by two optimal models, strongest activations and LIME images show how and why optimal models make classification decisions. These results help us to explain model predictions and build trust in deep learning for practical applications45. In addition, it can also help us optimize the deep learning based on BO and further improve the performance of the optimal model.

Conclusion

Automated and efficient fruit variety recognition and classification systems are essential in agricultural and food practices as they can significantly reduce labor costs and enhance the economic benefits throughout the fruit supply chain, from harvesting to sales. In this study, BO deep learning were employed to identify and classify images of nine pear varieties with added noise on two challenging datasets, which are close to real agricultural applications. Based on the figure of minimum objective vs. number of function evaluations, and the table of the results from the BO process for hyperparameter tuning in a deep learning model, two optimal models were identified. Important findings are as follows.

  1. (1)

    Dataset configuration significantly impacts the classification accuracy of the BO optimal models, i.e. the optimal model A achieved an accuracy of 97.29% on dataset A (training-to-testing ratio = 21:10), the optimal model B achieved an accuracy of 90.39% on dataset B (training-to-testing ratio = 1:10), and on the original Fruit360 dataset, the accuracy of the BO optimal model reached 100% (training-to-testing ratio = 12:5).

  2. (2)

    Although the BO process explores the hyperparameter space based on model error on the validation set, the proportion of the validation set within the training set has a relatively minor effect on the performance of the optimal models, especially when the training set is large. Specifically, we set the proportion of the validation set within the training set from 10% to 80%, and the accuracy of the optimal models optimized using different validation sets fluctuated between 95.68% and 97.29% on dataset A, and between 86.30% and 90.39% on dataset B.

  3. (3)

    Feature visualization revealed that the two optimal models have different understandings of different pears, but for certain types of pears (Pear, Pear Forelle, Pear Monster, Pear Red), the color might influence the classification results. The strongest activations demonstrated the two optimal models used which areas of the images to classify pear images with different noise levels. LIME showed the important features used by the two optimal models for making classification decisions. And the results indicated that the number of features might correlate with the classification accuracy of the optimal models. That is the more warm-toned features, the higher the accuracy of the corresponding optimal model.

These results not only showcase the excellent performance of BO deep learning in classifying noisy pear images but also address the challenges faced by deep learning in agricultural applications, promoting the widespread application of deep learning in the food field. Our future work will focus on the following aspects, (1) Increasing the variety and number of fruits used for classification, not limited to a single type of fruit, aiming to develop a general model for the fruit and even the food sector. (2) Increasing the number of hyperparameters for BO and integrating the data augmentation process with the training of BO deep learning models. (3) Continuing to study the interpretability of deep learning-based models to reveal the feature evolution mechanisms of black-box models, thereby enhancing the trust of users in the food sector in deep learning.