Visual feature-based multi-scale hybrid attention network for fine-grained Hawthorn varieties identification

Tan, Chaoqun; Deng, Jiale; Wu, Chunjie; Wang, Maojia; Ke, Li

doi:10.1038/s41598-025-20875-0

Download PDF

Article
Open access
Published: 22 October 2025

Visual feature-based multi-scale hybrid attention network for fine-grained Hawthorn varieties identification

Chaoqun Tan¹,
Jiale Deng²,
Chunjie Wu²,
Maojia Wang¹ &
…
Li Ke³

Scientific Reports volume 15, Article number: 36895 (2025) Cite this article

1337 Accesses
Metrics details

Subjects

Abstract

Hawthorn is a well-known economic crop widely recognized for its efficacy in cardiovascular protection and blood pressure reduction. However, accurately identifying Hawthorn varieties, which arise from diverse cultivation conditions, poses a significant challenge in species authentication. To address this challenge, we introduce a visual feature-based method for Hawthorn identification. Specifically, we propose a multi-scale hybrid deep learning model to capture and merge both local and global features of Hawthorn images. Our model incorporates shallow prior and high-level semantic information, thereby enhancing classification precision. Furthermore, to improve the model ability to recognize local details in fine-grained images, we propose a novel spatial local attention mechanism. The loss functions are designed to reduce the low-frequency features in the fine-grained image. Extensive experiments conducted on our Hawthorn dataset, as well as two public datasets, demonstrate that our model outperforms state-of-the-art methods.

Introduction

Hawthorn, a member of the Rosaceae family¹, encompasses fruit-bearing trees and shrubs. Renowned for its protective effects on cardiovascular, anti-oxidation, anti-cancer, and anti-inflammatory properties, as well as its ability to lower blood pressure and cholesterol^2,3. Hawthorn has over 10 varieties⁴, serving different purposes, the confusion surrounding these varieties has raised concerns about their impact on quality and commercial value, prompting increased public awareness⁵. Consequently, there is an urgent need for practical authentication methods to distinguish between Hawthorn varieties in real-world applications.

Testing the active ingredients, such as organic acids and flavonoids⁶, can aid in identifying the different varieties of Hawthorn. However, relying solely on physicochemical identification and biological evaluation^7,8,9,10 may be difficult to reflect the intrinsic quality, as these methods focus on limited compounds and can be time-consuming^11,12. An effective alternative gaining prominence is the use of sensory technology, including an electronic nose, electronic tongue, and electronic eyes, coupled with chemometric methods^13,14,15. While these methods have shown promise, they typically require professional equipment. A growing emphasis is on developing non-destructive and accurate identification methods for Hawthorn.

With the development of deep learning, computer vision technology is widely used in image identification^16,17,18, leveraging its capacity to discern images solely through visual information^19,20. Convolutional Neural Network (CNN) has gained widespread interest in computer vision due to its efficient feature expression capabilities. For example, Wang et al.²¹ proposed a CNN-based model to fuse features of leaves from different parts of soybean plants for cultivar recognition. Pérez et al.²² used CNN to classify and detect the five types of potential adulteration. However, a notable limitation of CNN-based models is their focus solely on local relationships^23,24. To overcome this limitation, transformer-based models are proposed to extract global information. For instance, Chang et al.¹⁹ proposed to use a vision transformer model to extract edge features for plant disease identification. Pacal²⁵ introduced an advanced transformer module to extract features from maize leaves for detecting diseases. However, one drawback of those transformer-based models is their high computational cost and poor generalization^26,27.

Moreover, some recent references^28,29,30 propose different similarity-preserving metrics and quantization methods for fine-grained image retrieval. In contrast, the OTQ method³¹ leverages cross-X semantic hypergraph learning to mine discriminative features through synergistic interactions across scales, layers, and images, thereby enhancing retrieval robustness against intra-class variations. Inspired by our observations above, we propose a rapid and effective visual feature-based model for identifying the different varieties of Hawthorn. Based on the characteristic of small inter-class differences of the fine-grained image, our model obtains and merges local details and global features from the fine-grained images by combining MBConv³² and Swin Transformer³³. Specifically, MBConv is used to extract multi-scale subtle features, while Swin Transformer is employed to extract the global information features. Then, the shallow-level feature map and high-level feature are fused by Patch-Merge layers using global and local features at different scales. Through an element-wise add operation, we combine shallow-level and high-level features. After that, we use a separated CNN layer and a fully connected layer to output the Hawthorn identification results. The framework of our model is presented in Fig. 4, and the details of our proposed method are presented in Sect. 3. On the other hand, a comprehensive visual multi-varieties Hawthorn images dataset is constructed, where high-resolution images are captured using a self-developed acquisition device, which is detailed in Sect. 2. The experimental results and analysis are shown in Sect. 4. The conclusion is drawn in Sect. 5.

Our contributions are highlighted as follows:

A novel multi-scale hybrid model is proposed to capture and merge local details and global features of fine-grained images. Through the infusion of shallow prior features to guide high-level semantic information, our method demonstrates superior accuracy in identifying varieties of Hawthorn. Remarkably, to the best of our knowledge, our approach stands as the pioneering application of deep learning technology in Hawthorn identification.
A newly spatial local attention module is designed to enhance the awareness of detailed local features within fine-grained images and improve the learning capacity. Softmax normalizes the features of a linear transformation. The loss of local information in maximum pooling is carefully reduced by highlighting the low-frequency features of the fine-grained image.
To address the absence of publicly available image datasets for different varieties of Hawthorn, a visual Hawthorn image dataset from various varieties is first constructed.
Extensive experiments are performed on our dataset as well as public datasets. The experimental results show that our model achieves the highest accuracy among state-of-the-art models. Moreover, the practical utility of our proposed model extends beyond Hawthorn identification, showcasing excellent performance in the broader context of plant identification.

Dataset collection

We collect 8 different varieties of Hawthorn, namely (A) Xiaojinxing, (B) Xinglongshisheng, (C) Hongkongqi, (D) Baiquan, (E) Damianqiu, (F) Yubeihong, (G) Dajinxing and (H) Dawuling. The collected samples illustrated in Fig. 1, are sourced from the Lotus Pond Chinese herbal medicine market in Chengdu. These samples are certified by experts from the Hospital of Chengdu University of Traditional Chinese Medicine (China), and samples encompass various slices obtained from intact specimens. Our comprehensive Hawthorn dataset comprises 3960 images, with the distribution of classes presented in Fig. 2.

The image acquisition process utilizes a wood-made device equipped with lighting and image acquisition systems. The lighting system incorporates PHILIPS Graphica TL-D light with a temperature of 5000 K, comprising four light tubes and scattering plates to eliminate shadows during the image capture. The wood-made device has a reflective grey coating with a reflectivity of 18%, as depicted in Fig. 3(A).

In the image acquisition system, a camera (Canon EOS 60D) is employed to capture high-resolution images of 5120 × 3840 × 3 pixels, as shown in Fig. 3B. YoloV4³⁴ is used to obtain the 2D bounding boxes of images. Moreover, subsequent cropping is performed based on the detected bounding box, as demonstrated in Fig. 3C. Additionally, we manually remove any incomplete, blurry, or inappropriate images. Then, data augmentation strategies can be applied to mitigate the effects of data imbalance by generating additional samples, enhancing model generalization, and improving classification performance through preprocessing transformations such as rotation, flipping, cropping, and duplicating minority class samples.

Methodology

Overview

Our model was designed to extract local and global information from input images for Hawthorn identification, as illustrated in Fig. 4. The architecture incorporates spatial local attention (SLA) and MBCovn layers to capture multi-level features deemed as local characteristics. Additionally, Swin-Transformer Blocks were employed to acquire hierarchical features through the merging and reduction of spatial resolution, serving as global features.

Subsequently, PatchMerge layers were used to fuse the local and global features by doubling the channels and halving the width and height. Two fused feature maps: F_s for shallow-level features and F_h for high-level features were obtained. Finally, the identification results were computed through fully connected layers. This approach enables adeptly handling fine-grained analysis by leveraging the flexibility of various scale features.

Local feature extraction

Spatial local attention

Our model included two sub-modules in the local feature extraction. Due to the similarity of visual features of images, SLA was proposed to focus more precisely on the color and surface details of different varieties of Hawthorn and learn more about the subtle features of images. The structure of SLA is displayed in Fig. 5.

In SLA, the maximum pooling layer is first used to contain the highest values in the local area and effectively retain the most significant features. However, other important fine-grained information (such as texture and global context) is ignored, which may lead to a decrease in the expressiveness of features. To address this issue, an average pooling layer was incorporated to produce a flat vector, enabling the extraction of global spatial context information from the feature maps.

$$\:{F}_{max}\left(x\right)=\underset{i=\text{1,2}\dots\:C}{\text{max}}X,\:\:\:{F}_{max}\left(x\right)\in\:{\mathbb{R}}^{h\times\:w\times\:1}$$

(1)

$$\:{F}_{avg}\left(x\right)=\frac{1}{C}\sum\:_{i=1}^{C}X,\:{F}_{avg}\left(x\right)\in\:{\mathbb{R}}^{h\times\:w\times\:1}$$

(2)

where X is the input feature. Then the fusion feature $\:{F}_{con}$ by concatenating the outputs of average pooling and maximum pooling according to the channel dimension.

$$\:{F}_{con}=\text{C}oncat({F}_{avg}\left(x\right),{F}_{max}\left(x\right))$$

(3)

Furthermore, compared with SPA, the SLA compresses the feature map using a 1 × 1 linear mapping, which reduces high-dimensional features into more compact low-dimensional representations while aggregating cross-channel information. It is expressed in formula 4.

$$\:C\left({X}_{i}\right)=\frac{{e}^{{Z}_{i}}}{{\sum\:}_{j=1}{e}^{{Z}_{j}}}$$

(4)

Then, Softmax is subsequently applied to generate normalized attention weights, emphasizing important channels and enhancing fine-grained features. In this way, self-attention weight can be obtained to focus more on fine-grained features. Mathematically,

$$\:{Z}^{{\prime\:}}=\sum\:_{i=1}\frac{{\:W}_{n}\times\:{X}_{i}}{C\left({X}_{i}\right)}{\times\:W}_{m}$$

(5)

Combining with the maximum pooling feature vector, the average pooling feature vector, and self-attention weight, the local feature Z’ was obtained. Where W_m is a linear transformation matrix of 1 × 1. C(X_i) is the normalization factor, $\:{Z}_{i}$is the output of the i node, and j is the number of output nodes. The self-attention weight of [0–1] is calculated by the Softmax function.

Then, the 7 × 7 Conv W_v was used to extract features, and then the spatial attention weight with channel number 1 was obtained by the Sigmoid activation function. The larger kernels have larger receptive fields and can capture local dependencies of images. The attentions were divided by smoothing the obtained attention weight. The low-frequency regions of the image were highlighted and the loss of local features in the maximum pooling was reduced. Then the weight of attention was multiplied by the connected feature vectors:

$$\:{Z}^{{\prime\:}{\prime\:}}=\sigma\:\left({Z}^{{\prime\:}}\times\:{\:W}_{v}\right)$$

(6)

where σ is the non-linear Sigmod function, Z ” represents the output weight. Finally, the sigmoid activation function was used to generate the feature attention. By multiplying the attention weight value Z” and the input X, the attention-weighted X’ can be obtained. As shown in formula 7.

$$X^{\prime } = Z^{{\prime \prime }} \times X$$

(7)

The experimental results have shown that SLA can enhance the perception of local fine-grained features while suppressing significant sensitive areas in images. Subsequently, this module was integrated into the MBConv module to enable a more accurate focus on the specific features and the correspondence.

Improved MBConv module

MBConv combined depth-wise convolution and attention mechanisms to reduce the number of parameters and improve learning ability. Inspired by the CBAM³⁵, the SLA module was added after the Squeeze and Excitation (SE) layer. In this way, it was an effective and simple attention module that stacked channel attention and spatial attention modules. The comparison of our newly designed MBConv with the original one is presented in Fig. 6.

Depth-wise convolution is first applied to each input channel by the $\:K\times\:K$ convolution kernel, expressed as follows in formula 8:

$$\:{Y}_{depth}\left(i,j,c\right)=\sum\:_{m=1}^{K}\sum\:_{n=1}^{K}X\left(i+m,j+n,c\right).{W}_{depth}(m,n,c)$$

(8)

where $\:{Y}_{depth}\in\:{\mathbb{R}}^{h\times\:w\times\:c}$ is the output feature map, $\:X\in\:{\mathbb{R}}^{h\times\:w\times\:c}$is the input feature map. $\:{W}_{depth}\in\:{\mathbb{R}}^{h\times\:w\times\:c}$ is the depth convolution kernel, $\:K$ is 3. Then, the SE and SLA modules are used to obtain more detailed information. The attention maps were multiplied by the input feature map for adaptive feature refinement. Depth-wise convolution operates on each channel independently, preserving local spatial details of the input features. Finally, the $\:1\times\:1$ convolution kernel is used to fuse the output channels of the depth-wise convolution.

$$\:Y\left(i,j,c\right)=\sum\:_{c=1}^{C}{Y}_{depth}\left(i,j,c\right).W(c,k)$$

(9)

where $\:W(c,k)\in\:{\mathbb{R}}^{1\times\:1\times\:c\times\:c}$ is the point-wise convolution kernel. Our model can learn abundant details and features to deal with the similarity of visual features. These processes were repeated from low level to high level, so that could obtain multiple local features.

Global feature extraction

In this paper, the Swin Transformer was employed to obtain global features. To enhance a greater generalization property of our model, Multi-head Self-Attention (MSA) was replaced with window-based long-head self-attention (W-MSA) and the shifted window-based long head self-attention (SW-MSA) in Swin Transformer. The input $\:X\in\:{\mathbb{R}}^{h\times\:w\times\:c}$ is divided into fixed-size windows $\:W\times\:W$. Then, it is computed by self-attention, the scaled dot-product attention is to obtain $\:Z\in\:{\mathbb{R}}^{d\times\:d}$.

$$\:Z=Attn\left(z\right)=Softmax(Q{K}^{\text{{\rm\:T}}}/\sqrt{w})V$$

(10)

the $\:Softmax$ attention $\:Attn(\bullet\:)$ with a global receptive field works as the following nonlinear mapping:

$$\:y{\prime\:}=WMSA\left(LN\left({Z}^{l-1}\right)\right)+{Z}^{l-1}$$

(11)

where $\:LN\left(\bullet\:\right)$ is the Layer Normalization that essentially is a learnable column scaling with a shift, and $\:FFN\left(\bullet\:\right)$ is a standard two-layer feedforward neural network applied to the embedding of each patch. The scaled dot-product attention (8) of $\:Z$, the jth element of its ith row $\:{Z}_{i}^{j}$ is obtained in Formula 12.

$$\:{Z}_{i}^{j}=\frac{{e}^{{(Q{K}^{\text{{\rm\:T}}}/\sqrt{w})}_{i}}}{\sum\:_{j=1}^{h}{e}^{{(Q{K}^{\text{{\rm\:T}}}/\sqrt{w})}_{ij}}}.\:V=\:Softmax({q}_{i}{K}^{\text{{\rm\:T}}}/\sqrt{w})V$$

(12)

Perform multi-head attention fusion on the attention results of each layer of windows to obtain hierarchical fusion features and enhance attention to fine-grained features. The calculation process is shown in formula 14:

$$\:{z}^{l+1}=SWMSA\left(LN\left({Z}^{l}\right)\right)+{Z}^{l}$$

(13)

The multi-scale structure, built upon the enhanced MBConv module, effectively captures features ranging from shallow details to deep semantic information and enhances classification precision.

Multi-features fusion

The cropped images, $\:{x}^{i}$ were as input to MBConv and obtain the appearance features $\:{O}_{s}$. All patches were calculated to double channels and halve the width and height by the PatchMerge layer. The extracted features were transferred into Swin Transformer, and then the shallow feature maps $\:{F}_{s}$ were generated, similarly, the higher-level features $\:{F}_{h}$ were obtained:

$$\:{F}_{h}=Swin\left(PatchMerge\left({O}_{s}\right)\right)$$

(14)

$$\:y=Conv\left({{F}_{s})+Conv(F}_{h}\right)$$

(15)

Subsequently, the features were superimposed by skipping the connection. The shallow prior features were injected to guide high-level semantic information. The local and global features of images were extracted to identify the fine-grained images.

Results and discussions

Evaluation

Training parameters

The model is optimized by the AdamW³⁶ algorithm. The initial learning rate is 1.5e-3, and the learning rate decay strategy is StepLR³⁶. The batch size is set to 16, and the final model is obtained when it reaches 400 epochs. Due to the imbalanced data, the Focalloss is used as the loss function. The code is built by using Pytorch = 1.8.1 and Python = 3.9. The model is trained on a PC (equipped with an Intel i7 processor) with a graphics processing unit card (NVIDIA 2080TI, 11G memory).

Metrics

We consider four metrics to evaluate our model identification performance, including Accuracy, Precision, Recall, Specificity, and F 1Sore:

$$\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}=\frac{TP}{TP+FP}$$

(16)

$$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}=\frac{TP}{TP+FN}$$

(17)

$$\:\text{S}\text{p}\text{e}\text{c}\text{i}\text{f}\text{i}\text{c}\text{i}\text{t}\text{y}=\frac{TN}{FP+TN}$$

(18)

$$\:\text{F}1\:\text{S}\text{c}\text{o}\text{r}\text{e}=\frac{2\text{*}\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\text{*}\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}{\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}+\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}$$

(19)

where $\:TN$ represents the number of True Negative, and $\:TP$ denotes the number of True Positive. $\:FN$ indicates the number of False Negative, and $\:FP$ represents the number of False Positive³⁷.

Hawthorn identification

Results of our model

We split our dataset into a training set and a testing set. Specifically, 80% of the data are used for training, and the remaining 20% is used for testing. The change curves of loss and accuracy are used to reflect the training performance, and the testing loss and accuracy results of our model are shown in Fig. 7. During the training process, the loss gradually decreased, the model converged and the accuracy gradually increased. Until 200 epochs, the loss becomes stable. And the model achieves the highest accuracy of 90.96% in 384 epochs. The results of the testing set show that the model has good identification performance.

The experimental results are detailed in Table 1, where we can see that the recall is higher than precision, which indicates the predicted positive is much higher than the actual positive.

Table 1 The experimental results of different varieties of Hawthorn by our model.

Full size table

The identification performance for (A) Xiaojinxing (C) Hongkongqi (D) Baiquan (G) Dajinxing (H) Dawuling have better performance, benefiting from their distinct appearance information and diverse places of origin. However, for (B)Xinglongshisheng, (E) Damianqiu, and (F) Yubeihong, where visual similarities lead to confusion. This difficulty is compounded by the fact that (E) Damianqiu and (F) Yubeihong are different varieties but from the same origin. To further evaluate our model performance, we employ the confusion matrix, and the area under the Receiver Operating Characteristic Curve (ROC)³⁸ curve is calculated. The experiential results are depicted in Fig. 8a, b, respectively.

In Fig. 8a, the columns represent the predicted labels, and the rows represent the true labels. The values corresponding to rows and columns indicated the number of correct classes predicted from true data. The distribution of the actual and predicted numbers of each class is shown. Consistent with the results in Table 1, (B) Xinglongshisheng has confusion numbers with (E) Damianqiu and (G) Dajinxing, especially with (G) Dajinxing. For (E) Damianqiu, it is incorrectly predicted as (A) Xiaojinxing. And (F) Yubeihong is also incorrectly predicted as (G) Dajinxing. Our dataset has the characteristics of imbalance, (G) Dajinxing has the largest number.

Therefore, the similar visual features of different Hawthorns are analogous, the model has a stronger learning ability and learns more features for the classes with the largest classes. Similarly, based on the confusion matrix, ROC is computed to reflect the difference between the True Positive Rate and the False Positive Rate. It measures the ability of our model to correctly identify certain categories in Fig. 8b. Among them, (B)Xinglongshisheng, (E) Damianqiu, and (F) Yubeihong have the lower accuracy. It also reaffirms the credibility of the identification results generated by our proposed model.

Manual identification

To evaluate the speed and accuracy of our method, we compare our model with the manual identification results from five experts. These experts, affiliated with the pharmacy department of the Hospital of Chengdu University of Traditional Chinese Medicine, perform manual identification solely by visual features. The accuracy for each class is calculated from the identification of 346 images, taking the average of the results from the 5 experts.

Table 2 The results of manual identification by five experts.

Full size table

Table 2 displays the outcomes, revealing an overall accuracy of 65.24% for manual identification. Notably, our model surpasses this performance, demonstrating higher accuracy for each class. In contrast to manual identification, our model showcases superior speed and remarkable accuracy, emphasizing its substantial value and potential for practical applications.

Ablation study

Different attention modules

To verify the effectiveness of the SLA module, the SPA³⁵ in CBAM and the Global Context Attention Module (GC)³⁹ are compared. In fairness, the baseline is based on the improved model, and the remaining parameters remain unchanged. The classification accuracy is shown in Table 3. The comparative loss and accuracy results of multiple attention modules on the testing set are shown in Fig. 9.

Table 3 Ablation study results for attention mechanism modules.

Full size table

From Fig. 9, we can see SLA has the best advantages in improving performance. From Table 3, SLA has the best accuracy of 90.96%, which is superior to the others. Compared with SPA, SLA has an increase of 1.64%. Simultaneously, compared with no additional attention module, it can be confirmed that the performance of the model can be significantly improved. Compared with GC, although the computational complexity of this model is lower, the identification results are limited, and our SLA still has the best performance. Moreover, the parameters have not changed much, and SLA has good practicability. To better verify the effect of the two attentions, further, the visualized heat maps are depicted in Fig. 10.

The difference in regions of interest (ROIs) under different attentions can be observed via the visualized heatmaps. The heatmaps illustrate the regions of the image that the model emphasizes when making classification decisions. Bright regions highlight areas of greater significance, where the model has identified critical features. All can focus on the ROIs of images, which demonstrate the ability of models to focus on key regions of interest, such as color variation, texture, or structural anomalies.

Compared with SPA and GC, SLA has a deeper and wider range of ROIs, such as (A) Xiaojinxing, (D) Baiquan, and (E) Damianqiu. For (A) Xiaojinxing, the model pays more attention to the texture details in pulp. Surface color features are more critical for the identification of (D) Baiquan. For (E) Damianqiu, the model pays more attention to the texture and color details, which can better improve the classification performance. For B (Xinglongshisheng) and F (Yubeihong) with lower precision, the different ROIs are focused on by different models, and the comparative advantage is not obvious. Besides, for (H) Dawuling, compared with SLA, GC pays more attention to blank redundant areas, which affect the model performance and lead to lower accuracy. And SLA concentrates on the fine-grained features of the target and enhances the classification performance. In short, SLA can learn more effective features and improve recognition accuracy. It can also be used to recognize fine-grained images by focusing on the detailed features of local areas.

Moreover, some failure results are also presented in Fig. 11. The confusion matrix indicates that (A) Xiaojinxing and (E) Damianqiu are frequently confused in Fig. 8. Corresponding heatmaps reveal that the model focuses on similar regions for these two classes. Misidentified (A) Xiaojinxing images appear to place greater emphasis on color details. For (B) Xinglongshisheng, common misclassifications are (F) Yubeihong and (G) Dajinxing, with heatmaps showing the model focusing more on blank areas during these errors. When processing (C) Hongkongqi, the model primarily attends to the core region and is also prone to confusion with (A) Xiaojinxing. The highest degree of confusion occurs between (F) Yubeihong and (G) Dajinxing, which exhibit significant visual similarity as evidenced in Fig. 11. Through analysis, we can know that the SLA can enhance local feature awareness within single images (e.g., flesh texture details), the isolated learning paradigm fails to leverage inter-sample semantic relationships. This limitation leads to confusion between visually similar classes, such as (A) Xiaojinxing and (E) Damianqiu in Fig. 8. To address these issues, we propose integrating a hypergraph attention layer⁴⁰ after SLA. Specifically, local features extracted from multiple hawthorn samples would serve as hypergraph nodes, with semantic classes defining hyperedges. This structure would explicitly model discriminative cross-sample associations, enabling feature refinement through high-order interaction propagation. Such synergy would enhance robustness against intra-class variance while preserving fine-grained locality.

Visualization SLA module

The visualization of the convolutional layer feature maps is crucial for a deeper understanding of the SLA’s effectiveness in extracting fine-grained features for different hawthorn varieties. Thus, the compared experiment that visualizes the feature maps at various layers with the SLA is designed. The detailed Grad-CAM visualizations are included to further illustrate how the SLA influences the decision-making process of models and highlights the key visual features the model relies on when distinguishing between different hawthorn varieties. The visualization results are shown in Fig. 12.

The brighter areas highlight the regions where the model’s attention is focused during classification. As shown in Fig. 11, Different convolutional layers focus on distinct regions in the heatmaps for various hawthorn species, revealing the key visual features the model considers when distinguishing between different hawthorn varieties. For instance, for (A) Xiaojinxing, the texture details in pulp are key visual features. B (Xinglongshisheng) and F (Yubeihong) have similar visual features. For (E) Damianqiu, the model pays more attention to the texture and color details, which can better improve the classification performance. The spatial localized attention (SLA) mechanism proposed in this paper demonstrates its ability to precisely capture color and surface details, enabling the learning of richer fine-grained features. This effectiveness is evident in the refined area distinctions presented in the heatmaps, underscoring the role of SLA in improving classification performance.

Different CNN modules

To verify the effect of the MBConv, ablation experiments are performed, wherein the MBConv is replaced with other CNN modules. To ensure fairness, we maintain the network structure. The accuracy comparison with various CNN modules is presented in Table 4.

Table 4 The comparative experimental results based on different CNNs.

Full size table

Notably, our original model stands out with optimal parameters and computational complexity and attains the highest accuracy. Our original model surpasses the one replaced by ResNet at least 3.69%. MBConv combines depth-wise convolution and an attention mechanism, which improves the learning ability by residual connections. Compared with the Depthwise Separable block, MBConv has a higher accuracy by 1.14%. The experimental results also verify the validity and rationality of the MBconv module.

Horizontal comparison and performance trade-offs

Comparative experimental results

For horizontal comparison, state-of-the-art ConvNet models and multiple different Transformer models are compared with ours. And FocalLoss is chosen as the loss function for all models. The experimental results for the multiple ConvNet models are shown in Table 5; Fig. 3.

Table 5 The results of the comparison for the different ConvNets.

Full size table

Simultaneously, the value of Parameter (Params) and Floating-point operations per second (FLOPs) are used to reflect the computational complexity and number of parameters of the model. And the Frames Per Second (FPS) are computed to evaluate the model speed. Our model achieves the highest accuracy among the compared models. It is 3.69% higher than DenseNet and 8.95% higher than EfficientNet. From Fig. 12, we can see (F) yubeihong and (G) dajinxing are easily confused. The lower the accuracy of the model, the greater the amount of confusion for different classes. In addition, from the parameters and complexity, the basic convolution has many parameters, high computational complexity, and limited classification accuracy. Compared with the lightweight network, the parameters of models are greatly reduced, and the accuracy is also improved. It also provided a theoretical foundation for the selection of the MBConv module. Simultaneously, multiple state-of-the-art Transformer algorithms are compared, including ViT⁴⁶, FocalNet⁴⁷, Swin Transformer³³, CMT⁴⁸, CvT⁴⁹, PVT⁵⁰, MaxViT⁵¹, EfficientViT⁵², and SwinFG⁵³.

The experimental results are shown in Table 6; Fig. 14, where we can find that our model has the best identification accuracy among state-of-the-art Transformer models. Interestingly, Transformer-based models have lower accuracy than based-ConvNets models. Transformers require a large amount of data to improve their learning ability, and the ConvNets-based models have a strong priority by inductive bias.

Table 6 The experimental results for multiple different Transformers.

Full size table

Hence, our model has good generalization and high accuracy in multi-scale structures. It is 20.66% higher than ViT, 5.29% higher than Swin Transformer, and 17.2% higher than CvT. Our model has fewer parameters than Transformer-based models such as FocalNet, MaxViT, and SwinFG, yet it achieves the highest accuracy. This demonstrates that our model effectively balances computational efficiency and classification performance. Although the FPS of our model is not the fastest, the experimental results show that the hierarchical network is more advantageous for classifying fine-grained images. By fusing multi-level features of images, it can learn more effective subtle details and achieve visual fine classification for the different varieties of Hawthorn.

Visualization comparison results

To better reflect the classification performance of our model, the visualization of different models is intuitively shown in Fig. 15.

According to the comparative experimental results of Tables 5 and 6, we randomly selected the visualization results of DenseNet, MobileNet, and Swin Transformer. From Fig. 14, traditional models focus more on more scattered areas, indicating that the model has limited ability to capture key features. Traditional models tend to focus on irrelevant regions, such as the background or secondary areas, indicating a limited ability to capture key features. In contrast, the heat maps generated by the model proposed in this paper show concentrated attention on critical areas, such as the color, surface texture, and flesh morphology of hawthorn. This demonstrates the heightened sensitivity of the model to specific, meaningful features essential for classification. In addition, similar attention distribution is shown in the heatmaps of different varieties for traditional models. The focus on differential features enhances the interpretability of our model. Notably, our model demonstrates a more precise focus on fine-grained features, highlighting its superior ability to capture critical details for classification.

Blossoms and leaf and fruit identification

To validate the generalization of our method, additional experiments are conducted on two publicly available herb datasets, with comparisons made against other methods. The two herb datasets are Chinese Medicinal Blossom⁵⁴, Medicinal Leaf⁵⁵, and Fruit dataset⁵⁶, and excerpts of these datasets are illustrated in Figs. 16 and 17, and Fig. 18. The Chinese Medicinal Blossom dataset comprises 12 different types of Chinese medicinal flowers, comprising a total of 11,500 images. The Medicinal Leaf dataset contains 30 types of healthy herbal leaves, comprising a total of 6,500 images. The Fruit dataset includes 45 different types of fruit, comprising a total of 21,668 images.

Table 7 Comparative analysis of classification results on Chinese medicinal blossom, Chinese medicinal leaf, and fruit datasets.

Full size table

Chinese medicinal blossom We compare our method with other state-of-the-art methods on the Chinese medicinal blossom dataset, the comparison results are shown in Table 7. From the table, we can see that our model exhibits the highest accuracy among other mainstream methods. Our proposed model focuses more on the fine-grained features, while the fusion of local features and global information enhances model performance and robustness.

Chinese medicinal leaf The comparison results of different mainstream methods and our method on the public Chinese Medicinal Leaf dataset are shown in Table 7. The medicinal Leaf dataset comprises 30 medicinal plants that exhibit subtle visual differences. From Table 7, the conventional CNN showcases limited classification effectiveness. Conversely, our demonstrates conspicuous advantages. Its commendable performance across various datasets underscores its robust generalization capability.

Fruit dataset The comparison results of different mainstream methods and our method on the public Fruit dataset are shown in Table 7. The fruit dataset comprises 45 different fruits. As shown in Table 7, CNNs exhibit limited classification performance, and Transformers have lower accuracy. In contrast, our approach demonstrates significant advantages, with consistently strong results across diverse datasets, highlighting its superior generalization capability.

Conclusion

Considering the impact of different cultivation conditions and areas on the varieties of Hawthorn, hence, was an indispensable need for species identification. With the development of deep learning, a novel supervised learning hybrid model was designed to better capture the local and global information of fine-grained images via injecting shallow prior knowledge to guide high-level semantic features. On this basis, a novel SLA was proposed to highlight the low-frequency features of the image by calculating the self-attention weight of spatial dimension. It can enhance the awareness of detailed local features and improve the learning capacity. Extensive experiments were performed on our dataset as well as the public dataset. Compared with different state-of-the-art ConvNets and Transformer models, ours have the best accuracy of 90.96% and significantly outperforms others. Furthermore, the effectiveness and rationality of the model have also been proved by the different ablation experiments. Hence, it has been confirmed that our model was preferable to identify the varieties of Hawthorn. It has also been proven to be an exceptional practice for the high efficiency of agriculture and plant technology.

However, some limitations to the identification still exist. Due to the small and unbalanced dataset, moreover, the similarity of visual features of images, the learning ability of the model for subtle features has still been lacking, such as the low accuracy of (B) xinglongshisheng and (F) yubeihong. Thus, while the amount of dataset was gradually increased in the follow-up, the identification of the subtle differences between similar classes was researched to improve the identification accuracy of Hawthorn. Moreover, we will explore generative data augmentation methods to address the issue of imbalanced data distribution, such as Denoising Diffusion Probabilistic Models (DDPM) and data resampling, to optimize data distribution.

Data availability

The datasets used during the current study are available from the corresponding author on reasonable request.

References

Guo, Q., Du, J., Jiang, Y., Goff, H. D. & Cui, S. W. Pectic polysaccharides from hawthorn: physicochemical and partial structural characterization. Food Hydrocoll. 90, 146–153 (2019).
Article CAS Google Scholar
Wu, J., Peng, W., Qin, R. & Zhou, H. Crataegus pinnatifida: chemical constituents, pharmacology, and potential applications. Molecules 19 (2), 1685–1712 (2014).
Article PubMed PubMed Central Google Scholar
Liu, C. et al. Digestion-promoting effects and mechanisms of Dashanzha pill based on Raw and charred crataegi fructus. Chem. Biodivers. 18 (12), 2100705 (2021).
Article Google Scholar
Xue, Q. et al. Profiling and analysis of multiple constituents in crataegi fructus before and after processing by ultrahigh-performance liquid chromatography quadrupole time-of-flight mass spectrometry. Rapid Commun. Mass Spectrom. 35 (7), 9033 (2021).
Article Google Scholar
Li, L. et al. Hawthorn pectin: Extraction, function and utilization. Curr. Res. Food Sci. 4, 429–435 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wu, M. et al. Roles and mechanisms of Hawthorn and its extracts on atherosclerosis: a review. Front. Pharmacol. 11, 118 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yin, F. et al. Quality control of processed crataegi fructus and its medicinal parts by ultra high performance liquid chromatography with electrospray ionization tandem mass spectrometry. J. Sep. Sci. 38 (15), 2630–2639 (2015).
Article CAS PubMed Google Scholar
Yu, J., Guo, M., Jiang, W., Dao, Y. & Pang, X. Illumina-based analysis yields new insights into the fungal contamination associated with the processed products of crataegi fructus. Front. Nutr. 9, 883698 (2022).
Article PubMed PubMed Central Google Scholar
Qin, R. et al. The combination of Catechin and epicatechin gallate from fructus crataegi potentiates β-lactamantibiotics against methicillin-resistant Staphylococcus aureus (mrsa) in vitro and in vivo. Int. J. Mol. Sci. 14 (1), 1802–1821 (2013).
Article CAS PubMed PubMed Central Google Scholar
Fei, C. et al. Qual- Ity evaluation of Raw and processed crataegi fructus by color measurement and fingerprint analysis. J. Sep. Sci. 41 (2), 582–589 (2018).
Article CAS PubMed Google Scholar
Lee, J. J., Lee, H. J. & Oh, S. W. Antiobesity effects of Sansa (crataegi fructus) on 3t3-l1 cells and on high-fat–high-cholesterol diet-induced obese rats. J. Med. Food. 20 (1), 19–29 (2017).
Article CAS PubMed Google Scholar
Wang, T. et al. Effect of the fermentation broth of the mixture of pueraria lobata, lonicera japonica, and Crataegus pinnatifida by Lactobacillus rhamnosus 217-1 on liver health and intesti- Nal flora in mice with alcoholic liver disease induced by liquor. Front. Microbiol. 12, 722171 (2021).
Article PubMed PubMed Central Google Scholar
Wang, T. et al. An e-nose and convolu- Tion neural network-based recognition method for processed products of crataegi fructus. Comb. Chem. High Throughput Screen. 24 (7), 921–932 (2021).
Article MathSciNet CAS PubMed Google Scholar
Fei, C. et al. Identification of the Raw and processed crataegi fructus based on the electronic nose coupled with chemometric methods. Sci. Rep. 11 (1), 1849 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Yang, S. et al. A novel method for rapid discrimination of bulbus of fritillaria by using electronic nose and electronic tongue technology. Anal. Methods. 7 (3), 943–952 (2015).
Article Google Scholar
Wang, X., Zhang, S. & Zhang, T. Crop insect pest detection based on dilated multi-scale attention u-net. Plant. Methods. 20 (1), 34 (2024).
Article PubMed PubMed Central Google Scholar
Theiß, M., Steier, A., Rascher, U. & Mu¨ller-Linow, M. Completing the picture of field-grown cereal crops: a new method for detailed leaf surface models in wheat. Plant. Methods. 20 (1), 21 (2024).
Article PubMed PubMed Central Google Scholar
Wang, L. et al. Small-and medium-sized rice fields identification in hilly areas using all available sentinel-1/2 images. Plant. Methods. 20 (1), 25 (2024).
Article CAS PubMed PubMed Central Google Scholar
Chang, B., Wang, Y., Zhao, X., Li, G. & Yuan, P. A general-purpose edge-feature guidance module to enhance vision Transformers for plant disease identification. Expert Syst. Appl. 237, 121638 (2024).
Article Google Scholar
Wang, Y. et al. Application of hyperspectral imaging assisted with integrated deep learning approaches in identifying geographical origins and predicting nutrient contents of Coix seeds. Food Chem. 404, 134503 (2023).
Article CAS PubMed Google Scholar
Wang, B. et al. Fusing deep learning features of triplet leaf image patterns to boost soybean cultivar identification. Comput. Electron. Agric. 197, 106914 (2022).
Article Google Scholar
Lu, T., Yu, F., Xue, C. & Han, B. Identification, classification, and quantification of three physical mechanisms in oil-in-water emulsions using Alexnet with transfer learning. J. Food Eng. 288, 110220 (2021).
Article Google Scholar
Han, K. et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 45 (1), 87–110 (2022).
Article ADS PubMed Google Scholar
Bazi, Y., Bashmal, L., Rahhal, M. M. A., Dayil, R. A. & Ajlan, N. A. Vision trans- formers for remote sensing image classification. Remote Sens. 13 (3), 516 (2021).
Article ADS Google Scholar
Pacal, I. Enhancing crop productivity and sustainability through disease iden- tification in maize leaves: exploiting a large dataset with an advanced vision transformer model. Expert Syst. Appl. 238, 122099 (2024).
Article Google Scholar
Kim, K. et al. Rethink- ing the self-attention in vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (3071–3075) (2021).
Touvron, H., Cord, M., El-Nouby, A., Verbeek, J. & J´egou, H. Three things everyone should know about vision transformers. In: European Conference on Computer Vision, (497–515) (Springer, 2022).
Ma, L., Li, X., Shi, Y., Wu, J. & Zhang, Y. Correlation filtering-based hashing for fine-grained image retrieval. IEEE. Signal. Process. Lett. 27, 2129–2133 (2020).
Article ADS Google Scholar
Ma, L., Hong, H., Meng, F., Wu, Q. & Wu, J. Deep progressive asymmetric quantization based on causal intervention for fine-grained image retrieval. IEEE Trans. Multimedia. 26, 1306–1318 (2023).
Article Google Scholar
Ma, L., Luo, X., Hong, H., Meng, F. & Wu, Q. Logit variated product quantization based on parts interaction and metric learning with knowledge distillation for Fine-Grained image retrieval. IEEE Trans. Multimedia (2024).
Ma, L. et al. Optimal transport quantization based on Cross-X semantic hypergraph learning for Fine-Grained image retrieval. IEEE Trans. Circuits Syst. Video Technol. 35 (7), 7005–7019 (2025).
Article Google Scholar
Dai, Z., Liu, H., Le, Q. V. & Tan, M. Coatnet: marrying Convolution and attention for all data sizes. Adv. Neural. Inf. Process. Syst. 34, 3965–3977 (2021).
Google Scholar
Liu, Z. et al. Swin trans- former: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (10012–10022) (2021).
Bochkovskiy, A., Wang, C. Y. & Liao, H. Y. M. Yolov4: Optimal speed and accuracy of object detection. Preprint at https://arXiv/org:2004.10934 (2020).
Woo, S., Park, J., Lee, J. Y. & Kweon, I. S. Cbam: Convolutional block atten- tion module. In: Proceedings of the European Conference on Computer Vision (ECCV), (3–19) (2018).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. Preprint at https://arXiv/org/1711.05101 (2017).
Yang, L., Wang, X. & Zhai, J. Waterline extraction for artificial Coast with vision Transformers. Front. Environ. Sci. 10, 16 (2022).
Article Google Scholar
Fan, S. et al. On line detection of defective apples using computer vision system combined with deep learning methods. J. Food Eng. 286, 110102 (2020).
Article Google Scholar
Cao, Y., Xu, J., Lin, S., Wei, F. & Hu, H. Gcnet: non-local networks meet squeeze-excitation networks and beyond. arXiv. (2019).
Ma, L., Zhao, F., Hong, H. Y., Wang, L. & Zhu, Y. Complementary parts contrastive learning for Fine-Grained weakly supervised object Co-Localization. IEEE Trans. Circuits Syst. Video Technol. 33 (11), 6635–6648 (2023).
Article Google Scholar
Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. Preprint at https://arXiv/org/14091556 (2014).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (770–778) (2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (4700–4708) (2017).
Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neu- ral networks. In: International Conference on Machine Learning, (6105–6114) (PMLR, 2019).
Howard, A. G. et al. Mobilenets: efficient convolutional neural networks for mobile vision applications. Preprint at https://arXiv/org/170404861 (2017).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. (2020). Preprint at https://arXiv/org/2010.11929.
Yang, J., Li, C., Dai, X. & Gao, J. Focal modulation networks. Adv. Neural. Inf. Process. Syst. 35, 4203–4217 (2022).
Google Scholar
Guo, J. et al. Cmt: Convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (12175–12185) (2022).
Wu, H. et al. Cvt: Intro- ducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (22–31) (2021).
Wang, W. et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (568–578) (2021).
Tu, Z. et al. Multi-Axis Vision Transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, (2022).
Liu, X. et al. Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (14420–14430). (2023).
Ma, Z., Wu, X., Chu, A., Huang, L. & Wei, Z. SwinFG: A fine-grained recognition scheme based on Swin transformer. Expert Syst. Appl. 244, 123021 (2024).
Article Google Scholar
Huang, M. & Xu, Y. Image classification of Chinese medicinal flowers based on convolutional neural network. Math. Biosci. Engineering: MBE. 20 (8), 14978–14994 (2023).
Article PubMed Google Scholar
Thella, P. K. & Ulagamuthalvi, V. An efficient double labelling image segmenta- Tion model for leaf pixel extraction for medical plant detection. Annals Romanian Soc. Cell. Biology, 2241–2251 (2021).
Oltean, M. August. ‘Fruits 360 dataset: new research directions.’ Source: (2022). https://www.kaggle.com/datasets/moltean/fruits (accessed on (2021).

Download references

Acknowledgements

The authors would like to acknowledge the generous guidance provided by the rest of the National Key Laboratory of Fundamental Science on Synthetic Vision. They would also like to acknowledge Pro.Yongliang Huang for providing additional information about medicinal plants.

Funding

This study was funded by the National Natural Science Foundation of China (No. 82405033), China Postdoctoral Science Foundation (No. 2025MD774046) and the Research Promotion Plan for Xinglin Scholars in Chengdu University of Traditional Chinese Medicine (No. BSZ2024030).

Author information

Authors and Affiliations

School of Intelligent Medicine, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
Chaoqun Tan & Maojia Wang
A State Key Laboratory of Southwestern Chinese Medicine Resources, School of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, 611137, China
Jiale Deng & Chunjie Wu
National Key Laboratory of Fundamental Science on Synthetic Vision, School of Computer Science, Sichuan University, Chengdu, 610065, China
Li Ke

Authors

Chaoqun Tan
View author publications
Search author on:PubMed Google Scholar
Jiale Deng
View author publications
Search author on:PubMed Google Scholar
Chunjie Wu
View author publications
Search author on:PubMed Google Scholar
Maojia Wang
View author publications
Search author on:PubMed Google Scholar
Li Ke
View author publications
Search author on:PubMed Google Scholar

Contributions

Chaoqun Tan proposed the idea, conducted the experiments, and drafted and revised the manuscript. Jiale Deng and Maojia Wang analyzed the results, and revised sections of the manuscript. Chunjie Wu and Ke Li participated in project management and obtained the funding for this study. All authors contributed to the paper and approved the submitted version.

Corresponding author

Correspondence to Li Ke.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

All authors agreed to publish this manuscript.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Tan, C., Deng, J., Wu, C. et al. Visual feature-based multi-scale hybrid attention network for fine-grained Hawthorn varieties identification. Sci Rep 15, 36895 (2025). https://doi.org/10.1038/s41598-025-20875-0

Download citation

Received: 20 September 2024
Accepted: 17 September 2025
Published: 22 October 2025
Version of record: 22 October 2025
DOI: https://doi.org/10.1038/s41598-025-20875-0

Subjects

Abstract

Introduction

Dataset collection

Methodology

Overview

Local feature extraction

Spatial local attention

Improved MBConv module

Global feature extraction

Multi-features fusion

Results and discussions

Evaluation

Training parameters

Metrics

Hawthorn identification

Results of our model

Manual identification

Ablation study

Different attention modules

Visualization SLA module

Different CNN modules

Horizontal comparison and performance trade-offs

Comparative experimental results

Visualization comparison results

Blossoms and leaf and fruit identification

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval and consent to participate

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links