TCMP-300: A Comprehensive Traditional Chinese Medicinal Plant Dataset for Plant Recognition

Zhang, Yanling; Sun, Wanhui; Yang, Chuanguang; Huang, Libo; An, Zhulin; Feng, Weilun; Tang, Wenjing; Xu, Yongjun

doi:10.1038/s41597-025-05522-7

Download PDF

Data Descriptor
Open access
Published: 09 July 2025

TCMP-300: A Comprehensive Traditional Chinese Medicinal Plant Dataset for Plant Recognition

Yanling Zhang¹,
Wanhui Sun¹,
Chuanguang Yang²,
Libo Huang²,
Zhulin An²,
Weilun Feng^2,3,
Wenjing Tang^2,3 &
…
Yongjun Xu²

Scientific Data volume 12, Article number: 1166 (2025) Cite this article

3479 Accesses
2 Citations
Metrics details

Subjects

Abstract

Traditional Chinese Medicinal Plants (TCMPs) are often used to prevent and treat diseases for the human body. Since various medicinal plants have different therapeutic effects, plant recognition has become an important topic. Traditional identification of medicinal plants mainly relies on human experts, which does not meet the increased requirements in clinical practice. Artificial Intelligence (AI) research for plant recognition faces challenges due to the lack of a comprehensive medicinal plant dataset. Therefore, we present a TCMP dataset that includes 52,089 images in 300 categories. Compared to the existing medicinal plant datasets, our dataset has more categories and fine-grained plant parts to facilitate comprehensive plant recognition. The plant images were collected through the Bing search engine and cleaned by a pretrained vision foundation model with human verification. We conduct technical validation by training several state-of-the-art image classification models with advanced data augmentation on the dataset, and achieve 89.64% accuracy. Our dataset promotes the development and validation of advanced AI models for robust and accurate plant recognition.

Distribution and protection of Thesium chinense Turcz. under climate and land use change

Article Open access 18 March 2024

Development of a system for the automated identification of herbarium specimens with high accuracy

Article Open access 16 May 2022

Plant community data along elevational gradients in China’s 17 mountains

Article Open access 15 December 2025

Background & Summary

Medicinal plants refer to plants that prevent and treat diseases to maintain health for the human body. Medicinal plants are not only the source of a large number of Western medicinal compounds but also the main source of Complementary and Alternative Medicine (CAM). China is rich in medicinal plant resources. According to statistics¹, there are 11,146 species belonging to 383 families and 2313 genera of medicinal plants in China. In recent years, medicinal plants have played an increasingly important role in the prevention and treatment of various diseases^2,3,4. However, different medicinal plants have distinct therapeutic effects. In clinical practice, it is necessary to use them according to the specific condition to mitigate adverse reactions and avoid toxicity. Therefore, accurately identifying medicinal plants to avoid misplanting, miscollection, misuse, and fraudulent substitution is the most fundamental and crucial step in ensuring the quality and efficacy of medicines.

As a part of systematics, taxonomy is closely related to phylogenetic studies and the research of evolutionary processes within systematics. Its tasks include using the findings of systematics to name and classify organisms under reasonable principles, providing an organizational framework that facilitates the storage and retrieval of biodiversity information, and offering guidance for identification work⁵. Plant taxonomy is a long-established discipline that has continuously developed and improved through the practice of human recognition and utilization of plants. Guided by Darwin’s theory of evolution⁶, plant taxonomists from various countries have proposed insights into plant classification systems by studying the origin and development of the plant kingdom. Among the most influential systems are those of Engler, Hutchinson, Takhtajan, and Cronquist. The Engler system and the Hutchinson system, in particular, represent the “Pseudanthium” and “Euanthium” schools, respectively, with flower characteristics being the primary distinguishing features⁷. The classification of medicinal plants adopts the principles and methods of plant taxonomy. Mastering characteristics such as plant morphology and microscopic structure ensures the accurate identification of plants with medicinal value⁸. This guarantees the authenticity of medicinal plants and herbal medicines, facilitating better research, rational development, and utilization.

Traditional identification of medicinal plants mainly relies on professionals according to the corresponding classification system. According to the written description of the classification system, the identification of medicinal plants is carried out by sight, hand touch, mouth taste, and nose⁹. This method mainly depends on the professional level and experience of the operator. Therefore, different operators have certain subjectivity in identifying medicinal plants, and their accuracy rates vary. With the development of human society, the demand for medicinal plant resources is increasing. Traditional identification methods cannot adapt to the rich diversity of plant taxa. At present, the classification of medicinal plants can be conducted from different perspectives, including pharmacognosy, pollen, leaf epidermis and seed coat, cytology, and molecular identification¹⁰.

Particularly, plant image recognition technology based on deep learning^11,12 is a more intelligent and reliable auxiliary technology that can improve the accuracy and efficiency of plant identification. Due to the higher difficulty in collection and annotation, there are relatively fewer related works. Besides, Traditional Chinese Medicine (TCM) is more prevalent in Southeast Asia, and research on Traditional Chinese Medicinal Plant (TCMP) datasets is also distributed in the Southeast Asian region. In 2021, Tung and Vinh¹³ proposed a large, public, and multi-class dataset of medicinal plant images called VNPlant-200. The dataset consists of a total of 20,000 images of 200 different labeled Vietnamese medicinal plants. They also reported the preliminary classification results by using two local image descriptors. Huang et al.¹⁴ built a Chinese medicinal blossoms dataset, which consists of twelve blossoms used in TCM. The authors employed various data augmentation methods to increase the number of samples and provided the image classification results using AlexNet¹⁵ and InceptionV3¹⁶. Abdollahi¹⁷ proposed using MobileNet¹⁸ to identify various medical plants in Ardabil, Iran. The dataset used to train and evaluate includes 30 different classes of medicinal plants, totaling 3000 images. DIMPSAR¹⁹ is a dataset for the analysis and recognition of Indian medicinal plant species. The dataset consists of two parts: leaf and plant. The former includes 80 classes, while the latter contains 40 classes. MED-117²⁰ is an image dataset of common medicinal plant leaves from Assam, India. Beyond classification, the MED-117 dataset also includes segmented leaf frames for leaf segmentation using U-Net²¹. Ref. ²² reported a collection of leaf images from 10 common medicinal plants in Bangladesh. This paper also used several common Convolutional Neural Networks (CNNs) to classify this dataset. In addition to medicinal plant datasets, researchers have also collected an image dataset of TCM fruits²³. This dataset includes 20 common types of TCM fruits. This paper also utilizes CNNs for classification. We summarized the details of the above datasets in Table 1.

Table 1 Detail information about related datasets.

Full size table

However, although some works of medicinal plant datasets have been introduced, there are still many challenges in this field:

1.
Limited diversity of species. Existing medicinal plant datasets capture a limited number of plant species and parts, restricting their application in clinical practice.
2.
Lacking systematic data collection and cleaning framework. The data collection and cleaning process for existing datasets often highly relies on manual curation without automation tools.
3.
Inadequate validation framework. Previous papers do not conduct technical validation using state-of-the-art network architectures and advanced data augmentation methods. This hampers the accurate assessment of datasets and achieves suboptimal performance.

To address these challenges, our TCMP-300²⁴ dataset stands out with its comprehensive design and structured methodology. The overview of the overall data processing workflow and evaluation is illustrated in Fig. 1. By leveraging an automated web crawler, we have ensured a diverse representation of species and plant parts, capturing images from various angles and contexts. This not only enhances the dataset’s ability to reflect the true diversity of medicinal plants but also mitigates the limitations posed by manual curation. Furthermore, our systematic data-cleaning process ensures that only high-quality and relevant images are included, providing a strong foundation for subsequent research and model training. The final dataset is verified by human TCM experts. We conduct comprehensive technical validation using three popular image-classification network families, including mobile network, CNN, and Vision Transformer (ViT). Additionally, we propose an advanced data augmentation technique, HybridMix, during the training process, which not only enriches the dataset but also enhances the robustness of models. The largest Swin-Base model²⁵ achieves 89.64% accuracy on our TCMP-300²⁴ dataset. Our dataset, combined with released superior models, set a new standard for future medicinal plant recognition research, making it easier for researchers to reproduce results and build upon our findings.

Among the selected medicinal plants, the most species belong to the following families: Asteraceae (27 species), Lamiaceae (18 species), Liliaceae (14 species), Ranunculaceae and Fabaceae (11 species each), Apiaceae (10 species), Brassicaceae and Solanaceae (8 species each), Rosaceae, Papaveraceae, Scrophulariaceae, Araliaceae, and Campanulaceae (6 species each), while other families contain fewer than 5 species. The selected medicinal plants predominantly contain bioactive compounds such as flavonoids, alkaloids, volatile oils, coumarins, sterols, and tannins. They exhibit clinical effects such as (1) clearing heat and detoxifying; (2) antibacterial and anti-inflammatory; (3) relieving cough and resolving phlegm; (4) reducing swelling and dissipating nodules; (5) promoting blood circulation and resolving stasis. Among them, certain folk medicinal herbs such as bittersweet, black nightshade, and rorippa have not yet been fully explored and utilized. These plants hold significant potential as valuable resources for developing new traditional Chinese medicines in clinical practice.

In summary, the comprehensive and meticulously curated TCMP-300²⁴ dataset represents a significant advancement in the field of medicinal plant research. By facilitating the classification and identification of diverse plant species, this dataset not only enhances our understanding of the medicinal properties of these plants but also accelerates the discovery of new therapeutic agents. For example, the development of artemisinin, a Nobel Prize-winning treatment for malaria, was the result of extensive manual research to identify effective compounds. With a well-structured dataset, researchers could potentially streamline the process of discovering new drugs from nature, significantly improving efficiency in the search for novel therapies. Moreover, this dataset aligns with the United Nations Sustainable Development Goal 3: “Ensure healthy lives and promote well-being for all at all ages.” By providing a robust resource for the classification and study of medicinal plants, it can support initiatives aimed at improving healthcare and promoting the use of traditional medicine. Ultimately, the development of such datasets not only contributes to scientific knowledge but also has the potential to foster sustainable practices in healthcare and biodiversity conservation, paving the way for a healthier future for all.

Methods

Image crawling

The dataset images were collected through the Bing search engine using an automated web crawler that queried with standardized scientific names (following the binomial nomenclature system: Genus species + authority, e.g., Angelica sinensis (Oliv.) Diels and Pulsatilla chinensis (Bunge) Regel) of medicinal plants. After initial automated retrieval, all images underwent rigorous manual and taxonomic validation to ensure botanical fidelity. The final collection includes high-quality images depicting key morphological features of each plant species, such as blossoms, stems, leaves, roots, fruits, and whole-plant profiles, with representative examples shown in Fig. 2.

Data cleaning

The data-cleaning process began with the development of textual prompts based on medicinal plant labels, followed by the application of CLIP (Contrastive Language-Image Pre-Training)²⁶, a robust vision foundation model, to classify the crawled images. Guided by TCM experts, precise classification thresholds were established for each category, enabling the identification and removal of low-quality samples. Figure 3 illustrates examples of dirty data, such as images containing textual content or pharmaceutical products, which were prevalent in the initial dataset. After the first cleaning phase, we eliminated 52.73% noise images in the initial dataset. Figure 5 compares the number of images per class before and after cleaning, showing a significant reduction from 222~400 to 101~354 images per class, highlighting the poor quality of the original data. To further refine the dataset, TCM experts conducted manual verification, addressing misclassified or overlooked samples. Figure 4 demonstrates two types of errors: mistakenly cleaned samples (e.g., rare variants of “Sambucus williamsii” misclassified due to atypical appearances or obscured features) and mistakenly uncleaned samples (e.g., images with ambiguous visual features or subtle text overlays that evaded automated detection). This comprehensive approach ensured a high-quality dataset for subsequent analysis. Notice that few images in the TCMP-300²⁴ dataset may also inevitably contain visible watermarks or embedded annotations (e.g., numbers). According to the observation from Northcutt et al.²⁷, the noisy data may be a common phenomenon in computer vision machine learning datasets, such as MNIST²⁸, CIFAR-10²⁹, CIFAR-100²⁹, Caltech-256³⁰, and ImageNet³¹. This further demonstrates that the noisy data is not a problem specific of our dataset but a reality of the computer vision field.

Data Records

Folder structure and recording format

The TCMP-300²⁴ dataset, publicly available at https://doi.org/10.6084/m9.figshare.29432726, comprises 52,089 images across 300 TCMP categories. The user could download “tcmp-300-release.tar.gz” and “tcmp-info-20250329.csv”, where the former contains TCMP images and the latter describes class information. The user should run the script “tar -zxvf tcmp-300-release.tar.gz” to extract image files. The dataset is organized into category-specific subfolders following the naming convention “[Class ID].[English Name of TCMP]” (e.g., “001.Veronica persica Poir.”), with each subfolder containing all corresponding medicinal plant images. All images are provided in widely compatible formats (PNG, JPG, JPEG, and WebP), ensuring seamless integration with modern AI frameworks such as PyTorch³².

Data Splitting

Because our dataset exhibits a long tail distribution, to maintain this data characteristic, we divide the dataset based on classes as the basic unit. To split the whole dataset into train and validation subsets, we first define a threshold τ. For class C_i which contains ∣C_i∣ images, i ∈ [1, 2, ⋯ , n], where n stands for the total class number. We can calculate the image number of its corresponding train subset and validation subset by Equ. (1):

$${N}_{train}^{{C}_{i}}=\lfloor \tau \times | {C}_{i}| \rfloor ,\,{N}_{validation}^{{C}_{i}}=| {C}_{i}| -{N}_{train},$$

(1)

where ${N}_{train}^{{C}_{i}}$ and ${N}_{validation}^{{C}_{i}}$ are the number of images from the training subset and validation subset for class C_i, respectively. ⌊⋅⌋ denotes round down operation. In our practice, we set τ = 0.7.

Properties

To make our dataset suitable for a wide range of identification tasks in practice, the TCMP-300²⁴ dataset has three valuable properties:

Comprehensiveness. Our dataset covers the maximum number of TCMP categories compared to previous datasets. It encompasses six major parts of TCMP: blossom, stem, leaf, root, fruit, and the whole plant. This extensive categorization ensures a complete representation of various TCMP components.
Diversity. To augment the dataset’s diversity, images were scraped from the internet encompassing various perspectives, both indoor and outdoor settings, a spectrum of lighting conditions, and complex backgrounds. This mirrors the multifaceted contexts where medicinal plants are typically found. Our dataset merges images captured from diverse environments to ensure the generalization of TCMP recognition.
Long-tailed distribution. The distribution of instances across various categories emulates the real-world prevalence of TCM. The image number of frequently utilized medicinal plants substantially surpasses that of their rare counterparts, as illustrated in Fig. 5. This long-tailed distribution requires the model to achieve good performance on the overall categories, even if the sample numbers in some categories are small.

Technical Validation

Model architectures

We adopt popular image classification models, including ResNet³³, DenseNet³⁴, RegNet³⁵, MobileNetV3³⁶, ShuffleNetV2³⁷, EfficientNet³⁸, ConvNeXt³⁹, ViT⁴⁰, and Swin Transformer²⁵, to evaluate the dataset. The first seven models are CNN backbones, where MobileNet, ShuffleNet, and EfficientNet are lightweight networks for edge devices. The last two models are visual Transformer networks.

Classification framework

Figure 6 shows the pipeline of the TCMP image classification framework. The input is an image x with RGB channels. We first adopt a data augmentation module τ(⋅) to preprocess the input image and produce an augmented image $\widetilde{{\boldsymbol{x}}}$, i.e., $\widetilde{{\boldsymbol{x}}}=\tau ({\boldsymbol{x}})$. A general image classification network f is applied to classify the augmented image $\widetilde{{\boldsymbol{x}}}$, e.g., ResNet. The classification network f often includes a feature extractor ϕ for feature learning and a linear classifier ξ to output class probability distribution ${\boldsymbol{p}}=f(\widetilde{{\boldsymbol{x}}})=\xi (\phi (\widetilde{{\boldsymbol{x}}}))\in {{\mathbb{R}}}^{C}$, where C is the number of classes. The feature extractor ϕ includes L stages ${\{{\phi }_{l}\}}_{l=1}^{L}$ to process image features. Each stage ϕ_l downsamples the input image to refine hierarchical information and generate semantic features. We adopt the traditional cross-entropy loss to the image classification network. Given the input image x and its ground truth y, the cross-entropy loss is formulated as Equ (2).

$$L=-\mathop{\sum }\limits_{c=1}^{C}{\tau }_{c,y}\log \,{{\boldsymbol{p}}}_{c}.$$

(2)

Here, τ_c,y is an indicator function. If c = y, τ_c,y = 1, else wise τ_c,y = 0.

Data augmentation

Following the ImageNet classification, we utilize random cropping and flipping as the standard data augmentation⁴¹. The input image is first randomly cropped, ranging from 0.08 to 1.0 of the original image size. The aspect ratio is ranged from 3/4 to 4/3 of the original image aspect ratio. The cropped image is resized to 224 × 224 and horizontally flipped using a probability of 0.5. Finally, it is normalized by the mean and standard deviation over RGB channels.

We further propose HybridMix, an image mixture technique, as the extra data augmentation to improve the model accuracy. As shown in Fig. 7, given an input image, HybridMix chooses Mixup or CutMix with a probability of 0.5 to perform global or local image mixture.

1.
Global Image Mixture: Mixup⁴². Given two different images x_i and x_j from the mini-batch, Mixup conducts global pixel mixture using a balancing factor λ to produce a new mixed image ${\widetilde{{\boldsymbol{x}}}}_{ij}$, where λ ∈ [0, 1]. The label of the mixed image is also linearly interpolated between y_i and y_j with the same balancing factor λ to generate mixed label ${\widetilde{y}}_{ij}$. The formulations of ${\widetilde{{\boldsymbol{x}}}}_{ij}$ and ${\widetilde{y}}_{ij}$ are shown as Equ (3).
$${\widetilde{{\boldsymbol{x}}}}_{ij}=\lambda \ast {{\boldsymbol{x}}}_{i}+(1-\lambda )\ast {{\boldsymbol{x}}}_{j},\,{\widetilde{y}}_{ij}=\lambda \ast {y}_{i}+(1-\lambda )\ast {y}_{j}$$
(3)
2.
Local Image Mixture: CutMix⁴³. The main idea of CutMix is to crop a patch from x_j and pad x_i with the cropped patch. We represent the bounding box coordinates of the cropped patch on x_j as B = (r_x, r_y, r_w, r_h), where ${r}_{w}=W\sqrt{\lambda }$ and ${r}_{h}=H\sqrt{\lambda }$, λ ∈ [0, 1], W and H are the width and height of the image, respectively. r_x and r_y are uniformly sampled according to r_x ~ U(0, W), r_y ~ U(0, H). The CutMix image is generated by removing the patch B on x_i and filling the corresponding patch cropped from x_j. In practice, a binary mask M = {0, 1}^W×H fills 0 according to the bounding box B otherwise 1. The output CutMix image and interpolated label can be expressed as:
$$\widehat{{\boldsymbol{x}}}={\bf{M}}\odot {{\boldsymbol{x}}}_{i}+({\boldsymbol{1}}-{\bf{M}})\odot {{\boldsymbol{x}}}_{j},\,{\widetilde{y}}_{ij}=\lambda \ast {y}_{i}+(1-\lambda )\ast {y}_{j}.$$
(4)
Here, ⊙ denotes the element-wise multiplication, 1 = {1}^W×H is the binary mask filled with 1.

Training setup

All models are trained by a stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a weight decay of 1 × 10⁻⁴, and a batch size of 128. We utilize a cosine learning rate strategy, the learning rate of which starts at 0.1 and gradually decreases to 0 within the total 300 epochs.

Evaluation metrics

We utilize Accuracy (Acc), F1 score, Balanced Accuracy (B-Acc), and Balanced F1 score (B-F1) to measure the performance of TCMP recognition. The predicted results are often constructed by True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) for each class. Based on TP, TN, FP, and FN, the evaluation metrics of Acc, F1, B-Acc, and B-F1 can be formulated as follows:

1.
Accuracy: Accuracy calculates the recall of all samples:
$$\,{\rm{Acc}}\,=\frac{{\sum }_{c=1}^{C}T{P}_{c}}{{\sum }_{c=1}^{C}(T{P}_{c}+F{N}_{c})},$$
(5)
where C denotes the total class number, TP_c denotes the True Positive of class c.
2.
F1 score: The overall F1 score is the arithmetic mean of the F1 scores for all classes:
$$\begin{array}{c}{{\rm{Precision}}}_{c}=\frac{T{P}_{c}}{T{P}_{c}+F{P}_{c}},\,{{\rm{Recall}}}_{c}=\frac{T{P}_{c}}{T{P}_{c}+F{N}_{c}},\\ {{\rm{F1}}}_{c}=\frac{2{{\rm{Precision}}}_{c}\cdot {{\rm{Recall}}}_{c}}{{{\rm{Precision}}}_{c}+{{\rm{Recall}}}_{c}},\,\,{\rm{F1\; score}}=\frac{1}{C}\mathop{\sum }\limits_{c=1}^{C}{{\rm{F1}}}_{c}.\end{array}$$
(6)
3.
Balanced Accuracy: Balanced accuracy calculates the average recall rate for each class:
$$\,{\rm{B}} \mbox{-} {\rm{Acc}}=\frac{1}{C}\mathop{\sum }\limits_{c=1}^{C}\,{{\rm{Recall}}}_{c}.$$
(7)
4.
Balanced F1 score: Balanced F1 score is the weighted average of F1 scores for each class, with the weight being the proportion of samples in each class to the total sample size.
$${w}_{c}=\frac{{N}_{i}}{N},\,\,{\rm{B}} \mbox{-} {\rm{F1}}=\mathop{\sum }\limits_{c=1}^{C}{w}_{c}{{\rm{F1}}}_{c},$$
(8)
where N_c denotes the samples number of class c, N denotes the total samples size.

Image classification accuracy with advanced data augmentation

As shown in Table 2, we evaluate the proposed dataset over various classification models with regular training and advanced data augmentation. Our proposed HybridMix augmentation consistently improves the baseline training with an average improvement of 1.57% across eleven models, including 1.42% for mobile networks, 1.01% for CNNs, and 2.25% for ViTs. The results indicate that HybridMix can enhance the generalization capability for TCMP recognition. The models with various sizes often present distinct performances. The lightweight MobileNetV3, ShuffleNetV2, and EfficientNet-B0 models are often deployed over resource-limited mobile devices, achieving 83% ~ 86% accuracy. The accuracy is generally improved as the parameter size increases. The best models for CNN and ViT are ConvNeXt-Tiny and Swin-Base, which achieve 89.43% and 89.64%, respectively. The results demonstrate that large models reach the best accuracy performance. Given the class imbalance in the dataset, we also report balanced accuracy (B-Acc), F1 score, balanced F1 score (B-F1) alongside overall accuracy. We can observe that HybridMix leads to consistent improvements on B-Acc, F1, and B-F1 metrics. The results show that our models trained by HybirdMix work well on the class imbalance scenario.

Table 2 Image classification accuracy (%) over multiple models with baseline training and advanced HybridMix augmentation on our proposed dataset.

Full size table

Image classification accuracy with various input resolutions

As shown in Table 3, we investigate the effectiveness of the proposed dataset with various input resolutions. We conduct experiments over multiple classification models to increase the confidence of the results. We can observe that the accuracies of classification models generally improve as the input resolution increases. When the input resolution is increased from 224 × 224 to 336 × 336, the classification models achieve an average accuracy gain of 0.84%. When the input resolution is increased to 448 × 448, the average accuracy improvement is up to 1.65%, and RegNet-X achieves the best 89.54% accuracy. Moreover, we also observe that the larger input resolution enhances B-Acc, F1, and B-F1 metrics consistently. The experimental results demonstrate that classification accuracy benefits from larger input resolutions.

Table 3 Image classification accuracy (%) over multiple models with various input resolutions on our proposed dataset.

Full size table

Analysis of accuracy curve during the training process

Figure 8 shows the accuracy curves of the CNN and ViT models during the training process. We find that the accuracy curve often steadily increases as the training epoch proceeds. Accuracy improves rapidly during the first 50 training epochs and tends to grow slowly after the 50th epoch. The network converges at the 300-th epoch and reaches the best accuracy.

Visualization of learned feature spaces

Figures 9 and 10 show t-SNE⁴⁴ visualization of learned feature spaces over various networks from random and hard samples, respectively. For clear visualization, we randomly sample 10 classes from the dataset. In figures, each cluster including points with the same color represents a unique class. As shown in Fig. 9, we can observe that each network exhibits well intra-class compactness and inter-class separability. The visualization results indicate that each network learns a discriminative feature space, leading to good classification performance. Moreover, we also analyze the feature space of hard samples in Fig. 10. Here, the hard sample is defined as the misclassified sample by the AI model. We find that the feature space of hard samples is less discriminative than that of random samples. This indicates why hard cases are easy to be misclassified.

Visualization of attention heatmaps

Figures 11 and 12 show attention heatmaps using a pretrained ResNet-50 model by Grad-CAM⁴⁵. As shown in Fig. 11, the visualization results indicate that the model trained by our TCMP-300²⁴ dataset can exactly capture fine-grained discriminative features for critical parts, such as blossom, stem, leaf, root, fruit, and plant, while ignoring the noise parts. The captured critical features are classification evidence for robust and accurate TCMP recognition. Moreover, we also found a small number of failure cases when the model focuses on inaccurate TCMP-related features, as shown in Fig. 12. This means that AI models may also have potential biases on few TCMP images.

Usage Notes

Anyone can access TCMP-300²⁴ dataset from figshare platform. The standard PyTorch dataloader module could be called to load the dataset for training and validation. We also release some superior AI models pretrained on the TCMP-300²⁴ dataset as baselines and references for future research to explore more powerful AI models. As shown in Fig. 13, we also open a leaderboard of our dataset over paperwithcode at https://paperswithcode.com/sota/image-classification-on-tcmp-300. Anyone can update the best accuracy online to promote the progress of TCMP recognition.

To facilitate clinical practice, we deploy our model into an online web application over HuggingFace. As shown in Fig. 14, anyone can visit this system at https://winycg-tcmprecognition.hf.space/?__theme=system&deep_link=trNDfB9dwx8and upload an image for online plant recognition. Moreover, the system can also output top-5 class probabilities to interpret classification evidence.

Our dataset provides a platform to expand the number of categories and their numbers of images. We release ready-made tools for image crawling and data cleaning. The users can further construct personalized datasets according to their requirements. We hope our TCMP-300²⁴ dataset can become a larger-scale dataset with more diverse categories and samples.

Code availability

The GitHub repository for the TCMP-300²⁴ dataset is available at https://github.com/winycg/TCMP-300. This repository offers detailed tools and guidelines for users to combine the TCMP-300²⁴ dataset into their research. It contains comprehensive tools to promote the study with the TCMP-300²⁴ dataset and enable customization of workflows for image crawling, data cleaning, data preprocessing, model training, model inference, and model deployment. Moreover, we also provide data analysis tools to plot training curves, visualize t-SNE spaces and Grad-CAM heatmaps.

References

Chen, S. et al. Validation of the its2 region as a novel dna barcode for identifying medicinal plant species. PloS one 5, e8613 (2010).
Article ADS PubMed PubMed Central Google Scholar
Zhu, Y. et al. A high-quality chromosome-level genome assembly of the traditional chinese medicinal herb zanthoxylum nitidum. Sci. Data 11, 1–13 (2024).
Article Google Scholar
Li, Q. et al. Chromosome-level genome assembly of the tetraploid medicinal and natural dye plant persicaria tinctoria. Sci. Data 11, 1440 (2024).
Article CAS PubMed PubMed Central Google Scholar
Shi, Y. et al. Chromosome-level genome assembly of the traditional medicinal plant lindera aggregata. Sci. Data 12, 565 (2025).
Article CAS PubMed PubMed Central Google Scholar
Williams, D. M. Plant Taxonomy: The Systematic Evaluation of Comparative Data, 2nd edition. Syst. Biol. 59, 608–610, https://doi.org/10.1093/sysbio/syq017 (2010).
Article Google Scholar
Darwin, C. et al. On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. Oxf. Text Arch. Core Collect. (1859).
Cronquist, A. The status of the general system of classification of flowering plants. Annals Mo. Bot. Gard. 52, 281–303 (1965).
Article Google Scholar
Wang, Y. et al. A haplotype-resolved genome assembly of coptis teeta, an endangered plant of significant medicinal value. Sci. Data 11, 1012 (2024).
Article CAS PubMed PubMed Central Google Scholar
Ye, F., Cai, G., Zheng, Z., Qi, X. & Yin, p. Medicinal plant specimen identification based on multi-feature fusion. In Proceedings of the 2011 China Intelligent Automation Academic Conference (2011).
Wang, Z., Du, F., Zhang, L., Wang, S. & Wei, S. Advances in classification and identification of the plants of polygonati rhizoma and its adulterants. North. Hortic. 24, 130–136 (2019).
Google Scholar
Li, a, Dai, Z. & Zhang, Jea Progress in the application of machine learning in plant phenotyping. Plant Fiber Sci. China 45, 248–260 (2023).
Google Scholar
Yanhui, Z., Fan, X. & Zhang, J. A preliminary study on the application of deep learning methods in medicinal plant image recognition based on deeplearning4j on spark. Chin. J. Libr. Inf. Sci. for Tradit. Chin. Medicine 42, 18–22 (2018).
Google Scholar
Quoc, T. N. & Hoang, V. T. Vnplant-200 – a public and large-scale of vietnamese medicinal plant images dataset. In Antipova, T. (ed.) Integrated Science in Digital Age 2020, 406–411 (Springer International Publishing, Cham, 2021).
Huang, M.-L., Xu, Y.-X. & Liao, Y.-C. Image dataset on the chinese medicinal blossoms for classification through convolutional neural network. Data Brief 39, 107655, https://doi.org/10.1016/j.dib.2021.107655 (2021).
Article CAS PubMed PubMed Central Google Scholar
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. neural information processing systems25 (2012).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826 (2016).
Abdollahi, J. Identification of medicinal plants in ardabil using deep learning : Identification of medicinal plants using deep learning. In 2022 27th International Computer Conference, Computer Society of Iran (CSICC), 1–6, https://doi.org/10.1109/CSICC55295.2022.9780493 (2022).
Howard, A. G. et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
B R, P. & Rani, N. S. Dimpsar: Dataset for indian medicinal plant species analysis and recognition. Data Brief 49, 109388, https://doi.org/10.1016/j.dib.2023.109388 (2023).
Article CAS PubMed PubMed Central Google Scholar
Sarma, P., Boruah, P. A. & Buragohain, R. Med 117: A dataset of medicinal plants mostly found in assam with their leaf images, segmented leaf frames and name table. Data Brief 47, 108983, https://doi.org/10.1016/j.dib.2023.108983 (2023).
Article CAS PubMed PubMed Central Google Scholar
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241 (Springer, 2015).
Borkatulla, B., Ferdous, J., Uddin, A. H. & Mahmud, P. Bangladeshi medicinal plant dataset. Data Brief 48, 109211, https://doi.org/10.1016/j.dib.2023.109211 (2023).
Article CAS PubMed PubMed Central Google Scholar
Tian, D., Zhou, C., Wang, Y., Zhang, R. & Yao, Y. Nb-tcm-chm: Image dataset of the chinese herbal medicine fruits and its application in classification through deep learning. Data Brief 54, 110405, https://doi.org/10.1016/j.dib.2024.110405 (2024).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Tcmp-300: A comprehensive traditional chinese medicinal plant dataset for plant recognition. figshare https://doi.org/10.6084/m9.figshare.29432726 (2025).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
Stevens, S. et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19412–19424 (2024).
Northcutt, C. G., Athalye, A. & Mueller, J. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749 (2021).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (2002).
Article Google Scholar
Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Tech. Rep., Master’s thesis, Department of Computer Science, University of Toronto (2009).
Griffin, G. et al. Caltech-256 object category dataset. Tech. Rep., Technical Report 7694, California Institute of Technology Pasadena (2007).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009).
Paszke, A. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708 (2017).
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K. & Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10428–10436 (2020).
Howard, A. et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, 1314–1324 (2019).
Ma, N., Zhang, X., Zheng, H.-T. & Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), 116–131 (2018).
Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114 (PMLR, 2019).
Liu, Z. et al. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11976–11986 (2022).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Yang, C. et al. Gated convolutional networks with hybrid connectivity for image classification. In Proceedings of the AAAI conference on artificial intelligence, vol. 34, 12581–12588 (2020).
Zhang, H., Cisse, M., Dauphin, Y. N. & Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
Yun, S. et al. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, 6023–6032 (2019).
Van der Maaten, L. & Hinton, G. Visualizing data using t-sne. J. machine learning research 9 (2008).
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626 (2017).

Download references

Acknowledgements

This work is partially supported by the National Natural Science Foundation of China under Grant Number 62476264 and 62406312, the Postdoctoral Fellowship Program and China Postdoctoral Science Foundation under Grant Number BX20240385 (China National Postdoctoral Program for Innovative Talents), the Beijing Natural Science Foundation under Grant Number 4244098, and the Science Foundation of the Chinese Academy of Sciences.

Author information

Authors and Affiliations

School of pharmacy, Xinyang Agriculture and Forestry University, Xinyang, 464000, China
Yanling Zhang & Wanhui Sun
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Chuanguang Yang, Libo Huang, Zhulin An, Weilun Feng, Wenjing Tang & Yongjun Xu
University of Chinese Academy of Sciences, Beijing, 100190, China
Weilun Feng & Wenjing Tang

Authors

Yanling Zhang
View author publications
Search author on:PubMed Google Scholar
Wanhui Sun
View author publications
Search author on:PubMed Google Scholar
Chuanguang Yang
View author publications
Search author on:PubMed Google Scholar
Libo Huang
View author publications
Search author on:PubMed Google Scholar
Zhulin An
View author publications
Search author on:PubMed Google Scholar
Weilun Feng
View author publications
Search author on:PubMed Google Scholar
Wenjing Tang
View author publications
Search author on:PubMed Google Scholar
Yongjun Xu
View author publications
Search author on:PubMed Google Scholar

Contributions

C.Y., L.H., and Z.A. led this research. Y.Z. and W.S. were responsible for image crawling and data curation. C.Y. wrote the manuscript, conceived the experiments, analyzed the results and deployed AI models. L.H. and Z.A. investigated related background and works. L.H. and W.T. performed data cleaning. W.F. conducted the experiments. Y.X. refined the organization and writing of the manuscript.

Corresponding authors

Correspondence to Chuanguang Yang, Libo Huang or Zhulin An.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Sun, W., Yang, C. et al. TCMP-300: A Comprehensive Traditional Chinese Medicinal Plant Dataset for Plant Recognition. Sci Data 12, 1166 (2025). https://doi.org/10.1038/s41597-025-05522-7

Download citation

Received: 10 April 2025
Accepted: 02 July 2025
Published: 09 July 2025
Version of record: 09 July 2025
DOI: https://doi.org/10.1038/s41597-025-05522-7

Subjects

Abstract

Similar content being viewed by others

Distribution and protection of Thesium chinense Turcz. under climate and land use change

Development of a system for the automated identification of herbarium specimens with high accuracy

Plant community data along elevational gradients in China’s 17 mountains

Background & Summary

Methods

Image crawling

Data cleaning

Data Records

Folder structure and recording format

Data Splitting

Properties

Technical Validation

Model architectures

Classification framework

Data augmentation

Training setup

Evaluation metrics

Image classification accuracy with advanced data augmentation

Image classification accuracy with various input resolutions

Analysis of accuracy curve during the training process

Visualization of learned feature spaces

Visualization of attention heatmaps

Usage Notes

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links