Abstract
Traditional Chinese Medicinal Plants (TCMPs) are often used to prevent and treat diseases for the human body. Since various medicinal plants have different therapeutic effects, plant recognition has become an important topic. Traditional identification of medicinal plants mainly relies on human experts, which does not meet the increased requirements in clinical practice. Artificial Intelligence (AI) research for plant recognition faces challenges due to the lack of a comprehensive medicinal plant dataset. Therefore, we present a TCMP dataset that includes 52,089 images in 300 categories. Compared to the existing medicinal plant datasets, our dataset has more categories and fine-grained plant parts to facilitate comprehensive plant recognition. The plant images were collected through the Bing search engine and cleaned by a pretrained vision foundation model with human verification. We conduct technical validation by training several state-of-the-art image classification models with advanced data augmentation on the dataset, and achieve 89.64% accuracy. Our dataset promotes the development and validation of advanced AI models for robust and accurate plant recognition.
Similar content being viewed by others
Background & Summary
Medicinal plants refer to plants that prevent and treat diseases to maintain health for the human body. Medicinal plants are not only the source of a large number of Western medicinal compounds but also the main source of Complementary and Alternative Medicine (CAM). China is rich in medicinal plant resources. According to statistics1, there are 11,146 species belonging to 383 families and 2313 genera of medicinal plants in China. In recent years, medicinal plants have played an increasingly important role in the prevention and treatment of various diseases2,3,4. However, different medicinal plants have distinct therapeutic effects. In clinical practice, it is necessary to use them according to the specific condition to mitigate adverse reactions and avoid toxicity. Therefore, accurately identifying medicinal plants to avoid misplanting, miscollection, misuse, and fraudulent substitution is the most fundamental and crucial step in ensuring the quality and efficacy of medicines.
As a part of systematics, taxonomy is closely related to phylogenetic studies and the research of evolutionary processes within systematics. Its tasks include using the findings of systematics to name and classify organisms under reasonable principles, providing an organizational framework that facilitates the storage and retrieval of biodiversity information, and offering guidance for identification work5. Plant taxonomy is a long-established discipline that has continuously developed and improved through the practice of human recognition and utilization of plants. Guided by Darwin’s theory of evolution6, plant taxonomists from various countries have proposed insights into plant classification systems by studying the origin and development of the plant kingdom. Among the most influential systems are those of Engler, Hutchinson, Takhtajan, and Cronquist. The Engler system and the Hutchinson system, in particular, represent the “Pseudanthium” and “Euanthium” schools, respectively, with flower characteristics being the primary distinguishing features7. The classification of medicinal plants adopts the principles and methods of plant taxonomy. Mastering characteristics such as plant morphology and microscopic structure ensures the accurate identification of plants with medicinal value8. This guarantees the authenticity of medicinal plants and herbal medicines, facilitating better research, rational development, and utilization.
Traditional identification of medicinal plants mainly relies on professionals according to the corresponding classification system. According to the written description of the classification system, the identification of medicinal plants is carried out by sight, hand touch, mouth taste, and nose9. This method mainly depends on the professional level and experience of the operator. Therefore, different operators have certain subjectivity in identifying medicinal plants, and their accuracy rates vary. With the development of human society, the demand for medicinal plant resources is increasing. Traditional identification methods cannot adapt to the rich diversity of plant taxa. At present, the classification of medicinal plants can be conducted from different perspectives, including pharmacognosy, pollen, leaf epidermis and seed coat, cytology, and molecular identification10.
Particularly, plant image recognition technology based on deep learning11,12 is a more intelligent and reliable auxiliary technology that can improve the accuracy and efficiency of plant identification. Due to the higher difficulty in collection and annotation, there are relatively fewer related works. Besides, Traditional Chinese Medicine (TCM) is more prevalent in Southeast Asia, and research on Traditional Chinese Medicinal Plant (TCMP) datasets is also distributed in the Southeast Asian region. In 2021, Tung and Vinh13 proposed a large, public, and multi-class dataset of medicinal plant images called VNPlant-200. The dataset consists of a total of 20,000 images of 200 different labeled Vietnamese medicinal plants. They also reported the preliminary classification results by using two local image descriptors. Huang et al.14 built a Chinese medicinal blossoms dataset, which consists of twelve blossoms used in TCM. The authors employed various data augmentation methods to increase the number of samples and provided the image classification results using AlexNet15 and InceptionV316. Abdollahi17 proposed using MobileNet18 to identify various medical plants in Ardabil, Iran. The dataset used to train and evaluate includes 30 different classes of medicinal plants, totaling 3000 images. DIMPSAR19 is a dataset for the analysis and recognition of Indian medicinal plant species. The dataset consists of two parts: leaf and plant. The former includes 80 classes, while the latter contains 40 classes. MED-11720 is an image dataset of common medicinal plant leaves from Assam, India. Beyond classification, the MED-117 dataset also includes segmented leaf frames for leaf segmentation using U-Net21. Ref. 22 reported a collection of leaf images from 10 common medicinal plants in Bangladesh. This paper also used several common Convolutional Neural Networks (CNNs) to classify this dataset. In addition to medicinal plant datasets, researchers have also collected an image dataset of TCM fruits23. This dataset includes 20 common types of TCM fruits. This paper also utilizes CNNs for classification. We summarized the details of the above datasets in Table 1.
However, although some works of medicinal plant datasets have been introduced, there are still many challenges in this field:
-
1.
Limited diversity of species. Existing medicinal plant datasets capture a limited number of plant species and parts, restricting their application in clinical practice.
-
2.
Lacking systematic data collection and cleaning framework. The data collection and cleaning process for existing datasets often highly relies on manual curation without automation tools.
-
3.
Inadequate validation framework. Previous papers do not conduct technical validation using state-of-the-art network architectures and advanced data augmentation methods. This hampers the accurate assessment of datasets and achieves suboptimal performance.
To address these challenges, our TCMP-30024 dataset stands out with its comprehensive design and structured methodology. The overview of the overall data processing workflow and evaluation is illustrated in Fig. 1. By leveraging an automated web crawler, we have ensured a diverse representation of species and plant parts, capturing images from various angles and contexts. This not only enhances the dataset’s ability to reflect the true diversity of medicinal plants but also mitigates the limitations posed by manual curation. Furthermore, our systematic data-cleaning process ensures that only high-quality and relevant images are included, providing a strong foundation for subsequent research and model training. The final dataset is verified by human TCM experts. We conduct comprehensive technical validation using three popular image-classification network families, including mobile network, CNN, and Vision Transformer (ViT). Additionally, we propose an advanced data augmentation technique, HybridMix, during the training process, which not only enriches the dataset but also enhances the robustness of models. The largest Swin-Base model25 achieves 89.64% accuracy on our TCMP-30024 dataset. Our dataset, combined with released superior models, set a new standard for future medicinal plant recognition research, making it easier for researchers to reproduce results and build upon our findings.
Among the selected medicinal plants, the most species belong to the following families: Asteraceae (27 species), Lamiaceae (18 species), Liliaceae (14 species), Ranunculaceae and Fabaceae (11 species each), Apiaceae (10 species), Brassicaceae and Solanaceae (8 species each), Rosaceae, Papaveraceae, Scrophulariaceae, Araliaceae, and Campanulaceae (6 species each), while other families contain fewer than 5 species. The selected medicinal plants predominantly contain bioactive compounds such as flavonoids, alkaloids, volatile oils, coumarins, sterols, and tannins. They exhibit clinical effects such as (1) clearing heat and detoxifying; (2) antibacterial and anti-inflammatory; (3) relieving cough and resolving phlegm; (4) reducing swelling and dissipating nodules; (5) promoting blood circulation and resolving stasis. Among them, certain folk medicinal herbs such as bittersweet, black nightshade, and rorippa have not yet been fully explored and utilized. These plants hold significant potential as valuable resources for developing new traditional Chinese medicines in clinical practice.
In summary, the comprehensive and meticulously curated TCMP-30024 dataset represents a significant advancement in the field of medicinal plant research. By facilitating the classification and identification of diverse plant species, this dataset not only enhances our understanding of the medicinal properties of these plants but also accelerates the discovery of new therapeutic agents. For example, the development of artemisinin, a Nobel Prize-winning treatment for malaria, was the result of extensive manual research to identify effective compounds. With a well-structured dataset, researchers could potentially streamline the process of discovering new drugs from nature, significantly improving efficiency in the search for novel therapies. Moreover, this dataset aligns with the United Nations Sustainable Development Goal 3: “Ensure healthy lives and promote well-being for all at all ages.” By providing a robust resource for the classification and study of medicinal plants, it can support initiatives aimed at improving healthcare and promoting the use of traditional medicine. Ultimately, the development of such datasets not only contributes to scientific knowledge but also has the potential to foster sustainable practices in healthcare and biodiversity conservation, paving the way for a healthier future for all.
Methods
Image crawling
The dataset images were collected through the Bing search engine using an automated web crawler that queried with standardized scientific names (following the binomial nomenclature system: Genus species + authority, e.g., Angelica sinensis (Oliv.) Diels and Pulsatilla chinensis (Bunge) Regel) of medicinal plants. After initial automated retrieval, all images underwent rigorous manual and taxonomic validation to ensure botanical fidelity. The final collection includes high-quality images depicting key morphological features of each plant species, such as blossoms, stems, leaves, roots, fruits, and whole-plant profiles, with representative examples shown in Fig. 2.
Data cleaning
The data-cleaning process began with the development of textual prompts based on medicinal plant labels, followed by the application of CLIP (Contrastive Language-Image Pre-Training)26, a robust vision foundation model, to classify the crawled images. Guided by TCM experts, precise classification thresholds were established for each category, enabling the identification and removal of low-quality samples. Figure 3 illustrates examples of dirty data, such as images containing textual content or pharmaceutical products, which were prevalent in the initial dataset. After the first cleaning phase, we eliminated 52.73% noise images in the initial dataset. Figure 5 compares the number of images per class before and after cleaning, showing a significant reduction from 222~400 to 101~354 images per class, highlighting the poor quality of the original data. To further refine the dataset, TCM experts conducted manual verification, addressing misclassified or overlooked samples. Figure 4 demonstrates two types of errors: mistakenly cleaned samples (e.g., rare variants of “Sambucus williamsii” misclassified due to atypical appearances or obscured features) and mistakenly uncleaned samples (e.g., images with ambiguous visual features or subtle text overlays that evaded automated detection). This comprehensive approach ensured a high-quality dataset for subsequent analysis. Notice that few images in the TCMP-30024 dataset may also inevitably contain visible watermarks or embedded annotations (e.g., numbers). According to the observation from Northcutt et al.27, the noisy data may be a common phenomenon in computer vision machine learning datasets, such as MNIST28, CIFAR-1029, CIFAR-10029, Caltech-25630, and ImageNet31. This further demonstrates that the noisy data is not a problem specific of our dataset but a reality of the computer vision field.
Data Records
Folder structure and recording format
The TCMP-30024 dataset, publicly available at https://doi.org/10.6084/m9.figshare.29432726, comprises 52,089 images across 300 TCMP categories. The user could download “tcmp-300-release.tar.gz” and “tcmp-info-20250329.csv”, where the former contains TCMP images and the latter describes class information. The user should run the script “tar -zxvf tcmp-300-release.tar.gz” to extract image files. The dataset is organized into category-specific subfolders following the naming convention “[Class ID].[English Name of TCMP]” (e.g., “001.Veronica persica Poir.”), with each subfolder containing all corresponding medicinal plant images. All images are provided in widely compatible formats (PNG, JPG, JPEG, and WebP), ensuring seamless integration with modern AI frameworks such as PyTorch32.
Data Splitting
Because our dataset exhibits a long tail distribution, to maintain this data characteristic, we divide the dataset based on classes as the basic unit. To split the whole dataset into train and validation subsets, we first define a threshold τ. For class Ci which contains ∣Ci∣ images, i ∈ [1, 2, ⋯ , n], where n stands for the total class number. We can calculate the image number of its corresponding train subset and validation subset by Equ. (1):
where \({N}_{train}^{{C}_{i}}\) and \({N}_{validation}^{{C}_{i}}\) are the number of images from the training subset and validation subset for class Ci, respectively. ⌊⋅⌋ denotes round down operation. In our practice, we set τ = 0.7.
Properties
To make our dataset suitable for a wide range of identification tasks in practice, the TCMP-30024 dataset has three valuable properties:
-
Comprehensiveness. Our dataset covers the maximum number of TCMP categories compared to previous datasets. It encompasses six major parts of TCMP: blossom, stem, leaf, root, fruit, and the whole plant. This extensive categorization ensures a complete representation of various TCMP components.
-
Diversity. To augment the dataset’s diversity, images were scraped from the internet encompassing various perspectives, both indoor and outdoor settings, a spectrum of lighting conditions, and complex backgrounds. This mirrors the multifaceted contexts where medicinal plants are typically found. Our dataset merges images captured from diverse environments to ensure the generalization of TCMP recognition.
-
Long-tailed distribution. The distribution of instances across various categories emulates the real-world prevalence of TCM. The image number of frequently utilized medicinal plants substantially surpasses that of their rare counterparts, as illustrated in Fig. 5. This long-tailed distribution requires the model to achieve good performance on the overall categories, even if the sample numbers in some categories are small.
Technical Validation
Model architectures
We adopt popular image classification models, including ResNet33, DenseNet34, RegNet35, MobileNetV336, ShuffleNetV237, EfficientNet38, ConvNeXt39, ViT40, and Swin Transformer25, to evaluate the dataset. The first seven models are CNN backbones, where MobileNet, ShuffleNet, and EfficientNet are lightweight networks for edge devices. The last two models are visual Transformer networks.
Classification framework
Figure 6 shows the pipeline of the TCMP image classification framework. The input is an image x with RGB channels. We first adopt a data augmentation module τ(⋅) to preprocess the input image and produce an augmented image \(\widetilde{{\boldsymbol{x}}}\), i.e., \(\widetilde{{\boldsymbol{x}}}=\tau ({\boldsymbol{x}})\). A general image classification network f is applied to classify the augmented image \(\widetilde{{\boldsymbol{x}}}\), e.g., ResNet. The classification network f often includes a feature extractor ϕ for feature learning and a linear classifier ξ to output class probability distribution \({\boldsymbol{p}}=f(\widetilde{{\boldsymbol{x}}})=\xi (\phi (\widetilde{{\boldsymbol{x}}}))\in {{\mathbb{R}}}^{C}\), where C is the number of classes. The feature extractor ϕ includes L stages \({\{{\phi }_{l}\}}_{l=1}^{L}\) to process image features. Each stage ϕl downsamples the input image to refine hierarchical information and generate semantic features. We adopt the traditional cross-entropy loss to the image classification network. Given the input image x and its ground truth y, the cross-entropy loss is formulated as Equ (2).
Here, τc,y is an indicator function. If c = y, τc,y = 1, else wise τc,y = 0.
Data augmentation
Following the ImageNet classification, we utilize random cropping and flipping as the standard data augmentation41. The input image is first randomly cropped, ranging from 0.08 to 1.0 of the original image size. The aspect ratio is ranged from 3/4 to 4/3 of the original image aspect ratio. The cropped image is resized to 224 × 224 and horizontally flipped using a probability of 0.5. Finally, it is normalized by the mean and standard deviation over RGB channels.
We further propose HybridMix, an image mixture technique, as the extra data augmentation to improve the model accuracy. As shown in Fig. 7, given an input image, HybridMix chooses Mixup or CutMix with a probability of 0.5 to perform global or local image mixture.
-
1.
Global Image Mixture: Mixup42. Given two different images xi and xj from the mini-batch, Mixup conducts global pixel mixture using a balancing factor λ to produce a new mixed image \({\widetilde{{\boldsymbol{x}}}}_{ij}\), where λ ∈ [0, 1]. The label of the mixed image is also linearly interpolated between yi and yj with the same balancing factor λ to generate mixed label \({\widetilde{y}}_{ij}\). The formulations of \({\widetilde{{\boldsymbol{x}}}}_{ij}\) and \({\widetilde{y}}_{ij}\) are shown as Equ (3).
$${\widetilde{{\boldsymbol{x}}}}_{ij}=\lambda \ast {{\boldsymbol{x}}}_{i}+(1-\lambda )\ast {{\boldsymbol{x}}}_{j},\,{\widetilde{y}}_{ij}=\lambda \ast {y}_{i}+(1-\lambda )\ast {y}_{j}$$(3) -
2.
Local Image Mixture: CutMix43. The main idea of CutMix is to crop a patch from xj and pad xi with the cropped patch. We represent the bounding box coordinates of the cropped patch on xj as B = (rx, ry, rw, rh), where \({r}_{w}=W\sqrt{\lambda }\) and \({r}_{h}=H\sqrt{\lambda }\), λ ∈ [0, 1], W and H are the width and height of the image, respectively. rx and ry are uniformly sampled according to rx ~ U(0, W), ry ~ U(0, H). The CutMix image is generated by removing the patch B on xi and filling the corresponding patch cropped from xj. In practice, a binary mask M = {0, 1}W×H fills 0 according to the bounding box B otherwise 1. The output CutMix image and interpolated label can be expressed as:
$$\widehat{{\boldsymbol{x}}}={\bf{M}}\odot {{\boldsymbol{x}}}_{i}+({\boldsymbol{1}}-{\bf{M}})\odot {{\boldsymbol{x}}}_{j},\,{\widetilde{y}}_{ij}=\lambda \ast {y}_{i}+(1-\lambda )\ast {y}_{j}.$$(4)Here, ⊙ denotes the element-wise multiplication, 1 = {1}W×H is the binary mask filled with 1.
Training setup
All models are trained by a stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a weight decay of 1 × 10−4, and a batch size of 128. We utilize a cosine learning rate strategy, the learning rate of which starts at 0.1 and gradually decreases to 0 within the total 300 epochs.
Evaluation metrics
We utilize Accuracy (Acc), F1 score, Balanced Accuracy (B-Acc), and Balanced F1 score (B-F1) to measure the performance of TCMP recognition. The predicted results are often constructed by True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) for each class. Based on TP, TN, FP, and FN, the evaluation metrics of Acc, F1, B-Acc, and B-F1 can be formulated as follows:
-
1.
Accuracy: Accuracy calculates the recall of all samples:
$$\,{\rm{Acc}}\,=\frac{{\sum }_{c=1}^{C}T{P}_{c}}{{\sum }_{c=1}^{C}(T{P}_{c}+F{N}_{c})},$$(5)where C denotes the total class number, TPc denotes the True Positive of class c.
-
2.
F1 score: The overall F1 score is the arithmetic mean of the F1 scores for all classes:
$$\begin{array}{c}{{\rm{Precision}}}_{c}=\frac{T{P}_{c}}{T{P}_{c}+F{P}_{c}},\,{{\rm{Recall}}}_{c}=\frac{T{P}_{c}}{T{P}_{c}+F{N}_{c}},\\ {{\rm{F1}}}_{c}=\frac{2{{\rm{Precision}}}_{c}\cdot {{\rm{Recall}}}_{c}}{{{\rm{Precision}}}_{c}+{{\rm{Recall}}}_{c}},\,\,{\rm{F1\; score}}=\frac{1}{C}\mathop{\sum }\limits_{c=1}^{C}{{\rm{F1}}}_{c}.\end{array}$$(6) -
3.
Balanced Accuracy: Balanced accuracy calculates the average recall rate for each class:
$$\,{\rm{B}} \mbox{-} {\rm{Acc}}=\frac{1}{C}\mathop{\sum }\limits_{c=1}^{C}\,{{\rm{Recall}}}_{c}.$$(7) -
4.
Balanced F1 score: Balanced F1 score is the weighted average of F1 scores for each class, with the weight being the proportion of samples in each class to the total sample size.
$${w}_{c}=\frac{{N}_{i}}{N},\,\,{\rm{B}} \mbox{-} {\rm{F1}}=\mathop{\sum }\limits_{c=1}^{C}{w}_{c}{{\rm{F1}}}_{c},$$(8)where Nc denotes the samples number of class c, N denotes the total samples size.
Image classification accuracy with advanced data augmentation
As shown in Table 2, we evaluate the proposed dataset over various classification models with regular training and advanced data augmentation. Our proposed HybridMix augmentation consistently improves the baseline training with an average improvement of 1.57% across eleven models, including 1.42% for mobile networks, 1.01% for CNNs, and 2.25% for ViTs. The results indicate that HybridMix can enhance the generalization capability for TCMP recognition. The models with various sizes often present distinct performances. The lightweight MobileNetV3, ShuffleNetV2, and EfficientNet-B0 models are often deployed over resource-limited mobile devices, achieving 83% ~ 86% accuracy. The accuracy is generally improved as the parameter size increases. The best models for CNN and ViT are ConvNeXt-Tiny and Swin-Base, which achieve 89.43% and 89.64%, respectively. The results demonstrate that large models reach the best accuracy performance. Given the class imbalance in the dataset, we also report balanced accuracy (B-Acc), F1 score, balanced F1 score (B-F1) alongside overall accuracy. We can observe that HybridMix leads to consistent improvements on B-Acc, F1, and B-F1 metrics. The results show that our models trained by HybirdMix work well on the class imbalance scenario.
Image classification accuracy with various input resolutions
As shown in Table 3, we investigate the effectiveness of the proposed dataset with various input resolutions. We conduct experiments over multiple classification models to increase the confidence of the results. We can observe that the accuracies of classification models generally improve as the input resolution increases. When the input resolution is increased from 224 × 224 to 336 × 336, the classification models achieve an average accuracy gain of 0.84%. When the input resolution is increased to 448 × 448, the average accuracy improvement is up to 1.65%, and RegNet-X achieves the best 89.54% accuracy. Moreover, we also observe that the larger input resolution enhances B-Acc, F1, and B-F1 metrics consistently. The experimental results demonstrate that classification accuracy benefits from larger input resolutions.
Analysis of accuracy curve during the training process
Figure 8 shows the accuracy curves of the CNN and ViT models during the training process. We find that the accuracy curve often steadily increases as the training epoch proceeds. Accuracy improves rapidly during the first 50 training epochs and tends to grow slowly after the 50th epoch. The network converges at the 300-th epoch and reaches the best accuracy.
Visualization of learned feature spaces
Figures 9 and 10 show t-SNE44 visualization of learned feature spaces over various networks from random and hard samples, respectively. For clear visualization, we randomly sample 10 classes from the dataset. In figures, each cluster including points with the same color represents a unique class. As shown in Fig. 9, we can observe that each network exhibits well intra-class compactness and inter-class separability. The visualization results indicate that each network learns a discriminative feature space, leading to good classification performance. Moreover, we also analyze the feature space of hard samples in Fig. 10. Here, the hard sample is defined as the misclassified sample by the AI model. We find that the feature space of hard samples is less discriminative than that of random samples. This indicates why hard cases are easy to be misclassified.
Visualization of learned feature space from random samples by t-SNE44.
Visualization of learned feature space from hard cases by t-SNE44.
Visualization of attention heatmaps
Figures 11 and 12 show attention heatmaps using a pretrained ResNet-50 model by Grad-CAM45. As shown in Fig. 11, the visualization results indicate that the model trained by our TCMP-30024 dataset can exactly capture fine-grained discriminative features for critical parts, such as blossom, stem, leaf, root, fruit, and plant, while ignoring the noise parts. The captured critical features are classification evidence for robust and accurate TCMP recognition. Moreover, we also found a small number of failure cases when the model focuses on inaccurate TCMP-related features, as shown in Fig. 12. This means that AI models may also have potential biases on few TCMP images.
Visualization of attention heatmaps with accurate TCMP-related features by Grad-CAM45.
Visualization of attention heatmaps with inaccurate TCMP-related features by Grad-CAM45.
Usage Notes
Anyone can access TCMP-30024 dataset from figshare platform. The standard PyTorch dataloader module could be called to load the dataset for training and validation. We also release some superior AI models pretrained on the TCMP-30024 dataset as baselines and references for future research to explore more powerful AI models. As shown in Fig. 13, we also open a leaderboard of our dataset over paperwithcode at https://paperswithcode.com/sota/image-classification-on-tcmp-300. Anyone can update the best accuracy online to promote the progress of TCMP recognition.
To facilitate clinical practice, we deploy our model into an online web application over HuggingFace. As shown in Fig. 14, anyone can visit this system at https://winycg-tcmprecognition.hf.space/?__theme=system&deep_link=trNDfB9dwx8and upload an image for online plant recognition. Moreover, the system can also output top-5 class probabilities to interpret classification evidence.
Our dataset provides a platform to expand the number of categories and their numbers of images. We release ready-made tools for image crawling and data cleaning. The users can further construct personalized datasets according to their requirements. We hope our TCMP-30024 dataset can become a larger-scale dataset with more diverse categories and samples.
Code availability
The GitHub repository for the TCMP-30024 dataset is available at https://github.com/winycg/TCMP-300. This repository offers detailed tools and guidelines for users to combine the TCMP-30024 dataset into their research. It contains comprehensive tools to promote the study with the TCMP-30024 dataset and enable customization of workflows for image crawling, data cleaning, data preprocessing, model training, model inference, and model deployment. Moreover, we also provide data analysis tools to plot training curves, visualize t-SNE spaces and Grad-CAM heatmaps.
References
Chen, S. et al. Validation of the its2 region as a novel dna barcode for identifying medicinal plant species. PloS one 5, e8613 (2010).
Zhu, Y. et al. A high-quality chromosome-level genome assembly of the traditional chinese medicinal herb zanthoxylum nitidum. Sci. Data 11, 1–13 (2024).
Li, Q. et al. Chromosome-level genome assembly of the tetraploid medicinal and natural dye plant persicaria tinctoria. Sci. Data 11, 1440 (2024).
Shi, Y. et al. Chromosome-level genome assembly of the traditional medicinal plant lindera aggregata. Sci. Data 12, 565 (2025).
Williams, D. M. Plant Taxonomy: The Systematic Evaluation of Comparative Data, 2nd edition. Syst. Biol. 59, 608–610, https://doi.org/10.1093/sysbio/syq017 (2010).
Darwin, C. et al. On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. Oxf. Text Arch. Core Collect. (1859).
Cronquist, A. The status of the general system of classification of flowering plants. Annals Mo. Bot. Gard. 52, 281–303 (1965).
Wang, Y. et al. A haplotype-resolved genome assembly of coptis teeta, an endangered plant of significant medicinal value. Sci. Data 11, 1012 (2024).
Ye, F., Cai, G., Zheng, Z., Qi, X. & Yin, p. Medicinal plant specimen identification based on multi-feature fusion. In Proceedings of the 2011 China Intelligent Automation Academic Conference (2011).
Wang, Z., Du, F., Zhang, L., Wang, S. & Wei, S. Advances in classification and identification of the plants of polygonati rhizoma and its adulterants. North. Hortic. 24, 130–136 (2019).
Li, a, Dai, Z. & Zhang, Jea Progress in the application of machine learning in plant phenotyping. Plant Fiber Sci. China 45, 248–260 (2023).
Yanhui, Z., Fan, X. & Zhang, J. A preliminary study on the application of deep learning methods in medicinal plant image recognition based on deeplearning4j on spark. Chin. J. Libr. Inf. Sci. for Tradit. Chin. Medicine 42, 18–22 (2018).
Quoc, T. N. & Hoang, V. T. Vnplant-200 – a public and large-scale of vietnamese medicinal plant images dataset. In Antipova, T. (ed.) Integrated Science in Digital Age 2020, 406–411 (Springer International Publishing, Cham, 2021).
Huang, M.-L., Xu, Y.-X. & Liao, Y.-C. Image dataset on the chinese medicinal blossoms for classification through convolutional neural network. Data Brief 39, 107655, https://doi.org/10.1016/j.dib.2021.107655 (2021).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. neural information processing systems25 (2012).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. & Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826 (2016).
Abdollahi, J. Identification of medicinal plants in ardabil using deep learning : Identification of medicinal plants using deep learning. In 2022 27th International Computer Conference, Computer Society of Iran (CSICC), 1–6, https://doi.org/10.1109/CSICC55295.2022.9780493 (2022).
Howard, A. G. et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
B R, P. & Rani, N. S. Dimpsar: Dataset for indian medicinal plant species analysis and recognition. Data Brief 49, 109388, https://doi.org/10.1016/j.dib.2023.109388 (2023).
Sarma, P., Boruah, P. A. & Buragohain, R. Med 117: A dataset of medicinal plants mostly found in assam with their leaf images, segmented leaf frames and name table. Data Brief 47, 108983, https://doi.org/10.1016/j.dib.2023.108983 (2023).
Ronneberger, O., Fischer, P. & Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, 234–241 (Springer, 2015).
Borkatulla, B., Ferdous, J., Uddin, A. H. & Mahmud, P. Bangladeshi medicinal plant dataset. Data Brief 48, 109211, https://doi.org/10.1016/j.dib.2023.109211 (2023).
Tian, D., Zhou, C., Wang, Y., Zhang, R. & Yao, Y. Nb-tcm-chm: Image dataset of the chinese herbal medicine fruits and its application in classification through deep learning. Data Brief 54, 110405, https://doi.org/10.1016/j.dib.2024.110405 (2024).
Zhang, Y. et al. Tcmp-300: A comprehensive traditional chinese medicinal plant dataset for plant recognition. figshare https://doi.org/10.6084/m9.figshare.29432726 (2025).
Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
Stevens, S. et al. Bioclip: A vision foundation model for the tree of life. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19412–19424 (2024).
Northcutt, C. G., Athalye, A. & Mueller, J. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749 (2021).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (2002).
Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Tech. Rep., Master’s thesis, Department of Computer Science, University of Toronto (2009).
Griffin, G. et al. Caltech-256 object category dataset. Tech. Rep., Technical Report 7694, California Institute of Technology Pasadena (2007).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, 248–255 (Ieee, 2009).
Paszke, A. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Huang, G., Liu, Z., Van Der Maaten, L. & Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708 (2017).
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K. & Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10428–10436 (2020).
Howard, A. et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, 1314–1324 (2019).
Ma, N., Zhang, X., Zheng, H.-T. & Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), 116–131 (2018).
Tan, M. & Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114 (PMLR, 2019).
Liu, Z. et al. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 11976–11986 (2022).
Dosovitskiy, A. et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Yang, C. et al. Gated convolutional networks with hybrid connectivity for image classification. In Proceedings of the AAAI conference on artificial intelligence, vol. 34, 12581–12588 (2020).
Zhang, H., Cisse, M., Dauphin, Y. N. & Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
Yun, S. et al. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, 6023–6032 (2019).
Van der Maaten, L. & Hinton, G. Visualizing data using t-sne. J. machine learning research 9 (2008).
Selvaraju, R. R. et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626 (2017).
Acknowledgements
This work is partially supported by the National Natural Science Foundation of China under Grant Number 62476264 and 62406312, the Postdoctoral Fellowship Program and China Postdoctoral Science Foundation under Grant Number BX20240385 (China National Postdoctoral Program for Innovative Talents), the Beijing Natural Science Foundation under Grant Number 4244098, and the Science Foundation of the Chinese Academy of Sciences.
Author information
Authors and Affiliations
Contributions
C.Y., L.H., and Z.A. led this research. Y.Z. and W.S. were responsible for image crawling and data curation. C.Y. wrote the manuscript, conceived the experiments, analyzed the results and deployed AI models. L.H. and Z.A. investigated related background and works. L.H. and W.T. performed data cleaning. W.F. conducted the experiments. Y.X. refined the organization and writing of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, Y., Sun, W., Yang, C. et al. TCMP-300: A Comprehensive Traditional Chinese Medicinal Plant Dataset for Plant Recognition. Sci Data 12, 1166 (2025). https://doi.org/10.1038/s41597-025-05522-7
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41597-025-05522-7
















