Background & Summary

The Guangdong-Hong Kong-Macao Greater Bay Area (hereafter referred to as the GBA) comprises nine cities in Guangdong Province with the Special Administrative Regions of Hong Kong and Macao. The GBA features a diverse geographical landscape, characterized by mountainous and hilly terrains to the east, west, and north, and predominantly covered by subtropical evergreen broad-leaved forests and monsoon forests1. This region is distinguished by its humid subtropical climate and high summer precipitation, which foster a wide variety of flora and ecological interactions. The GBA is important for both its economic and innovative activity, as well as its diverse ecological support for various habitats. Seed plants, including angiosperms (flowering plants) and gymnosperms (conifers and their relatives), constitute the largest component of the region’s plant diversity. Despite its a rich botanical diversity, The GBA faces challenges related to plant conservation and sustainable land use. In recent decades, the degradation of vegetation within the GBA has primarily resulted from rapid urbanization, economic growth, and climate change2. As the area has evolved into a major economic hub in China, primary forests have increasingly been converted for infrastructure projects, resulting in substantial ecological degradation and biodiversity loss. This decline adversely affects essential ecosystem services, which are crucial for maintaining environmental balance in the region, necessitating extensive conservation efforts.

The species identification is crucial for the monitoring, conservation, and sustainable utilization of biodiversity. Traditional morphological classification methods typically rely on structural differences of plant organs. However, identifying species based on morphological characters can be challenging when dealing with specimens at early growth stage or those displaying similar characteristics due to shared environmental conditions, particularly among closely related or morphologically similar species3. The rapid advancement of DNA barcoding techniques has significantly improved the accuracy of species identification, even for specimens that have been processed or preserved for extended periods4,5,6,7. By addressing many limitations associated with traditional morphological taxonomy, DNA barcoding plays an essential role in biology research8,9. This approach utilizes one or a few standardized DNA sequences. While cytochrome c oxidase subunit 1 (CO1) has been recommend as universal marker for most animal species, it is not suitable for plants due to the slower evolutionary rate of mitochondrial genes in plants compared to those in animals8,10,11. Instead, two core plastid DNA barcode regions, namely partial sequences from the rbcL and matK genes, are employed as standard markers12. These regions are often supplemented with the more variable internal transcribed spacer 2 (ITS2) region of nuclear ribosomal DNA to enhance resolution for seed plants13,14,15,16.

Establishing high-resolution and well-curated taxonomic reference libraries is a necessary first step to ensure accuracy in species identification and discovery17,18. A local DNA barcode library often proves more effective than a global DNA library due to its higher specificity and enhanced resolution. In this study, we constructed a reliable DNA barcode reference database for seed plants in the GBA using three standard barcodes matK, rbcL, and ITS2 (Fig. 1). The database comprises 2,864 native species across 1,117 genera and 192 families (Table 1). Among these, 695 samples representing 516 species were newly sequenced, contributing 1,860 sequences to GenBank and enriching the ITS2 barcode dataset. Additionally, there are 1,528 individuals with all three markers, which represent 1,002 species across 545 genera and 151 families.

Fig. 1
figure 1

Workflow of reference DNA barcode library development.

Table 1 Summary of the standard DNA barcodes for native seed plants in the GBA.

The successful establishment of this comprehensive DNA barcode reference database marks a significant advancement in the understanding and preservation of plant diversity19. While previous libraries often centered on specific forest types or ecological zones20,21, our study provides a regionally targeted and taxonomically inclusive barcode reference library specifically for the GBA, which is distinctive in both its taxonomic breadth and its focus on a highly urbanized and fragmented landscape. Importantly, our dataset was constructed the critical gap in localized barcode representation.

The database aids in biodiversity monitoring and conservation initiatives by facilitating the rapid and accurate identification of species, including rare and endangered plants in the region. This resource not only enhances our understanding of plant biodiversity and supports effective conservation strategies globally but also bolsters future conservation efforts.

Methods

Generation of the seed plants list in the Guangdong-Hong Kong-Macao Greater Bay Area

The seed plants list was derived from specimen records housed at the Herbarium of South China Botanical Garden (IBSC). Species names and taxonomic authorities were verified using the ‘status’ function of the R package ‘plantlist’ version 0.8.022 and cross-referenced with Plant of the world online (http://powo.science.kew.org/). The family classification follows APG IV for angiosperms23 and Yang et al.24 for gymnosperms. GBA has a total of 3,876 native seed species belonging to 1,229 genera and 198 families. Among them, 2,864 species from 1,117 genera and 192 families were included in this study (Table 1). Details could be found in the Table S1 at Figshare25, since leaves materials and sequence data could not be obtained for the remaining species. Available barcodes information for included species is documented in “Barcodes information” at Figshare25.

Sample collection and data acquisition

From September 2022 to December 2024, field expeditions were conducted to collect specimens across eight cities in Guangdong Province: Zhaoqing, Foshan, Jiangmen, Guangzhou, Dongguan, Huizhou, Shenzhen, and Zhuhai (Fig. 2, “Barcode Dataset” at Figshare25). Fresh leaves were preserved in silica gel during the fieldwork, and voucher specimens were deposited in the herbarium of the South China Botanical Garden (IBSC).

Fig. 2
figure 2

The localities we collected newly specimens in the GBA. Red dots and size represent sample sites and sampling numbers, respectively.

In addition, we incorporated partial standard barcode data from a previous study on subtropical regions in southern China20,26. To enhance the representation of taxa and genetic variation among seed plants species in the GBA, we retrieved a set of barcode sequences from GenBank. Sequences were manually filtered to exclude those that were low-quality (e.g., truncated, ambiguous, lacking voucher data) or taxonomically inconsistent (e.g., outdated synonyms, misspellings, or provisional names not present in the checklist).

DNA extraction, amplification and sequencing

Total genomic DNA was extracted from silica-dried leaves following the cetyl trimethyl ammonium bromide (CTAB) method27, which involves several steps, including cell lysis, removal of protein and debris, and DNA recovery. Sequencing was performed using universal DNA barcode primers for rbcL28 (rbcLa_F: ATGTCACCACAAACAGAGACTAAAGC; rbcLa_R: GTAAAATCAAGTCCACCTCG), matK (K. J. Kim, unpublished, matK_3F: CGTACAGTACTTTTGTGTTTACGAG, matK_1R: ACCCAGTCCATCTGGAAATCTTGGTTC), and ITS2 region was amplified using the primers ITS2_S2F/ITS2_S3R15 (ITS2_S2F: ATGCGATACTTGGTGTGAAT; ITS2_S3R: GACGCTTCTCCAGACTACAAT). The PCR reaction mixture consisted of 25 μl. which included 2.5 μl of 10 × PCR buffer (Tris-HCl, 100 mM; KCl, 500 mM; MgCl2, 15 mM), 0.5 μl of each primer (10 μM), 2.0 μl of dNTPs (2.5 µM), 0.5 μl of DNA template (about 20~30 ng), 0.2 μl of rTaq (5 U µl−1), and 18.8 μl of ddH2O29. The PCR amplification was performed under the following conditions: initial denaturation at 94 °C for three minutes; 35 cycles of 94 °C for 30 seconds, annealing at 50 °C for rbcL, 48 °C for matK, and 55 °C for ITS2 for 45 seconds, followed by extension at 72 °C for one minute, and a final extension at 72 °C for 10 minutes. All PCR products were visualized on a 1.0% agarose gel and sequenced using Sanger sequencing on an ABI3730 DNA analyzer. Both PCR product purification and Sanger sequencing were carried out by Tsingke Biotechnology Co., Ltd. Raw sequences were subsequently edited and assembled using Geneious 11.0.230,31.

All the sequences obtained by Sanger sequencing were verified using the BLASTn tool. In instances where certain species were absent from the GenBank database, the target fragments with the highest bit-scores from sequence accessions of the same genus or family were considered reliable. A total of 695 samples included at least one verified sequence in this study.

Data verification

To minimize the impact of missing data, analyses were restricted to species represented by multiple individuals. Raw sequences were aligned using MAFFT v7.4714532 and manually refined in Geneious 11.0.230. Default parameters were applied for aligning rbcL and matK sequences, while ITS2 sequences were aligned by the order level. Following alignment, barcode concatenation was performed to calculate the resolution percentages for various barcode combinations. Three widely recognized methods were employed to assess the efficacy of DNA barcodes in species authentication: (1) Barcoding gap Method: Species with multiple specimens were analysed to detect a barcode gap, characterized by the smallest interspecific genetic distance exceeding the largest intraspecific genetic distance33. The uncorrected intra- and interspecific genetic distances were calculated using the “distancematrix” function in R34. (2) Similarity-Based Method: The percentage of correctly identified sequences was determined using the “BM/BCM (best match/best close match)” function in “taxonDNA”. In ‘best match’ analysis, each query is to find out its closest barcode match, whereas the ‘best close match’ was more rigorous, as it depended on a 95% pairwise distance threshold35. Pairwise genetic distances were calculated using the Kimura 2-parameter (K2P) model. (3) Tree-Based Method: Monophyletic clades with bootstrap support of at least 50% were classified as successful identifications. A Maximum Likelihood (ML) phylogenetic tree was constructed using a combination of three genetic markers based on the GTRGAMMA model in the “raxml-hpc2” (v8.2.12) on CIPRES platform, with 1,000 bootstrap replicates to estimate node support36.

Data Records

The collection of 695 individuals representing 516 species were successfully sequencing, yielding 591 matK, 680 rbcL, and 589 ITS2 sequences. All DNA barcodes are accessible in GenBank, with the following accession numbers: PQ331846-PQ332525 for rbcL, PQ331250-PQ331840 for matK, and PQ564758-PQ565346 for ITS2. The sequence record lists could be found in three supplementary files at Figshare (Tables S2–S4)25. Detailed information on the newly generated sequences is provided (Table S2). For species reported in this region but lacking sufficient quality samples for DNA barcode sequencing, we selected 5,661 standard barcodes from 887 species based on our previous studies conducted in subtropical regions of China (Table S3)20,26. Additionally, we incorporated 12,838 sequences downloaded from GenBank (Table S4), encompassing 1461 species across 832 genera and 165 families.

In total, this database comprises 20,359 standard barcodes, covering 2,864 native species from 1,117 genera within 192 families in the GBA37. The families with the highest species diversity include Poaceae, with 219 species of 97 genera; Fabaceae, with 171 species from 74 genera; and Orchidaceae, with 134 species representing 54 genera (Table 2). All specimen details and standard DNA barcode sequences have been uploaded to the BOLD system, which is publicly available, under the name of dataset “DS-GHMGBA”37 (https://doi.org/10.5883/DS-GHMGBA). All supplementary files for this study are shared at Figshare25.

Table 2 Families ranked by number of species for native seed plants in the GBA.

Technical Validation

We evaluated the discriminatory power of the standard barcodes among species with multiple individuals using three common methods (Table 3, Fig. 3). The ITS2 marker exhibited the highest species resolution compared to other single barcode fragments, achieving identification rates of 87.6%, 69.7 2%, and 70.87% for BM/BCM method, the barcoding gap, and the tree-based method, respectively. However, we found that among some higher-level taxonomic groups (such as above the family or order level), the ITS2 sequences are so highly variable that it becomes very difficult to align the sequences effectively. In addition, ITS2 is also prone to fungal contamination and contains a large number of paralogous copies38. All of these factors hinder the accuracy of using ITS2 alone as a DNA barcode region.

Table 3 Species discrimination rates using three barcodes and their combinations based on three methods for native seed plants of the GBA.
Fig. 3
figure 3

The species level discrimination rate for three barcodes and their combinations using three different methods. BM/BCM: Best Match/Best Close Match; RM: rbcL + matK; RI2: rbcL + ITS2; MI2: matK + ITS2; RMI2: rbcL + matK + ITS2. The results for BM and BCM were identical across all barcodes in this study; only a single set of values is presented.

In contrast, rbcL and matK provided lower resolution, with identification rate of 56.2% for rbcL and 63.29% for matK using the BM/BCM method. These results are significantly lower than the 90% successful identification rate reported by Lahaye et al.39 for 1,084 angiosperm species. Using the barcoding gap method, the identification rates of matK and rbcL were 46.9% and 35.72%, respectively, while the tree-based method yielded rates of 31.89% for rbcL and 43.5% for matK. The combination of all two plastid barcodes (matK and rbcL) yielded results that fluctuated around 55%. Similar findings have been reported in recent studies focusing on seed plants40,41,42,43. The combination of rbcL + matK + ITS2 (referred to as RMI2) had the highest species resolution for both the barcoding gap and the tree-based method (Table 3), with rates of 76.83% and 71.63%, respectively. Therefore, combining nuclear (ITS2) and plastid markers (rbcL and matK) remains the most robust strategy, balancing primer universality, species resolution, and broader taxonomic applicability13.

The maximum likelihood (ML) phylogenetic tree was constructed using RMI2, based on 1,399 samples representing 979 species from 505 genera across 148 families and 47 orders (Fig. 4). Of the nodes in the tree, 17.83% exhibited support rates below 50%, while 14.35% displayed moderate support, with values ranging from 50% to 75%. Notably, 67.82% of the nodes had support values exceeding 75%, providing adequate resolution to distinguish most genera and families. However, we found 42 of the 505 genera appeared as non-monophyletic in the tree. In these cases, we cannot completely rule out the possibility of systematic errors (limited phylogenetic signals of barcodes); however, some instances may also be attributed to the inherent complexity of plant group classification. For example, Turpinia and its relatives are readily distinguishable morphologically, but molecular evidence fails to effectively differentiate them44, as demonstrated in this study. There are also some genera, such as Castanopsis, Lithocarpus, and Quercus (Fagaceae), that could not be effectively distinguished by the current DNA barcodes. This is likely due to strong hybridization and introgression events that have occurred among these genera in the past45. The same thing might have happened in Lauraceae as well46. Overall, the creation of a DNA barcode database for the GBA region offers substantial benefits for plant species identification, efficient biodiversity monitoring, and research on plant evolution within the region. We will continue to improve and expand this database to better meet the needs of the local community.

Fig. 4
figure 4

The maximum likelihood (ML) tree using a combination of three barcodes for seed plants of the GBA.