Introduction

Spatholobus suberectus is distributed in Fujian Province and the Guangxi Zhuang Autonomous Region of China1. It is a leguminous plant used in traditional Chinese medicine. Dried S. suberectus stem is used as a medicinal component. Due to the red juice that exudes during harvesting, it is also known as “Ji Xue Teng” in China. Modern pharmacological and clinical research has shown that S. suberectus has anti-inflammatory1, antioxidant2, antiphotoaging3, antidiabetic4, and anticancer5 properties. In addition, S. suberectus has long been used as a nourishing food additive (wine, soup and tea) in China2. Owing to its importance in food and medicine, the market demand for S. suberectus is high, but the scarcity of wild resources and long growth period (more than 7 years) before it can be used as a medicine limit its supply6. Unscrupulous businessmen, driven by profits, mix vine plants with S. suberectus, which greatly affects the effectiveness and safety of its use in clinical medicine. The keys to solving this problem will be the development of methods for the identification of S. suberectus and its products and transformation of the supply model from wild resources to artificial cultivation.

Methods such as source, character and microscopic identification and chemical composition analysis are often used to identify medicinal plants or processed products7,8,9. DNA barcoding is a molecular diagnostic technology that uses standard, sufficiently variable DNA fragments for species identification and delimitation10. The DNA barcode fragments are short in length and easy to amplify, and even if the samples are not fresh enough (e.g., samples from herbaria or prepared products), the DNA that has been partially degraded can be distinguished11,12. The construction of barcode libraries from known taxa is the basis of this work, as well as the analysis of phylogenetic relationships on the basis of library assignment of barcode sequences to distinguish species13. Internationally recognized candidate sequences for plant DNA barcodes include the chloroplast–plastid region (matK, rbcL, ycf, psbA-trnH, etc.) and the nuclear internal transcribed spacer (ITS) region14. The Consortium for the Barcode of Life (CBOL) Plant Group proposed the combination of plastid and nuclear ITS regions as an effective barcoding tool for distinguishing plant species15. The China CBOL Plant Group has incorporated ITS (or ITS2) into the core barcode for seed plant identification, and psbA-trnH are recommended as auxiliary barcodes16. Chen et al.17 conducted a comparative analysis of the amplification success rate, intraspecies and interspecies variation, and barcoding gap of multiple candidate sequences and reported that ITS2 performed the best. They also used 6,600 samples of 4,800 plant species to evaluate the ability to use ITS2 for identification. The results revealed that the species identification success rate was 92.7%. Therefore, the use of the ITS2 sequence as a universal DNA barcode sequence for medicinal plants was proposed. The “Pharmacopoeia of the People’s Republic of China” (2015 edition) includes the guiding principles for DNA barcoding technology and establishes a Chinese herbal medicine identification system based on ITS2 (ITS) supplemented with psbA-trnH18,19.

Several DNA barcodes have been found to be useful for identifying S. suberectus. An et al. used 26 S rDNA to distinguish S. suberectus, Callerya dielsiana, Derris taiwaniana, Mucuna sempervirens and Derris trifoliata through seven samples20. Huang et al. used matk to distinguish S. suberectus, D. trifoliata, Entada phaseoloides, Callerya cinerea and Sargentodoxa cuneata through 8 samples21. Zhou et al. used psbA-trnH to distinguish S. suberectus, S. cuneata, Kadsura interior, Kadsura heteroclita, M. sempervirens, Mucuna birdwoodiana, C. dielsiana and Callerya tsui through 79 samples22. ITS2 is located between the 5.8 S and 26 S eukaryotic ribosomal RNA genes and does not encode proteins23. Since the ITS region is not incorporated into the ribosome, it is subject to less natural selection pressure during evolution, thus tolerating more variation and showing extremely extensive sequence polymorphism in most eukaryotic organisms24,25. The ITS2 region has been used as a phylogenetic marker to identify many medicinal plants, closely related plants and a wide range of species10. Bupleurum L. (Apiaceae)26, Uncaria27, Aristolochia28, Eryngium29, Gnaphalium affine30, Rheum officinale31 and other medicinal plants can all be identified to a certain extent via ITS2 barcodes. We used ITS2 to distinguish S. suberectus from source species of almost all easily confused products (17). This study addresses the lack of a universally applicable barcode for the identification of S. suberectus, thereby ensuring the safe use of S. suberectus as a medicine.

The highly variable ITS2 is not only used in species identification but also contributes to genetic diversity analysis of species and varieties. Khazal et al. analysed the genetic diversity of Leishmania major through phylogenetic inference based on ITS232. Delva et al. analysed the genetic diversity of Amylomyces rouxii through phylogenetic analysis, genetic distance, genetic variation and haplotype network construction on the basis of ITS1/ITS2 and D1/D233. Lin et al. used ITS2 and the mitochondrial cytochrome c oxidase subunit 1 gene (cox1) as genetic markers to conduct genotyping analysis, identified 17 different ITS2 haplotypes and determined the population genetic structure of Sargassum plagiophyllum C. Agardh34. The mature secondary structure of the catalytic ribosomal RNA (a central loop connected to a four-finger structure) is highly conserved35,36. The prediction of secondary structure can not only serve to supplement and verify phylogeny at the sequence level but also assist in the discovery of genotypic variations in the population29. Umdale et al. evaluated the species and genetic diversity of Asian Vigna through haplotype and secondary structure analysis based on ITS237. Therefore, we analysed the genetic diversity of wild S. suberectus via ITS2. Genotype mining will provide information and labels for screening excellent varieties in the future and lay the foundation for artificial cultivation.

The guanine and cytosine (GC) content provides the material basis for species diversity and genetic diversity. GC base pairs also guarantee the structural stability of double-stranded DNA and RNA38,39. The GC content and distribution may be constrained and driven by structure, thermodynamic stability, and other factors40,41. The GC content and distribution are also reflective of sequence selection and structural evolution42. A recent study revealed that the paired region of angiosperm ITS2 contains a relatively high GC content and that GC-biased gene conversion (gBGC) is one of the main reasons for the high GC content43. We explored the evolution of S. suberectus ITS2 in relation to structure-related GC substitution trends and mechanisms and the relationships between GC content and species differentiation and genetic diversity.

Materials and methods

Sample collection and specimen identification

Field sampling of S. suberectus and source plants of easily confused products was conducted from May to June 2023. Fresh leaves from a total of 56 samples, including S. suberectus (39), M. sempervirens (6), and Craspedolobium unijugum (11), were collected in this study. The leaves were immediately placed into a sealed plastic bag containing enough silica gel to avoid DNA degradation. All the plants were identified by Prof. Yunfeng Huang and Prof. Kejian Yan of the Guangxi Institute of Chinese Medicine & Pharmaceutical Science. S. suberectus (Herbarium: 00327831), M. sempervirens (Herbarium: 02014496), and C. unijugum (Herbarium: 02028691) can be identified in the Chinese Virtual Herbarium (https://www.cvh.ac.cn/index.php). The samples were obtained mainly from the Guangxi Zhuang Autonomous Region and Yunnan Province, China. The distance between all the sampled individuals of the same population was greater than 50 m. Sample information is shown in Table S1.

DNA extraction, amplification, and sequencing

The genomic DNA of all the samples was extracted from approximately 15 mg of silica gel-dried leaves via the Fast Pure Plant DNA Isolation Mini Kit (Vazyme, Nanjing, China). The quality and concentration of the genomic DNA were determined via a NanoDrop 1000 spectrophotometer. Each DNA solution was diluted or concentrated to approximately 50 ng/µL for PCR amplification. PCRs were performed in a volume of 25 µL, which consisted of 50 ng of template DNA (1 µL), 12.5 µL of 2×Taq PCR master mix (Vazyme, Nanjing, China), 2 µL of 10 µmol/L forward and reverse primers, and 9.5 µL of ddH2O. The primers used were as follows: ITS2-2 F “ATGCGATACTTGGTGTGAAT” and ITS2-3R “GACGCTTCTCCAGACTACAAT”10. The reaction program was as follows: 94 °C for 5 min; 94 °C for 30 s, 58 °C for 45 s, and 72 °C for 45 s (30 cycles); and 72 °C for 10 min. All the PCR products were detected via agarose gel electrophoresis, and the gel was photographed via a UV transilluminator. The product was purified via a Fast Pure Gel DNA Extraction Mini Kit (Vazyme, Nanjing, China), and the reaction mixture was sequenced on an ABI 3130xl automatic sequencer (Applied Biosystems, Foster City, California, USA).

Sequence assembly, feature comparison and genetic analysis

The sequencing peak diagram was spliced and calibrated via Codon Code Aligner 8.0.2 software. The primers and low-quality regions of the sequenced ITS2 sequences were removed and cut according to the annotation file to obtain the complete sequences. The HMMer annotation method, which is based on the hidden Markov model, was used to remove the 5.8 S and 28 S sequences to obtain accurate ITS2 spacer sequences44. We used BLAST to search for homologous genes for all obtained sequences against the National Center for Biotechnology Information (NCBI) datanse (https://blast.ncbi.nlm.nih.gov/Blast.cgi? PROGRAM = blastn&PAGE_TYPE = BlastSearch&LINK_LOC = blasthome, accessed on 28 September 2023), and the source species were determined on the basis of the homologous genes with the highest similarity score and the lowest E value45. The ITS2 sequences of S. suberectus, M. sempervirens and C. unijugum were submitted to the GenBank database (https://www.ncbi.nlm.nih.gov/genbank/, accession number: PP465924-PP465979, accessed on 12 March 2024). Two hundred and thirty-six (236) ITS2 sequences from 15 adulteration-prone species were downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/, accessed on 28 September 2023) and analysed together. Searches for C. dielsiana (Herbarium: 02015435), S. cuneata (Herbarium: 02108794), M. birdwoodiana (Herbarium: 02098429), C. cinerea (Herbarium: 01924183), K. interior (Herbarium: 02015552), Kadsura heteroclite (Herbarium: 02231169), D. trifoliata (Herbarium: 01924020), E. phaseoloides (Herbarium: 02074730), Mucuna macrocarpa (Herbarium: 01781725), Padbruggea filipes (Herbarium: 1289072), Bauhinia championii (Herbarium: 01965870), Callerya nitida (Herbarium: 02036623), Schisandra propinqua (Herbarium: 02231152), and Schisandra henryi (Herbarium: 02231143) were performed against the Chinese Virtual Herbarium database (https://www.cvh.ac.cn/index.php). Searches for Wisteriopsis reticulata (ID: 88791) were performed against the Chinese Field Herbarium database (https://www.cfh.ac.cn/album/ShowSpAlbum.aspx?spid=88791). MAFFT was used to perform sequence alignment46. The tool trimAl was used to trim aligned sequences47. ModelTest-NG was used to select the optimal evolutionary model for ITS2 sequences48. In accordance with the Akaike information criterion (AICc), the JC model was selected to construct a phylogenetic tree via RAxML-NG software49. The bootstrap method (1000 repetitions) was used to check the support rate of each branch50. The R package ggtree was used for visualization of evolutionary trees51.

The R package DECIPHER was used to calculate intraspecific and interspecific genetic distances52. The R package ggplot2 was used to visualize the results in the form of box plots53. DNAsp v6.0 was used to analyse the Ks value between the ITS2 sequences of S. suberectus and other plants in the form of noncoding sequences54. The results are presented as density plots via the R package ggplot253. The R package pegas was used for statistical analysis of the haplotypes, and the results were visualized via the basic plotting functions of R software55. RNAfold software was used to obtain the secondary structure of ITS2 from S. suberectus56. LocARNA software was used to obtain consensus secondary structures and secondary structure-based phylogenetic trees57.

In the studies of Xian and Liu et al., the DNA/RNA hybrid substitution model was used to explain the substitution patterns of the ITS2 paired and unpaired regions43,58. We used the substitution model selection script (model_selection.pl) in PHASE 3.0 to select the best substitution model on the basis of the AICc value59. The phylogeny of ITS2 was inferred on the basis of sequence alignment files, consensus secondary structure files and NJ trees. MCMC analysis was performed for 10,000,000 generations to reach convergence, with sampling every 100 generations and 30,000 (30%) trees being burned-in. The remaining trees were used to infer substitution rates at initial and equilibrium states via the mcmcsummarize program of the PHASE package.

The equilibrium GC content (GC*) was calculated according to the method of Xian and Liu et al.43,58. In the convergence state, the GC content of the sequence in the equilibrium substitution mode can be calculated as the percentage of the AT→GC substitution rate in the sum of the AT→GC and GC→AT substitution rates60.

Results

Species identification, sequence characterization and phylogenetic inference

We performed BLAST alignment of the ITS2 sequences of 56 samples. Consistent with the morphological identification, S. suberectus (39), M. sempervirens (6) and C. unijugum (11) were identified. Their percent identity was above 96%. The average percentage identity of S. suberectus ITS2 was 99.83%. A total of 233 sequences from adulteration-prone species in GenBank were analysed together. S. suberectus had the shortest sequence (201/202 bp) and the highest GC content (69.31–71.29). S. cuneata had the longest sequence (228–257 bp). B. championii had the lowest GC content (52.75–54.13) (Table S1).

At the sequence level, phylogenetic analysis was performed according to the ML method to better distinguish species. There were obvious topological differences between S. suberectus and 17 easily confused species, including M. sempervirens and C. unijugum, indicating the usefulness of ITS2 in identifying S. suberectus (Fig. 1).

Fig. 1
figure 1

ITS2-based phylogenetic analysis of S. suberectus and easily confused species. The source species of different sequences are marked with different colours and shapes, and the colour of the circles at the nodes changing from green to red represent an increase in bootstrap value.

Genetic differentiation of S. Suberectus and the source plants of adulterated products

Genetic distance models are used to measure the extent of genetic differences between species. The intraspecific genetic distance of S. suberectus was distributed between 0 and 0.244, with an average value of 0.149 (Fig. 2). The average interspecific genetic distance between S. suberectus and S. propinqua was the smallest (0.648), ranging from 0.609 to 0.741. The average interspecific genetic distance between S. suberectus and P. filipes was the greatest (0.746), ranging from 0.721 to 0.756. The intraspecific genetic distances of S. suberectus were all smaller than the interspecific genetic distances of S. suberectus and other species, indicating clear genetic differences.

Fig. 2
figure 2

The intraspecific genetic distance of S. suberectus and the interspecific genetic distance between S. suberectus and other species. The red dots represent the average genetic distance.

Ks can be used to compare gene duplication events and evolutionary rates within and between species. The Ks value within S. suberectus was the smallest (0.002), while the Ks value between S. suberectus and K. heteroclita was the highest (0.352) (Fig. 3). These findings indicated that S. suberectus and K. heteroclita complete the differentiation of ITS2 at an early stage. ITS2 of S. suberectus had two peaks with a large peak interval, indicating that ITS2 had undergone at least two large-scale duplications in S. suberectus and that the differentiation rate was slow.

Fig. 3
figure 3

Ks frequency distributions of ITS2 within S. suberectus and between S. suberectus and other species.

Intraspecific variation in S. Suberectus ITS2 sequences

We used a haplotype network to analyse the genetic variation in the ITS2 sequence of S. suberectus. The 39 ITS2 sequences were divided into 8 haplotypes (H). The main haplotypes were H2 and H6, which contained 11 and 20 sequences, respectively (Fig. 4). There were fewer mutation sites between H2 and H1/3/4/5 (5, 5, 4 and 1, respectively) and between H8 and H6/7 (1 and 2, respectively). There were 21 mutations in H2 and H7, indicating that ITS2 evolved in two different directions, towardH2 and H6. In addition, the only member of H7 was SS030, whose base at position 183 was not detected (Y). H7 and H6 had no other mutations except at this position. This result indicated that the base at position 183 of SS030 may be T, and H6 and H7 can be classified into the same haplotype.

Fig. 4
figure 4

S. suberectus ITS2 haplotype network. Different haplotypes are represented by circles of different colours, and the size and number of sectors they are divided into represent the number of sequence entries that make up the haplotype. The length of the lines between haplotypes represents the number of mutation sites. Variant positions and changed bases between haplotypes are marked and connected via dashes.

Prediction and phylogenetic inference of the secondary structure of S. Suberectus ITS2

A phylogenetic tree was constructed on the basis of the ITS2 sequence and secondary structure of S. suberectus. The 39 S. suberectus ITS2 sequences were divided into 4 branches (Fig. 5). The members within each branch contained more similar sequences and secondary structures (Figures S1 and S2). ITS2 of branches I, II, and III all contained a classic four-arm structure with one ring, whereas the four arms of ITS2 of branch IV were distributed on a free single strand. In all branches, structure IV had the most rings. Clade I contained 3 bulges, 3 internal loops and a hairpin loop, whereas Clade II contained 2 bulges, 4 internal loops and a hairpin loop. This was also the most important structural difference between Clades I and II. Structure IV of Clade III contained 3 bulges, 2 internal loops and a hairpin loop, whereas structure IV of Clade IV contained 4 bulges, 2 internal loops and a hairpin loop. In addition to the central loop, Clade III had another multiple loop in structure I, which was quite different from the results for the other clades.

Fig. 5
figure 5

Phylogenetic tree of S. suberectus ITS2 and consensus secondary structures of members of each clade. The colour gradient from red to green represents an increase in the degree of base conservation.

Structure-based GC heterogeneity and mutation direction of S. Suberectus ITS2

Liu et al. defined the equilibrium GC content (GC*) as the GC content when the substitution pattern of the sequence remains unchanged over time (convergent evolution) in the future equilibrium state43. The GC* provides clues for inferring the evolution trend of the GC content. We performed statistical analysis of the GC content (pGC and upGC) of the paired and unpaired regions as well as the equilibrium GC (pGC* and upGC*) (Fig. 6). The pGC (75.85 ± 0.49) was significantly greater than the upGC (58.12 ± 0.87). The pGC* (70.4) was lower than the current pGC, indicating a downwards trend in paired region GC replacement patterns. In addition, the upGC* (58.28) was similar to the current upGC content, indicating the opposite evolutionary trend for paired regions and nonpaired regions.

Fig. 6
figure 6

Comparison of the GC and equilibrium GC (GC*) contents of paired and unpaired regions of the ITS2 secondary structure. Boxplots with data points in different colours represent the GC content of paired and unpaired regions (pGC and upGC), respectively. GC* values in different regions are marked with red solid lines. The red lines for the paired and unpaired regions are marked on the right with “pGC*” and “upGC*”, respectively.

We used the best substitution model, HKY85 + G_RNA16A, which is based on the lowest AICc value, to infer the base pair substitution process. We found a total of 8 double-base substitutions, including correctly paired substitutions (such as AU→GC) and hybrid mismatched substitutions (such as GU→GC) (Fig. 7A and B). We also identified 12 possible single-base substitution events. They included 8 heterozygous mismatches (such as GU→GC) and 4 homozygous mismatched substitutions (such as GG→GC) (Fig. 7C and D). When substitution occurred in the initial or convergent state, the transition rate generated by the driving GC was always higher than that generated by the AU (Fig. 7). Base pair substitutions primarily drove the generation of correct pairs (AU and GC) through single-base substitutions. The substitution rate in the convergence state was higher than the initial substitution rate (Fig. 7C and D).

Fig. 7
figure 7

Base substitution rates for generating AU and GC in the initial state (I) and equilibrium state (E). Both nucleotides in the base pair were substituted to produce AU and GC. Before the substitution, they exhibited correct pairing (A) and heterozygous pairing (B), respectively. Only one nucleotide in the base pair was substituted to produce AU and GC. Before substitution, the pairs exhibited heterozygous pairing (C) and homozygous pairing (D). Different substitution processes are marked with different colours in the legend.

Discussion

ITS2 is a DNA barcode that can be used to effectively identify S. Suberectus

The ITS region is one of the most widely used DNA barcodes. The noncoding internal transcribed spacer region (ITS1 and ITS2) of ribosomal DNA in the ITS region has a higher evolutionary rate than the coding region does, shows a high degree of differentiation at the species level, and can be used to identify closely related species61. Its recognition ability exceeds that of the plastid region15,62,63,64. Amplification and sequencing success rates are the basis for barcoding applications65. Kress et al. proposed that short DNA sequences are easier and more economical to extract and sequence66. Cahyaningsih et al. reported that the GC content is positively correlated with sequencing accuracy67. The ITS2 sequence used in this study met these conditions (~ 220 bp, ~ 61.74%) (Table S1).

Meier et al. used the condition that the minimum interspecific genetic distance was greater than the maximum intraspecific genetic distance as the criterion for effectively distinguishing species68. The KS value is positively correlated with the degree of differentiation69. These theories combined with our results (Figs. 2 and 3) suggest that ITS2 is suitable for the identification of S. suberectus. In addition, compared with the studies on the identification of S. suberectus using 26 S rDNA (7 samples, 4 species)20, matk (8 samples, 5 species)21 and psbA-trnH (79 samples, 8 species)22, our study included almost all the source species of almost all easily confused products (292 sequences, 17 species), providing more comprehensive and valuable results for practical applications.

Genetic variation of S. Suberectus ITS2

ITS2 has sufficient variation to be an essential marker for classification and genetic diversity analysis of animals, plants and microorganisms37,70. Ding et al. used ITS2 sequences to evaluate the genetic differences of Artemisia annua71. Lin et al. reported that S. plagiophyllum on the west coast of Thailand contained a total of 17 different ITS2 haplotypes34. Our study revealed that S. suberectus contained 8 ITS2 haplotypes and that there were two main haplotypes (H2 and H6) (Fig. 4). We speculate that fewer types of variation result from homogenization caused by natural selection. Preliminary analysis of ITS2 haplotypes is the basis for distinguishing the molecular characteristics of members within species72. In the future, joint analysis of medicinal ingredient content, haplotypes and copy numbers will promote the development of screening methods for high-quality S. suberectus.

ITS2 secondary structure is highly relevant to species taxonomy73. It is difficult to use ITS2 sequences to identify changes at the species level, but comparisons of secondary structures make up for this shortcoming29. ITS2 secondary structures often differ among genotypes. The secondary structure of ITS2 can be used as a marker for the genotypes of Eryngium foetidum29 and Colocasia esculenta74. We predicted the secondary structure of ITS2 from S. suberectus (Figures S1 and S2). On the basis of these findings, we constructed a phylogenetic tree and drew a consensus secondary structure map (Fig. 5). These results can be used to develop variety markers for the cultivation and selection of wild resources and promote the protection and utilization of wild resources in the future.

A high GC content is the basis for the successful identification and analysis of the genetic diversity of S. suberectus via ITS2

During meiosis, chromosomal recombination results in base mismatches75. The gBGC hypothesis suggests that DNA repair mismatches are preferentially converted to GCs rather than ATs76. ITS2 is a region of ribosomal DNA (nrDNA) with a high local recombination rate. ITS2 evolved due to chromosomal recombination in a wide range of organisms61. Rapidly reorganized regions containing higher GC contents are thought to be characteristic of the gBGC model77. gBGC is considered one of the reasons for the increased GC content in the ITS2 of angiosperms, including those of the genus Corydalis43,58. The conversion of base pairs in the pairing region of ITS2 of S. suberectus to GC is consistent with the above characteristics (Fig. 7). In addition, since the GC content in the current study was higher than the equilibrium GC content, we speculate that the driving force for maintaining the high GC content of S. suberectus ITS2 is not only gBGC (Fig. 6). The synthesis of GC requires more biochemical resources than the synthesis of AT78. The current high GC content in the paired region may be driven by structural selection, ensuring the thermodynamic stability of ITS279.

High GC content is an intuitive reflection of high recombination and mutation rates caused by high levels of meiosis80,81. gBGC can maintain mutations within a certain range and produce more homologous genes82,83. We speculate that S. suberectus may have a relatively high level of meiosis, resulting in relatively high recombination in ITS2, which makes S. suberectus easy to distinguish from other species in terms of ITS2, and there are many types within S. suberectus.

Conclusion

In this study, phylogenetic trees were constructed, and genetic distances and KS values ​​were calculated via ITS2 of S. suberectus and 17 other species. ITS2 of S. suberectus was assigned to a separate branch in the phylogenetic tree. The genetic distance and KS value of ITS2 in S. suberectus were smaller than those between S. suberectus and other species. These results support the potential of using ITS2 for the identification of S. suberectus.

The genetic diversity of S. suberectus based on ITS2 was analysed. S. suberectus ITS2 had 8 haplotypes, and the most important haplotypes were H2 and H6. The phylogenetic tree based on secondary structure revealed 4 branches. These results provide information for the division of S. suberectus diversity.

One of the reasons for the high GC content in S. suberectus ITS2 is gBGC. The high degree of recombination and mutation of ITS2 caused by the high degree of meiosis is the basis for distinguishing S. suberectus from other species, as is the high degree of polymorphism within S. suberectus.