Introduction

A Codon is a continuous three-base sequence on the messenger RNA chain that determines an amino acid. It plays a crucial role in the transmission of genetic information from messenger RNA to proteins in organisms1. In eukaryotes, there are 61 different codons encoding 20 amino acids2. All amino acids except tyrosine and methionine have multiple corresponding codons, known as synonymous codons. There is variation in the frequency of synonymous codon usage among species, which is recognised as codon usage bias (CUB)3. This bias significantly affects the efficiency of protein translation, with genes encoded by optimal codons tending to be highly expressed or having polymorphic sites, thereby maintaining genetic information stability and optimising function during evolution4,5. Codon usage bias is generally considered a comprehensive response to the drift of non-synonymous codon mutations and the selection pressure of optimal codons6. In previous studies, this bias was also influenced by multiple factors such as nucleotide composition, GC content7, gene expression levels8, and transfer RNA abundance9. Codon usage patterns vary among different species, but closely related species often share similar biases10. Therefore, studying codon usage patterns in plants helps reveal the adaptability of species to their environment and offers perspectives on gene expression regulation and species evolution2,11,12.

Chloroplasts are semi-autonomous organelles and an energy conversion systems unique to higher plants and algae, and play important roles in photosynthesis, biosynthesis, and carbon sequestration13. Since the chloroplast genome (CPG) sequences of tobacco14 and liverwort15 were first released in 1986, their structures and functions have received widespread attention. These sequences are of moderate length and are rich in genetic information16. In contrast to mitochondrial and nuclear genomes, CPGs are widely applied in molecular ecology and evolutionary studies because of their moderate nucleic acid substitution rates, conserved genome structures and gene composition, and the absence of paralogous homologous interference17,18,19. The study of codon usage patterns in CPGs serves as a fundamental step in understanding chloroplast function, while analysing base bias provides insights into genetic modification and transgenic systems of the chloroplast genome20.

Araceae, a large family with 144 accepted genera and 3645 species worldwide, is rich in resources and widely distributed21. Aroideae is the largest subfamily among Araceae, and is known for the significant medicinal and edible value of its plants. Pinellia ternata and Arisaema erubescens have been documented in the classic medical tome “Shennong Bencao Jing” of the Western Han Dynasty in China. Most of species in Aroideae have pharmacological actions including resolving phlegm, anti-inflammatory, and antitumour properties22,23. Furthermore, the bulb of Amorphophallus konjac has abundant glucomannan components, increasing its potential applications in the fields of medicine, food, and the chemical industry24. As a plant rich in starch and cellulose, Colocasia esculenta is considered an important food crop in some regions and plays a key role in local agricultural production and the food supply25.

In recent years, significant progress has been made in the research on the CPG of Aroideae. The CPGs of Arisaema erubescens and Pinellia ternata were published by Zhang et al.26 and Cai et al.27, respectively, and the phylogenetic analysis revealed that Arisaema and Pinellia were most closely related. The CPG of eight species in Dracunculus clade were published in 2021, further highlighting the molecular evolution of Aroideae28. The CPGs of several Amorphophallus species, including A. konjac29, A. yunnanensis30, and A. coaetaneus31, have been assembled and characterised. Our previous research published the CPG of three famous Aroideae species and conducted comparative analyses among 17 Aroideae species32. Currently, the NCBI database (https://www.ncbi.nlm.nih.gov/) contains CPG data for 61 Aroideae species. However, the codon usage bias and the phylogenetic relationships of these species remain unclear. In this study, the codon usage patterns of 61 Aroideae species were analysed by using their CPGs and the base composition, influencing factors, and optimal codons were identified. Furthermore, phylogenetic trees were constructed based on complete chloroplast genomes and protein-coding sequences, as well as a principal component clustering analysis (PCA) based on relative synonymous codon usage values, providing new insights into the evolution of Aroideae.

Results

Statistics of the chloroplast sequence data and codon composition analysis

The CPGs of 61 Aroideae species belonging to 15 tribes were downloaded from NCBI (https://www.ncbi.nlm.nih.gov/) (Supplementary Table 1). The CPG sizes of the 61 Aroideae species ranged from 158,177 bp (Anchomanes hookeri) to 177,076 bp (Amorphophallus muelleri). Several genes such as psbC, rpl16, cemA, and ycf68 are absent in some species, whereas others, such as rps16, ndhD, psbI, and atpH are found to be shorter than 300 bp. A total of 3,097 protein-coding genes were screened in the CPG of 61 Aroideae species, including 1,512 photosynthesis-related genes, 786 ribosomal protein-related genes, 244 self-replicating-related genes, 270 ycf genes and 291 other genes (Supplementary Table 2). According to functional classification, photosynthesis-related genes presented the highest values of total GC content (39.28%), and relatively high optimal codon frequency (FOP) and codon adaptation index (CAI) values, which suggested that these genes may be highly expressed. The effective number of codons (ENC) values of the chloroplast genes ranged from 37.17–59.64, with ENC values of rps18 and rps8 being less than 40, suggesting greater codon bias and greater stability for these two genes.

The codon usage patterns of the 61 Aroideae species were different (Fig. 1). Thomsonieae (tribe 2) and Arisaemateae (tribe 4) presented greater variation in codon numbers and GC values among the 15 tribes. Notably, Arisaema prazeri, Arisaema heterophyllum and Arisaema ringens had relatively low codon numbers, whereas Arisaema ringens, Arisaema prazeri, and Caladium bicolor had relatively high GC values. The GC contents at the first (GC1), second (GC2), and third (GC3) positions of the codon were 45.51% ~ 46.98%, 37.29% ~ 39.18% and 28.47% ~ 32.36% respectively, suggesting a preference for A/T bases over G/C bases in the codons of Aroideae. Moreover, the GC content at three codon positions followed the pattern GC1 > GC2 > GC3. The average GC content of the 61 Aroideae species was 37.91%. The AT content at the 3rd position of the codon was greater than the GC content, indicating a strong bias for codons ending with A/T bases (Supplementary Table 3).

Fig. 1
Fig. 1
Full size image

Number of codons used, GC content, GC1, GC2, and GC3 analyses in Aroideae. X-axis: Different coloured blocks represent different tribes, and numbers 1–15 represent the corresponding 15 tribes. Y-axis: From top to bottom, it represents the number of codons used, the total GC content, and the GC content at the three positions of the codons.

Correlation analysis of chloroplast codon bias parameters in Aroideae

To explore the potential associations among codon composition, codon bias, and gene expression, a Pearson correlation analysis was performed using the values of codon-related parameters of the chloroplast genes in Aroideae (Fig. 2). The results revealed that the GC content was positively correlated with both GC12 (p < 0.0001, r = 0.92) and GC3 (p < 0.0001, r = 0.57), but the correlation between GC12 and GC3 was insignificant, suggesting that the GC contents at the three codon positions in the CPG of Aroideae were different. GC3 was positively correlated with ENC (p < 0.001, r = 0.41), suggesting that the 3rd base of the codon might influence the codon usage bias. As a measure of bias of preferred codons in highly expressed genes, the CAI was observed to be positively correlated with FOP (p < 0.0001, r = 0.7) and GC (p < 0.01, r = 0.3), suggesting that the frequency usage of optimal codon and GC contents was related to gene expression. Aromo was positively correlated with T3s (p < 0.01, r = 0.4), indicating that codons encoding aromatic amino acids prefer to end with T. Additionally, Gravy was positively correlated with T3s (p < 0.05, r = 0.37) and negatively correlated with A3s (p < 0.05, r = -0.4), suggesting that the hydrophobicity of proteins might be influenced by whether the 3rd position of the codon was A or T.

Fig. 2
Fig. 2
Full size image

Correlation analysis of codon indices of chloroplast genes in Aroideae. A3s, T3s, G3s, C3s, composition at the third synonymous codon position; GC, overall GC content; GC3, GC content at the third position of the codons; GC12, average G/C contents of the 1st and 2nd positions of codons; CAI, codon adaptation index; CBI, codon bias index; FOP, optimal codon frequency; ENC, effective number of codons; Gravy, general average hydropathicity; Aromo, frequency of aromatic amino acids. *Significant at p < 0.05 (two-tailed); **Significant at p < 0.01 (two-tailed); ***Significant at p < 0.001 (two-tailed); ****Significant at p < 0.0001 (two-tailed) .

Analysis of factors influencing codon bias

The ENC-GC3 plot and the frequency distribution of ENC ratios are effective tools for analysing the factors affecting the codon usage bias of Aroideae. In the ENC-GC3 plots, the chloroplast genes of 61 Aroideae species deviated a deviation from the ENCexp standard curve (Fig. 3). Most genes, such as rps18, psbA, rpl16, and atpF were located in the lower region of the plot, indicating that the bias of these genes was influenced by natural selection. Only a few genes, such as psaA, ycf3, ycf68, and ndhE were located above the standard curve, indicating that mutations were the dominant factor leading to codon bias. In addition, the frequency distribution table of the ENC ratios revealed that 46.76% of the chloroplast genes fell within the range of -0.05–0.05, whereas 53.24% of the genes were outside this range (Supplementary Table 4). This further indicated that natural selection played a key role in the codon bias of most genes in Aroideae.

Fig. 3
Fig. 3
Full size image

ENC-GC3 plots of chloroplast gene codons in 61 Aroideae species.

The PR2-plot analysis can reveal the usage frequency of A/T and C/G at the third codon position. If genes cluster in the centre of the PR2-plot plane, it indicates that the frequencies of A/T and G/C bases are similar, and the mutations are solely responsible for the codon bias. The distribution of chloroplast genes in Aroideae on the plane was uneven, with a majority of the genes distributed at the bottom of the plane (Fig. 4). These results suggested that the CPG of Aroideae preferred to use A/T bases, and the biased base usage implied that the codon usage patterns were more influenced by natural selection.

Fig. 4
Fig. 4
Full size image

PR2 plots of chloroplast gene codons in 61 Aroideae species.

High frequency and optimal codons

A total of 30 codons with RSCU > 1 were identified in the CPG of 61 Aroideae species, and 29 codons (96.7%) ended in A/T with the exception of UUG. Conversely, the number of codons with RSCU < 1 was 32, and 90.6% of these ended in C/G. This result suggested that high-frequency synonymous codons (RSCU > 1) ended in A/T, whereas low-frequency synonymous codons (RSCU < 1) tended to end in C/G (Supplementary Table 5). A codon heatmap was generated using the RSCU values of 64 synonymous codons (Fig. 5). The codon colours in the map were similar across different species, suggesting that the RSCU values were stable in Aroideae, with a conservative evolutionary process. Based on the ENC value, codons that fulfilled the criteria of RSCU > 1 and RSCU ≥ 0.08 were screened out (Fig. 6). The results revealed that the number of optimal codons for the CPG of 61 Aroideae species ranged from 13 (Alocasia navicularis) to 20 (Amorphophallus paeoniifolius). AGT(59),ATT(53),and TTA(55) were identified as optimal codons in more than 50 species, and CGT and TTT were found to be the optimal codons in 60 and 61 species, respectively.

Fig. 5
Fig. 5
Full size image

Heatmap of the RSCU values of 61 Aroideae species.

Fig. 6
Fig. 6
Full size image

Optimal codon analysis in 61 Aroideae species. Light blue background: RSCU > 1 and ΔRSCU ≥ 0.08; While background: RSCU > 1.

Genomic comparative and nucleotide diversity analyses

The divergence and conservation of the CPGs of Aroideae were studied using MultiPipMaker software, with reference to the annotated chloroplast genome sequences of Alocasia fornicata (Fig. 7). The alignments revealed that the large single-copy (LSC) and small single-copy (SSC) regions were more divergent than the inverted repeats (IRs) regions, and the non-coding regions were more divergent than the coding regions.

Fig. 7
Fig. 7
Full size image

Structural comparison of the CPGs of 61 Aroideae species. Above the alignment, black arrows and bold black lines depict the orientation of genes, with each colour strip representing a different region: the blue strip for LSC, the orange strip for IRs, and the yellow strip for SSC. Peachblow strips indicate different chloroplast genomes, green bars indicate mismatches, and white bars represent insertions or deletions (indels).

To identify the sequence divergence hotspots in the CPG of Aroideae, DnaSP software was used to calculate nucleotide diversity (Pi) values within a 600 bp window (Fig. 8). The results revealed that the LSC and SSC regions (SCs) were more variable than the IRs regions. A total of eight regions with high variability (Pi > 0.064) were identified in SSC regions: five genes ndhF (0.072), rpl32 (0.068), ccsA (0.066), ndhE (0.064), and ndhG (0.07) and three intergenic regions: ndhF-rpl32 (0.069), ccsA-ndhD (0.065), and ndhE-ndhG (0.064). Four regions trnV-trnM (0.05), trnM-atpE (0.048), accD (0.05), and rpl36-rps8 (0.049), were identified as highly variable regions (Pi > 0.048) in the LSC region. These regions presented high variable sites (VSs), parsimony informative sites (Pins), discrimination success rate based on distance method (DSR), and average K-2P distances (Supplementary Table 6). Moreover, three variable regions (trnV-trnM, ndhE, and ndhF-rpl32) with DSR values higher than 85% were used to construct neighbor-joining phylogenetic trees. The results revealed that the ndhE region could discriminate different chloroplasts in Aroideae subfmily, indicating its potential to be developed as a valuable DNA barcode. Consistent with the results of the genomic alignments, the SC regions were more divergent than IRs regions. This could be attributed to a higher selective pressure during the evolution of SC regions, resulting in the accumulation of more .mutations.

Fig. 8
Fig. 8
Full size image

Nucleotide diversity (Pi) analysis of the CPGs of 61 Aroideae species. Window length: 600 bp; step size: 200 bp.

Phylogenetic analysis

To investigate the phylogenetic relationships in Aroideae, we used the maximum likelihood method to construct chloroplast phylogenetic trees on the basis of CPGs and protein-coding sequences (CDSs) (Fig. 9). Lemna minor was used as an outgroup for both trees. The topologies of the two chloroplast phylogenies were similar, with most nodes showing high support values. In the Aroideae clade, three distinct clades (Zantedeschia, Amorphophallus and Ambrosina clade) could be identified and distinguished, and the positions of some species on the phylogenetic trees showed differed.

Fig. 9
Fig. 9
Full size image

Phylogenetic analysis of 61 Aroideae species. Lemna minor was used as an outgroup. (a) phylogenetic tree based on filtered CDS; (b) phylogenetic tree based on CPG.

In the phylogenetic tree constructed from the CDS, Montrichardia arborescens was sister to Anubias heterophylla in Zantedeschia clade, and Calla palustris did not form a cluster with any other species (Fig. 9a). However, in the tree based on the CPG, Anubias heterophylla, Montrichardia arborescens, and C. palustris were grouped together (Fig. 9b). The branch of Alocasia, which includes with Alocasia fornicata , Alocasia navicularis , and Leucocasia gigantea, was clustered in the Arisaemateae clade in the tree based on CDS, but was clustered in Colocasieae clade in the tree based on CPG. The Caladieae species and Thomsonieae species were clustered in the Amorphophallus clade on both trees.

To further reveal the evolutionary relationships of Aroideae species, a PCA clustering analysis based on RSCU values was conducted in Zantedeschia, Calla and Ambrosina clades (Fig. 10). The results revealed that Calla palustris was separated from Anubias heterophylla and Montrichardia arborescens (Fig. 10a). In the Ambrosina clade, Alocasia was more closely related to Colocasieae than to the Arisaemateae (Fig. 10b).

Fig. 10
Fig. 10
Full size image

PCA clustering analysis based on RSCU values. (a) PCA clustering analysis of species in the Zantedeschia and Calla clades; (b) PCA clustering analysis of species in the Ambrosina clade.

Discussion

In this study, the the codon usage patterns of the chloroplast genome of 61 Aroideae species were analysed according to the above relevant experimental methods. The protein-coding sequences of Aroideae were rich in A or T, with an average GC content of 37.91%. Arisaema ringens, Arisaema prazeri, and Caladium bicolor presented relatively high GC contents. A trend towards increasing GC values from GC3 to GC2 to GC1 was identified in 61 Aroideae species, and similar results were also reported in Aconitum and Juglandaceae species33,34. Correlation analysis revealed that CAI was significantly positively correlated with GC and FOP, suggesting that genes with high GC contents presented greater codon usage bias and higher expression levels. This was consistent with previous research findings35. In the CPG of Aroideae, the highest CAI (0.18) and GC values (39.28%) were found for photosynthesis-related genes, implying that these genes presented strong codon usage bias and were highly expressed. It could be inferred that photosynthesis-related genes may have played crucial roles in the evolutionary process of Aroideae adapting to natural environments.

Codon usage bias results from a combination of multiple factors36. Explaining the causes of this phenomenon in the chloroplast genomes of different species can help to further understand the evolutionary mechanisms of plant37. The analyses of the ENC plot, ENC ratios, and PR2 plot revealed that natural selection contributed the most to the codon usage bias in the CPG of Aroideae. Similar results have been reported in Gynostemma38, Miscanthus39 and Euphorbiaceae40. However, in Medicago truncatula41, mutation is the main reason for bias. In Mesona chinensis42, mutation pressure is relatively balanced with the influence of natural selection.

The RSCU value is a key index for evaluating the degree of codon usage bias. An RSCU that exceeds 1 indicates a high frequency of codon usage and a strong bias 43. Heatmap plotting and cluster analysis of the RSCU revealed that 64 synonymous codons could be divided into two groups: codons ending with G or C and codons ending with A or T. Twenty-nine high-frequency codons (96.67%) were clustered in the group in which codons ending with A or T, suggesting that the chloroplast genes of Aroideae tend to use codons ending with A or T. Furthermore, AGT, ATT, TTA, TTT, and CGT were identified as optimal codons in more than 50 Aroideae species, which follow the “NNA” and “NNT” patterns. Studies have shown that codons with this pattern can effectively improve the efficiency of transcription and translation during gene expression44,45. Therefore, selecting these optimal codons is expected to improve gene expression efficiency in chloroplast genetic engineering46.

Mutation hotspots are regions of the genome that are prone to mutation and play crucial roles in understanding evolutionary mechanisms47. Using MultiPipMaker and nucleotide diversity analysis, eight highly variable regions were identified including five gene regions (ndhF, rpl32, ccsA, ndhE, and ndhG) and three intergenic regions (ndhF-rpl32, ccsA-ndhD, and ndhE-ndhG). The ndhE-ndhG region, for example, has been shown to be a particularly useful marker for predicting phylogeny among related species of Spathiphyllum in the Araceae family48. Moreover, the ndhF-rpl32, ccsA-ndhD, and ndhE-ndhG intergenic regions have been identified as suitable DNA barcodes for species identification and phylogenetic analysis in a range of plant species, including Dracocephalum49, Siraitia Merrill50, Magnolia polytepala51, and Gynopodium52. The ndhF gene region is another noteworthy hypervariable fragment, particularly in Lagerstroemia53, Lirianthe54, and Dalbergia species55. Collectively, these highly variable regions serve as tools for the identification and evaluation of germplasm resources, genetic diversity analysis, and population evolution in Aroideae.

In the analysis of codon usage patterns, an exploration of RSCU values in different species can aid in understanding their evolutionary relationships56,57. The phylogenetic positions of Calla palustris and Alocasia were ambiguous in the chloroplast phylogenies (Fig. 9). In nuclear analysis, Calla palustris, Anubias heterophylla and Montrichardia arborescens were found to form a sister group58, but the support of this relationship in the mitochondrial phylogeny and molecular and morphological data was weak59,60. According to our PCA results based on the RSCU value, Calla clearly deviated from Montrichardia and Anubias, providing an evidence that supports the possibility of independent evolution of the Calla clade and the Zantedeschia clade. Alocasia was more closely related to Colocasieae than to Arisaemateae according to the PCA, and the same results were also found in the previously reported RAxML tree based on mitochondrial sequences and IQ-tree based on CPG61,62. In general, our ML analysis based on the CDS yielded outcomes that were generally consistent with those obtained from CPG. For example, Pinellia species are located most closely to Arisaema, Sauromatum and Typhonium in the Arisaemateae clade63. Thomsonieae were clustered together with Caladieae in Amorphophallus clade64,65. Therefore, the RSCU values of chloroplast protein-coding sequences could play an important role in the study of the phylogenetic relationships and taxonomy of Aroideae species.

Conclusion

In this study, the codon usage patterns of chloroplast genome was analysed in 61 Aroideae species. Our results revealed that Aroideae chloroplast genes preferred codons ending in A or T, and natural selection was the primary force driving codon usage bias. AGT (Ser), ATT (Ile), TTA (Leu), TTT (Phe), and CGT (Arg) were the five optimal codons shared by more than 50 samples. We also identified a highly variable gene region (ndhF, rpl32, ccsA, ndhE and ndhE) that could serve as a reliable DNA barcode for species identification and genetic diversity studies of Aroideae. Furthermore, principal component clustering analysis from RSCU values can help to better understand the phylogenetic relationships among Aroideae species and may serve as a tool in species identification and classification.

Materials and methods

Sequence data

The original protein-coding sequences (CDSs) of chloroplast genomes (CPGs) of 61 Aroideae species were obtained from the National Center for Biotechnology Information (NCBI) (https://www.ncbi.nlm.nih.gov/) on April 15, 2024. To reduce sampling bias and improve the accuracy of the analysis, repeated sequences and sequences less than 300 bp in length were excluded66. The CDS must have both a start codon and a termination codon. Finally, a total of 3,103 eligible coding sequences were screened for subsequent analysis (Supplementary Table 1).

Analysis of the codon usage index

CodonW 1.4.2 software (http://codonw.sourceforge.net/) was used to obtain the codon usage indices, including (1) relative synonymous codon usage (RSCU); (2) effective number of codons (ENC); (3) the codon adaptation index (CAI); (4) the codon bias index (CBI); (5) the total number of amino acids (L_aa); (6) the optimal codon frequency (FOP); (7) the number of synonymous codons (L_sym); (8) the general average hydropathicity (Gravy); (9) the frequency of aromatic amino acids (Aromo); (9) the GC content at the third position of the synonymous codons (GC3s); and (10) the composition at the third synonymous codon position (A3s, T3s, G3s, and C3s). The GC content including the overall GC content (GC), the GC content at the first (GC1), second (GC2) and third codon positions (GC3) was calculated using MEGA-X software38, and the average G/C contents of the 1st and 2nd positions of codons (GC12) were calculated by Excel 2019. SPSS 26.0 software was used for correlation analysis of the above parameters, and the graphs were created using Origin 2022.

ENC plot analysis

The ENC is an effective index for quantifying the degree of synonymous codon bias67. By comparing the expected ENC value with the GC3 value, the ENC-plot can be used to investigate the influence of base composition on the codon usage bias. The ENC-plot was constructed with the GC3 value as the abscissa and the ENC value as the ordinate. The standard curve indicated that the determinant of codon usage bias was mutation pressure, and the formula of the standard curve was as followed: ENC = 2 + GC3s + 29/[GC3s 2 + (1-GC3s)2]66. To accurately evaluate the difference between the observed value (ENCobs) and the expected value (ENCexp), the ENC ratio was calculated using the formula “ENC ratio = (ENCexp—ENCobs)/ENCexp”, and the difference between ENCobs and ENCexp was quantified according to the distribution of ENC ratios68.

Parity rule 2 (PR2) plot analysis

PR2 plot analysis is widely used to explain the influence of mutational pressure and natural selection on the nucleotide composition of double-stranded DNA. With G3s/(G3s + C3s) as the abscissa and A3s/(A3s + T3s) as the ordinate, the centre of the plane is the position where A3s equals T3s and where G3s equals C3s, indicating that there is no mutational pressure or natural selection bias69. If G3s and C3s or A3s and T3s are close, the codon usage bias in the CPG is affected only by mutation pressure; if there is a large difference between G3s and C3s or A3s and T3s, the bias is attributed primarily to natural selection70.

RSCU and optimal codon analysis

Relative synonymous codon usage values (RSCUs) refer to the ratio of the observed frequency of codon usage to the expected frequency under an unbiased usage. A codon with RSCU > 1 is considered a high-frequency codon, RSCU > 2 indicates that the codon is used with extremely high frequency, and RSCU < 1 indicates that the codon is a low-frequency codon43. A heatmap of the average RSCU values for all synonymous codons in the CPGs of 61 Aroideae species was constructed via TBtools v1.10871. Excel 2019 software was used to sort ENC values according to their size, and the genes in the top 10% and bottom 10% of ENC values were chosen to create high-expression gene datasets and low-expression gene datasets. The RSCU values of the two datasets were calculated according to previous research, and the ΔRSCU values were obtained by subtraction43. Codons with a ΔRSCU ≥ 0.08 and RSCU > 1 were defined as optimal codons1.

Genomic variation analysis

The CPGs of 61 Aroideae species were aligned via the Multiple Sequence Alignment Program (MAFFT v. 7 427)72. The online MultiPipMaker software (http://pipmaker.bx.psu.edu/pipmaker/)with default parameters was subsequently used to make alignments among the 61 CPGs of Aroideae, with the annotated chloroplast genome of Alocasia fornicata used as a reference73. To determine the level of nucleotide variability (Pi) within the Aroideae, a sliding window analysis was conducted via the DnaSP v5.10 program with a step size of 200 bp and a window length of 600 bp74. To develop molecular makers for the identification of different chloroplasts in Aroideae, the variable sites (VSs), parsimony informative sites (Pins), discrimination success rate based on distance method (DSR), and average K-2P distance of polymorphism sites were analysed, and the neighbor joining phylogenetic trees were constructed via the divergent regions with DSR values higher than 85%. Statistical analysis was conducted using Excel 2019.

Phylogenetic analysis

To clarify the phylogenetic relationships among the 61 Aroideae species, the maximum likelihood (ML) phylogenetic trees were constructed using the RaxML version which is based on CPGs and filtered CDSs75, and the sequences were aligned with MAFFT v. 7 427. Bootstrap replication was conducted with 1,000 replicates, and the other parameters were set to their defaluts. The phylogenetic trees were visualised using ChiPlot76. Moreover, we also performed principal component analysis (PCA) via Origin 2022,whcih is based on RSCU value, to further investigate the genetic relationships within Aroideae.