Introduction

Macadamia (genus Macadamia, family Proteaceae), an economically significant nut crop indigenous to eastern Australia, has achieved global cultivation distribution including major production regions in China, Australia, South Africa, and Kenya1. Renowned for its nutritional composition, macadamia kernels contain elevated levels of monounsaturated fatty acids and essential micronutrients, positioning them as a functional food with demonstrated health benefits2. Expanding applications in culinary oils, cosmetics, and confectionery products have driven substantial increases in global demand. Despite its economic prominence, macadamia remains a recently domesticated species with commercial cultivation history under a century. These domestication efforts are majorly based on two of the total four recognized macadamia species from this genus: M. integrifolia, M. tetraphylla3. Initial domestication efforts in Australia and subsequent commercial development in Hawaii established foundational cultivars that continue to underpin global production4. Notably, most commercial genotypes remain separated by only 2–4 generations from wild progenitors5indicating limited genetic divergence during domestication.

China’s macadamia industry has experienced exponential growth over three decades, particularly in Yunnan Province. Early production relied heavily on Australian and Hawaiian cultivars, while parallel breeding initiatives using introduced germplasm have developed regionally adapted accessions for diverse cultivation regions and environmental conditions. Given the limited generation turnover and short cultivation history of macadamia in China, the introduced germplasm can be considered ‘raw materials’, while the locally selected germplasm represents the ‘first generation’. Cultivar and accession identification are critical for modern plant breeding programs and germplasm collection6. However, detailed parental information is largely unavailable for most ‘first generation’ samples, except for some hybrids generated through cross-breeding. The lack of improved varieties remains a major constraint for China’s macadamia industry, underscoring the importance of elucidating genetic relationships within germplasm collections to facilitate genetic improvement.

Studies of genetic relationships in plants have traditionally relied on both morphological characteristics and molecular markers. While morphological traits offer valuable phenotypic information, molecular markers provide more precise and reliable insights into genetic variation at the species level. For macadamia, extensive efforts have been made to evaluate genetic diversity and population structure using various nuclear genome-derived markers, including amplified fragment length polymorphisms (AFLPs)7simple sequence repeats (SSRs)8,9and single nucleotide polymorphisms (SNPs)10,11. However, studies characterizing the genetic and structural variation of chloroplast genomes in Macadamia and using it to assess the genetic diversity of domesticated macadamia varieties remain limited.

Chloroplasts, the energy-producing organelles in plant cells, possess their own genetic material. The chloroplast genome (plastome) of land plants typically forms a circular DNA molecule characterized by a conserved quadripartite structure: one large single-copy region (LSC), one small single-copy region (SSC), and two inverted repeat regions (IRs)12. Compared to nuclear genomes, plastomes exhibit several distinctive features, including smaller size (120–160 kb), predominantly maternal inheritance, and relatively lower nucleotide substitution rates13. However, chloroplast DNA demonstrates a higher mutation rate than mitochondrial DNA, another maternally inherited organellar genome14,15. Notably, noncoding regions of plastomes, particularly intergenic spacers, evolve more rapidly than coding regions16making plastome an invaluable tool for phylogenetic studies across various taxonomic levels. While plastomes structure and genes are conserved during crop domestication, the plastomes of cultivated species can still exhibit altered genetic diversity17haplotype diversity18and other features19compared to their wild relatives.

In Macadamia, plastomic data have been proven instrumental in confirming its phylogenetic position in the Proteaceae family20tracing the geographical origins of cultivated germplasm of Macadamia integrifolia4and revealing reticulate evolution in Macadamia combine with nuclear gene phylogenetic analysis21. Besides, the utility of plastome for phylogenetic analyses has been well-documented across numerous plant species, including olive22jujube23Isodon rubescens24and Japanese apricot25. Building upon these previous studies, we assembled complete plastomes of 185 macadamia individuals from a core germplasm in China, to reveal the maternal progenitor and genetic and structural variations of plastomes for these resources. Given macadamia’s few cultivation cycles and recent introduction to China, we hypothesize most of the ‘first generation’ individuals should share the same maternal donor and have uniform plastomic background. Our genetic and structural variation analyses aimed to identified unique plastomic haplotypes, and the distribution of SSRs, long repeats, and SNPs across these plastomes. These findings can provide essential resources for variety differentiation and enhance our understanding of genetic variation within existing macadamia breeding pools, offering valuable insights for future breeding and conservation efforts.

Results

Characteristics of Macadamia plastome

All 185 assembled plastomes of macadamia exhibit a typical quadripartite structure, with total length spanning 159,195 to 159,726 bp. Structural analysis revealed distinct regional variations: the large single-copy (LSC) region measured 87,651–88,107 bp, while the small single-copy (SSC) region ranged from 18,743 to 18,819 bp. Inverted repeat (IR) regions demonstrated a higher stability, maintaining lengths between 26,378 and 26,422 bp (Fig. 1). The total GC content across all genomes remained consistent at 38.11–38.14%, with regional differentiation: LSC regions displayed 36.54–36.61%, SSC regions 31.64–31.71%, and IR regions the highest values at 43.01–43.05% (Supplementary Table S1). The plastomes of the 185 macadamia individuals were highly conserved, each contained a complement of 115 unique genes, including 81 protein-coding genes, 30 transfer RNA (tRNA) genes, and four ribosomal RNA (rRNA) genes (Supplementary Table S2). The rps12 gene in the 185 analyzed macadamia plastomes is distributed across the LSC and IR regions. Specifically, its 5’ exon resides in the LSC, while its 3’ exon is duplicated within the IRs. Sequence length analysis showed protein-coding regions occupied 80,793–80,829 bp, tRNA 2,729–2,877 bp, and rRNA 8,855–9,058 bp. The GC content was 38.43–38.46% in protein-coding gene, 53.17–53.33% in rRNA, and 55.25–55.27% in tRNA. All 185 complete plastomes can be classified into 23 different haplotypes (TYPE1–TYPE23), within which TYPE1 has the largest number of individuals (49), followed by TYPE9 (39). There are eight types that only consist of one individual (Supplementary Table S3).

Fig. 1
figure 1

Plastome maps of Macadamia. Gene location is shown in the outer circle, GC content is shown in the inner circle.

Analyses of ssrs, long repeats, and codon usage

Comprehensive characterization of simple sequence repeats (SSRs) showed the number of SSRs ranges from 75 to 89 across 23 macadamia haplotypes (Fig. 2). Among these SSRs, mononucleotide repeats dominated the SSRs profile, ranged from 50 to 59 and accounting for 66.29–70.73% of all SSRs. Dinucleotide repeats ranged from 9 to 12, accounting for 12–13.48%, while trinucleotide repeats ranged from 4 to 6, representing 4.94–6.9% of all SSRs. Tetranucleotide repeats ranged from 9 to 10, making up 10.98–14.67% of the total SSRs. No pentanucleotide repeats were detected in any plastome, and hexanucleotide repeats were found in four plastomic haplotypes (TYPES 5, 6, 7, and 20). Regional distribution analysis showed LSC regions harbored 59–72 SSRs (74.68–80.9%), significantly exceeding SSC (11–14; 13.58–17.72%) and IR regions (4–6; 4.6–7.41%). In addition, Protein-coding regions contained 21–22 SSRs (23.6–28%) (Supplementary Table S4).

Fig. 2
figure 2

The type and number of SSRs in 23 macadamia plastome haplotypes. (a) The number of general SSR types in different haplotypes; (b) the constitution of different repeat class types in each haplotype.

A total of 577 long repeats were identified across all plastomes, including 329 palindromic repeats, 192 forward repeats, 38 reverse repeats, and 18 complement repeats (Fig. 3). Both palindromic and forward repeats occurred universally and were detected in all plastomes, whereas complement repeats and reverse repeats showed restricted distribution and were found in 8 and 12 plastome haplotypes, respectively. Meanwhile, the long repeats can be classified into 20 length types (Supplementary Table S5). Among these, long repeats spanning 30–39 bp were predominant, accounting for 446 (73.4%), followed by those in the 40–49 bp range with 106 (17.4%). Notably, repeats exceeding 60 bp demonstrated a higher frequency of 23 (3.8%) than the 50–59 bp category, which showed the lowest frequency (0.3%).

Fig. 3
figure 3

The long repeats in macadamia plastomes. (a) The number of four types of long repeats in 23 haplotypes; (b) Constitution of long repeats in different length ranges.

The relative synonymous codon usage (RSCU) values of 81 protein-coding genes spanning 0.35–1.85, with the number of codons ranged from 26,841 to 26,853 (Supplementary Table S6). Among all amino acid, leucine was the most frequently used codon (10.31–10.32%), while cysteine was the least frequent (1.19–1.2%). A total of 29 codons with RSCU values > 1.00 exhibited codon usage bias in the protein-coding genes of each plastome. With the exception of methionine and tryptophan, all other amino acids exhibited multiple synonymous variants.

SNPs and highly divergent regions among Macadamia

The nucleotide variability values as revealed by sliding window analysis across the whole plastome ranged from 0 to 0.03922 (Fig. 4). Five peaks with values > 0.01 were identified as regions of high nucleotide variability, four of which distributed in the LSC region, while one was found in the SSC region. The most variable region was located in the LSC region, specifically the trnStrnG-exon1 (0.03922), followed by ndhDpsaC (0.01757), trnHpsbA (0.01653), petApsbL (0.01565), and psbCpsbZ (0.01009). In addition, we identified 573 non-redundant SNPs across all plastomes, among which 380 SNPs (66.32%) were located in the intergenic spacer regions. SNPs from coding regions included 118 nonsynonymous mutations and 75 synonymous mutations. Among the coding genes, ycf1 had the highest number of SNPs (35), followed by ndhF (15), rpoC2 (13), and matK (11). Among all SNPs, 289 were transitional mutations, and 284 were transversional mutations. The most common substitution was G to A (101), followed by C to T (96), while G to C (19) was the least frequent.

Fig. 4
figure 4

The nucleotide variability of the 185 complete macadamia plastomes. Window length: 300 bp; step size: 200 bp. Five regions with the highest π values were annotated. X-axis: nucleotide positions in the plastomes. Y-axis: nucleotide diversity of each window.

Phylogenetic analyses

The 185 macadamia accessions were divided into four clades in the phylogeny (Fig. 5), three of which obtained full supports (Bootstrap values = 100) (Supplementary Fig. S1). Clade1 consists of eight haplotypes and 70 individuals, comprising of 14 ‘raw material’ and 56 ‘first generation’ individuals. Clade2 has the fewest haplotypes and includes only two ‘raw material’ individuals. Clade3 is composed of four haplotypes, with a total of 29 individuals, including 17‘raw material’ and 12 ‘first generation’ individuals. Clade4 contains the largest number of plastome types, with nine haplotypes represented among 84 individuals, including 49 ‘raw material’ and 35 ‘first generation’ individuals. Among the 28 hybrids created by cross-breeding within the ‘first generation’ individuals, 26 individuals had plastomes identical to their female parent. For example, ‘ym20’, ‘ym04’, ‘ym05’, and ‘ym07’ are hybrids from the maternal cultivar ‘D’, and all five individuals shared the same plastome (TYPE10). Similarly, both the female parent (‘D4’) and the hybrids (‘y-4-1604’, ‘ym29’, and ‘ym24’) shared the TYPE12 of plastome. The rest two hybrids, ‘ym10’ and ‘ym11’, share the same female parent, cultivar ‘NSW-44’; however, since the plastome of ‘NSW-44’ could not be successfully assembled, these two hybrids uniquely shared the TYPE21 plastome. In addition, most raw materials introduced from Hawaiian shared the TYPE1 plastome and clustered in Clade4. TYPE6 (M. tetraphylla) and TYPE13 (M. ternifolia), represented as early diverged clades from Clade3 and Clade1, respectively. Furthermore, three individuals: ‘Bailahe1’, ‘JingG6’, and ‘Jing40’, were collected from seedling orchards and considered as ‘first generation’. However, ‘Bailahe1’ and ‘JingG6’ shared the TYPE22 plastome, while ‘Jing40’ was the only individual with the TYPE23 plastome, raising questions about their maternal origin.

Fig. 5
figure 5

Maximum likelihood phylogenetic tree of 185 macadamia accessions categorized into ‘raw material’ (RM) and ‘first generation’ (FG) groups, with type classified into 23 plastome haplotypes (TYPE1–TYPE23). Representative accessions with known identity include NC_025288.1 (M. integrifolia), ‘AM-T-95’ (M. tetraphylla), and ‘AM-TF’ (M. ternifolia).

Discussion

In this study, we assembled and analyzed 185 complete plastomes from macadamia specimens, representing a significant advancement in our understanding of Macadamia chloroplast genomics. The sequenced plastomes exhibited length polymorphism ranging from 159,195 to 159,726 bp, with size variations primarily attributed to differences in the large single-copy (LSC) region rather than the small single-copy (SSC) or inverted repeat (IR) regions. This pattern of LSC variability aligns with observations in other plant taxa, such as Japanese apricot25highlighting the dynamic nature of this genomic region. Despite variations in plastome size, we observed remarkable conservation in gene order and content across all macadamia plastomes. The rps12 gene, a notable chloroplast gene, encodes the small ribosomal subunit S12 protein via trans-splicing26. Notably, no variations were detected in the exon location or intron content of rps12 within our study samples. This conserved structure contrasts with the variability observed in ferns27. The GC content, a crucial genomic feature associated with genome size and ecological adaptation in plants28showed minimal variation among samples, reflecting its conservation at lower taxonomic levels. The IR regions exhibited the highest GC content (43%), followed by SSC (36.5%) and LSC (31.7%) regions, a pattern consistent with other plant plastomes29,30 that may be attributed to the clustering of rRNA genes in IR regions31. Our analysis of codon usage bias, influenced by both natural selection and mutation32identified 29 high-frequency codons in Macadamia plastomes, predominantly ending with A/T. This pattern mirrors observations in other dicotyledonous plants, including Ipomoea33 and Asteraceae species34. These finding underscoring the unique evolutionary characteristics of Macadamia plastomes.

Plastome simple sequence repeats (SSRs) have become invaluable tools in genetic studies due to their high polymorphism and reproducibility, making them particularly useful for assessing genetic diversity, reconstructing phylogenies, and facilitating species identification35. Our analysis revealed 75–89 SSR motifs per plastome, predominantly located in the large single-copy (LSC) region, a distribution pattern consistent with findings in Polystachya36. Similar to previous reports37we observed a preferential accumulation of SSRs in intergenic regions rather than coding sequences. Among the five SSR types identified in macadamia plastomes (excluding pentanucleotide repeats), mononucleotide repeats emerged as the most abundant, primarily consisting of homopolymeric A/T tracts. This observation aligns with SSR distribution patterns observed in both dicotyledonous38 and monocotyledonous plants36. The identified SSRs represent valuable molecular markers for future investigations into genetic diversity and phylogenetic relationships within macadamia germplasm populations. Additionally, our repeat pattern analysis detected frequent palindromic and forward repeats but revealed an absence of reverse or complementary repeats across all examined plastomes. These findings provide important insights into the structural organization and evolutionary dynamics of Macadamia plastome.

Divergent plastomic hotspots hold considerable potential for phylogenetic reconstruction39 and cultivar discrimination40. Our investigation identified five hypervariable regions (Pi > 0.01) in Macadamia plastomes: trnStrnG-exon1, ndhDpsaC, trnHpsbA, petApsbL, and psbCpsbZ. Notably, the trnHpsbA intergenic spacer—an established universal chloroplast marker—has been widely employed across taxonomic levels41. Both trnStrnG-exon1 and petApsbL, situated within the large single-copy (LSC) region, have demonstrated significant variability in other plant plastomes24. These three conserved markers, combined with the newly identified ndhDpsaC and psbCpsbZ regions, represent promising candidates for phylogenetic analyses in Macadamia.

Chloroplast DNA’s moderate mutation rate and uniparental inheritance render it particularly suitable for phylogenetic and parentage studies42. However, conventional chloroplast markers often lack sufficient resolution for distinguishing closely related cultivars43. Recent advancements in plastome sequencing have successfully enabled varietal differentiation in rice44cacao45,46and olive22. Our phylogenomic analysis of 185 Macadamia plastomes revealed 23 distinct haplotypes. Three representative species accessions (‘HAES741’ which represents M. integrifolia, ‘AM-T-95’ represents M. tetraphylla, and ‘AM-TF’ that represents M. ternifolia) formed well-supported clades, confirming plastome efficacy for interspecific relationship analysis. Nevertheless, we also found identical plastome sequences among multiple individuals within taxonomic groups (Fig. 5), which aligns with a recent study based on platomic SNPs21. This highlights the significant limitations of plastomic data in varietal differentiation, particularly for modern cultivars. For instance, similar results were observed in tomato breeding programs where plastome sequencing failed to distinguish contemporaneous varieties47. Furthermore, the sharing of identical haplotypes among multiple individuals suggests that parental lines carrying these haplotypes possess a selective advantage in breeding programs, explaining their widespread use in Macadamia breeding.

Nonetheless, our results corroborate previous reports that plastome sequences exhibit higher discriminatory power for ancestral varieties compared to modern cultivars43. Notably, raw germplasm materials representing ancestral stock displayed unique haplotypes (TYPE13–TYPE19) that were readily distinguishable in phylogenetic reconstructions (Supplementary Table S3, Fig. 5). These unique haplotypes reflect evolutionary uniqueness and likely represent genotypes that have not undergone artificial selection. Conversely, first-generation cultivars predominantly shared identical plastomes (e.g., haplotypes in Clade 4), indicating genetic homogeneity in modern breeding lines. It is also notable that three accessions (‘Bailahe1’, ‘JingG6’, and ‘Jing40’) exhibited ambiguous phylogenetic placement, warranting further investigation. The nucleotide polymorphism levels in the FG and MR populations are similar, which may be attributed to either incomplete coverage of parental materials or an extremely short domestication period (Supplementary Table S8). This comprehensive assessment provides novel insights into maternal lineages and evolutionary origins of Macadamia cultivars through combined analysis of plastome variability and conservation. Future studies by incorporating more plastomic data from the wild macadamia accessions are needed to better trace the maternal origin of each cultivar lineage.

Conclusion

Our study presents the assembly and characterization of 185 macadamia plastomes, revealing conserved genomic architecture in GC content, gene composition, and codon usage patterns. We identified polymorphic SSRs, SNPs, and five hypervariable regions with potential as molecular markers for phylogenetic and population genetic studies. Phylogenetic reconstruction using complete plastome sequences successfully delineated relationships among ancestral germplasm and characterized first-generation cultivars. These findings demonstrate substantial genetic diversity within the macadamia germplasm resources in Yunnan, China, and provides critical baseline data for understanding genotypic relationships and informing future breeding strategies.

Materials and methods

Materials

In this study, a total of 185 samples covering three macadamia species were involved, including M. integrifolia, M. tetraphylla, and M. ternifolia. Among those, 82 samples were introduced from Australia and USA defined as ‘raw material’, while the remaining 103 samples created and selected in China were defined as ‘first generation’. All the samples are maintained at the Germplasm Repository of Macadamia Nut Jinghong City, Ministry of Agriculture, Xishuangbanna, Yunnan Province, China. Detailed samples information is provided in Supplementary Table S7.

Chloroplast genome assembly, annotation, and analyses

Whole-genome sequencing data for 208 macadamia samples from our previous study (Li et al.48; NCBI SRA accession PRJNA909356) were reanalyzed. Briefly, young leaves were collected and total genomic DNA was extracted using the TaKaRa MiniBEST Plant Genomic DNA Extraction Kit (Takara Bio, China). Subsequently, standard DNA libraries were constructed following procedures that included DNA fragmentation. Finally, paired-end reads of 150 bp were generated on the DNBSEQ-T7 platform (Biomarker Technologies, China). Raw data were removed if they: (1) contained sequencing adapters; (2) had > 50% of bases with quality scores < 20 (Phred-like socre); or (3) were paired-end reads containing of > 10% ´N´ bases. Following quality filtering, plastomes were assembled with GetOrganelle49 using default parameters, yielding complete circular genomes for 185 samples, which were used in the subsequent annotation and analyses. Assembly orientations were standardized against the reference genome (NC_025288.1), and potential chloroplast partition structures were identified. GetOrganelle’s assembly steps and the reference genome employed effectively control for interference from contaminating bacterial reads. To identify unique plastomic haplotypes, redundant sequences were removed through BLAST-based pairwise genome comparisons. Annotation was performed using GeSeq50 with default parameters to predict protein-coding genes, tRNAs, and rRNAs. Gene positions were verified through BLAST alignment to reference chloroplast genes, with manual curation of start/stop codons and exon-intron boundaries. Genome visualization was generated using Organellar GenomeDRAW51. Functional annotation involved BLAST searches (E-value ≤ 10e− 5) against NCBI-Nr, Swiss-Prot, COG, KEGG, and GO databases.

Analyses of ssrs, repeat sequences and codon usage

Simple sequence repeats(SSRs)in all assembled plastomes were identified using the MISA software52with parameters set to 8, 5, 4, 3, 3, and 3 for mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides, respectively. REPuter software53 was used to detect four types of long repeat, including forward (F), reverse (R), complement (C), and palindromic (P), with parameters set as minimal repeat size 30 bp, Hamming distance 3, and maximum computed repeats 5000. In addition, the Cusp software54 was employed to analyze relative synonymous codon usage (RSCU) to obtain codon usage patterns of Macadamia plastome.

Variant calling and annotation

MUMmer55 software was used to globally aligned sequence of each sample with the reference sequence (TYPE1) to identify variant sites between them, and perform an initial filtering to detect potential SNP sites. We extracted 100 bp sequences flanking the SNP site in the reference sequence, then used BLATv3556 software to align the extracted sequences with the assembly results to verify the SNP site. If the alignment length is less than 101 bp, it is deemed unreliable and will be discarded. If it aligns multiple times, it is considered an SNP in a repetitive region and will also be discarded. Finally, reliable SNPs are identified.

Analyses of nucleotide variability and phylogenetic analysis

DnaSP57 software was used to detect nucleotide variability of coding genes and Non-coding regions by a sliding window method, which window length was set to 300 bp, step size was set to 200 bp. We used the all 185 complete plastomes to reconstruct the phylogenetic tree. The plastomic sequences of two Proteaceae taxa, i.e., Helicia nilagirica (NC_057271.1) and Helicia shweliensis (NC_045942.1), were download as outgroup. Maximum Likelihood (ML) analyses were run using phyML software58 with 1000 bootstrap replicates. The phylogenetic tree was visualized by TVBOT59.