Background & Summary

Anoectochilus roxburghii (Wall.) Lindl., a perennial herb of the Orchidaceae family and Anoectochilus genus, is highly valued for its dual medicinal and edible properties1. This species has been used as a natural, nutritional food ingredient and a traditional Chinese herb for thousands of years2. A. roxburghii has various biological activities, including anti-tumor, anti-oxidative, hypoglycemic, anti-inflammatory, and immunomodulatory activities3,4,5. Research indicates that the primary active ingredients in A. roxburghii includes alkaloids, flavonoids, polysaccharides, steroids, and terpenes6,7. Among these, flavonoids are widely recognized as important indicator components for quality  assessment of A. roxburghii1,8,9.

In this study, we report the first haplotype-resolved genome assembly of wild A. roxburghii widely distributed in Fujian Province, China. Through an integrated multi-omics approach, we combined PacBio high-fidelity (HiFi) sequencing, high-throughput chromosome conformation capture (Hi-C) sequencing, Illumina short-read sequencing, and RNA-Seq data to assemble and annotate the haplotype-resolved genomes of A. roxburghii. The haploid genomes exhibit sizes of 1.92 Gb and 1.93 Gb, with contig N50 values of 22.72 Mb and 22.17 Mb. The Benchmarking Universal Single-Copy Orthologs (BUSCO)10 analysis demonstrates a completeness score of 93.5% and 92.9%, respectively. Furthermore, we annotated a total of 26,239 and 26,324 protein-coding genes in the two haplotypes. Phylogenetic analysis elucidated the evolutionary relationships among A. roxburghii and related Orchidaceae species. These high-quality haplotype-resolved genomes provide a fundamental genetic resource that will enable comprehensive elucidation of secondary metabolite (particularly flavonoid) biosynthetic pathways and will significantly advance both functional genomics studies and molecular breeding applications.

Methods

Sample collection, library construction, and sequencing

High-quality PacBio HiFi libraries were prepared following the manufacturer’s protocol and sequenced on a PacBio Sequel II platform, yielding a total of 162.58 Gb circular consensus sequencing (CCS) reads with N50 of 17,251 bp (~78 × coverage). Libraries prepared with Illumina TruSeq PCR-free kits were sequenced on a NovaSeq X platform, generating 150 bp paired-end reads that yielded 124.55 Gb of data. For Hi-C library construction, young leaves were cross-linked with formaldehyde. Genomic DNA was then isolated using the CTAB method and digested with DpnII. The Hi-C libraries were prepared following a standard protocol and sequenced on an Illumina HiSeq 3000 platform, generating a total of 188.98 Gb of paired-end reads (Table 1). RNA sequencing libraries were generated with the TruSeq RNA Library Prep Kit according to the manufacturer’s guidelines, with triplicate biological replicates sequenced on an Illumina NovaSeq platform.

Table 1 Summary of haplotype-resolved genome assembly of A. roxburghii.

Estimation of genome size and heterozygosity

The genome size and heterozygosity of A. roxburghii were estimated through k-mer frequency analysis, the method that involves analyzing the distribution of k-mers within the genome based on Poisson’s distribution11. Prior to assembly, we used Jellyfish12 (v2.2.10) to generate the 39-mer frequency distribution of PacBio HiFi reads. Following this, we employed GenomeScope 2.013 to evaluate the genomic features. Consequently, we obtained the haploid genome size of A. roxburghii is 1.92 Gb, with a heterozygosity rate of 2.19% (Fig. 1).

Fig. 1
figure 1

Genome survey result based on K-mer analysis.

Haplotype-resolved genome assembly

We assembled the A. roxburghii genome using multiple sequencing datasets, including 162.58 Gb (~78 × coverage) PacBio HiFi reads, 124.55 Gb (~60 × coverage) Illumina reads, and 188.98 Gb (~90 × coverage) Hi-C paired reads (Table 1). To address the assembly challenges caused by the high heterozygosity of A. roxburghii genome, we conducted genome assembly and phasing using HiFiasm14 (v0.23.0) with PacBio HiFi reads under Hi-C mode parameters (-s 0.55 for haplotype similarity threshold; -D 5 for kmer filter threshold), generating two phased haplotype contig assemblies. Redundant and low-quality sequences were removed using Purge_Dups15 (v1.2.6) with stringent parameters (-f 0.7 for sequence retention threshold). Subsequent error correction was conducted using Pilon16 (v1.24) with Illumina paired-end data, employing diploid-optimized parameters. To further elevate the assembly to the chromosomal level, Hi-C data were used to cluster, order, and orient the contigs with ALLHiC17 (v0.9.8) using the following parameters: -k 80 for cluster size cutoff and --nonunique 0.7 for mapping tolerance threshold. This process generated 20 pseudochromosomes for each haplotype (A and B) (Fig. 2), with mounting rates of 99.83% and 98.52%, and Scaffold N50 was 104.19 Mb and 105.51 Mb, respectively (Table 1).

Fig. 2
figure 2

Characteristics of A. roxburghii genome assembly. (a) Genomic landscape of A. roxburghii. (b) Hi-C contact heat map of A. roxburghii genome of haplotype A. (c) Hi-C contact heat map of A. roxburghii genome of haplotype B.

Genome annotation

For repeat sequence annotation, we used RepeatMasker18 (v4.1.2) and RepeatModeler19 (v2.0.5) for homologous prediction and de novo prediction, respectively. Additionally, sequences predicted as “Unknown” repeat were further analyzed using DeepTE20 (v1.0). By integrating the predicted results and removing redundancy, we determined that repeat sequences accounted for 76.54% and 76.68% of the two haploid genomes (Table 2 and Table 3).

Table 2 Statistics of repetitive sequences in haplotype A genome of A. roxburghii.
Table 3 Statistics of repetitive sequences in haplotype B genome of A. roxburghii.

Gene structure prediction integrates de novo gene prediction, homologous gene prediction, and transcript retrieval-based gene prediction. Firstly, Augustus21 (v4.0.0) was used for de novo gene prediction, HISAT222 (v2.2.1) and StringTie23 (v3.0.0) were used for transcriptome-based prediction, and TransDecoder24 (v5.4.0) was applied to predict open reading frames. Furthermore, Exonerate25 (v2.2.0) was used to align homologous peptides from several nearby species, including Dcatenatum catenatum (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/605/985/GCF_001605985.2_ASM160598v2/), Phalaenopsis equestris (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/263/595/GCF_001263595.1_ASM126359v1/), Gastrodia elata (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/760/335/GCA_016760335.1_NIFOS_GasEla_1.0/), and Apostasia shenzhenica (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/786/265/GCA_002786265.1_ASM278626v1/), to the assembled genome and obtained homolog prediction results. Finally, the Geta pipeline (https://github.com/chenlianfu/geta) was used to intergrate the gene models, and quality checks were conducted using HMMScan26 (v3.3.2) and BLASTp27 (v1.0.0) to screen for highly credible genes. In total, we successfully annotated 26,239 and 26,324 coding genes in the two haplotypes (Table 1). Additionally, we employed standardized workflows for gene function annotation. DIAMOND BLASTp27 (v1.0.0) was used to compare the predicted protein sequences against public databases, including UniProt, NR, GO, and KEGG, with an E-value cutoff of 1e-528. This approach enabled us to obtain information regarding gene functions and the metabolic pathways in which these genes are involved.

Phylogenetic analysis

We conducted the phylogenetic tree and divergence time between A. roxburghii and 15 other plants, including 4 orchids (P. equestris29, D. catenatum30, G. elata31, and A. shenzhenica32), 1 species of Liliaceae (Asparagus officinalis33), 6 monocotyledon plants (Brachypodium distachyon34, Oryza sativa35, Sorghum bicolor36, Ananas comosus37, Musa acuminata38, and Spirodela polyrhiza39), 3 dicotyledon plants (Populus trichocarpa40, Arabidopsis thaliana41, and Vitis vinifera42), and one basal angiosperm (Amborella trichopoda43). Orthologous gene families were identified across all species using OrthoFinder44 (v3.0.1b1) with all-vs-all BLASTp alignment (e-value ≤ 1e-5). Comparative analysis of five orchid species revealed 8,111 conserved gene families shared among all members, while 1,060 gene families were uniquely retained in A. roxburghii (Fig. 3a). Furthermore, the 343 single-copy orthologous gene sequences were aligned using MUSCLE45 (v5.2). Conserved blocks were selected through Gblocks46 (v0.91b), with optimal amino acid substitution models determined by ProtTest47 (v3.4.2). A maximum-likelihood phylogenetic tree was constructed in RAxML48 (v8.2.12) with 1,000 bootstrap replicates. Divergence times were subsequently estimated employing the MCMCTree module in PAML49 (v4.10.3) package under a relaxed molecular clock model. To calibrate the molecular clock, we applied fossil calibration constraints at four key nodes, with all calibration times obtained from the TimeTree50 database (http://timetree.org/), including the divergence time between A. officinalis and P. equestris (92.5–118.5 million years ago, Mya), O. sativa and B. distachyon (41.5–62.0 Mya), M. acuminata and O. sativa (103.2–117.1 Mya), and the basal angiosperm node represented by A. trichopoda and A. thaliana (179.9–205.0 Mya). Finally, evolutionary analyses of gene family expansion and contraction were performed using CAFÉ 551, identifying 2,100 expanded and 2,120 contracted gene families.

Fig. 3
figure 3

Phylogenetic and comparative genomics analyses of the A. roxburghii haplotype A genome. (a) The number of shared and unique gene families in A. roxburghii and four Orchidaceae species. (b) Phylogenetic tree showing the evolutionary relationship of A. roxburghii and 15 other plants. Expansion (green) and contraction (red) of gene family numbers are shown. Predicted divergence times (Mya, million years ago) are labelled in black at other intermediate nodes.

Data Record

The raw sequencing data52,53 generated in this study have been deposited in both the Genome Sequence Archive (GSA) at the National Genomics Data Center (CNCB-NGDC) under accession CRA021929 and the NCBI Sequence Read Archive (SRA) under accession SRP605955. The two haplotype genome assemblies have bee deposited in the European Nucleotide Archive (ENA) under the accession numbers GCA_976986765 for Haplotype A and GCA_976986775 for Haplotype B54,55. The genome assembly and annotation data56 had been submitted at the Figshare database at the following link: https://figshare.com/articles/dataset/Genome_of_Anoectochilus_roxburghii/28163756.

Technical Validation

To evaluate the quality of the genome assembly, we aligned Illumina short reads to the reference genome using BWA57 (v0.7.17), achieving a high mapping rate of 99.99%. Genome completeness was further assessed through BUSCO10 (v2.2) analysis against the embryophyta_odb10 database. The results demonstrated completeness, with 93.5% and 92.9% of core conserved plant genes identified in the two haplotype assemblies, respectively.

To validate the reliability of haplotype-resolved genome assembly, we aligned HiFi reads to the merged haplotype assemblies using minimap258 (v2.24) with mapping parameters (-N 0) to retain only primary alignments. Statistical analysis revealed that among the 10,149,730 reads successfully mapped to both haplotypic chromosome sets, 49.55% (5,028,978 reads) showed specific alignment to haplotype A chromosomes, while 49.21% (4,994,602 reads) specifically aligned to haplotype B chromosomes. Notably, only 1.24% (126,150 reads) exhibited cross-mapping between the two haplotypic chromosome sets, demonstrating high inter-haplotype sequence specificity. For anchoring quality assessment, we performed Hi-C data alignment to the reference genome. The contact matrix revealed significantly stronger intra-chromosomal interaction signals compared to inter-chromosomal interactions (Fig. 2b and Fig. 2c). Notably, the interaction patterns showed prominent diagonal distribution within chromosomes, providing additional validation for the accuracy of genome assembly and scaffolding.

Code avaliability

The Geta pipeline is publicly available under the MIT License at GitHub: https://github.com/chenlianfu/geta. All parameters used in this study are described in the Methods.