Background & Summary

The advent of telomere-to-telomere (T2T) genome assemblies represents a milestone in genomics, enabling near-complete chromosomal reconstructions and significantly improving resolution in previously intractable regions such as centromeres, telomeres, and rDNA clusters1,2. Traditional assemblies often leave these repetitive regions unresolved. Recent progress in sequencing technologies—including PacBio HiFi, Oxford Nanopore ultra-long reads, and Hi-C chromatin interaction maps—coupled with advanced assembly tools such as hifiasm3 and quarTeT4, has enabled the generation of highly contiguous and accurate plant genomes. T2T assemblies have now been achieved in multiple crops, including soybean5, rice6, maize7, and cucumber8, as well as woody species like Chinese cork oak9, Chinese bayberry10, and Populus alba × P. tremula (84 K poplar)11, offering unprecedented insights into genome structure and evolution.

The genus Acer (maples), belonging to the family Sapindaceae, comprises over 200 species distributed across Eurasia and the Americas, with China being the modern diversity center, harboring approximately 140 species12. A. truncatum, widely distributed between 28°N–46°N and 102°E–143°E13, is a species of ecological and economic significance. It is noted for its tolerance to drought and cold14,15, striking autumn foliage16,17, and seed oil rich in unsaturated fatty acids, especially nervonic acid18,19,20. These compounds are beneficial to cardiovascular and neurological health, with nervonic acid playing roles in neural repair and possibly mitigating neurodegenerative diseases such as Alzheimer’s and multiple sclerosis21.

To date, several Acer genomes have been assembled, including A. catalpifolium22, A. negundo23, A. palmatum24, A. pseudosieboldianum25, A. rubrum26, A. saccharum23, A. truncatum17,18, and A. yangbiense27, providing valuable resources for stuies on phylogeny, stress tolerance, and leaf coloration. However, no high-completeness, gap-less genome has been reported for the genus. Earlier versions of the A. truncatum genome included a 653 Mb assembly using PacBio Sequel I, 10x Genomics, and Illumina data (1,453 contigs, contig N50 = 773 kb)20, and a hybrid HiFi/ONT/Hi-C assembly with 181 scaffolds (scaffold N50 = 9.14 Mb)17.

In this study, we reanalyzed the dataset from Zhang et al.17 to generate a near-complete, haplotype-resolved genome of A. truncatum. The final assembly spans 1.2 Gb, with a contig N50 of 42 Mb and scaffold N50 of 44 Mb, and contains only seven gaps. Telomeric and centromeric sequences were accurately resolved across most chromosomes. This represents the first T2T-level genome assembly in the Acer genus. The resulting high-contiguity genome provides a valuable and comprehensive reference for comparative and functional genomic studies in maples and related taxa.

Methods

Data acquisition

All raw sequencing data for A. truncatum ‘Lihong’, including PacBio subreads, ONT reads, Hi-C reads, and Illumina paired-end reads, were obtained from the National Genomics Data Center (NGDC) Sequence Read Archive under accession number PRJCA014724 (https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA014724). The ccs tool (v6.4.0) from Pacific Biosciences (https://github.com/PacificBiosciences/ccs) was used with the–all option to regenerate PacBio HiFi reads from the provided subreads. This process generated 38 Gb (~62 × coverage) of HiFi data, which was subsequently used in the genome assembly, alongside ONT, Hi-C, and Illumina data (Supplementary Table 1, 2).

Genome assembly and polishing

We reassembled the ‘Lihong’ genome using hifiasm (v0.19.8-r602)3 to generate haplotype-resolved contigs by integrating PacBio HiFi reads, Oxford Nanopore long reads, and Hi-C sequencing data. Hi-C reads were aligned to the assembled contigs using Juicer (v1.6)28, and an initial chromosome-scale assembly was generated using 3D-DNA (v201008)29. This draft assembly was manually curated using Juicebox (v2.20.00)30 to refine chromosome boundaries, correct misassemblies, and resolve haplotype switch errors based on Hi-C contact maps30.

This process yielded a chromosome-scale assembly of 1.2 Gb, comprising two complete haplotypes (Fig. 1). Attempts to phase the two haplotypes using subphaser (v1.2)31 were unsuccessful due to the absence of discriminative k-mers, and the haplotypes were thus arbitrarily labeled as “a” and “b”.

Fig. 1
figure 1

Genomic landscape of the haplotype-resolved A. truncatum genome. (a) Chromosome length in Mb; (b) Density of Class I transposable elements (TEs); (c) Density of Class II TEs; (d) Gene density across the genome; (e) Proportion of tandem repeats; (f) GC content; (g) Collinear blocks larger than 100 kb. (b–f) represent statistical analyses using a window size of 500 kb.

Gap filling

Gaps within the scaffolded assembly were filled using quarTeT (v1.2.5)4, leveraging the high accuracy of PacBio HiFi reads. This step markedly improved the assembly’s continuity.

Telomere extension

To improve the completeness of telomeric regions and extend assemblies to chromosome termini, HiFi reads were mapped to the draft genome using minimap2 (v2.29)32. Reads mapping to chromosomal termini were extracted and reassembled into contigs using hifiasm, which were then aligned back to the assemblies to extend the telomeric regions.Telomeres were detected at all expected positions except one end of chromosomes 4a, 5a, 6b, and 7b, and both ends of chromosome 7a (Fig. 2a). Conserved plant rDNA repeats33,34 were localized: 18S–5.8S–28S arrays were enriched at the termini of chromosomes 1a, 6a, 7a, and 7b (Fig. 2b), while 5S arrays were primarily distributed on chromosomes 3a, 3b, and 8b (Fig. 2c). Organelle genome assembly Separate assemblies for the chloroplast and mitochondrial genomes were performed using GetOrganelle (v1.7.7.1)35, based on PacBio HiFi reads.

Fig. 2
figure 2

Distribution and quantity of telomeres, 18-5.8-28S rDNA, and 5S rDNA on the A. truncatum chromosomes. (a) Distribution and quantity of telomeres; (b) Distribution and quantity of 18-5.8-28S rDNA; (c) Distribution and quantity of 5S rDNA.

Redundancy and contamination removal

Redundans (v2.0.1)36 was used to align all contig and scaffold sequences to the chromosomal and organellar genome assemblies. This analysis identified low-coverage fragments, potential haplotigs, rDNA fragments, and other redundant or contaminating sequences within the unplaced or scattered sequences. These fragments were manually curated and removed, resulting in a final assembly with 7 gaps, each fixed at 100 bp.

Repeat sequence identification

De novo identification and annotation of transposable elements (TEs) were performed to build a custom TE library for A. truncatum using EDTA (v1.9.9)37 with parameters–sensitive 1–anno 1. The genome assembly was then masked for repetitive sequences using this library with RepeatMasker (v4.1.8) (https://www.repeatmasker.org/RepeatMasker/). This analysis identified 2,128,936 repetitive sequences, covering 774,332,459 base pairs, which represent 63.98% of the entire assembly (Table 1, Table 2). Among the repeat classes, Long Terminal Repeats (LTRs) were the most abundant, comprising 738,696 elements that span 515,715,855 bp (42.61% of the genome). LTR subclasses, copia and gypsy, accounted for 18.36% and 12.20% of the genome, respectively. Terminal Inverted Repeats (TIRs) made up 11.58% of the genome sequence.

Table 1 Annotation statistics of the Acer truncatum genome assembly.
Table 2 Repeat annotation statistics of the Acer truncatum genome.

Gene prediction and annotation

Protein-coding gene annotation utilized a combination of evidence sources and computational tools. Homology-based evidence included 334,064 non-redundant protein sequences from 14 plant species (Supplementary Table 3). This transcript evidence, along with the homology evidence, was used in the PASA (v2.4.1) pipeline38 to annotate gene structures and identify full-length transcripts. These full-length transcripts were used to train the ab initio gene prediction tools AUGUSTUS (v3.5.0)39 and SNAP40, with AUGUSTUS undergoing five rounds of iterative optimization.

The MAKER2 (v3.01.03) pipeline41 was then used to integrate evidence from ab initio predictions (AUGUSTUS, SNAP), transcript alignments using BLASTN and TBLASTX42 and homologous protein alignments using BLASTX. Exonerate (v2.2.0)43 was used to refine alignment evidence, excluding repetitive regions masked by RepeatMasker. EvidenceModeler (EVM) (v2.1.0)44 was employed to integrate the MAKER output with the gene models generated by PASA, producing a more consistent gene set. To minimize the inclusion of transposable element genes, TEsorter (v1.4.7)45 was used to identify TE protein domains within the predicted genes, and these domains were masked during the EVM integration step. The final gene models were refined using PASA to add untranslated regions (UTRs) and model alternative splicing isoforms. Finally, gene models were filtered to remove those with internal stop codons, ambiguous bases, missing start or stop codons, or encoding proteins shorter than 50 amino acids.

This comprehensive annotation pipeline resulted in a final set of 58,569 protein-coding genes, comprising 81,299 transcripts (Table 1, Supplementary Table 4). Subgenome “a” contains 29,648 genes, with an average gene length of 4,481 bp, an average of 5.9 exons per gene, and an average exon length of 303 bp (Supplementary Table 5). Subgenome “b” contains 28,921 genes, with an average gene length of 4,464 bp, an average of 5.7 exons per gene, and an average exon length of 290 bp.

ncRNA annotation

Transfer RNA (tRNA) genes were annotated using tRNAscan-SE (v2.0.12)46. Ribosomal RNA (rRNA) genes were annotated with barrnap (v0.9) (https://github.com/tseemann/barrnap), with partial gene predictions filtered out. Other types of non-coding RNAs (ncRNAs), including microRNAs (miRNAs), small nuclear RNAs (snRNAs), and others, were identified by aligning to the Rfam database using RfamScan (v15.0) (https://rfam.org/). In total, we annotated 1,259 rRNA sequences, 1,560 tRNA sequences, and 338 small ncRNAs (Supplementary Table 6).

Functional annotation

Functional annotation of the protein-coding genes was performed using three complementary strategies. First, Gene Ontology (GO) terms and KEGG pathway annotations were assigned using eggNOG-mapper (v2)47 against the eggNOG homology database48. Second, sequence similarity searches were conducted with DIAMOND v2.0.1549 against multiple protein databases: Swiss-Prot, TrEMBL, NR, and the Arabidopsis thaliana proteome. Only the best hit per gene was retained, requiring an alignment identity greater than 30% and an E-value less than 1e-5. Third, conserved protein domains, motifs, and functional sites were identified using InterProScan (v5.74-105.0)48, which queries databases such as PRINTS, Pfam, SMART, PANTHER, and CDD.

Overall, 96.3% of the protein-coding genes received at least one type of functional annotation. Specifically, GO terms were assigned to 42.63% of the genes, and KEGG pathway annotations were assigned to 40.57% (Table 1, Supplementary Table 7).

Data Records

All data generated or analyzed during this study are publicly available. The haplotype-resolved genome assembly of Acer truncatum is available at the European Nucleotide Archive (ENA) under accession GCA_976991395 (https://identifiers.org/insdc.gca:GCA_976991395)50. In addition, the genome assembly and annotation files are also available on Figshare (https://doi.org/10.6084/m9.figshare.27020836.v1)51. Supplementary material associated with this study can be accessed online via Figshare (https://doi.org/10.6084/m9.figshare.30294964.v1)52.

Technical Validation

Evaluation of the assembled and the annotated genome

To evaluate the accuracy and completeness of the genome assembly, we performed multiple assessments based on read mapping, genome coverage, sequence consistency, gene completeness, and chromosomal structure.

Illumina short reads were mapped to the final assembly using BWA (v0.7.17)53, achieving a mapping rate of 99.69%. PacBio HiFi reads were mapped using minimap2, with a 99.55% mapping rate after removing non-primary alignments (Table 3). Depth analysis revealed that 99.81% of the genome was covered at least 10 × by Illumina reads and 99.58% by HiFi reads, indicating near-complete coverage.

Table 3 Alignment and coverage statistics of different types of sequencing reads mapped to the Acer truncatum genome.

To detect potential redundancy or collapsed regions, sequencing depth was analyzed across all genomic loci. Both single-copy and multi-copy BUSCO regions showed unimodal depth distributions approximating a Poisson model, without secondary peaks or signs of heterozygosity (Fig. 3a–d). GC-depth correlation analysis showed no significant GC bias in either sequencing dataset (Fig. 3e,f).

Fig. 3
figure 3

Assessment of genome assembly quality through read coverage depth. (a) Genome coverage depth by Illumina reads; (b) Coverage depth of BUSCO core genes by Illumina reads; (c) Genome coverage depth by HiFi reads; (d) Coverage depth of BUSCO core genes by HiFi reads; (e) Illumina read coverage depth across different GC contents; (f) HiFi read coverage depth across different GC contents.

Assembly completeness was further assessed using BUSCO (v5.8.2)54 with the embryophyta_odb10 dataset. The complete diploid assembly (haplotypes a and b) contained 98.9% complete BUSCOs, with subgenomes “a” and “b” each showing 98.7% and 98.9% completeness, respectively. Evaluation of the annotated protein-coding gene set yielded an even higher completeness rate of 99.3% (Table 1, Supplementary Table 8), with 98.2% in subgenome “a” and 98.0% in subgenome “b”.

To validate haplotype phasing, we used KAT (v2.4.1)55 to generate k-mer spectra from the raw Illumina reads. The spectra showed clear separation of haplotype-specific k-mers between subgenomes “a” and “b” (Fig. 4), confirming phasing accuracy.

Fig. 4
figure 4

Quality assessment of the genome assembly. K-mer analysis was performed using KAT to compare HiFi reads to the complete genome (a), subgenome a (b), and subgenome b (c) of A. truncatum. The plots are color-coded to show the frequency of specific k-mers from the reads in the assembly. K-mers absent from the assembly are shown in black, while those present are indicated by red (once), purple (twice), green (three times), blue (four times), yellow (five times), and orange (six or more times). Panels A, B, and C represent assessments of the complete A. truncatum genome, assembly A, and assembly B, respectively.

Hi-C contact matrices were generated by aligning Hi-C reads to the final assembly using Juicer, and visualized using Juicebox. The matrices showed strong intra-chromosomal contacts and minimal off-diagonal noise, indicating high-quality scaffolding (Fig. 5a,b). A large-scale inversion polymorphism between chromosomes 6a and 6b was identified and confirmed by synteny analysis using minimap2 (Fig. 5c,d), indicating real biological structural variation rather than misassembly.

Fig. 5
figure 5

Hi-C interaction heatmaps of the A. truncatum genome. (a) Hi-C interaction heatmap for subgenome a; (b) Hi-C interaction heatmap for subgenome b; (c) Hi-C interaction heatmap for the complete haplotype genome; (d) Synteny between the complete haplotype genome and subgenome.

Evaluation of the assembled transcriptome

Transcriptome assemblies were validated using BUSCO against the embryophyta_odb10 dataset. The de novo assembled transcript set from Trinity (v2.15.1)38 yielded 91.0% completeness. The genome-guided assembly, generated by mapping reads with Hisat2 (v2.2.1)56 and assembling transcripts with StringTie (v3.0.0)57, achieved 98.0% completeness. The merged transcript set contained 126,155 sequences and showed a BUSCO completeness of 99.0%, confirming its suitability as transcript evidence for gene annotation.