Background & Summary

Xanthoceras sorbifolium, commonly known as yellowhorn, is a monotypic species of the Sapindaceae family and a deciduous shrub or small tree native to northern China1. Its spring blossoms display notable petal color changes and diverse shapes, making it a popular ornamental in landscaping2. With robust tolerance to drought, cold, and saline-alkali soils, yellowhorn thrives in arid and semi-arid regions, aiding desertification control in northwestern China3. Yellowhorn seed kernels contain 55%–65% oil, with 93% unsaturated fatty acids, supporting both edible oil and biodiesel production4. This combination of ornamental appeal, resistance, and high oil content makes yellowhorn a valuable species in horticultural and agricultural contexts.

A notable ornamental variant is the double-flowered form, characterized by multiple petal whorls but rendered sterile due to a transposon insertion in the intron region of the AGAMOUS (AG) gene XsAG15. However, the complete structure of XsAG1 remains unclear, and a comprehensive genome for the double-flowered variety is still lacking. Beyond ornamentation, yellowhorn displays significant resistance to plant pathogens and abiotic stress6,7. NLR genes are essential for pathogen detection and immune signaling8,9,10, underscoring the need for high-quality genomes to identify NLR gene loci and further development of disease-resistant varieties11.

Although several yellowhorn genomes have been assembled since 20197,12,13,14,15,16, limitations including incomplete telomeres and centromeres, unclosed gaps, and limited annotation quality constrain our genetic understanding. A haplotype-resolved T2T genome can serve as a blueprint for precise breeding. In this study, we constructed the first haplotype-resolved T2T gapless genomes for two ornamental yellowhorn varieties: “PBN-43”, a single-flowered, white-petaled cultivar from Zhangjiakou (Hebei, China), and “PBN-126”, a double-flowered, white-petaled cultivar from Chifeng (Nei Mongol, China) (Supplementary Fig. S1). We employed PacBio HiFi reads, high-throughput chromatin conformation capture (Hi-C) reads, Illumina paired-end (NGS) reads, and RNA-seq reads for genome assembly and annotation. The final haplotype genomes range in length from 464.34 Mb to 468.93 Mb and contain an average of 35,084 high-confidence protein-coding genes (PCGs). Additionally, we systematically revealed profiles of genetic variations, transposable elements (TEs), and NLR genes based on the assembled genomes, collectively providing valuable genomic resources for future studies and breeding.

Methods

Plant materials and sequencing

The single-flowered variety (PBN-43) and double-flowered variety (PBN-126) of Xanthoceras sorbifolium were cultivated at the Woqi Agriculture Development Co., Ltd yellowhorn orchard, Weifang, Shandong Province, China (Supplementary Fig. S1). For each variety, all tissue samples were collected from the same individual plant. Specifically, 10 g of young fresh leaves were collected and flash-frozen in liquid nitrogen for DNA and RNA extraction and sequencing. For flower tissues and mature seeds (77 days after flowering, only for PBN-43), 2 g per sample were collected for RNA extraction and sequencing. Genomic DNA was isolated using cetyltrimethylammonium bromide (CTAB) method. Frozen young leaves were ground into powders in liquid nitrogen for DNA extraction, and the quality of the isolated genomic DNA was assessed by monitoring DNA degradation and contamination on 1% agarose gels and measuring DNA concentration using the Qubit® DNA Assay Kit with a Qubit® 2.0 Fluorometer (Life Technologies, CA, USA). Total RNA was isolated by utilizing RNAprep Pure Plant Plus Kit (Polysaccharides & Polyphenolics-rich, TIANGEN, Beijing, China) according to the manufacturer’s instructions. RNA integrity was assessed using the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, CA, USA). Sequencing was performed on the Illumina NovaSeq. 6000 platform.

To assemble and annotate haplotype-resolved T2T gapless genomes of yellowhorn varieties PBN-43 and PBN-126, we generated high-depth Illumina paired-end (NGS) reads (mean depth ~60.66×), PacBio HiFi reads (mean depth ~205.27×), high-throughput chromatin conformation capture (Hi-C) sequencing reads (mean depth ~217.48×), and RNA-seq reads for both accessions (Table 1). For Illumina paired-end (NGS) reads, a library was constructed following the manufacturer’s protocol and sequenced on the Illumina NovaSeq. 6000 platform. Library preparation involved DNA fragmentation, end-polishing, A-tailing, adapter ligation, PCR amplification, and purification. The libraries were analyzed for size distribution using an Agilent 2100 Bioanalyzer and quantified by real-time PCR. For PacBio HiFi reads, a 15 kb library was constructed using a SMRTbell Express Template Prep Kit 3.0 (Pacific Biosciences, CA, USA). Library preparation included DNA shearing, damage repair, end repair, hairpin adapter ligation, size selection, and purification of the library. After quality control testing, the SMRTbell library was sequenced using a single 25 M SMRT Cell on the PacBio Revio platform (Pacific Biosciences, CA, USA). For Hi-C reads, the library was prepared from cross-linked chromatin isolated from young leaves using a standard Hi-C protocol. Chromatin was cross-linked with formaldehyde, digested with restriction enzyme DpnII, labeled with biotin-14-dCTP, and ligated to form proximity ligation products. DNA was then purified, fragmented, and enriched with streptavidin beads before adapter ligation. The resulting library was amplified and sequenced on the DNBSEQ platform to generate 2 × 150 bp paired-end reads.

Table 1 Sequencing data statistics for PBN-43 and PBN-126.

Genome survey and assembly

The estimation of genome size and heterozygosity was conducted by Jellyfish (v2.2.10)17 and GCE (v1.0.2)18 based on k-mer size = 23. The genome assembly process involved several steps. Hifiasm (v0.19.9-r616)19 was utilized to conduct de novo genome assembly by integrating HiFi reads and Hi-C reads with the parameters “-s 0.45 -l 2 -x 0.95 -y 0.6 --telo-m CCCTAAA” to accommodate heterozygosity and assemble more contigs with telomere motifs. Purge_haplotigs (v1.1.3)20 was then used to improve haplotype phasing by further removing redundant contigs based on hifiasm preliminary assemblies and HiFi reads. Then, Hi-C data were used to anchor the contigs of each haplotype via the Juicer (v1.6)21 and the 3D-DNA (v201013)22 pipelines. After this step, almost half of chromosomes in each haplotype reached T2T level. Manual corrections for misassemblies were made using Juicebox (v2.17.00)23, and short contigs were discarded during this process. Verkko (v2.1)24 was additionally employed for assembly by integrating PacBio HiFi and Hi-C data to generate continuous long contigs with default parameters. Verkko assembly, hifiasm assembly, and HiFi reads were utilized for closing gaps with the help of quarTeT (v1.1.7)25 GapFiller module (with default parameter) and Integrative Genomics Viewer (IGV)26. These assemblies and HiFi reads were also used for telomere patching using Teloclip pipeline (v0.0.4) (https://github.com/Adamtaranto/teloclip). Finally, PGA (v.0.2) (https://github.com/likui345/PGA) was used to rearrange the final assemblies according to ZS4 with default parameters. For telomere identification, the quarTeT TeloExplorer module was used with the “-c plant” parameter. For centromere identification, centromere candidate regions were initially generated using the quarTeT CentroMiner module for each chromosome with repetitive elements and gene annotation under default parameters. These candidate regions were further manually determined by integrating information from repeat density, gene density, and Hi-C chromatin interaction heatmaps. After these steps, we obtained four haplotype genomes ranging from 464.34 Mb to 468.93 Mb, all of which reached gapless telomere-to-telomere (T2T) level with high-quality assessment metrics, representing the most comprehensive and complete yellowhorn genomes to date (Fig. 1a–e, Table 2, Supplementary Table S1).

Fig. 1
figure 1

Comprehensive overview of haplotype-resolved yellowhorn genomes. (a) Circos map of each haplotype’s genomic features. Including: I. chromosomes (blue bar for PBN-43_hap1, orange bar for PBN-43_hap2, green bar for PBN-126_hap1, and purple bar for PBN-126_hap2; grey segments within the chromosome bars indicate centromere locations), II. GC content, III. gene density, IV. TE density, V. LTR/Copia density, VI. LTR/Gypsy density, VII. syntenic blocks. (b–e) Hi-C chromatin interaction heatmaps for PBN-43_hap1 (b), PBN-43_hap2 (c), PBN-126_hap1 (d), and PBN-126_hap2 (e), displaying chromosomal interaction frequencies for each haplotype. (f) LTR-RT insertion times (in Mya) across selected Sapindaceae species. Clade colors: red for Sapindoideae, blue for Hippocastanoideae, and green for Xanthoceroideae.

Table 2 Statistics for genome assembly and annotation of PBN-43 and PBN-126.

Repetitive elements annotation

We conducted repetitive elements annotation using a combination of de novo prediction and homology-based searches. For both varieties, we first utilized RepeatModeler (v2.0.5)27 to construct a de novo repeat library with default parameters. Next, to further refine the library, we classified all sequences labeled as “Unknown” in the de novo library using TEsorter (v1.4.6)28 with default parameters. Additionally, we incorporated Sapindales-specific repeat families from the Dfam database29 using the famdb.py tool, resulting in an updated species-specific repeat library. Finally, RepeatMasker (v4.1.6) (https://www.repeatmasker.org/) was used to annotate repeat regions, employing both the species-specific repeat library and the RepBase30 library. We performed genome-wide LTR-RT identification and classification using LTR_retriever (v2.9.5)31, with outputs from the LTRharvest (v1.6.5)32 and LTR_FINDER_parallel (v1.1)33 pipelines. The neutral mutation rate parameter was set to “-u 7e-9” (according to Arabidopsis34) in LTR_retriever to estimate LTR-RT insertion times.

For yellowhorn, an average of 67.99% (~317.09 Mb) of genomic regions across all haplotype genomes consisted of repetitive elements, particularly long terminal repeat retrotransposons (LTR-RT). The Gypsy superfamily was slightly more abundant than the Copia superfamily in PBN-43, whereas the opposite pattern was observed in PBN-126. Additionally, unclassified elements were reduced to an all-time low, averaging just 1.24% across all haplotype genomes (Table 2). We also found that most Sapindaceae species experienced LTR-RT insertion events within the last 6 million years (Mya), and several species (e.g., Dipteronia dyeriana and Litchi chinensis) exhibited complex and extended insertion patterns. Yellowhorn underwent a more recent LTR-RT insertion during the last 0.45–0.48 Mya (Fig. 1f).

Genes annotation

Multiple prediction methods contributed to the final gene annotation. For transcriptome-based prediction, we first used HISAT2 (v2.2.1)35 to align transcriptome reads to the genomes to generate BAM files with default parameters, which were then sorted using SAMtools (v1.2.0)36. Then, two transcriptome-based prediction pipelines were employed: (1) a de novo transcript assembly pipeline consisting of Trinity (v2.15.1)37, PASA (v2.5.3)38, and TransDecoder (v5.7.1) (https://github.com/TransDecoder/TransDecoder); and (2) a reference-guided transcript assembly pipeline using StringTie (v2.2.1)39 and TransDecoder (v5.7.1). For homology-based prediction, we used a homologous protein dataset containing sequences from Acer truncatum, Aesculus chinensis, Arabidopsis thaliana, Dimocarpus longan, Litchi chinensis, and Xanthoceras sorbifolium “ZS4”. Gene models were predicted using miniprot (v0.12-r237)40 with the parameters “-I –gff –outc 0.8 –outn 1000” and GeMoMa (v1.9)41 with default parameters. For ab initio prediction, we trained and generated gene models using AUGUSTUS (v3.5.0)42 and GeneMark-ETP (v1.02)43 via the BRAKER3 (v3.0.3)44 pipeline, based on hard-masked genomes, aligned RNA-seq reads (BAM files), and the homologous protein dataset described above. After assigning weights to these three prediction methods, all predicted gene models were integrated into consensus gene models using EvidenceModeler (v2.0.0)45. For functional annotation, DIAMOND (v2.1.9)46 was used to search protein sequences against the NCBI non-redundant (NR) protein database47 and the UniProt database48 with the parameters “--evalue 1e-5 --max-target-seqs 1”. Additionally, InterProScan (v5.47–82.0)49, eggNOG-mapper (v2.1.12)50, and kofam_scan (v1.3.0)51 were utilized for further functional annotation.

Collectively, an average of 35,084 protein-coding genes was predicted with an average gene length of 3,166 bp. Across all haplotype genomes, at least 93.75% of all genes were functionally annotated (Table 2). Compared with the first published yellowhorn genome ZS4 (23,365 genes), a mean of 11,719 more genes have been predicted in haplotype genomes, and more than 99.8% of ZS4 genes were found in our genomes (Supplementary Table S2).

Genomes comparison and synteny analysis

To capture and catalog variations among haplotypes and previously published genomes, we first used Minimap2 (v2.26-r1175)52 with the parameters “asm5 -ax –eqx” to align the genomes. Synteny and Rearrangement Identifier (SyRI, v1.6.3)53 was then applied to identify variations based on the genome alignment results, followed by visualization using plotsr (v1.1.1)54. For variation annotation, SnpEff (v5.2-1)55 databases were constructed for both varieties and used with default parameters on the VCF files generated by SyRI. Variations classified as “High” and “Moderate” impact by SnpEff were regarded as deleterious variations. With the haplotype-resolved yellowhorn genomes assembled, alleles were identified directly from the genome sequences using the JCVI (v0.0.0)56 pipeline and manual curation. Gene annotation files for hap1 and hap2 were converted into BED format using the jcvi.formats.gff module with default parameters, and orthologous gene pairs were detected using the jcvi.compara.catalog module with the parameter --cscore=0.99, serving as preliminary allele pairs. Redundant pairs were manually removed based on coding protein sequence similarity and chromosomal location, particularly removing interchromosomal pairs, to produce the final allele set.

To investigate genomic variation among different yellowhorn varieties, we chose PBN-43_hap1 and PBN-126_hap1 as reference genomes and aligned them with each other, as well as individually compared them to the previously published yellowhorn genome ZS4. In comparing PBN-43_hap1 vs. PBN-126_hap1, we detected more syntenic regions (343.90 Mb and 399.72 Mb) but fewer inversions (43.89 Mb and 10.66 Mb) and translocations (10.29 Mb and 4.60 Mb) than in comparisons involving ZS4, indicating greater genomic similarity between these two varieties (Fig. 2a, Supplementary Table S3). Although the SNP density (5.31 SNPs/Kb and 5.23 SNPs/Kb) was slightly higher for PBN-43_hap1 vs. PBN-126_hap1, fewer of these SNPs resulted in deleterious variations such as frameshifts or stop-gained mutations, and more occurred in intergenic regions. Likewise, the numbers of insertions (105,700 and 129,675) and deletions (105,901 and 120,709) were significantly lower than in alignments with ZS4, and structural variants (SVs) showed a similar trend (Supplementary Tables S3, S4). We also identified several large-scale SVs, especially on Chr02, Chr06, and Chr07, which were supported by HiFi reads spanning the junction, confirming the accuracy of the assemblies (Fig. 2a, Supplementary Fig. S3).

Fig. 2
figure 2

Genomic variations among different yellowhorn varieties and within the haplotypes. (a) Genomic variations among PBN-43_hap1, PBN-126_hap1, and the previously published yellowhorn genome ZS4. (b-c) Haplotype-resolved variations in PBN-43 (b) and PBN-126 (c), with SNP and InDel densities calculated per 100 Kb using hap1 as the reference genome.

At the haplotype level, variations between haplotypes for PBN-43 and PBN-126 were identified and annotated, and a total of 27,388 (PBN-43) and 27,888 (PBN-126) allelic genes were identified (Fig. 2b,c, Supplementary Tables S3, S4). Several large-scale SVs were detected among haplotypes and confirmed by HiFi read coverage on junction loci (Supplementary Fig. S3). Additionally, both the number of variations and deleterious variations in PBN-126 haplotypes is lower than those in PBN-43, which corresponds to the heterozygosity rate differences (Supplementary Table S3, S4). Notably, the previously discovered AG gene variation5, which leads to the double-flowered phenotype, is present in both alleles in PBN-126 and we found that it is actually a 342 bp LINE-RH insertion instead of a LINE1 (Supplementary Fig. S4).

Identification of NLR genes

NLR-annotator (v2.1b)57 was used to identify NLR domains across the genomes of 12 Sapindaceae species with default parameters. Genes were classified as NLR genes if they contained at least an NB-ARC domain, a C-terminal LRR domain, and an N-terminal domain of either Toll/interleukin-1 receptor (TIR), coiled-coil (CC), or RPW8, as determined by InterProScan functional annotation. Classification of NLR genes was based on the type of N-terminal domain.

In this study, we conducted genome-wide identification of NLR genes in both haplotypes of PBN-43 and PBN-126, as well as in the genomes of 11 other Sapindaceae species. Across the Sapindaceae family, the number of NLR genes varies tremendously, ranging from 90 in Aesculus chinensis to 589 in Dimocarpus longan (Fig. 3a, Supplementary Tables S5, S6). This variation aligns with the species-specific mechanisms of NLR gene expansion and contraction characteristic of flowering plants58. In yellowhorn, we identified 211/252 NLR genes in PBN-43 hap1/hap2 and 232/260 in PBN-126 hap1/hap2. Comparison of NLR gene locations among PBN-43, PBN-126, and ZS4 revealed that the main differences lie in some dispersed NLR genes, such as those on Chr09 (Fig. 3b). Regarding the NLR gene densities along chromosomes, we found that they tend to form gene clusters in yellowhorn, especially on the short arms of Chr04 and Chr09. Other chromosomes show a sparse distribution of NLR genes, reflecting that “hotspots” of NLR genes remain consistent among haplotypes and varieties (Fig. 3c, Supplementary Fig. S5). This phenomenon aligns with the duplication mode of yellowhorn NLR genes, with proximal and tandem duplications constituting the majority (Fig. 3d). NLR gene clusters may provide functional redundancy, allowing multiple genes in the same region to perform similar immune functions. In summary, this study presents for the first time complete haplotype-resolved genomes of yellowhorn, laying a foundation for future research into its ornamental traits, breeding potential, and broader genomic studies.

Fig. 3
figure 3

NLR gene landscape in Sapindaceae species and yellowhorn. (a) NLR gene numbers across species from the three Sapindaceae subfamilies species. (b) Chromosomal localization of NLR genes in PBN-43, PBN-126, and ZS4 genomes. (c) Density of NLR genes along chromosomes in PBN-126_hap1. (d) Duplication modes of all NLR genes in both PBN-43 and PBN-126 genomes.

Data Records

The raw sequence data (PacBio HiFi, Illumina paired-end, Hi-C, and RNA-seq) reported in this paper have been deposited in the Genome Sequence Archive (GSA) of the National Genomics Data Center (NGDC, https://ngdc.cncb.ac.cn) under the accessions CRA02162959, CRA02162860, CRA02163161, and CRA02162462, respectively, with the BioProject accession PRJCA0334897063. These data have also been deposited in the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov) under the accession SRP57310664, with the BioProject accession PRJNA120013565. The assembled genomes have been deposited in the NCBI GenBank under the GenBank accessions JBMIRF00000000066 (PBN-43_hap1), JBMIRG00000000067 (PBN-43_hap2), JBMRGO00000000068 (PBN-126_hap1), and JBMRGN00000000069 (PBN-126_hap2). For broader accessibility, we have also deposited the corresponding files including genome assemblies, genome annotation, functional annotation, and VCF files at Figshare70.

Technical Validation

Genome assembly and annotation quality assessment

We assessed the quality and completeness of the genome assemblies in multiple ways. First, HiC-Pro (v3.1.0)71 was used to align Hi-C reads to the final assemblies and generate matrix files with default parameters. These were then visualized using HiCPlotter (v0.6.02)72 to create Hi-C chromatin interaction maps. We checked Hi-C chromatin interaction maps and found no significant contig misassemblies in any of the genome assemblies (Fig. 1b–e). For mapping rates, PacBio HiFi reads and NGS reads were aligned to each haplotype genome using Minimap2 and BWA-MEM (v0.7.18)73, respectively. Mapping rates and genome coverages were calculated using SAMtools, resulting in an overall mapping rate of over 99%. Benchmarking Universal Single-Copy Orthologs (BUSCO, v5.5.0)74 was employed to assess the completeness of gene regions with the embryophyta_odb10 database (2024-01-08), and 98.82% to 99.01% of 1,614 plant core conserved genes were complete in all assemblies. Merqury (v1.3)75 was used to calculate QV based on a 23-mer database generated from HiFi reads using the meryl tool, and the QV values of all assemblies ranges from 66.79 to 67.02, indicating the accuracies were >99.99%. Additionally, LTR_retriever utilized outputs from LTRharvest and LTR_FINDER_parallel to generate LTR Assembly Index (LAI) for evaluating the assembly continuity. LAI scores of the four haplotype genomes range from 16.53 to 18.04, which can be categorized as “Reference” level. Collectively, these assessments suggest that we have produced high-quality haplotype-resolved yellowhorn genome assemblies in terms of contiguity, accuracy, and completeness (Table 2). The annotated genes achieved an average BUSCO completeness of 98.27% based on the embryophyta_odb10 database (2024-01-08), and an average of 93.90% of all genes were functionally annotated across all assemblies, indicating high-quality annotations (Table 2).