Background & Summary

With the advancement of sequencing technologies, the long-reads sequencing and improved assembly algorithms now enable the generation of telomere-to-telomere (T2T) and haplotype-resolved genome1. Recently, the haplotype-resolved T2T genomes of human was completed, providing an alternative reference genome for studying human genetics of East Asian population2. T2T genomes have been achieved extensively in plants, significantly accelerating breeding efforts3. High-quality genome assemblies allow detailed investigation of the complex region in the genome such as centromere, tandem duplications, and segmental duplication1,4,5,6. However, haplotype-resolved T2T genomes remain scarce in non-primate vertebrates, limiting our understanding of the evolution of complex region in these groups5.

With an increasing number of reports on distant hybridization in fish7, the discovery and utilization of wild germplasm resources could significantly accelerate fish breeding effects8,9,10. Hypsibarbus vernayi (Cypriniforms: Cyprinidae) is a medium-sized barb distributed throughout the Mekong River basin in Southeast Asia and Yunnan, China. As a commonly consumed fish in the region, it holds potential value for aquaculture development. To date, A high-quality genome assembly is still lacking for this species. A well-assembled genome has potentials in accelerating genetic breeding and aquaculture development.

In this study, we assembled a haplotype-resolved and near-T2T genome of H. vernayi using PacBio HiFi, Oxford Nanopore, and Hi-C data. The high-quality genome provides a valuable genetic resource for the aquaculture and breeding of cyprinid fishes.

Methods

Material sampling, DNA extraction and sequencing

A male H. vernayi was sampled from Yunnan Agriculture University. Five tissues were collected for RNA sequencing, including brain, liver, kidney, heart, and spleen. The sampling and experimental protocol of this study complied with animal welfare laws and guidelines and was approved by the Scientific Ethics Committee of Yunnan Agricultural University, Kunming, China (Approval No. APYNAU202202018). Genomic DNA (gDNA) was isolated from muscle using standard phenol-chloroform extraction. For Oxford Nanopore library construction, SQK-LSK110 kit (Oxford Nanopore Technologies, UK) was used following the manufacturer’s instructions. Then the library was sequenced on the PromethION sequencer (Oxford Nanopore Technologies, UK). For Pacbio HIFI sequencing, the quality and length of gDNA was assessed using Qubit 1X dsDNA HS assay kit (Thermo Fisher Scientific, USA) and gDNA 165 kb kit on Femto Pulse system (Agilent Technologies, USA). The SMRTbell library was constructed using SMRTbell® prep kit 3.0 (Pacific Biosciences, USA) according to the manufacturer’s standard protocol. DNA fragments shorter than 5 Kb were discard by magnetic beads using AMPure PB bead size selection kit (Pacific Biosciences, USA) following manufacturers’ instructions. Then the library was sequenced on Pacbio Revio system (Pacific Biosciences, USA). The Hi-C library was constructed following the protocol described by Lieberman-Aiden et al.11. The liver tissues were processed by cell cross-linking, DNA digestion and fragmentation (DpnII), biotin labelling, proximal chromatin DNA ligation, enrichment of biotin-labelled fragments, and purification. The concentration and insert size were evaluated using Qubit 3.0 (Thermo Fisher Scientific, USA) and Agilent Bioanalyzer 2100 (Agilent Technologies, USA). The library was sequenced on the NovaSeq X Plus platform with the PE 150 bp mode. For RNA sequencing, total RNA was extracted with RNAprep Pure Plant Kit (TIANGEN, China). mRNA was purified using oligo(dT)-attached mRNA capture beads. Library construction and sequencing were performed at the Wuhan Benagen Technology Company Limited. A total of 110.9 Gb (~78.3X) PacBio HiFi reads, 38.2 Gb (~27.0X) Oxford Nanopore reads and 250.6 Gb Hi-C reads were generated, enabling us to assemble a phased diploid genome.

Genome survey and assembly

We conducted a genome survey using long-read data based on the k-mer method. Kmc v3.2.412 was applied to count k-mer frequency with k set as 19. We used GenomeScope213 to estimate the genome size and the percentage of heterozygosity. The estimated genome size based on the k-mer method was 692,991,697 bp with a heterozygosity of 0.593% (Fig. 1b).

Fig. 1
figure 1

(a) The morphology of Hypsebarbus vernayi. (b) Genome survey based on the distribution of k-mer (k = 19) of long reads. (c) BUSCO completeness assessment of two haploid genomes. (d,e) Hi-C contact map of the first (d) and the second haplotype (e).

For genome assembly, we applied the Hi-C model of hifiasm v0.19.914 using HiFi and Oxford Nanopore reads to generate contigs with default parameters. Then, Juicer pipeline v1.615 was used to map the Hi-C reads against the contigs. 3D-DNA v20100816 was used to generate the.hic file which was visualized and manually adjusted using the software Juicebox Assembly Tools15. The final genome was generated based on the manually corrected assembly. For the alternative haplotype, we used Ragtag v2.1.017 to arrange the contigs based on the assembly of the first haplotype. Then, the juicer pipeline15 and 3D-DNA16 were applied in the same manner as in the previous steps.

De novo genome assembly generated two sets of contigs with genome sizes of 709,573,396 bp and 707,790,980 bp. Based on Hi-C data, 705,107,344 bp and 705,617,558 bp contig sequences were respectively anchored to 25 chromosomes, accounting for 99.37% and 99.69% of the total assembled size of each haplotype (Fig. 1d,e), consistent with the number of most Cypriniformes fishes reported by karyotype studies18. The BUSCO completeness assessment suggested that the completeness of each haplotype genome was 99.0% and 99.1% (Fig. 1c). We also aligned the genome with Onychostoma macrolepis19 using minimap220. The dotplot between two genomes showed a one-to-one synteny among chromosomes (Supplementary Fig. 1).

Genome annotation

For repetitive elements annotation, we used RepeatModeler v2.0.521 for de novo prediction of repetitive sequences. Additionally, we also generated a satellite DNA library using SRF22. The two libraries were combined as input for RepeatMasker v4.1.2-p123 to produce a soft-masked genome which was then used for gene model annotation. Repetitive element annotation revealed comparable contents between the two haplotypes, with 289,423,506 bp (40.79% of the genome) and 284,959,886 bp (40.26%) annotated in the first and second haploid genomes. The identified transposon elements include long interspersed nuclear elements (LINE), short interspersed nuclear elements (SINE), long terminal repeats (LTR) and DNA transposons. DNA transposons are the major component of repetitive elements, accounting for approximately 35% of a haploid genome (Table 1).

Table 1 statistics of the H. verneyi genomes.

We employed the BRAKER324 pipeline to predict gene models incorporating RNA-seq data and the protein sequences of Onychostoma macrolepis. For the other haplotype, lifton25 was used to lift over the homologous gene models from the annotated haplotype. A total of 24,172 protein-coding genes were annotated in the first haploid genome. The second haploid genome was annotated using a homology-based gene model that yielded 24,919 protein-coding genes. The annotated protein-coding gene number was comparable to that of O. macrolepis (24,77026), a closely related cyprinid fish in the subfamily Barbinae (sensu27).

Centromere and telomere identification

We used PyTandemFinder28 to search for the tandem repeat sequences with the highest abundance. The candidate sequences were masked in the genome using RepeatMasker23. Additionally, we validated the identified potential centromeric region by calculating the number of genes, transposon elements, and methylation level at 10 kb windows. To calculated the haploid methylation level, Nanopolish29 was used for methylation calling. Single nucleotide polymorphisms (SNPs) were called for each haplotype, and subsequently used to phase Oxford Nanopore reads into each haplotype. Then, the levels of 5-methylcytosine bases in a CpG context were detected for each haplotypes using the script calculate_methylation_frequency.py provided by Nanopolish. Repeat sequences were visualized using StainedGlass30 to further identify the centromere region. Telomeric repeats were detected using quarTeT31 by searching for the “AACCCT” repeats across the genome. The range of genes, transposon elements, and methylation level were normalized to 0~1.

As a result, a 262 bp (CEN262) sequence was selected as the putative centromeric repeat. The distribution pattern of CEN262 showed a single location on most chromosomes, except the Chr01 and Chr20. The occupation of CEN262 was consistent with regions showing reduced densities of transposable elements, genes, and DNA methylation, characteristics typically associated with centromeric regions (Fig. 2b). For the two chromosomes lacking the CEN262 signal, we examined the tandem repeat annotation to examine whether the centromere had been assembled. We did not detect an obvious repeat in Chr12, but a repeat-rich region was observed in Chr22. This suggests that the centromere of Chr12 may not have been assembled while Chr22 may use another centromeric repeat. Subsequently, we searched for the most abundant tandem repeat in Chr22 and masked the genome following the same steps. A different 262 bp repeat variant (CEN262_ALT) was identified. Based on the centromere position, 13 out of 25 chromosomes in H. vernayi were identified as acrocentric, and the remaining 11 chromosomes were submetacentric.

Fig. 2
figure 2

(a) The distribution of centromeres, telomeres, gap, TE density, and methylation level (blue line) in the first haplotype genome. TE and methylation density are visualized in 10 Kb widows. This haploid genome contains only one gap while the other haploid genome contains 7 gaps (see supplementary fig. 1) (b) Visualization of centromere region of Chr25 suggesting the distribution of CEN262, gene density, 5mC methylation level, and TE density in 10 Kb windows, where ranges are normalized to 0~1. The triangle heatmaps represent pairwise sequence identity between 0.87 Mb sequences.

A total of 41 telomeres were identified in the first haplotype and 43 in the second. In the first haplotype, 12 chromosomes have telomeres assembled at both ends, while the remaining chromosomes have telomeres at a single end (Fig. 2a). In the second haplotype, 19 chromosomes have telomeres at both ends, while 5 have telomeres assembled only at one end (Supplementary Fig. 2).

Structure variation between haplotypes

We used SyRI v1.7.032 to detect structural variations between two haploid genomes. The nucmer from MUMmer v4.0.033 tool was applied for pairwise alignment between two haplotypes. One-to-one alignments with a minimum alignment length of 300 bp were retained for further analysis. Then, SyRI32 was applied to identify the structural variations. The results were visualized using plotsr v1.1.534. Approximately 660 Mb of syntenic regions were identified between the two haplotypes, representing 93% of the genome, indicating a high level of similarity (Fig. 3). The inversion region accounted for 0.069% of the genome (~0.4 Mb), and the translocation region occupied 1% (~7.5 Mb). We also observed a few unaligned regions whose locations corresponded well with centromeric regions.

Fig. 3
figure 3

Synteny between the two haploid genomes of H. vernayi.

Data Records

The raw sequencing data have been deposited on NCBI database with BioProject accession PRJNA1280703. The Pacbio HIFI, ONT, and Hi-C data can be found with the accession number SRP59633835. The genome assemblies were deposited to European Nucleotide Archive (ENA) with accession number GCA_977016905.236. The annotation files have been deposited in Figshare37.

Technical Validation

We evaluated the quality and completeness of genome assemblies by BUSCO analysis. The BUSCO completeness assessment of genomes suggested that the completeness of the two haplotype genomes were 99.0% and 99.1%, respectively (Fig. 1c). The contig-level genomes by hifiasm v0.19.914 contain only 52 and 50 contigs, respectively. Only 8 gaps were contained in the final diploid genomes where 1 gap was in the first haplotype and 7 in the second haplotype. We also assessed the quality of assembled haplotypes using Merqury38. The quality value (QV) for each haplotype was 64.4 and 66.3, respectively, indicating that the accuracy was over 99.99%. The completeness results suggested that about 90% completeness for each haplotype (90.659 for hap1 and 90.624 for hap2), while the total completeness for the diploid genome was 99.89%. These results suggested that the genome was assembled with high quality.