Background & Summary

Gibbons, a group of small apes in the Hylobatidae family, represent a sister lineage to the “great apes” (Pongo, Gorilla, Pan and Homo) in the phylogeny of Primates. Endemic to the forests of Southeast Asia, all the four extant gibbon genera (Nomascus, Hylobates, Symphalangus, Hoolock) face severe endangerment due to deforestation and illegal hunting. Gibbons diverged from the great apes after the split of the Old World Monkeys (Cercopithecidae)1,2; they exhibit unique behaviors and genomic features, such as social pair-bonding, suspensory brachiation, rapid karyotype evolution via frequent large-scale chromosomal rearrangements3,4,5. This makes gibbons ideal for exploring the genomic plasticity, genome evolution and the conservation of endangered primate species.

Nomascus leucogenys, also known as the northern white-cheeked gibbon, is among the most endangered species, classified as Critically Endangered on the IUCN Red List. Found in the forests of southern China, Laos, and Vietnam, N. leucogenys is believed to be ecologically extinct in China6,7. To date, genome data is available only for a female individual of this species, which was initially sequenced on the Sanger platform in 20141 and subsequently updated through the integration of PacBio reads8. Another gibbon species, Hylobates moloch, was sequenced in 2022 by the third-generation technology4. Both studies used female individuals, and the genome of H. moloch was assembled on scaffold-level only. In a comprehensive phylogenetic study, 27 new primate genomes were sequenced, including Nomascus siki (PacBio + 10X Genomics), Hoolock leuconedys (Oxford Nanopore), Symphalangus syndactylus (PacBio + 10X Genomics) and Hylobates pileatus (Oxford Nanopore) from the gibbon lineage9. Among these, only H. pileatus had Hi-C data, while the assembly of N. siki remained at scaffold-level. The latest report on T2T genomes of five great apes includes a new assembly for the Y-chromosome of S. syndactylus, with a gapless length of ~30 Mb10.

In this study, we utilized PacBio long-read sequencing, Illumina short paired-end reads, Hi-C technology, and RNA-sequencing to produce a high-quality chromosome-level genome assembly of a male N. leucogenys individual using a lymphoblastoid cell line. Notably, we achieved improved continuity in assembling the Y-chromosome. Our assembly provides an essential reference for understanding gibbon genomic features, exploring the evolutionary adaptation of the gibbon genome, particularly the evolution of gibbon Y-chromosome, and aiding in the conservation of endangered primate species.

Methods

Ethics statement

All the experimental procedures in this study were in accordance with the Laboratory Management Regulations of Shandong University of Technology (No. 163).

Sample collection, preparation, and DNA extraction

An EBV-transformed lymphoblastoid cell line (KCB93007) derived from a male N. leucogenys was provided by Kunming Cell Bank. Genomic DNA was isolated using a Blood & Cell Culture DNA Kit (Qiagen 13323, Qiagen, Germany) following the manufacturer’s protocol. The quality and concentration of DNA was assessed using 1% agarose gel electrophoresis and a Qubit 2.0 Fluorometer (Life Technologies, USA). Total DNA was purified by AMPure PB beads (PacBio 100-265-900, PacBio, USA) to obtain high-quality genome DNA for library construction.

PacBio and Hi-C library construction and sequencing

Size-selected SMRTbell libraries were prepared using SMRTbell® Express Template Prep Kit 2.0 (PacBio 101-853-100, PacBio, USA) and sequenced for 30 hours using the PacBio Sequel II system (PacBio, USA). A total of 8,900,210 HiFi reads were generated from four SMRT Cells, with an average read length of 17,093 bp. After trimming the low-quality reads and adaptor sequences, 143.4 Gb of PacBio HiFi data were obtained (~49X coverage).

For Hi-C library construction, lymphoblastoid cells of N. leucogenys were fixed with formaldehyde, followed by the addition of glycine to quench the crosslinking reaction. DNA was extracted using the Blood & Cell Culture DNA Kit (Qiagen 13323, Qiagen, Germany), following the manufacturer’s recommended protocol. The extracted cross-linked DNA was then cut with restriction endonucleases overnight to generate pairs of distally located, physically associated DNA molecules. Subsequently, the DNA ends were labeled with biotin-14-dCTP, and blunt-end ligation of the cross-linked fragments was performed. After cell lysis, a four-cutter restriction enzyme (MboI) was used to digest the cross-linked DNA, and biotin-labelled Hi-C samples were enriched using streptavidin magnetic beads. The Hi-C libraries were amplified through 10–12 cycles of PCR and sequenced on an Illumina Novaseq6000 platform with 2 × 150 bp reads.

After filtering adaptors and low-quality reads (when the proportion of low-quality bases exceeded 20% of the read length), we obtained 1,011,168,632 paired-end clean reads for further genome assembly.

Transcriptome sequencing

For transcriptome sequencing, lymphoblastoid cells and liver tissues from male N. leucogenys, frozen in liquid nitrogen, were obtained from the tissue archive of Kunming Cell Bank. Total RNA was extracted using a RNeasy Mini Kit (Qiagen 74104, Qiagen, Germany). Then TruSeq RNA Sample Prep Kit (IlluminaRS-122-2002, Illumina, USA) was used for poly(A) mRNA isolation, first-strand and second-strand cDNA synthesis, fragment and adapter ligation, as well as cDNA library preparation.

The libraries from different tissues were sequenced on an Illumina Novaseq6000 platform using 150 pair-end sequencing. Low-quality reads and adaptor sequences were filtered out by fastp (v.0.23.2, https://github.com/OpenGene/fastp) prior to further analyses. In total, 11.62 Gb (98.29%) and 14.88 Gb (94.41%) of clean data were obtained from the lymphoblastoid cells (GB) and liver tissues (PL) respectively.

Genome assembly and quality assessment

The genome size of N. leucogenys (approx. 2.94 G) and its sibling species N. siki (approx. 2.78 G) has been reported in previous studies1,9. K-mer distribution analysis (K = 21) using GenomeScope v2.011 revealed an estimated heterozygosity rate of 0.39%, comparable to that of Hylobates moloch (0.38%–0.39%) and falling within the range of the Hylobatidae family (0.19%–0.41%)4. The Genomescope profile is illustrated in Fig. 1.

Fig. 1
Fig. 1
Full size image

Genomescope profile based on the HiFi reads (K = 21).

De novo assembly of PacBio HiFi reads (CCS mode) was performed using HiFiasm (version 0.16.1)12 to generate contigs. Hi-C reads were also incorporated at this stage to achieve better haplotype-phased assemblies. Before further processing, quality of the assembly was assessed using BUSCO (v.5.1, with database primates_odb, 13,780 genes), which evaluates genomic assembly based on expected gene content from near-universal single-copy orthologs13. The assembly comprised 143 scaffolds for Haplotype 1 (Hap1), with a total length of 3,069 Mb and scaffold N50 size of 124.2 Mb; and 90 scaffolds for Haplotype 2 (Hap2), with a total length of 2,960 Mb and scaffold N50 size of 121.2 Mb. The BUSCO completeness value exceeded 95%, with 12,889 complete and single-copy genes (~93.5%), 323 complete and duplicated genes (~2.4%), 141 fragmented genes (~1.0%), and 427 missing BUSCO genes (~3.1%) for Hap1 (Table 1). Comparative assessment with eight other gibbon assemblies using the same BUSCO database and parameters revealed that the completeness of our N. leucogenys genome was the highest among all the published gibbon genomes (Table 2) with the lowest proportion of missing genes, indicating a significant improvement in assembly continuity and completeness. Following polishing with PacBio reads, the final assembled contigs measured 3.06 Gb in length (Hap1), with a GC content of 40.9%.

Table 1 Statistics of N. leucogenys genome sequencing and assembly.
Table 2 BUSCO assessment results of available gibbon genomes.

Chromosome anchoring and quality assessment

The Hi-C data obtained from the N. leucogenys cells underwent initial quality control using fastp14, which involved reads trimming, adapter removal, and elimination of low-quality bases. Cleaned reads were subsequently aligned to the assembled genome sequences following the Omni-C pipelines (https://omni-c.readthedocs.io). High-quality mapped reads were used to create a pairs-file with pairtools15, which categorized pairs based on insert distance and read types. The parsed pairs-files were then sorted by pairtools sort. PCR duplicates were removed from downstream analysis through the pairtools dedup command with the -output-stats option, resulting in the exclusion of 17.2% of read pairs as PCR duplicates. In the remaining valid data, 94.6% were Mapped Read Pairs, and 5.4% were Unmapped Read Pairs, leading to a total of 565,327,778 unique mapped read pairs for scaffolding.

YaHS16 was employed to construct chromosome-scale scaffolds from Hi-C data. The pipeline corrected assembly errors and produced a final scaffolds file. Juicer tools17 were utilized to generate the Hi-C contact matrix and then convert it into editable Hi-C map files (out_JBAT.hic and out_JBAT.assembly). Juicebox (v. 2.16)18 was used for visualizing the Hi-C map, identifying, and correcting errors in the genome assembly. New AGP and FASTA files were generated for the final genome assembly after the edit was completed. The sizes of each generated scaffolds were calculated by faidx19.

For Hap1, 2.96 Gb of the final assembly were anchored onto 27 (25 + X + Y) chromosomes (pseudomolecules) (Fig. 2a). Among the chromosomes of Hap1, the longest was Chr_03 (168.1 Mb), and the shortest was Chr_Y (16.1 M) (Table 3). For Hap2, 2.79 Gb of the scaffolds were anchored onto chromosomes (Fig. 2b). The principal features of the assembled genome (Hap1) were illustrated in a Circos plot (Fig. 3).

Fig. 2
Fig. 2
Full size image

(a) Heatmap of Hi-C interaction density for Hap1. (b) Heatmap of Hi-C interaction density for Hap2. The interaction density for each pair of windows (resolution: 500 kb) is measured by the number of supporting Hi-C reads.

Table 3 Chromosome sizes and assignment for Hi-C scaffolds.
Fig. 3
Fig. 3
Full size image

Circos plot showing key features of the N. leucogenys genome (Hap1). (A) Chromosome numbers; (B) Genomic landscape of the pseudochromosomes, with scales indicating chromosome lengths; The sexual chromosomes X and Y are highlighted in blue and red, respectively; (C) Gene density; (D) Repeat coverage; (E) ncRNA density; (F) GC ratio; (G) GC skew; (H) Synteny relationship between pseudochromosomes.

Repeats and gene annotations

Chromosome-level sequences of N. leucogenys were utilized for repeats and gene annotations. To identify repeat sequences in the gibbon genome, a combination of de novo and homology-based approaches was employed. RepeatModeler v2.0.120 and EDTA (Extensive de novo TE Annotator)21,22 were used to construct the de novo repeat library and the transposable elements (TE) library, respectively. RepeatMasker v4.1.023 was then used to identify and annotate the repetitive elements using a combined database of the Repbase, de novo repeat library, and the TE library. Identified repeats were soft-masked for gene annotation.

Results of the RepeatMasker annotation revealed a total of 51.5% (Hap1) and 50.1% (Hap2) of the genome assembly consisted of repeats. Many of the repeats belongs to the TE family (retroelements, 47.2% for Hap1, and 45.8 for Hap2), which could be further categorized into various subtypes. Detailed classifications of the repetitive sequences are provided in Table 4.

Table 4 Classifications of the repetitive sequences in the genome of N. leucogenys.

Gene structures were predicted using the BRAKER 3 pipeline24, which integrated de novo, homology-based, and transcriptome sequencing-based annotation workflows. Prior to prediction, Augustus (http://augustus.gobics.de) was trained with human data, and GeneMark (http://exon.gatech.edu/GeneMark) was self-trained according to the software instructions25,26. Through de novo prediction, Augustus identified 34,671 protein-coding genes. GeneMark-ETP made predictions from homologous proteins from Homo sapiens (GCF_009914755.1, GCF_009914755.1_T2T-CHM13v2.0)27, Hylobates pileatus (GCA_021498465.1, 20115467.v1)9, H. moloch (GCF_009828535.3, GCF_009828535.3-RS_2023_07)4 and Nomascus leucogenys (GCF_000146795.3, Annotation Release 102)1, as well as the transcriptomic data. For proteins, the longest (representative) sequences were obtained for each gene with the aid of TBtools-II28. Clean data generated from transcriptome sequencing (described above) was provided for the BRAKER 3 pipeline in fastq format.

The results were combined to generate the prediction gene set. Genes with TE insertions were identified and removed by TEsorter (v.1.4.6)29, followed by further filtering with AGAT (v.1.2.0)30 to eliminate incomplete genes. Considering the highly repetitive sequences on the Y-chromosome increased the difficulty and inaccuracy for gene prediction, genes with CDS covered by over 60% of repeats were excluded9. Additionally, genes with alignment identity below 65% to the Swiss-Prot database were excluded from the final gene set.

Statistical analysis for the numbers of genes, mRNAs, exons, and introns was conducted using the “GXF Stat” function in TBtools-II28. The final prediction yielded 18,925 protein-coding genes (with 23,783 mRNAs), comparable to the gene counts in the Nleu_3.0 (18,991 genes, NCBI Nomascus leucogenys Annotation Release 102, 2015) and H. moloch genomes (19,302 genes)4. And 58 genes were annotated for the Y-chromosome after manual curation.

The average length of the coding genes was 28,215 bp, with an average of 7.71 exons per gene. For Hap1, chromosome 10 has the highest number of genes (1,188 genes, 9,719 exons), while chromosome 25 has the fewest genes (189 genes, 1,568 exons) among the autosomes. The density of gene distribution on each chromosome was illustrated in Fig. 3C. The average gene density is highest on chromosome 24 (10.9 genes/Mb), followed by chromosome 10 (10.6 genes/Mb) and 19 (10.3 genes/Mb). Chromosomes with the lowest gene density are Y (3.61 genes/Mb), 20 (3.67 genes/Mb) and 16 (3.76 genes/Mb).

Protein-coding genes were functionally annotated using different databases, including Swissprot (https://web.expasy.org/docs/swiss-prot_guideline.html), PFAM (http://xfam.org/), eggNOG (http://eggnogdb.embl.de/), GO (http://geneontology.org/page/go-database), and KEGG (http://www.genome.jp/kegg/) through the eggNOG-mapper (v.2) pipeline31. In all, 17,860 genes (95.48% of the total coding genes) were annotated in at least one database. Overall, 8,066 genes were matched to 310 KEGG pathways, and 13,171 genes were annotated to at least one KO number32. “Genetic information processing; translation (ko03010, 560 genes)”, “MAPK signaling pathway (ko04010, 207 genes)” and “Nucleocytoplasmic transport (ko03013, 196 genes)” were the top three pathways with the most genes annotated.

Noncoding RNAs (ncRNAs) are a class of RNAs that do not encode proteins. Four types of ncRNAs were identified in the genome: transfer RNAs (tRNAs), ribosomal RNA (rRNAs), microRNAs (miRNAs), and small nuclear RNAs (snRNAs). tRNAs were annotated by tRNAscan-SE (version 2.0)33, rRNAs were annotated by barrnap (v. 0.9)34. miRNAs and snRNAs were annotated using INFERNAL (version 1.1.2) by searching the Rfam database (version 14.9)35. A total of 8,021 annotated small ncRNAs were obtained, including 2,818 miRNAs, 423 rRNAs, 720 tRNAs, 208 lncRNA and 3,484 snRNAs. The distribution of ncRNAs on each chromosome is depicted Fig. 3E.

Comparison of primate genome assemblies

To further assess the quality of our genome assembly, we compared the N. leucogenys genome with other primate genomes that include Y-chromosome data (Table 5). Our chromosomal-level assembly exhibited a scaffold N50 of approx. 124.2 Mb (Hap1), surpassing most gibbon genomes published so far. We assembled over 16 Mb of Y-sequence, the second longest among reported gibbon genome data.

Table 5 Comparison of genome assemblies of human, chimpanzee and five gibbon species.

We annotated 58 Y-genes for N. leucogenys after manual curation, with a gene density on Y (3.61 genes/Mb) comparable to human (3.96 genes/Mb) and chimpanzee (4.12 genes/Mb) (Table 5).

In conclusion, we have successfully generated a haplotype-resolved chromosome-level genome for the critically endangered northern white-cheeked gibbon, using the third-generation sequencing technologies. The scaffold N50 length is approximately five-times longer compared to previously published data1, with a BUSCO completeness score exceeding 95%. This high-quality genome dataset will serve as a valuable resource for investigating the evolution and conservation of N. leucogenys and other highly endangered primate species.

Data Records

The raw data have been deposited in the National Center for Biotechnology Information (NCBI) database, under project PRJNA1108630 (SRP525211)36, as well as the Genome Sequence Archive (accession number: CRA016205) of the National Genomics Data Center, under project PRJCA025566 (PacBio: SAMC3564548/CRX1037479; HiC: SAMC3564549/CRX1037480, CRX1052507; RNA-Seq: SAMC3564550/CRX1037481, SAMC3564551/CRX1037482)37. Chromosome-level assemblies for Hap1 and Hap2 were submitted to GenBank at NCBI under accession numbers JBDORE000000000 (PRJNA1108630)38 and JBDORF000000000 (PRJNA1108629)39, respectively. The genome annotation results (GFF files, protein sequences, GO and KEGG annotations) were deposited in the Figshare database32.

Technical Validation

Genomic DNA was extracted by a Blood & Cell Culture DNA Kit (Qiagen 13323, Qiagen, Germany) according to the manufacturer’s instructions. The quality and quantity of total DNA was examined by Qubit fluorometer. DNA integrity was determined using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, California, USA).

Total RNA was isolated using a RNeasy Mini Kit (Qiagen 74104, Qiagen, Germany). RNA integrity was determined by the Agilent 2100 Bioanalyzer. RNA samples with a RIN values ≥ 8 were used to construct cDNA libraries for PacBio sequencing.