Haplotype-phased genome assemblies and annotation of the northern white-cheeked gibbon (Nomascus leucogenys)

Luo, Zhonglai; Jiang, Libo; Xu, Jianing; Wang, Jinhuan; Nie, Wenhui; Ning, Zemin; Yang, Fengtang

doi:10.1038/s41597-024-04073-7

Download PDF

Data Descriptor
Open access
Published: 25 November 2024

Haplotype-phased genome assemblies and annotation of the northern white-cheeked gibbon (Nomascus leucogenys)

Zhonglai Luo ORCID: orcid.org/0000-0002-2131-6712¹,
Libo Jiang¹,
Jianing Xu¹,
Jinhuan Wang²,
Wenhui Nie²,
Zemin Ning³ &
…
Fengtang Yang¹

Scientific Data volume 11, Article number: 1279 (2024) Cite this article

3170 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Nomascus leucogenys is a critically endangered species of small apes. Here, we sequenced and assembled the male genome of N. leucogenys, using PacBio and Hi-C datasets, with a particular focus on its Y-chromosome. The resulting high-quality haplotype-phased assemblies are at chromosome-scale, with scaffold/contig N50 values of 124.2/102.2 Mb for Haplotype 1 and 121.2/85.67 Mb for Haplotype 2. The assembled Y-chromosome spans 16.06 Mb. BUSCO assessment indicated completeness scores exceeding 95%. We predicted 18,925 protein-coding genes (23,783 mRNAs), including 58 genes on the Y-chromosome. Approximately 50% of the genome comprises repetitive elements. These comprehensive genome datasets will serve as a valuable resource for future studies on the genetics and protection of gibbons and improve our understanding on the evolution of Y-chromosome-related genes in primates.

Chromosome-level genome assembly of a critically endangered species Leuciscus chuanchicus

Article Open access 15 March 2025

A chromosome-level genome assembly of critically endangered Ochetobius elongatus

Article Open access 19 December 2024

Chromosome-level genome assembly of the obligate cave loach Troglonectes daqikongensis

Article Open access 05 December 2025

Background & Summary

Gibbons, a group of small apes in the Hylobatidae family, represent a sister lineage to the “great apes” (Pongo, Gorilla, Pan and Homo) in the phylogeny of Primates. Endemic to the forests of Southeast Asia, all the four extant gibbon genera (Nomascus, Hylobates, Symphalangus, Hoolock) face severe endangerment due to deforestation and illegal hunting. Gibbons diverged from the great apes after the split of the Old World Monkeys (Cercopithecidae)^1,2; they exhibit unique behaviors and genomic features, such as social pair-bonding, suspensory brachiation, rapid karyotype evolution via frequent large-scale chromosomal rearrangements^3,4,5. This makes gibbons ideal for exploring the genomic plasticity, genome evolution and the conservation of endangered primate species.

Nomascus leucogenys, also known as the northern white-cheeked gibbon, is among the most endangered species, classified as Critically Endangered on the IUCN Red List. Found in the forests of southern China, Laos, and Vietnam, N. leucogenys is believed to be ecologically extinct in China^6,7. To date, genome data is available only for a female individual of this species, which was initially sequenced on the Sanger platform in 2014¹ and subsequently updated through the integration of PacBio reads⁸. Another gibbon species, Hylobates moloch, was sequenced in 2022 by the third-generation technology⁴. Both studies used female individuals, and the genome of H. moloch was assembled on scaffold-level only. In a comprehensive phylogenetic study, 27 new primate genomes were sequenced, including Nomascus siki (PacBio + 10X Genomics), Hoolock leuconedys (Oxford Nanopore), Symphalangus syndactylus (PacBio + 10X Genomics) and Hylobates pileatus (Oxford Nanopore) from the gibbon lineage⁹. Among these, only H. pileatus had Hi-C data, while the assembly of N. siki remained at scaffold-level. The latest report on T2T genomes of five great apes includes a new assembly for the Y-chromosome of S. syndactylus, with a gapless length of ~30 Mb¹⁰.

In this study, we utilized PacBio long-read sequencing, Illumina short paired-end reads, Hi-C technology, and RNA-sequencing to produce a high-quality chromosome-level genome assembly of a male N. leucogenys individual using a lymphoblastoid cell line. Notably, we achieved improved continuity in assembling the Y-chromosome. Our assembly provides an essential reference for understanding gibbon genomic features, exploring the evolutionary adaptation of the gibbon genome, particularly the evolution of gibbon Y-chromosome, and aiding in the conservation of endangered primate species.

Methods

Ethics statement

All the experimental procedures in this study were in accordance with the Laboratory Management Regulations of Shandong University of Technology (No. 163).

Sample collection, preparation, and DNA extraction

An EBV-transformed lymphoblastoid cell line (KCB93007) derived from a male N. leucogenys was provided by Kunming Cell Bank. Genomic DNA was isolated using a Blood & Cell Culture DNA Kit (Qiagen 13323, Qiagen, Germany) following the manufacturer’s protocol. The quality and concentration of DNA was assessed using 1% agarose gel electrophoresis and a Qubit 2.0 Fluorometer (Life Technologies, USA). Total DNA was purified by AMPure PB beads (PacBio 100-265-900, PacBio, USA) to obtain high-quality genome DNA for library construction.

PacBio and Hi-C library construction and sequencing

Size-selected SMRTbell libraries were prepared using SMRTbell® Express Template Prep Kit 2.0 (PacBio 101-853-100, PacBio, USA) and sequenced for 30 hours using the PacBio Sequel II system (PacBio, USA). A total of 8,900,210 HiFi reads were generated from four SMRT Cells, with an average read length of 17,093 bp. After trimming the low-quality reads and adaptor sequences, 143.4 Gb of PacBio HiFi data were obtained (~49X coverage).

For Hi-C library construction, lymphoblastoid cells of N. leucogenys were fixed with formaldehyde, followed by the addition of glycine to quench the crosslinking reaction. DNA was extracted using the Blood & Cell Culture DNA Kit (Qiagen 13323, Qiagen, Germany), following the manufacturer’s recommended protocol. The extracted cross-linked DNA was then cut with restriction endonucleases overnight to generate pairs of distally located, physically associated DNA molecules. Subsequently, the DNA ends were labeled with biotin-14-dCTP, and blunt-end ligation of the cross-linked fragments was performed. After cell lysis, a four-cutter restriction enzyme (MboI) was used to digest the cross-linked DNA, and biotin-labelled Hi-C samples were enriched using streptavidin magnetic beads. The Hi-C libraries were amplified through 10–12 cycles of PCR and sequenced on an Illumina Novaseq6000 platform with 2 × 150 bp reads.

After filtering adaptors and low-quality reads (when the proportion of low-quality bases exceeded 20% of the read length), we obtained 1,011,168,632 paired-end clean reads for further genome assembly.

Transcriptome sequencing

For transcriptome sequencing, lymphoblastoid cells and liver tissues from male N. leucogenys, frozen in liquid nitrogen, were obtained from the tissue archive of Kunming Cell Bank. Total RNA was extracted using a RNeasy Mini Kit (Qiagen 74104, Qiagen, Germany). Then TruSeq RNA Sample Prep Kit (IlluminaRS-122-2002, Illumina, USA) was used for poly(A) mRNA isolation, first-strand and second-strand cDNA synthesis, fragment and adapter ligation, as well as cDNA library preparation.

The libraries from different tissues were sequenced on an Illumina Novaseq6000 platform using 150 pair-end sequencing. Low-quality reads and adaptor sequences were filtered out by fastp (v.0.23.2, https://github.com/OpenGene/fastp) prior to further analyses. In total, 11.62 Gb (98.29%) and 14.88 Gb (94.41%) of clean data were obtained from the lymphoblastoid cells (GB) and liver tissues (PL) respectively.

Genome assembly and quality assessment

The genome size of N. leucogenys (approx. 2.94 G) and its sibling species N. siki (approx. 2.78 G) has been reported in previous studies^1,9. K-mer distribution analysis (K = 21) using GenomeScope v2.0¹¹ revealed an estimated heterozygosity rate of 0.39%, comparable to that of Hylobates moloch (0.38%–0.39%) and falling within the range of the Hylobatidae family (0.19%–0.41%)⁴. The Genomescope profile is illustrated in Fig. 1.

De novo assembly of PacBio HiFi reads (CCS mode) was performed using HiFiasm (version 0.16.1)¹² to generate contigs. Hi-C reads were also incorporated at this stage to achieve better haplotype-phased assemblies. Before further processing, quality of the assembly was assessed using BUSCO (v.5.1, with database primates_odb, 13,780 genes), which evaluates genomic assembly based on expected gene content from near-universal single-copy orthologs¹³. The assembly comprised 143 scaffolds for Haplotype 1 (Hap1), with a total length of 3,069 Mb and scaffold N50 size of 124.2 Mb; and 90 scaffolds for Haplotype 2 (Hap2), with a total length of 2,960 Mb and scaffold N50 size of 121.2 Mb. The BUSCO completeness value exceeded 95%, with 12,889 complete and single-copy genes (~93.5%), 323 complete and duplicated genes (~2.4%), 141 fragmented genes (~1.0%), and 427 missing BUSCO genes (~3.1%) for Hap1 (Table 1). Comparative assessment with eight other gibbon assemblies using the same BUSCO database and parameters revealed that the completeness of our N. leucogenys genome was the highest among all the published gibbon genomes (Table 2) with the lowest proportion of missing genes, indicating a significant improvement in assembly continuity and completeness. Following polishing with PacBio reads, the final assembled contigs measured 3.06 Gb in length (Hap1), with a GC content of 40.9%.

Table 1 Statistics of N. leucogenys genome sequencing and assembly.

Full size table

Table 2 BUSCO assessment results of available gibbon genomes.

Full size table

Chromosome anchoring and quality assessment

The Hi-C data obtained from the N. leucogenys cells underwent initial quality control using fastp¹⁴, which involved reads trimming, adapter removal, and elimination of low-quality bases. Cleaned reads were subsequently aligned to the assembled genome sequences following the Omni-C pipelines (https://omni-c.readthedocs.io). High-quality mapped reads were used to create a pairs-file with pairtools¹⁵, which categorized pairs based on insert distance and read types. The parsed pairs-files were then sorted by pairtools sort. PCR duplicates were removed from downstream analysis through the pairtools dedup command with the -output-stats option, resulting in the exclusion of 17.2% of read pairs as PCR duplicates. In the remaining valid data, 94.6% were Mapped Read Pairs, and 5.4% were Unmapped Read Pairs, leading to a total of 565,327,778 unique mapped read pairs for scaffolding.

YaHS¹⁶ was employed to construct chromosome-scale scaffolds from Hi-C data. The pipeline corrected assembly errors and produced a final scaffolds file. Juicer tools¹⁷ were utilized to generate the Hi-C contact matrix and then convert it into editable Hi-C map files (out_JBAT.hic and out_JBAT.assembly). Juicebox (v. 2.16)¹⁸ was used for visualizing the Hi-C map, identifying, and correcting errors in the genome assembly. New AGP and FASTA files were generated for the final genome assembly after the edit was completed. The sizes of each generated scaffolds were calculated by faidx¹⁹.

For Hap1, 2.96 Gb of the final assembly were anchored onto 27 (25 + X + Y) chromosomes (pseudomolecules) (Fig. 2a). Among the chromosomes of Hap1, the longest was Chr_03 (168.1 Mb), and the shortest was Chr_Y (16.1 M) (Table 3). For Hap2, 2.79 Gb of the scaffolds were anchored onto chromosomes (Fig. 2b). The principal features of the assembled genome (Hap1) were illustrated in a Circos plot (Fig. 3).

Table 3 Chromosome sizes and assignment for Hi-C scaffolds.

Full size table

Repeats and gene annotations

Chromosome-level sequences of N. leucogenys were utilized for repeats and gene annotations. To identify repeat sequences in the gibbon genome, a combination of de novo and homology-based approaches was employed. RepeatModeler v2.0.1²⁰ and EDTA (Extensive de novo TE Annotator)^21,22 were used to construct the de novo repeat library and the transposable elements (TE) library, respectively. RepeatMasker v4.1.0²³ was then used to identify and annotate the repetitive elements using a combined database of the Repbase, de novo repeat library, and the TE library. Identified repeats were soft-masked for gene annotation.

Results of the RepeatMasker annotation revealed a total of 51.5% (Hap1) and 50.1% (Hap2) of the genome assembly consisted of repeats. Many of the repeats belongs to the TE family (retroelements, 47.2% for Hap1, and 45.8 for Hap2), which could be further categorized into various subtypes. Detailed classifications of the repetitive sequences are provided in Table 4.

Table 4 Classifications of the repetitive sequences in the genome of N. leucogenys.

Full size table

Gene structures were predicted using the BRAKER 3 pipeline²⁴, which integrated de novo, homology-based, and transcriptome sequencing-based annotation workflows. Prior to prediction, Augustus (http://augustus.gobics.de) was trained with human data, and GeneMark (http://exon.gatech.edu/GeneMark) was self-trained according to the software instructions^25,26. Through de novo prediction, Augustus identified 34,671 protein-coding genes. GeneMark-ETP made predictions from homologous proteins from Homo sapiens (GCF_009914755.1, GCF_009914755.1_T2T-CHM13v2.0)²⁷, Hylobates pileatus (GCA_021498465.1, 20115467.v1)⁹, H. moloch (GCF_009828535.3, GCF_009828535.3-RS_2023_07)⁴ and Nomascus leucogenys (GCF_000146795.3, Annotation Release 102)¹, as well as the transcriptomic data. For proteins, the longest (representative) sequences were obtained for each gene with the aid of TBtools-II²⁸. Clean data generated from transcriptome sequencing (described above) was provided for the BRAKER 3 pipeline in fastq format.

The results were combined to generate the prediction gene set. Genes with TE insertions were identified and removed by TEsorter (v.1.4.6)²⁹, followed by further filtering with AGAT (v.1.2.0)³⁰ to eliminate incomplete genes. Considering the highly repetitive sequences on the Y-chromosome increased the difficulty and inaccuracy for gene prediction, genes with CDS covered by over 60% of repeats were excluded⁹. Additionally, genes with alignment identity below 65% to the Swiss-Prot database were excluded from the final gene set.

Statistical analysis for the numbers of genes, mRNAs, exons, and introns was conducted using the “GXF Stat” function in TBtools-II²⁸. The final prediction yielded 18,925 protein-coding genes (with 23,783 mRNAs), comparable to the gene counts in the Nleu_3.0 (18,991 genes, NCBI Nomascus leucogenys Annotation Release 102, 2015) and H. moloch genomes (19,302 genes)⁴. And 58 genes were annotated for the Y-chromosome after manual curation.

The average length of the coding genes was 28,215 bp, with an average of 7.71 exons per gene. For Hap1, chromosome 10 has the highest number of genes (1,188 genes, 9,719 exons), while chromosome 25 has the fewest genes (189 genes, 1,568 exons) among the autosomes. The density of gene distribution on each chromosome was illustrated in Fig. 3C. The average gene density is highest on chromosome 24 (10.9 genes/Mb), followed by chromosome 10 (10.6 genes/Mb) and 19 (10.3 genes/Mb). Chromosomes with the lowest gene density are Y (3.61 genes/Mb), 20 (3.67 genes/Mb) and 16 (3.76 genes/Mb).

Protein-coding genes were functionally annotated using different databases, including Swissprot (https://web.expasy.org/docs/swiss-prot_guideline.html), PFAM (http://xfam.org/), eggNOG (http://eggnogdb.embl.de/), GO (http://geneontology.org/page/go-database), and KEGG (http://www.genome.jp/kegg/) through the eggNOG-mapper (v.2) pipeline³¹. In all, 17,860 genes (95.48% of the total coding genes) were annotated in at least one database. Overall, 8,066 genes were matched to 310 KEGG pathways, and 13,171 genes were annotated to at least one KO number³². “Genetic information processing; translation (ko03010, 560 genes)”, “MAPK signaling pathway (ko04010, 207 genes)” and “Nucleocytoplasmic transport (ko03013, 196 genes)” were the top three pathways with the most genes annotated.

Noncoding RNAs (ncRNAs) are a class of RNAs that do not encode proteins. Four types of ncRNAs were identified in the genome: transfer RNAs (tRNAs), ribosomal RNA (rRNAs), microRNAs (miRNAs), and small nuclear RNAs (snRNAs). tRNAs were annotated by tRNAscan-SE (version 2.0)³³, rRNAs were annotated by barrnap (v. 0.9)³⁴. miRNAs and snRNAs were annotated using INFERNAL (version 1.1.2) by searching the Rfam database (version 14.9)³⁵. A total of 8,021 annotated small ncRNAs were obtained, including 2,818 miRNAs, 423 rRNAs, 720 tRNAs, 208 lncRNA and 3,484 snRNAs. The distribution of ncRNAs on each chromosome is depicted Fig. 3E.

Comparison of primate genome assemblies

To further assess the quality of our genome assembly, we compared the N. leucogenys genome with other primate genomes that include Y-chromosome data (Table 5). Our chromosomal-level assembly exhibited a scaffold N50 of approx. 124.2 Mb (Hap1), surpassing most gibbon genomes published so far. We assembled over 16 Mb of Y-sequence, the second longest among reported gibbon genome data.

Table 5 Comparison of genome assemblies of human, chimpanzee and five gibbon species.

Full size table

We annotated 58 Y-genes for N. leucogenys after manual curation, with a gene density on Y (3.61 genes/Mb) comparable to human (3.96 genes/Mb) and chimpanzee (4.12 genes/Mb) (Table 5).

In conclusion, we have successfully generated a haplotype-resolved chromosome-level genome for the critically endangered northern white-cheeked gibbon, using the third-generation sequencing technologies. The scaffold N50 length is approximately five-times longer compared to previously published data¹, with a BUSCO completeness score exceeding 95%. This high-quality genome dataset will serve as a valuable resource for investigating the evolution and conservation of N. leucogenys and other highly endangered primate species.

Data Records

The raw data have been deposited in the National Center for Biotechnology Information (NCBI) database, under project PRJNA1108630 (SRP525211)³⁶, as well as the Genome Sequence Archive (accession number: CRA016205) of the National Genomics Data Center, under project PRJCA025566 (PacBio: SAMC3564548/CRX1037479; HiC: SAMC3564549/CRX1037480, CRX1052507; RNA-Seq: SAMC3564550/CRX1037481, SAMC3564551/CRX1037482)³⁷. Chromosome-level assemblies for Hap1 and Hap2 were submitted to GenBank at NCBI under accession numbers JBDORE000000000 (PRJNA1108630)³⁸ and JBDORF000000000 (PRJNA1108629)³⁹, respectively. The genome annotation results (GFF files, protein sequences, GO and KEGG annotations) were deposited in the Figshare database³².

Technical Validation

Genomic DNA was extracted by a Blood & Cell Culture DNA Kit (Qiagen 13323, Qiagen, Germany) according to the manufacturer’s instructions. The quality and quantity of total DNA was examined by Qubit fluorometer. DNA integrity was determined using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, California, USA).

Total RNA was isolated using a RNeasy Mini Kit (Qiagen 74104, Qiagen, Germany). RNA integrity was determined by the Agilent 2100 Bioanalyzer. RNA samples with a RIN values ≥ 8 were used to construct cDNA libraries for PacBio sequencing.

Code availability

The software and pipelines in this study for data process and analysis were used according to the developers’ instructions, as described in the cited references. Default parameters were used as suggested in the references if no detailed parameters are mentioned for a software.

References

Carbone, L. et al. Gibbon genome and the fast karyotype evolution of small apes. Nature 513, 195–201 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Shao, Y. et al. Phylogenomic analyses provide insights into primate evolution. Science 380, 913–924 (2023).
Article ADS PubMed CAS Google Scholar
Carbone, L. et al. A high-resolution map of synteny disruptions in gibbon and human genomes. PLoS Genet. 2, e223 (2006).
Article PubMed PubMed Central Google Scholar
Escalona, M. et al. Whole-genome sequence and assembly of the Javan gibbon (Hylobates moloch). J. Hered. 114, 35–43 (2022).
Article PubMed Central Google Scholar
Reichard, U. H., Barelli, C., Hirai, H., & Nowak, M.G. The evolution of gibbons and Siamang. In Louise Barrett (ed.). Evolution of gibbons and Siamang: phylogeny, morphology, and cognition. New York (NY): Springer New York. p. 3–41 (2016).
Harding, L. E. Nomascus leucogenys (Primates: Hylobatidae). Mamm. Species 44, 1–15 (2012).
Article Google Scholar
Fan, P. F., Fei, H. L. & Luo, A. D. Ecological extinction of the Critically Endangered northern white-cheeked gibbon Nomascus leucogenys in China. Oryx 48, 52–55 (2014).
Article Google Scholar
Mao, Y. F. et al. Structurally divergent and recurrently mutated regions of primate genomes. Cell 187(6), 1547–1562 (2024).
Article PubMed CAS Google Scholar
Zhou, Y. et al. Eighty million years of rapid evolution of the primate Y chromosome. Nat. Ecol. Evol. 7, 1114–1130 (2023).
Article PubMed Google Scholar
Makova, K. D. et al. The complete sequence and comparative analysis of ape sex chromosomes. Nature 630, 401–411 (2024).
Article ADS PubMed PubMed Central CAS Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Cheng, H., Concepcion, G. T., Feng, X. W., Zhang, H. W. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article PubMed PubMed Central CAS Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, M. BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Article PubMed PubMed Central CAS Google Scholar
Chen, S. F. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. iMeta 2, e107 (2023).
Article PubMed PubMed Central Google Scholar
Abdennur, N. et al. Pairtools: from sequencing data to chromosome contacts. bioRxiv https://doi.org/10.1101/2023.02.13.528389 (2023).
Article PubMed PubMed Central Google Scholar
Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics 39, btac808 (2023).
Article PubMed CAS Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
Article PubMed PubMed Central CAS Google Scholar
Robinson, J. T. et al. Juicebox.js provides a cloud-based visualization system for Hi-C data. Cell Syst. 6, 256–258 (2018).
Article PubMed PubMed Central CAS Google Scholar
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl Acad. Sci. USA 117, 9451–9457 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).
Article PubMed PubMed Central CAS Google Scholar
Ou, S. et al. Differences in activity and stability drive transposable element variation in tropical and temperate maize. bioRxiv 511471, https://doi.org/10.1101/2022.10.09.511471 (2022).
Smit, A. F. A., Hubley RR, Green PR. RepeatMasker Open-4.0. Available from: http://repeatmasker.org/ (2013).
Gabriel, L. et al. BRAKER3: Fully automated genome annotation using RNA-seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA. bioRxiv https://doi.org/10.1101/2023.06.10.544449 (2023).
Article PubMed PubMed Central Google Scholar
Stanke, M., Steinkamp, R., Waack, S. & Morgenster, B. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res. 32, W309–W312 (2004).
Article PubMed PubMed Central CAS Google Scholar
Tomas, B., Lomsadze, A. & Borodovsky, M. GeneMark-ETP: automatic gene finding in eukaryotic genomes in consistence with extrinsic data. bioRxiv., (2023).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53, https://doi.org/10.1126/science.abj698 (2022).
Article ADS PubMed PubMed Central CAS Google Scholar
Chen, C. J. et al. TBtools-II: A “One for All, All for One” bioinformatics platform for biological big-data mining. Mol. Plant 16, 1733–1742 (2023).
Article PubMed CAS Google Scholar
Zhang, R. G. et al. TEsorter: An accurate and fast method to classify LTR-retrotransposons in plant genomes. Hortic. Res. 9, uhac017 (2022).
Article PubMed PubMed Central Google Scholar
Dainat, J. AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format. (Version v0.8.0) https://github.com/NBISweden/AGAT.
Cantalapiedra, C. P. et al. eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
Article PubMed PubMed Central CAS Google Scholar
Luo, Z., Yang, F. & Ning, Z. Haplotype-phased genome assemblies and annotation of the northern white-cheeked gibbon (Nomascus leucogenys). Figshare https://doi.org/10.6084/m9.figshare.27233328 (2024).
Chan, P. P. & Lowe, T. M. tRNAscan-SE: Searching for tRNA genes in genomic sequences. Methods Mol. Biol. 1962, 1–14 (2019).
Article PubMed PubMed Central CAS Google Scholar
Seemann, T. Barrnap 0.9: rapid ribosomal RNA prediction (RRID:SCR_015995). https://github.com/tseemann/barrnap (2013).
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).
Article PubMed PubMed Central CAS Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP525211 (2024).
Genome Sequence Archive https://ngdc.cncb.ac.cn/gsa/browse/CRA016205 (2024).
Luo, Z., Yang, F. & Ning, Z. Nomascus leucogenys isolate KCB93007, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc.gca:GCA_040113115.1 (2024).
Luo, Z., Yang, F. & Ning, Z. Nomascus leucogenys isolate KCB93007, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc.gca:GCA_040113105.1 (2024).
Rhie, A. et al. The complete sequence of a human Y chromosome. Nature 621, 344–354 (2023).
Article ADS PubMed PubMed Central CAS Google Scholar
National Human Genome Research Institute, National Institutes of Health. Genome assembly NHGRI_mPanTro3-v1.1-hic.freeze_pri. https://identifiers.org/ncbi/insdc.gca:GCA_028858775.1 (2023).

Download references

Acknowledgements

This work was supported by grants from the National Natural Science Foundation of China (Grant No. 32370689, 31970252, 32070601), the Initial Funding for Doctoral Research from Shandong University of Technology (Grant No. 4041/422047), and the Natural Science Fund for Excellent Young Scholars of Shandong Province (ZR2022YQ23).

Author information

Authors and Affiliations

School of Life Sciences and Medicine, Shandong University of Technology, Zibo, China
Zhonglai Luo, Libo Jiang, Jianing Xu & Fengtang Yang
State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, 650223, Yunnan, China
Jinhuan Wang & Wenhui Nie
Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
Zemin Ning

Authors

Zhonglai Luo
View author publications
Search author on:PubMed Google Scholar
Libo Jiang
View author publications
Search author on:PubMed Google Scholar
Jianing Xu
View author publications
Search author on:PubMed Google Scholar
Jinhuan Wang
View author publications
Search author on:PubMed Google Scholar
Wenhui Nie
View author publications
Search author on:PubMed Google Scholar
Zemin Ning
View author publications
Search author on:PubMed Google Scholar
Fengtang Yang
View author publications
Search author on:PubMed Google Scholar

Contributions

F.T.Y., Z.M.N. and Z.L.L. conceived and designed the project. F.T.Y., W.H.N. and J.H.W. provided the samples. Z.M.N. assembled the genome data. Z.L.L., Z.M.N. and L.B.J. performed bioinformatics analysis. Z.L.L. wrote the manuscript and prepared the figures. J.N.X. provided technical assistance. F.T.Y., Z.M.N. and Z.L.L. revised and edited the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Zhonglai Luo or Fengtang Yang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Luo, Z., Jiang, L., Xu, J. et al. Haplotype-phased genome assemblies and annotation of the northern white-cheeked gibbon (Nomascus leucogenys). Sci Data 11, 1279 (2024). https://doi.org/10.1038/s41597-024-04073-7

Download citation

Received: 03 June 2024
Accepted: 04 November 2024
Published: 25 November 2024
Version of record: 25 November 2024
DOI: https://doi.org/10.1038/s41597-024-04073-7