Background & Summary

Monitor lizards (family Varanidae) are represented by a single genus, Varanus, which comprises 88 currently recognized species1. Many of these species have only recently been recognized as distinct, based primarily on genetic evidence. Furthermore, the number of newly identified cryptic species—morphologically similar or nearly identical forms—is steadily increasing2. Varanid lizards exhibit remarkable body size disparity, ranging from the diminutive Dampier Peninsula monitor (V. sparnus; approximately 230 mm in length and <17 g) to the massive Komodo dragon (V. komodoensis; up to 3,000 mm in length and 100 kg)2. Distributed primarily across eastern Africa and southern Australia, varanid lizards also inhabit diverse island systems (including oceanic, land-bridge, and continental fragments) throughout New Guinea, the Philippines, Indonesia, and the Solomon Islands. This distribution renders them an excellent model system for studying the roles of continents and islands in species evolution and diversification3. Additionally, many varanid species are involved in international trade and are listed under the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES)2. Consequently, monitor lizards represent a taxon of significant scientific value for studying continental and insular drivers of species differentiation, as well as considerable conservation importance. However, to date, chromosome-level genome assemblies have been published only for the ridgetail monitor (V. acanthurus)4, while scaffold-level genome assemblies are available for just five other Varanus species: V. komodoensis5,6, and the Southeast Asian water monitor (V. salvator macromaculatus)7, the Nile monitor (V. niloticus)8, the white-throated monitor (V. albigularis)8, and the blue tree monitor (V. macraei)9. Therefore, generating high-quality genomic resources for additional varanid species remains a priority.

The water monitor lizard (V. salvator; Fig. 1a), which can reach up to 1,170 mm in snout-vent length10, is oviparous species. It is listed in Appendix II of the Convention on the International Trade of Endangered Species (CITES) and is a Class I key protected wild animal in China (https://www.gov.cn/zhengce/2021-02/05/content_5727412.htm). Its distribution extends from South and Southwest China across Bangladesh, Brunei, Indochinese Peninsula, northeastern India, Indonesia, and Sri Lanka11 (Fig. 1b). Water monitor lizards typically inhabit areas close to water sources, such as rivers, lakes, and marshes, and are highly adapted to both aquatic and terrestrial environments2. Like other ectothermic vertebrates, they rely on external heat sources to regulate their body temperature and exhibit a strong adaptive capacity to adapt to environmental fluctuations12. They are carnivorous, with a diet consisting primarily of snails, crabs, fish, frogs, other small vertebrates, and carrion13,14. This species plays an important role in maintaining ecological balance through regulating prey populations and facilitating nutrient cycling15,16.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Photo (a) and potential habitat distribution area (b) of the water monitor lizard (Varanus salvator). The species distribution was projected using maximum entropy (MaxEnt) modeling based on 276 occurrence records, following Guo et al.63.

Varanus salvator is critically endangered in China17 and holds cultural significance, having been traditionally regarded as the “five-clawed golden dragon”, a symbol of imperial power in Chinese culture18. Karyotype analysis has established its diploid chromosome number as 2n = 4019, and genome survey analysis estimates its total genome size at approximately 1.67 Gb. In this study, we generated a chromosome-level genome assembly for V. salvator by integrating data from Illumina short-read sequencing (104× coverage), PacBio single-molecule real-time (SMRT) long-read sequencing (105×), 10× Genomics linked-read sequencing (111×), and high-throughput chromosome conformation capture (Hi-C) (102×) sequencing. K-mer-based analysis estimated the assembled genome size to be 1.64 Gb, with a contig N50 length of 27.34 Mb and a GC content of 44.2% (Fig. 2a; Table 1). Approximately 97.4% of the assembled sequences were anchored onto 20 pseudochromosomes (Fig. 2b). Comparative genomic analysis identified the 16th chromosome as the Z sex chromosome (Fig. 2b). The genome exhibits a heterozygosity rate of 0.24% and contains 34.0% repetitive sequences, comprising LTR retrotransposons (6.96%, 114,420 Mb), LINE retrotransposons (28.8%, 472,895 Mb), and other types (Fig. 2a; Table 2). A total of 19,347 protein-coding genes were identified, with an average coding sequence (CDS) length of 1,572 bp. Of these, 19,116 (98.8%) were functionally annotated (Table 3). In addition, 3,275 non-coding RNAs were identified, comprising 719 miRNAs, 984 tRNAs, 432 rRNAs, and 369 snRNAs (Table 2). Assessments of genome completeness and continuity yielded high-quality metrics, with BUSCO and CEGMA scores of 96.7% and 87.9%, respectively. This high-quality, chromosome-level genome assembly and its annotation for V. salvator provide a valuable resource for ecological and evolutionary studies within the genus Varanus and lay a foundation for future research in molecular ecology and conservation genetics.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Genomic features and chromosomal assignment of V. salvator. (a) Circos plot of genome characteristics, showing from the outside to the inside chromosome ideograms, gene density, GC content, and collinearity block of self-vs-self. See Tables 13 for more detailed statistics. (b) Identification of the Z sex chromosome (Chromosome 16) in V. salvator by comparing with V. acanthurus.

Table 1 Genome assembly statistics.
Table 2 Summary of genome annotation of V. salvator.
Table 3 Summary of gene functional annotation of V. salvator.

Methods

Sample collection and genomic sequencing

An adult male water monitor lizard was sourced from Hainan Key Laboratory for Herpetological Research. Following its natural death, tissues including skin, fat bodies, liver, testis, spleen, muscle, pancreas, and other organs were collected and stored separately at −80 °C for subsequent analysis. All procedures were conducted in accordance with prevailing Chinese regulations on animal welfare and scientific research and were approved by the Animal Research Ethics Committee of Nanjing Normal University (IACUC Approval No. 20200511). Genomic DNA was extracted from muscle tissue using a standard phenol-chloroform-isoamyl alcohol (PCI; 25:24:1, v/v/v) method, followed by precipitation with chloroform-isoamyl alcohol (24:1, v/v). DNA concentration was quantified using a Qubit 2.0 fluorometer (Thermo Fisher Scientific). DNA purity and integrity were assessed using a Nanodrop spectrophotometer (Thermo Fisher Scientific) and by 1.0% agarose gel electrophoresis, respectively.

Whole-genome sequencing was performed using a combination of four platforms: Illumina short-read, PacBio single-molecule real-time (SMRT) long-read, 10× Genomics linked-read, and Hi-C technologies. For genome survey analysis, a paired-end library with an insert size of 350 bp was constructed and sequenced on an Illumina HiSeq X Ten System (Novogene, Beijing, China). After adapter trimming and removal of low-quality reads, more than 99.8% of the sequences were retained as clean data, yielding approximately 303 Gb. K-mer frequency analysis was performed on the paired-end short reads using Jellyfish v2.2.620 with a k-mer size of 21 (Fig. 3a). The resulting k-mer distribution was analyzed with GenomeScope v1.021 to estimate genome size, heterozygosity, and repetitive sequence content.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Genome survey of the Water Monitor Lizard (Varanus salvator) and chromosomes assembly, showing k-mer-21-spectra output generated by GenomeScope2 using PacBio HiFi reads and the estimated genome size and heterozygosity rate (a), and the Hi-C heatmap for chromosome interactions among the chromosomes of V. salvator (b).

Genomic DNA was randomly fragmented, and the resulting fragments were size-selected and purified using AMPure PB beads (Pacific Biosciences, USA). After end-repair and A-tailing, SMRTbell hairpin adapters were ligated for library construction. To maximize read length, libraries with insert sizes greater than 15 kb were size-selected using the BluePippin System (Sage Science, USA). These libraries were sequenced on a PacBio Sequel II system (Novogene) in continuous long read (CLR) mode, producing long reads with a random error profile. Sequencing yielded 175.33 G of raw data, corresponding to approximately 105× genome coverage. These long, error-prone reads are suitable for assembly using overlap-layout-consensus algorithms.

A 10× Genomics library was prepared using the Chromium Controller instrument and a Chromium Genome Chip following the manufacturer’s protocol for the Chromium Genome Reagent Kit (10× Genomics)22. The library was sequenced on an Illumina HiSeq X Ten platform. After quality control, 185.09 Gb of clean data were obtained, achieving approximately 111× genome coverage.

A Hi-C library was constructed from muscle tissue following a previously published protocol23 with minor modifications. Cells were cross-linked with 4% formaldehyde. The cross-linked DNA was then extracted and digested with 400 units of the restriction enzyme MboI on a rocking platform. The digested fragments were end-repaired and labeled with biotin-14-dCTP, followed by blunt-end ligation. Following ligation, the DNA was purified, fragmented into approximately 300–500 bp pieces, and used to construct Illumina-compatible paired-end sequencing libraries. The libraries were PCR-amplified (12–14 cycles) and sequenced on an Illumina Novaseq 6000 platform (PE 125 bp). This yielded 170.85 Gb of clean data, providing approximately 102× genome coverage.

Genome assembly and quality assessment

The initial de novo assembly was performed with Falcon v0.5, which implements an overlapping-layout-consensus (OLC) algorithm for assembling PacBio long reads24. Key overlap filtering parameters were set as follows:–max_diff 100–max_cov 100–min_cov 2–bestn 10, and the analysis was run using–n_core 12 computational threads. Error correction of long reads was performed using the Illumina short reads. Specifically, Illumina short reads were mapped to the PacBio long reads for error correction, leveraging the high base accuracy of short reads to polish the long-read sequences. This process yielded corrected long reads with high accuracy. These corrected reads were then assembled into primary consensus contigs by the OLC algorithm in Falcon. The primary assembly was further polished using Pilon v1.2325. Pilon polishing used the aligned Illumina short reads and was executed with default parameters as recommended, generating a polished assembly.

Subsequently, linked reads from the 10× Genomics libraries were aligned to the polished assembly. Scaffolds were constructed from the polished contigs using fragScaff v140324.126 based on these alignments. The scaffolding process was executed with the following parameters: -maxCore 200 for thread allocation; -fs1 -m 3000 -1 30 -E 25000 -o 50000 to define insert size, mapping quality, fragment length, and overlap thresholds for valid read pairs; advanced options -fs2′-C 3′ and -fs3′-j 1 -u 2′. For chromosome-scale scaffolding, we used the LACHESIS tool27 with Hi-C data. Hi-C reads were mapped to the scaffolded assembly. Using LACHESIS, contigs were then clustered, ordered, and oriented into chromosome-scale scaffolds with parameters optimized for our data: CLUSTER_N = 17, CLUSTER_MIN_RE_SITES = 2076, CLUSTER_MAX_LINK_DENSITY = 3, and CLUSTER_DRAW_HEATMAP = CLUSTER_DRAW_DOTPLOT = 1. Other parameters were set as default values. The resulting interaction matrices were used to validate the assembly, correct potential misassemblies, and ultimately cluster, order, and orient the scaffolds into pseudochromosomes (Fig. 3b). Thus, the final chromosome-level genome assembly was generated through the integration of data from all four aforementioned sequencing technologies.

To identify sex chromosomes, we performed a whole-genome alignment between the V. salvator assembly and the published V. acanthurus genome using MUMmer v3.028. The comparative results were visualized using Circos software.

Gene prediction and functional annotation

Repetitive elements were annotated using a combination of homology-based and de novo prediction approaches. First, for homology-based prediction, RepeatMasker v4.0.529 and ProteinMask v4.0.5 were run with default parameters against the Repbase database (v2018-10-26)30 to identify known repeats. Tandem repeats were predicted de novo using Tandem Repeats Finder (TRF) v4.0931 with default parameters. Next, for de novo repeat annotation, LTR_Finder v1.0.732, RepeatScout v1.0.533, and RepeatModeler v1.0.834 were used with default parameters to construct a de novo repeat library. Sequences longer than 100 bp with less than 5% ambiguous bases (‘N’s) were curated from the de novo predictions to generate a custom transposable element (TE) library. A non-redundant combined repeat library was created by merging the Repbase and custom TE libraries. Finally, RepeatMasker was then run against this combined library using ProteinMask v4.0.5 as the search engine to perform comprehensive repeat masking and annotation.

Protein-coding gene structures were predicted by integrating evidence from three approaches: homology-based, ab initio, and transcriptome-based. For homology-based prediction, protein sequences from six reptilian species (Sphenodon punctatus, Ophiosaurus gracilis, Anolis carolinensis, Shinisaurus crocodilurus, Pogona vitticeps, and Gekko japonicus) were aligned to the genome using TBLASTN v2.2.26 (E-value ≤ 10−5)35. Significant hits were extended into full-length gene models using GeneWise v2.4.136. For ab initio prediction, we used Augustus v3.2.337, GeneID v1.438, GeneScan v1.039, GlimmerHWM v3.0440, and SNAP (v2013-11-29)41. For transcriptome-based annotation, total RNA was extracted from seven tissues (skin, fat body, liver, testis, kidney, spleen, muscle, and pancreas) of the same specimen. RNA integrity and concentration were assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Strand-specific RNA-seq libraries were prepared and sequenced on an Illumina NovaSeq 6000 platform (2 × 150 bp paired-end), yielding approximately 23 Gb of clean data. After adapter trimming and quality filtering (following the same procedure as for genomic DNA libraries), the transcriptome reads were de novo assembled using Trinity v2.1.142. The RNA-seq reads from each tissue were aligned to the genome assembly using TopHat v2.0.1143 for the identification of exon boundaries and splice junctions. Corresponding transcript assemblies were generated from the alignments using Cufflinks v2.2.144 with default parameters. Predictions from the three approaches were integrated using EvidenceModeler (EVM) v1.1.145 to generate a consensus, non-redundant set of gene models. The EVM gene models were further refined using the Program to Assemble Spliced Alignment (PASA), which leverages transcriptome assembly evidence to add untranslated regions (UTRs), identify alternative splicing isoforms, and produce the final gene set.

Subsequently, gene functions were annotated by searching predicted protein sequences against public databases. Predicted protein sequences were aligned against the Swiss-Prot database46 using BLASTP with an E-value ≤ 10−5, and the best significant hit was used for functional annotation. Protein domains and motifs were identified using InterProScan v4.847, which interrogates multiple databases including ProDom48, PRINTS49, Pfam50, SMART51, PANTHER52, and PROSITE53. Gene Ontology (GO) terms were assigned based on the InterProScan results. Additional functional annotations were assigned based on the best BLAST hits (E-value < 10−5) against the NCBI non-redundant (NR) and Swiss-Prot databases. Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology (KO) terms and pathway mappings were assigned using the KEGG Automatic Annotation Server (KAAS)54. tRNAs were identified using tRNAscan-SE v2.055. rRNAs were identified by aligning known rRNA sequences from related species to the assembly using BLASTN. Other non-coding RNAs (e.g., miRNAs, snRNAs) were identified by searching against the Rfam database v14.4 using Infernal v1.1.3 with default parameters.

Data Records

The genomic datasets generated and analysed in this study are available at the following repositories. All raw sequencing data [including Illumina short reads (CRR2192420-23), PacBio long reads (CRR2192408-10), 10× Genomics linked reads (CRR2192416-19), Hi-C data (CRR2192411-15), and RNA-seq data from skin (CRR2192430), fat body (CRR2192424), liver (CRR2192426), testis (CRR2192429), kidney (CRR2192425), spleen (CRR2192428), muscle (CRR2192431), and pancreas (CRR2192427)] and the primary genome assembly were deposited at the National Genomics Data Center (NGDC; https://ngdc.cncb.ac.cn/)56,57 under BioProject accession number PRJCA04577358. The final chromosome-scale genome assembly and annotation files are available at the NGDC Genome Warehouse (GWH) under accession number GWHHKDC00000000.159.

Technical Validation

The quality and completeness of the V. salvator genome assembly were assessed through multiple complementary approaches. First, analysis of the GC content distribution indicated no significant contamination in the assembly (Fig. 2a). Second, the Hi-C contact map revealed strong intra-chromosomal interaction signals along the diagonal (Fig. 3b), confirming the structural integrity of the genome. Assembly completeness was assessed using Benchmarking Universal Single-Copy Orthologues (BUSCO) v5.0 with the vertebrata_odb10 database (v2020-09-10), which contains 978 conserved single-copy orthologs60. The analysis showed that 96.7% of the expected single-copy orthologs were complete (C: 96.7%, of which 95.7% were single-copy and 1.0% duplicated; F: 0.8% fragmented; M: 2.5% missing; n: 978). In addition, completeness was evaluated using CEGMA v2.561 based on 248 highly conserved eukaryotic genes, which showed that 87.9% of the core genes were successfully assembled. To assess assembly accuracy, Illumina short reads were aligned to the final genome assembly using BWA v0.7.1762, resulting in a mapping rate of 99.4% and a genome coverage of 99.7%. Furthermore, 19,347 (98.8%) of the predicted gene models were functionally annotated against major databases, including Swiss-Prot, NR, KEGG, GO, Pfam, and InterPro. Collectively, these results indicate that our de novo assembly of the V. salvator genome is both high-quality and complete.