Background & Summary

The genus Pseudoxenodon, characterized by the obliquely arranged scales on the anterior part of the dorsal body, is a group of snakes that are widely distributed across southern and southeastern Asia1. This genus consists of seven species including P. stejnegeri, P. macrops, P. karlschmidti, P. inornatus, P. jacobsonii, P. baramensis and P. bambusicola2. Despite their ecological importance and unique morphological adaptations, they have been poorly studied, especially in a phylogenetic context. Due to lack of sufficient molecular data, the taxonomic classification of the genus Pseudoxenodon has remained controversial within the herpetological community. Recently, high-throughput sequencing techniques have been used to uncover snake genomic information and inform studies of snake evolution and development3, adaptation4,5, venom6,7 and phylogeny8.

Previously, the inference of phylogenetic relationships of the genus Pseudoxenodon has been based mainly on mitochondrial genes9,10,11,12. Many studies have demonstrated conflicting phylogenetic signals and evolutionary histories between mitochondrial and nuclear genes13,14,15,16. Moreover, increasing studies have even indicated that nuclear genes may provide more robust phylogenetic resolution for closely related lineages17,18. In recent years, phylogenetic reconstruction based on whole genome has emerged as a powerful and reliable tool for deciphering biodiversity, ecology and evolution of organisms19,20,21,22,23. However, only one species of Pseudoxenodon has undergone genome sequencing and assembly using just short read sequencing technology8. These limited genetic resources severely hinder accurate determination of the evolutionary relationship of Pseudoxenodon and in-depth study on their evolutionary history.

In this study, we present the first chromosome-level genome assembly of P. stejnegeri based on PacBio sequencing, Hi-C sequencing and Illumina sequencing technologies. We have assembled a high-quality genome with size of 1601.26 Mb and scaffold N50 length of 203.68 Mb. In total, about 97.07% bases have been anchored onto 18 chromosomes. This genome assembly not only provides genomic data for P. stejnegeri to study genetic diversity and population genetics but also offers a valuable resource for Pseudoxenodon studies on phylogenetics, adaptive evolution and comparative genomics.

Materials & Methods

Ethics statement

All animal experimental procedures were in accordance with the Chinese Laboratory Animal Welfare and Ethics law (GB/T35892–2018), and approved by the Biomedical Ethics Committee of Chengdu University.

Sample collection and DNA extraction

An adult female individual of P. stejnegeri (Fig. 1a) was collected from Ningbo City, Zhejiang Province, China in August 2023. Muscle tissue was used to extract genomic DNA for whole-genome sequencing. Genomic DNA was extracted using QIAGEN Genomic Kits following the manufacturer’s protocol. The quality and quantity of the total DNA were determined using a NanoDrop 2000 Spectrophotometer (Thermo Fischer Scientific) and Qubit Fluorometer (Invitrogen). The integrity of the DNA was further evaluated using 1% agarose gel electrophoresis. Additionally, seven transcriptomic samples (muscle, blood, heart, kidney, liver, lung and spleen) were collected from the same specimen for transcriptome sequencing. Total RNA was isolated using Trizol reagent (Invitrogen) as instructed by the manufacturer.

Fig. 1
figure 1

The morphological characteristics and genome information of P. stejnegeri. (a) Live specimen of P. stejnegeri. (b) The K-mer (K = 51) distribution for genome size estimation of P. stejnegeri genome. (c) The quality and length distribution of PacBio sequencing results. (d) Hi-C interaction heatmap of P. stejnegeri genome.

Library preparation and sequencing

For long-read sequencing, genomic DNA was used to construct PacBio SMRTbell library using the SMRTbell Express Template Prep Kit 3.0 with insert sizes of 15 kb. The size and concentration of library fragments were detected with an Agilent 2100 Bioanalyzer (Agilent technologies, USA). The qualified libraries were evenly loaded on SMRT Cell and sequenced using Sequel II platform (Pacific Biosciences, CA, USA) in CCS mode. For Illumina sequencing, a library with an insert size of 350 bp was constructed using the Truseq Nano DNA HT Sample Preparation Kit (Illumina, USA). The Hi-C library was prepared using the Smartgenomics Hi-C kit (Smartgenomics Technology Institute, China). Initially, muscle tissue was fixed with 1% formaldehyde to cross-link DNA and proteins. The cross-linked DNA was then digested with Hind III restriction enzyme and the resulting overhangs were in-filled with biotinylated nucleotides. The resulting blunt ends were then ligated, and Dynabeads M-280 Streptavidin (Life Technologies) was used to enrich the library for fragments containing biotinylated ligation junctions. Both Illumina standard genomic and Hi-C libraries were sequenced on an Illumina NovoSeq 6000 platform with 2 × 150 bp reads. RNA-seq libraries were constructed using Hieff NGS Ultima Dual-mode RNA Library Prep Kit (Yeasen) and sequenced (2 × 150 bp) on the DNBSEQ-T7 platform.

Genome survey

The whole-genome survey analysis was performed using short reads from Illumina sequencing. The raw reads were first subjected to quality control using fastp v0.23.424 with default parameters which yielded 80.34 Gb clean data (Table S1). Based on these high-quality data, we used Jellyfish v2.3.125 to analyze the k-mer frequency distribution with a K value of 51 according to a previous study3. The k-mer distribution result was then imported to Genomescope v2.0 to predict genome size and heterozygosity. The genome size of P. stejnegeri was estimated to be approximately 1411.46 Mb, with a heterozygosity rate of around 0.61% (Fig. 1b).

Genome assembly

First, the PacBio sequencing data was filtered to remove low-quality polymerase reads using PacBio SMRT-Analysis software package. The reads with length < 50 bp, an average quality value < 0.8 and the reads containing self-ligated SMRTbell adapters were discarded to obtain high-quality polymerase reads. We employed ccs v4.2.0 in SMRTLink v9.0 with parameters --min-passes = 3 and --min-rq = 0.99 to process the remaining subreads to generate HiFi reads, resulting in 6,044,853 reads (107.33 Gb) with a read N50 of 17.7 kb (Fig. 1c, Table S2). Then, the HiFi long reads were assembled into contigs by using Hifiasm v0.19.926 with default parameters. The assembled contig-level genome comprises 660 contigs spanning 1,600,567,841 base pairs, with an N50 value of 94.32 Mb (Table 1).

Table 1 Genome assembly statistics for P. stejnegeri.

To generate a chromosome-level genome, the raw Hi-C sequencing data was filtered using fastp v0.23.4 with default parameters, retaining 133.77 Gb clean reads (Table S3). These high-quality reads were subsequently mapped against the preliminary contigs by HiCUP v0.9.227 along with Bowtie v2.5.428. After Hi-C data alignment, we obtained about 219.40 million uniquely aligned valid reads, comprising 49.13% of the total reads (Table S4). Based on the valid reads, we applied ALLHiC v0.9.1429 to cluster, orientate, and order the contigs for scaffold-level assembly. Finally, we adopted Juicebox v2.2230 to manually fine-tune the assembly, resulting in a chromosome-level assembly. The assembled chromosome-level genome was 1.6 Gb, with 1.55 Gb (97.07%) anchored onto 18 pseudochromosomes and a scaffold N50 of 203.68 Mb (Figs. 1d, 2, 3, Table 1, S5). The assembled chromosomes were assigned names from chr1 to chr18 in descending order of length. We used subcommand telo in seqtk v1.5 (https://github.com/lh3/seqtk) to detect telomeric repeats in pseudochromosomes, and three pseudochromosomes achieved true telomere-to-telomere continuity (Table S6).

Fig. 2
figure 2

Snail plot for visualization of genome assembly and assessment metrics.

Fig. 3
figure 3

Circos plot for showing distribution of genomic features. The tracks from outermost to innermost are pseudo-chromosomes, tandem repeat density (maximum count: 11689), DNA transposon density (maximum count: 176), LINE density (maximum count: 1679), SINE density (maximum count: 53), LTR density (maximum count: 363), non-coding RNA density (maximum count: 235), protein-coding gene density (maximum count: 125), GC content and synteny among chromosomes.

Gene structure annotation

To obtain a high-quality gene annotation, three methods were used to predict protein-coding gene structure, including homology-based prediction, transcriptome-based prediction and ab initio prediction. For homology-based prediction, protein sequences of five species (including Pantherophis guttatus, Thamnophis sirtalis, T. elegans, Ahaetulla prasina, Mus musculus) were downloaded from NCBI database (Table S7). The protein sequences of each species were aligned to the assembled genome using genBlastA v1.0.431. The candidate homologous regions were provided as inputs to GeneWise v2.4.132 to precisely annotate gene structures. For transcriptome-based prediction, the raw RNA sequencing datasets were filtered using fastp v0.23.4, the retained clean reads were aligned to reference genome with Tophat v2.1.133. The alignment results were analyzed using Cufflinks v2.2.134 to perform genome-guided transcript assembly. For ab initio prediction, Augustus v3.5.035, geneid v1.4.536 and GENSCAN v1.037 were applied to annotate genes. The gene models derived from these different approaches were integrated using EVidenceModeler v2.1.038 to produce a non-redundant and complete gene set which was further corrected using PASA v2.5.339 to supplement the untranslated regions (UTRs) and alternative splicing information. Ultimately, we obtained a total of 21,579 protein-coding genes, with an average gene length of 28,275.09 bp, an average CDS length of 1,406.54 bp, and an average exon number of 8.31 (Fig. 4a, Table 2).

Fig. 4
figure 4

Venn diagrams for protein-coding gene annotation. (a) Genes annotated by different strategies. (b) Gene functions annotated by different databases.

Table 2 Statistics of the predicted protein-coding genes by different approaches.

Gene functional annotation

The predicted protein-coding genes were aligned against the NCBI non-redundant (nr) database and Swiss-Prot40 database using DIAMOND v2.1.1141. Conserved domains, structural motifs and functional signatures were annotated using InterProScan v5.5942 to search against InterPro v91.043 database. We also employed eggNOG-mapper v2.1.844 together with eggNOG v5.0.245 database to predict gene functions through evolutionary homology analysis. Both InterProScan and eggNOG-mapper automatically performed gene ontology (GO46) assignment. In addition, we used pyfastx v2.2.047 to split protein sequence file into three smaller files which were then submitted to BlastKOALA v3.148 server for KEGG49 pathway identification. Overall, 17,531 (80.87%) predicted protein-coding genes were functionally annotated by at least one functional database (Fig. 4b, Table 3).

Table 3 Summary of the functionally annotated protein-coding genes.

Repeat annotation

Repetitive elements in the P. stejnegeri genome were detected using a hybrid method that combined homology-based and de novo search strategies. We applied RepeatMasker v4.1.7 and RepeatProteinMask to carry out homology-based prediction with Repbase v23.0850 database and Dfam v3.851 database. For de novo prediction, LTR_Finder v1.0.752, Piler v1.053, RepeatScout v1.0.754 and RepeatModeler v2.0.655 were used to build a library of repetitive sequences. Consequently, RepeatMasker was utilized to predict transposable elements based on the library. Additionally, we identified tandem repeats from the P. stejnegeri genome using Krait v2.0.656 with pytrf v1.4.157 as search engine, maximum motif size of 100 bp, and minimum length of 10 bp. In total, we identified 9,976,736 repeat elements with total length of 908.04 Mb accounting for 56.71% of the assembled genome (Table 4).

Table 4 Summary of the repetitive elements in the P. stejnegeri genome.

Non-coding RNA identification

We initially employed Infernal v1.1.558 to align the assembled genome against Rfam v15.059 database for detecting non-coding RNAs (rRNAs, tRNAs, snRNAs, and miRNAs). Then, tRNAscan-SE v2.0.1260 with default parameters was used to explore tRNAs. Barrnap v0.9 (https://github.com/tseemann/barrnap) was used to predict ribosomal RNAs with the --kingdom parameter set to euk. Finally, we identified 3440 non-coding RNAs including 273 miRNAs, 1083 rRNAs, 1549 tRNAs and 342 snRNAs (Table 5).

Table 5 Statistics of the annotated non-coding RNAs.

Data Records

The raw PacBio, Hi-C, Illumina and RNA-seq data were submitted to the Sequence Read Archive at NCBI under accession number SRP64781861. We have also deposited the raw sequencing data in the Genome Sequence Archive (GSA62) in National Genomics Data Center (NGDC63) with accession number CRA02513464 under BioProject PRJCA039323. The final genome and annotation data has been made available on the Figshare repository65. The final genome assembly has also been deposited at DDBJ/ENA/GenBank under the accession JBNIJY00000000066.

Technical Validation

We have used multiple methods to assess the quality of the genome assembly. First, the completeness of the genome assembly was evaluated using benchmarking universal single-copy orthologs (BUSCO) v5.6.067 based on vertebrata_odb10 lineage dataset and core eukaryotic genes mapping approach (CEGMA) v 2.568. The BUSCO result revealed 97.8% completeness (Fig. 2, Table S8), and 231 (93.15%) out of 248 core eukaryotic genes from CEGMA were identified in the assembled genome (Table S9). Then, we mapped Illumina filtered reads to the assembled genome using BWA v0.7.1869 for accurate assessment. The mapping result indicated that 99.71% paired-end reads could be aligned to the assembled genome (Table S10). We further assessed the quality value (QV) and k-mer completeness using Merqury v1.370 with 21-mers generated from Illumina short reads. The QV score and k-mer completeness were estimated as 47.1 and 89.91%, separately. We also performed chromosomal synteny analysis between P. stejnegeri and other two snakes (Ahaetulla prasina, T. elegans) with well assembled genomes using MCScanX v1.0.071. We observed a high degree of synteny among these species (Fig. 5). In conclusion, all these results illustrated that the assembled genome was a high-quality chromosome-level reference genome for P. stejnegeri.

Fig. 5
figure 5

Gene synteny analysis of genome chromosomes between P. stejnegeri and two other snakes (A. prasina and T. elegans).