Background & Summary

With their numerous evolved powerful features, suckermouth catfishes have emerged as one of the notorious invasive groups globally, with documented invasions across tropical and subtropical regions, including Southeast Asia, the southern United States, and Central America, exerting significant impacts on the ecosystem1,2,3,4. Notably, Pterygoplichthys pardalis, native to the Amazon Basin, serves as a famous and typical representative of such an invasive group5. It has established invasive populations various countries, where it disrupts food webs, alters benthic habitats through burrowing, and damages fisheries infrastructure2,3,6,7. More importantly, these invaders not only compete with native species over food resources, but they also aggressively prey on eggs and young fish, thereby leading to a decline in native fish populations and posing a significant threat to the integrity of the local ecological chain4,6. Economic costs arise from levee erosion, reduced catch yields, and expensive eradication efforts2,3,6,7, underscoring the urgency of understanding its biology to inform management strategies.

This omnivorous fish feeds on a wide variety of food sources, including algae, organic material, small invertebrates, and sediment particles1,8, enabling exploitation of resource-poor environments. Amazingly, significant changes in their gastric system, which functions as an additional respiratory organ, enable them to thrive in environments with low levels of dissolved oxygen9,10, which is the common feature of polluted or eutrophic habitats. Additionally, they possess the abilities to survive in cold temperatures and drought conditions by burrowing underground, even when the water level dips below the entrance of their burrows1,11. Its rapid growth, high reproductive capacity, and lack of natural predators have facilitated its accidental introduction into non-native habitats, where it rapidly establishes invasive populations1,12,13,14,15,16. These traits, coupled with a lack of natural predators in non-native ranges, enable P. pardalis to monopolize niches, displace native species, and degrade ecosystems12,13. Eventually, once these invaders establish a population, eradicating them becomes challenging. Despite scientists, including ichthyologists, ecologists, and evolutionary biologists, have been studying for decades1,7,14,17, genetic mechanisms underlying P. pardalis’s adaptability remain poorly understood. Only a single mitochondrial genome (NCBI Accession: NC_058365)18 and a very fragmented nuclear draft genome (contig N50: 4.15 kb)19 are insufficient for resolving these complex traits. More importantly, this situation significantly limits insights into molecular drivers of invasiveness and constrains comparative analyses with native and invasive relatives.

To address these challenges, we present a chromosome-level genome assembly (1.51 Gb) of P. pardalis by integrating Illumina short reads, Nanopore long reads, and Hi-C data. By combining multiple annotation strategies, we ultimately determined that 0.97 Gb of the genome are repetitive sequences, which account for 64.47% of the total genome, and we successfully predicted 23,859 protein-coding genes in the P. pardalis genome. These findings not only provide a high-quality genome resource for P. pardalis, but also facilitate large-scale comparative genomic studies and enable prevention- and control-oriented applications.

Methods

Data acquisition

The catfish samples used in this study were purchased from an ornamental fish wholesale market in Xi’an, China (Fig. 1). The remaining samples of this specimen (Catfish_01) have been cryopreserved at −80 °C in the Biodiversity Repository of the Institute of Basic and Translational Medicine at Xi’an Medical University. All animal specimens were collected legally in accordance with the policy of the Animal Care and Use Ethics of the institution. Genomic DNA was extracted from the muscle tissue of one suckermouth catfish (P. pardalis) using the Blood & Cell Culture DNA Mini Kit (Qiagen, Hilden, Germany). To obtain a high-quality chromosome-level genome assembly, data from multiple sequencing platforms were acquired: 1) A short-insert paired-end library was prepared and sequenced on the Illumina NovaSeq. 6000 platform; 2) A Nanopore library was prepared and sequenced across 26 flow-cells using the Nanopore PromethION 48 (Oxford Nanopore, Oxford, UK); 3) A Hi-C library was constructed and sequenced using the Illumina NovaSeq. 6000 platform; 4) To support genome annotation, total RNAs was extracted from muscle using a TRIzol Kit (Life Technologies) and subsequently used for library construction and sequencing on the Illumina NovaSeq. 6000 platform. All library construction and genome/transcriptome sequencing processes were conducted in biotechnology companies according to their standard workflows. In total, we got 146.15 Gb of Illumina paired-end short-read data (Table 1), 218.07 Gb of Nanopore long-read sequencing data (Table 1), and 149.24 Gb of high-throughput chromosome conformation capture (Hi-C) sequencing data (Table 1).

Fig. 1
figure 1

A photo of the P. pardalis specimen used for the genome sequencing. (a) Dorsal view; (b) Ventral view.

Table 1 Statistics of the sequencing data generated in this study.

Quality control of sequencing data

To facilitate high-quality genome assembly, we performed strict quality control processes. For Illumina reads, adaptor sequences and polymerase chain reaction (PCR) duplicates were removed from all paired-end reads with Perl scripts20. Additionally, any Illumina reads containing more than 5% unknown bases or exceeding 30 low-quality bases, along with their paired-end reads, were discarded21. For Nanopore reads, only reads with a mean quality score >7 were retained and used for subsequent analysis21.

Genome size estimation

A k-mer based strategy was employed to estimate the genome size of P. pardalis. Using all the cleaned short-insert Illumina reads, a 17-mer was selected for this analysis (https://github.com/fanagislab/kmerfreq). The genome size can be calculated using the formula: G = Knum/Kdepth. G represents the estimated genome size, Knum denotes the total count of 17-mers, and Kdepth represents the peak depth of the 17-mers22. The genome of P. pardalis was estimated to be approximately 1.48 Gb, with a considerable level of heterozygosity (Fig. 2).

Fig. 2
figure 2

Genomic information of P. pardalis. Survey of genomic characteristics. X-axis represents 17-mer depth, y-axis represents 17-mer frequency.

Genome assembly

The genome assembly was performed with the following steps: 1) Long reads from the Nanopore platform were used for the contig-level assembly using NextDenovo (v2.2; https://github.com/Nextomics/NextDenovo). Key parameters were carefully set to ensure optimal assembly, including a read cutoff of 1k, a seed cutoff of 59754, and a blocksize of 5 g. 2) Cleaned short reads generated from the Illumina short-insert library were mapped onto the assembled contigs using BWA (v0.7.17)23. To further enhance the accuracy of the assembly at the single-base level, we performed two iterations of correction using Pilon (v1.22)24. 3) We mapped the Hi-C sequencing reads to the corrected contigs, and subsequently utilized Juicer (v1.5.7)25 and 3D de novo assembly (v180922)26 to perform chromosome-level genome assembly. Eventually, we successfully assembled the 1.51 Gb chromosome-level reference genome, with a total of 26 chromosomes and a scaffold N50 length of 49.47 Mb (Figs. 3, 4, Tables 2, 3). Notably, the assembled genome size closely aligned with the estimated size based on k-mer analysis (1.48 Gb) (Fig. 2), indicating the high-integrity of the genome assembly we acquired. To further evaluate the quality of the genome assembly, multiple strategies were employed, including the BUSCO (v5.2.2, Vertebrata_odb10)27 score (98.8%) (Table 4), the mapping ratio of short-insert reads (98.52%) (Table 5), transcripts (99.61%) (Tables 5, 6), Nanopore (99.89%) (Table 5), QV value (31.59), as well as the Hox clusters (Fig. 5). Among them, the Naopore reads were remapped with minimap2 (v2.26-r1175), and the QV scores were assessed by Merqury (v3.0.1; https://github.com/marbl/merqury). All these results indicate that the P. pardalis genome assembly exhibits both high integrity and accuracy.

Fig. 3
figure 3

Heatmap of chromosomal interactions. Blocks represent contact between corresponding locations.

Fig. 4
figure 4

Distributions of genomic elements in P. pardalis genome. Outer to inner ring are distributions of protein-coding genes, tandem repeats (TRP), long terminal repeats (LTR), short interspersed nuclear elements (SINE), long interspersed nuclear elements (LINE), DNA elements, and GC content, respectively.

Table 2 Statistics of chromosomal level assembly of P. pardalis.
Table 3 Statistics of assembly information of of P. pardalis.
Table 4 Completeness assessment of P. pardalis genome by BUSCO.
Table 5 Statistics of the mapping ratio of the reads and transcripts to the P. pardalis genome.
Table 6 Statistics of transcript assembly by Bridger software.
Fig. 5
figure 5

Hox gene clusters in P. pardalis genome. Solid line represents functionally annotated gene in the database, dotted line represents that only the gene fragment could be found.

Genome annotation

Tandem repetitive sequences within the genome were identified using Tandem Repeat Finder (v4.07)28. Non-interspersed repeats in the genome were annotated using RepeatMasker (v4.1.0)29. Transposable elements (TEs) in the genome were annotated at both the DNA and protein levels. A de novo repeat library at the DNA level was constructed using RepeatModeler (v1.0.4; GitHub - Dfam-consortium/RepeatModeler: De-Novo Repeat Discovery Tool) enabling the identification of potential novel repetitive sequences. The genome assembly was searched against Repbase (v23.06) using RepeatMasker (v4.1.0)29 to detect homologous repetitive sequences, providing a more comprehensive picture of the repetitive sequence content. RM-BLASTX within RepeatProteinMask (v4.1.0) was employed to query the TE protein database at the protein level. We found that 0.97 Gb of the genome length consisted of repetitive sequences, which accounts for 64.47% of the genome assembly of P. pardalis (Table 7). Among them, DNA elements (499.77 Mb; 33.15%) constitute the largest proportion of transposable elements (TEs; Table 8), which were followed by the long interspersed nuclear elements (LINEs; 144.01 Mb; 9.55%), long terminal repeats (LTRs; 78.00 Mb; 5.17%), and short interspersed nuclear elements (SINEs; 29.27 Mb; 1.94%), in the P. pardalis genome (Table 8).

Table 7 Statistics of the repetitive sequences annotated by each method of the P. pardalis genome.
Table 8 Statistical of the predicted transposable element in the P. pardalis genome.

Prediction and functional annotation of protein-coding genes

Protein-coding genes were predicted based on three distinct strategies. For de novo-based prediction, the transcripts of P. pardalis muscle tissue were assembled based on RNA-seq data using Bridger (r2014-12-01)30. Subsequently, the assembled transcripts were filtered and underwent primary prediction using the PASA pipeline (v2.1.0)31 and AUGUSTUS (v2.5.5)32. Protein sequences, including Bagarius yarrelli (GCA_005784505.1), Ameiurus melas (GCA_012411365.1), Ictalurus punctatus (GCF_001660625.1), Pangasianodon hypophthalmus (GCF_009078355.1), Tachysurus fulvidraco(GCF_003724035.1), Hemibagrus wyckioides (GCA_019097595.1), Silurus meridionalis (GCF_014805685.1), Clarias magur (GCA_013621035.1), Danio rerio (GCF_000002035.6), Pelteobagrus fulvidraco (http://gigadb.org/dataset/100506), and Glyptosternon maculatum (https://doi.org/10.1093/gigascience/giy104), were downloaded for homology-based prediction. To further refine the coding gene prediction, we selected the longest transcript for each gene and removed those with premature termination sites. Using the Basic Local Alignment Search Tool (BLAST) (v2.2.26; https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.26/) with an e-value threshold of 1e-5, we then performed homology-based annotation using GeneWise (v2.4.1)33. For transcript-based prediction, RNA-seq reads were mapped to the assembled genome using BLAT (v34)34 and spliced alignments were subsequently linked using the PASA pipeline (v2.1.0)31. Finally, the predicted coding genes obtained from the three strategies were integrated using EvidenceModeler (r2012-06-25)35. We successfully predicted 23,859 protein-coding genes in the P. pardalis genome (Table 9), with the BUSCO score of 87.7% (metazoa_odb10, Table 10). To validate the quality of these predicted protein-coding genes, we conducted a comparative analysis of length distributions across many gene structures, including mRNA (Fig. 6a), coding sequences (CDS) (Fig. 6b), exons (Fig. 6c), and introns (Fig. 6d), between P. pardalis and other species. Our results indicated that the predicted protein-coding genes in P. pardalis exhibited comparable quality to those previously reported in other species (Fig. 6).

Table 9 Statistics of functional annotation for protein coding genes.
Table 10 Completeness assessment of P. pardalis gene by BUSCO.
Fig. 6
figure 6

Quality comparison of protein-coding genes between P. pardalis and other species. Quality of gene annotation based on (a) gene length, (b) CDS length, (c) exon length, and (d) intron length, respectively.

For functional annotation, all the predicted protein-coding genes were aligned to multiple databases, including InterPro (https://www.ebi.ac.uk/interpro/), Gene Ontology (GO) (https://geneontology.org/), Kyoto Encyclopedia of Genes and Genomes (KEGG) (https://www.kegg.jp/), UniProt/SwissProt (https://www.uniprot.org/), UniProt/TrEMBL (https://www.uniprot.org/), and the Non-Redundant Protein Sequence Database (NR; https://ftp.ncbi.nlm.nih.gov/blast/db). We found the majority of the predicted genes (22,169; 92.92%) had homologous genes in various public databases (Table 9).

Data Records

All the raw sequencing data, including Nanopore and Illumina reads, have been uploaded to the NCBI database (National Center for Biotechnology Information, https://www.ncbi.nlm.nih.gov) under the BioProject accession number PRJNA116548336. The genome assembly and annotation files were uploaded to the Dryad Digital Repository (https://doi.org/10.5061/dryad.bk3j9kdgh)37 and Genbank dataset (GCA_050231285.1)38.

Technical Validation

The final assembly (1.51 Gb) of P. pardalis is slightly larger than the estimated genome size (1.48 Gb), which may be cause by the genome heterozygosity (Fig. 2). Three distinct strategies were employed to predict protein-coding genes. Using Hi-C technology, we successfully assembled 26 chromosomes of P. pardalis (Fig. 3), which is consistent with the result of a karyotype experiment in a previous study39. Genome annotation further revealed that the length and proportion of repetitive sequences in P. pardalis (0.97 Gb and 64.47%) are obviously higher than those of other catfish species (I. punctatus: 0.27 Gb and 34.92%, P. hypophthalmus: 0.27 Gb and 36.90%, H. wyckioides: 0.32 Gb and 40.12%, S. meridionalis: 0.30 Gb and 40.12%, G. maculatum: 0.25 Gb and 32.76%, and P. fulvidraco: 0.28 Gb and 38.47%) (Fig. 4), indicating that the expansion of repetitive regions is the main reason for the large genome of P. pardalis.