Background & Summary

The darkbarbel catfish (Pelteobagrus vachelli) belongs to the order Siluriformes, family Bagridae, and genus Pelteobagrus1. This species is the largest and fastest-growing within its genus, with individuals reaching up to 2 kg in weight and 50 cm in length2,3. Due to its flavorful meat, nutritional richness, minimal intermuscular bones, and high nutritional value, P. vachelli have been highly sought after by consumers and the market, making them one of the fastest-growing specialty fish species in pond aquaculture over the past decade4,5. The yellow catfish (Pelteobagrus fulvidraco), a closely related species within the same genus, is also an important aquaculture species in China, with a production of 622,651 tons in 2023. Although P. vachelli grows faster and attains a larger weight compared to P. fulvidraco, its flesh is less tender. Recent research indicates that hybrid yellow catfish (P. fulvidraco ♀ × P. vachelli ♂) demonstrate significant advantages in growth, survival rate, disease resistance, and transportability6. Consequently, hybrid yellow catfish have gradually become the main aquaculture species, leading to the development of the new variety “Huangyou No. 1”7. As a promising new aquaculture variety, the hybrid yellow catfish has a very positive market outlook8. However, the molecular mechanisms underlying the hybrid heterosis in these interspecific hybrid yellow catfish remain unclear. Further research utilizing multi-omics analyses and other techniques is necessary to elucidate the genetic mechanisms and gene regulation responsible for this hybrid heterosis.

Species within the Bagridae genus exhibit significant sexual dimorphism in growth, making the study of sex determination mechanisms highly significant9,10. To date, several species have successfully decoded their chromosomal-level genomes, including the Pelteobagrus fulvidraco11, Leiocassis longirostris12, Pseudobagrus ussuriensis13, Hemibagrus wyckioides14,15, Hemibagrus macropterus16, Hemibagrus guttatus17. These genome sequences provide a solid foundation for analyzing key economic traits, particularly in the investigation of sex determination mechanisms, and eventual application in monosex breeding18,19,20. Pelteobagrus vachelli shows pronounced differences in growth between sexes in both wild and farmed populations. One-year-old males grow approximately 50% faster than females under the same farming conditions, while two-year-old males grow 2 to 3 times faster than females2,21. Chinese researchers have developed a sex-specific molecular marker-assisted technique for producing all-male fish based on these growth differences, which has significantly increased the yield and economic benefits of the yellow catfish industry22,23. Aquaculture practices have also demonstrated that significant growth differences between sexes persist in hybrid yellow catfish, with two-year-old males growing about 50% faster than females7. This suggests that there is still a necessity and potential for all-male breeding in hybrid yellow catfish24. Consequently, analyzing the genomic information of male P. vachelli, developing male sex-specific molecular markers, and establishing rapid methods for identifying the genetic sex of P. vachelli are crucial for the all-male breeding of P. vachelli and hybrid yellow catfish. While previous studies have reported the female genome25,26, information on the male genome remains scarce.

In this study, we utilized short-read, PacBio HiFi long-read sequencing and Hi-C technology to generate a high-quality, chromosome-level assembly of the male Pelteobagrus vachelli genome. The development of this reference genome is expected to drive significant advancements in population genetics and the identification of functional genes linked to key economic traits in P. vachelli. Establishing this genomic foundation has the potential to facilitate hybrid breeding and all-male breeding programs for P. vachelli.

Methods

Sample collection and DNA extraction

A mature male Pelteobagrus vachelli was collected from the Pearl River Fisheries Research Institute, Chinese Academy of Fishery Sciences, Guangzhou, China. Muscle tissue from this specimen was used to extract DNA for whole-genome sequencing, including short-read and long-read sequencing, as well as Hi-C sequencing. All experiments were conducted in accordance with the recommendations of the Ethics Committee of the Pearl River Fisheries Research Institute, Chinese Academy of Fishery Sciences. Genomic DNA was extracted from the muscle tissue using a Qiagen DNeasy Blood and Tissue Kit (Qiagen, USA), following the manufacturer’s instructions. The quality and concentration of the extracted DNA were assessed using a NanoDrop One spectrophotometer (Thermo Scientific, USA) and 1% agarose gel electrophoresis.

Genome sequencing

The extracted DNA was randomly sheared into approximately 350 bp fragments, and a short-read library was constructed using the MGIEasy Universal DNA Library Prep Set (MGI, China). Sequencing was performed on the MGISEQ T7 platform (MGI, China), producing 38.89 Gb of paired-end raw reads, each 150 bp in length (Table 1). For PacBio sequencing, we employed the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, USA) following PacBio’s standard protocol with insert sizes of 15 kb, and sequenced on the Pacific Biosciences Sequel II platform in CCS mode. This process yielded 31.22 Gb of HiFi data, with an average read length of 14.05 kb (Table 1). For Hi-C sequencing, approximately 1 g of muscle tissue from the male Pelteobagrus vachelli was dissected and processed using the GrandOmics Hi-C kit (DpnII restriction enzyme; GrandOmics, China) according to the manufacturer’s protocol. The Hi-C library was sequenced on the MGISEQ T7 platform (MGI, China), yielding 81.36 Gb of Hi-C read data (Table 1).

Table 1 Statistics of sequencing data.

RNA extraction and transcriptome sequencing

To facilitate genome annotation, total RNA was extracted from twelve tissues, including the brain, liver, kidney, heart, muscle, spleen, skin, gill, swim bladder, intestine, blood, and testis. The RNA quality was assessed using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, USA) and an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). A cDNA library was constructed from the mixed RNA sample using the MGIEasy Universal DNA Library Prep Set (MGI, China), following the manufacturer’s protocol. This library was sequenced on the MGISEQ T7 platform (MGI, China) with a paired-end 150 bp layout, producing 32.47 Gb of transcriptome data (Table 1).

Genome size and heterozygosity estimation

The raw genome MGI data were primarily filtered using fastp v 0.23.227 (parameter: -q 15 -l 150) to remove low-quality reads and adaptor sequences. To estimate the genome size of the P. vachelli, a k-mer analysis was performed using 38.05 Gbclean reads. Initially, Jellyfish (v2.3.0)15 was used to calculate the frequency of 17-mers and generate a k-mer frequency table. Subsequently, GenomeScope (v2.0)28 was used to analyze the overall genomic properties. The preliminary genome survey estimates that P. vachelli has a genome size of approximately 594,436,389 bp, with 0.568% heterozygosity and 39.3% of the genome repeat content (Fig. 1).

Fig. 1
figure 1

The 17-mer frequency distribution analysis chart for the Pelteobagrus vachelli genome.

Genome assembly

The genome assembly was performed using Hifiasm (v0.19.5)29 with 2,222,451 HiFi reads and a total of 31.22 Gb of data, employing the default parameters. HiFi long reads were input into Hifiasm to generate primary assembly contig graphs. This process resulted in 824 contigs with a total length of 728.88 Mb, a maximum contig size of 17.96 Mb, and an N50 of 5.60 Mb (Table 2). Scaffolding was achieved using Juicer (v1.6)30 in conjunction with 3D-DNA (v201008)31. Initially, BWA (v0.7.17)32 was employed to index the contig-level genome, after which Juicer was utilized to identify restriction enzyme cutting sites. Clean Hi-C (paired-end) reads were mapped to the contigs using Juicer, and Hi-C-assisted initial chromosome assembly was performed with the 3D-DNA algorithm following standard protocols. Chromosome boundaries were refined, and scaffolds corrected using the manually operated Juicerbox (v1.11.08)33 module, resulting in the resolution of 26 chromosomes (Figs. 2, 3). The file modified by Juicebox was further revised and used as input for 3D-DNA for re-scaffolding on a per-chromosome basis. The final assembly comprised 32 scaffolds, with a maximum scaffold size of 45.05 Mb and an N50 size of 28.76 Mb (Tables 2, 3).

Table 2 Summary statistics of P. vachelli genome assembly.
Fig. 2
figure 2

Hi-C contact map produced by 3D-DNA.

Fig. 3
figure 3

Features of the P. vachelli genome. From outside to inside: (a) The 26 pseudo-chromosomes, (b) GC content, (c) Gene density, (d) Repeats content, (e) LTR content, (f) LINE content, (g) DNA-TE content and (h) Male and female P. vachelli.

Table 3 Pseudo-chromosome length statistics after Hi-C assisted assembly.

Repeat annotation

In recognition of the importance of tandem repeats, we employed two software tools, GMATA (v2.2.1)34 and Tandem Repeats Finder (TRF, v4.10.0)35, to perform a genome-wide search for tandem repeat sequences using default parameters. GMATA is designed primarily to identify simple sequence repeats (SSRs) with shorter repeat units, whereas TRF can detect tandem repeats of all types of repeat units. The search results indicated that SSRs comprise 1.49% of the total genome length, while tandem repeat sequences account for 1.99% of the genome length. We then investigated the dispersed repetitive sequences. Initially, MITE-hunter36 was used to identify a small transposon known as MITE within the genome, creating a MITE library file. Following this, a hard-masking procedure was applied to the genome, marking repeated sequences as ‘N’, and RepeatModeler (v2.05)37 was employed to conduct a de novo search for additional repeated sequences, resulting in the formation of a denovo library file (RepMod.lib). Given that RepMod.lib contained numerous unknown repeated sequences, TEclass38 was utilized for classification. Finally, the MITE.lib, RepMod.lib, and Repbase (v19.06)39 libraries were integrated to create a comprehensive library file. This total library file was then employed with RepeatMasker (v4.1.6)40 to conduct a search for repeated sequences throughout the entire genome. The results revealed that dispersed repetitive sequences constitute 32.53% of the total genome length (Table 4). Among transposable elements (TEs), DNA elements are the most prevalent, making up 15.21% of the genome, followed by long interspersed nuclear elements (LINEs) at 6.88%, long terminal repeat (LTR) retrotransposons at 5.00%, and miniature inverted repeat transposable element (MITE) at 2.92% of the genome. Ultimately, a total of 284,053,865 bp of repetitive sequences were identified, comprising 38.97% of the entire genome (Table 4, Fig. 4).

Table 4 Repetitive sequences in the genome of P. vachelli.
Fig. 4
figure 4

Four types of TE sequence divergence distribution diagram annotated by RepeatMasker.

Gene prediction and function assignment

Gene structure prediction was carried out using three distinct methodologies: homology-based, transcriptome-based, and ab initio annotations. For the homology-based prediction, we utilized GEMOMA (v1.6.1)41 to compare homologous proteins from six related species (Danio rerio, Ictalurus punctatus42, Silurus meridionalis43, Pangasianodon hypophthalmus44, Pseudobagrus ussuriensis13, and Pelteobagrus fulvidraco11) with our assembled genome. Transcriptome sequence annotation via PASA (v2.3.3)45 facilitated the acquisition of gene information. This information was then employed in a semi-supervised self-training process with GeneMark-ST46 (v5.1) to predict gene models. The predicted genes were compared against the Swissprot Database47 using Blastp, with alignment results filtered for identity ≥ 95%. We selected the top 3,000 genes with the highest alignment scores from GeneMark-ST as the training set for AUGUSTUS model training. Subsequently, AUGUSTUS (v3.5.0)48 was used to predict genes within the genome using the developed model. The gene prediction results from ab initio, homology-based, and transcriptome-based annotations were converted into a format compatible with EVM (v2.1.0)45. These files were then integrated using EVM with default parameters to produce an initial non-redundant gene set. Our predictions identified a total of 23,638 genes in the genome, with an average gene length of 14,706.84 bp, an average coding sequence length of 167.49 bp, and an average of 9.99 exons per gene (Table 5, Fig. 5).

Table 5 Gene structures and function annotation.
Fig. 5
figure 5

Venn diagram of function annotations from various databases.

Data Records

The raw sequencing reads of all libraries have been deposited into NCBI SRA database via the accession number PRJNA100029449. The assembled genome has been deposited at Genbank under the accession number GCA_033026395.150. Moreover, data of the genome annotations, predicted coding sequences and protein sequences are available at Figshare51.

Technical Validation

Genome synteny analysis

To investigate chromosomal synteny with closely related species, we performed a comparative analysis using the genome of P. vachelli alongside those of Pelteobagrus fulvidraco11 and Pseudobagrus ussuriensis13. Whole genome DNA sequence alignments between P. vachelli and the other two species were conducted using MCscan (v0.8)52 and syntenic relationships were visualized with JCVI (v1.1.12)53. The collinearity analysis revealed chromosomal rearrangements on six chromosomes between P. vachelli and P. fulvidraco. However, the genomes of P. vachelli and P. ussuriensis exhibited a perfect one-to-one correspondence between their chromosomes, demonstrating the high quality and accuracy of our genome (Fig. 6).

Fig. 6
figure 6

A synteny analysis of the chromosomes among genomes of Pelteobagrus vachelli and the other two fish Pelteobagrus fulvidraco and Pseudobagrus ussuriensis.

Assessment of genome assembly

The accuracy of the P. vachelli genome assembly was evaluated by assessing its completeness using the conserved metazoan gene set ‘actinopterygii_odb10’ from BUSCO (v5.4.3)54. The analysis demonstrated high completeness, with an overall completeness of 98.1%. Specifically, 96.8% of the genes were complete and single-copy, 1.3% genes were complete and duplicated, 0.9% genes were fragmented, and 1.0% genes were missing. These findings indicate the high quality of the P. vachelli genome assembly (Table 6).

Table 6 BUSCO analysis of the genome assembly and genes.

Gene annotation validation

To evaluate the integrity of the annotated gene set, we conducted BUSCO analysis using conserved single-copy homologous genes from the actinopterygii_odb10 library. The results revealed that approximately 96.54% of the complete gene elements are present in the annotated gene set, indicating a high level of completeness in the conserved gene predictions. Specifically, 95.08% of the genes were complete and single-copy BUSCOs, with only 0.47% genes fragmented and 2.99% genes missing from the assembly (Table 7). These findings highlight the exceptional integrity and conservation of gene content in the dace genome assembly, leading to highly confident prediction outcomes.

Table 7 BUSCO analysis of the genome annotation and genes.