Background & Summary

The bighead catfish (Clarias macrocephalus, Clariidae), a freshwater species belonging to the Siluriformes order, which is one of the most diverse fish orders with over 3,000 extant species, is native to Southeast Asia. Its distribution, which is primarily across the Mekong River Basin, including Cambodia, Thailand, and Vietnam, has been extended to several other countries such as China, Guam, Malaysia, and the Philippines1,2. Phylogenetic analyses, which have indicated that the bighead catfish are most closely related to white-spotted catfish (Clarias fuscus), have shown that they diverged approximately 25 million years ago3, while they are phylogenetically distant from North African catfish (C. gariepinus), a large economic clariid catfish globally, and eel catfish (Channallabes apus)4,5,6. The Bighead catfish, which is a benthopelagic and facultative air-breather that is capable of surviving in hypoxic waters and migrating across land, have adaptability such as aestivation in mud during the dry season. These features, which reflect ecological plasticity, make it a valuable model animal for studying respiratory physiology and environmental adaptation7. The bighead catfish is economically an important aquaculture species in Southeast Asia. However, its wild populations are declining owing to habitat degradation, overfishing, and genetic introgression from hybridization with alien invasive species, such as the North African catfish used for aquaculture breeding and are in a critical situation1,7.

Despite the progress in aquaculture practices and ecological research, genomic resources for the bighead catfish remain limited, and existing data are incomplete. This limitation hampers efforts toward biodiversity conservation and sustainable aquaculture. Genome-wide studies using reduced-representation approaches, such as DArTseq™, lack support from a contiguous reference genome and are inadequate for resolving structural variants and functional elements8,9. By contrast, chromosome-level genome assemblies of other catfish species have greatly advanced our understanding of their physiology, adaptation, and aquaculture traits. For example, in the genome of the channel catfish (Ictalurus punctatus), a key regulator of dermal bone and scale formation provides insights into skeletal evolution in teleosts and information for selection programs10. Similarly, the genome of the Chinese longsnout catfish (Leiocassis longirostris) has provided the information of molecular pathways related to carnivorous feeding and energy metabolism, demonstrating the role of genomic data in ecological adaptation and nutritional studies11. This process, which facilitates genomic selection in breeding programs, has remained unclear for the bighead catfish for which a chromosome-scale assembly is lacking and only a scaffold-level assembly has been made in the past12.

Here, to overcome the longstanding deficiency in genomic resources in bighead catfish and for better conservation and breeding efforts, the first chromosome-scale haplotype-resolved genome assembly was sequenced using a combination of third-generation platforms, including Pacific Biosciences’s High-Fidelity (HiFi) long reads, and Oxford Nanopore Technology ‘s long reads, complemented by Illumina paired-end short reads and High-throughput Chromosome Conformation Capture (Hi-C) paired-end short reads, this new assembly will serve as a foundational resource for genetic and evolutionary research.

Methods

Sample collection and DNA preparation

Broodstocks of the bighead catfish (C. macrocephalus) were obtained from the Faculty of Fisheries at Kasetsart University, Thailand. The bighead catfish was euthanized by severing the spinal cord anterior to the dorsal fin, and the liver tissue was collected and preserved in ethanol for DNA extraction. The sex of bighead catfish was determined by recording the morphology of genital papillae13. Genomic DNA was extracted using standard salting-out method, as described previously by Supikamolseni et al.14.

Quality-control and preparation experiment of sequenced reads

Raw sequencing reads were evaluated for quality using FastQC (http://www.bioinformatics.babraham.ac.uk/ projects/fastqc/) v0.1.12, which was used for short reads, and NanoPlot15 v1.42.0, which was used for long reads. The sequencing depth and Fastq QC (base/quality summary) were calculated by using SeqKit16 and Seqtk (https://github.com/lh3/seqtk). Illumina data, excluding Hi-C data, were trimmed to remove synthetic sequencing adapters by using AdapterRemoval17 v2.3.3. PacBio’s HiFi reads were trimmed by using HiFiAdapterFilt (‘-reward 1 -penalty −5 -gapopen 3 -gapextend 3 -dust no -soft_masking true -value 700 -searchsp 1750000000000)18 v64d1c7b. Cross-species contamination of raw read data was evaluated against the Mash database (‘refseq.genomes + plasmids.k21.s1000.msh’) by using Mash (‘sketch size = 1000, k = 21’)19 v2.3. Reads were filtered for length quality whenever necessary by using NanoFilt (‘-l $length -q $Quality’)15 v2.8.0.

PacBio HiFi long reads

PacBio HiFi reads (library source: genomic, library selection: circular consensus sequencing (CSS)/PCR)) were generated using circular consensus sequencing (CSS), which achieves a base accuracy of >99.9% and maximizes assembly quality by sequencing the same DNA fragment multiple times in a continuous polymerase chain reaction (PCR) without denaturation20. Sequencing on the Sequel II system produced 10.62 Gb of raw data, representing 12.9X genome coverage. The dataset achieved a mean quality score of Q28.7 and a median of Q36, with over 72% of reads that scored above Q30 (Fig. 1C).

Fig. 1
figure 1

Summary of sequencing data of the bighead catfish (Clarias macrocephalus) genome. (A) Summary of sequencing platforms and data types, including PacBio HiFi, Oxford Nanopore Technologies (ONT), and Illumina paired-end sequencing (Hi-C and WGS) and their genome coverage. (B) GenomeScope2.0 profile based on k-mer size of 21, which indicates a haploid genome size of approximately 899 Mb with 0.056% heterozygosity and 1.77% duplication. The x-axis shows k-mer coverage, and y-axis shows the observed frequency of k-mers. (C) Quality and length distribution of PacBio HiFi reads. The x-axis shows read length (bp), and the y-axis shows the mean PHRED score. The difference in GC content is represented by difference in color intensity. (D) Quality and length distribution of ONT reads. The x-axis shows read length (bp, log scale), and the y-axis shows the mean PHRED score. The difference in GC content is represented by difference in color intensity.

Oxford nanopore technologies (ONT) non-UL 1D-long noisy reads

ONT reads (library source: genomic, library selection: size fractionation) provided long and noisy data with a median error rate of approximately 20%. ONT reads have been used to improve assembly continuity and to span complex repeat regions21. These reads were prepared from size-selected, end-polished DNA fragments to which adapters were ligated to double-stranded DNA, and were purified by using AMPure XP beads. Sequencing on a PromethION system generated 29.56 Gb of raw data, equivalent to 36X genome coverage, with a 13 kb N50 read length. The median quality score was Q11.03, and 97.0% of reads were above Q7 (Fig. 1D).

Proximity-ligation Illumina paired-end short reads

Hi-C data (library source: genomic, library selection: DNase) were generated by using proximity ligation techniques to provide long-range linkage information, allowing contig phasing (haplotype separation) and genome scaffolding22,23. The data were generated by using an in situ Hi-C protocol that employed DpnII endonuclease digestion and were sequenced on an Illumina NextSeq. 2000, which yielded 37.19 Gb of paired-end reads and achieving 39.77X genome coverage. The data quality was high, yielding 123.96 M read pairs and 74.77 M unique Hi-C contacts, with > 97.18% that were at Q30.

Illumina paired-end short reads

Illumina paired-end short reads24 (library source: genomic, library selection: PCR) were sequenced on an Illumina NextSeq. 2000 platform, which yielded 31.91 Gb of raw read data (read length = 151 bp) with 40X genome coverage. The average quality score was Q25, and 92% of the reads were above Q30.

Reference-free genome profiling survey

All reference-free genome profiling analyses were performed by using the K-mer Analysis Toolkit (KAT)25 v2.4.1 and GenomeScope26 v2.0. For the genome survey (Table 1), canonical 21-mer k-mer distributions were estimated from trimmed Illumina and HiFi reads using Meryl27 and Jellyfish28 v2.3.1. A haploid sequence of approximately 890 million nucleotides (Mb) and an inter-haplotype heterozygosity rate of approximately 0.549% for bighead catfish were predicted by the genome survey (Fig. 1B).

Table 1 Genomic sequencing data and genome survey.

De novo haplotype-resolved assembly, Hi-C scaffolding, and haplotype phasing

The haplotype-resolved diploid assembly of bighead catfish combined HiFi, ONT, and Hi-C data using Hifiasm (‘-hic-hifi -ul -primary’)29 v0.19.8-r603, resulting in two Hi-C-phased contig sets (.hic.hap1 and.hic.hap2), which represented two separate and complete haploid homologous haplotypes30. Each haplotype genome comprises a random assortment of chromosomes inherited from the maternal and paternal sets. To scaffold haplotype 1 and haplotype 2 with long reads (HiFi and ONT), proximity-ligation (Hi-C), and Illumina paired-end short reads, GreenHill (‘-cph $hap1 $hap2 -p $combined_ont_hifi -IP1 -IP2 -HIC’)31 v1.0.0. was used.

Hi-C maps, manual review, and post-review for obtaining chromosome-scale scaffolds

To build heat maps of Hi-C contacts between pairs of loci and obtain Hi-C scaffolds (post-haplotype resolved assembly), we used the Juicer pipeline (‘-assembly -S early’)32 v2.17.00 and Hi-CCUPS CPU (‘-ignore-sparse -cpu’) v2.17.00, which aligned Hi-C reads to GreenHill scaffolds and assigned a mapping quality score using BWA-MEM33 v0.7.17_r1188. PCR duplicates and near-duplicate mapped reads were removed by using Samtools34 v1.18. Hi-C contacts were generated at various resolutions (2.5 Mb to 5 kb). Visualization of Hi-C maps was performed by running the runassembly_visualizer.sh script from the 3D-DNA pipeline22 v02/12/2018. To manually review Hi-C scaffolds for misjoins or errors, Juicebox Assembly Tools (JBAT)32 v2.16.00 were used, and regions with misjoins or under-collapsed heterozygosity were edited based on off-diagonal signals in the Hi-C read density heatmap, which indicated contig/scaffolding errors. Post-review validation of the Hi-C scaffolds was performed using a 3D-DNA-post-review.sh v180114 (‘-sort-output-c 27’), a module of the 3D-DNA pipeline22. Unplaced scaffolds were filtered by using SeqKit (‘-by-length -reverse -m 2,500,000’)16 v2.7.0, which resulted in 27 chromosome-length scaffolds per haplotype (Fig. 2).

Fig. 2
figure 2

Workflows of haplotype-resolved genome assembly and genome scaffolding for the bighead catfish (Clarias macrocephalus). (A) The results of scaffolding using Juicebox v2.16.0. (B) The strategy of genome assembly using PacBio HiFi, ONT, and Illumina reads, followed by consensus assembly using Flye and haplotype-resolved assembly using Hifiasm (Hi-C UL mode). (C) Genome scaffolding using Hi-C reads, processed through GreenHill, 3D-DNA, and Juicer tools, producing haplotype-resolved and consensus assemblies. (D) Manual curation using JBAT and 3D-DNA post-review to correct scaffolding errors and finalize assemblies. (E) Final assembly improvements include multiple gap-filling rounds with TGS-GapCloser, QuarTeT GapFiller. (F) Polishing with NextPolish2, producing high-quality pseudo-chromosomes and unplaced scaffolds. Various software pipelines and types of sequencing data are used throughout the assembly process, as indicated by the icons. (G) Assembly polishing and QV assessment using Merqury, Pilon, and short-read alignment tools. (H) Mitochondrial genome assembly using Minimap2 and MitoHiFi pipelines.

Haplotype-aware genome polishing with NextPolish2

To increase the QV values and correct small SV errors (SNV/indels), we generated a HiFi mapping file after counting repetitive k-mers in the reads using Meryl (k = 15) for Winnowmap, a version of Minimap2 optimized for read alignment in repetitive regions that leverages a Bloom filter to filter alignments based on k-mer multiplicity35,36. Next, we generated two k-mer databases (21-mer and 31-mer files) from the trimmed short reads using the Yak k-mer analyzer (https://github.com/lh3/yak). The Winnowmap mapping file in BAM format, the target assembly in FASTA format, and the two Yak databases in Yak format were used as inputs for genome polishing using NextPolish237 v0.2.0.

Align-genus and read-homology-based methods for gap-filling and joining of contigs

Homology-based approaches have been used to resolve assembly gaps (i.e., gaps between contigs in pseudochromosomes/scaffolds)38. For this purpose, additional scaffold sets were generated by using Hifiasm (‘-primary’) as a contigger, with variations in the type of read data input (HiFi, ONT, or Hi-C) and haplotype purging options (‘-l0’). Additionally, orthologous sequences were retrieved from NCBI Datasets for closely related species in the family Clariidae (Taxonomy ID: 13012, n = 3 reference sequences): C. fuscus (GCA_030347435.1)39, C. gariepinus (GCA_024256425.2)40, and Channallabes apus (GCA_030522415.1)41, which were last accessed in February 2024. Three rounds of TGS-GapCloser42 v1.2.1, followed by one round of QuarTeT GapFiller, were applied to haplotypes 1 and 2. In the first round, filtered high-quality HiFi reads (Q20, length >10 kb) were used to fill medium-sized gaps (<20,000 bases). In the second round, filtered ONT reads (Q25, length >10 kb) were employed to fill larger gaps (>20,000 bases), whereas in the third round, polished unitigs (. p_utg.fa) from the hifiasm assembly were utilized. The number of gaps was reduced from N = 4015 (haplotype 2) to N = 2501 (haplotype 2), and further to N = 1100 after TGS-GapCloser. Finally, all reference genomes and scaffold sets (p_utg) were used as combined inputs, and the number of gaps was reduced to N = 550 (haplotype 2) by using QuarTeT GapFiller (default parameters: q *. fa’)43 v1.2.1.

Targeted haplotype-aware genome polishing with Pilon

To enhance the accuracy and quality values (QV) of the haplotype assemblies of C. macrocephalus and to precisely polish the genome assembly while correcting error-prone regions, k-mer analysis, read alignments, and sequence polishing were conducted using Pilon44 v1.24. Identification of missing reads at assembly seq-mers error positions: Positions of erroneous k-mers of non-repetitive k-mers found in the genome (i.e., the error seqmers) were identified with Merqury and Meryl, as explained in the T2T-Polish GitHub workflow (“QV estimate with hybrid k-mer db”; https://github.com/arangrhie/T2T-Polish). K-mers were counted using Meryl (‘meryl count k = $k’) for each haplotype assembly, and a combined (hybrid) 21-mers from the Illumina and HiFi data meryl k-mer database was created through union-summing (‘meryl union-sum’). We further filtered k-mers from the hybrid read database, retaining only k-mers with higher multiplicity values, such as greater than 5 or 10 (‘meryl greater-than’), to eliminate low-confidence sequences. Next, the k-mers unique to the reads and absent from the assemblies were isolated using Meryl’s difference operation (‘meryl difference’). These read-only k-mers represent genomic variations that were not captured in the assemblies. These filtered k-mers were used to extract reads containing these unique sequences from both the Illumina and HiFi datasets using the lookup command (‘meryl-lookup’). Mapping of reads: ONT reads were filtered by using NanoFilt to remove low-quality bases, whereas for Illumina paired-end reads (R1 and R2), low-quality sequences and adapter contamination were removed by using Fastp (‘-5 -3 -n 0 -f 5 -F 5 -t 5 -T 5 -q 20’)45 v0.23.4. Next, HiFi reads were aligned with the combined haplotype assemblies using Winnowmap based on k-mers found in repeats (‘-W repetitive_k15.txt’). For ONT reads, we used Minimap2 (‘map-ont’) without secondary alignments (‘-secondary = no’) and the alignments were filtered at MAPQ > 30 (‘-q 30’)46. Illumina short reads were aligned using Bowtie247 end-to-end (sensitive mode) (‘-sensitive’) to minimize spurious alignments, with no mixed read pairs (‘-no-mixed’) and no discordant mappings (‘-no-discordant’) to keep only alignments no farther than the expected insert size (500 bp. + 2 * read length for Illumina data), and PCR duplicates were removed using the SAMtools command (‘markdup’). Running Pilon and calling the consensus: Pilon was used to refine the assemblies by correcting SNPs, indels, and other base-level errors. For each haplotype scaffold, Pilon was run with specific parameters (-genome, -frags, -bam, -targets, -fix all,-vcf, -diploid, -minmq 30, -minqual 30) to incorporate alignments from all data types (HiFi, ONT, and short reads). The output Variant Calling Files (VCFs) containing the detected variants were sorted by position using BCFtools (‘bcftools sort’)48, compressed with bgzip, indexed with (‘bcftools index’), and a consensus sequence was generated for each scaffold using (‘bcftools consensus -f $genome.fa -H 1’). The overall median quality was increased from 41 to approximately 45–47 after the haplotype-aware targeted assembly polishing.

Additional Hi-C scaffolding and sequence integration using quartet

To reintegrate unplaced scaffolds into each haplotype in the pseudo-chromosomes, we used Quartet and HapHiC49 v1.0. First, we aligned the unplaced contigs to the reference genomes C. fuscus (GCA_030347435.1) and C. gariepinus (GCA_024256425.2) using Quartet AssemblyMapper with the following parameters (‘-r $reference -q $contigs -c 50000 -l 2000 -i 90 -a Minimap2’): We identified 53 MB and 33 MB of unplaced scaffold sequences in bighead catfish haplotype 1 and haplotype 2, respectively, with strong homology to C. fuscus pseudo-chromosomes; we filtered bighead catfish haplotype 1 and haplotype 2 with SeqKit to retain pseudo-chromosomes and concatenated them to unplaced scaffolds. For the preparation of Hi-C scaffolding, Hi-C reads were mapped to separate haplotypes using BWA-MEM (‘-5SP”) after making a BWT index for haplotype 1 and haplotype 2 (‘bwa index $genome.fasta’). Alignments were filtered to remove PCR duplicates and secondary alignments using Samblaster (samblaster $BAM | samtools view -@ $threads -S -h -b -F 3340)50 v0.1.26. Hi-C scaffolding was then performed with HapHiC. Hi-C contact maps were visualized for each haplotype in JBAT and with the haphic plot tool (Figs. 35). Manual post-review was carried out as described previously. Finally, three rounds of TGS-GapCloser followed by one round of targeted Pilon polishing specifying the new targets resulted in a genome of higher global quality—both in terms of QV metrics and structural accuracy (CRE/CSE) as measured by CRAQ. Duplication artefacts observed in heterozygous peaks of the Merqury k-mer spectra were addressed using haplotype-specific k-mer databases. Erroneous k-mers were identified with meryl difference, and corresponding reads were extracted using meryl lookup. Final consensus sequences were generated using BCFtools with the opposite haplotype reference (-H 1), mitigating most haplotype switch errors.

Fig. 3
figure 3

Genome assembly status of the bighead catfish examined in February 2024 and November 2024. (A) Haplotype 1 shown on the left and Haplotype 2 shown on the right ideograms generated in February, 2024 after manual review with Juicebox. Gaps and telomeres lengths on pseudo-chromosomes are shown by orange rectangles and blue triangles, respectively. (B) The genome assemblies generated in November, 2024.

Fig. 4
figure 4

Hi-C contact maps of scaffolded pseudo-chromosomes in the bighead catfish (Clarias macrocephalus). (A) Contact map of haplotype 1 assembly, visualized in blue. (B) Contact map of haplotype 2 assembly is visualized in purple. The x-axis and y-axis represent genomic coordinates in megabases (Mb), and each axis is segmented according to individual pseudo-chromosome scaffolds, allowing visualization of chromosomal contact patterns. Color intensity indicates the normalized Hi-C contact values (Knight–Ruiz (KR) matrix balancing); h darker colors represent higher interaction (frequencies). Bin size = 50 Kb.

Fig. 5
figure 5

The heat maps of Hi-C contact matrix displaying individual pseudo-chromosome scaffolds in haplotype 1, which were sorted by length. The x-axis and y-axis represent genomic coordinates along each pseudo-chromosome. Darker colors along the diagonal indicate higher contact frequencies, reflecting local chromatin interactions and supporting accurate scaffold continuity and orientation.

For base accuracy and phasing error correction, polishing tools that used various methods were employed. For pileup-based methods, Pilon was used, which corrected pileup based and single-nucleotide polymorphism (SNP) switch errors that had been identified previously with Meryl based on Merqury positioning of assembly errors (hereby referred to as “error seqmers”)51, Inspector52 was used to assess assembly quality and, in some cases, to repair structural errors, pileup errors, and base-level errors through its correction module that was based on HiFi reads. Racon53 with Merfin54 HiFi or Clair355 with Merfin HiFi effectively handled SNPs and other small nucleotide variants (SNVs), especially under low coverage. Polishing based on Nanopore data was always followed by HiFi-based polishing to maintain high base accuracy. A TGS-GapCloser was used to bridge gaps, and CRAQ was applied to break assemblies at read hard-clipping points. JBAT/HiC scaffolding was subsequently applied, followed by a manual review that ensured structural accuracy. Each step was logically dependent on prior outputs. For instance, JBAT scaffolding necessitated subsequent gap closing (e.g., TGS-GapCloser), while iterative refinement employed tools such as Racon or bcftools consensus command from BCFtools. This step was crucial for validating assembly completeness and reducing errors. Non-haplotype-aware tools (e.g., Racon, Clair3) were applied cautiously to minimize errors from parental haplotype switches, while Merfin filtered edits for polishing and BCFtools consensus produced the final polished sequence. For large variants and closing gaps using read alignments we used Sniffles56 for large structural variant (SV) calling, specifically for insertions (INS) and deletions (DEL). The final genome assembly was required to have a high mapping quality, with no MAPQ0 regions and an overall QV > 50. Variant calls were validated visually using IGV57, and the overall quality and structural consistency were assessed using Meryl, Asset/detgaps, CRAQ, Inspector, and VerityMap58.

Additional targeted consensus automated polishing (SVS and SNPs)

The consensus polishing was automated using the T2T-polish GitHub repository https://github.com/arangrhie/T2T-Polish. A HiFi mapping file of repetitive k-mers (k = 15) was generated with Meryl, which was then used in Winnowmap for HiFi read alignment (‘-MD -W. repetitive_k15.txt -ax map-pb’). Alignment filtering was performed using Samtools (‘-Sb’). The tool pb-falconc (‘bam-filter-clipped -t -F 0 × 104 -output-count-fn’) (https://github.com/PacificBiosciences/pbbioconda) v1.15.0 was used to remove clipped reads. Genome polishing was carried out using the liftover branch of the Racon GitHub repository and Racon v1.5.0 with options (‘-L -S’). After polishing, the k-mers present in the genome (seqmers) were counted by using a meryl count (k = 21). Merfin was then employed (‘-readmers Illumina.HiFi.gt1.PCRfree.hybrid.meryl -seqmers’) to evaluate the results by comparing the distributions in the reads and in the polished genome51,54. For consensus generation (i.e., applying polishing edits to the genome assembly), we used BCF tools (‘-H 1’) for two rounds of genome correction. Assembly quality metrics were measured for QV, completeness, and BUSCOs scores, and we found a large increase in QV for most chromosomes (min. increase > + 1-5 QV points) after ONT Racon and Merfin. The median QV was 50, and the progress over time is presented in Fig. 3.

Organelles: Bighead catfish mitochondrial genome

The mitochondrial genome was assembled by aligning reads to the reference (NC_046749.1) bighead catfish mtDNA sequence that is available at NCBI Nucleotide59. Nanopore reads were mapped to the reference using Minimap2 (‘-ax map-ont -secondary = no’), and Illumina reads were mapped using Minimap2 (‘-ax sr’). PCR duplicates in the short reads were removed from the alignments using Samtools, and the results were visualized using IGV. Pilon (‘-fix all -diploid -changes -vcf -tracks -minmq 10’) was used to correct the reference (NC_046749.1), to call SNPs, SVs, gaps, and local variants, and to obtain the consensus mtDNA sequence. Reads were realigned to the consensus and no additional SNPs were visible in the IGV. Subsequently, mapped reads were filtered with Samtools view (‘-F4 -q 20’) and were re-assembled de novo using Unicycler60 for comparison. The results were visualized in Bandage-ng61 v2022.09. The assembly was polished with Pilon, and the two homologous mitochondria were compared using Minimap2 (‘-eqx -x asm5’). Gene annotations were generated using MitoFinder62 v1.4.2. To ensure that the mitochondrial genome was correct and not dissimilar from other mtDNAs in Siluriformes, we downloaded all reference mtDNA sequences for 209 species of Siluriformes catfish from NCBI Nucleotide (last accessed September 2024). All sequences were obtained from NCBI RefSeq and not from NCBI GenBank. All mtDNA nucleotide sequences (N = 210) were renamed using the Pan-SN naming scheme (https://github.com/pangenome/PanSN-spec), concatenated in a single multi-FASTA, including the bighead catfish mtDNA sequence that was generated after Pilon polishing. All-versus-all alignments were performed with wfmash, and ODGI63 was used to filter the alignment graph. Visualizations were made with Bandage and multiQC64. All tools used in the pipeline were integrated within the larger PGGB (pan-genome graph builder) framework65.

Comparative synteny analysis

A comparative synteny analysis was conducted using orthology and alignment tools to investigate the conservation of chromosome structures and identify syntenic regions between the bighead catfish and other teleost fishes. The proteomes of seven representative teleost species, rainbow trout (Oncorhynchus mykiss), medaka (Oryzias latipes), common carp (Cyprinus carpio), Nile tilapia (Oreochromis niloticus), zebrafish (Danio rerio), barramundi (Lates calcarifer), and spotted gar (Lepisosteus oculatus), were retrieved from Ensembl using the latest genome versions to ensure annotation consistency. Orthologous gene clusters were identified using OrthoFinder v2.5.566, followed by reciprocal best-hit filtering with rbhXpress v1.2.3. Syntenic blocks, which enable cross-species comparisons of genome-wide gene orders, were visualized using macrosyntR v0.2.1967.

Transposable element annotation

De novo annotation of transposable elements (TEs), which was performed using the EDTA pipeline (v2.2.0), it integrates multiple tools to detect and classify diverse TE families, including long- terminal tandem repeats (LTRs), terminal inverted repeats (TIRs), helitrons, and other types of repeats68.

Benchmarking of assembly quality metrics

Metrics of continuity, structural accuracy, base accuracy, and functional completeness were used for benchmarking as described in the Vertebrate Genome Project (VGP) paper5. The assembly quality metrics are listed in Table 2, scaffold-wise assembly metrics are listed in Table 3, and the methodology for their calculation is presented below.

  1. 1.

    Continuity and summary statistics: To evaluate the continuity and summary statistics of the assembly, the following measures for scaffolds/contigs (N50, N90, NG50, LG50, and LG90) were computed by using RagTag (ragtag.py asmstats -g)69 v2.1.0.

  2. 2.

    Repeat completeness and continuity of repeats: To measure repeat completeness and continuity to assess assembly quality, the percentage of fully assembled LTR retroelements (LTR-RTs) was estimated, and the long terminal repeat (LTR) assembly index (LAI) was calculated using LTR_retriever70 v2.9.00. Telomeres, which are nucleoprotein complexes located at the ends of eukaryotic chromosomes71, have been identified in all vertebrates studied so far. The DNA component of telomeres contains a tandemly repeated G-rich hexanucleotide sequence (TTAGGG/CCCTAA)n72,73. To assess the assembly quality of complex repeats (centromeres), TandemTools and TandemQUAST58 v1.0 were used (results not shown). Telomere prediction for the presence/absence and orientation of telomeres was performed using TIDK, a Telomere Identification Toolkit, implemented in TeloExplorer (‘-m 50 -c animal’), a module of QuarTeT. To estimate gap length and its location in the genome, detgaps, a script from Asset available at GitHub (https://github.com/dfguan/asset), was used.

  3. 3.

    Structural accuracy (regional and structural errors and reliable blocks): To assess structural accuracy, CRAQ (‘sms_coverage = 6, ngs_coverage = 20, -Minimap2-sensitive’)74 v1.0.9, was used. CRAQ is a method that relies on examining mapped reads, clipped reads, and coverage support from two or more simultaneous sequencing platforms against a reference, to identify supporting regions (i.e., the reliable blocks in the VGP paper). CRAQ metrics consist of global AQI (R-AQI and S-AQI), Small Clip-based Regional Errors (CREs), and Large Clip-based Structural Errors (CSEs), which indicate incorrect assembly breakpoints. The mapping results consisted of loading multiple read-to-genome alignment files in the BAM file format, along with BED and BigWig genome annotation files, which were visualized using the Integrative Genome Viewer (IGV)55 using the command line (‘igv -g $genome.fa $BAM’). To assess cross-species structural correctness, the same references from related catfish species (C. gariepienus and C. fuscus) were used for one-to-one nucleotide-level alignments of orthologous segments using MashMap2 (‘-s 2000000 -pi 90 -c 100000’)75 v3.1.3.

  4. 4.

    Base accuracy and assembly completeness: To assess base accuracy and completeness, specifically Merqury’s QV and 21-mer genome completeness (%), Merqury50 v1.3 was used. It was run three times using different k-mer databases: Illumina, HiFi, and a hybrid 21-mer database combining Illumina and HiFi reads, using the method described in the “Consensus polishing” section above. To assess functional completeness and evaluate gene set completeness, the ray-finned fish lineage and BUSCO (‘-l actinopterygii_odb10’)76 v5.6.1 were used. Finally, mapping rates and inconsistencies were determined by mapping using Minimap2 (‘-ax map-hifi -secondary = no’) and WinnowMap (‘-W repetitive.15. txt’) at MAPQ > 10 for HiFi reads, while for ONT reads, the map-ont default for Minimap2 and the map-pb default in Winnowmap MAPQ > 10 read-to-assembly alignments were used. Visualizations were performed using the IGV (igv-g $genome $BAM(s) $merqury_only_bed_wig_kmer_files).

  5. 5.

    Consensus genome assembly with Flye: To validate the haplotype-resolved assembly, we assembled an additional consensus genome with the assembler Flye77 using default parameters and 10 rounds of minimap2 overlap), which collapses haplotypes and generates approximations of genes by merging divergent alleles into single chimeric alleles.

Table 2 Summary of the haplotype-resolved genome assembly for Clarias macrocephalus.

Data Records

The sequencing datasets and genome assemblies of Clarias macrocephalus have been deposited in multiple public repositories. All raw sequencing data are hosted under NCBI BioProject number PRJNA1132508, with the BioSample accession SAMN41769988 for the diploid bighead catfish isolate CMAM (TaxID: 35657). The sequencing data deposited in the NCBI Sequence Read Archive include: Oxford Nanopore Technologies long-read data (SRR29723575)78; PacBio HiFi sequencing data (SRR29723576)79; Hi-C chromatin conformation data (SRR29723577)80; Illumina paired-end sequencing data from male sample (SRR29723578)81; and Illumina paired-end sequencing data from female sample (additional) (SRR30463128)82. The complete genome assemblies and associated datasets are available through Zenodo (https://doi.org/10.5281/zenodo.14826875)83. The assembled haplotype-resolved genome sequences were deposited in GenBank as whole-genome shotgun sequencing projects, with haplotype 1 available under accession JBLWMO00000000084 and haplotype 2 under accession JBLWMP00000000085.

Technical Validation

Technical validation of bighead catfish genome

The haplotype-resolved genome assembly of the bighead catfish was validated using multiple complementary approaches. The assembly produced 27 Hi-C scaffolds per haplotype with a total size of ~880 Mb per haplotype, consistent with the haploid chromosome number of the diploid karyotype (2n = 54), thereby confirming chromosome-scale structural accuracy. Structural integrity was further supported by visual inspection of Hi-C contact maps (Figs. 4, 5, 6).

Fig. 6
figure 6

The heat maps of Hi-C contact matrix displaying individual pseudo-chromosome scaffolds in haplotype 2. The x-axis and y-axis represent genomic coordinates along each pseudo-chromosome. Darker colors along the diagonal indicate higher contact frequencies, supporting the accuracy of scaffold assembly and orientation.

Pairwise mapping between phased haplotypes, performed with Minimap2 in asm5 mode, revealed 1,968,666 heterozygous single-nucleotide polymorphisms, corresponding to a heterozygosity rate of 0.594%. Structural variant analysis with PlotSR suite86 identified more than 390,000 insertions and deletions spanning ~7.7 Mb in each haplotype. Additional copy number variants were detected, including 114 copy gains (184 kb) and 123 copy losses (416 kb), distributed across highly divergent genomic regions spanning ~57.9 Mb in haplotype 1 and ~56.0 Mb in haplotype 2.

Assembly quality was high, with a scaffold N50 of 35.4 Mb, 95.5% completeness based on Benchmarking Universal Single-Copy Orthologs (BUSCO), and an overall consensus quality value (QV) of 50. RagTag asmstats reported a contig NG50 of 3 Mb and a scaffold N50 of 34 Mb, with LG50 and LG90 values of 11 and 24, respectively (Table 2).

A K-mer–based quality assessment using Merqury showed a substantial increase in consensus accuracy, yielding a final median QV > 46, which exceeds the Vertebrate Genome Project (VGP) standard of QV > 40. Merqury grey peaks indicated the presence of read-only k-mers, supporting the conclusion that some biologically or technically complex regions were not fully recovered. A minor peak at 3× multiplicity was observed, suggesting low-level false duplications; these could be resolved in future versions using tools such as PurgeDups87 or Purge Haplotigs88.

Genome completeness was additionally supported by BUSCO analysis with the actinopterygii_odb10 dataset. For comparison, a collapsed assembly generated with Flye achieved high BUSCO scores (C:97.7%, S:11.2%, D:86.5%, F:0.6%, M:1.7%, n = 3640), with 3,557 complete BUSCOs (408 single-copy and 3,149 duplicated). This collapsed assembly spanned 1.84 Gb across 4,961 contigs and scaffolds with an N50 of 1 Mb. It contained ~2,480 gaps (estimated as half the contig count) but reported 0% gap content by length. Base-level consensus quality was QV 40–50, corresponding to 99.99–99.999% accuracy. Although the assembly from Flye is inherently collapsed, it remains useful for estimating total DNA content and sample-specific gene content:a small fraction of missing orthologs in the Flye assembly likely reflects unresolved complex regions or conservative parameters (Fig 7).

Fig. 7
figure 7

Quality assessment of the genome assembly of bighead catfish. (A) K-mer spectra generated using the Merqury (software). This spectra-asm plot shows k-mer contents for read-only (black), haplotype 1-specific (red), haplotype 2-specific (blue), and shared (green) regions. The x-axis shows k-mer multiplicity, and the y-axis shows the count of k-mers. (B) BUSCO completeness assessment for three assemblies: the two haplotype-resolved assemblies, and a collapsed assembly created using Flye assembler (positive control),. The x-axis shows the percentage of BUSCOs for four categories: complete single-copy (blue), complete duplicated (light blue), fragmented (yellow), and missing (red) genes. (C) K-mer spectra-cn plot displaying copy number multiplicity of k-mers in the diploid genome. The x-axis shows k-mer multiplicity, and the y-axis shows the count of k-mers. Most k-mers fall into the expected diploid peak (copy number = 2, blue). A minor peak at copy number 3 suggests a small degree of homozygous duplication, which may be further reduced by purging redundant haplotypes or trimming contig ends.

Approximately 35.25% of the genome was predicted to consist of transposable elements or other repetitive sequences. Among the identified classes, terminal inverted repeat (TIR) DNA transposons were the most abundant (19.12%), followed by LTR retrotransposons (8.30%) and Helitrons (4.47%). LINE elements were present at low abundance (0.46%), and unclassified repeats contributed 2.82% (Table 4).

Table 3 Scaffold-level assembly statistics for C. macrocephalus.
Table 4 Transposable elements for the haplotype 2 of Clarias macrocephalus.

The synteny plots showed strong chromosomal collinearity between bighead catfish, common carp, and Nile tilapia, which share a common ancestor among otophysan teleosts. By contrast, more fragmented or rearranged syntenic signals were observed in zebrafish and spotted gar, suggesting a greater evolutionary divergence (Fig. 8).

Fig. 8
figure 8

Synteny analysis of chromosome-level assemblies for various catfish samples. (A) Whole-genome four-way synteny analysis with 1413 single-copy orthologs shared across Clarias gariepinus (GCA_024256425.2), Clarias macrocephalus (database number), Danio rerio (GRCz11), and Oreochromis niloticus (O_niloticus_UMD_NMBU), reveals conserved chromosome structures and chromosomal rearrangements. (B) Simple phylogeny and geological timescales of C. macrocephalus and related teleosts examined using Time Tree 3 version.

The current assembly contains at most 358 unresolved gaps, substantially fewer than previous efforts12, thus providing a more complete view of the bighead cafish genome. The comparative analysis with related Clarias genomes revealed conserved mtDNA with homology to other catfish species (Fig. 9).

Fig. 9
figure 9

All-against-all pairwise comparisons of mitochondrial DNA sequence alignments in Siluriformes catfishes. The assembled mtDNA sequence of the bighead catfish is represented on the last row (blue color).

Together, these validations demonstrate the high accuracy, biological completeness, and diploid-resolved nature of the bighead catfish genome assembly, supporting its reliability as a reference for future genomic, evolutionary, and applied aquaculture studies.

Usage Notes

The primary genome assembly is well-suited for comparative genomic studies, particularly those involving synteny analysis, due to its high gene set completeness. For studies focused on genetic variation, including single-nucleotide variants or structural variation, either haplotype 1 or haplotype 2 can be employed, depending on the research objective.