Background & Summary

The cyprinid genus Acrossocheilus Oshima, 1919 comprises 26 valid species distributed across East and Southeast Asia, including mainland China, Taiwan, Hainan, Laos, and Vietnam. These small- to medium-sized barbines are principally characterized by a medially interrupted lower lip with two thick lateral lobes, which are anteriorly separated from the lower jaw by a distinct groove running the entire length of the jaw1. These species are widely distributed across Laos, Vietnam, and southern China, including Hainan, Taiwan, and other parts of the Chinese mainland2. Acrossocheilus longipinnis, is an endemic species of mainland China currently known only from the Pearl River basin, exhibits an elongated, laterally compressed body covered in dense scales with a prominent lateral line. Its silver-gray base coloration is adorned with five distinct pale yellow vertical bars. A key morphological trait in males is the elongation of the last branched ray and first unbranched ray of the dorsal fin into filamentous projections. Valued in the ornamental fish trade for its unique morphology and striking coloration, this species has experienced significant wild population declines, as indicated by recent fishery resource assessments. This decline is attributed to multiple anthropogenic threats, including cascading hydropower dam construction, extensive sand mining, overfishing, environmental pollution, and the introduction of invasive fish species. Consequently, A. longipinnis has been classified as Vulnerable on the IUCN Red List.

Molecular research on A. longipinnis remains limited. To date, only its mitochondrial genome has been sequenced3. Crucially, a reference genome assembly for this species is still lacking, which significantly hinders progress in understanding its biology, advancing genetic breeding programs, and developing desirable aquacultural traits. Recent advancements in DNA sequencing technologies, however, offer unprecedented opportunities for genomic research. Notably, Pacific BioSciences’ (PacBio) Circular Consensus Sequencing (CCS) mode provides long read lengths (10–20 kb) and high accuracy (>99%), thus greatly facilitating de novo assembly studies of both plant and animal genomes4,5. According to the comprehensive overview by Li and Durbin6, high-fidelity (HiFi) sequencing enables near-telomere-to-telomere assemblies by resolving repetitive regions and segmental duplications that are challenging for short-read approaches. In a parallel manner, Wang et al.7 emphasize HiFi’s applications in complex genomic regions, such as centromeres and ribosomal DNA arrays, and its superiority in variant detection and phasing compared to other long-read platforms like Oxford Nanopore Technologies7. When integrated with complementary approaches such as chromosomal conformational capture (Hi-C) sequencing, these technologies enable the generation of highly contiguous, chromosome-level genome assemblies. Such integrated approaches have already been successfully applied in another Acrossocheilus species, Acrossocheilus fasciatus, demonstrating their utility in resolving genomic architectures within this genus8.

Here, we assembled a high-quality genome of A. longipinnis by combining short sequencing reads, PacBio HiFi long reads, and Hi-C sequencing data. The final longfin barb genome assembly had a total length of 936.04 Mb, with 99.06% (927.20 Mb) of the sequences successfully anchored to 25 chromosomes. The assembly demonstrated high continuity (contig N50 = 36.09 Mb) and completeness (BUSCO = 98.76%), supported by quality metrics including a QV value of 54.46, a GCI score of 29.76, and a CRAQ value of 96.40. Subsequent annotation identified 24,718 protein-coding genes and 553.06 Mb of repetitive sequences. This high-quality genome assembly not only facilitates population genetic research and evolutionary analyses of A. longipinnis but also provides valuable resources for optimizing genetic breeding efforts.

Methods

Sampling, DNA and RNA extraction

This study was carried out according to the recommendations for the care and use of animals for scientific purposes set up by the Animal Care and Use Committee of the Chinese Academy of Fishery Sciences (ACUC-CAFS). Samples of A. longipinnis were collected from Hechi City, Guangxi Zhuang Autonomous Region, China (coordinates: 107°33′–108°13′ E, 24°22′–24°55′ N). Tissue samples were promptly collected, snap-frozen in liquid nitrogen, and then stored at −80 °C. DNA and RNA extraction, library construction, and sequencing in this study were performed using standard experimental and analytical protocols provided by NextOmics Biosciences (Wuhan, China).

Long read DNA preparation and sequencing

A total of 8 μg of high-quality genomic DNA was extracted from muscle tissue using a Qiagen DNeasy Blood and Tissue Kit (Qiagen, USA) according to the manufacturer’s instructions. The quality and concentration of the extracted DNA were assessed using a NanoDrop One spectrophotometer (Thermo Scientific, USA) and 1% agarose gel electrophoresis. PacBio long insert libraries were prepared using the SMRTbell Express Template Prep Kit 2.0 according to manufacturers’ instructions, with an insert size of approximately 20 kb. The libraries were sequenced on the PacBio Revio system in CCS mode. Subreads were processed with SMRTLink (v11.1.0)9 using the parameters “--minPasses 3 --minPredictedAccuracy 0.99 --minLength 500”, producing approximately 114.37 Gb HiFi reads with an N50 size of 16,728 (Table 1). The parameter “minPredictedAccuracy” set to 0.99 in the context of PacBio SMRTLink software means that, during the data processing of sequencing reads, only those reads that have a predicted accuracy of 99% or higher will be retained for further analysis.

Table 1 Summary of DNA sequencing data of A. longipinnis genome.

Short read DNA preparation and sequencing

The extracted DNA (~5 μg) was randomly sheared into approximately 350 bp fragments, and a short fragment library was constructed using the MGIEasy Universal DNA Library Prep Set (MGI, China). Sequencing was conducted on the MGISEQ T7 platform (MGI, China), resulting in a total of 56.50 Gb of short sequencing reads, each 150 bp in length (Table 1).

Hi-C DNA library preparation and sequencing

A Hi-C library was generated using the DpnII restriction enzyme (GrandOmics, China). Muscle tissue samples were treated with 1% formaldehyde at room temperature for 10–30 minutes to crosslink chromatin-interacting proteins. Subsequently, the DNA was digested with the restriction enzyme, and the 5′ overhangs were repaired with a biotinylated residue. A paired-end library with insert sizes of approximately 300 bp was prepared and then sequenced on the MGISEQ T7 platform (MGI, China). A total of 127.92 Gb of clean data was obtained from 129.09 Gb of sequencing data using the software fastp (v0.19.5)10 with parameters “-w 16 --length_required 150” (Table 1).

RNA library preparation and sequencing

For the purpose of RNA sequencing, we extracted total RNA from muscle, heart, liver, spleen, gill, kidney, skin, and fin tissues using the TRIzol reagent (Invitrogen, USA) following the manufacturer’s protocol. Mixed total RNA purity was assessed with a NanoPhotometer spectrophotometer (IMPLEN, CA, USA), while RNA concentration was quantified using the Qubit RNA Assay Kit with a Qubit 2.0 Fluorometer (Life Technologies, CA, USA). RNA-seq libraries were prepared using the TruSeq Stranded mRNA Library Prep Kit (Illumina, USA) according to the manufacturer’s instructions. Sequencing was performed on a MGISEQ T7 platform (MGI, China), generating 150 bp paired-end reads.

Genome size estimation

The genome size of A. longipinnis was estimated through k-mer profiling. First, raw short sequencing reads underwent quality control using fastp (v0.19.5)10. Using K-mer analysis (K = 21) of quality-filtered short reads, the genome size of A. longipinnis was first estimated with findGSE (v1.94.R)11. The genome size of A. longipinnis was estimated to be 961,326,620 bp (Fig. 1).

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

K-mer frequency distribution estimated. The observed K-mer (raw K-mer) frequencies (in grey), fitted K-mer frequencies (in blue) with skew normal distribution model, and overall fitting (in red) that concatenated observed and fitted K-mer frequencies.

De novo assembly and Hi-C assembly

Primary contigs were assembled from HiFi reads using Hifiasm (v 0.25.0)12 with parameters: -t 100–n-hap 2–telo-m TTAGGG hifi.fa. Genome base errors (single-nucleotide variants and small indels) were corrected using NextPolish (v1.4.1)13, integrating both HiFi reads and quality-filtered short reads. This yielded 132 contigs spanning 936.78 Mb with an N50 of 33.36 Mb. For chromosomal anchoring, BWA (v0.7.12)14 was used to align the Hi-C clean data to the assembled contigs. Low-quality reads were filtered using the HiC-Pro pipeline15 with default parameters. The remaining valid reads were employed to anchor chromosomes using Juicer16 and the 3d-dna pipeline17, followed by manual correction with Juicebox (v2.13.07)18. In the 3d-DNA pipeline, a default gap size of 500 bp was inserted between consecutive sequences. Next, we applied the LR_Gapcloser19 program to close the gaps in the assemblies. To enhance genome quality, the assemblies were polished with NextPolish2 (v0.2.0)20 using HiFi reads and quality-filtered short reads. Ultimately, 99.06% of contig sequences were anchored to 25 pseudochromosomes, with only two gaps remaining (one each in pseudochromosomes 5 and 20) (Table 2 and Fig. 2). The sizes of these two gaps were 3 bp and 151 bp, respectively. The longest and shortest pseudochromosomes measured 56.97 Mb and 28.75 Mb, respectively (Table 3). The final assembly totaled 936.04 Mb with a contig N50 of 36.09 Mb (Table 2 and Fig. 3).

Table 2 Summary statistics of A. longipinnis assembly.
Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Hi-C assembly of chromosome interactive heat map. The abscissa and ordinate represent the order of each bin on the corresponding chromosome group. The colour block illuminates the intensity of interaction from white (low) to red (high).

Table 3 Pseudo-chromosome length statistics after Hi-C assisted assembly.
Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Snail plot showing the features of the assembled A. longipinnis genome. The main plot is divided into 1,000 size-ordered bins around the circumference with each bin representing 0.1% of the 936,040,231 bp assembly. The distribution of chromosome lengths is shown in dark grey with the plot radius scaled to the longest chromosome present in the assembly. Orange and pale-orange arcs show the N50 and N90 chromosome lengths (36,094,363 and 29,100,020 bp), respectively. The pale grey spiral shows the cumulative chromosome count on a log scale with white scale lines showing successive orders of magnitude. The blue and pale-blue area around the outside of the plot shows the distribution of GC, AT and N percentages in the same bins as the inner plot.

Repetitive sequence annotation

Repeat elements in the A. longipinnis genome were annotated employing a combined methods of homology alignment and de novo searches. The homology-based blast was performed against the RepBase database (http://www.girinst.org/repbase/)21 using RepeatMasker (v4.0.7)22 and Proteinmask software for known repeat elements. For de novo annotation, we firstly employed LTR_FINDER (v1.06)23 and RepeatModeler (v1.0.4)24 to bulid a de novo repeat library, and then was used to predict repeat elements using RepeatMasker (v4.0.7)22 with default parameters. Additionally, Tandem Repeat Finder (v4.10.0)25 was used to discern tandem repeats with default parameters. In detail, a total of 553.06 Mb (~59.09%) of repetitive sequences were obtained. Among the interspersed repeats, long terminal repeats were the most prevalent type, accounting for 32.67% of the genome (Table 4).

Table 4 Statistics of interspersed repetitive sequences in A. longipinnis assembly.

Gene prediction and functional annotation

Gene prediction was performed using a multifaceted approach incorporating transcriptome-based, homology-based, and ab initio methods. For the transcriptome-based prediction, a total of 8.73 Gb of RNA-seq clean reads were aligned to the A. longipinnis assembly using Hisat2 (v2.2.1)26 (Table 5). Stringtie (v1.2.2)27 was then utilized to assemble transcripts based on the alignment results. In addition, the RNA-seq data were de novo assembled by Trinity (v2.15.2)28 with parametrs:–seqType fq–max_memory 200 G–min_kmer_cov 2–min_glue 2–CPU 60–min_contig_length 200. Afterwards, the assembled transcripts were aligned against the A. longipinnis assembly using Program to Assemble Spliced Alignment (PASA; v2.4.1)29. For homology-based prediction, we utilized Miniport (v0.11) to conduct a comparative analysis of the protein sequences from seven vertebrate species, including A. fasciatus8, Ctenopharyngodon idella30, Cyprinus carpio31, Poropuntius huangchuchieni32, Onychostoma macrolepis (GCF_012432095.1), Danio rerio (GCF_049306965.1), and Homo sapiens (GCF_009914755.1). For ab initio prediction, 2,000 high-quality genes from PASA were randomly selected as the training set for model training with AUGUSTUS (v3.2.3)33. AUGUSTUS (v3.2.3)33 was then employed to predict coding regions in the repeat-masked genome. In addition, Fgenesh (v2.4.5)34 was also used for ab initio prediction. Finally, all gene models were integrated using EvidenceModeler (v2.1.0)35. The final comprehensive gene set comprised 24,718 genes (Table 6), with an average of 10.44 exons per gene, an exon length of 170.64 bp, and a coding sequence (CDS) length of 1781.09 bp.

Table 5 Summary of RNAseq sequencing data of A. longipinnis genome.
Table 6 Statistics of functional annotation result.

After gene prediction, the finalized gene sets derived from the preceding methods underwent functional annotation through matching with a variety of databases. Briefly, amino-acid sequences were aligned to SwissProt36, Kyoto Encyclopedia of Genes and Genomes (KEGG)37, and the NCBI nonredundant database (NR) using the Diamond (v 2.1.10)38 with an E-value cutoff of 1e-05. Protein domains were identified using the InterProScan (v5.30)39 program, and Gene Ontology (GO) terms for each gene were also extracted through InterProScan. Overall, 24,228 (98.02%) of the predicted protein-coding genes were functionally annotated (Table 6).

Ethical approval

The study did not involve any wild animals. All experimental procedures involving fish were conducted in strict compliance with the Guide for the Hongshui River Rare Fish Conservation Center to minimize animal suffering and ensure animal welfare.

Data Records

The raw sequencing data have been deposited into the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database with accession number SRP60447140 under BioProject number PRJNA1297891. Additionally, the genome assembly and annotation are available at the Figshare dataset41.

Technical Validation

Genome assembly and gene prediction quality assessment

We employed a multi-faceted approach to rigorously evaluate the precision and integrity of the A. longipinnis genome assembly. First, we utilized Merqury (v1.3)42 with a combination of HiFi long reads and short reads, setting the K-mer value at 19, to calculate the consensus QV. The analysis yielded a QV of 54.46, indicating a high level of accuracy in the assembled genome sequence (Table 2). Subsequently, we aligned the HiFi reads and quality-filtered short reads to the assembly using minimap2 (v2.24-r1122)43 and BWA (v0.7.12)14, respectively. This process demonstrated an exceptional alignment rate, with 99.99% of the HiFi reads and 99.85% of the short sequencing reads successfully mapped to the genome (Table 2). Centromeric regions were predicted following the method described in the recent telomere-to-telomere genome study of Cyprinus carpio31. We found the centromeric regions displayed the canonical features of centromeres: high repetitive sequence content, low gene density, and low HiFi read coverage depth, aligning with the previous research reports31,44 (Fig. 4). Additionally, both assembly gaps were located within highly repetitive regions, one of which lay within a centromere. The HiFi read coverage in the regions flanking these gaps was notably lower compared to the genome-wide average. Clipping information for revealing assembly quality (CRAQ, v1.10)45 was used to assess the accuracy of our genome assembly based on PacBio HiFi and quality-filtered short reads, resulting in a S-AQI of 96.40, confirming high assembly quality. In addition, genome continuity inspector (GCI, v1.0)46 yielded a value of 29.76, which was comparable to that of the chicken complete genome47. To assess genome completeness, we performed an analysis with Benchmarking Universal Single-Copy Orthologs (BUSCO) (v5.5.0)48 using the actinopterygii_odb10 database. The results showed that 98.76% of the BUSCO genes were complete, including 97.53% single-copy and 1.24% duplicated orthologs, while only 0.93% of the genes were fragmented (Fig. 5). Furthermore, BUSCO analysis of the genome annotation revealed 97.14% of the recognized BUSCOs were complete, consisting of 95.11% single-copy and 2.03% duplicated genes (Fig. 5). Collectively, these comprehensive evaluation metrics strongly suggest that the A. longipinnis genome assembly has achieved a high standard of quality, providing a reliable resource for subsequent genetic and biological studies.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Characterization of centromeric regions and gap locations visualized by a circos plot. From inside to outside: Gene density in 1 Mb sliding windows; Percentage of repetitive sequence in 1 Mb sliding windows; Centromere density in 1 Mb sliding windows; Gap locations; HiFi reads coverage depth; The length of pseudochromosome in the size of Mb.

Fig. 5
Fig. 5The alternative text for this image may have been generated using AI.
Full size image

BUSCO assessments of A. longipinnis genome and gene sets.