Abstract
Homatula variegata is a small benthic loach from the upper Yangtze and adjacent basins with aquaculture and ornamental value but no reference genome. We present a near telomere-to-telomere (T2T) chromosome-level assembly built from PacBio HiFi, Oxford Nanopore ultra-long, Illumina short reads, and Hi-C. The 641.26-Mb genome resolves 24 chromosomes as single contigs (contig N50, 24.40 Mb). Hi-C confirms chromosome-length scaffolding; we detect 24 putative centromeres and 20 terminal telomeric tracts, with 22 chromosomes gap-free and two containing one gap. Annotation identifies 24,479 protein-coding genes, 93% functionally assigned, and 27.13% repetitive content dominated by DNA transposons. Quality assessments show high completeness (BUSCO, 96.48% complete) and base-level accuracy consistent with k-mer and read-mapping metrics. To our knowledge this is the first near T2T-level reference for any loach (Cobitoidei), filling a key gap in Cypriniformes genomics. This resource will enable comparative and population genomics, illuminate adaptation to montane stream habitats, and support selective breeding, conservation, and aquaculture of this native species.
Similar content being viewed by others
Data availability
Code availability
All software and versions are listed above. No custom code was used for this study.
References
Li, W., Pu, Y. & Tian, H. Spatial and temporal distribution characteristics and optimum habitat conditions of Paracobitis variegatus in Heishui River. Journal of Fishery Sciences of China 30, 515–524 (2023).
Mauice, K. Subspecific differentiation of Paracobit variegatus with comments on its zoogeography. Zoological Research 15, 58–67 (1994).
Zhou, Y. Preliminary study on the biology of Paramisgurnus rubripes in the middle reaches of Qingyi River, Sichuan Agricultural University (2007).
Ma, B. S. et al. Length–weight and length–length relationships of four native fish species from the Yalong River, China. Journal of Applied Ichthyology 33, 839–841 (2017).
Guo, Z. Sequencing of mitochondrial genome of Paragonimus rubripes and phylogenetic analysis of Cyprinus carpio, Shaanxi Normal University. (2012).
Liu, C. Z., Wei, G. H., Hu, J. H. & Liu, X. Y. Complete mitochondrial genome of Paracobitis variegates and its phylogenetic analysis. Mitochondrial DNA Part A 27, 2421–2422 (2016).
Liu, F. et al. The telomere-to-telomere gapless genome of grass carp provides insights for genetic improvement. GigaScience 14, giaf059 (2025).
Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).
Yuan, J. et al. A telomere-to-telomere genome assembly of koi carp (Cyprinus carpio) using long reads and Hi-C technology. GigaScience 14, giaf087 (2025).
Zhang, X., Chen, J., Zhou, W., Wen, J. & Shi, Q. A telomere-to-telomere gap-free genome assembly of the protandrous hermaphrodite Asian seabass (Lates calcarifer). Scientific data 12, 1457 (2025).
Shifu, C. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. Imeta 2, e107 (2023).
De Coster, W. & Rademakers, R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics 39, btad311 (2023).
Jung, Y. & Han, D. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics 38, 2404–2413 (2022).
Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome biology 16, 259 (2015).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell systems 3, 95–98 (2016).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature communications 11, 1432 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods 18, 170–175 (2021).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
Madden, T. The BLAST sequence analysis tool. The NCBI handbook 2, 425–436 (2013).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell systems 3, 99–101 (2016).
Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics 39, btac808 (2023).
Xu, M. et al. TGS-GapCloser: a fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience 9, giaa094 (2020).
Brown, M. R., de La Rosa, M. G. & Blaxter, M. Tidk: a toolkit to rapidly identify telomeric repeats from genomic datasets. Bioinformatics 41, btaf049 (2025).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
Open2CN Abdennur et al. Cooltools: enabling high-resolution Hi-C analysis in Python. PLOS Computational Biology 20: e1012067 (2024).
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117, 9451–9457 (2020).
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant physiology 176, 1410–1422 (2018).
Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic acids research 38, e199 (2010).
Abrusán, G., Grundmann, N., DeMester, L. & Makalowski, W. TEclass—a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25, 1329–1330 (2009).
Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and genome research 110, 462–467 (2005).
Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics 5, 4.10. 11–14.10. 14 (2004).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology 37, 907–915 (2019).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature biotechnology 33, 290–295 (2015).
Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Gene prediction: Methods and protocols Springer, 161–177 (2019).
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology 9, R7 (2008).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature methods 12, 59–60 (2015).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular biology and evolution 38, 5825–5829 (2021).
Aramaki, T. et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2020).
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic acids research 49, 9077–9096 (2021).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP610030 (2025).
GenBank https://identifiers.org/ncbi/insdc.gca:GCA_052674685.1 (2025).
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology 21, 245 (2020).
Tegenfeldt, F. et al. OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic acids research 53, D516–D522 (2025).
Parra, G. & Keith Bradnam, I. K. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Acknowledgements
This study was funded by the Leshan Municipal Science and Technology Bureau Key Research Project (Grant No. 23NZD002) and by the Leshan Sub-center of the National Swine Industry Center.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tang, Y., Wu, Q., Wang, Y. et al. A chromosome level genome assembly of Homatula variegata from the Yangtze River basin. Sci Data (2026). https://doi.org/10.1038/s41597-026-06667-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-026-06667-9


