Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Advertisement

Scientific Data
  • View all journals
  • Search
  • My Account Login
  • Content Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • RSS feed
  1. nature
  2. scientific data
  3. data descriptors
  4. article
A chromosome level genome assembly of Homatula variegata from the Yangtze River basin
Download PDF
Download PDF
  • Data Descriptor
  • Open access
  • Published: 26 January 2026

A chromosome level genome assembly of Homatula variegata from the Yangtze River basin

  • Yong Tang1,
  • Qiaoxing Wu1,
  • Yusu Wang1,
  • Shaoqi Jiang1,
  • Lan Liu1 &
  • …
  • Lingjin Xian1 

Scientific Data , Article number:  (2026) Cite this article

  • 248 Accesses

  • Metrics details

We are providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply.

Subjects

  • Agricultural genetics
  • Genome

Abstract

Homatula variegata is a small benthic loach from the upper Yangtze and adjacent basins with aquaculture and ornamental value but no reference genome. We present a near telomere-to-telomere (T2T) chromosome-level assembly built from PacBio HiFi, Oxford Nanopore ultra-long, Illumina short reads, and Hi-C. The 641.26-Mb genome resolves 24 chromosomes as single contigs (contig N50, 24.40 Mb). Hi-C confirms chromosome-length scaffolding; we detect 24 putative centromeres and 20 terminal telomeric tracts, with 22 chromosomes gap-free and two containing one gap. Annotation identifies 24,479 protein-coding genes, 93% functionally assigned, and 27.13% repetitive content dominated by DNA transposons. Quality assessments show high completeness (BUSCO, 96.48% complete) and base-level accuracy consistent with k-mer and read-mapping metrics. To our knowledge this is the first near T2T-level reference for any loach (Cobitoidei), filling a key gap in Cypriniformes genomics. This resource will enable comparative and population genomics, illuminate adaptation to montane stream habitats, and support selective breeding, conservation, and aquaculture of this native species.

Similar content being viewed by others

First Chromosome-level Genome Assembly and Annotation of an Endangered Freshwater Stingray (Fontitrygon garouaensis) from Africa

Article Open access 26 January 2026

Chromosome-level genome assembly of a critically endangered species Leuciscus chuanchicus

Article Open access 15 March 2025

Chromosome-level genome assembly of Tritrichomonas foetus, the causative agent of Bovine Trichomonosis

Article Open access 20 September 2024

Data availability

All raw sequencing reads (Illumina WGS, Oxford Nanopore, PacBio HiFi, and Hi-C), the final genome assembly (FASTA), and functional/structural annotations (GFF3/FASTA) are available under NCBI BioProject PRJNA130724748. The assembly is deposited at GenBank under accession GCA_052674685.149.

Code availability

All software and versions are listed above. No custom code was used for this study.

References

  1. Li, W., Pu, Y. & Tian, H. Spatial and temporal distribution characteristics and optimum habitat conditions of Paracobitis variegatus in Heishui River. Journal of Fishery Sciences of China 30, 515–524 (2023).

    Google Scholar 

  2. Mauice, K. Subspecific differentiation of Paracobit variegatus with comments on its zoogeography. Zoological Research 15, 58–67 (1994).

    Google Scholar 

  3. Zhou, Y. Preliminary study on the biology of Paramisgurnus rubripes in the middle reaches of Qingyi River, Sichuan Agricultural University (2007).

  4. Ma, B. S. et al. Length–weight and length–length relationships of four native fish species from the Yalong River, China. Journal of Applied Ichthyology 33, 839–841 (2017).

    Google Scholar 

  5. Guo, Z. Sequencing of mitochondrial genome of Paragonimus rubripes and phylogenetic analysis of Cyprinus carpio, Shaanxi Normal University. (2012).

  6. Liu, C. Z., Wei, G. H., Hu, J. H. & Liu, X. Y. Complete mitochondrial genome of Paracobitis variegates and its phylogenetic analysis. Mitochondrial DNA Part A 27, 2421–2422 (2016).

  7. Liu, F. et al. The telomere-to-telomere gapless genome of grass carp provides insights for genetic improvement. GigaScience 14, giaf059 (2025).

    Google Scholar 

  8. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    Google Scholar 

  9. Yuan, J. et al. A telomere-to-telomere genome assembly of koi carp (Cyprinus carpio) using long reads and Hi-C technology. GigaScience 14, giaf087 (2025).

    Google Scholar 

  10. Zhang, X., Chen, J., Zhou, W., Wen, J. & Shi, Q. A telomere-to-telomere gap-free genome assembly of the protandrous hermaphrodite Asian seabass (Lates calcarifer). Scientific data 12, 1457 (2025).

    Google Scholar 

  11. Shifu, C. Ultrafast one-pass FASTQ data preprocessing, quality control, and deduplication using fastp. Imeta 2, e107 (2023).

    Google Scholar 

  12. De Coster, W. & Rademakers, R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinformatics 39, btad311 (2023).

    Google Scholar 

  13. Jung, Y. & Han, D. BWA-MEME: BWA-MEM emulated with a machine learning approach. Bioinformatics 38, 2404–2413 (2022).

    Google Scholar 

  14. Servant, N. et al. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome biology 16, 259 (2015).

    Google Scholar 

  15. Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell systems 3, 95–98 (2016).

    Google Scholar 

  16. Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).

    Google Scholar 

  17. Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature communications 11, 1432 (2020).

    Google Scholar 

  18. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature methods 18, 170–175 (2021).

    Google Scholar 

  19. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Google Scholar 

  20. Guan, D. et al. Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36, 2896–2898 (2020).

    Google Scholar 

  21. Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).

    Google Scholar 

  22. Madden, T. The BLAST sequence analysis tool. The NCBI handbook 2, 425–436 (2013).

    Google Scholar 

  23. Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).

    Google Scholar 

  24. Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell systems 3, 99–101 (2016).

    Google Scholar 

  25. Zhou, C., McCarthy, S. A. & Durbin, R. YaHS: yet another Hi-C scaffolding tool. Bioinformatics 39, btac808 (2023).

    Google Scholar 

  26. Xu, M. et al. TGS-GapCloser: a fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience 9, giaa094 (2020).

    Google Scholar 

  27. Brown, M. R., de La Rosa, M. G. & Blaxter, M. Tidk: a toolkit to rapidly identify telomeric repeats from genomic datasets. Bioinformatics 41, btaf049 (2025).

    Google Scholar 

  28. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).

    Google Scholar 

  29. Open2CN Abdennur et al. Cooltools: enabling high-resolution Hi-C analysis in Python. PLOS Computational Biology 20: e1012067 (2024).

  30. Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117, 9451–9457 (2020).

    Google Scholar 

  31. Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant physiology 176, 1410–1422 (2018).

    Google Scholar 

  32. Han, Y. & Wessler, S. R. MITE-Hunter: a program for discovering miniature inverted-repeat transposable elements from genomic sequences. Nucleic acids research 38, e199 (2010).

    Google Scholar 

  33. Abrusán, G., Grundmann, N., DeMester, L. & Makalowski, W. TEclass—a tool for automated classification of unknown eukaryotic transposable elements. Bioinformatics 25, 1329–1330 (2009).

    Google Scholar 

  34. Jurka, J. et al. Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and genome research 110, 462–467 (2005).

    Google Scholar 

  35. Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics 5, 4.10. 11–14.10. 14 (2004).

    Google Scholar 

  36. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    Google Scholar 

  37. Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature biotechnology 37, 907–915 (2019).

    Google Scholar 

  38. Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature biotechnology 33, 290–295 (2015).

    Google Scholar 

  39. Keilwagen, J., Hartung, F. & Grau, J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Gene prediction: Methods and protocols Springer, 161–177 (2019).

  40. Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).

    Google Scholar 

  41. Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology 9, R7 (2008).

    Google Scholar 

  42. Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nature methods 12, 59–60 (2015).

    Google Scholar 

  43. Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).

    Google Scholar 

  44. Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular biology and evolution 38, 5825–5829 (2021).

    Google Scholar 

  45. Aramaki, T. et al. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36, 2251–2252 (2020).

    Google Scholar 

  46. Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic acids research 49, 9077–9096 (2021).

    Google Scholar 

  47. Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29, 2933–2935 (2013).

    Google Scholar 

  48. NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP610030 (2025).

  49. GenBank https://identifiers.org/ncbi/insdc.gca:GCA_052674685.1 (2025).

  50. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Google Scholar 

  51. Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    Google Scholar 

  52. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biology 21, 245 (2020).

    Google Scholar 

  53. Tegenfeldt, F. et al. OrthoDB and BUSCO update: annotation of orthologs with wider sampling of genomes. Nucleic acids research 53, D516–D522 (2025).

    Google Scholar 

  54. Parra, G. & Keith Bradnam, I. K. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).

    Google Scholar 

Download references

Acknowledgements

This study was funded by the Leshan Municipal Science and Technology Bureau Key Research Project (Grant No. 23NZD002) and by the Leshan Sub-center of the National Swine Industry Center.

Author information

Authors and Affiliations

  1. School of Modern Agriculture, Leshan Vocational and Technical College, Leshan, Sichuan Province, 614000, China

    Yong Tang, Qiaoxing Wu, Yusu Wang, Shaoqi Jiang, Lan Liu & Lingjin Xian

Authors
  1. Yong Tang
    View author publications

    Search author on:PubMed Google Scholar

  2. Qiaoxing Wu
    View author publications

    Search author on:PubMed Google Scholar

  3. Yusu Wang
    View author publications

    Search author on:PubMed Google Scholar

  4. Shaoqi Jiang
    View author publications

    Search author on:PubMed Google Scholar

  5. Lan Liu
    View author publications

    Search author on:PubMed Google Scholar

  6. Lingjin Xian
    View author publications

    Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Yong Tang or Lingjin Xian.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, Y., Wu, Q., Wang, Y. et al. A chromosome level genome assembly of Homatula variegata from the Yangtze River basin. Sci Data (2026). https://doi.org/10.1038/s41597-026-06667-9

Download citation

  • Received: 25 September 2025

  • Accepted: 21 January 2026

  • Published: 26 January 2026

  • DOI: https://doi.org/10.1038/s41597-026-06667-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Download PDF

Advertisement

Explore content

  • Research articles
  • News & Comment
  • Collections
  • Follow us on Twitter
  • Sign up for alerts
  • RSS feed

About the journal

  • Aims and scope
  • Editors & Editorial Board
  • Journal Metrics
  • Policies
  • Open Access Fees and Funding
  • Calls for Papers
  • Contact

Publish with us

  • Submission Guidelines
  • Language editing services
  • Open access funding
  • Submit manuscript

Search

Advanced search

Quick links

  • Explore articles by subject
  • Find a job
  • Guide to authors
  • Editorial policies

Scientific Data (Sci Data)

ISSN 2052-4463 (online)

nature.com sitemap

About Nature Portfolio

  • About us
  • Press releases
  • Press office
  • Contact us

Discover content

  • Journals A-Z
  • Articles by subject
  • protocols.io
  • Nature Index

Publishing policies

  • Nature portfolio policies
  • Open access

Author & Researcher services

  • Reprints & permissions
  • Research data
  • Language editing
  • Scientific editing
  • Nature Masterclasses
  • Research Solutions

Libraries & institutions

  • Librarian service & tools
  • Librarian portal
  • Open research
  • Recommend to library

Advertising & partnerships

  • Advertising
  • Partnerships & Services
  • Media kits
  • Branded content

Professional development

  • Nature Awards
  • Nature Careers
  • Nature Conferences

Regional websites

  • Nature Africa
  • Nature China
  • Nature India
  • Nature Japan
  • Nature Middle East
  • Privacy Policy
  • Use of cookies
  • Legal notice
  • Accessibility statement
  • Terms & Conditions
  • Your US state privacy rights
Springer Nature

© 2026 Springer Nature Limited

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing