Background & Summary

The reed vole (Microtus fortis) is a small rodent with significant value as a model organism across multiple biological disciplines. Its unique physiological traits make it a compelling subject for genetic and biomedical research. These include a specialized digestive system adapted to a high-fiber diet, which offers insights into herbivore metabolism and gut microbiota interactions1. Furthermore, it serves as a rare natural model for spontaneous ovarian cancer, providing a clinically relevant system for studying tumorigenesis without artificial induction2,3. It is also utilized in behavioral studies to explore social dynamics and other complex behaviors4. Most notably, M. fortis possesses a remarkable innate resistance to parasites like Schistosoma japonicum5,6,7, making it an invaluable non-permissive host model for dissecting the genetic underpinnings of anti-parasite immunity. However, the full exploration of the genetic basis for these characteristics has been significantly hampered by the lack of a high-quality genomic reference. Previous genomic resources for M. fortis were limited to transcriptomic data or highly fragmented draft assemblies3,8,9,10. Such resources are insufficient for studying large-scale genomic architecture, as they cannot be used to analyze synteny, identify large structural variants, or accurately resolve the structure and copy number of complex and tandemly arrayed gene families, such as those related to immunity. These limitations have prevented a deep investigation into the evolutionary adaptations and the molecular mechanisms underlying the vole’s unique phenotypes.

High-quality, chromosome-level reference genomes are foundational for modern genomics, enabling comprehensive analyses of genome evolution, function, and regulation11. The advent of third-generation long-read sequencing technologies has revolutionized de novo genome assembly. Specifically, PacBio High-Fidelity (HiFi) sequencing, which generates long reads (>10 kb) with very high accuracy (>99.9%), is particularly effective for resolving complex repeat regions and heterozygous sequences, thus generating highly contiguous initial assemblies with superior completeness12. To elevate such an assembly to a chromosomal scale, these contiguous sequences can be combined with data from chromatin conformation capture techniques like Hi-C. Hi-C data provides empirical, long-range information about the three-dimensional proximity of DNA segments within the nucleus. This orthogonal dataset is ideal for accurately ordering and orienting the assembled contigs into chromosome-length pseudomolecules, which is essential for validating karyotype and enabling studies of genome-wide structural organization13.

To address the critical resource gap for M. fortis, we employed this powerful hybrid assembly strategy, integrating deep-coverage PacBio HiFi long-read data with extensive Hi-C-based scaffolding. This approach has allowed us to produce the first high-quality, chromosome-level reference genome for the species, overcoming the specific limitations of previous draft versions. This Data Descriptor provides a detailed account of the sample collection, sequencing protocols, assembly pipeline, and annotation methods used, presenting a robust and valuable genomic resource that will facilitate advanced, previously intractable research into the unique biology of M. fortis.

Methods

Sample collection and preparation

A healthy adult male Microtus fortis was sourced from the Dongting Lakes region and subsequently maintained at the Xiangya Medical College, Central South University (Changsha, China). To minimize allelic variation and simplify the assembly process, tissues from a single individual were used for genome sequencing. Skeletal muscle tissue was collected for whole-genome shotgun sequencing, Hi-C library construction, and RNA sequencing. An additional liver tissue was also collected for whole-genome shotgun sequencing. Immediately following dissection, all samples were immersed in liquid nitrogen for rapid freezing and subsequently stored at −80 °C to maintain integrity. All procedures were conducted in strict accordance with institutional guidelines and were approved by the Laboratory Animal Welfare and Ethics Committee of Central South University (Changsha, China; approval no. CSU-2022-0654).

Genome sequencing

Total genomic DNA was isolated from skeletal muscle tissue using a TIANamp Genomic DNA Kit (Tiangen Biotech, Beijing, China). For genome survey purposes, a short-read library with an insert size of approximately 300 bp was constructed using DNBSEQ technology. The genomic DNA was fragmented, and selected fragments underwent end-repair, A-tailing, and ligation of sequencing adapters. The adapter-ligated products were then heat-denatured and circularized by splint oligo hybridization. Unligated linear DNA molecules were digested, and the resulting single-stranded circular DNA molecules were amplified via rolling circle amplification (RCA) to create DNA Nanoballs (DNBs). The DNBs were loaded onto sequencing flow cells and sequenced on the DNBSEQ-T7 platform (MGI, Shenzhen, China), generating a total of 195.46 Gb of 150 bp paired-end data (Tables 1).

Table 1 Statistics of the DNA sequencing data for the M. fortis genome assembly.
Table 2 Statistics of the final genome assembly of M. fortis.
Table 3 Chromosome information of the M. fortis genome assembly.
Table 4 BUSCO assessment of genome completeness for M. fortis (glires_odb10).
Table 5 Repetitive element annotations in the M. fortis genome.
Table 6 Summary of non-coding RNA annotation in the M. fortis genome.

For long-read sequencing, a SMRTbell library with an insert size of approximately 15 kb was prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA). High-molecular-weight DNA was sheared to the target fragment size. The fragmented DNA was then subjected to a DNA damage repair step, followed by an end-repair and A-tailing process. Hairpin SMRTbell adapters were ligated to both ends of the DNA fragments, creating a closed, single-stranded circular topology. Exonuclease treatment was performed to remove any remaining unligated linear DNA fragments. The resulting SMRTbell templates were size-selected to obtain the desired 15 kb library. A sequencing primer was annealed to the SMRTbell templates, which were then bound with DNA polymerase. This library was sequenced on a PacBio Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA) using the circular consensus sequencing (CCS) mode, which generates highly accurate HiFi reads by sequencing the same molecule multiple times. This process yielded a total of 94.26 Gb of HiFi long reads (Table 1).

Hi-C sequencing

For chromosome-level scaffolding, an in situ Hi-C library was constructed following established protocols14. Approximately 1 g of skeletal muscle tissue was finely minced and treated with 2% formaldehyde for 30 minutes to crosslink proteins and DNA, thereby preserving the native three-dimensional chromatin architecture. The crosslinked cells were lysed, and the intact nuclei were purified. The chromatin was then digested overnight using the 4-cutter restriction enzyme DpnII, which creates sticky ends. The digested fragment ends were filled in with nucleotides, including a biotinylated dATP, to mark the original termini of the interacting fragments. Next, proximity ligation was performed in a dilute solution, which favors the ligation of fragments that were spatially adjacent within the nucleus. After ligation, the crosslinks were reversed by proteinase K treatment and overnight incubation at 65 °C. The DNA was purified, and non-ligated biotinylated ends were removed. The purified DNA, now enriched for chimeric ligation products, was sheared to a size range of 300–600 bp. The biotin-containing fragments, representing the Hi-C junctions, were captured and enriched using streptavidin-coated magnetic beads. Finally, the enriched fragments were processed into a sequencing library through end-repair, A-tailing, and adapter ligation. The final Hi-C library was sequenced on the DNBSEQ-T7 platform (MGI, Shenzhen, China), producing approximately 233.49 Gb of 150 bp paired-end data (Table 1).

RNA-seq sequencing

Total RNA was extracted from skeletal muscle tissue using a Tiangen RNA extraction kit (Tiangen Biotech, Beijing, China). For transcriptome-assisted annotation, a DNBSEQ RNA-seq library was prepared. First, messenger RNA (mRNA) was enriched from the total RNA using oligo(dT)-coated magnetic beads to specifically capture polyadenylated transcripts. The enriched mRNA was then randomly fragmented. Using these fragments as templates, first-strand cDNA was synthesized with random hexamer primers and reverse transcriptase. Subsequently, second-strand cDNA was synthesized using DNA polymerase I and RNase H. The resulting double-stranded cDNA fragments underwent end-repair, A-tailing, and ligation of sequencing adapters. The adapter-ligated fragments were then amplified by PCR to create the final library, which was sequenced on the DNBSEQ platform (MGI, Shenzhen, China), generating approximately 12 Gb of new transcriptomic data from skeletal muscle. To further enhance the comprehensiveness of the gene annotation, this newly generated data was combined with previously published RNA-seq datasets from liver (NCBI BioProject: PRJNA395088) and ovary (NCBI BioProject: PRJNA687349) tissues3,10.

Genome survey

To gain preliminary insights into the genomic landscape of M. fortis prior to assembly, a k-mer-based analysis was performed on the high-coverage short-read data. The frequency of all 19-base-pair subsequences (19-mers) was counted from approximately 195.46 Gb of quality-filtered reads using JELLYFISH v2.2.1015, generating a k-mer frequency distribution histogram. The resulting distribution was then analyzed with GenomeScope v2.016, which models the genome’s properties by fitting the k-mer profile to statistical expectations. The model identified a main peak corresponding to homozygous, single-copy k-mers and a smaller, secondary peak at half the depth, representing heterozygous k-mers (Fig. 1). Based on the position and relative size of these peaks, a haploid genome size of approximately 2.09 Gb, a genomic heterozygosity rate of 0.44%, and a repeat content of 25.37% were estimated. These pre-assembly metrics were crucial for guiding the assembly strategy and for later validation of the final assembly size and complexity.

Fig. 1
Fig. 1
Full size image

The 19-mer count distribution for the genome size estimation. The genome size, heterozygous rate, and repeat content of M. fortis were estimated to be 2.09 Gb, 0.44% and 25.37% respectively.

Contig assembly

The initial contig-level assembly was generated from the PacBio HiFi long reads using Hifiasm v0.16.1-r37517, which is specifically designed for long, accurate reads and constructs a phased assembly graph to resolve heterozygous regions. This initial step produced a primary assembly containing both primary contigs and alternative haplotigs. To create a single, non-redundant reference for scaffolding, this diploid assembly was processed with Purge_dups v1.2.518. This tool identifies and removes redundant sequences corresponding to alternative alleles by analyzing read depth coverage across the assembled contigs. After this purging step, a clean, haploid representation of the genome was obtained. This final contig-level assembly spanned 2.29 Gb, comprising 107 contigs with a high contig N50 length of 68.89 Mb (Table 2), indicating excellent contiguity prior to scaffolding.

Chromosome-level genome assembly

The highly contiguous contig-level assembly was scaffolded into chromosome-level pseudomolecules using the high-coverage Hi-C data. First, the raw Hi-C reads were aligned to the contig assembly. The Juicer pipeline (v1.6)19 was then employed to process these alignments, filter out invalid pairs (e.g., self-ligations, random breaks), and generate a genome-wide contact map that quantifies the interaction frequency between all pairs of contigs. Subsequently, the 3D-DNA pipeline (v190716)20 utilized this contact map to automatically cluster, order, and orient the contigs into large scaffolds corresponding to chromosomes. For quality control and refinement, the resulting scaffolds were subjected to a final polishing step via manual review in Juicebox (v1.11.08)21. The initial automated scaffolding was of high quality, requiring only minor corrective edits. This manual process involved resolving a small number of clear, unambiguous misjoins and connecting a few scaffolds where strong, contiguous Hi-C interaction signals provided clear evidence of their adjacency. All manual edits were conservative and strictly guided by the Hi-C contact patterns to ensure the final accuracy of the chromosome-level assembly. The final, polished chromosome-level assembly anchors 97.73% of the sequence into 26 pseudomolecules (Table 3 and Fig. 2), consistent with previous karyotype studies of M. fortis from the same geographical region22. The assembly exhibits exceptional long-range contiguity, with a scaffold N50 of 91.23 Mb. To assess its genic completeness, the assembly was evaluated with BUSCO v5.4.623 against the glires_odb10 lineage dataset, which identified 96.3% of the expected conserved genes as complete, confirming the high quality of the final genome assembly (Table 4).

Fig. 2
Fig. 2
Full size image

Hi-C interaction heatmap of the M. fortis genome assembly. The heatmap illustrates the contact density between genomic regions. The 26 assembled chromosomes are arranged in order of size. The intense signal along the diagonal of each chromosome block indicates high interaction frequency within each chromosome, validating the high quality of the chromosome-level scaffolding. The color bar reflects the logarithm of contact density, from high (dark red) to low (light yellow).

Identification of sex chromosomes

Depth-of-coverage analysis of both short-read (DNBSEQ-T7) and long-read (PacBio HiFi) data revealed that chrX exhibited approximately half the average coverage compared to other chromosomes. This indicates a hemizygous state, confirming the sex chromosomes. This identification was further validated by genome-wide synteny analysis, which showed that chrX is highly homologous to the X chromosomes of Microtus ochrogaster and Mus musculus (Supplementary Fig. 3). Due to its high repetitive content and relatively small size, the Y chromosome was not successfully scaffolded and remains distributed among the 62 unplaced scaffolds.

Gene prediction and functional annotation

To obtain a high-quality gene set, protein-coding genes (PCGs) were predicted on the repeat-masked genome using an integrated, evidence-based strategy that combined three distinct lines of evidence: homology-based prediction, de novo gene finding, and transcriptome-based evidence.

Homology-based prediction

Protein sequences from two closely related species (Microtus ochrogaster and Microtus oregoni) and from a previous M. fortis draft assembly were aligned to the new genome using TBLASTN. The gene structures corresponding to these alignments were then predicted using Exonerate v2.2.024.

De novo prediction

Gene models were predicted based on the intrinsic properties of the DNA sequence. First, high-quality proteins identified from the RNA-seq dataset were used to train gene prediction models for Augustus v3.325 and Genscan26 using MAKER2 v2.31.1027. Then, these trained models were used to perform ab initio gene prediction across the genome.

Transcriptome-based prediction

RNA-seq reads from multiple tissues were mapped to the genome using TopHat2 v2.1.128 and assembled into transcripts with StringTie v2.4.029. Likely protein-coding regions within these transcripts were subsequently identified with TransDecoder v5.7.0.

Finally, the gene predictions from these three approaches were merged into a final, non-redundant consensus gene set using EVidenceModeler (EVM) v2.1.030, which weighs the different sources of evidence to generate the most reliable gene models. This comprehensive pipeline resulted in the identification of a total of 23,678 PCGs.

For functional annotation, the amino acid sequences of these 23,678 predicted genes were aligned against multiple public databases using BLASTP (E-value < 1e-5), including the NCBI non-redundant protein (Nr) and Uniprot (Swiss-Prot/TrEMBL) databases31. Further functional information was derived by searching for conserved protein domains and families using InterProScan32, which scans against databases like Pfam. This process also assigned Gene Ontology (GO) terms and associated genes with biological pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) and eukaryotic Orthologous Groups (KOG) databases via eggNOG-mapper33. This multi-faceted approach successfully assigned putative functions to 23,088 genes (97.5% of the total predicted set).

Repeat annotation

The identification and masking of repetitive elements is a critical step before gene annotation to prevent spurious predictions. We employed a comprehensive strategy combining both homology-based and de novo approaches. For homology-based prediction, which identifies known repeats, the genome was screened using RepeatMasker v.open-4.0.934 and RepeatProteinMask against the RepBase library35, a curated database of known transposable elements. For de novo prediction, which discovers novel repeat families specific to the M. fortis genome, we first used RepeatModeler v1.0.436 to construct a species-specific repeat library. Tandem repeats, which are simple, consecutively repeated sequences, were identified separately using Tandem Repeats Finder v4.0737. The results from these different approaches were then integrated to create a final, comprehensive repeat annotation. This analysis revealed that repetitive sequences constitute 41.87% of the M. fortis genome (Table 5). The most abundant categories of transposable elements were Long Terminal Repeats (LTRs) at 17.16%, followed by Long Interspersed Nuclear Elements (LINEs) at 15.16%, Short Interspersed Nuclear Elements (SINEs) at 9.05%, and DNA transposons at 2.8%.

Non-coding RNA annotation

A comprehensive annotation of non-coding RNAs (ncRNAs) was performed to identify key functional RNA molecules within the genome. Different classes of ncRNAs were identified using specialized bioinformatics tools. Transfer RNAs (tRNAs), which are essential for translating mRNA into protein, were predicted using tRNAscan-SE v2.038, which searches for the characteristic cloverleaf secondary structure of tRNA genes. Other major classes of ncRNAs, including ribosomal RNAs (rRNAs), microRNAs (miRNAs), and small nuclear RNAs (snRNAs), were identified by searching the genome against the Rfam database (v14.7)39 using the Infernal v1.1.2 software40. Infernal employs covariance models that account for both sequence and secondary structure conservation, allowing for highly sensitive and specific identification of structured ncRNA families. The combined results from these analyses identified a total of 1,398 rRNA genes, 566 miRNA genes, and 1,964 snRNA genes (Table 6).

Mitochondrial genome assembly and annotation

The complete mitochondrial genome was assembled de novo using the MitoZ (v1.3) software41, which utilized both the DNBSEQ short-read data and the PacBio HiFi long-read data. The resulting circular mitogenome was annotated using both MitoZ and the MitoS2 server42, selecting the appropriate reference database based on the species. To ensure high accuracy, all protein-coding genes in the final annotation were manually checked and curated by comparing them to the mitogenome of a closely related species, Microtus obscurus (NC_087845.1). The final annotated mitogenome map was visualized using the OGDRAW software43. We successfully assembled and annotated the complete, circular mitochondrial genome using the MitoZ pipeline. The final mitogenome is 16,367 bp in length and contains the full set of 37 typical vertebrate mitochondrial genes (Supplementary Fig. 5).

Data Records

All sequencing data generated for this study have been deposited in the NCBI database under BioProject accession number PRJNA1271721. The raw sequencing reads are available in the Sequence Read Archive (SRA) under study accession SRP58974844. Specifically, DNBSEQ genomic sequencing data are deposited under accession SRR3382152845, and PacBio HiFi genomic sequencing data are available under accessions SRR3382152646 and SRR3382152747. Additionally, Hi-C interaction data and transcriptomic (RNA-seq) data have been deposited under accessions SRR3382152548 and SRR3382152449, respectively. The complete mitochondrial genome assembly is available in GenBank in.fasta format under accession number PX549189.150. The final chromosome-level genome assembly has been deposited in GenBank under accession JBQVRV000000000.151 and in the Genome Warehouse (GWH) at the National Genomics Data Center (NGDC) under accession GWHESEF0000000052. The GWH entry includes the genome assembly, genome annotation, coding sequences, and protein sequences available for download.

Technical Validation

DNA quantification and qualification

The quality of the starting biological material was rigorously assessed prior to library construction to establish a foundation of high-quality data. High-molecular-weight genomic DNA, essential for long-read sequencing, was extracted from muscle tissue. Its integrity was confirmed by 1% agarose gel electrophoresis, which revealed a distinct, high-molecular-weight band with no visible smearing, indicating that the DNA was not degraded and was suitable for generating long PacBio reads. The concentration of the DNA was precisely quantified using a Qubit Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA), which employs a dsDNA-specific dye for accuracy, ensuring optimal input for library preparation. Purity was assessed with a NanoDrop spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA); an A260/280 ratio of approximately 1.8 confirmed that the sample was largely free of protein contamination, which can inhibit downstream enzymatic reactions.

Quality control of raw sequencing data

The raw sequencing data from all platforms underwent a stringent quality control process to remove low-quality reads and artifacts. For the DNBSEQ short-read data, raw reads were processed to remove adapter sequences, reads with a high proportion of N bases (>5%), and low-quality reads where more than 50% of the bases had a phred quality score below 20. This ensured that only high-quality reads were used for the downstream genome survey. For the PacBio sequencing, the raw subread data was processed on-instrument to generate highly accurate HiFi reads (>99.9% accuracy) through circular consensus sequencing (CCS), which inherently filters out low-quality reads. For Hi-C data, raw reads were first processed to remove adapter sequences and low-quality reads. The resulting clean data was then aligned to the reference genome and filtered for valid interaction pairs. The filtering steps included removing unmapped reads, invalid pairs (such as self-circles and dangling ends), and PCR duplicates. Only the valid, unique read pairs were used for the subsequent scaffolding process.

RNA quality evaluation

For the transcriptome analysis, the quality and integrity of the total RNA extracted from skeletal tissues were paramount. RNA quality was evaluated on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). The resulting electropherograms for all samples used in library preparation yielded an RNA Integrity Number (RIN) greater than 8.0. This high RIN value, derived from the ratio of the 28S to 18S ribosomal RNA subunits, confirms that the RNA was intact and had not undergone significant degradation. This is critical for ensuring that the RNA-seq data accurately represents the full-length transcript repertoire without 3’ bias, thus providing reliable evidence for gene annotation.

Evaluation of the assembled genome

The final genome assembly demonstrates exceptional quality across evaluations of its continuity, completeness, and accuracy. The assembly exhibits exceptional continuity, with a contig N50 of 68.89 Mb and a scaffold N50 of 91.23 Mb. A total of 97.73% of the assembled sequence was successfully anchored into 26 chromosome-level pseudomolecules, which is consistent with the known karyotype of M. fortis. The completeness was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.623 against the glires_odb10 lineage dataset, yielding a complete score of 96.3% (C:96.3% [S:93.7%, D:2.6%], F:0.6%, M:3.1%, n:13798). The assembly accuracy was further validated using Merqury, which estimated a consensus quality (QV) score of 59.76 and a base error rate of 1.056 × 10−6. The k-mer copy number and assembly spectra (Supplementary Figs. 1, 2) showed a clean distribution with minimal artificial duplications, confirming the high fidelity of the chromosome-level assembly. The accuracy of the assembly was also confirmed by a high read mapping rate of 99.98% with Inspector53 and by the clean intra-chromosomal interaction patterns shown in the Hi-C contact map (Fig. 2). The quality of the gene annotation was also assessed by comparing the length distributions of gene elements to those of related species, which showed similar patterns and supported the accuracy of the gene prediction pipeline (Fig. 3). Finally, a contamination screen using BlobToolKit54 confirmed the high purity of the assembly, showing that the vast majority of contigs belonged to a single, high-coverage cluster assigned to Chordata, with no evidence of significant contamination (Supplementary Fig. 4).

Fig. 3
Fig. 3
Full size image

Comparative analysis of gene element length distributions. The plots compare the length distributions of key gene structure elements between the current M. fortis genome (black), a previous M. fortis draft assembly (GCF_014885135.2, blue), and two other closely related species, M. ochrogaster (GCF_000317375.1, green) and M. oregoni (GCF_018167655.1, red). The four panels show distributions for: (top left) overall gene length, (top right) coding sequence (CDS) length, (bottom left) exon length, and (bottom right) intron length. The similar patterns observed across these species, particularly for CDS and exon lengths, support the high quality and accuracy of the gene prediction in the current assembly.