Hi-C sequencing data from frontal cortex of laboratory rats

Kim, Panjun; Ward, Rachel R.; Sharp, Burt M.; Williams, Robert W.; Chen, Hao

doi:10.1038/s41597-025-06173-4

Download PDF

Data Descriptor
Open access
Published: 04 December 2025

Hi-C sequencing data from frontal cortex of laboratory rats

Panjun Kim¹,
Rachel R. Ward¹,
Burt M. Sharp¹,
Robert W. Williams¹ &
…
Hao Chen ORCID: orcid.org/0000-0002-2680-6921²

Scientific Data volume 12, Article number: 1910 (2025) Cite this article

1167 Accesses
Metrics details

Subjects

Abstract

The three-dimensional structure of chromosomes is integral to nuclear organization and influences processes such as DNA replication, repair, and gene expression. Hi-C methods generate high-resolution, genome-wide data that map physical interactions between different parts of the chromatin. These data have applications in de novo genome assembly, the detection of structural variants, and the analysis of gene regulatory mechanisms. This report describes a Hi-C dataset from the frontal cortex of laboratory rats. The dataset includes a diverse panel of inbred strains (SHR/OlaIpcv, BN-Lx/Cub, BXH6/Cub, HXB2/Ipcv, HXB10/Ipcv, HXB23/Ipcv, HXB31/Ipcv, LE/Stm, F344/Stm) and an F1 hybrid (SHR/Olalpcv × BN/NHsdMcwi). This data set provides a valuable resource for advancing rat genomics and offers opportunities for integration with other omic datasets generated using this widely used model organism.

Super-resolution visualization of chromatin loop folding in human lymphoblastoid cells using interferometric photoactivated localization microscopy

Article Open access 20 May 2022

Integration of Hi-C with short and long-read genome sequencing reveals the structure of germline rearranged genomes

Article Open access 29 October 2022

Learning representations of chromatin contacts using a recurrent neural network identifies genomic drivers of conformation

Article Open access 28 June 2022

Background & Summary

High-throughput chromosome conformation capture (Hi-C) is a technique for studying the three-dimensional (3D) organization of the genome¹. The method identifies regions of the genome that are physically close to one another inside the cell nucleus. This is achieved by first crosslinking the chromatin to fix its 3D structure. The DNA is then digested, and fragments that are in close proximity are ligated together. These newly joined DNA molecules, which represent interacting genomic regions, are identified through next-generation paired-end sequencing².

Hi-C data are frequently used for de novo genome assembly. The frequency of interactions between DNA fragments provides information about their spatial proximity, which allows smaller sequenced fragments (contigs) to be ordered and oriented into chromosome-length assemblies. This improves the contiguity and accuracy of the resulting genome. The approach is effective for assembling complex genomes, such as those of polyploid species³, and can also assist in separating the contributions of homologous chromosomes in a process known as haplotype phasing⁴.

Additionally, Hi-C data facilitate the detection of large-scale structural variants (SVs), such as deletions, duplications, insertions, inversions, and translocations. These variants, which span more than 50 base pairs⁵, can be difficult to identify with standard short-read sequencing methods⁶. Since SVs alter the expected physical arrangement of the genome, they produce distinct changes in Hi-C interaction patterns. These changes serve as signatures for identifying the location and type of structural variation⁷.

Hi-C is also a primary tool for mapping chromatin loops, which are 3D structures formed when distant genomic regions interact. These loops often bring gene promoters into contact with distal regulatory elements like enhancers or silencers, playing a key role in regulating gene expression⁸. Architectural proteins such as CTCF and cohesin frequently bind to specific DNA motifs to stabilize these interactions⁹. Hi-C therefore provides genome-wide maps of the physical interactions that are important for understanding long-range gene regulation.

We have generated Hi-C datasets from the frontal cortex tissue of ten distinct rat strains. This collection includes significant genetic diversity from the HXB/BXH recombinant inbred strains¹⁰, which are a component of the Hybrid Rat Diversity Panel (HRDP)^11,12. The HRDP is a resource for systems genetics that facilitates high-resolution mapping of complex traits and allows for integrative analyses of genomic, transcriptomic, and phenotypic data^11,13. The HRDP contains both recombinant and genetically diverse inbred strains, which support reproducible studies of behavior, physiology, and disease. We expect these Hi-C datasets will aid in rat genome assembly, the discovery of structural variation, and the analysis of regulatory architecture. They can also be integrated with other ‘omic’ datasets available for this model organism^14,15,16.

Methods

Animals

Rats were bred at the University of Tennessee Health Science Center from breeders supplied by the Medical College of Wisconsin. All rats were group-housed under standard laboratory conditions and had not been subjected to behavioral or drug treatments. Euthanasia was performed via isoflurane overdose, followed by rapid brain removal. The average age at euthanasia was 139 days. Brains were immediately frozen using freeze spray. All procedures were approved by the Institutional Animal Care and Use Committee of the University of Tennessee Health Science Center and followed the NIH Guidelines concerning the Care and Use of Laboratory Animals.

Tissue processing, Hi-C library preparation, and sequencing

Frozen rat brains were transferred from −80 °C storage to −20 °C for at least 30 minutes, then cryosectioned at −15 °C into 120 µm slices. On a −10 °C cold block, 30–200 mg of cortex tissue was microdissected using an 18-gauge needle with a bent tip and forceps as needed. Following extensive troubleshooting, the optimal tissue input weight was determined to be approximately 50 mg. The bilateral cortex tissue was immediately transferred to pre-chilled 1.5 mL DNA LoBind tubes (Eppendorf) and kept on dry ice until all tissue was collected. The tissue was then pulverized in small quantities using a Cellcrusher-mini pre-cooled in liquid nitrogen. After all tissue was pulverized, 1X PBS was used to collect the sample from the device.

Tissue samples were processed according to the Arima Hi-C + Kit protocol. Samples that passed the first quality control step (QC1) in the Arima protocol continued to the library preparation stage using the KAPA Hyper Prep Kit. Samples were fragmented using a Covaris S2 instrument, with settings for each library detailed in Table 1. The fragmented samples were analyzed with an Agilent Bioanalyzer to confirm a median fragment size between 200 and 600 bp.

Table 1 Settings for Covaris instrument used for each library.

Full size table

Size selection methods varied across samples due to extensive troubleshooting. The optimal procedure was determined to be a two-sided selection using fresh Ampure XP beads, which aimed for a narrow peak with a 400 bp median fragment size. Steps involving Ampure XP beads were performed with either a DynaMag for 1.5 mL tubes or a magnetic rack from Parse Biosciences for PCR-sized tubes. It was noted that magnetic 96-well plates were unsuitable as they resulted in bead loss.

Library preparation was completed using the KAPA HyperPrep kit with either Illumina TruSeq DNA Single Indexes or Illumina TruSeq UDI v2. Pre-amplification libraries were quantified using the KAPA Library Quantification Kit on a Lightcycler 480. Amplified libraries and samples at all other quantification steps were analyzed using the Qubit HS DNA Assay and an Agilent Bioanalyzer HS DNA chip. A complete list of reagents and part numbers is available in Table 2.

Table 2 Kits and reagents for library preparation.

Full size table

Generating prerequisite files for the Juicer pipeline

The Juicer pipeline (v1.6)¹⁷ requires three input files in addition to the FASTQ sequencing data: a reference genome, a chromosome size file, and a restriction site file. a. Reference genome: The mRatBN7.2/rn7 genome was used as the reference. The file (“rn7.fa.gz”) was downloaded from the UCSC Genome Browser. To improve mapping specificity, only canonical chromosomes were retained, excluding unlocalized and unplaced sequences. b. Chromosome size file: The reference genome was indexed using the faidx subcommand of samtools (v1.16.1)¹⁸. The chromosome names and sizes were extracted from the first two columns of the resulting.fai file. c. Restriction enzyme site file: A file containing the genomic locations of restriction enzyme cleavage sites was generated using the generate_site_positions.py script from the Juicer pipeline. The four recognition sequences used in the Arima Genomics protocol (GATC, GANTC, CTNAG, and TTAA) were supplied as parameters along with the reference genome.

Processing Hi-C data using the Juicer pipeline

The Hi-C sequencing data and the three prerequisite files were processed using the Juicer pipeline (https://github.com/aidenlab/juicer/releases/tag/1.6) with a juicer_tools.jar file (https://github.com/aidenlab/juicer/wiki/Download). The pipeline automates read alignment, filtering of invalid reads, and generation of contact maps. A. Alignment. Paired-end FASTQ files were aligned to the reference genome using BWA-MEM (v.0.7.17)¹⁹ with Juicer’s default parameters (-SP5M). No pre-alignment trimming was performed. B. Filtering. After alignment, reads were classified as normal paired, chimeric paired, and chimeric ambiguous (mapped equally well to multiple locations), or unmapped. Chimeric ambiguous and unmapped reads are excluded from downstream analyses. C. Duplicate removal. PCR and optical duplicates were identified and removed using Juicer’s dups.awk module. This step produced the final non-redundant dataset (merged_nodups.txt). D. Generating contact matrices. The alignments were then used to generate contact matrices, stored in .hic files, at two mapping quality (MapQ) thresholds. The inter.hic file includes reads with MapQ ≥ 1, while the high-stringency inter_30.hic file includes only reads with MapQ ≥ 30. The latter file, containing uniquely mapped reads, was used for downstream analyses.

Data Records

The dataset is publicly available through the NIH Short Read Archive^{20,21,22,23,24,25,26,27,28,29}. Accession numbers are provided in Table 3. The Hi-C sequencing data are organized into three groups: (1) four individual rats from the parental strains of the hybrid rat diversity panel’s recombinant inbred (RI) lines, (2) five RI strains, and (3) one F1 hybrid (SHR/Olalpcv × BN/NHsdMcwi). The deposited data consist of paired-end raw sequencing reads in fastq.gz format.

Table 3 Statistics of Hi-C sequencing after running the Juicer pipeline.

Full size table

Technical Validation

Hi-C sequencing data were processed with the Juicer pipeline, using the RatBN7.2/rn7 reference genome for alignment. Quality control metrics for each sample are detailed in Table 3. The experimental goal was to generate 500–600 million (M) paired-end reads per sample. The average yield was 623.39 M reads per sample. Two samples, LE/Stm and F344/Stm, had lower sequencing depth due to challenges in quantifying the library concentration, which led to imbalanced output in a pooled sequencing run. However, these two samples produced a high proportion of unique reads (76.51%) compared to the sample average of 60.85%.

Alignment to the reference genome found only 1.31% of the reads failed to map across all samples. Of the mapped reads, an average of 41.30% were normal (non-chimeric) pairs derived from single genomic fragments. The remaining 57.40% were chimeric, containing sequences from different genomic locations. Among these chimeric reads, 47.34% mapped to a unique genomic location, while 10.06% mapped ambiguously to multiple locations.

The alignable read pairs, which include both normal pairs and uniquely mapped chimeric pairs, constituted an average of 88.64% of the total reads. From this set, optical duplicates accounted for a small fraction (0.83% of total reads), while PCR duplicates were more common (23.83% of total reads). After removing duplicates, the final set of unique, alignable read pairs used for analysis represented an average of 63.98% of the initial total reads.

For high-confidence contact analysis, these unique read pairs were filtered based on mapping quality (MAPQ). Pairs with a MAPQ score below 30 were excluded, which removed an average of 22.32% of the unique pairs. The remaining 77.68% of unique pairs, averaging 297.38 M high-quality pairs per sample, were used to construct the final Hi-C contact maps.

Analysis of the high-quality pairs showed that inter-chromosomal contacts (interactions between different chromosomes) accounted for an average of 52.28 M pairs per sample (17.6%). The majority, 245.10 M pairs (82.4%), were intra-chromosomal contacts (interactions within the same chromosome). These intra-chromosomal interactions were further divided by the linear genomic distance between contacting sites. Short-range interactions (<20 kb) comprised an average of 119.65 M pairs, while long-range interactions (>20 kb) made up the remaining 125.41 M pairs.

Taken together, these processing steps and the resulting metrics indicate that the Hi-C dataset is of high quality, meeting or exceeding the benchmark recommendations provided by Arima Genomics. Key criteria met include a low percentage of unmapped reads (1.31% observed vs. <6% recommended), a limited number of ambiguous chimeric reads (10.06% observed vs. <20% recommended), a high proportion of alignable reads (with over 98% of reads initially mapped, compared to a recommendation of >80%), and an appropriate fraction of inter-chromosomal contacts (approximately 17.6% observed vs. ~20% recommended). These metrics suggest that the dataset provides a solid foundation for investigating the 3D genomic architecture and functional genomics of the rat.

Data availability

The Hi-C sequencing dataset supporting this Data Descriptor has been deposited in the NCBI Sequence Read Archive (SRA) under BioProject accession number PRJNA1197090. The data are publicly accessible at https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1197090/.

Code availability

Data processing was conducted using Juicer 1.6 (https://github.com/aidenlab/juicer/tree/main). The command line used and the input files are available on GitHub at https://github.com/distilledchild/rat-cortex-hic-descriptor/.

References

Bonev, B. & Cavalli, G. Organization and function of the 3D genome. Nat Rev Genet 17, 772 (2016).
Article CAS PubMed Google Scholar
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat Plants 5, 833–845 (2019).
Article CAS PubMed Google Scholar
Duan, H. et al. Physical separation of haplotypes in dikaryons allows benchmarking of phasing accuracy in Nanopore and HiFi assemblies with Hi-C data. Genome Biol. 23, 84 (2022).
Article CAS PubMed PubMed Central Google Scholar
Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).
Article PubMed PubMed Central Google Scholar
Dixon, J. R. et al. Integrative detection and analysis of structural variation in cancer genomes. Nat. Genet. 50, 1388–1398 (10/2018).
Song, F., Xu, J., Dixon, J. & Yue, F. Analysis of Hi-C Data for Discovery of Structural Variations in Cancer. Methods Mol Biol 2301, 143–161 (2022).
Article CAS PubMed PubMed Central Google Scholar
Siersbæk, R. et al. Dynamic rewiring of promoter-anchored chromatin loops during adipocyte differentiation. Mol. Cell 66, 420–435.e5 (2017).
Article PubMed Google Scholar
Davidson, I. F. et al. CTCF is a DNA-tension-dependent barrier to cohesin-mediated loop extrusion. Nature 616, 822–827 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Printz, M. P., Jirout, M., Jaworski, R., Alemayehu, A. & Kren, V. Invited Review: HXB/BXH rat recombinant inbred strain platform: a newly enhanced tool for cardiovascular, behavioral, and developmental genetics and genomics. J. Appl. Physiol. 94, 2510–2522 (2003).
Article CAS PubMed Google Scholar
Dwinell, M. R. et al. Establishing the hybrid rat diversity program: a resource for dissecting complex traits. Mamm. Genome https://doi.org/10.1007/s00335-024-10102-y (2025).
Tabakoff, B., Smith, H., Vanderlinden, L. A., Hoffman, P. L. & Saba, L. M. Networking in Biology: The Hybrid Rat Diversity Panel. Methods Mol. Biol. 2018, 213–231 (2019).
Article CAS PubMed Google Scholar
Tabakoff, B., Hoffman, P. L. & Saba, L. The hybrid rat diversity panel for addiction research. 1.
de Jong, T. V. et al. A revamped rat reference genome improves the discovery of genetic diversity in laboratory rats. Cell Genom 4, 100527 (2024).
Article PubMed PubMed Central Google Scholar
Pattee, J. et al. Evaluation and characterization of expression quantitative trait analysis methods in the Hybrid Rat Diversity Panel. Front. Genet. 13, 947423 (2022).
Article CAS PubMed PubMed Central Google Scholar
Duttke, S. H. et al. Glucocorticoid Receptor-Regulated Enhancers Play a Central Role in the Gene Regulatory Networks Underlying Drug Addiction. Front. Neurosci. 16, 858427 (2022).
Article PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst 3, 95–98 (2016-7).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10 (2021).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997 [q-bio] (2013).
https://identifiers.org/ncbi/insdc.sra:SRR31858083.
https://identifiers.org/ncbi/insdc.sra:SRR31858082.
https://identifiers.org/ncbi/insdc.sra:SRR34514257.
https://identifiers.org/ncbi/insdc.sra:SRR31858086.
https://identifiers.org/ncbi/insdc.sra:SRR31858081.
https://identifiers.org/ncbi/insdc.sra:SRR34514258.
https://identifiers.org/ncbi/insdc.sra:SRR34514260.
https://identifiers.org/ncbi/insdc.sra:SRR31858085.
https://identifiers.org/ncbi/insdc.sra:SRR31858084.
https://identifiers.org/ncbi/insdc.sra:SRR34514259.

Download references

Acknowledgements

We gratefully acknowledge the contributions to the Hybrid Rat Diversity Panel (HRDP): Dr. Melinda R. Dwinell (Medical College of Wisconsin) provided the breeders; the Center for Integrative and Translational Genomics at the University of Tennessee Health Science Center (UTHSC) supported colony maintenance; and Mr. Angel Garcia Martinez and Ms. Caroline Jones assisted with breeding. Sequencing data were generated by the UTHSC Molecular Resource Center of Excellence and University of Tennessee Genomics Core. Computational analyses utilized the University of Tennessee Infrastructure for Scientific Applications and Advanced Computing (ISAAC). This work was supported by grants U01 DA-053672 from NIH/NIDA (to B.M.S., R.W.W., and H.C.).

Author information

Authors and Affiliations

Department of Genetics, Genomics and Informatics. University of Tennessee Health Science Center, Memphis, TN, 38103, USA
Panjun Kim, Rachel R. Ward, Burt M. Sharp & Robert W. Williams
Department of Pharmacology, Addiction Science, and Toxicology. University of Tennessee Health Science Center, Memphis, TN, 38103, USA
Hao Chen

Authors

Panjun Kim
View author publications
Search author on:PubMed Google Scholar
Rachel R. Ward
View author publications
Search author on:PubMed Google Scholar
Burt M. Sharp
View author publications
Search author on:PubMed Google Scholar
Robert W. Williams
View author publications
Search author on:PubMed Google Scholar
Hao Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

P.K. performed data analysis and drafted the manuscript. R.R.W. generated the Hi-C data and contributed to manuscript preparation. H.C. oversaw data generation and analysis. H.C., R.W.W. and B.M.S. designed the experiments and revised the manuscript.

Corresponding author

Correspondence to Hao Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, P., Ward, R.R., Sharp, B.M. et al. Hi-C sequencing data from frontal cortex of laboratory rats. Sci Data 12, 1910 (2025). https://doi.org/10.1038/s41597-025-06173-4

Download citation

Received: 11 April 2025
Accepted: 21 October 2025
Published: 04 December 2025
Version of record: 04 December 2025
DOI: https://doi.org/10.1038/s41597-025-06173-4