Background & Summary

High-throughput chromosome conformation capture (Hi-C) is a technique for studying the three-dimensional (3D) organization of the genome1. The method identifies regions of the genome that are physically close to one another inside the cell nucleus. This is achieved by first crosslinking the chromatin to fix its 3D structure. The DNA is then digested, and fragments that are in close proximity are ligated together. These newly joined DNA molecules, which represent interacting genomic regions, are identified through next-generation paired-end sequencing2.

Hi-C data are frequently used for de novo genome assembly. The frequency of interactions between DNA fragments provides information about their spatial proximity, which allows smaller sequenced fragments (contigs) to be ordered and oriented into chromosome-length assemblies. This improves the contiguity and accuracy of the resulting genome. The approach is effective for assembling complex genomes, such as those of polyploid species3, and can also assist in separating the contributions of homologous chromosomes in a process known as haplotype phasing4.

Additionally, Hi-C data facilitate the detection of large-scale structural variants (SVs), such as deletions, duplications, insertions, inversions, and translocations. These variants, which span more than 50 base pairs5, can be difficult to identify with standard short-read sequencing methods6. Since SVs alter the expected physical arrangement of the genome, they produce distinct changes in Hi-C interaction patterns. These changes serve as signatures for identifying the location and type of structural variation7.

Hi-C is also a primary tool for mapping chromatin loops, which are 3D structures formed when distant genomic regions interact. These loops often bring gene promoters into contact with distal regulatory elements like enhancers or silencers, playing a key role in regulating gene expression8. Architectural proteins such as CTCF and cohesin frequently bind to specific DNA motifs to stabilize these interactions9. Hi-C therefore provides genome-wide maps of the physical interactions that are important for understanding long-range gene regulation.

We have generated Hi-C datasets from the frontal cortex tissue of ten distinct rat strains. This collection includes significant genetic diversity from the HXB/BXH recombinant inbred strains10, which are a component of the Hybrid Rat Diversity Panel (HRDP)11,12. The HRDP is a resource for systems genetics that facilitates high-resolution mapping of complex traits and allows for integrative analyses of genomic, transcriptomic, and phenotypic data11,13. The HRDP contains both recombinant and genetically diverse inbred strains, which support reproducible studies of behavior, physiology, and disease. We expect these Hi-C datasets will aid in rat genome assembly, the discovery of structural variation, and the analysis of regulatory architecture. They can also be integrated with other ‘omic’ datasets available for this model organism14,15,16.

Methods

Animals

Rats were bred at the University of Tennessee Health Science Center from breeders supplied by the Medical College of Wisconsin. All rats were group-housed under standard laboratory conditions and had not been subjected to behavioral or drug treatments. Euthanasia was performed via isoflurane overdose, followed by rapid brain removal. The average age at euthanasia was 139 days. Brains were immediately frozen using freeze spray. All procedures were approved by the Institutional Animal Care and Use Committee of the University of Tennessee Health Science Center and followed the NIH Guidelines concerning the Care and Use of Laboratory Animals.

Tissue processing, Hi-C library preparation, and sequencing

Frozen rat brains were transferred from −80 °C storage to −20 °C for at least 30 minutes, then cryosectioned at −15 °C into 120 µm slices. On a −10 °C cold block, 30–200 mg of cortex tissue was microdissected using an 18-gauge needle with a bent tip and forceps as needed. Following extensive troubleshooting, the optimal tissue input weight was determined to be approximately 50 mg. The bilateral cortex tissue was immediately transferred to pre-chilled 1.5 mL DNA LoBind tubes (Eppendorf) and kept on dry ice until all tissue was collected. The tissue was then pulverized in small quantities using a Cellcrusher-mini pre-cooled in liquid nitrogen. After all tissue was pulverized, 1X PBS was used to collect the sample from the device.

Tissue samples were processed according to the Arima Hi-C + Kit protocol. Samples that passed the first quality control step (QC1) in the Arima protocol continued to the library preparation stage using the KAPA Hyper Prep Kit. Samples were fragmented using a Covaris S2 instrument, with settings for each library detailed in Table 1. The fragmented samples were analyzed with an Agilent Bioanalyzer to confirm a median fragment size between 200 and 600 bp.

Table 1 Settings for Covaris instrument used for each library.

Size selection methods varied across samples due to extensive troubleshooting. The optimal procedure was determined to be a two-sided selection using fresh Ampure XP beads, which aimed for a narrow peak with a 400 bp median fragment size. Steps involving Ampure XP beads were performed with either a DynaMag for 1.5 mL tubes or a magnetic rack from Parse Biosciences for PCR-sized tubes. It was noted that magnetic 96-well plates were unsuitable as they resulted in bead loss.

Library preparation was completed using the KAPA HyperPrep kit with either Illumina TruSeq DNA Single Indexes or Illumina TruSeq UDI v2. Pre-amplification libraries were quantified using the KAPA Library Quantification Kit on a Lightcycler 480. Amplified libraries and samples at all other quantification steps were analyzed using the Qubit HS DNA Assay and an Agilent Bioanalyzer HS DNA chip. A complete list of reagents and part numbers is available in Table 2.

Table 2 Kits and reagents for library preparation.

Generating prerequisite files for the Juicer pipeline

The Juicer pipeline (v1.6)17 requires three input files in addition to the FASTQ sequencing data: a reference genome, a chromosome size file, and a restriction site file. a. Reference genome: The mRatBN7.2/rn7 genome was used as the reference. The file (“rn7.fa.gz”) was downloaded from the UCSC Genome Browser. To improve mapping specificity, only canonical chromosomes were retained, excluding unlocalized and unplaced sequences. b. Chromosome size file: The reference genome was indexed using the faidx subcommand of samtools (v1.16.1)18. The chromosome names and sizes were extracted from the first two columns of the resulting.fai file. c. Restriction enzyme site file: A file containing the genomic locations of restriction enzyme cleavage sites was generated using the generate_site_positions.py script from the Juicer pipeline. The four recognition sequences used in the Arima Genomics protocol (GATC, GANTC, CTNAG, and TTAA) were supplied as parameters along with the reference genome.

Processing Hi-C data using the Juicer pipeline

The Hi-C sequencing data and the three prerequisite files were processed using the Juicer pipeline (https://github.com/aidenlab/juicer/releases/tag/1.6) with a juicer_tools.jar file (https://github.com/aidenlab/juicer/wiki/Download). The pipeline automates read alignment, filtering of invalid reads, and generation of contact maps. A. Alignment. Paired-end FASTQ files were aligned to the reference genome using BWA-MEM (v.0.7.17)19 with Juicer’s default parameters (-SP5M). No pre-alignment trimming was performed. B. Filtering. After alignment, reads were classified as normal paired, chimeric paired, and chimeric ambiguous (mapped equally well to multiple locations), or unmapped. Chimeric ambiguous and unmapped reads are excluded from downstream analyses. C. Duplicate removal. PCR and optical duplicates were identified and removed using Juicer’s dups.awk module. This step produced the final non-redundant dataset (merged_nodups.txt). D. Generating contact matrices. The alignments were then used to generate contact matrices, stored in .hic files, at two mapping quality (MapQ) thresholds. The inter.hic file includes reads with MapQ ≥ 1, while the high-stringency inter_30.hic file includes only reads with MapQ ≥ 30. The latter file, containing uniquely mapped reads, was used for downstream analyses.

Data Records

The dataset is publicly available through the NIH Short Read Archive20,21,22,23,24,25,26,27,28,29. Accession numbers are provided in Table 3. The Hi-C sequencing data are organized into three groups: (1) four individual rats from the parental strains of the hybrid rat diversity panel’s recombinant inbred (RI) lines, (2) five RI strains, and (3) one F1 hybrid (SHR/Olalpcv × BN/NHsdMcwi). The deposited data consist of paired-end raw sequencing reads in fastq.gz format.

Table 3 Statistics of Hi-C sequencing after running the Juicer pipeline.

Technical Validation

Hi-C sequencing data were processed with the Juicer pipeline, using the RatBN7.2/rn7 reference genome for alignment. Quality control metrics for each sample are detailed in Table 3. The experimental goal was to generate 500–600 million (M) paired-end reads per sample. The average yield was 623.39 M reads per sample. Two samples, LE/Stm and F344/Stm, had lower sequencing depth due to challenges in quantifying the library concentration, which led to imbalanced output in a pooled sequencing run. However, these two samples produced a high proportion of unique reads (76.51%) compared to the sample average of 60.85%.

Alignment to the reference genome found only 1.31% of the reads failed to map across all samples. Of the mapped reads, an average of 41.30% were normal (non-chimeric) pairs derived from single genomic fragments. The remaining 57.40% were chimeric, containing sequences from different genomic locations. Among these chimeric reads, 47.34% mapped to a unique genomic location, while 10.06% mapped ambiguously to multiple locations.

The alignable read pairs, which include both normal pairs and uniquely mapped chimeric pairs, constituted an average of 88.64% of the total reads. From this set, optical duplicates accounted for a small fraction (0.83% of total reads), while PCR duplicates were more common (23.83% of total reads). After removing duplicates, the final set of unique, alignable read pairs used for analysis represented an average of 63.98% of the initial total reads.

For high-confidence contact analysis, these unique read pairs were filtered based on mapping quality (MAPQ). Pairs with a MAPQ score below 30 were excluded, which removed an average of 22.32% of the unique pairs. The remaining 77.68% of unique pairs, averaging 297.38 M high-quality pairs per sample, were used to construct the final Hi-C contact maps.

Analysis of the high-quality pairs showed that inter-chromosomal contacts (interactions between different chromosomes) accounted for an average of 52.28 M pairs per sample (17.6%). The majority, 245.10 M pairs (82.4%), were intra-chromosomal contacts (interactions within the same chromosome). These intra-chromosomal interactions were further divided by the linear genomic distance between contacting sites. Short-range interactions (<20 kb) comprised an average of 119.65 M pairs, while long-range interactions (>20 kb) made up the remaining 125.41 M pairs.

Taken together, these processing steps and the resulting metrics indicate that the Hi-C dataset is of high quality, meeting or exceeding the benchmark recommendations provided by Arima Genomics. Key criteria met include a low percentage of unmapped reads (1.31% observed vs. <6% recommended), a limited number of ambiguous chimeric reads (10.06% observed vs. <20% recommended), a high proportion of alignable reads (with over 98% of reads initially mapped, compared to a recommendation of >80%), and an appropriate fraction of inter-chromosomal contacts (approximately 17.6% observed vs. ~20% recommended). These metrics suggest that the dataset provides a solid foundation for investigating the 3D genomic architecture and functional genomics of the rat.