Young KRAB-zinc finger gene clusters are highly dynamic incubators of ERV-driven genetic heterogeneity in mice

Bruno, Melania; Farhana, Sharaf M.; Mitra, Apratim; Costello, Kevin; Watkins-Chow, Dawn E.; Logsdon, Glennis A.; Gambogi, Craig W.; Dumont, Beth L.; Black, Ben E.; Keane, Thomas M.; Ferguson-Smith, Anne C.; Dale, Ryan K.; Macfarlan, Todd S.

doi:10.1038/s41467-025-64609-2

Download PDF

Article
Open access
Published: 30 October 2025

Young KRAB-zinc finger gene clusters are highly dynamic incubators of ERV-driven genetic heterogeneity in mice

Nature Communications volume 16, Article number: 9608 (2025) Cite this article

5040 Accesses
3 Citations
15 Altmetric
Metrics details

Subjects

Abstract

KRAB-zinc finger proteins (KZFPs) comprise the largest family of mammalian transcription factors, rapidly evolving within and between species. Most KZFPs in human and mice have been found to repress endogenous retroviruses (ERVs) and other retrotransposons, with KZFP gene numbers correlating with the ERV load across species, suggesting coevolution. Whether new KZFPs emerge in response to ERV invasions is currently unknown. Using a combination of long-read sequencing technologies and genome assembly, we present a detailed comparative analysis of young KZFP gene clusters in the mouse lineage, which has undergone recent KZFP gene expansion and ERV infiltration. Detailed annotation of KZFP genes in a cluster on Mus musculus Chromosome 4 reveals parallel expansion and diversification of this locus in different mouse strains (C57BL/6 J, 129S1/SvImJ and CAST/EiJ) and species (Mus spretus and Mus pahari). Our data supports a model by which new ERV integrations within young KZFP gene clusters likely promoted recombination events leading to the emergence of new KZFPs that repress them. At the same time, ERVs also increased their numbers by duplication instead of retrotransposition alone, unraveling a new mechanism for ERV enrichment at these loci.

Introduction

KRAB-zinc finger proteins (KZFPs) comprise a large family of transcription factors, numbering in the hundreds in most mammalian genomes¹. KZFPs are characterized by a variable array of tandem C2H2 zinc fingers conferring DNA binding specificity, and a Krüppel-associated box (KRAB) domain recruiting the SUMO ligase KAP1/TRIM28, that engages with several heterochromatin forming complexes to induce gene silencing². While a few evolutionary old and conserved KZFPs have been shown to play essential roles in core developmental processes like embryonic development³, imprinting^4,5 and meiotic hotspot determination^6,7, the vast majority of KZFPs, both young and old, bind to and repress transposable elements (TEs)¹. In particular, several evolutionary young and clade or species specific KZFPs have been shown to silence endogenous retroviruses (ERVs) of similar age^8,9. This finding, together with a striking correlation between the number of zinc finger genes and the load of ERVs in different mammalian genomes, has led to a model of KZFP and ERV coevolution¹⁰. ERVs are the genetic remnants of retroviral integrations in the germline that when passed on through generations can lead to the rewiring of gene regulatory networks in the host genome across evolutionary time^11,12. Whereas new ERV subfamilies can establish themselves either from new viral infections or by diversification from other ERVs, it is not known how new KZFP genes coevolve with new waves of ERV infiltration of the host genome. Genomic analyses have shown that KZFP genes evolve by gene duplications that give rise to highly repetitive KZFP gene clusters^13,14, however, the precise evolutionary dynamics of how these clusters respond to ERV colonization to promote the emergence of new KZFP genes are still poorly understood.

The mouse lineage has been recently colonized by several murine specific ERVs¹⁵. Concurrently, mice and other Eumuroida species have undergone dramatic expansions of their KZFP gene repertoire, with hundreds of species-specific KZFP units. In contrast, other clades, such as Hominoidea, exhibit much less extensive expansions (Fig. 1a, Supplementary Fig. 1a). There is growing evidence that different Mus musculus strains and even sub-strains possess KZFP gene variants that explain diverse regulation of specific ERV families and their nearby genes^{9,16,17,18,19,20,21,22,23}. Although it is known that KZFPs serve as an important regulatory layer for ERV repression, the KZFP gene clusters in mice remain relatively understudied genomic loci. This is largely due to gaps in the available genome assemblies, which hinder comprehensive analysis of these loci and obscure the full extent of KZFP gene diversity and evolution across the mouse lineage.

**Fig. 1: Mouse displays several evolutionary young KZFP gene clusters with partially unknown sequence.**

In this study, we leverage the power of long-read sequencing technologies to investigate the content of young mouse KZFP gene clusters and uncover dynamics of their rapid evolution and divergence. We show that the integration of ERVs new to the mouse lineage within evolutionary young KZFP gene clusters likely promoted recombination events leading to the emergence of new KZFPs that repress them. Simultaneously, ERV copies also expanded by duplication in addition to retrotransposition, revealing a previously unrecognized mechanism driving ERV enrichment at these loci.

Results

De novo Mus musculus assemblies reveal a much larger KZFP gene cluster at the end of Chromosome 4

KZFP genes are organized in genomic clusters on several chromosomes in the Mus musculus genome (Fig. 1b). While some KZFP gene clusters primarily comprise old genes shared across species, others, such as the double-cluster on Chromosome 12 (Chr12), harbor genes entirely unique to mice. A few clusters, like those at the end of Chr2 and Chr4, contain at least one KZFP gene shared with other rodent species while mostly encoding KZFP genes unique to mice. The KZFP cluster at the distal end of Chr4 stands out for several reasons. This KZFP gene cluster appears to be specific to the Murinae clade, likely originating in the last common ancestor of rats and mice (Fig. 1c). Despite the conserved syntenic block defined by Tnfrsf8 and Miip genes flanking this locus, the cluster is absent in other closely related muroids, such as gerbils (Fig. 1c, Supplementary Fig. 1b). Comparative analysis between mouse and rat reveals that this region expanded significantly in the mouse lineage, acquiring multiple new KZFP genes. Finally, this cluster has been repeatedly implicated in studies mapping modifier loci for variably regulated ERVs across mouse strains^9,16,23, making it an excellent model to explore the evolution and diversification of KZFP gene clusters. Extensive analyses of this locus have been hindered by persistent sequence gaps in this region even in the current GRCm39 reference assembly, the size and content of which have remained largely undefined (Fig. 1d, Supplementary Fig. 1c).

Thus, we generated de novo genome assemblies to fill the gaps in this locus and other young KZFP gene clusters for two widely used laboratory mouse strains C57BL/6 J (BL6J) and 129S1/SvImJ (129S1), by combining the high sequencing accuracy of PacBio HiFi sequencing with the ultralong reads of ONT sequencing (Fig. 2a). While our primary goal was to resolve gaps in KZFP gene cluster loci, the resulting assemblies achieved high overall quality as assessed by the Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis²⁴, and with several contigs spanning entire chromosomes (Fig. 2b, c, Supplementary Fig. 2a, b). To refine the assemblies, we retained and strand-corrected contigs aligned to known locations in the GRCm39 reference, removing unplaced and redundant nested contigs (Supplementary Fig. 2a, b).

**Fig. 2: De novo assemblies for the C57BL/6 J (BL6J) and 129S1/SvImJ (129S1) reveal that the Chr4 KZFP gene cluster is much larger than previously estimated.**

Focusing on the KZFP gene cluster on Chr4, we uncovered over 2.5 Mb of previously unassembled sequence in BL6J, resolving five gaps in the GRCm39 reference. This expanded the total cluster size from 2.8 Mb to 5.4 Mb and revealed new regions of high sequence similarity within this locus (Fig. 2d). De novo transcriptome assembly, combining published RNA-seq data²⁵ with newly generated PacBio IsoSeq data, allowed us to curate and complete the annotation of this locus. We identified 38 previously unannotated KZFP genes and resolved the full sequence of three genes (Zfp986, Zfp993, and Gm21411) whose zinc finger arrays were incomplete due to sequence gaps (Fig. 2d, Supplementary Data 1).

The identification of novel sequence nearly doubling the size of this locus underscores a key limitation of interpreting genomic datasets using the GRCm39 reference assembly, where short-read mapping of ChIP-seq and RNA-seq experiments can result in artificial read pileups when reads derived from gap sequences are misaligned to the highly similar sequences available in the reference assembly (Supplementary Fig. 2c, d).

Interestingly, the Chr4 KZFP gene cluster is even larger in the 129S1 strain than in BL6J, spanning 6.9 Mb and containing 83 KZFP genes (Fig. 2e, Supplementary Data 1). Accuracy of the de novo assemblies at the Chr4 KZFP gene cluster (as well as at another double cluster on Chr12) was confirmed by inspecting the alignment of strain specific long ONT reads ( > 100 kb) over these loci in the corresponding assembly (Supplementary Fig. 3).

Complete telomere-to-telomere (T2T) assemblies for C57BL/6 J and CAST/EiJ (CAST) mouse strains have recently become available²⁶, allowing us to extend our comparative analysis to a third mouse strain. Applying a similar gene annotation and curation strategy to the CAST Chr4 KZFP gene cluster, we identified 86 KZFP genes in a 6.6 Mb region (Supplementary Data 1). Importantly, despite differences in sequencing and assembly strategies, our BL6J assembly is structurally identical to the T2T C57BL/6J assembly at the Chr4 KZFP gene cluster, as well as at other young clusters (Supplementary Fig. 4). This suggests that the large differences in KZFP gene cluster size observed between the three mouse strains are due to strain specific, rather than individual, locus divergence.

The Chr4 KZFP gene cluster has been independently expanding in different mouse strains and species

Sequence comparison of the BL6J, 129S1, and CAST Chr4 KZFP gene cluster revealed that this locus is fundamentally heterogeneous between the three mouse strains, beyond the size difference (Fig. 3a, b). While the beginning and end of the cluster are conserved, most of the locus is rearranged, with some regions of sequence similarity scrambled across the cluster and variably duplicated in the three strains. While the overall sequence comparison hints to high divergence of this locus in the three mouse strains, detailed comparison of the curated KZFP gene annotation in this cluster further revealed a disparate KZFP gene repertoire (Fig. 3c, d). Fingerprint amino acids - corresponding to the amino acids at the positions −1, +2, +3, and +6 within each zinc finger according to helical nomenclature - are major determinants of the KZFP DNA binding specificity as they directly contact the target nucleotide sequence. Thus, we focused on the arrays of fingerprint amino acids of the coding KZFP genes and identified the repertoire of distinct fingerprint arrays in the Chr4 cluster of each mouse strain (Supplementary Data 2). Several fingerprint arrays were shared across multiple KZFPs and we identified 38, 47, and 45 distinct Chr4 KZFP fingerprint arrays in the BL6J, 129S1, and CAST strains, respectively. Only a small number of fingerprint arrays were found to have an exact match across different strains, and instead, the majority of fingerprint arrays are unique to the individual strains (Fig. 3d). Even fingerprint arrays shared between strains often have different representation, with varying numbers of KZFP copies in each strain (Supplementary Data 2). Our analysis indicates that the Chr4 KZFP gene cluster has undergone parallel evolution in the BL6J, 129S1, and CAST mouse strains, marked by independent duplication events within the locus. This conclusion is further supported by the distinct patterns of self-identical sequences observed in the locus across the three strains (Fig. 3e).

**Fig. 3: The Chr4 KZFP gene cluster is highly heterogeneous between mouse strains and species.**

Since the Mus musculus Chr4 KZFP gene cluster locus is much larger than the corresponding locus in rat, we investigated whether this cluster expansion is unique to Mus musculus or if it also occurred in other mouse species. To explore this, we generated a de novo assembly for Mus spretus using a partially inbred strain (SPRET2) and also extended our analysis to a de novo assembly of Mus pahari²⁷. These assemblies completely spanned the locus corresponding to Chr4 KZFP gene cluster, as well as other KZFP gene clusters analyzed in this study, allowing comparison of gapless sequences. Sequence comparisons among Rattus norvegicus, Mus pahari, Mus spretus, and the three Mus musculus strains reveal that the independent expansion of the Chr4 KZFP gene cluster is a common feature of the mouse lineage compared to rat. However, the Mus musculus strains exhibit substantially larger cluster sizes than those observed in Mus spretus and Mus pahari (Fig. 3f).

The Chr4 KZFP gene cluster rapidly expanded by large segmental duplications encompassing multiple KZFP genes

To explore modes of rapid locus expansion, we traced the regions of segmental duplications within the Chr4 KZFP gene cluster. Due to the highly repetitive nature of this locus, many short sequence stretches exhibit high similarity, as revealed by various strategies for identifying self-identical sequences (Fig. 2d, e, Fig. 3e). However, detailed annotation of the cluster’s gene content in each strain showed that several genes shared identical or highly similar fingerprint arrays (Supplementary Data 2). This raised the question of whether these genes duplicated independently or as groups. To investigate this, we compared the sequences of the whole 3’ exon (including both the portion encoding for the zinc finger array and the 3’UTR) of all KZFP genes within the BL6J cluster. This strategy enabled us to compare the underlying DNA sequence of the exon that contributes the most to the individual KZFP gene identity, while disregarding their actual coding potential. Limiting our analysis to the coding region would have biased the comparison, as truncated arrays caused by isolated point mutations would appear highly dissimilar, even though the sequences are nearly identical (Supplementary Fig. 5). This analysis allowed us to identify similarity relationships between all the KZFP genes within the BL6J Chr4 cluster (Fig. 4a). By examining the gene positions within the cluster alongside their sequence similarity, we uncovered gene blocks - groups of genes with high similarity that were located in different regions of the cluster. These gene blocks ranged in size, containing between 3 and 7 genes (Fig. 4b). Interestingly, we also found evidence of partial duplications suggestive of multiple rounds of interstitial segmental duplications.

**Fig. 4: The Chr4 KZFP gene cluster expanded by large segmental duplications spanning several genes and TEs.**

Collectively, this analysis revealed that the Chr4 KZFP gene cluster is a highly recombinogenic locus and that large segmental duplications have been responsible for the rapid expansion of the Mus musculus locus.

Chromosomal position and meiotic recombination are not major drivers of KZFP gene cluster expansion and heterogeneity in mouse strains

We next sought to identify features that may have contributed to the recombination and segmental duplications driving the expansion and divergence of the Chr4 KZFP gene cluster in the mouse lineage. Several molecular mechanisms can be responsible for segmental duplications^28,29,30. It has been observed that recombination and segmental duplication events in human are more frequent in genomic regions close to the subtelomeres^31,32. Given the proximity of the Chr4 KZFP gene cluster to the telomeric region, we investigated whether its chromosomal position might have played a crucial role in its recombinogenic potential. To this end, we analyzed two additional young KZFP gene clusters: one on Chr2 located at a similar distance from the telomere as the Chr4 KZFP gene cluster, and a non-telomeric KZFP gene double-cluster on Chr12 located much closer to the centromere (Fig. 1b). Like the Chr4 cluster, the KZFP gene cluster at the end of Chr2 had several gaps in both the GRCm39 reference assembly and the 129S1.v3 assembly (Supplementary Fig. 6a–c). Surprisingly, we observed that after filling the gaps, the BL6J locus is slightly smaller than initially predicted in the GRCm39 reference. A comparison of this KZFP gene cluster across mouse strains revealed that the locus is nearly identical between BL6J and 129S1 Mus musculus strains but shows dramatic expansion in the CAST strain (Supplementary Fig. 6d, e).

In contrast, the KZFP gene double-cluster locus on Chr12 exhibited pronounced heterogeneity among the three examined Mus musculus strains (Supplementary Fig. 7). Notably, this double-cluster is specific to the mouse lineage and is entirely absent in rat and other species. Interestingly, this locus not only underwent independent expansion in mice, but each of the two clusters expanded independently as well, as demonstrated by the varying sizes of the clusters in different mouse strains and species (Supplementary Fig. 7d, e). Furthermore, the CAST strain displayed evidence of recombination between the two KZFP clusters within this locus, resulting in an inversion detectable by the change in orientation of the non-KZFP genes located between the clusters. These findings indicate that the chromosomal position of the KZFP gene clusters does not necessarily dictate the recombinogenic potential of these loci: the Chr2 KZFP gene cluster, which is located near the telomere, shows little heterogeneity between BL6J and 129S1 mouse strains, while the Chr12 double-cluster, positioned far from the telomere, exhibits high heterogeneity across all three mouse strains analyzed. Moreover, inbreeding of Mus musculus strains also does not appear to be the primary driver of KZFP gene cluster expansion. This is demonstrated by the striking expansion of the Chr12 double-cluster in Mus spretus.

Since meiotic recombination can also promote the emergence of structural variants^33,34, we next investigated whether the young KZFP gene clusters are enriched for meiotic hotspots, which could explain the frequent recombination events and locus heterogeneity observed between mouse strains. To address this, we looked at available datasets from BL6J and CAST mouse testes for PRDM9 binding, which determines the position of meiotic double strand breaks, and for DMC1 binding to single-stranded DNA (SSDS), which indicates loci undergoing DNA break repair during meiosis^35,36 (Supplementary Fig. 8). Young KZFP gene clusters in the BL6J and CAST strains do not appear to be significantly enriched for PRDM9 binding sites or DMC1-bound single-stranded DNA accumulation during spermatogenesis, which are rather excluded from these loci, compared to adjacent genomic regions. This is consistent with low meiotic recombination frequency observed at zinc finger gene and repeat loci also in human³⁷. While low-frequency meiotic recombination events cannot be entirely excluded, meiotic recombination does not appear to be a major driver of the sequence divergence observed at these loci.

Young and divergent KZFP gene clusters harbor a high load of self-identical sequences compared to older conserved clusters

Since the KZFP gene clusters on Chr4, Chr2, and Chr12 are all young clusters, we also examined two evolutionarily older KZFP gene clusters: one located on Chr6 and the other on Chr7 in Mus musculus. These older clusters harbor KZFP genes that are conserved across mammals and are located on the respective chromosomes at positions similar to the Chr12 double-cluster locus (Fig. 1b). Both older clusters (Chr6 and Chr7) exhibit remarkable sequence similarity not only between the mouse strains and species analyzed but also with rat (Supplementary Fig. 9a, c). Furthermore, these older clusters display a much lower load (Chr6) or even absence (Chr7) of self-identical sequences (Supplementary Fig. 9b, d), compared to the younger KZFP gene clusters. While large stretches of high sequence similarity in the young KZFP gene clusters are likely a consequence of recently duplicated gene blocks, young KZFP gene clusters also contain an abundance of small self-identical repetitive sequences compared to the older and more conserved KZFP gene clusters.

Young mouse KZFP gene clusters expanded together with the infiltration of mouse ERVs at these loci

Repetitive sequences and TEs have been shown to contribute to structural variations, including large deletions, segmental duplications and chromosomal rearrangements^38,39,40 and previous studies have demonstrated that mouse KZFP gene clusters display TE enrichment, particularly enrichment of ERVs¹⁴. Thus, we hypothesized that the expansion and divergence of the young KZFP gene clusters might correlate with TE enrichment at these loci in the mouse lineage compared to rat. To address this, we compared the TE content across the genomes of rat, Mus pahari, Mus spretus, and the three Mus musculus strains. Genome-wide, we observed only minor differences in the overall abundance of different TE classes, with a subtle increase in LTR elements in the mouse lineage compared to rat (Supplementary Fig. 10a). However, when analyzing the TE content specifically within the Chr4 KZFP gene cluster, we found that the rat cluster is heavily enriched in LINE elements (45% of the cluster), whereas in mice, this locus acquired a higher LTR load (Supplementary Fig. 10c). A similar pattern was observed for the Chr2 KZFP gene cluster (Supplementary Fig. 10b). The LTR load was particularly striking at the Chr12 KZFP gene double-cluster locus in Mus spretus and in the three analyzed Mus musculus strains. This locus, which underwent the largest expansion in Mus spretus, is absent in rat and quite small in Mus pahari, where it displays a marked LINE-rich composition compared to Mus spretus and Mus musculus (Supplementary Fig. 10d). This trend is less prominent in the older KZFP gene clusters located on Chr6 and Chr7 (Supplementary Fig. 10e, f). Further analysis of the enrichment of individual LTR families at each examined KZFP gene cluster revealed that distinct mouse specific ERVs have colonized different young clusters, particularly in Mus musculus and Mus spretus compared to Mus pahari and rat (Supplementary Fig. 10g, Supplementary Data 3). In contrast, older KZFP gene clusters in mice displayed enrichment for many fewer LTR elements, which were mostly shared with rat. Finally, analysis of the individual LINE families at the Chr4 KZFP gene cluster locus revealed that the LINE families with the highest representation at this cluster in mice are conserved at the corresponding rat locus (Supplementary Data 3), suggesting that this cluster might have originated as LINE-rich region and rapidly expanded in the mouse lineage together with the infiltration of new, lineage-specific ERVs. Close inspection of the Chr4 KZFP gene cluster revealed examples of chimeric LINE and ERVs (Fig. 4e, f, Supplementary Fig. 11). These chimeric elements have been described as the result of TE mediated non-allelic homologous recombination in which the initial retrotransposition event can also be the cause of the initial DNA breaks, and often associated with the establishment of structural variants^41,42,43, and we have found examples of these recombination scars at boundaries of recombined and duplicated portions of the BL6J Chr4 KZFP gene cluster (Supplementary Fig. 11). Altogether, our analysis suggests that the infiltration of new ERVs in the young KZFP gene clusters in the mouse lineage likely increased the frequency of non-allelic homologous recombination events, thus providing a possible mechanism whereby ERVs directly promote the expansion and divergence of young KZFP gene clusters.

TE enrichment at young KZFP gene clusters is a consequence of the locus expansion by segmental duplication

Interestingly, while we observed increased content of mouse ERVs in several young KZFP gene clusters, we also found that the large duplicated gene blocks in the Chr4 KZFP gene cluster harbored copies of the TEs that had integrated prior to the segmental duplication events (Fig. 4c, d, Supplementary Fig. 11a). This suggests that the rapid gain of TE enrichment in KZFP gene clusters was driven by the segmental duplication events, rather than by retrotransposition events alone. Supporting this interpretation, we observed that despite the much smaller size of the Mus spretus Chr4 KZFP gene cluster locus relative to its Mus musculus counterpart, the overall enrichment for several ERV families remains similar between the two species (Supplementary Fig. 10g). This implies that the ERV load increased proportionally with locus expansion during duplication events in the mouse lineage. Further evidence for gain of TE enrichment independent of transposition comes from the enrichment of DNA transposons, prominent at the Chr12 KZFP double-cluster locus (Supplementary Fig. 10h, Supplementary Data 3). Because DNA transposons are mostly inactive fossil elements in mammals^44,45 - and even the few active ones replicate via a cut-and-paste mechanism rather than copy-and-paste - their increased enrichment suggests that pre-existing copies were duplicated during segmental duplication of the surrounding genomic regions.

To better understand these patterns, we focused on the BL6J Chr4 KZFP gene cluster, identifying specific ERVs that are highly represented relative to their genome-wide distribution. Of all the annotated MLTR18A_MM elements, 25.2% were found within the Chr4 KZFP gene cluster. Similarly, a large portion of all annotated ERVs of other distinct families (18.5% for LTRIS6, 16.1% for MMTV-int, 14.8% for RLTR13D3, 13.3% for LTRIS3, and 11.2% for RLTR1D2_MM elements) were found in the same cluster. These percentages are particularly striking, given that the Chr4 KZFP gene cluster accounts for only 0.2% of the total BL6J genome. Analysis of the sequence divergence of ERV subfamilies (measured as percentage of divergence to the consensus) for which more than 2% of total annotations occur at the Chr4 KZFP gene cluster revealed that while genome-wide there is a continuum of divergence (as expected for TEs that independently accumulate mutations) the Chr4 cluster displayed several groups of ERV copies with nearly identical sequence divergence (Fig. 5a). This is distinct from the overall bimodal distribution of sequence divergence observed for some ERV families⁴⁶, and likely reflects the segmental duplication of ERVs as opposed to independent insertions. Similar analyses in the 129S1 and CAST strains, as well as in Mus spretus, revealed strain-specific differences in ERV representation and distribution of sequence divergence (Supplementary Fig. 11a, Supplementary Fig. 12), underscoring the dynamic nature of these loci.

**Fig. 5: Highly represented ERVs in the Chr4 KZFP gene cluster increased their number by duplication.**

Taken together, our findings suggest the following model (Fig. 5b): the integration of new ERVs within KZFP gene clusters may have increased the recombinogenic potential of these loci by promoting non-allelic homologous recombination events, driven by regions of microhomology shared between different ERVs. Repair processes leading to segmental duplications resulted in cluster expansion, possibly occasionally counterbalanced by repair events that caused cluster contraction. These two forces, operating independently and in parallel across different mouse strains, alongside species and strain specific ERV integrations, have likely driven the divergence of young KZFP gene clusters in mice. As a result, the loci now appear highly divergent, with the emergence of distinct young KZFP gene repertoires. This model would also explain how the content of ERVs and KZFP genes can increase in concert, with the KZFP gene clusters gaining more recombinogenic potential as they expand, due to the concomitant increase in repeat load. This raises an additional question: could the emergence of new KZFP genes that bind to and silence these newly integrated ERVs act as a brake on this self-reinforcing recombinogenic system?

Several KZFPs in the BL6J Chr4 cluster target ERVs moderately enriched within the Chr4 cluster itself

To address whether there is a relationship between the ERVs that promoted KZFP gene cluster recombination and the emergence of new KZFP genes that could bind and repress them, we characterized the DNA binding properties of all the KZFPs encoded in the BL6J Chr4 cluster – with at least one amino acid difference – combining new ChIP-seq data for 42 KZFPs (including KZFPs with new or updated annotation) with previously published ChIP-seq data for 8 KZFPs⁹. Thus, we generated a TE target map for all the KZFPs encoded in the BL6J Chr4 cluster (Fig. 6a, Supplementary Data 5). As expected from previous studies, we observed that several KZFPs specifically target distinct TEs, and we were able to identify target motifs by integrating canonical motif discovery from experimentally determined peak regions with target motif prediction accounting for the KZFP amino acid sequence⁴⁷ (Supplementary Data 6). This analysis allowed us to identify the KZFPs responsible for targeting and silencing RLTR4 elements (Fig. 6b), that had been previously shown to be over-expressed in mESCs lacking the Chr4 cluster⁹ and for which the specific KZFP responsible for their repression in wild-type cells had remained unknown. Furthermore, we found that many KZFPs exhibiting specific TE targeting were unique to the BL6J strain compared to 129S1 and CAST strains. Among them, we found several KZFPs that target distinct subfamilies of IAP LTRs and IAPEz internal regions (Fig. 6c), thereby identifying the modifiers likely responsible for reported variable methylation of these IAP elements in different mouse strains¹⁶. Lastly, we found examples of KZFPs that did not display strong or specific binding to TEs, and for which we could not even identify a general target motif (Supplementary Data 6), suggesting that not all the KZFPs that have emerged thus far in the BL6J Chr4 cluster have a specific function.

**Fig. 6: Several KZFPs in the BL6J Chr4 cluster bind to ERVs that display only a mild enrichment within the Chr4 cluster itself.**

Interestingly, we also observed that several ERVs targeted by KZFPs encoded in the Chr4 cluster are moderately enriched at this locus, although they do not exhibit strong enrichment (Supplementary Fig. 10g, Fig. 6a). This observation hints to the intriguing possibility that the emergence of KZFPs targeting these ERVs may have acted as a brake on their enrichment. We speculate that KZFPs may limit the further expansion of their target ERVs in two ways synergistically: as KZFPs can repress the ERVs they bind to, they can reduce their retrotransposition; at the same time, as the ERVs cannot increase their numbers by new integrations, they cannot further increase the recombinogenic potential of the KZFP gene cluster locus, further limiting their expansion by segmental duplication.

Discussion

KZFPs represent the largest family of DNA-binding factors in mammals, with a unique evolutionary flexibility in their overall numbers as well as in their DNA-binding specificity, which has enabled a remarkable diversification of the KZFP repertoire in different species. Rodents, and specifically mice, are an important model to understand how KZFP gene repertories have been established over time, since KZFP genes have undergone dramatic expansion in different lineages, such as the Mus genus.

Although KZFPs have the potential to modulate the expression of ERV sequences, which are increasingly recognized for their role in both physiological and pathogenic processes^48,49,50, evolutionary young KZFP gene cluster loci in the mouse genome have remained relatively understudied due to the challenges associated with assembling these highly repetitive loci completely and reliably. In this study we demonstrate that the complete assembly and annotation of a single locus (Chr4 KZFP gene cluster) has led to the discovery of dozens of previously uncharacterized KZFP coding genes, which have the potential to contribute to complex gene regulation by repressing specific TEs. Furthermore, with the availability of full sequences from three different mouse strains, we were able to explore, at an unprecedented resolution, the significant heterogeneity of three young KZFP gene clusters in mice. Our findings reveal these loci as largely underexplored reservoirs of genetic diversity, tightly linked to ERV epigenetic heterogeneity. Although we could not observe large variation of the examined KZFP gene clusters between two BL6J Mus musculus individuals, we cannot exclude that pangenome analyses across larger populations and in different mouse strains and species might reveal more intra-species heterogeneity at evolutionary young KZFP gene clusters. Beyond the simple presence or absence of KZFPs with specific TE-targeting capabilities, we hypothesize that the varying numbers of copies of the same KZFP across different mouse strains could also influence transcriptional repression, contributing to gradual differences in gene regulation rather than binary on/off (gene present or absent) states. Future studies that systematically annotate and functionally validate all KZFPs across different mouse strains will likely illuminate several observed differences in gene expression regulation. These differences are not easily explained by the mere presence or absence of individual genes⁵¹, but might be due to the interplay of multiple KZFPs acting as epigenetic modifiers in more complex regulatory networks.

Our analysis also suggests that while new ERV integrations likely contributed to the expansion of KZFP gene clusters, their enrichment in these loci may be a consequence of the expansion of the KZFP gene clusters by segmental duplication, in a self-reinforcing loop that has likely promoted the independent expansion of evolutionary young KZFP gene clusters in the mouse lineage and providing a compelling model for the correlation of LTR elements and KZFP gene numbers in the mouse genome compared to rat. The retroviral infiltration of KZFP genes might also not be a unique feature of the mouse lineage. ERV accumulation at KZFP gene clusters has also been observed in the Peromyscus lineage⁵² and examples of lineage specific KZFP gene repertoire expansion are also present in primates⁵³. A similar trend of ERV infiltration and KZFP gene cluster expansion is also evident in the human genome, where large primate-specific clusters are heavily infiltrated by primate-specific ERVs, in contrast to conserved KZFP gene cluster loci (Supplementary Fig. 13). One main question remains: do ERVs land at KZFP gene clusters only by chance, with no significant negative consequences, allowing them to escape selection, or is there an active mechanism that facilitates the integration of new ERVs at these loci? While more work is also required to address how new KZFP gene clusters are seeded, we can start to speculate that certain TE integrations may have promoted recombination events between regular genomic regions and existing KZFP gene clusters, leading to the emergence of new clusters.

Methods

Analysis of KZFP unit conservation in rodents and primates

We re-analyzed the KZFP unit census (including genes, pseudogenes, and predicted related sequences) from Supplementary Table 2 of Imbeault et al. (2017)¹, using cluster IDs to represent distinct zinc finger arrays. To account for gene duplication events, we also considered the number of KZFP units with the same cluster ID in each analyzed species.

First, we identified and counted KZFP units unique to each species within the primate or rodent clades, considering all species for comparison. Second, we quantified KZFP units shared between at least two rodent species and absent in non-rodent species, and similarly, for primates. Third, we counted KZFP units shared with at least one other mammalian species but absent outside mammals. The remaining KZFP units represented those shared across species beyond mammals.

Phylogenetic trees, ideograms and synteny plots

Phylogenetic trees were generated using TimeTree⁵⁴ and downloaded in Newick format. Trees were then visualized using the ggtree R package⁵⁵. Ideogram plots were generated using the karyoploteR package⁵⁶, while synteny plots were generated using the SVbyEye tool⁵⁷.

De novo assembly of C57BL/6 J and 129S1/SvImJ mouse strains and Mus spretus

PacBio HiFi sequencing was performed for both C57BL/6J (BL6J) and 129S1/SvImJ (129S1) mouse strains and Mus spretus. All mouse procedures had been reviewed and approved by the National Institute of Child Health and Human Development (NICHD) Animal Care and Use Committee (ACUC) at the National Institutes of Health (ASP#: 24-026). Mice were housed with a light cycle of 14 h on/10 h off, temperature maintained between 20 and 22 °C, humidity kept at 40–55%, and with no more than 5 mice per cage. Adult mice (between 2 and 6 months of age) were euthanized using CO₂ following the approved procedure.

Genomic high molecular weight (HMW) DNA was isolated from kidney of adult males of pure strain C57BL/6 J and 129S1/SvImJ mice purchased from JAX (Strain #000664 and #002448, respectively) and of an adult female of SPR2 Mus spretus (RIKEN RBRC00208), using the Monarch HMW DNA Extraction kit for Tissue (NEB T3060), following manufacturer instructions.

Sequencing libraries were prepared using the SMRTbell Express Template Prep Kit 2.0 and sequenced on a Sequel II using version 2.0 sequencing reagents. Circular consensus sequence (CCS)/HiFi reads were generated off-instrument from the initial subread data from each SMRTCell using the pb_ccs workflow (ccs version 6.3.0) within PacBio SMRTLink version 11.0.0.146107.

ONT ultralong read sequencing was performed for BL6J and 129S1 mouse strains, by deriving F1 hybrid mouse embryonic stem cells (XY) from E3.5 blastocysts from a cross between a female BL6J and a male 129S1 mouse. Cells were tested for karyotype stability by mitotic chromosome spreading and counting. Ultra-HMW genomic DNA was extracted with the Monarch HMW DNA Extraction Kit for Cells and Blood (NEB T3050) following manufacturer instructions. Libraries were prepared using the Nanopore Ultra-Long Sequencing Kit (SQK-ULK001) following manufacturer instructions and sequenced on an ONT FLO-PRO002 and FLO-MIN106 flowcells. Basecalling was done on instrument using Guppy v6.3.9, or Guppy v6.4.6 in high-accuracy mode.

To generate de novo assembly for BL6J and 129S1 mouse strains, first Canu was used for trio-binning of ONT reads⁵⁸, then Verkko was used to assemble each strain separately using the individual PacBio HiFi data as parental and the binned F1 ONT data⁵⁹.

De novo assembly of Mus spretus was generated using hifiasm⁶⁰ from PacBio HiFi data only, and the assembly graph of primary contigs output (.bp.p_ctg.gfa output file) was used to generate the assembly fasta file.

De novo assembly QC and contig filtering

Assemblies were tested for completeness using BUSCO v5.4.7⁶¹.

To facilitate navigation through the BL6J and 129S1 assemblies without scaffolding and the introduction of gaps, we retained only contigs that aligned to known chromosomes, using the GRCm39 assembly as a reference. De novo assemblies were aligned to the GRCm39 assembly using minimap2 with the -x asm5 option⁶². The resulting PAF files were filtered based on the following criteria: (i) only contigs aligning to known chromosomes were kept; (ii) if a contig aligned to multiple chromosomes, it was assigned to the chromosome with the longest alignment; (iii) for overlapping alignments, only the contig with the largest alignment was retained, removing smaller nested contigs. Additionally, strand corrections (flipping) were applied to contigs aligned to the minus strand. Due to fragmentation of ChrY in the BL6J assembly, with multiple contigs mapping to the same regions, only autosomes and ChrX contigs were retained for downstream analyses in both strains. The completeness of the filtered, strand-corrected assemblies was reassessed using BUSCO, confirming equivalent completeness compared to the unfiltered assemblies.

To assess assembly accuracy at the KZFP gene clusters, haplotype specific ONT reads (F1 ONT reads after trio binning using the parental PacBio HiFi data) were aligned to the corresponding assembly or, as a control, to the ‘wrong’ strain assembly using minimap2 (-ax map-ont option); the resulting sam alignments were converted to bam format and indexed using samtools view and index commands, respectively. Alignments of reads >100 kb were inspected in IGV to confirm tiling and homogeneous coverage of the read alignments over the KZFP gene cluster loci. Small indel threshold was set to 50 bp to improve visibility.

SNP analysis between two BL6J assemblies

Sequence alignment was performed using the nucmer program, part of MUMmer tool (v 4.0.1)⁶³, using the BL6J de novo assembly in this study as reference and the T2T BL6J assembly (GCA_964188535) from²⁶ as query. The resulting delta.filter output was then used to identify SNPs with the show-snps command (with -Clr -T options), and SNPs within the examined KZFP gene clusters were quantified.

Sequence alignments for synteny and self-identity plots

Sequence comparison of KZFP cluster loci was performed by pairwise sequence alignment using lastz⁶⁴ with (default options, --format=PAF) for the analysis across different mammal species shown in Fig. 1c and Supplementary Fig. 1b. For mouse versus rat comparison (Fig. 1c), only alignments larger than 3 kb were retained to improve visibility. For comparisons between different mouse strains, mouse species and rat minimap2 (-x asm20 -c --eqx --secondary=yes) was used. To compare KZFP gene cluster loci between de novo assemblies and available reference genomes (GRCm39 and 129S1.v3), the same minimap2 strategy was used, but with --secondary=no option. For self-identity arc plots, minimap2 was used with the following options: -x asm5 -c --eqx -D -P --dual=no. Self-identity arc plots above the alignment plots in Fig.2d, e were generated using SVbyEye, after filtering the paf file with filterPaf(., is.selfaln = TRUE). Self-identity dotplots were generated using the ModDotPlot tool⁶⁵ with the static -id 85 --color <custom colors> options. The -w parameter was set to 2000 for all the plots to maintain the same window size across assemblies and clusters, with the only exception of the KZFP gene cluster on Chr7, for which -w was set to 500 to account for the much smaller locus size.

Known gene and repeat annotation

Known genes from the Gencode M32 annotation release were annotated using the liftoff tool (with -copies option)⁶⁶. This approach was also used to generate gene annotation and identify all KZFP gene cluster boundaries in the available assemblies from different mouse strains. Repeats were annotated using RepeatMasker v4.1.5⁶⁷ (-species “mus musculus”) using NCBI/RMBLAST [2.14.1 + ] search engine and Dfam with RBRM v3.8.

Manual curation of new KZFP gene annotation at the Chr4 cluster

Manual curation of de novo KZFP gene annotation at the Chr4 KZFP gene cluster in BL6J and 129S1 mice was achieved by combining PacBio Iso-Seq from our F1 hybrid mESCs (from BL6J x 129S1) and published short-read RNA-seq from pure strain mESCs for the two strains (SRR23065639, SRR23065640, and SRR23065641 for BL6J; SRR23065649, SRR23065650, and SRR23065651 for 129S1)²⁵.

For PacBio Iso-Seq, RNA was isolated from mESCs using the Zymo Research Quick-RNA Microprep kit following the manufacturer instructions. Iso-Seq libraries were created using the PacBio SMRTbell prep kit 3.0. Separate libraries were generated for mRNAs <3 kb or >3 kb. Sequencing was performed on a Sequel IIe sequencer (Pacific Biosciences) running instrument control software version 11.0.0.144466 and a movie collection time of 25 h per SMRTCell with 2 hr pre-extension. CCS/HiFi reads were generated from the initial subread data using the pb_ccs workflow (ccs v.6.3.0) within PacBio SMRTLink version 11.0.0.146107. CCS/HiFi reads with the proper orientation of 5’ and 3’ Iso-Seq primers on the sequence ends were identified using LIMA v.2.7.1. Iso-Seq primers were trimmed from the LIMA-FL reads. IsoSeq3 refine v.3.8.2 was used to evaluate and process the LIMA-FL reads by: (i) retaining reads with proper orientation of 5’ and 3’ IsoSeq primers and poly-A tail; (ii) orienting reads so that poly-A tail is on 3’ end; (iii) trimming poly-A tail from read; (iv) removing possible chimeric reads (LIMA-FL reads with barcode sequences in the middle). Polished high quality isoform sequences (predicted accuracy >= 0.99) were obtained by using IsoSeq3 cluster on Refine-FLNC reads from both <3 kb and >3 kb mRNA libraries, to also reduce redundancy. These isoforms were used for downstream analysis and aligned to both BL6J and 129S1 assemblies using minimap2 (-ax splice:hq -uf options). Sam files were converted to bam using SAMtools v1.19⁶⁸.

BL6J and 129S1 short-read RNA-seq data was aligned to the respective genome assembly using STAR v.2.7.10b⁶⁹ with default parameters. De novo transcriptome assemblies for BL6J and 129S1 were generated using StringTie v.2.2.1 (--mix -u options)⁷⁰, combining the bam files from both short-read RNA-seq and PacBio Iso-seq data.

For the manual curation of Chr4cl KZFP genes in the CAST mouse strain the same analysis approach was used, combining published short-read RNA-seq data from pure strain mESCs (SRR23065636, SRR23065637, SRR23065638) with ONT long-read RNA-seq data instead of PacBio Iso-Seq. Briefly, RNA was isolated from F1 BL6J/CAST mESCs (XY) – described in ref. ²⁶ – using the Qiagen AllPrepDNA/RNA Kit. ONT sequencing libraries were generated using the ONT cDNA-PCR Sequencing V14 - Barcoding (SQK-PCB114.24) and sequenced on a PromethION machine using the 10.4 flow cell chemistry. Fastq files were generated with the base caller Dorado with SUP (super accurate) setting. Reads were mapped to the CAST genome assembly (GCA_964188545) using minimap2 (-ax splice option).

For the manual curation of KZFP genes, all transcripts identified by StringTie and falling at the Chr4 cluster locus were manually inspected in IGV and compared with the bam alignment generated from the short-read RNA-seq for the specific mouse strain. We observed that all KZFP genes at the Chr4 cluster were highlighted by the presence of a 3’exon overlapping MMSAT4/MurSatRep1 repeats. The sequences from these transcripts were then imported into Snapgene and manually inspected for the presence of canonical splice sites around the identified exons and for open reading frames in the spliced transcripts. Whenever multiple transcript isoforms were detected for the same gene, we prioritized transcripts with higher abundance estimated by StringTie and inspected less abundant transcripts when the most abundant ones were missing critical exons that were obviously highlighted in the bam track form the short-read RNA-seq. Transcripts with the potential to encode for both a KRAB domain and at least one zinc finger in at least one of the StringTie identified isoforms were used to annotate coding KZFP genes. Transcripts missing the start codon or the exon encoding for part of the KRAB domain were considered pseudogenes or otherwise classified as ‘other’, if they could still retain a minimal coding potential. Finally, for the CAST annotation only, we manually extended the end of the 3’exon whenever the StringTie assembled transcripts revealed coding potential but were truncated in the coding portion of the 3’exon, resulting in otherwise truncated zinc finger arrays; the extended annotation always contained several more in frame zinc fingers and a stop codon, revealing complete open reading frames.

Fingerprint analysis

During the manual curation of the KZFP gene annotation, all zinc finger arrays were also manually annotated. Fingerprint amino acid sequences were then extracted by position within the annotated zinc fingers (−1, +2, +3, and +6 positions, according to helical nomenclature). Furthermore, fingerprints in zinc fingers with mutations for at least one of the two cysteine or histidine were highlighted in red; fingerprints in zinc fingers with mutations for at least one of the other conserved structural amino acids (−12 F/Y, −3 F, +4 L) were highlighted in yellow.

To sort the KZFPs by similarity of the fingerprint arrays and facilitate comparison of their target sequence in ChIP-seq experiments (Fig. 6a), sequences of fingerprint arrays were compared by multiple sequence alignment using MAFFT v.7⁷¹ with G-INS-1 progressive alignment method and a 0.5 value for align versus leave gappy regions. The fasta alignment was then used to generate a distance matrix based on maximum likelihood estimation using the phangorn v.2.12.1 R package (dist.ml function)⁷², and the matrix was used to generate a tree estimation based on the Minimum Evolution Algorithm, using the fastme.bal function of the ape v.5.8 R package⁷³. The pml (default parameters) and optim.pml (optNni=TRUE option) of the phangorn package were then used to optimize the tree topology using nearest-neighbor interchange.

KZFP gene duplication analysis

To identify highly similar KZFP genes within the BL6J Chr4cl, the 3’exon DNA sequence of all the annotated KZFP genes (both coding and pseudogene/other) was used to generate a multiple sequence alignment using clustalO 1.2.4 with default settings⁷⁴. The output tree was saved in Newick format and displayed using the ggtree R package⁵⁵. KZFP genes with highly similar 3’exons were highlighted with similar colors and used to identify patterns of KZFP gene cluster duplications spanning multiple genes.

TE content, enrichment, and divergence analysis

TE content analysis for each assembly was performed based on de novo repeat annotation by RepeatMasker⁶⁷ for all mouse strains and species, and using both available annotation of rat repeats as well as de novo annotation of mouse repeats in the Rattus norvegicus mRatBN7.2 assembly.

TE enrichment at KZFP gene clusters was calculated based on the number of bp annotated for each TE subfamily as follows: (bp of TE/bp of KZFP cluster locus) / (bp of TE/bp of whole genome).

P-value of the overlap of each TE subfamily with each KZFP cluster was calculated by permutation test (n = 1000) using the overlapPermTest function with default settings of the regioneR package⁷⁵. All data from this analysis is available in Supplementary Data 3; Supplementary Fig. 8g, h display the log2 enrichment for TEs that display enrichment with P-value < 0.001 in at least one KZFP gene cluster in one or more Mus musculus strain or in Mus spretus. The same strategy was used for the human TE enrichment analysis and TEs enriched with P-value < 0.001 in at least one KZFP gene clusters examined were displayed in Supplementary Fig. 13.

To compare the sequence divergence of LTR elements in the Chr4 KZFP cluster versus genome-wide, the percentage of divergence value calculated by RepeatMasker was used. All data from this analysis is available in Supplementary Data 4; LTR elements displayed in Fig. 5a were selected as more than 2% of their total annotations (and more than 10 annotations) occurred within the BL6J Chr4 KZFP gene cluster.

ChIP-seq experiments

ChIP-seq experiments were performed by over-expressing each KZFP coding construct into mouse F9 embryo carcinoma cells (ATCC, CRL-1720). Cells were grown in Dulbecco’s Modified Eagle’s Medium (DMEM 4.5 g/L D-Glucose, L-Glutamine, 110 mg/mL Sodium Pyruvate) (Gibco, 11995-065) supplemented with 10% Fetal Bovine Serum (EMD Millipore, ES-009-B), 1X GlutaMAX (Gibco, 35050-061), 1X Anti-Anti (Gibco, 15240-062), at 37 °C under 5% CO₂.

The CDS for each KZFP was codon optimized for expression in mammalian cells, synthesized and cloned with a C-terminal 3x-HA tag by Genscript into a Sleeping Beauty (SB) transposon-based vector harboring a puromycin resistance cassette⁹.

F9 cells were transfected with individual KZFP encoding pSB plasmids together with the DNA transposase plasmid, using Lipofectamine 2000 reagent (Invitrogen, Cat#11668019) according to manufacturer instructions. 48 h after transfection, cells were grown for additional 48 h in medium supplemented with a final concentration of 1 µg/mL puromycin. Cell cultures were then expanded using regular culture medium and expression of KZFP constructs of the expected size was validated by Western Blot.

For ChIP-seq, cells expressing each KZFP construct or transfected with empty vector as negative control were harvested by trypsinization, resuspended in PBS, counted and fixed in 1% formaldehyde for 10 min at room temperature with gentle mixing. Fixation was quenched with 0.4 M glycine for 5 min at room temperature with gentle mixing. Cells were washed twice with cold PBS and cell pellet was frozen on dry ice and stored at −80 °C. Cells were lysed in cell lysis buffer (5 mM PIPES pH8.0, 85 mM KCl, 0.5% NP-40, 1X EDTA-free protease inhibitor cocktail (Roche, 5056489001)) for 10 min on ice and homogenized using type-B dounce homogenizer. Released nuclei were pelleted and lysed in nuclei lysis buffer (50 mM Tris-HCl pH8.0, 150 mM NaCl, 2 mM EDTA pH8.0, 1% NP-40, 0.5% Sodium Deoxycholate, 0.1% SDS, 1X EDTA-free protease inhibitor cocktail) to release chromatin. Chromatin was sonicated with a Bioruptor® 300 (Diogenode), nuclear debris were then pelleted at 20,800 g for 15 min at 4 °C and supernatant was used for chromatin immunoprecipitation. Sonicated chromatin from 30million cells was used for each ChIP. Dynabeads™ Protein A magnetic beads (Invitrogen, 10002D) were incubated with anti-HA tag antibody (Abcam, ab9110) and washed in 0.5% BSA in PBS. ChIP was performed overnight on rotation at 4 °C. Beads were then washed once with low salt wash buffer (0.1% SDS, 1% Triton X-100, 2 mM EDTA pH8.0, 20 mM Tris-HCl pH8.0, 150 mM NaCl), twice with high salt wash buffer (0.1% SDS, 1% Triton X-100, 2 mM EDTA pH8.0, 20 mM Tris-HCl pH8.0, 500 mM NaCl), twice with LiCl wash buffer (250 mM LiCl, 1% NP-40, 1% Sodium Deoxycholate, 1 mM EDTA pH8.0, 10 mM Tris-HCl pH8.0) and twice with TE buffer (10 mM Tris-HCl pH8.0, 1 mM EDTA pH8.0). Beads were incubated overnight at 65 °C in elution buffer (10 mM Tris-HCl pH8.0, 0.3 M NaCl, 5 mM EDTA pH8.0, 0.5% SDS) with 0.1 µg/µL RNAseA (Thermo Scientific, EN0531). Eluates were transferred to fresh tubes and incubated for 2 h at 55 °C with 0.3 µg/µL proteinaseK (Roche, 3115852001). For the chromatin input sample, elution buffer was added to an aliquot of sonicated chromatin and the sample was treated similarly to ChIP samples. DNA was finally purified with the DNA Clean & Concentrator-5 kit (Zymo Research, D4004).

DNA libraries for NGS were obtained with the ThruPLEX® DNA-Seq Kit (Takara, R400676) with DNA Single Index Kit −12S Set A and B (Takara, R400695 and R400695), following manufacturer instructions. Input samples for each experimental batch were mixed in equal amounts at the library preparation step to generate batch specific input samples. Samples were sequenced as 100 bp paired end reads on an Illumina NovaSeq 6000 system.

H3K4me3 and H3K9me3 ChIP-seq experiments in pure BL6J cells were performed the same as above with the following modifications: HGTC8 cells⁷⁶ were used for the experiment; anti-H3K4me3 (Abcam ab8580) and anti-H3K9me3 (Abcam ab8898) antibodies were used for ChIP.

ChIP-seq analysis

Read quality was assessed by fastQC v0.12.1. Reads were aligned to the Mus musculus GRCm39 reference genome assembly (GCF_000001635.27) for all the KZFP ChIP-seq experiments, using the Burrows-Wheeler Alignment (BWA) tool v0.7.17⁷⁷ (bwa aln and bwa sampe commands, default settings – only one best alignment was randomly retained in case of equally good multiple read alignments). Sam files were then converted into bam files with SAMtools v1.19⁶⁸, while removing eventually unmapped and duplicated reads, and retaining only primary alignments (samtools view -F 0×4,0×400,0 ×100,0×800 -b -h file.sam > file.bam). Bam files were sorted and indexed with SAMtools and converted to bigwig normalized to 1x genome coverage (RPGC normalization) for each sample with deepTools v3.5.4⁷⁸ (bamCoverage --bam file.bam -o file.bw -of bigwig --binSize 10 --effectiveGenomeSize 2521902382 --normalizeUsing RPGC --extendReads 200). ChIP bigwigs were further normalized by the input of the respective batch using deepTools (bigwigCompare -b1 ChIP.bw -b2 input.bw -o ChIP_input_ratio.bw -of bigwig --operation ratio –skipZeroOverZero --binSize 10).

The same analysis strategy was used also to analyze the H3K4me3 and H3K9me3 ChIP-seq data from HGTC8 cells and previously published PRDM9 ChIP-seq data (GSM1493404)³⁵. To generate bigwig normalized to 1x genome coverage for reads aligned to the de novo BL6J or CAST assembly, we used --effectiveGenomeSize 2525297504 or 2630722264, respectively, calculated using the unique-kmers.py command of the tool khmer v2.1.1 (with -k 200)^79,80,81.

For the re-alignment of previously published antiDMC1 SSDS datasets (GSM1954835, GSM1954839 and GSM1954846)³⁶ to de novo BL6J and CAST assemblies, identification of ssDNA derived reads and genome alignment was completed using a published SSDS processing pipeline (https://github.com/kevbrick/SSDSnextflowPipeline)⁸². Reads from hybrid samples were aligned in replicate to each parental strain genome using the same parameters. Also in this case, only one best alignment was randomly retained in case of equally good multiple read alignments.

Peaks from the KZFP ChIP-seq experiments were called using MACS2 v2.2.7.1⁸³ (macs2 callpeak -t ChIP.bam -c input.bam -f BAMPE -g 2521902382). Peaks with even 1 bp overlap with peaks called in any of the negative control replicate samples were removed.

Peaks were further filtered to only retain the ones with qValue =<0.01 and fold enrichment over input of 10. For the samples that retained less than 20 peaks, fold enrichment over input of 5 was used as cutoff.

The same analysis pipeline was used to re-analyze data published in Wolf et al. 2020⁹, with the following modifications for single read samples: bwa samse command (instead of bwa sampe) was used to generate sam files and -f BAM option (instead of -f BAMPE) was used to call peaks by MACS2.

For target motif analysis, fasta sequences of 200 bp surrounding the summits of the retained peaks were extracted using getfasta command from BEDTools v2.31.1⁸⁴. Motifs were then identified using meme-chip tool from MEME suite v5.5.5⁸⁵, as well as the RCADE tool⁴⁷. Target motif prediction based on the KZFP amino acid sequences alone was also performed, using the Zinc Finger Recognition Code (ZiFRC) tool⁸⁶.

TE enrichment of ChIP-seq peaks for each KZFP was assessed by permutation test (n = 1000) using the overlapPermTest function with default settings of the regioneR package (Supplementary Data 5). To validate distinct TE targeting by KZFPs (top bound TEs in Supplementary Data 6), ChIP-seq reads were re-aligned to the consensus sequences of all mouse repeats in the dfam repeat library and the presence of bona fide peaks over TEs that displayed peak enrichment with P-value < 0.01 based on the permutation test was manually inspected in the Integrative Genomic Viewer (IGV)⁸⁷.

For the bubble heatmap plot in Fig. 6a, a more stringent cutoff of P-value < 0.001 based on the permutation test was used.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All raw data generated in this study has been deposited in the SRA database under the BioProject PRJNA1219187. The ChIP-seq data generated in this study, together with the corresponding processed data files (normalized bigwig coverage files and peak files), has been deposited as a GEO series in the NCBI GEO database under accession number GSE292055.

The Mus musculus C57BL/6 J de novo assembly after contig filtering and strand correction generated in this study has been deposited at DDBJ/ENA/GenBank under the accession JBPRBH000000000; the version described in this paper is version JBPRBH010000000 [https://www.ncbi.nlm.nih.gov/nuccore/JBPRBH000000000] The Mus musculus 129S1/SvImJ de novo assembly after contig filtering and strand correction generated in this study has been deposited at DDBJ/ENA/GenBank under the accession JBPRBI000000000; the version described in this paper is version JBPRBI010000000 [https://www.ncbi.nlm.nih.gov/nuccore/JBPRBI000000000]. The Mus spretus SPR2 de novo assembly generated in this study has been deposited at DDBJ/ENA/GenBank under the accession JBQVYN000000000; The version described in this paper is version JBQVYN010000000 [https://www.ncbi.nlm.nih.gov/nuccore/JBQVYN000000000].

The raw data used to generate the Mus pahari assembly is available in the SRA database under the BioProject PRJNA966193.

The RNA-seq data used in this study are available in the SRA database under the BioProject PRJNA923323. Additional KZFP ChIP-seq data used in this study are available as a GEO series in the NCBI GEO database under accession number GSE115287.

The T2T C57BL/6 J assembly used in this study is available in the NCBI Genome database under the GenBank accession number GCA_964188535.1 [https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_964188535.1/]. The T2T CAST/EiJ assembly used in this study is available in the NCBI Genome database under the GenBank accession number GCA_964188545.1 [https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_964188545.1/].

The PRDM9 ChIP-seq data used in this study is available in the SRA database under accession number SRX689499. The DMC1 SSDS data used in this study is available in the GEO database as GEO series GSE75419. Source data are provided with this paper.

Code availability

All data analysis was performed using publicly available tools and code options used are described in detail in the methods section.

References

Imbeault, M., Helleboid, P. Y. & Trono, D. KRAB zinc-finger proteins contribute to the evolution of gene regulatory networks. Nature 543, 550–554 (2017).
Article CAS PubMed ADS Google Scholar
Bruno, M., Mahgoub, M. & Macfarlan, T. S. The Arms Race Between KRAB-Zinc Finger Proteins and Endogenous Retroelements and Its Impact on Mammals. Annu Rev. Genet 53, 393–416 (2019).
Article CAS PubMed Google Scholar
Yang, P. et al. A placental growth factor is silenced in mouse embryos by the zinc finger protein ZFP568. Science 356, 757–759 (2017).
Article CAS PubMed PubMed Central ADS Google Scholar
Li, X. et al. A maternal-zygotic effect gene, Zfp57, maintains both maternal and paternal imprints. Dev. Cell 15, 547–557 (2008).
Article CAS PubMed PubMed Central Google Scholar
Takahashi, N. et al. ZNF445 is a primary regulator of genomic imprinting. Genes Dev. 33, 49–54 (2019).
Article CAS PubMed PubMed Central Google Scholar
Baudat, F. et al. PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science 327, 836–840 (2010).
Article CAS PubMed ADS Google Scholar
Myers, S. et al. Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science 327, 876–879 (2010).
Article CAS PubMed ADS Google Scholar
de Tribolet-Hardy, J. et al. Genetic features and genomic targets of human KRAB-zinc finger proteins. Genome Res 33, 1409–1423 (2023).
Article PubMed PubMed Central Google Scholar
Wolf, G. et al. KRAB-zinc finger protein gene expansion in response to active retrotransposons in the murine lineage. Elife 9, (2020).
Thomas, J. H. & Schneider, S. Coevolution of retroelements and tandem zinc finger genes. Genome Res 21, 1800–1812 (2011).
Article CAS PubMed PubMed Central Google Scholar
Mager, D. L. & Stoye, J. P. Mammalian Endogenous Retroviruses. Microbiol. Spectrum 3, https://doi.org/10.1128/microbiolspec.mdna3-0009-2014 (2015).
Fueyo, R., Judd, J., Feschotte, C. & Wysocka, J. Roles of transposable elements in the regulation of mammalian transcription. Nat. Rev. Mol. Cell Biol. 23, 481–497 (2022).
Article CAS PubMed PubMed Central Google Scholar
Huntley, S. et al. A comprehensive catalog of human KRAB-associated zinc finger genes: insights into the evolutionary history of a large family of transcriptional repressors. Genome Res 16, 669–677 (2006).
Article CAS PubMed PubMed Central Google Scholar
Kauzlaric, A. et al. The mouse genome displays highly dynamic populations of KRAB-zinc finger protein genes and related genetic units. PLoS One 12, e0173746 (2017).
Article PubMed PubMed Central Google Scholar
Baillie, G. J., van de Lagemaat, L. N., Baust, C. & Mager, D. L. Multiple groups of endogenous betaretroviruses in mice, rats, and other mammals. J. Virol. 78, 5784–5798 (2004).
Article CAS PubMed PubMed Central Google Scholar
Bertozzi, T. M., Elmer, J. L., Macfarlan, T. S. & Ferguson-Smith, A. C. KRAB zinc finger protein diversification drives mammalian interindividual methylation variability. Proc. Natl Acad. Sci. USA 117, 31290–31300 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Juriloff, D. M. et al. Investigations of the genomic region that contains the clf1 mutation, a causal gene in multifactorial cleft lip and palate in mice. Birth Defects Res A Clin. Mol. Teratol. 73, 103–113 (2005).
Article CAS PubMed Google Scholar
Plamondon, J. A., Harris, M. J., Mager, D. L., Gagnier, L. & Juriloff, D. M. The clf2 gene has an epigenetic role in the multifactorial etiology of cleft lip and palate in the A/WySn mouse strain. Birth Defects Res A Clin. Mol. Teratol. 91, 716–727 (2011).
Article CAS PubMed Google Scholar
Kano, H., Kurahashi, H. & Toda, T. Genetically regulated epigenetic transcriptional activation of retrotransposon insertion confers mouse dactylaplasia phenotype. Proc. Natl Acad. Sci. USA 104, 19034–19039 (2007).
Article CAS PubMed PubMed Central ADS Google Scholar
Treger, R. S. et al. The Lupus Susceptibility Locus Sgp3 Encodes the Suppressor of Endogenous Retrovirus Expression SNERV. Immunity 50, 334–347.e9 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bertozzi, T. M., Takahashi, N., Hanin, G., Kazachenka, A. & Ferguson-Smith, A. C. A spontaneous genetically induced epiallele at a retrotransposon shapes host genome function. Elife 10, https://doi.org/10.7554/eLife.65233 (2021).
Young, G. R. et al. Gv1, a Zinc Finger Gene Controlling Endogenous MLV Expression. Mol. Biol. Evol. 38, 2468–2474 (2021).
Article CAS PubMed PubMed Central Google Scholar
Byers, C. et al. Genetic control of the pluripotency epigenome determines differentiation bias in mouse embryonic stem cells. Embo j. 41, e109445 (2022).
Article CAS PubMed Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Ferraj, A. et al. Resolution of structural variation in diverse mouse genomes reveals chromatin remodeling due to transposable elements. Cell Genom. 3, 100291 (2023).
Article CAS PubMed PubMed Central Google Scholar
Francis, B. et al. The structural diversity of telomeres and centromeres across mouse subspecies revealed by complete assemblies. bioRxiv 10, 619615 (2024).
Google Scholar
Gambogi, C. W. et al. Centromere innovations within a mouse species. Sci. Adv. 9, eadi5764 (2023).
Article CAS PubMed PubMed Central Google Scholar
Hastings, P. J., Lupski, J. R., Rosenberg, S. M. & Ira, G. Mechanisms of change in gene copy number. Nat. Rev. Genet. 10, 551–564 (2009).
Article CAS PubMed PubMed Central Google Scholar
Balachandran, P. & Beck, C. R. Structural variant identification and characterization. Chromosome Res 28, 31–47 (2020).
Article CAS PubMed PubMed Central Google Scholar
Abdullaev, E. T., Umarova, I. R. & Arndt, P. F. Modelling segmental duplications in the human genome. BMC Genomics 22, 496 (2021).
Article CAS PubMed PubMed Central Google Scholar
Linardopoulou, E. V. et al. Human subtelomeres are hot spots of interchromosomal recombination and segmental duplication. Nature 437, 94–100 (2005).
Article CAS PubMed PubMed Central ADS Google Scholar
Audano, P. A. et al. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell 176, 663–675.e19 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hinch, R., Donnelly, P. & Hinch, A. G. Meiotic DNA breaks drive multifaceted mutagenesis in the human germ line. Science 382, eadh2531 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lukaszewicz, A., Lange, J., Keeney, S. & Jasin, M. De novo deletions and duplications at recombination hotspots in mouse germlines. Cell 184, 5970–5984.e18 (2021).
Article CAS PubMed PubMed Central Google Scholar
Baker, C. L. et al. PRDM9 drives evolutionary erosion of hotspots in Mus musculus through haplotype-specific initiation of meiotic recombination. PLoS Genet 11, e1004916 (2015).
Article PubMed PubMed Central Google Scholar
Smagulova, F., Brick, K., Pu, Y., Camerini-Otero, R. D. & Petukhova, G. V. The evolutionary turnover of recombination hot spots contributes to speciation in mice. Genes Dev. 30, 266–280 (2016).
Article CAS PubMed PubMed Central Google Scholar
Spence, J. P. & Song, Y. S. Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations. Sci. Adv. 5, eaaw9206 (2019).
Article PubMed PubMed Central ADS Google Scholar
Bailey, J. A. & Eichler, E. E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat. Rev. Genet. 7, 552–564 (2006).
Article CAS PubMed Google Scholar
Weckselblatt, B. & Rudd, M. K. Human Structural Variation: Mechanisms of Chromosome Rearrangements. Trends Genet 31, 587–599 (2015).
Article CAS PubMed PubMed Central Google Scholar
Balachandran, P. et al. Transposable element-mediated rearrangements are prevalent in human genomes. Nat. Commun. 13, 7115 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Han, K. et al. Genomic rearrangements by LINE-1 insertion-mediated deletion in the human and chimpanzee lineages. Nucleic Acids Res. 33, 4040–4052 (2005).
Article CAS PubMed PubMed Central Google Scholar
Startek, M. et al. Genome-wide analyses of LINE–LINE-mediated nonallelic homologous recombination. Nucleic Acids Res. 43, 2188–2198 (2015).
Article CAS PubMed PubMed Central Google Scholar
Campbell, I. M. et al. Human endogenous retroviral elements promote genome instability via non-allelic homologous recombination. BMC Biol. 12, 74 (2014).
Article PubMed PubMed Central Google Scholar
Pace, J. K. 2nd & Feschotte, C. The evolutionary history of human DNA transposons: evidence for intense activity in the primate lineage. Genome Res 17, 422–432 (2007).
Article CAS PubMed PubMed Central Google Scholar
Gagnier, L., Belancio, V. P. & Mager, D. L. Mouse germ line mutations due to retrotransposon insertions. Mob. DNA 10, 15 (2019).
Article PubMed PubMed Central Google Scholar
Kawase, M. & Ichiyanagi, K. Mouse retrotransposons: sequence structure, evolutionary age, genomic distribution and function. Genes Genet Syst. 98, 337–351 (2024).
Article PubMed Google Scholar
Najafabadi, H. S., Albu, M. & Hughes, T. R. Identification of C2H2-ZF binding preferences from ChIP-seq data using RCADE. Bioinformatics 31, 2879–2881 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lyu, Y. et al. Stem cell activity-coupled suppression of endogenous retrovirus governs adult tissue regeneration. Cell, (2024).
Ivancevic, A. et al. Endogenous retroviruses mediate transcriptional rewiring in response to oncogenic signaling in colorectal cancer. Sci. Adv. 10, eado1218 (2024).
Article CAS PubMed PubMed Central Google Scholar
Lykoskoufis, N. M. R., Planet, E., Ongen, H., Trono, D. & Dermitzakis, E. T. Transposable elements mediate genetic effects altering the expression of nearby genes in colorectal cancer. Nat. Commun. 15, 749 (2024).
Article CAS PubMed PubMed Central ADS Google Scholar
Cooper, K. L. The case against simplistic genetic explanations of evolution. Development 151, https://doi.org/10.1242/dev.203077 (2024).
Gozashti, L., Feschotte, C. & Hoekstra, H. E. Transposable Element Interactions Shape the Ecology of the Deer Mouse Genome. Mol. Biol. Evol. 40, https://doi.org/10.1093/molbev/msad069 (2023).
Jacobs, F. M. et al. An evolutionary arms race between KRAB zinc-finger genes ZNF91/93 and SVA/L1 retrotransposons. Nature 516, 242–245 (2014).
Article CAS PubMed PubMed Central ADS Google Scholar
Kumar, S. et al. TimeTree 5: An Expanded Resource for Species Divergence Times. Mol. Biol. Evol. 39, msac174 (2022).
Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T.-Y. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evolution 8, 28–36 (2017).
Article Google Scholar
Gel, B. & Serra, E. karyoploteR: an R/Bioconductor package to plot customizable genomes displaying arbitrary data. Bioinformatics 33, 3088–3090 (2017).
Article CAS PubMed PubMed Central Google Scholar
Porubsky, D. et al. SVbyEye: A visual tool to characterize structural variation among whole-genome assemblies. bioRxiv 09, 612418 (2024).
Google Scholar
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. 36, 1174–1182 (2018).
Article CAS Google Scholar
Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. 41, 1474–1482 (2023).
Article CAS PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Article CAS PubMed PubMed Central Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Mol. Biol. Evolution 38, 4647–4654 (2021).
Article CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Marçais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLOS Computational Biol. 14, e1005944 (2018).
Article Google Scholar
Harris, R. S. Improved pairwise alignment of genomic DNA. Ph.D. Thesis. The Pennsylvania State University. https://www.bx.psu.edu/~rsharris/lastz/ (2007).
Sweeten, A. P., Schatz, M. C. & Phillippy, A. M. ModDotPlot—rapid and interactive visualization of tandem repeats. Bioinformatics 40, https://doi.org/10.1093/bioinformatics/btae493 (2024).
Shumate, A. & Salzberg, S. L. Liftoff: accurate mapping of gene annotations. Bioinformatics 37, 1639–1643 (2021).
Article CAS PubMed PubMed Central Google Scholar
Smit, A., Hubley, R. & Green, P. RepeatMasker Open 4, 2013 (2015).
Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Shumate, A., Wong, B., Pertea, G. & Pertea, M. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLOS Computational Biol. 18, e1009730 (2022).
Article CAS ADS Google Scholar
Katoh, K., Misawa, K., Kuma, K. I. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).
Article CAS PubMed PubMed Central Google Scholar
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2010).
Article PubMed PubMed Central Google Scholar
Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2018).
Article Google Scholar
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Article PubMed PubMed Central Google Scholar
Gel, B. et al. regioneR: an R/Bioconductor package for the association analysis of genomic regions based on permutation tests. Bioinformatics 32, 289–291 (2015).
Article PubMed PubMed Central Google Scholar
Cheng, J., Dutra, A., Takesono, A., Garrett-Beal, L. & Schwartzberg, P. L. Improved generation of C57BL/6J mouse embryonic stem cells in a defined serum-free media. Genesis 39, 100–104 (2004).
Article PubMed Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016).
Article PubMed PubMed Central Google Scholar
Crusoe, M. et al. The khmer software package: enabling efficient nucleotide sequence analysis [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Res. 4, https://doi.org/10.12688/f1000research.6924.1 (2015).
Döring, A., Weese, D., Rausch, T. & Reinert, K. SeqAn An efficient, generic C++ library for sequence analysis. BMC Bioinforma. 9, 11 (2008).
Article Google Scholar
Luiz, C., Irber, J. & Brown, C. T. Efficient cardinality estimation for k-mers in large DNA sequencing data sets. k-mer cardinality estimation, 056846, (2016).
Khil, P. P., Smagulova, F., Brick, K. M., Camerini-Otero, R. D. & Petukhova, G. V. Sensitive mapping of recombination hotspots using sequencing-based detection of ssDNA. Genome Res 22, 957–965 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
Article PubMed PubMed Central Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS PubMed PubMed Central Google Scholar
Bailey, T. L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994).
CAS PubMed Google Scholar
Najafabadi, H. S. et al. C2H2 zinc finger proteins greatly expand the human regulatory lexicon. Nat. Biotechnol. 33, 555–562 (2015).
Article CAS PubMed Google Scholar
Thorvaldsdóttir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinforma. 14, 178–192 (2012).
Article Google Scholar

Download references

Acknowledgements

We thank James Thomas, Alice Young, Shelise Brook and Morgan Park at the National Institutes of Health Intramural Sequencing Center (NISC) for generating the long-read sequencing data for the C57BL/6 J and 129S1/SvImJ mouse strains; and we thank Tianwei Li and James R. Iben at the Eunice Kennedy Shriver National Institutes of Child Health and Human Development (NICHD) Molecular Genomic Core for generating the PacBio HiFi sequencing data of Mus spretus and the Next Generation Sequencing data for ChIP-seq experiments. We are very grateful to Takashi Akera and Warif El Yakoubi for providing Mus spretus specimens, as well as to members of the Macfarlan lab, Zuzana Loubalova and Adam Phillippy for helpful discussions. This study utilized the computational resources of the NIH HPC Biowulf Cluster (http://hpc.nih.gov). This work was supported by the Intramural Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development DIR 1ZIAHD008933 (T.S.M.) and ZICHD008986 (R.K.D.) at the National Institutes of Health (NIH), the Wellcome Investigator Award 210757/Z/18/Z (A.C.F.S. and K.C.) and the NIH grant R35-GM130302 (B.E.B.).

Funding

Open access funding provided by the National Institutes of Health.

Author information

Authors and Affiliations

The Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
Melania Bruno, Sharaf M. Farhana, Apratim Mitra, Dawn E. Watkins-Chow, Ryan K. Dale & Todd S. Macfarlan
Department of Genetics, University of Cambridge, Downing Street, Cambridge, UK
Kevin Costello & Anne C. Ferguson-Smith
Department of Genetics, Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Glennis A. Logsdon
Department of Biochemistry & Biophysics, Penn Center for Genome Integrity, Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Craig W. Gambogi & Ben E. Black
The Jackson Laboratory, Bar Harbor, ME, USA
Beth L. Dumont
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
Thomas M. Keane

Authors

Melania Bruno
View author publications
Search author on:PubMed Google Scholar
Sharaf M. Farhana
View author publications
Search author on:PubMed Google Scholar
Apratim Mitra
View author publications
Search author on:PubMed Google Scholar
Kevin Costello
View author publications
Search author on:PubMed Google Scholar
Dawn E. Watkins-Chow
View author publications
Search author on:PubMed Google Scholar
Glennis A. Logsdon
View author publications
Search author on:PubMed Google Scholar
Craig W. Gambogi
View author publications
Search author on:PubMed Google Scholar
Beth L. Dumont
View author publications
Search author on:PubMed Google Scholar
Ben E. Black
View author publications
Search author on:PubMed Google Scholar
Thomas M. Keane
View author publications
Search author on:PubMed Google Scholar
Anne C. Ferguson-Smith
View author publications
Search author on:PubMed Google Scholar
Ryan K. Dale
View author publications
Search author on:PubMed Google Scholar
Todd S. Macfarlan
View author publications
Search author on:PubMed Google Scholar

Contributions

Conceptualization, M.B. and T.S.M.; Methodology, M.B., S.M.F., A.M. and K.C.; Investigation, M.B., S.M.F., A.M., K.C., D.E.W.C., R.K.D. and T.S.M.; Data sharing, G.A.L., C.W.G., B.L.D., B.E.B. and T.M.K.; Writing – Original Draft, M.B.; Writing – Review & Editing, M.B., D.E.W.C. and T.S.M. with input from all the authors; Funding Acquisition, R.K.D. and T.S.M.; Resources, T.S.M.; Supervision, M.B., R.K.D., A.C.F.S. and T.S.M.

Corresponding authors

Correspondence to Melania Bruno or Todd S. Macfarlan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Vincenza Colonna, who co-reviewed with Laura Pignata; Didier Trono and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information (download PDF )

Description of Additional Supplementary Files (download PDF )

Supplementary Data 1 (download XLSX )

Supplementary Data 2 (download XLSX )

Supplementary Data 3 (download XLSX )

Supplementary Data 4 (download XLSX )

Supplementary Data 5 (download XLSX )

Supplementary Data 6 (download XLSX )

Reporting Summary (download PDF )

Transparent Peer Review file (download PDF )

Source data

Source Data (download XLSX )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Bruno, M., Farhana, S.M., Mitra, A. et al. Young KRAB-zinc finger gene clusters are highly dynamic incubators of ERV-driven genetic heterogeneity in mice. Nat Commun 16, 9608 (2025). https://doi.org/10.1038/s41467-025-64609-2

Download citation

Received: 11 March 2025
Accepted: 23 September 2025
Published: 30 October 2025
Version of record: 30 October 2025
DOI: https://doi.org/10.1038/s41467-025-64609-2

This article is cited by

The role of KRAB zinc-finger proteins in expanding the domestication potential of transposable elements
- Juliette Davis
- Diana Voicu
- Michael Imbeault
Nature Genetics (2026)