Nucleotide dependency analysis of genomic language models detects functional elements

Tomaz da Silva, Pedro; Karollus, Alexander; Hingerl, Johannes; Galindez, Gihanna Sta. Teresa; Wagner, Nils; Hernandez-Alias, Xavier; Incarnato, Danny; Gagneur, Julien

doi:10.1038/s41588-025-02347-3

Download PDF

Article
Open access
Published: 10 October 2025

Nucleotide dependency analysis of genomic language models detects functional elements

Nature Genetics volume 57, pages 2589–2602 (2025)Cite this article

5371 Accesses
12 Altmetric
Metrics details

Subjects

Abstract

Deciphering how nucleotides in genomes encode regulatory instructions and molecular machines is a long-standing goal. Genomic language models (gLMs) implicitly capture functional elements and their organization from genomic sequences alone by modeling probabilities of each nucleotide given its sequence context. However, discovering functional genomic elements from gLMs has been challenging due to the lack of interpretable methods. Here we introduce nucleotide dependencies, which quantify how nucleotide substitutions at one genomic position affect the probabilities of nucleotides at other positions. We demonstrate that nucleotide dependencies are more effective at indicating the deleteriousness of genetic variants than alignment-based conservation and gLM reconstruction. Dependency analysis accurately detects regulatory motifs and highlights bases in contact within RNAs, including pseudoknots and tertiary structure contacts, revealing new, experimentally validated RNA structures. Finally, we leverage dependency maps to reveal critical limitations of several gLM architectures and training strategies. Altogether, nucleotide dependency analysis opens a new avenue for discovering and studying functional elements and their interactions in genomes.

Nucleotide Transformer: building and evaluating robust foundation models for human genomics

Article Open access 28 November 2024

Evaluation of large language models for discovery of gene set function

Article 28 November 2024

Generative language models on nucleotide sequences of human genes

Article Open access 27 September 2024

Main

The basic blueprint of every living organism is encoded in its genome. While high-throughput sequencing allows us to read this genetic information, interpreting its meaning remains a major challenge. A key interpretation method is sequence comparison¹, which identifies functional elements by leveraging nucleotide-level conservation as well as statistical dependencies between nucleotides. Covariation analysis, in particular, has been crucial in structural biology^2,3, for instance, in identifying conservation of Watson–Crick base pairing in RNA. However, these analyses traditionally relied on sequence alignments, limiting their use to highly conserved genomic regions.

Genomic language models (gLMs) have emerged as an alignment-free alternative^4,5. Trained to predict nucleotides from their sequence context, these models learn evolutionary patterns directly from vast amounts of genomic data⁴. Studies have shown that gLMs capture biologically relevant information, distinguishing between functional and nonfunctional transcription factor (TF) binding motifs and identifying genetic variants with phenotypic effects^4,5,6. They have also found use as so-called foundation models for predicting molecular phenotypes, sometimes outperforming other methods^{4,7,8,9,10,11,12,13,14,15,16}. These analyses indicate that gLMs intrinsically represent genomic functional elements. However, the foundation model paradigm uses gLMs as intermediate black boxes and does not reveal these elements.

In this work, we leverage gLMs to provide a measure of dependencies between nucleotide pairs. We systematically study the resulting nucleotide dependency maps to determine which genomic elements they encode and thus exploit them to characterize functional elements and their interactions. This approach also allows us to compare different gLMs and identify their limitations.

Results

Nucleotide dependency maps

Genomic language models are trained to reconstruct nucleotides, thereby providing nucleotide probabilities given their surrounding sequence context (Fig. 1a). In principle, success at reconstructing nucleotides requires detecting characteristic genomic features that are more likely to be found in the sequence context. For example, the probability of a particular nucleotide in the human genome being a guanine strongly depends on whether it is intronic (~22% (ref. ¹⁷)) or located at the third base of a start codon (~100%). To study the relationship between nucleotides and their context using gLMs, we use a technique analogous to in silico mutagenesis (explained in ref. ¹⁸). Specifically, we mutate a nucleotide in the sequence context (query nucleotide) into all three possible alternatives and record the change in predicted probabilities at a target nucleotide in terms of odds ratios (Fig. 1b and Methods). This procedure, which can be repeated for all possible query-target combinations, quantifies the extent to which the language model prediction of the target nucleotide depends on the query nucleotide, all else equal.

**Fig. 1: Probing nucleotide dependencies from gLMs.**

We applied this general procedure to 14 gLMs (Extended Data Table 1 and Methods). Unless stated otherwise, we present results from our SpeciesLM gLMs that were trained on regions 5′ of start codons in fungi and metazoa (SpeciesLM fungi and SpeciesLM metazoa; Methods). On selected biological applications, we turn to other gLMs. To assess the biological relevance of these dependencies, we sought to verify that single-nucleotide variants (SNVs) of known functional importance have a greater impact on gLM predictions. Given an SNV at a query position, we computed for any target position the maximum absolute log-odds ratio over all possible four target nucleotide values. Next, we averaged these values across all targets to obtain an aggregate score of query variant impact (Methods). We named this metric the variant influence score. In the ClinVar database¹⁹, the influence score was significantly higher for noncoding pathogenic variants than for benign variants (Extended Data Fig. 1a,b). This is despite using a gLM trained only on 2-kb regions 5′ of start codons, which only overlap a small fraction of all transcribed bases. Prior studies leveraged gLM reconstruction probabilities to prioritize functional variants, positing that lower probability indicates greater deleteriousness^5,6 (Methods). Remarkably, this reconstruction-based metric showed substantially lower performance than the influence score. However, the influence score did not outperform alignment-based scores, perhaps because the criteria used by ClinVar to categorize variants as pathogenic include bioinformatics predictions that often integrate alignment-based conservation.

For a less biased comparison, we focused on a dataset from a saturation mutagenesis experiment on nine selected human promoters²⁰ (Fig. 1c). Here the variant influence score correlated with variant effects on absolute gene expression fold change, outperforming reconstruction, as well as alignment-based conservation scores^21,22,23,24. Remarkably, the purely unsupervised influence score was on par with the state-of-the-art supervised expression predictor Borzoi²⁵. These two approaches appeared to capture complementary predictive signals, because a simple integrative model further improved performance (Fig. 1c; similar observations on noncoding ClinVar variants are shown in Extended Data Fig. 1b). The variant influence score also outperformed reconstruction and alignment-based conservation at distinguishing fine-mapped promoter expression quantitative trait loci (eQTLs) single-nucleotide polymorphisms from matched controls, in human, where it did not outperform Borzoi, and in yeast^26,27,28,29 (Extended Data Fig. 1c–f).

Having shown that aggregate dependency strengths reflect functional importance, we then studied individual query-target pairs. For every query-target pair, we considered the maximum effect a query nucleotide change has on the predicted odds of a target, yielding two-dimensional (2D) nucleotide dependency maps (Methods). An example map is shown for the yeast arginine tRNA (Fig. 1d). The entire secondary structure of the tRNA, defined by base pairing within the four arms, clearly stands out with high dependencies. The dependency map also highlighted a tertiary structure contact. Upon introducing single-nucleotide substitutions in these pairs, the gLM adapted its predictions according to the Watson–Crick base pairing and, with a lesser preference, to wobble base pairing (see Fig. 1d for an example). Remarkably, the model recapped structural RNA rules from its reconstruction objective alone, in an alignment-free manner and without focused training on tRNAs.

Nucleotide pairs have two dependencies, depending on which nucleotide is the query. Scoring nucleotide pairs by the maximum of those two values yielded near-perfect secondary structure contact predictions across 172 tRNAs of Saccharomyces cerevisiae. Alternative metrics, including gradient-based dependencies and using masking instead of nucleotide substitution on query, showed lower predictive signal. This trend was confirmed when further assessing the dependencies on cognate donor and acceptor splice sites (Extended Data Fig. 1g,h).

In the following sections, we explore and categorize patterns found in nucleotide dependency maps, associate them to biological mechanisms and exploit them to detect and characterize functional elements in the genome.

Blocks along the diagonal highlight regulatory sequence motif instances

We observed that short sets of contiguous nucleotides frequently exhibited strong reciprocal dependencies, manifesting as dense blocks along the diagonal. Many dense blocks were observed at TF motif instances in promoters (Fig. 2a,b). This was in striking contrast to other well-reconstructed locations, including simple repeats such as poly(dA:dT) stretches. Intuitively, an individual mutation in the poly(dA:dT) stretch will have a mild impact on predicting any other element of the stretch and thus dependencies in the repeat tend to be less dense. In contrast, all bases in a TF motif are strongly interdependent, as a mutation at any position can disrupt the entire site’s function by reducing binding. Therefore, we reasoned that TF motifs could be detected using gLMs by searching for dependency blocks.

**Fig. 2: Blocks along the diagonal of dependency maps highlight instances of regulatory sequence motifs.**

To find dependency blocks, we computed the first quartile of query-target dependencies among consecutive six nucleotides (Methods). This quantile-based block score is more robust than the average in isolating strong interactions, while privileging dense blocks. To assess how block scores facilitate the detection of TF binding sites, we leveraged the near-complete TF binding data in S. cerevisiae with nucleotide-level preferences (position weight matrices (PWMs)³⁰). We considered the 1-kb regions 5′ of start codons and defined PWM matches within 10 bp of an experimental binding peak as binding sites for 68 TFs³¹.

While reconstruction varied widely for binding site nucleotides and repeat elements, the block score of binding site nucleotides was generally higher than for nucleotides in repeats identified by RepeatMasker (LTR retrotransposons, telomeric/centromeric repeats, rDNA regions, low-complexity DNA, simple repeats; Fig. 2c). Consistent with this, the block score discriminated binding site nucleotides substantially better than reconstruction (Fig. 2d). By comparison, the PhastCons conservation inferred from the alignment of seven yeast Saccharomyces species²¹ had no discriminative power at this task, and modest discriminative power if overlapping coding sequences were removed (Fig. 2d and Extended Data Fig. 2). Extending the alignment to 69 species of the Saccharomycetales order using default parameters did not improve results (Extended Data Fig. 2 and Methods).

Moreover, the block score discriminated binding site nucleotides as effectively as PWM scanning. This result is remarkable because the block score was obtained in a completely unsupervised fashion, whereas the PWMs were not only derived from experimental data but also used to define the positive class. Additionally, if we only benchmark on nucleotides forming part of a PWM match, the block score will demonstrate an ability to discriminate binding from nonbinding PWM matches, thus showing that the gLM considers the context of the motif (Extended Data Table 2). In sum, this analysis demonstrates the ability of gLMs to detect regulatory elements and the utility of dependency maps.

We note that not all motifs appear as complete blocks. S. cerevisiae Abf1 motif, for example, is represented as two spaced and interacting blocks, reflecting the dimeric binding preferences of this factor (Fig. 2e). Thus, even within motifs, the dependency maps can serve to visualize underlying functional relationships.

Off-diagonal blocks indicate sequence element interactions

Blocks in the dependency maps also occurred away from the diagonal, revealing distal interactions, such as between key transcription initiation elements (TATA box and INR) in Drosophila melanogaster (Fig. 3a) and the primary splicing determinants (donor, branch and acceptor sites) in S. cerevisiae (Fig. 3b and Extended Data Fig. 3). The short length of yeast introns allowed a genome-wide assessment that showed that dependencies between donor and acceptor splice sites were higher than dependencies between donor and decoy acceptor-like sequences within the intron or background dependencies at matched distances (Fig. 3c). These results indicate that distal dependencies capture a range of functional relationships among sequence elements, including promoter and transcript architecture.

**Fig. 3: Off-diagonal blocks highlight sequence element interactions.**

Going a step further, we asked whether the maps could also reflect changes in transcript structure due to interindividual variation. To this end, we leveraged aberrant splicing events associated with rare variants from 946 human individuals (GTEx³²) and SpliceBERT, a language model trained on vertebrate RNA sequences¹⁴. As an example, a rare variant in the TRPC6 gene disrupts a canonical donor splice site, leading to the use of a cryptic site and the creation of an aberrant, shorter intron. The dependency map reflects this by showing a strong interaction between the canonical donor and the boundaries of this new intron (Fig. 3d). Across 1,811 rare-variant-associated aberrant splicing events, dependencies between the variant position and the ends of the corresponding outlier intron exceeded those between nucleotides at matched distances (Fig. 3e). These results held for both outlier intron ends and all variant location categories (Fig. 3e). We conclude that dependency maps capture splicing rules and can reflect variant-induced transcript structure alterations.

Nucleotide dependencies reveal RNA secondary and tertiary structure contacts

Besides blocks, we observed antiparallel diagonals, that is, distal stretches of consecutive nucleotides that depend on each other one-to-one in reverse order as in the case of the four arms of the yeast arginine tRNA described above (Fig. 1d). Using a convolutional filter, we systematically called regions with antiparallel elements across diverse fungal genomes (Methods and Extended Data Fig. 4a). Dependencies in antiparallel diagonals were typically consistent with Watson–Crick or wobble base pairing (Extended Data Fig. 4b), indicating that they captured helical stems, critical to RNA folding. Moreover, antiparallel diagonals with the strongest dependencies were found among highly structured RNAs such as tRNAs and ribosomal RNAs (Extended Data Fig. 4c and Methods). Hence, these findings suggest that detecting antiparallel diagonals in nucleotide dependency maps could be instrumental in inferring RNA structures.

To evaluate the potential of dependency maps for capturing RNA structures, we used RiNALMo, a language model trained on 36 million noncoding of both annotated and unannotated RNA sequences from RNAcentral³³, nt³⁴, Rfam³⁵ and Ensembl³⁶, spanning a wide variety of species⁷. Originally, the authors of RiNALMo trained this LM as foundation for a supervised predictor of RNA secondary structures. We, instead, scored contacts as the largest of the two dependency map entries for each pair of nucleotides, computed using RiNALMo’s underlying LM. Remarkably, these scores were strongly predictive of secondary structure contacts, with areas under the receiver operating characteristic (ROC) curve typically exceeding 0.9 for most RNA families (Archive ll database³⁷; Fig. 4a), although we performed no fine-tuning on secondary structures. Nonetheless, tools optimized for RNA secondary structure analysis, such as RNAalifold and the fine-tuned RiNALMo, outperformed dependency scores at predicting secondary stucture contacts of experimentally validated structures (Extended Data Fig. 4d).

**Fig. 4: Dependency maps reveal known and new RNA structures, and highlight tertiary contacts.**

However, secondary structures are simplified planar representations of the topology of a single possible 3D folding of an RNA sequence, missing important contacts occurring in the 3D fold. We observed that some apparent false positive predictions from our dependency-map approach corresponded to tertiary structure contacts that were absent from predictions by the supervised RiNALMo. For instance, in Archaeoglobus fulgidus isoleucine tRNA, the dependency maps showed 6 bp with dependencies as strong as secondary-structure contact dependencies (dependency > 6). These constituted six of the eight known contacts found only in the tertiary structure. RiNALMo’s supervised secondary structure predictor unsurprisingly missed these tertiary structure contacts (Fig. 4b). This ability to detect tertiary interactions is important, as they provide useful spatial constraints to help determine an RNA’s 3D structure.

To systematically evaluate the added value of dependency maps in capturing tertiary structure contacts, we analyzed noncanonical (that is, non-Watson–Crick/wobble) contacts in Protein Data Bank (PDB) RNA structures in the CompaRNA database³⁸. We found that 50% of the pairs with a dependency score larger than 13.5 and not predicted to be in secondary structure by RiNALMo’s supervised model were annotated as contacts (Fig. 4c). Across the entire database, noncanonical base pairs were well-captured by the dependency maps (Fig. 4d; area under the curve (AUC) = 0.8). In contrast, this information was largely lost by the supervised RINALMo and by RNAalifold (Fig. 4d and Extended Data Fig. 4d). These results indicate the utility of dependency maps for RNA structure inference by providing candidate contacts not captured by secondary structure contact predictors.

These findings prompted us to investigate further the potential of dependency maps in addressing major challenges of secondary structure prediction, such as pseudoknot detection. Pseudoknots are important nonsecondary structure elements that form when base pairs are not nested, for example, bases in a loop pairing with another single-stranded region. We observed high dependencies between bases of documented contacts implied by pseudoknots, such as in the 396 nt-long RNase P RNA (see Fig. 4e and Extended Data Fig. 4e for another example in a riboswitch), in which not only the stems but also the pseudoknot are reflected with strong antiparallel diagonals. Analyzing systematically 2,530 pseudoknot-containing RNA structures with less than 90% sequence similarity from the bpRNA-1m(90) dataset³⁹, we found that pairs of nucleotides in pseudoknots showed substantially higher dependencies than pairs not belonging to structural contacts (AUC = 0.92; Extended Data Fig. 4f).

An RNA’s secondary structure represents the topology of a single conformation. However, an RNA sequence can adopt alternative functional RNA folds. We found that dependency maps can capture alternative structures. For instance, the dependency maps of the tryptophan leader sequence in the bacterium Escherichia coli (Fig. 4f), a structured region for which tryptophan abundance regulates the switch between terminator and antiterminator conformations⁴⁰, captures the two alternative folds, with domain 3 being involved in antiparallel diagonals with both domain 2 and domain 4 (ref. ⁴⁰).

To assess the capacity of dependency maps to derive new structural predictions, we performed in-cell chemical probing of E. coli with DMS, followed by high-throughput mutational profiling analysis (DMS-MaPseq), a transcriptome-wide assay probing adenines and cytosines not engaged in Watson–Crick base pairing⁴¹. Transcriptome-wide, the structural contacts predicted by the antiparallel patterns in dependency maps can efficiently capture experimentally derived RNA base-pair contacts (Extended Data Fig. 4g,h).

We then focused on all noncoding regions upstream of the start codon spanning 500 nucleotides, as they harbor different transcribed regions, including structures with roles in translation and transcription regulation⁴². We selected dependency maps indicating the presence of at least two stem loops and not belonging to an annotated structure, revealing four previously unreported secondary structures corroborated by experimental data from DMS-MaPseq and validated by covariation analysis (Fig. 4g and Extended Data Fig. 4i). Notably, as covariation analysis typically requires a high-quality sequence alignment and, optionally, a predicted RNA structure, the ability of nucleotide dependencies to capture—in an alignment-free and unsupervised fashion—functionally relevant RNA structural contacts, underscores their predictive power.

Collectively, these results show that dependency-map analysis can overcome the typical challenges associated with RNA structure prediction, capturing both secondary and tertiary structure contacts, pseudoknots and alternative structures of functionally relevant RNAs.

gLMs capture forward and inverted duplications without memorization

We observed parallel (Fig. 5a and Extended Data Fig. 5a) and antiparallel diagonals reflecting duplicated sequences in the forward and reverse complement orientations, respectively. Further in silico experiments demonstrated that gLMs have modeled the duplication operation itself, rather than relying on memorizing these sequences (Fig. 5b, Extended Data Fig. 5b and Supplementary Note).

**Fig. 5: gLMs capture sequence duplications without memorization.**

Additionally, gLMs will only introduce short stretches of antiparallel dependencies in specific contexts, rather than associating any pair that could theoretically engage in Watson–Crick base pairing, demonstrating that the models have learned determinants of RNA structure beyond reverse complementarity (Extended Data Fig. 5c and Supplementary Note).

Dependency strength depends on genomic distance

We then investigated pattern-independent, global properties of the distribution of dependencies. To this end, we focused on S. cerevisiae as a model system. Nucleotide dependencies followed a power–law relationship with respect to distance to the query nucleotide, decaying by about 78% per tenfold distance increase (Fig. 6a). We did not find substantial variations in the decay rate across various types of genomic regions (Fig. 6b). However, dependencies were generally 1.64× stronger in the mitochondrial than in the nuclear genome (Fig. 6b). Browsing dependency maps of mitochondria revealed dependency-rich regions whose biological interpretation needs further investigations (Fig. 6c). Investigating deviations to the general power–law trend revealed higher dependencies at 3-nucleotide spacing, perhaps as a consequence of the high content of coding sequences in yeast. Nucleosome positioning also appeared to influence dependency distributions, with stronger dependencies than expected by the power law at distances corresponding to nucleosome position periodicity on both S. cerevisiae (164 bp) and Schizosaccharomyces pombe (152 bp)⁴³ (Fig. 6d). We conclude that nucleotide dependency maps offer new avenues to study general constraints on genomic sequences.

**Fig. 6: Dependencies relate to distance in a species and region-specific way and reveal periodicities intrinsic to the genome.**

Dependency maps uncover shortcomings in gLM model designs and training data selection

Current gLMs differ in both model architecture and the sequence data on which they were trained. As of the time of writing, there is no consensus on the advantages and disadvantages of these different approaches, and comparisons are challenging due to the complexity of gLMs. We set out to use nucleotide dependencies, which can be computed for any gLM, as a general tool for visualizing and getting insights into existing gLMs.

Human tRNAs are suitable loci for comparative analysis both because several models have been trained on human genomes only and because tRNAs entail well-established and highly conserved distal functional dependencies. We observed that some modeling choices introduce artifacts in the dependency maps. For example, models belonging to the Nucleotide Transformer family⁹ do not reconstruct at the single base level but instead predict nonoverlapping spans of six nucleotides. This produces artificial dependency blocks along the diagonal, which do not represent motif instances but arise because nucleotides of the same span are generally more dependent (Fig. 7a). Nevertheless, these models are capable of learning dependencies at the single base level, for example, some tRNA stem contacts in the human tRNA-Arg-TCT-4-1 (Fig. 7a).

**Fig. 7: Dependency maps to compare gLMs and diagnose their shortcomings.**

Equally, autoregressive models, for example, Evo⁸, do not consider bidirectional context when making predictions; instead, they are designed to predict the next nucleotide given its 5′ context. This creates an artifact at the beginning of genomic elements such as the tRNA, which likely arises because the model cannot deduce the element until it has seen sufficiently many tokens inside of it (Fig. 7a). This problem can be mitigated by running the model both on the forward and reverse strand and taking the maximum dependency within a pair of nucleotides. Nevertheless, more appropriate measures of nucleotide dependencies for autoregressive models may need to be developed.

Comparing models trained on different types of sequences revealed starker differences. Specifically, models trained only on the human genome, regardless of architecture, parameter count or whether within-species variation was included, did not learn the human tRNA structure to any meaningful degree (Fig. 7b). By contrast, models trained on multiple species succeeded in at least learning aspects of human tRNA structure, regardless of architecture and whether the training data included any human genomes. Similar results were observed when evaluating the performance of gLMs from the Nucleotide Transformer family, which all show very similar architectures, on the human promoter saturation mutagenesis assay²⁰ (Fig. 7c) and ClinVar¹⁹ (Extended Data Fig. 6). We conclude that infrequent genomic elements, even if they are highly conserved, generally require a multispecies approach to be learned.

Discussion

In conclusion, we introduced nucleotide dependencies that quantify how nucleotide substitutions at one genomic position affect the likelihood of nucleotides at another position. This new metric appears as a general and effective approach to identifying functionally related nucleotides using gLMs. Nucleotide dependency maps reveal functional elements across various biological processes, including transcriptional, post-transcriptional regulatory elements, their interactions and RNA folding. Therefore, this new metric has implications across multiple areas of computational and genome biology.

Traditionally, comparative genomics has helped identify functional sequences by leveraging the concept of sequence conservation, a major indicator of functional importance based on purifying selection among homologous sequences, that is, sequences descended from a common ancestor. Algorithmically, sequence alignment is first used to identify homologous sequences; conservation is then estimated from the aligned nucleotide frequencies adjusted for phylogenetic drift and mutational biases. This approach limits the scope to alignable homologous sequences. In contrast, gLMs can more flexibly borrow information across sequences with similar contexts, allowing them to capture recurrent patterns such as TF binding site motifs and their functional arrangements that can have arisen independently on nonhomologous sequences. In principle, this also allows gLMs to capture instances of positive selection, for example, where a sequence element has been acquired only recently in a specific species, although this ability is currently unexplored. Nonetheless, there may be specific new evolutionary features that exceed the current reach of genomic language modeling.

The nucleotides predicted by gLMs are not only shaped by functional elements but also include mutational biases and easy-to-predict low-complexity regions that follow simple rules such as repeats. We provide preliminary evidence that analyzing nucleotide dependencies helps disentangle some of these factors, such as highly reconstructed regulatory elements compared to highly reconstructed repeats. However, development of gLM training strategies explicitly accounting for repeats⁵ and mutational processes may help to further focus these models on functional elements.

So far, gLM-derived variant effect metrics leverage reconstruction probability^5,6, presuming that unlikely sequences are more deleterious. We showed that the influence of a nucleotide on predicting others is a more effective indicator of deleteriousness and could outperform alignment-based conservation. However, accounting for genetic drift and mutational biases will require research at the intersection of genomic language modeling and population genetics.

We have shown that dependency maps provide a promising new entry point to unravel the regulatory code. Regulatory elements, such as TF binding sites, manifest as dense blocks in dependency maps. We showed in yeast that applying simple image processing techniques on dependency maps identified these sites with an accuracy comparable to models trained on experimental binding data. Thus, this method is valuable for discovering regulatory elements, particularly where experimental data are limited (for example, nonmodel species, post-transcriptional regulation). Future improvements could involve modeling motifs with variable-sized blocks and accounting for all base-level dependencies. Moreover, dependencies also highlighted interactions between sequence elements in splicing and promoters, a property that future work could leverage to explore how sequence context governs the activity of regulatory elements.

Dependency maps accurately reflect bases in contacts within RNA folds, a substantial finding given the limited ground-truth data in RNA structural biology. Our entirely unsupervised approach, which relates to techniques recently proposed for unraveling amino acid contacts from protein language models^44,45,46,47, overcomes limitations of secondary structure inference, yielding information on both canonical and noncanonical contacts, pseudoknots and alternative folding. Analyzing nucleotide dependencies within RNA structure sequences is related to covariation analysis, which identifies compensating substitutions between pairs of positions in an alignment as evidence for evolutionarily conserved contacts. In contrast to covariation analysis, our approach does not require alignments, which are rarely unique and for which even a single-nucleotide shift can introduce ambiguities, affecting the covariation statistics. We note, however, that nucleotide dependency analysis and sequence-alignment-based approaches are complementary. Notably, sequence alignment often provides direct evidence of a common ancestor sequence. In contrast, gLM dependencies provide more flexibility for detecting functional interactions such as noncanonical contacts and in regimes with low alignable sequences. Furthermore, using nucleotide dependencies to infer structural contacts relies on the gLM to have been trained on enough sequences to have captured relevant evolutionary footprints. In this respect, future work could investigate the influence on the choice of species, sequences and model design.

The gLM evaluations are often based on high-level aggregate statistics, such as the area under the ROC (AUROC) curve and R², which assess the performance of downstream tasks that further models build upon. These evaluations conflate the contributions of gLMs as foundational models with those of the downstream supervised models and thus provide narrow, unidimensional assessments. Nucleotide dependencies instead enable benchmarking the gLMs themselves. We revealed critical limitations in current model architectures and single-species training practices, paving the way for more effective and generalizable gLMs.

Across various scientific fields, visualization tools also enable researchers to generate new observations and hypotheses. A nonquantifiable contribution of dependency maps, but perhaps not the least, is to allow visualizing selective constraints on sequence in a new way.

Methods

SpeciesLM training

For SpeciesLM metazoa, we obtained metazoan genomes comprising 494 different species from the Ensembl 110 database³⁶. For each annotated protein-coding gene, we extracted 2,000 bases 5′ to the start codon and trained a species-aware masked language model on this region. We followed the training and tokenization procedure outlined in Species-aware gLMs⁴, but kept the batch size at 2,304, despite increasing the input sequence length, resulting in approximately twice as many tokens seen during training as in SpeciesLM fungi 5′. We used rotary positional encoding to inject positional information into the Transformer blocks.

For SpeciesLM fungi, we deviated from the above recipe by tokenizing each base of the sequences discussed in ref. ⁴ separately (single nucleotide, 1-mer tokenization) and using learned absolute positional encodings. To stabilize training, we increased dropout in the multilayer perceptron layers of the transformer to 0.2 and set it to 0.1 for attention dropout.

Overall, we improved the training efficiency by fusing biases of the linear layers, the multilayer perceptron in the transformer and the optimizer using Nvidia Apex. We used FlashAttention2 (ref. ⁵¹) to train all models.

Nucleotide dependencies and variant influence score

We define the dependency between a variant nucleotide k_alt at position i and a target position j as

$${e}_{i,j,{k}_{\mathrm{alt}}}=\max {\left\{\left|{\log }_{2}\left(\frac{{\rm{o}}\hat{{\rm{d}}}\mathrm{ds}\left({n}_{\!j}={k|}{n}_{1},\ldots ,{n}_{i}={k}_{\mathrm{alt}},\ldots ,{n}_{N}\right)}{{\rm{o}}\hat{{\rm{d}}}\mathrm{ds}\left({n}_{\!j}={k|}{n}_{1},\ldots ,{n}_{i}={k}_{\mathrm{ref}},\ldots ,{n}_{N}\right)}\right)\right|\right\}}_{k{\rm{\in }}\left\{{\rm{A}},{\rm{C}},{\rm{G}},{\rm{T}}\right\}}$$

where k is one of the four possible nucleotides A, C, G or T; n_i and n_j are the nucleotides at position i and j, respectively; k_ref is the nucleotide in the reference, nonaltered input sequence, and k_alt is the nucleotide in the alternative sequence. The odds estimates are computed from the predictions of a gLM under consideration. For this computation, none of the nucleotides (including the target nucleotide) is masked.

The variant influence score ${{e}_{i,k}}_{\mathrm{alt}}$, for a sequence of N nucleotides, is defined by averaging the dependencies on a variant nucleotide at position i across all positions j = 1, …, N such that $j\ne i$.

A nucleotide dependency e_i_,j between a query position i and a target position j on a sequence of N nucleotides is given by:

$${e}_{i,j}=\max {\left\{\left|{\log }_{2}\left(\frac{{\rm{o}}\hat{{\rm{d}}}\mathrm{ds}\left({n}_{\!j}={k|}{n}_{1},\ldots ,{n}_{i}{\rm{\ne }}{k}_{\mathrm{ref}},\ldots ,{n}_{N}\right)}{{\rm{o}}\hat{{\rm{d}}}\mathrm{ds}\left({n}_{\!j}={k|}{n}_{1},\ldots ,{n}_{i}={k}_{\mathrm{ref}},\ldots ,{n}_{N}\right)}\right)\right|\right\}}_{k{\rm{\in }}\left\{{\rm{A}},{\rm{C}},{\rm{G}},{\rm{T}}\right\}}$$

We compute dependencies for all i,j pairs such that $i\ne j,$ that is we do not consider self-dependencies.

In autoregressive models, a query variant cannot directly affect the prediction of a target position located 5′ of the query. Thus, to obtain the lower triangular matrix of the dependency map, we also run the model on the reverse strand.

In the SpeciesLM metazoa, which predicts nucleotides as overlapping 6-mers, the procedure needs to be adapted to yield one prediction for each target nucleotide. This is achieved by first computing for each of the six 6-mers that overlap the target nucleotide of interest, which probability it implies for this target nucleotide, as previously described⁴. We then average these six probabilities to obtain a single probability.

For the Nucleotide Transformer models, which predict only nonoverlapping 6-mers, we use a similar approach. Consider the case of predicting the probability of observing nucleotide n at position i of the sequence. In the tokenized sequence, this nucleotide has position p in the kth 6-mer where:

$$k=\left\lfloor \frac{i}{6}\right\rfloor$$

$$\begin{array}{cc}p=i & \mathrm{mod}6\end{array}$$

The model predicts a distribution over all 4⁶ possible 6-mers at position k. We first discard all predictions corresponding to 6-mers that contain a nucleotide that differs from the reference sequence at any location other than p—which leaves only four 6-mers. We renormalize so that the predicted probability of these remaining 6-mers sums to one. We then record the (renormalized) probability of the 6-mer that has the desired nucleotide n at position p.

Apart from extracting nucleotide-level probabilities with the above-mentioned method, we have also experimented with computing the probability for a nucleotide at position i as the sum of all k-mers containing that nucleotide at that position. Evaluation of nucleotide dependencies within tRNAs revealed a worse performance with this method.

Variant impact benchmarks

As our metric of variant impact, we used the variant influence score. This average is computed over the full receptive field of the model for the SpeciesLM. For Nucleotide Transformer models, we only average over the central 2 kb, so as to facilitate comparisons. Nevertheless, we provide the full sequence context for which this model has been trained.

For comparison, we also calculated a variant effect score based on the gLM reconstruction at the query variant. Specifically, this score is the log ratio between the predicted probability of the variant nucleotide and the predicted probability of the reference nucleotide^5,6.

Finally, we downloaded conservation scores (PhyloP and PhastCons) for human and S. cerevisiae from the University of California, Santa Cruz genome browser database^{21,22,23,24,52}. For humans, these include the conservation scores based on the 100-way, 447-way and 470-way alignment.

Promoter saturation mutagenesis

Promoter saturation mutagenesis (ref. ²⁰) data mapped to hg38 were provided by V. Agarwal (mRNA Center of Excellence, Sanofi, Waltham, MA, USA). As discussed in ref. ²⁹, we excluded the FOXE1 promoter due to the low replicability of the measurements, leaving nine promoters and comprising 8,635 variants. Variants were then intersected with the human gene 5′ regions (that is, the regions 2-kb 5′ of annotated start codons). Then, the variant influence score was calculated for each variant measured in the assay from the LM dependencies for these regions. The variant influence score was then correlated with the absolute value of the measured log₂ fold change in expression. This correlation was computed for each promoter and then averaged across promoters.

To determine confidence intervals, we performed 100 bootstrap samples per promoter and recomputed the correlation for each bootstrap sample. The confidence interval was defined by adding/subtracting two standard deviations of the average correlation.

eQTL variants

For human eQTL, we downloaded SUSIE²⁶ fine-mapped GTEx eQTL data from EBI. We then intersected these data with the human gene 5′ regions. This procedure, by design, enriches for promoter eQTL. Similar to the details in ref. ²⁹, we considered every eQTL variant with a posterior inclusion probability higher than 0.9 as putative causal and we considered any eQTL variant with posterior inclusion probability lower than 0.01 as putative noncausal. We only considered putative noncausal eQTL intersecting regions, which also include at least one causal eQTL. This procedure gave 2,958 eQTL variants, of which 1,631 were classified as putative causal. Then, the influence score for each variant was computed based on the nucleotide dependencies in these regions. We ranked variants according to the influence score. Confidence intervals were computed using bootstrapping as before.

For yeast eQTL, we downloaded the results of an MPRA study assessing candidate cis-eQTL variants²⁷. After this study, we classify any eQTL variant with false discovery rate < 0.05 in the MPRA assay as causal and we classify any eQTL with (unadjusted) P value of >0.2 as noncausal. This yielded 3,056 eQTL variants, of which 379 were classified as causal. These eQTL variants were then intersected with yeast gene 5′ regions and influence scores were computed from the SpeciesLM fungi dependency maps. Confidence intervals were computed using bootstrapping as before.

Clinvar

We used ClinVar version 2023_07_17 (ref. ¹⁹), previously downloaded from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/. We considered noncoding any variant in the categories ‘intron_variant’, ‘5_prime_UTR_variant’, ‘splice_acceptor_variant’, ‘splice_donor_variant’, ‘3_prime_UTR_variant’, ‘non_coding_transcript_variant’, ‘genic_upstream_transcript_variant’ and ‘genic_downstream_transcript_variant’. As discussed in ref. ⁵³, we considered as pathogenic any variant classified as pathogenic or likely pathogenic and as benign any variant classified as benign or likely benign. We excluded variants with fewer than one review star. This resulted in 385,572 variants, of which 22,313 were classified as pathogenic.

As most ClinVar variants fall outside the 5′ regions of genes, we chose not to intersect with these regions. Instead, we computed the dependency map centered on the variant of interest. Confidence intervals were computed using bootstrapping as before.

Borzoi

We ran Borzoi in mixed precision to reduce computational overhead using the PyTorch Borzoi package. Replicate zero of Borzoi was used for all analyses. For the eQTL analysis, we computed the L2 score as discussed in ref. ²⁵. We used the tissues of borzoi predictions matching the eQTLs. If several Borzoi tracks matched the tissue, we averaged the scores across these tracks. For ClinVar, we followed a similar approach, except that we collected Borzoi predictions for all tissues and assays. We then computed the L2 score across tracks to give a tissue-agnostic and mechanism-agnostic variant-effect score.

For the Kircher saturation mutagenesis dataset, we computed the logSED score as discussed in ref. ²⁵. We mapped the cell types used in the assay to Borzoi tracks as follows: for the GP1BB, HBB, NHBG1 and PKLR promoters, we used ‘RNA:K562’; for the F9 and LDLR promoters, we used ‘RNA:HepG2’; for HNF4A and MSMB, we used ‘RNA:kidney’ (as HEK293 is originally a kidney cell) and for TERT, we used ‘RNA:astrocyte’ (as glioblastoma are cancerous astrocytes).

Integrative model using Borzoi and the influence score

We integrated Borzoi and the influence score using logistic regression—for the eQTL and ClinVar predicitions—and using linear regression for the mutagenesis data using fivefold cross-validation scheme for all benchmarks. Notably, for the Kircher saturation mutagenesis task, model fitting and cross-validation were performed separately for each promoter, and performance was then averaged across folds and promoters.

Alternative dependency metrics

All benchmarks on alternative dependency metrics were performed on the SpeciesLM fungi.

Gradient-based

We computed the gradient of the prediction for each nucleotide at position i with respect to each nucleotide at position j yielding a 4 × 4 matrix. To achieve this, we first replaced the tokenization layer with a one-hot encoding and a linear layer, which map the one-hot encoded nucleotides to their respective token embeddings. We then propagated gradients from each target nucleotide prediction to each one-hot encoded input nucleotide. As a metric of nucleotide dependency, we then used the maximum absolute value across the 4 × 4 matrix of each i,j position.

Mask-based

Masked-based dependencies are computed as:

$${e}_{i,j}=\max {\left\{\left|{\log }_{2}\left(\frac{{\rm{o}}\hat{{\rm{d}}}\mathrm{ds}\left({n}_{\!j}={k|}{n}_{1},\ldots ,{n}_{i}=\left[\mathrm{MASK}\right],\ldots ,{n}_{N}\right)}{{\rm{o}}\hat{{\rm{d}}}\mathrm{ds}\left({n}_{\!j}={k|}{n}_{1},\ldots ,{n}_{i}={k}_{\mathrm{ref}},\ldots ,{n}_{N}\right)}\right)\right|\right\}}_{k{\rm{\in }}\left\{{\rm{A}},{\rm{C}},{\rm{G}},{\rm{T}}\right\}}$$

where ‘[MASK]’ stands for the mask token, k belongs to one of the four possible nucleotides A, C, G or T; n_i and n_j are the nucleotides at position i and j, respectively; k_ref is the nucleotide in the reference, nonaltered input sequence.

S. cerevisiae tRNA structure benchmark

S. cerevisiae genome assembly version R64-1-1 and annotation version R64-1-1.53 were downloaded from EnsemblFungi³⁶. The S. cerevisiae tRNA secondary structures were downloaded from GtRNAdb⁵⁴. We considered only the tRNAs overlapping the 1 kb 5′ regions to any yeast start codon, yielding 172 tRNA sequences. Subsequently, dependency maps on tRNAs were processed by taking the maximum between e_i,j and e_j,i. This symmetrizes the dependency map and achieves one unique score per pair of positions in the tRNA sequence. We then used this score to predict whether a pair of nucleotides belonged to a secondary structure contact.

Assessment of donor–acceptor dependencies in S. cerevisiae

We extracted intron sequences by selecting the regions within annotated gene intervals that lie between exon annotations. This resulted in 380 sequences. We then retained only introns bounded by canonical splice site dinucleotides GT and AG, yielding 272 sequences. We then computed the average dependency between every donor and acceptor nucleotide within the intron as a measure of dependency between the donor and acceptor sites. We designed two negative sets for a given intron. For the negative set ‘Decoy acceptor’, we compute the average dependency between donor nucleotides and each AG dinucleotide within the intron that does not include the acceptor site. For the negative set ‘Matched distance’, we sampled four random dependencies between nucleotides that were as distant from each other as the donor was from the acceptor, without including the donor or acceptor themselves.

TF motif mapping

We downloaded FIMO PWM scan results from http://www.yeastss.org (ref. ⁵⁵) and Chip-Exo TF binding peaks from http://www.yeastepigenome.org (ref. ³¹). We then extracted all Chip-Exo peaks for the available PWMs. We excluded PWM matches for which no Chip-Exo data were available for the corresponding factor. This procedure yielded data for 68 TFs. We annotated every nucleotide within 1 kb 5′ of a start codon as part of a binding TF motif if it is (1) part of a PWM match with P value of <0.01 and (2) this PWM match is within ten bases of a Chip-Exo peak of the corresponding TF. We defined the positive class in this way to ensure that we capture nucleotides relevant for determining binding (that is, motif) rather than all nucleotides close to a Chip-Exo peak, regardless of their role in binding. This resulted in 92,117 binding nucleotides out of a total of 6,538,427. We designated a nucleotide as repeat if it was masked by RepeatMasker. We extracted this information from the soft-masked GTF provided by Ensembl³⁶.

The 69-way alignment

We used progressive cactus⁵⁶ to align 69 budding yeast species using default parameters and specifying S. cerevisiae as reference quality genome. We then extracted fourfold degenerate sites and used phyloFit with the EM algorithm to estimate a neutral model. Using this neutral model and the alignment, we ran phastCons with the parameters --rho 0.3 --estimate-rho --target-coverage 0.4 --expected-length 23, which correspond to the parameters used in the seven-way alignment²¹. We also ran phyloP²², with --method LRT --mode CONACC.

Dependencies in rare-variant-associated aberrant splicing

We computed dependency maps for all rare SNVs associated with splicing outliers in GTEx²⁸ as described earlier³². Because the input length of SpliceBert¹⁴ is limited to 1,024 bp, the complete set of variant outlier pairs (n = 18,371) was filtered such that the variant and associated outlier junction were located within an 800-bp window (n = 1,811) and 100 bp of sequence was added from the maximum and minimum positions of the variant and outlier junction splice sites. For each variant location, we extracted the average value of the dependency map at the intersection of either variant and outlier donor dinucleotide or variant and outlier acceptor dinucleotide. This variant effect score was compared against a background score. This background score was computed as the mean over all dependencies that were as distant from each other as the variant was from the outlier donor (matched distance) or the outlier acceptor. The scores were filtered for a minimum distance of 5 bp between the variant and splicing dinucleotide to filter values near the diagonal corresponding to self-interactions. Variant categories were annotated with the Ensembl variant effect predictor (VEP)⁵⁷. For each variant, the most severe VEP annotation was considered. For the ‘exon’ category, the following VEP categories were grouped together: synonymous_variant, missense_variant, stop_lost, stop_gained.

Genome-wide search for parallel and antiparallel dependencies

We scanned dependency maps for parallel and antiparallel dependencies using 5 × 5 convolutional filters. We constructed the antiparallel filter by populating the antidiagonal of a zero-filled 5 × 5 matrix with ones, and for the parallel filter, by populating the diagonal with ones. We then centered each filter by subtracting the mean value from each position to ensure that a convolution on a uniform 5 × 5 region yields a result of zero. We applied these filters to dependency maps from SpeciesLM fungi (both filters) and RiNALMo (antiparallel filter only)^4,7.

Search for parallel and antiparallel dependencies in fungi using the SpeciesLM fungi

For the SpeciesLM fungi, we have computed dependency maps spanning 1 kb 5′ of each annotated start codon on a set of representative fungi species, including Agaricus bisporus, Candida albicans, Debaryomyces hansenii, Kluyveromyces lactis, Neurospora crassa, S. cerevisiae, S. pombe and Yarrowia lipolytica. The genomes and annotation files for each species were downloaded from EnsemblFungi release 53 with accessions GCA_000300555.1, GCA000182965v3, GCA_000006445.2, GCA000002515.1, GCA_000182925.2, GCA_003046715.1, GCA_000002945.2 and GCA_000002525.1, respectively.

All regions annotated as ‘five_prime_utr’, ‘three_prime_utr’, ‘intron’, ‘CDS’, ‘pseudogene_with_CDS’ and other regions (for example, nonannotated introns) inside an annotated gene interval were categorized as protein-coding gene. All regions annotated as ‘tRNA’, ‘tRNA_pseudogene’, ‘rRNA’, ‘snRNA’, ‘ribozyme’, ‘SRP_RNA’, ‘snoRNA’, ‘RNase_P_RNA’ and ‘RNase_MRP_RNA’ were categorized as structured RNA. Finally, all regions annotated as ‘transposable_element’, ‘pseudogene’ and regions without any annotation were considered as intergenic.

Search for antiparallel dependencies and RNA structure in E. coli using RiNALMo

For RiNALMo, we computed dependency maps for regions 100, 200 and 500 bp before each annotated start codon in E. coli str. K-12 substr. MG1655, whose genome and annotation were downloaded from GenBank⁵⁸ with accession U00096.3.

As candidates for a new RNA structure, we first filtered positions whose convolution value is greater or equal to 25 to select only high-value antiparallel dependencies, resulting in a filtered convolved dependency map. Next, we counted the unique number of antidiagonals potentially belonging to one stem by extracting the unique i + j nonzero positions supported by at least three nonzero values.

As candidates for a new structure, we selected maps suggesting the existence of at least two potential stems.

RNA secondary structure benchmarking

We downloaded the database of secondary structures Archive II³⁷, which includes 3,865 curated RNA structures across nine families (5S rRNA, SRP RNA, tRNA, tmRNA, RNase P RNA, group I intron, 16S rRNA, telomerase RNA and 23S rRNA). For each structure, we generated the dependency map with the pretrained RiNALMo and retained the largest of the two dependency map entries for each pair of nucleotides (maximum of i,j and j,i). The AUROC curve was computed for each structure against the Archive II secondary structure annotations.

Benchmarking of canonical and noncanonical RNA contacts

We downloaded the database of RNA structures CompaRNA³⁸, which is a compilation of RNA contacts based on 201 available RNA structures in the Protein Data Bank by RNAView⁵⁹. Contacts are classified either as ‘standard’ or as ‘extended’. While the first includes only canonical AU, GC and wobble GU pairs in the cis-Watson–Crick/Watson–Crick conformation⁶⁰, the latter calls all interacting bases regardless of their conformation, including noncanonical or tertiary contacts. Of the 201 structures, 196 had a length below the maximum input length of RiNALMo (1,022 nt). For each structure, we generated the dependency map using the pretrained RiNALMo and retained the largest entry from the two dependency maps for each pair of nucleotides. Similarly, the same structures were also evaluated with the fine-tuned RiNALMo model version rinalmo_giga_ss_bprna_ft, resulting in a predicted value for each pair of nucleotides. To evaluate their performance in predicting noncanonical contacts, we excluded all canonical contacts and computed the AUROC curve for all remaining positions across all structures. Significance between ROC AUCs was determined by bootstrapping over 10,000 permutations.

Comparison with RNAalifold

We evaluated the performance of the dependency maps against RNAalifold⁶¹, a standard alignment-based method for predicting a consensus RNA structure by incorporating sequence covariation from a set of aligned RNA sequences as input. For this, we use the 201 PDB entries in CompaRNA that had at least one Rfam match and consider two subsets. The first subset consisted of the 33 PDB sequences that contained an exact sequence match between the PDB entry and at least one Rfam seed alignment. In case of multiple matching Rfam seed alignments (for example, ribosomal RNA), we considered an arbitrarily chosen single Rfam seed alignment to avoid confounding the evaluations by duplicates. The second subset consisted of the remaining 168 sequences. For this, we used nhmmer (v3.1b2)⁶² to find homologous sequences within a database of 220,478 bacterial and archaeal genomes and plasmids downloaded from NCBI. After removing sequences longer than 1,022 nt (the maximum context length for which the gLM RiNALMo has been trained), this resulted in 67 sequences with hits in the database.

On the first subset, we use the Rfam seed alignments as input to RNAalifold. To assess the robustness of the analyses to the alignment procedure, we additionally realigned the sequences in the seed alignments using Clustal-Omega (v1.2.4)⁶³ and MAFFT (v7.525)⁶⁴. For the second subset, we performed sequence alignments using both Clustal-Omega and MAFFT, limiting the alignments to a maximum of 1,000 sequences (by aligning the PDB sequence to the top 999 nhmmer hits) to reduce computation time. On both subsets and from each alignment, a base-pair probability matrix corresponding to the predicted RNA structure was generated using RNAalifold available through the ViennaRNA (v2.6.4) package⁶⁵. RNAalifold was run in the following two modes: using the default energy model (command: RNAalifold -p) and with RIBOSUM scoring (command: RNAalifold -p -r).

Pseudoknot benchmark

We downloaded the compendium dataset bpRNA-1m(90) that contains 28,370 annotated RNA structures with less than 90% sequence similarity obtained from the databases CRW, tmRNA, SRP, tRNAdb2009, RNP, RFAM and PDB³⁹. From these, we extracted all structures that contain pseudoknot contacts and are no longer than 1,022 nt (the maximum context length for which the gLM RiNALMo has been trained). These resulted in 2,530 structures of varying lengths and sources. We then extracted the pseudoknot contacts from the dot-bracket notation provided by bpRNA that takes into account non-nested pairs³⁹. Finally, we computed the dependency maps for each one of these structures and evaluated their ability to predict whether a pair of nucleotides belongs to a pseudoknot contact (positive set) or does not belong to a structure contact (pseudoknot or canonical structure contact—negative set).

DMS-MaPseq analysis of E. coli cells

E. coli TOP10 cells were grown in LB broth at 37 °C with shaking until OD₆₀₀ = 0.5, after which dimethyl sulfate (DMS; Sigma-Aldrich, D186309), prediluted 1:4 in ethanol, was added to a final concentration of 200 mM. Bacteria were incubated for 2 min at 37 °C, and reaction was quenched by addition of 0.5 M final DTT. Bacteria were pelleted by centrifugation at 17,000g for 1 min at 4 °C, after which they were resuspended in cell pellets in 12.5-μl resuspension buffer (20 mM Tris–HCl pH 8.0; 80 mM NaCl; 10 mM EDTA pH 8.0), supplemented with 100 μg ml⁻¹ final lysozyme (L6876, Merck) and 20 U SUPERase·In RNase Inhibitor (Thermo Fisher Scientific, A2696), by vortexing. After 1 min, 12.5-μl lysis buffer (0.5% Tween-20; 0.4% sodium deoxycholate; 2 M NaCl; 10 mM EDTA) were added, and samples were incubated at room temperature for 2 additional min. Then 1 ml TRIzol Reagent (Thermo Fisher Scientific, 15596018) was added, and RNA extracted as per the manufacturer’s instructions. rRNA depletion was performed on 1 μg total RNA using the RiboCop for Bacteria kit (Lexogen, 126). DMS-MaPseq library preparation was performed as previously described⁴¹. After sequencing, reads were aligned to the E. coli str. K-12 substr. MG1655 genome (GenBank, U00096.3), using the rf-map module of the RNA framework⁶⁶ and Bowtie2 (ref. ⁶⁷). Count of DMS-induced mutations and coverage and reactivity normalization were performed using the rf-count-genome and rf-norm modules of the RNA framework. Experimentally informed structure modeling was performed using the rf-fold module of the RNA framework and ViennaRNA (v2.5.1)⁶⁷.

RNA structure covariation analysis

Covariation analysis was performed using the cm-builder pipeline (https://github.com/dincarnato/labtools) and a nonredundant database of 7,598 representative archaeal and bacterial genomes (and associated plasmids, when present) from RefSeq⁶⁸.

Evaluation of artificial forward and inverted duplications

We generated random sequences of 100 nucleotides by sampling from regions 1 kb 5′ of the start codon in S. cerevisiae to ensure a representative GC content and shuffling the sequences to destroy potential functional elements. Additionally, we created 100 unique duplicated sequences, ranging from 2 to 20 nucleotides in length, by randomly sampling each nucleotide with equal probability. Each duplicated sequence was then inserted into a uniquely generated 100-nucleotide sequence at a random distance from each other, ensuring no overlaps occurred. We used the SpeciesLM fungi to generate dependency maps for each sequence. We then computed average dependencies by taking the mean of the dependencies between nucleotides and their duplicates. This involved averaging across a parallel diagonal for forward duplications and an antiparallel diagonal for inverted duplications.

For tRNA-sized sequences, we followed a similar method but generated each sequence by shuffling each unique tRNA sequence in S. cerevisiae once. We computed the average number of inverted duplications by averaging the occurrences of duplicated sequences of specific lengths across 10,000 shuffled versions of each tRNA sequence.

Genome-wide analysis of dependency distribution

Using the SpeciesLM fungi, we computed dependency maps across the genomes of S. cerevisiae and S. pombe. Because the SpeciesLM fungi was pretrained on sequences of 1,003 nucleotides, including the start codon at the end, we discarded dependencies involving the last three nucleotides of each sequence, yielding dependencies for 1,000 nucleotides. Genome-wide dependency maps of 1-kb span were obtained with a tiling approach. Along each chromosome, we computed 1-kb square dependency maps every 500 bp and averaged overlapping entries.

To ensure that the same number of targets is computed before and after a specific query nucleotide, we considered dependencies involving nucleotides at most 500 positions away from each other. For each map, we sampled 1,000 dependencies and averaged dependencies mapping to the same genomic positions but computed from different overlapping maps. Due to limitations in numerical precision, we considered only dependencies larger than 0.001.

To compute the power–law coefficients, a linear regression was fitted to predict the logarithm of the dependency from the logarithm of its corresponding distance in nucleotides. The scaling coefficient was then obtained by exponentiating the fitted intercept of the linear regression, and the decay rate was obtained directly from the fitted slope. The scaling coefficient and decay rate were computed for different regions in the genome which are as follows: (1) nuclear—involving all dependencies belonging to nuclear DNA; (2) mitochondria—involving all dependencies within mitochondrial DNA; (3) structured RNA—belonging to the annotations ‘tRNA’, ‘tRNA_pseudogene’, ‘rRNA’, ‘snRNA’, ‘ribozyme’, ‘SRP_RNA’, ‘snoRNA’, ‘RNase_P_RNA’ or ‘RNase_MRP_RNA’; (4) protein-coding gene—belonging to the annotations ‘five_prime_utr’, ‘three_prime_utr’, ‘CDS’ or ‘pseudogene_with_CDS’; (5) intron—belonging to the regions inside an annotated gene interval but not to exons and (6) intergenic—belonging to all regions annotated as ‘transposable_element’, ‘pseudogene’, as well as regions without any annotation.

Model comparison

All other models used were downloaded from Huggingface or from their publicly available repositories. Human tRNA sequences were downloaded from GtRNAdb⁵⁴. Exact duplicate sequences were removed, leaving 266 tRNAs.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Data to reproduce the analysis, together with the 69-Saccharomycetales genome alignment and conservation score, as well as the bacterial and archaeal genomes and the plasmid sequences used for the benchmark against RNAalifold, are provided in ref. ⁶⁹. The SpeciesLM models are available at https://huggingface.co/collections/johahi/specieslms-678a39261cfff01c1fa3ae41. Raw DMS-MaPseq data have been deposited to the Gene Expression Omnibus database under accession GSE271937.

Code availability

The code required to reproduce the results in the paper is available at https://github.com/gagneurlab/dependencies_DNALM or in ref. ⁶⁹.

References

Alföldi, J. & Lindblad-Toh, K. Comparative genomics as a tool to understand evolution and disease. Genome Res 23, 1063–1068 (2013).
Article PubMed PubMed Central Google Scholar
Altschuh, D., Lesk, A. M., Bloomer, A. C. & Klug, A. Correlation of co-ordinated amino acid substitutions with function in viruses related to tobacco mosaic virus. J. Mol. Biol. 193, 693–707 (1987).
Article CAS PubMed Google Scholar
Noller, H. F. et al. Secondary structure model for 23S ribosomal RNA. Nucleic Acids Res 9, 6167–6189 (1981).
Article CAS PubMed PubMed Central Google Scholar
Karollus, A. et al. Species-aware DNA language models capture regulatory elements and their evolution. Genome Biol 25, 83 (2024).
Article CAS PubMed PubMed Central Google Scholar
Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful predictors of genome-wide variant effects. Proc. Natl Acad. Sci. USA 120, e2311219120 (2023).
Article CAS PubMed PubMed Central Google Scholar
Benegas, G., Albors, C., Aw, A. J., Ye, C. & Song, Y. S. A DNA language model based on multispecies alignment predicts the effects of genome-wide variants. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02511-w (2025).
Article PubMed Google Scholar
Penić, R. J., Vlašić, T., Huber, R. G., Wan, Y. & Šikić, M. RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks. Nat. Commun. 16, 5671 (2025).
Article PubMed PubMed Central Google Scholar
Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 386, eado9336 (2024).
Article CAS PubMed PubMed Central Google Scholar
Dalla-Torre, H. et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat. Methods 22, 287–297 (2025).
Article CAS PubMed Google Scholar
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
Article CAS PubMed PubMed Central Google Scholar
Schiff, Y. et al. Caduceus: bi-directional equivariant long-range DNA sequence modeling. Proc. Mach. Learn. Res. 235, 43632–43648 (2024).
PubMed PubMed Central Google Scholar
Nguyen, E. et al. HyenaDNA: long-range genomic sequence modeling at single nucleotide resolution. Adv. Neural Inf. Process. Syst. 36, 43177–43201 (2023).
Google Scholar
Vilov, S. & Heinig, M. Investigating the performance of foundation models on human 3′UTR sequences. Nucleic Acids Res. 53, gkaf871 (2025).
Article PubMed PubMed Central Google Scholar
Chen, K. et al. Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction. Brief. Bioinform. 25, bbae163 (2024).
Article CAS PubMed PubMed Central Google Scholar
Shen, T. et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat. Methods 21, 2287–2298 (2024).
Article CAS PubMed PubMed Central Google Scholar
Marin, F. I. et al. BEND: benchmarking DNA language models on biologically meaningful tasks. In Proc. 12th International Conference on Learning Representations (ICLR, 2024).
Gazave, E., Marqués-Bonet, T., Fernando, O., Charlesworth, B. & Navarro, A. Patterns and rates of intron divergence between humans and chimpanzees. Genome Biol 8, R21 (2007).
Article PubMed PubMed Central Google Scholar
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 20, 389–403 (2019).
Article CAS PubMed Google Scholar
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res 46, D1062–D1067 (2018).
Article CAS PubMed Google Scholar
Kircher, M. et al. Saturation mutagenesis of twenty disease-associated regulatory elements at single base-pair resolution. Nat. Commun. 10, 3583 (2019).
Article PubMed PubMed Central Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 15, 1034–1050 (2005).
Article CAS PubMed PubMed Central Google Scholar
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20, 110–121 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sullivan, P. F. et al. Leveraging base-pair mammalian constraint to understand genetic variation and human disease. Science 380, eabn2937 (2023).
Article CAS PubMed PubMed Central Google Scholar
Kuderna, L. F. K. et al. Identification of constrained sequence elements across 239 primate genomes. Nature 625, 735–742 (2024).
Article CAS PubMed Google Scholar
Linder, J., Srivastava, D., Yuan, H., Agarwal, V. & Kelley, D. R. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat. Genet. 57, 949–961 (2025).
Article CAS PubMed PubMed Central Google Scholar
Wang, G., Sarkar, A., Carbonetto, P. & Stephens, M. A simple new approach to variable selection in regression, with application to genetic fine mapping. J. R. Stat. Soc. Series B 82, 1273–1300 (2020).
Article Google Scholar
Renganaath, K. et al. Systematic identification of cis-regulatory variants that cause gene expression differences in a yeast cross. eLife 9, e62669 (2020).
Article CAS PubMed PubMed Central Google Scholar
Aguet, F. et al. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Article Google Scholar
Avsec, Ž. et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Article CAS PubMed PubMed Central Google Scholar
De Boer, C. G. & Hughes, T. R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res 40, D169–D179 (2012).
Article PubMed Google Scholar
Rossi, M. J. et al. A high-resolution protein architecture of the budding yeast genome. Nature 592, 309–314 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wagner, N. et al. Aberrant splicing prediction across human tissues. Nat. Genet. 55, 861–870 (2023).
Article CAS PubMed Google Scholar
The RNAcentral Consortium. RNAcentral: a hub of information for non-coding RNA sequences. Nucleic Acids Res 47, D221–D229 (2019).
Article Google Scholar
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Res. 51, D29–D38 (2023).
Article CAS PubMed Google Scholar
Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 49, D192–D200 (2021).
Article CAS PubMed Google Scholar
Martin, F. J. et al. Ensembl 2023. Nucleic Acids Res. 51, D933–D941 (2023).
Article CAS PubMed Google Scholar
Mathews, D. H. How to benchmark RNA secondary structure prediction accuracy. Methods 162–163, 60–67 (2019).
Article PubMed PubMed Central Google Scholar
Puton, T., Kozlowski, L. P., Rother, K. M. & Bujnicki, J. M. CompaRNA: a server for continuous benchmarking of automated methods for RNA secondary structure prediction. Nucleic Acids Res 41, 4307–4323 (2013).
Article CAS PubMed PubMed Central Google Scholar
Danaee, P. et al. bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res 46, 5381–5394 (2018).
Article CAS PubMed PubMed Central Google Scholar
Yanofsky, C. RNA-based regulation of genes of tryptophan synthesis and degradation, in bacteria. RNA 13, 1141–1154 (2007).
Article CAS PubMed PubMed Central Google Scholar
Zubradt, M. et al. DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo. Nat. Methods 14, 75–82 (2017).
Article CAS PubMed Google Scholar
Kavita, K. & Breaker, R. R. Discovering riboswitches: the past and the future. Trends Biochem. Sci. 48, 119–141 (2023).
Article CAS PubMed Google Scholar
Givens, R. M. et al. Chromatin architectures at fission yeast transcriptional promoters and replication origins. Nucleic Acids Res 40, 7176–7189 (2012).
Article CAS PubMed PubMed Central Google Scholar
Vig, J., Madani, A., Varshney, L. R., Xiong, C., & Rajani, N. BERTology meets biology: interpreting attention in protein language models. In Proc. International Conference on Learning Representations (ICLR, 2021).
Bhattacharya, N. et al. Interpreting potts and transformer protein models through the lens of simplified attention. Pac. Symp. Biocomput. 27, 34–45 (2022).
PubMed PubMed Central Google Scholar
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article CAS PubMed Google Scholar
Zhang, Z. et al. Protein language models learn evolutionary statistics of interacting sequence motifs. Proc. Natl Acad. Sci. 121, e2406285121 (2024).
Article CAS PubMed PubMed Central Google Scholar
Delagoutte, B., Moras, D. & Cavarelli, J. tRNA aminoacylation by arginyl-tRNA synthetase: induced conformations during substrates binding. EMBO J 19, 5599–5610 (2000).
Article CAS PubMed PubMed Central Google Scholar
Vorontsov, I. E. et al. HOCOMOCO in 2024: a rebuild of the curated collection of binding models for human and mouse transcription factors. Nucleic Acids Res. 52, D154–D163 (2024).
Article CAS PubMed Google Scholar
Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinformatics 5, 4.10.1–4.10.14 (2004).
Article Google Scholar
Dao, T. FlashAttention-2: faster attention with better parallelism and work partitioning. In Proc. 12th International Conference on Learning Representations (ICLR, 2023).
Raney, B. J. et al. The UCSC Genome Browser database: 2024 update. Nucleic Acids Res 52, D1082–D1088 (2024).
Article CAS PubMed Google Scholar
Cheng, J. et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492 (2023).
Article CAS PubMed Google Scholar
Chan, P. P. & Lowe, T. M. GtRNAdb: a database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res 37, D93–D97 (2009).
Article CAS PubMed Google Scholar
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Article CAS PubMed PubMed Central Google Scholar
Armstrong, J. et al. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature 587, 246–251 (2020).
Article CAS PubMed PubMed Central Google Scholar
McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol 17, 122 (2016).
Article PubMed PubMed Central Google Scholar
Sayers, E. W. et al. GenBank. Nucleic Acids Res 49, D92–D96 (2021).
Article CAS PubMed Google Scholar
Yang, H. et al. Tools for the automatic identification and classification of RNA base pairs. Nucleic Acids Res 31, 3450–3460 (2003).
Article CAS PubMed PubMed Central Google Scholar
Leontis, N. B. & Westhof, E. Geometric nomenclature and classification of RNA base pairs. RNA 7, 499–512 (2001).
Article CAS PubMed PubMed Central Google Scholar
Bernhart, S. H., Hofacker, I. L., Will, S., Gruber, A. R. & Stadler, P. F. RNAalifold: improved consensus structure prediction for RNA alignments. BMC Bioinformatics 9, 474 (2008).
Article PubMed PubMed Central Google Scholar
Wheeler, T. J. & Eddy, S. R. nhmmer: DNA homology search with profile HMMs. Bioinformatics 29, 2487–2489 (2013).
Article CAS PubMed PubMed Central Google Scholar
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Article PubMed PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lorenz, R. et al. ViennaRNA package 2.0. Algorithms Mol. Biol. 6, 26 (2011).
Article PubMed PubMed Central Google Scholar
Incarnato, D., Morandi, E., Simon, L. M. & Oliviero, S. RNA framework: an all-in-one toolkit for the analysis of RNA structures and post-transcriptional modifications. Nucleic Acids Res 46, e97 (2018).
Article PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Manfredonia, I. et al. Genome-wide mapping of SARS-CoV-2 RNA structures identifies therapeutically-relevant elements. Nucleic Acids Res 48, 12436–12452 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tomaz da Silva, P. et al. Data, models and code for: ‘nucleotide dependency analysis of DNA language models reveals genomic functional elements’. Zenodo https://doi.org/10.5281/zenodo.16524884 (2025).
Kerimov, N. et al. A compendium of uniformly processed human gene expression and splicing quantitative trait loci. Nat. Genet. 53, 1290–1299 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gould, G. M. et al. Identification of new branch points and unconventional introns in Saccharomyces cerevisiae. RNA 22, 1522–1534 (2016).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

P.T.d.S. is supported by the Munich Center for Machine Learning. N.W. is supported by the Helmholtz Association under the joint research school ‘Munich School for Data Science—MUDS’. D.I. was supported by the Dutch Research Council (NWO), NWO Open Competitie ENW—XS (project OCENW.XS22.1.015) and by the European Research Council (ERC), European Union’s Horizon Europe research and innovation program (grant agreements 101124787 and RNAStrEnD). X.H.-A. was supported by an EMBO Postdoctoral Fellowship (ALTF 792-2022). J.G. was supported by the German Bundesministerium für Bildung und Forschung (BMBF) through the Model Exchange for Regulatory Genomics project MERGE (031L0174A). J.G. was also supported by the Deutsche Forschungsgemeinschaft (DFG; German Research Foundation) through the project NFDI 1/1 ‘GHGA—German Human Genome-Phenome Archive’ (441914366), and funded through the EVUK program (‘Next-generation AI for Integrated Diagnostics’) of the Free State of Bavaria. J.G. and G.S.T.G. are supported by the DFG (German Research Foundation) through the TRR267 (403584255). This study was supported by the DFG (German Research Foundation) through the IT Infrastructure for Computational Molecular Medicine (project 461264291). This study was also supported by the ERC (EPIC; 101118521 to P.T.d.S., A.K., J.H., N.W. and J.G.) and by the European Union. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the ERC Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. We thank F. Bonneau for assistance with the presentation of the three-dimensional structure of a tRNA and further thank S. Aerts and J. Cheng for the useful feedback on the manuscript.

Funding

Open access funding provided by Technische Universität München.

Author information

These authors contributed equally: Pedro Tomaz da Silva, Alexander Karollus.

Authors and Affiliations

School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
Pedro Tomaz da Silva, Alexander Karollus, Johannes Hingerl, Gihanna Sta. Teresa Galindez, Nils Wagner, Xavier Hernandez-Alias & Julien Gagneur
Munich Center for Machine Learning, Munich, Germany
Pedro Tomaz da Silva, Alexander Karollus & Johannes Hingerl
Munich Data Science Institute, Technical University of Munich, Munich, Germany
Gihanna Sta. Teresa Galindez
Mechanisms of Protein Biogenesis, Max Planck Institute of Biochemistry, Martinsried, Germany
Xavier Hernandez-Alias
Department of Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute (GBB), University of Groningen, Groningen, the Netherlands
Danny Incarnato
Institute of Human Genetics, School of Medicine and Health, Technical University of Munich, Munich, Germany
Julien Gagneur
Computational Health Center, Helmholtz Munich, Neuherberg, Germany
Julien Gagneur

Authors

Pedro Tomaz da Silva
View author publications
Search author on:PubMed Google Scholar
Alexander Karollus
View author publications
Search author on:PubMed Google Scholar
Johannes Hingerl
View author publications
Search author on:PubMed Google Scholar
Gihanna Sta. Teresa Galindez
View author publications
Search author on:PubMed Google Scholar
Nils Wagner
View author publications
Search author on:PubMed Google Scholar
Xavier Hernandez-Alias
View author publications
Search author on:PubMed Google Scholar
Danny Incarnato
View author publications
Search author on:PubMed Google Scholar
Julien Gagneur
View author publications
Search author on:PubMed Google Scholar

Contributions

P.T.d.S. and A.K. conceptualized the study and performed the methodology, software development, formal analysis, investigation and visualization, and contributed to writing the original draft and writing, reviewing and editing the final draft of the paper. P.T.d.S. managed project administration. J.H.performed the methodology, software development and investigation, and contributed to writing, reviewing and editing the final draft of the manuscript. G.S.T.G., N.W. and X.H.-A. performed the methodology, software development, investigation and formal analysis, and contributed to writing, reviewing and editing the final draft of the paper. D.I. conducted formal analysis, investigation, resource management and data curation; contributed to writing, reviewing and editing the final draft of the paper, and was responsible for visualization, supervision and funding acquisition. J.G. conceptualized the study and performed methodology and resource management; contributed to writing the original draft and writing, reviewing and editing the final draft of the paper; and was responsible for visualization, supervision, project administration and funding acquisition.

Corresponding author

Correspondence to Julien Gagneur.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Maria Barna and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Nucleotide dependencies capture functional interactions and variant effects in ClinVar and eQTLs.

a, Variant influence score against pathogenic and benign variants classified from ClinVar. P-value obtained from double-sided Wilcoxon rank-sum test <10⁻⁶. b, Performance (AUROC) for the classification of ClinVar variants into pathogenic or benign comparing the variant influence score, gLM log ratio between predicted probability of the reference nucleotide and variant nucleotide, alignment-based conservation scores from PhyloP and PhastCons, the supervised model Borzoi, as well as a logistic regression on the influence-score and Borzoi (the latter was fitted and evaluated using 5-fold cross-validation on this dataset). Pathogenic N = 22313; benign N = 363259. c, Variant influence score for putative causal and putative non-causal variants as obtained from fine-mapped human eQTL^26,70. P-value obtained from double-sided Wilcoxon rank-sum test <10⁻⁶. d, Area under the receiver operating characteristic curve (AUROC) for the classification of putative causal versus putative non-causal variants from the fine-mapped human eQTL. Ensemble model fitted as in b. Non-fine-mapped N = 1327; fine-mapped N = 1631. e, Variant influence score for putative causal and putative non-causal variants obtained from yeast eQTL²⁷. P-value obtained from double-sided Wilcoxon rank-sum test <10⁻⁶. f, Performance in AUROC for the classification of yeast putative causal vs putative non-causal eQTL variants. Putative non-causal N = 2677; putative causal N = 379. g, Performance (AUROC) for the prediction of tRNA secondary structure contacts using different nucleotide dependency metrics: gradient-based, mask-based and substitution-based. h, Precision-recall curves for the prediction of splice site interactions using different nucleotide dependency metrics as before. Donor-acceptor interactions N = 238; non-donor-acceptor interactions at matched distances N = 476. For all boxplots: center line, median; box limits, first and third quartiles; whiskers span all data within 1.5× interquartile ranges of the lower and upper quartiles. All error bars represent ±2 standard deviations, constructed using 100 bootstrap samples. The height of each bar corresponds to the AUROC using the different variant scores.

Extended Data Fig. 2 The block-score also outperforms other metrics in identifying transcription factor (TF) binding sites when restricted to non-coding regions.

Receiver operating characteristic (ROC) curve comparing the ability of different metrics to classify whether a nucleotide is part of a bound TF motif or not (92,117 binding nucleotides out of 3,334,202 overall). As in Fig. 2d, but overlap with coding sequences was removed. This improves the performance of alignment-based conservation somewhat, but the block-score still performs much better. Computing conservation scores on a 69-way alignment of budding yeast species did not improve discrimination.

Extended Data Fig. 3 Dependency map highlights an alternative branch-point of the yeast gene LSM2.

Dependency map for an intron of the yeast gene LSM2 not only highlighting the canonical donor, acceptor and branch point but also an alternative non-canonical branch point. While the canonical branch point appears as an on-diagonal block, another parallel off-diagonal block is visible, suggesting that if mutations altered the canonical branch point then compensatory mutations on the alternative branch points would be favored. The target nucleotides of this block belong to a branch-point-like sequence, indicating a role as an alternative branch-point, which has also been previously found experimentally⁷¹.

Extended Data Fig. 4 Nucleotide dependencies systematically highlight RNA secondary structure, pseudoknot and non-Watson-Crick contacts.

a, Example of a convolution of a 5 × 5 anti-parallel diagonal filter on the dependency map of yeast tR(ACG)O tRNA. The dependency map is shown on the left, while the resulting convolution is shown on the right. The maximum convolution values highlight the anti-parallel dependencies within the tRNA. b, Fraction of nucleotides that show Watson-Crick or wobble correspondence within the 5 base-pair region defined by the convolution filter location with the maximum hit for different convolution values in a region of 1 kb 5′ of a start codon. c, Maximum anti-parallel dependencies within a 1-kb region 5′ of each start codon across fungi species and its location in the genome categorized in one of structured RNA, protein coding (spanning the whole interval of a protein-coding gene), and intergenic (spanning mostly non-annotated regions between genes). d, ROC curve for classifying experimentally obtained canonical (left) and non-canonical (right) contacts from the compaRNA database. The performance of RNAalifold base-pair probabilities, the fine-tuned RiNALMo and nucleotide dependencies is shown for two regimes: (1) high-quality manually curated seed alignments from the Rfam database together with realignment of the same sequences using MAFT and ClustlO (top) (2) alignments constructed from a database search of 220,478 bacterial and archaeal genomes and plasmids downloaded from NCBI (bottom). e, E. coli cobalamin riboswitch dependency map together with the highlighted pseudoknot contacts and nucleotide reconstruction on top. f, Distribution of dependencies for pairs of nucleotides belonging to an annotated pseudoknot contact (right, N = 21,051) or not belonging to a structure contact (left, N = 175,016,129). Dependencies discriminate between these two categories (area under the ROC curve 0.92, double-sided Wilcoxon Rank-sum test P < 10⁻¹⁶). All dependencies were computed for RNAs with pseudoknot contacts across 2,530 structures in bpRNA spanning multiple species and database sources (Methods). g, Distribution of DMS mutation frequencies for all A and C nucleotides (the nucleotides probed by DMS) in E. coli non-coding regions in antiparallel dependencies against all remaining non-coding nucleotides. Nucleotides part of dependency-map antiparallel stretches have significantly lower mutation rates (P < 10⁻¹⁶, double-sided Wilcoxon rank-sum test, nucleotides in anti-parallel dependencies N = 5,008, other N = 103,867), which indicates they were more protected from DMS and therefore more likely to be in Watson-Crick base pairing. h, Ground-truth base-pairing probabilities derived from DMS-MaPseq experimentally-constrained RNA secondary structure prediction for nucleotide pairs in anti-parallel dependencies against all pairs (P < 10⁻¹⁶, double-sided Wilcoxon rank-sum test, all pairs N = 1,133,018, pairs in anti-parallel dependencies N = 2,743). i, Covariation analysis of four novel structures validated by DMS-MaPseq 5′ of genes FkpB (b0028), glnS (b0680), mtlD (b3600) and rlmB (b4180). For all boxplots: center line, median; box limits, first and third quartiles; whiskers span all data within 1.5× interquartile ranges of the lower and upper quartiles. P values were computed using the paired two-sided Wilcoxon test.

Extended Data Fig. 5 gLMs capture repeated sequences genome-wide but distinguish between inverted repeats within and outside structural contacts.

a, Fraction of equal nucleotides within the 5 base-pair region defined by the parallel diagonal convolution filter location with the maximum hit for different convolution values. The strongest parallel diagonal dependencies belong to repeated sequences, indicating that repeats are highlighted genome-wide in parallel dependency patterns. b, Dependency map and nucleotide reconstruction for a 1 kb random sequence containing an inserted artificially generated random duplicated sequence of 100b. Despite the repeated nucleotides being spaced 800 bp apart, the gLM highlights the parallel dependency linking each nucleotide. c, Top, average dependency against inverted repeat length for tRNA length sequences (black colored dots). The red colored dots indicate the average dependency within anti-parallel dependencies in tRNA stems. Error bars indicate 95% confidence intervals across 100 simulated sequences (black) or all tRNAs with specific stem lengths (red). Bottom, average number of inverted repeats expected to get by chance for each repeat length.

Extended Data Fig. 6 Performance of the nucleotide transformer models on ClinVar variant pathogenicity prediction.

Area under the ROC curve for absolute variant effect prediction on the same dataset as Extended Data Fig. 1a using variant influence scores computed from Nucleotide Transformer models. Error bars represent ±2 standard deviations, constructed using 100 bootstrap samples. The height of each bar corresponds to the AUROC using the different model scores.

Extended Data Table 1 All gLMs assessed in this study together with their input sequence specifications, training data, architecture, and figure panels where they are used

Full size table

Extended Data Table 2 The block-score discriminates between binding and non-binding sites within sequences with a PWM match

Full size table

Supplementary information

Supplementary Information

Supplementary Note.

Reporting Summary

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tomaz da Silva, P., Karollus, A., Hingerl, J. et al. Nucleotide dependency analysis of genomic language models detects functional elements. Nat Genet 57, 2589–2602 (2025). https://doi.org/10.1038/s41588-025-02347-3

Download citation

Received: 24 September 2024
Accepted: 28 August 2025
Published: 10 October 2025
Issue date: October 2025
DOI: https://doi.org/10.1038/s41588-025-02347-3

Subjects

Abstract

Similar content being viewed by others

Main

Results

Nucleotide dependency maps

Blocks along the diagonal highlight regulatory sequence motif instances

Off-diagonal blocks indicate sequence element interactions

Nucleotide dependencies reveal RNA secondary and tertiary structure contacts

gLMs capture forward and inverted duplications without memorization

Dependency strength depends on genomic distance

Dependency maps uncover shortcomings in gLM model designs and training data selection

Discussion

Methods

SpeciesLM training

Nucleotide dependencies and variant influence score

Variant impact benchmarks

Promoter saturation mutagenesis

eQTL variants

Clinvar

Borzoi

Integrative model using Borzoi and the influence score

Alternative dependency metrics

Gradient-based

Mask-based

S. cerevisiae tRNA structure benchmark

Assessment of donor–acceptor dependencies in S. cerevisiae

TF motif mapping

The 69-way alignment

Dependencies in rare-variant-associated aberrant splicing

Genome-wide search for parallel and antiparallel dependencies

Search for parallel and antiparallel dependencies in fungi using the SpeciesLM fungi

Search for antiparallel dependencies and RNA structure in E. coli using RiNALMo

RNA secondary structure benchmarking

Benchmarking of canonical and noncanonical RNA contacts

Comparison with RNAalifold

Pseudoknot benchmark

DMS-MaPseq analysis of E. coli cells

RNA structure covariation analysis

Evaluation of artificial forward and inverted duplications

Genome-wide analysis of dependency distribution

Model comparison

Reporting summary

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links