Main

Intrinsically disordered proteins and protein regions (collectively referred to as intrinsically disordered regions (IDRs)) play important and often essential roles in many biological processes across the kingdoms of life, often in the context of molecular recognition1,2,3. Despite their importance, IDRs are often poorly conserved as assessed by alignment-based methods4,5,6,7,8. One exception here is short linear motifs (SLiMs, referred to here as motifs): 5- to 15-residue regions that often contain multiple conserved positions in a consensus pattern9,10,11. Motifs can engage in sequence-specific binding (also referred to as site-specific binding12), where the precise order and identity of amino acids is critical11,13. Importantly, motifs are modular: they can mediate molecular recognition when inserted into otherwise non-binding contexts9,13.

IDRs can also mediate molecular interactions in the absence of motifs via distributed multivalent interactions driven by sequence-encoded chemistry, an idea referred to as chemical specificity3,8,14,15. Such interactions can mediate stoichiometric protein–protein interactions or the formation of biomolecular condensates12,16,17,18,19,20. For distributed multivalent interactions, the precise order or even identity of specific amino acids can be unimportant; what matters is the presentation of certain chemical moieties.

Motif-based (sequence-specific) interaction and distributed multivalent (chemically specific) interaction are generally thought to drive orthogonal types of molecular recognition (that is, loss of a motif cannot be compensated for by changing the distal chemical context)16,21. This orthogonality may also reflect the degree of order in the bound state19. Accordingly, our current working model for IDR evolution posits that substantial sequence variation is tolerated so long as essential conserved positions in motifs are retained and/or the bulk amino acid composition and patterning are maintained5,22,23,24.

In this Article, using Saccharomyces cerevisiae as a model, we investigate the interplay between these two modes of interaction in the context of conservation and function in an essential IDR. Our results suggest that, rather than two distinct modes of interaction, IDR-mediated molecular recognition should be considered as a two-dimensional (2D) landscape, in which motifs can cooperate with—and even be entirely replaced by—distributed multivalent interactions (that is, chemical specificity). The insertion of short sequence modules can impart function not as bona fide motifs, but simply by altering the sequence chemistry. Our results imply that sequence context and motifs can compensate for and buffer one another, providing a key missing piece in our understanding of the sequence constraints on IDR evolution.

Results

Proteome-wide analyses reveal weak alignment-based conservation for yeast IDRs

IDRs often undergo more substantial sequence variation than folded domains4,6. To quantify this difference (Fig. 1a), we performed a systematic analysis of sequence conservation and predicted disorder across the S. cerevisiae proteome (Methods)25,26. This analysis revealed pervasive disorder and confirmed an expected anti-correlation between conservation and disorder (Fig. 1b and Extended Data Fig. 1a–f). Interestingly, essential proteins in S. cerevisiae were, on average, as disordered as non-essential proteins (Fig. 1c)27. We therefore wondered whether seemingly poorly conserved IDRs in essential proteins could be functionally conserved despite substantial sequence variation.

Fig. 1: IDRs are poorly conserved, as assessed by linear alignment, and Abf1 is an essential protein with a poorly conserved yet essential IDR.
figure 1

a, Schematic showing conservation and disorder across a hypothetical protein. b, Histogram of per-protein correlations (r) between per-residue conservation and disorder scores. The inset shows an example of a single protein, with each marker representing a single amino acid. The histogram reports on r values for the entire yeast proteome. c, Percentage of the sequence defined as IDRs for essential (n = 1,182) versus non-essential (n = 3,989) S. cerevisiae proteins. The box plot reports on the distribution of values across the yeast proteome, with the median value shown as a central line, the box capturing the interquartile range and whiskers representing values within 1.5× the interquartile range. A Mann–Whitney U-test to assess whether the two distributions differ revealed a P value of 0.69, indicating that they are not significantly different. d, Sequence analysis of Abf1 with per-residue conservation. The horizontal grey line corresponds to the conservation score expected for a randomly shuffled sequence (Null). e, Schematic of the plasmid shuffle assay. The measured variants were expressed from a plasmid. f, Spot dilution assays for semi-quantitative assessment of sequence-dependent growth rates, scoring each viable construct between +4 and +1 (Supplementary Fig. 2). g, Domain diagrams for Abf1 variants encoded from truncation mutations, with their viability shown to the right-hand side. h, Amino acid sequence of Abf1 IDR2449–662—the focus of this study. NLS, SV40 NLS; WT, wild type.

Abf1 is an essential chromatin protein with an essential IDR

We sought a model protein to explore the conservation of function without sequence conservation. Because long IDRs are abundant in chromatin-associated proteins (Extended Data Fig. 1g), we turned to yeast general regulatory factors (GRFs). GRFs typically harbour sequence-specific DNA-binding domains (DBDs) and long IDRs (Supplementary Fig. 3)28, are abundant and are mostly essential for viability. Although often referred to as transcription factors, their IDRs mostly do not contain transactivation domains29, and they modulate genomic processes by organizing or partitioning chromatin30,31,32,33,34,35,36,37,38, akin to mammalian architectural factors, such as CTCF39.

Abf1 is an essential S. cerevisiae GRF40 that modulates DNA accessibility via nucleosome positioning31,33,35. As an insulator, it uncouples expression levels at bidirectional promoters30,41,42 and participates in roadblock termination of transcription43,44,45. With a bipartite DBD and two poorly conserved IDRs (IDR1 and IDR2), Abf1 was well suited for our study (Fig. 1d). Although the DBD mediates sequence-specific DNA recognition, the function of the IDRs is unclear, although all evidence suggests they mediate molecular interactions with partners46,47,48,49,50.

Our interest in Abf1 was buoyed by previous observations—reproduced by us (Supplementary Figs. 1 and 2)—that the full-length Kluyveromyces lactis orthologous Abf1 could complement function in S. cerevisiae, despite 150+ million years of evolution51. That orthologous IDRs offer the same functional conservation is often assumed but rarely tested. To investigate this, we began by asking which wild-type (S. cerevisiae) IDRs from Abf1 were necessary and sufficient for function.

Previous truncation experiments suggested important carboxy (C)-terminal sequences (CS1/2) in IDR2 (refs. 52,53). We extended this truncation approach using a classical plasmid shuffling assay, in which the genomic copy of ABF1 was deleted and the sole copy of wild-type ABF1 was provided on a plasmid carrying a URA3 marker (Fig. 1e)54. Transformation with a plasmid bearing a mutated abf1 gene and selection on 5-fluoroorotic acid (5-FOA) plates for loss of the wild-type ABF1 gene plasmid (5-FOA is toxic in the presence of URA3) enabled us to assess whether the mutated abf1 gene supported viability (Methods, Supplementary Fig. 1 and Supplementary Tables 18).

While abf1 deletions are inviable, how Abf1 variants alter growth rate provides one route to investigate Abf1’s essential (but so far ill-defined) function. We rated the growth rates of viable mutant constructs semi-quantitatively, from +4 (similar to that of the wild type) to +1 (very slow growth) in serial-dilution spotting assays (Fig. 1f and Supplementary Fig. 2). For the inviable constructs, we further confirmed that they: (1) were expressed; (2) entered the nucleus; and (3) bound specifically to selected Abf1 DNA-binding sites by anti-Flag chromatin immunoprecipitation (ChIP) targeting the Flag-tagged Abf1 variant in contrast with untagged wild-type Abf1 (Methods and Extended Data Fig. 2a).

Our truncation approach revealed that IDR2 was essential, but not IDR1 (Fig. 1g). The maximal C-terminal IDR2 truncation reported previously to be viable in the presence of IDR1 (that is, IDR2449–662)53 was also viable at a wild type-like growth rate in the absence of IDR1 (Fig. 1g,h). We used IDR2449–662 as our wild-type IDR reference and refer to it as IDR2 for the remainder of this work.

IDR2 is poorly conserved across all orthologues (Fig. 1d), yet essential for viability, at least in S. cerevisiae. The CS1/2 region—the conservation peak in Fig. 1d at around residue 650—is a nuclear localization signal (NLS)49 and was unnecessary if a heterologous SV40 NLS was included, as was the case for all mutant constructs herein (Fig. 1g; IDR2449–623).

Abf1 IDR2 is poorly conserved by alignment but modestly conserved by chemistry

Although conservation is often assessed by linear sequence (for example, via alignment), previous work established that IDRs can be conserved in terms of sequence composition5,8,16. We reasoned that protein sequences could be analysed in terms of compositional conservation and linear sequence conservation (Fig. 2a). A comprehensive analysis across S. cerevisiae IDRs and folded domains found that many IDRs were well conserved compositionally, despite poor linear sequence conservation (Fig. 2b and Extended Data Fig. 2b–g). Importantly, this analysis revealed that IDR2 is more conserved in terms of charged and hydrophobic residues than most IDRs with a similar degree of sequence conservation (Fig. 2b and Extended Data Fig. 2c–f).

Fig. 2: IDR2 shows relatively conserved composition, yet most orthologous sequences cannot rescue function.
figure 2

a, Schematic of compositional and sequence conservation. b, An analysis of all S. cerevisiae IDRs revealed that Abf1 IDR2 (coloured dot) was relatively conserved in charge and hydrophobicity (Hydro.) composition, but less so for polar composition. See also Extended Data Fig. 2. c, Schematic of orthologous IDR2s shown across full sequence alignment. d, Phylogenetic tree (for the whole organism) versus IDR2 rescue ability. Red and green shading represent inviability and viability, respectively, of the IDR2 orthologue in S. cerevisiae (shaded yellow). Most orthologues did not support viability. Sequence identity versus S. cerevisiae IDR2 is shown to the right. See also Extended Data Fig. 3. e, Alternative IDRs tested, with their original functions noted to the right. In all cases, IDRs are fused C-terminal of the bipartite Abf1 DBD. See also Supplementary Fig. 4.

IDR2s from orthologous Abf1 proteins largely fail to rescue function in S. cerevisiae

We expected that compositional conservation in IDR2 would explain functional conservation, such that orthologous IDRs with similar composition would support viability in S. cerevisiae. To test this, we fused orthologous versions of IDR2449–662 from 18 different yeast Abf1s to the S. cerevisiae Abf1 DBD (Fig. 2c). These orthologous IDR2449–662 regions were identified using the highly conserved DBD and NLSs, which defined the start and end of IDR2449–662 orthologues (Fig. 1d).

We tested our evolutionary chimeric proteins (S. cerevisiae Abf1 DBD and orthologous Abf1 IDR2) in our plasmid shuffle assay (Fig. 2d). Unexpectedly, outside of the sensu stricto S. cerevisiae complex (the bottom four species in Fig. 2d), only two of the 15 chimeras were viable. There was no obvious relationship between viability and sequence composition, sequence identity or sequence length (Extended Data Fig. 3a,b), falsifying our expectation that compositionally conserved IDRs would be functionally conserved.

Next, we tested several IDRs from proteins with similar functions and similar amino acid compositions, including IDRs from Abf1 (IDR1), other GRFs (Rap1 and Mcm1), a yeast transcription factor (Gcn4) and a human insulator (CTCF) (Fig. 2e, Extended Data Fig. 3c and Supplementary Fig. 4). We also tested unrelated but compositionally similar low-complexity IDRs from the human RNA-binding protein FUS and the yeast translation termination factor Sup35 (Fig. 2e and Extended Data Fig. 3c). These IDRs failed to confer viability. However, to our surprise, some compositionally similar IDRs taken from other S. cerevisiae DNA-binding proteins (the transcription factors Gal4 and Pho4, as well as another GRF, Reb1) conferred viability in place of IDR2 (Fig. 3a and Supplementary Fig.5).

Fig. 3: Sequence motifs play a key role in IDR2 function.
figure 3

a, IDRs from other (compare with Fig. 2e) yeast transactivators (Gal4 and Pho4) and from the GRF Reb1 conferred viability. b, Global shuffles were inviable (that is, IDR2 composition alone was insufficient for viability). c, Sequential sequence shuffling pinpointed an essential motif (EM) in the centre of IDR2. Note that LS-12 is also referred to as EM shuffle in Figs. 4c and 6. d, The EM was not conserved across orthologues or enriched for charge or hydrophobicity compared with the rest of IDR2. e, Three putative motifs are shown in the context of their IDRs. Abf1G4 and Gal4G4 were identified by sequence alignment of IDR2 and Gal4768–881 (Extended Data Fig. 3g). f, Structural bioinformatics implicated the EM in forming a transient helix. g, Mutations leading to helix disruption (those encoding substitutions of glycine for other polar residues) abrogated viability. h, Schematic showing motif and context. i, Insertion of the EM into the (non-functional) FUS1−16312E (shaded blue) and Sup35 (shaded orange) contexts conferred viability. Similarly, insertion of Gal4G4 or Abf1G4 into FUS1−16312E conferred viability. j, Gal4G4 rescued viability in several unrelated non-functional IDR contexts. k, IDR2 context mutations that reduced hydrophobicity but preserved the EM were inviable. l, FUS1−16312E context mutations that reduced aromaticity but preserved Gal4G4 were inviable. m, Sub-sequences compositionally matched to Gal4G4 and taken from a range of transcription factors also provided viability when inserted into FUS1−16312E. Con., conservation.

Collectively, these results suggested that: (1) not any IDR could functionally replace Abf1 IDR2; (2) similarity in IDR composition is insufficient for IDR2 function; (3) not all yeast GRFs function in the same way via their IDRs; and (4) there may be functional overlap between the GRF Abf1 and some transactivating transcription factors (Gal4 and Pho4), but not others (Gcn4).

Amino acid composition in IDR2 is insufficient to explain function

IDR2 appears to provide molecular recognition similar to that of Gal4, Pho4 and Reb1. These factors can also mediate chromatin remodelling31,32,33,34,35,36,55,56,57. We therefore wondered whether these IDRs share a common motif for recruiting the requisite machinery. Given that motifs depend on their linear sequence, shuffling a motif should disrupt its function. As an initial test, we designed three globally shuffled variants of IDR2 in which ~65% of the residues were shuffled across the sequence (Fig. 3b). These shuffles retained the same composition and wild type-like levels of patterning for different chemical groups (Extended Data Fig. 3d–f). These shuffles were inviable, demonstrating that IDR2-like composition alone is insufficient for viability and suggesting the presence of one or more motifs.

IDR2 possesses an essential SLiM

To identify motif(s) in IDR2, we developed an unbiased approach termed sequential sequence shuffling (Fig. 3c). IDR2 was subdivided into abutting 30-residue windows and the sequence in each window was locally shuffled. This revealed two central windows that were intolerant to shuffling, which we confirmed by shuffling everything except the central 60-residue sub-sequence (Fig. 3c). We then repeated the procedure using ten-residue windows within the 60-residue sub-sequence. We identified a 20-residue sub-sequence (the essential motif) that could not be shuffled and was essential for IDR2 function (Fig. 3c). Despite being essential, this region was unremarkable with respect to conservation and other sequence properties (Fig. 3d).

We questioned whether similar motifs might exist in the other functional IDRs identified in Fig. 3a. Sequence alignment between IDR2 and Gal4768–881 was relatively poor (Extended Data Fig. 3g), identifying just one loosely homologous region (named Abf1G4 in Abf1 and Gal4G4 in Gal4) (Fig. 3e). Intriguingly, shuffles of IDR2 in which Abf1G4 was shuffled (LS-13 and LS-15) showed a growth defect (Fig. 3c), hinting that Abf1G4 may contribute to function as an additional, albeit non-essential, motif.

Finally, we investigated the molecular basis for essential motif function. Structural bioinformatics predicted that this region forms a transient helix (Fig. 3f)—a feature that is frequently associated with IDR-mediated interactions21. To establish the importance of helicity, we introduced helix-disrupting point mutations in the DNA encoding the essential motif, which abrogated viability (Fig. 3g). These results are at least consistent with a model in which sequence-specific binding relies on a helical bound state.

IDR context is a critical functional determinant

Motifs should—in principle—confer function when inserted into a non-functional context10. Context here refers to the IDR sequence surrounding the motif (Fig. 3h). We tested whether the essential motif met this definition. As a non-functional context, we selected the phosphomimetic variant of the low-complexity IDR from the human RNA-binding protein FUS (FUS1−16312E)58 (Supplementary Fig. 6). FUS1−16312E is a compositionally uniform low-complexity disordered region that lacks secondary structure or known binding motifs58,59. However, FUS1−16312E contains uniformly spaced hydrophobic (aromatic) and acidic residues, conferring sequence properties similar to those of IDR2. Although FUS1−16312E alone was inviable (Fig. 2e), insertion of the essential motif into the FUS1−16312E context rescued viability (Fig. 3i). Similarly, inserting the essential motif into another inviable low-complexity sequence (Sup351–131) also conferred viability (Fig. 3i). These results were consistent with the essential motif being a bona fide motif: it conferred viability as a modular entity. Interestingly, insertion of Gal4G4 and Abf1G4 into the FUS1−16312E context also rescued viability (Fig. 3i), consistent with (but not demonstrating) a model where Gal4G4 and Abf1G4 are also motifs.

In addition to the importance of inserted motifs, we tested whether viability depended on properties of the host IDR sequence context. First, the essential motif alone fused to the Abf1 DBD was inviable, demonstrating the need for IDR context (Supplementary Fig. 1). Next, we inserted Gal4G4 in three more otherwise inviable IDR contexts (Sup351–131, Rap1231–361 and Abf1 IDR187–311) (Fig. 3j). In all of these, Gal4G4 conferred viability, confirming its general ability to rescue function. Although these constructs were viable, they showed distinct growth rates; the Abf1 IDR1 and Sup35 contexts conferred slower growth than the Rap1 and FUS1−16312E contexts. Apparently, viability could be modulated not only through motifs, but also by altering the context.

Motivated by our conservation analysis, we tested the importance of hydrophobicity outside the essential motif (Fig. 2b). In stark contrast with our shuffle variants (Fig. 3c), two variants that preserved the essential motif but reduced hydrophobicity in distinct ways were inviable (Fig. 3k). Next, we designed variants of FUS1−16312E + Gal4G4 in which the contextual aromatic residues were converted to serine or leucine (Fig. 3l), yielding inviable constructs (Fig. 3l). Furthermore, if Gal4G4 was inserted into a glutamine-rich IDR from the yeast transcriptional co-repressor Ssn6, this variant was also inviable (Supplementary Fig. 1). This indicates that IDR context hydrophobicity is a key determinant of IDR function.

Compositionally selected sub-sequences can rescue Abf1 function

Although IDR context depends on sequence chemistry, motifs should depend on sequence order and amino acid identity. To confirm that Abf1G4 and Gal4G4 were bona fide motifs, we intended to demonstrate that unrelated but length-matched sub-sequences with amino acid composition matching Abf1G4 or Gal4G4 were inviable. To our surprise, five randomly selected sub-sequences from yeast and non-yeast transcription factor IDRs with an amino acid composition similar to Gal4G4 (Supplementary Fig. 7) were viable if inserted in the FUS1−16312E context (Fig. 3m). This extremely surprising result forced us to step back and revisit how IDR2 could be interacting with its partners.

IDR-mediated function is driven by sequence and chemical specificity

Conventionally speaking, the ability of a short (<20-residue) sequence to confer function when inserted into a non-functional context is taken to demonstrate a bona fide motif (for example, a SLiM)9,60. Given that SLiMs can interact without acquiring a defined 3D structure, structural characterization is sufficient but not necessary to nominate a region as a SLiM61.

Sequence-specific motifs are often defined by three characteristics: an inability to tolerate shuffling (Fig. 3c); sensitivity to point mutations (Fig. 3g); and autonomous modular activity (Fig. 3i). As such, our results confirmed that the essential motif is a bona fide motif. However, the discovery that compositionally similar but unrelated sub-sequences were also functional implied that either we had a remarkable ability to identify motifs by eye or something else was at play.

Thus far, we identified two functional determinants: (1) the presence of a motif (Fig. 3g,i,j); and (2) the presence of an IDR context that we interpreted to mediate distributed multivalent interactions, as hydrophobic residues were critical (Fig. 3k,l). These two functional determinants can be considered in terms of sequence specificity (dependence on a precise amino acid sequence) and chemical specificity (dependence on sequence-encoded complementary chemistry, without the requirement for an exact amino acid order). Generally, these two modes are discussed separately; motifs are considered in the context-specific stoichiometric interactions, whereas distributed multivalent interactions are mainly associated with biomolecular condensates13,16,20,21. However, we now wondered whether these two interaction modes might instead exist on a combined 2D landscape (Fig. 4a).

Fig. 4: IDR2-mediated interactions can be understood via a 2D binding landscape.
figure 4

a, IDR-mediated interactions can be understood in terms of motif binding and context binding. b, Motif and context binding can be projected onto a simple 2D binding surface. Numbers represent identifiers to describe different types of mutations that modulate context (1→2) or motifs (1→3). c, The EM is a true motif; it cannot be shuffled or distributed (distr.). d, Variants can be interpreted through the 2D binding landscape. e, The sub-sequences identified in Fig. 3m conferred viability if distributed in FUS1−16312E and Sup35 contexts, with minimal change to growth phenotypes compared with the non-distributed motifs. Two independent Gal4G4 distribution variants were tested to exclude the possibility that a new motif was generated by chance. f, Rational de novo design of a functional context-only IDR based on FUS1−16312E. g, Gal4G4 rescued viability when added to the FUS1−16312E context, and the residue order or relative positions of the Gal4G4 residues did not affect this effect. The composition of acidic residues tuned the growth phenotypes in both a FUS1−16312E context (top, shaded blue) and a Sup35 context (bottom, shaded orange). The patterning of aromatic residues also tunes the growth phenotype in a FUS1−16312E context. h, Abf1G4 rescued function in the FUS1−16312E context, but did not tolerate distribution, suggesting that this region is a motif. i, For the 68 sequences that lacked the EM, viable (blue) and inviable sequences (red) could be delineated based on the charge score and binding score. The numerical values are based on the weighted sequence composition. j, Global shuffles of the Gal4768–881 construct revealed a variety of functional consequences. k, Although Pho41–249 supported viability, global and segmental (seg.) shuffle constructs were inviable, suggesting that Pho41–249 harbours a motif.

To guide our intuition, we performed simple, coarse-grained simulations to quantify 1:1 binding between an IDR and a partner as a function of motif and context binding strength (Fig. 4b and Extended Data Fig. 4a–d). These simulations revealed a 2D landscape in which a bound state could be achieved through many combinations of motif and context binding strengths. Sequence changes could alter the context (Fig. 4b; from 1→2), the motif (Fig. 4b; 1→3) or both. Accordingly, we tested this conceptual framework via further rational sequence design.

Motifs are, by definition, sequence specific; they depend on their precise linear order of amino acids. It should not be possible to shuffle or distribute the residues of a motif and retain function. In support of this, variants in which the essential motif was either locally shuffled or had its residues distributed across the FUS1−16312E context were inviable (Fig. 4c). Essential motif distribution was also not tolerated in a Sup35 context (Fig. 4c), confirming that this result was not context dependent.

Next we asked whether motif shuffling or redistribution could be interpreted in the context of our 2D landscape. Motif redistribution simultaneously disrupted the motif and altered the context chemistry (Fig. 4d; 2→3). Motif shuffling (or point mutations in the DNA sequence encoding the motif) disrupted the motif without impacting the context chemistry (Fig. 4d; 4→6). Finally, variants resulting from mutations introduced to the DNA sequence encoding amino acid residues outside the motif disrupted the context chemistry (Fig. 4d; 4→5). This conceptual framework allowed us to re-interpret our additional variants in a new light.

The identification of motifs requires motif shuffling or redistribution as a control

Our binding landscape model raised the intriguing possibility that GalG4 and other sequences identified in Fig. 3m were not bona fide motifs, but instead altered the IDR sequence chemistry (that is, context), albeit locally. We tested this by distributing the amino acids of these sequences across the IDR context. In all cases, including multiple independent shuffles of the same sequence and across numerous distinct contexts, these distributed variants were viable (Fig. 4e). This demonstrated that these subregions (that is, Gal4G4, p65 and GR, as shown in Fig. 3m) were in fact not bona fide motifs, but instead simply altered the sequence chemistry without regard for precise amino acid order or identity. In contrast, we reiterate that the essential motif is a bona fide motif: it does not tolerate point mutations (Fig. 3g), shuffling (Fig. 3c) or distribution (Fig. 4c). However, the same degree of viability could be achieved either with a combination of context and motif or with a context-only IDR.

Rational design reveals the molecular grammar of chemical specificity

The context-only IDRs (Fig. 4e) prompted us to generate rational designs based on chemical principles. Given the importance of hydrophobic residues (Fig. 3k,l) we tested the sufficiency of a hydrophobic context by designing a FUS1−16312E variant with additional, evenly distributed hydrophobic residues (+4 tyrosine (aromatic) and +3 methionine (aliphatic)) mirroring the chemical groups introduced by Gal4G4. This completely synthetic construct was fully viable with a near-wild-type growth phenotype (Fig. 4f). This result demonstrates that—at least in the context of our growth assay—an IDR with an evolved and essential bona fide motif can be replaced by a de novo-designed IDR lacking any kind of motif. We further established that locally shuffling or distributing Gal4G4 in FUS1−16312E had no impact on growth (Fig. 4g; top), cementing the ability of a motif-free sequence to confer wild type-like viability.

We further explored the molecular grammar of chemical specificity using rational sequence design. Our bioinformatics analysis implicated acidic residues as conserved (Fig. 2b). To test their importance, we removed all acidic residues from the context in the FUS1−16312E + Gal4G4 sequence, with no impact on growth compared with FUS1−16312E + Gal4G4 (Fig. 4g; bottom). However, removing all acidic residues from the sequence resulted in a viable but slow-growing phenotype (Fig. 4g). We interpret these variants as tuning chemical specificity (Fig. 4b). Similarly, the Sup35 + Gal4G4 and Sup35 + EM constructs (where EM stands for essential motif) showed slower growth at baseline than the FUS1−16312E + Gal4G4 and FUS1−16312E + EM constructs, respectively, and removing all acidic residues from the Sup35 context led to an inviable construct (Fig. 4c,g). Together, these results suggested that Sup35 provided a weaker context than FUS1−16312E. The Sup35 context may be generally worse for binding than the FUS1−16312E context, or the two contexts may lead to distinct patterns of interaction with cellular components.

Patterning of aromatic residues has been shown to tune intermolecular interactions in various contexts16,62,63. To investigate this, we designed a FUS1−16312E + Gal4G4 variant with aromatic residues clustered together (Extended Data Fig. 4e). Although this variant was viable—suggesting that evenly spaced aromatic residues were not a strict prerequisite for function—we observed a growth defect (Fig. 4g). This result suggests that changes in residue patterning can further tune chemical specificity.

Next we asked whether the essential motif is the only bona fide motif in IDR2. Given that shuffling Abf1G4 in IDR2 compromised growth (Fig. 3c; LS-13 and LS-15), this hinted that IDR2 may contain another non-essential motif (i.e. Abf1G4) in addition to the already identified essential motif (EM). To test this, we designed a construct in which Abf1G4 was distributed across FUS1−16312E (Fig. 4h). Whereas FUS1−16312E + Abf1G4 was viable, this distributed-Abf1G4 construct was inviable, suggesting that Abf1G4 is a bona fide motif.

Finally, we asked whether our motif-free constructs could reveal chemical principles to differentiate between viable and non-viable sequences. We divided our 67 viable and inviable sequences based on amino acid composition along two dimensions: charge (y axis; charge score) and the hydrophobicity of aliphatic and aromatic residues (x axis; binding score) (Fig. 4i). This enabled us to deliniate between viable and inviable sequences (dashed line in Fig. 4i). In doing so, we noted two outliers—Gal4768–881 and Pho41–249—suggesting that these sequences may also possess bona fide motifs. We designed further variants to explicitly test this.

With Gal4768–881, we generated three global shuffles (controlling for residue patterning; Extended Data Fig. 3e). Although each of these was viable, they showed variable growth phenotypes, ranging from wild type (+4) to relatively slow growing (+2) (Fig. 4j). Despite excluding the presence of a bona fide motif, these results illustrate how local sequence chemistry (beyond just composition) can influence function.

With Pho41–249, we generated two types of shuffle: (1) a global shuffle, where the entire sequence was shuffled; and (2) a segmental shuffle, in which subregions of ~40 amino acids were locally shuffled, preserving the overall chemical distribution (Fig. 4k). The segmental shuffle was designed explicitly to preserve local chemical composition and sequence patterning (Extended Data Fig. 3f). Unlike in Gal4768–881, both of these shuffles were inviable, consistent with the presence of at least one bona fide motif. Although further investigation is needed to pinpoint the motif(s) (Discussion), this suggests that additional motifs beyond the essential motif could mediate Abf1 function.

In summary, although IDR-mediated interactions have historically been viewed through the lens of sequence-specific motif binding, here we show that a motif-free construct can fully rescue an IDR with an essential motif in the wild-type context if appropriate chemically specific interactions are provided. Furthermore, chemically specific interactions were sufficient and necessary for function, whereas sequence-specific interactions were essential only in a small window of chemical contexts and were otherwise insufficient. More generally, our results imply that chemical specificity could be conserved despite massive changes in linear sequence—an idea we sought to investigate further.

Conservation of chemical specificity is invisible to alignment-based methods

Next we asked how conservation of chemical specificity might manifest across orthologues. To investigate this, we leveraged FINCHES, a recently developed approach to predict IDR-mediated chemical specificity from sequence64. FINCHES allowed us to take an IDR and a partner protein and predict which IDR regions or residues facilitate attractive and repulsive interactions (Fig. 5a). Using this, we evolved IDR2 in silico under the constraints of retaining chemically specific interactions with a putative partner.

Fig. 5: Chemical specificity imparts cryptic conservation that is challenging to identify from multiple sequence alignments.
figure 5

a, Chemical specificity between two IDRs can be predicted directly from sequence using FINCHES. b, In silico evolution scheme. Mutations were generated by converting the protein sequence to nucleic acids, stochastically mutating nucleotides using standard transition/transversion rates and then converting back to protein space. Selection was then performed by computing the predicted intermolecular interaction map between the IDR2 variant and the partner protein, requiring overall conservation of attractive interactions. c, Comparison of per-residue conservation for sets of orthologues evolved under selection (blue) versus no selection (dashed red). These two profiles were indistinguishable. d, Schematic of error-prone PCR used for in vitro evolution. e, Expected conservation patterns for sequence (left) versus chemical (right) specificity. If sequence-specific conservation is at play, we expect to see local conservation peaks reflecting highly conserved regions. f, Results from in vitro and in silico evolution experiments. Top, in vitro evolution identified 14 viable and 34 inviable constructs. Viable constructs had mutations distributed throughout the sequence, including the region encoding the EM, without conserved peaks. Residues that were never altered in viable constructs are coloured by amino acid type (top bar plot). Bottom, comparison of per-residue conservation from 14 in vitro (black) and in silico (blue) evolution, where in silico experiments show the distribution that emerged when sets of 14 viable variants (that is, with conserved attractive interactions) were generated many different times with the same average number of mutations. Seqs., sequences.

Although the IDR2 binding partners essential for viability remain unknown, Abf1 was previously isolated in a stable trimeric complex with Rad7 and Rad16 (refs. 47,65). Moreover, the amino (N)-terminal IDR of Rad7 (Rad7NTD) possesses several distinct regions that are predicted to interact with IDR2 (Extended Data Fig. 4f,g). We therefore performed in silico evolution of IDR2 under selection for chemically specific interactions with the Rad7NTD (Fig. 5b). Although Rad7NTD was used here, we emphasize that our conclusions do not depend on the specific partner (Extended Data Fig. 4h). As a control, we also evolved IDR2 without selection. In the limit of small sets of orthologues (20–30 sequences), we could not distinguish conserved versus unconserved sets of artificial orthologues via alignment-based methods, despite the conserved variants being evolved under strong selective pressure (Fig. 5c). This suggests that alignment-based approaches are unable to capture conservation of chemical specificity.

To test our computational results experimentally, we performed random mutagenesis to investigate the signatures of conservation in IDR2 (Fig. 5d). Using error-prone PCR, we generated a large number of length-matched randomly mutagenized versions of IDR2, tested their viability in vivo and then assessed per-residue conservation across viable sequences. High conservation of specific residues or regions would imply conservation of sequence specificity (Fig. 5e; top). Alternatively, a lack of regional conservation would be consistent with (although not unambiguously demonstrate) conservation of chemical specificity (Fig. 5e; bottom). This approach yielded 48 variants—14 viable and 34 inviable—with a distribution of point mutations (Fig. 5f; top). Across the 14 viable sequences generated by error-prone PCR, no region or residue was statistically enriched for conservation as assessed by alignment. Furthermore, a pool of mutationally matched sequences evolved in silico under conservation of chemical specificity produced data that were statistically indistinguishable from the experimental data (Fig. 5f; bottom). Although caveats remain (Discussion), these results further support the conclusion that alignment-based methods cannot capture conservation of chemical specificity.

Finally, across the yeast proteome, we identified ~450 IDRs from proteins that were poorly conserved as assessed by alignment, yet highly conserved in chemical specificity across various chemical interactions (Methods and Supplementary Table 9). These proteins are involved in essential cellular processes, including transcriptional regulation, DNA repair and cell signalling. We reiterate that conserved sequence features (for example, amino acid composition and patterning parameters) have been reported before5,8,23,66. Here we propose that conservation operates not at the level of sequence-intrinsic features but at the level of complementary chemical interfaces. Our work, and that of others, argues that conservation in yeast IDRs is widespread, albeit following different rules than conservation in folded domains.

The IDR of Abf1 is not required for nucleosome barrier function in vitro

Next, we sought to determine which essential cellular function(s) IDR2 is responsible for. The best-known role of Abf1 is nucleosome barrier function. Abf1 binding to DNA generates nucleosome-free regions (NFRs) and an alignment point (barrier) against which regular nucleosome arrays become phased36,67,68. Such nucleosome positioning impacts transcription at promoters, both by NFR generation and by positioning the so-called +1 nucleosomes at transcription start sites69. The nucleosome barrier function of Abf1 relies on cooperation with ATP-dependent chromatin remodellers (for example, the yeast INO80 complex). The remodeller slides nucleosomes against DNA-bound Abf1 and spaces nucleosomes into regular phased arrays67,70. We hypothesized that this nucleosome barrier function involves the IDR of Abf1 (for example, by regulating or recruiting the remodeller).

We tested this hypothesis using in vitro genome-wide chromatin reconstitution, an approach previously used to demonstrate barrier function mechanism67,70,71. This assay employs purified factors and micrococcal nuclease digestion coupled with high-throughput sequencing (MNase-seq), allowing genome-wide mapping of nucleosome positions (Fig. 6a). To our surprise, neither an Abf1 variant with an inviable IDR (FUS1−16312E) nor the lack of an IDR (ΔIDR1/2; that is, Abf1 DBD only) compromised the barrier function of Abf1 (Fig. 6b and Extended Data Fig. 5a–c). Furthermore, the IDR was not involved in INO80 remodeller recruitment in vitro (Extended Data Fig. 5b). In short, our results demonstrate that IDR2 is not required for barrier function, falsifying our first hypothesis.

Fig. 6: Variant IDRs have diverse effects on Abf1 function.
figure 6

a, Schematic of MNase-seq mapping. b, In vitro genome-wide nucleosome positioning assay at Abf1 sites (n = 379; class I + II sites69) revealed that Abf1 IDRs were not required for nucleosome positioning barrier function. The lines compare the indicated purified Abf1 variants in combination with ATP, reconstituted nucleosomes (salt gradient dialysis (SGD) chromatin) and the yeast chromatin remodeller INO80. The grey area reflects the experiment with wild-type Abf1, but in the absence of INO80. c, The anchor-away technique enables ad hoc depletion of wild-type Abf1 from the nucleus. The wild-type copy of Abf1 is genetically fused to FRB, whereas the ribosomal protein Rpl13A, which transits through the nucleus during ribosome assembly, is fused to the double FKBP12 tag (2×FKBP). After adding rapamycin, FRB and FKBP bind and nuclear Abf1 is sequestered at cytosolic ribosomes, leading to ad hoc nuclear depletion. d, Schematic of ODM-seq chromatin mapping. A change from black to yellow represents DNA methylation. e, Abf1 IDR variants did not increase occupancy in NFRs relative to wild-type Abf1 and did not grossly alter nucleosome organization around Abf1 sites. The composite plots show ODM-seq data for the indicated Abf1 IDR variants aligned at n = 380 responder sites (classes I and II69; see also Extended Data Fig. 5d,e). f, Inviable Abf1 IDR variants or conditions lacking Abf1 affected +1 nucleosome positioning at essential genes differently than viable Abf1 IDR variants (growth rates in parentheses). PCA results are shown for all essential genes of ODM-seq data in the upstream flank of the in vivo +1 nucleosome (Extended Data Fig. 7g) averaged over two biological and two technical replicates (Extended Data Fig. 8a) each for the indicated Abf1 conditions. g, The Gal4 transactivation domain in the Gal4768–881 construct allowed growth even under fully GAL-inducing conditions with galactose as the sole carbon source. Shown is a spot dilution assay after two days of incubation. h, Whereas a classical view might delineate IDRs into motifs and non-motifs, our work here—and that of others—supports an emerging model in which motif function is licensed by a permissive context and in which chemical specificity contributes additional interactions that can even compensate for the loss of a motif. aro. clust., aromatic clusters.

Inviable Abf1 variants mostly maintain NFRs and flanking nucleosomal arrays in vivo

Our next hypothesis was that inviable Abf1 IDR2 variants impair nucleosome organization in vivo, especially at essential genes. We turned to an ad hoc in vivo depletion approach using an Abf1 anchor-away system, which allows the testing of inviable Abf1 variants after depletion of wild-type Abf1 (Fig. 6c)42,69,72. Our ad hoc depletion system was validated by the recapitulation of previous observations. Upon Abf1 depletion, NFRs became partially filled and flanking nucleosomal arrays became disorganized (the peaks shifted towards the NFRs and peak-to-trough ratios decreased) at previously annotated Abf1 responder sites in promoters, in contrast with annotated non-responders (Extended Data Fig. 5d,e)69. Moreover, we reproduced gene expression changes (total RNA sequencing (RNA-seq)) for genes with responder Abf1 sites (defined by effects on nucleosome organization69) in their promoters69 (Extended Data Fig. 5f).

Although we used MNase-seq to recapitulate previous work, we turned to next-generation DNA methylation footprinting coupled with nanopore sequencing (occupancy measurement by DNA methylation (ODM-seq)—also known as Fiber-seq; Fig. 6d71,73). ODM-seq is a single-molecule technique that scores both occupied and non-occupied DNA (Fig. 6d), employs saturation of methylation, yields low variation among replicates and thereby reliably measures nucleosome positions and occupancy (Extended Data Fig. 6 and Supplementary Fig. 8).

Combining the anchor-away technique with ODM-seq, we mapped genome-wide nucleosome organization for seven different conditions in the nucleus upon wild-type Abf1 depletion: wild-type Abf1 (positive control); no Abf1 (negative control); two inviable constructs (that is, FUS1−16312E (Fig. 3i) and essential motif shuffle (LS-12; Fig. 3c)); and three viable constructs with either a wild type-like growth rate of +4 (FUS1−16312E + GalG4 distr. (Fig. 4e) or Gal4768–881 shuffle 1 (Fig. 4j)) or a compromised growth rate of +2 (FUS1−16312E aromatic clusters (Fig. 4g)). Both viable and inviable Abf1 IDR variants generated nucleosome organizations at previously annotated promoter Abf1 responder sites that were very similar to wild-type Abf1 and different from those generated in the absence of Abf1 (Fig. 6e). This suggested that both NFR generation or maintenance and nucleosome positioning in vivo mainly relied on the Abf1 DBD and less on the IDR properties as seen in vitro (Fig. 6b). Nonetheless, we note that composite plots of many genes may obscure small effects and/or effects at few genes.

Abf1 IDR is required for +1 nucleosome positioning at essential genes

As annotated Abf1 responder sites did not reveal gross differences in nucleosome organization between viable and inviable constructs, we wondered whether we were missing relevant effects by focusing only on these sites. A comprehensive comparison of annotated Abf1 sites in yeast—identified either via various in vivo mapping techniques or consensus DNA sequence motifs (position weight matrices (PWMs))—revealed a striking incongruency (Extended Data Fig. 7a,b). This analysis also recapitulated the previous conclusion that binding sites mapped in vivo need not contain a PWM (Extended Data Fig. 7c)41,74. Furthermore, recent work by Mahendrawada et al.74 identified many genes whose expression changes in response to transcription factor ablation, even if those genes lack binding sites for the ablated transcription factor in their promoters. We plotted our ODM-seq data of wild-type versus no Abf1 at genes annotated by Mahendrawada et al.74 as: (1) only having Abf1 binding sites in their promoters; (2) only responding to Abf1 ablation; or (3) both being responders and having Abf1 promoter sites. Indeed, transcription response, but not necessarily Abf1 sites, correlated with altered nucleosome organization (Extended Data Fig. 7d–f). This confirmed, on the previously unassessed level of nucleosome organization, that a response to Abf1 ablation need not be linked to an Abf1 site. It also confirmed that alterations in nucleosome organization at the NFR and +1 nucleosome correlate with gene expression changes38,69,75,76. Therefore, we searched in our ODM-seq data of Abf1 variants not for Abf1 sites around which nucleosome organization was affected, but for genes where nucleosome organization was affected at NFRs or +1 nucleosomes.

As gleaned from the composite plots comparing wild-type Abf1 and no Abf1, we looked for changes in: (1) the NFR region, which may reflect differential binding of nucleosomes or transcription machinery (for example, a preinitiation complex); (2) the upstream flank of the +1 nucleosome, which usually contains the transcription start site41,77 and where even small shifts of nucleosome positions may impact transcription regulation38,69,75,76,78; and (3) the downstream flank of the +1 nucleosome, which also reflects shifted positions (Extended Data Fig. 7g). Given our focus on essential genes, we performed principal component analyses (PCAs) of ODM-seq data for wild-type Abf1, no Abf1 and Abf1 variants in these three regions of essential genes. Strikingly, PCA clearly separated inviable from viable conditions for the upstream flank of the +1 nucleosomes of essential genes (Fig. 6f) and the downstream flank, but not for the NFR region (Extended Data Fig. 8). We concluded that proper +1 nucleosome positioning—as a proxy for proper gene regulation—at essential genes depended on Abf1 IDR, in a manner incompatible with the inviable variants or the absence of Abf1.

Abf1 can accommodate a strong and transcriptionally active activation domain

Abf1 lacks a transactivation domain29 and participates in the insulation of transcription regulation41,79. It was therefore unclear a priori whether a GRF such as Abf1 could tolerate bearing a strong transactivation domain. We identified several viable Abf1 constructs with well-known transactivation domains, especially the Gal4768–881 construct with the strong Gal4 transactivation domain (Fig. 3a). Nonetheless, it was possible that the Gal4 activation domain, including the short Gal4G4 activation domain sequence present in several of our constructs, was repressed by Gal80 in our glucose-containing media80. We confirmed that the Gal4768–881 construct still supported viability with galactose as the sole carbon source (that is, with the GAL regulon fully induced; Fig. 6g). We concluded that Abf1’s essential GRF function was compatible with a strong transactivation potential.

Discussion

IDR-mediated interactions are often seen to be driven by sequence-specific binding motifs (for example, SLiMs) or by distributed multivalent interactions. These interaction modes are determined by sequence specificity and chemical specificity, respectively, and both are functionally important17,21,81,82,83. Surprisingly, we uncovered here that—at least in Abf1—these two modes were compensatory for one another. Using viability as a readout for function, a weak context could be compensated for by introducing a motif (Fig. 3) and, more surprisingly, the absence of a motif could be compensated for by the gain of context strength (Fig. 4). Importantly, we showed that testing motif shuffling or distribution was essential to identify bona fide motifs.

We interpreted our results using a 2D binding landscape, where the function of IDR2 reflects the interoperable combination of motif- and context-dependent binding (Fig. 4b). Based on our molecular understanding, we rationally designed new IDRs that were functional, albeit dramatically different from the wild-type sequence (Fig. 4). Beyond our work here, biophysical hints for a model in which sequence and chemical specificity cooperate in driving IDR-mediated binding have been found in in vitro reconstituted systems18,19,84,85,86,87,88. Finally, our work suggests plausible rules for evolutionary conservation in IDRs. Although our in silico and in vitro evolution experiments do not definitively prove one model over another, they are consistent with the conservation of chemical specificity shaping sequence variation in IDRs. Moreover, proteome-wide analyses identified hundreds of examples in which such conservation indicated conserved intermolecular interactions despite large changes in amino acid sequences (Supplementary Table 9).

We note several limitations of our work. First, we posit that if two IDRs offered equivalent growth phenotypes when provided in a protein where alteration or deletion of the same IDR rendered yeast inviable, these IDRs were functionally equivalent. This definition was empirically determined by our assay. However, different IDRs may confer viability in different ways by interacting with different partners. As the Abf1 interactome is revealed in future work, this question can be answered more directly. Second, we focus here on a minimal system lacking IDR1. However, assessing full-length IDRs (as opposed to just IDR2) of Abf1 orthologues may reveal motif migration within IDRs (see below). Third, IDR2 was important for +1 nucleosome positioning in vivo—especially at essential genes (Fig. 6f)—by a so far unknown mechanism beyond a pure nucleosome barrier function. In vivo +1 nucleosome positioning depends on combinations of barriers, such as Abf1, with remodellers, and possibly the transcription machinery67,89,90. We speculate that the interplay with transcriptional machinery may depend on IDR2, but we showed that the pure barrier function did not (Fig. 6b). Furthermore, +1 nucleosome positioning may involve a role of GRFs in torsional insulation91 or 3D chromatin organization92, which may also affect genes without Abf1 sites, for which we could not distinguish direct from indirect effects. The distinction between direct and indirect effects, while mechanistically important, does not matter for the distinction between viable and inviable outcomes.

Our ODM-seq data for Abf1 variants in vivo demonstrate that the function of GRFs such as Abf1 does not solely entail generating NFRs by binding to specific sites. Although NFR maintenance was grossly impaired upon Abf1 depletion, as described69, even inviable Abf1 constructs maintained NFRs (Fig. 6e). Instead, the combination of NFRs and proper organization of flanking nucleosomes, especially +1 nucleosomes, appeared to constitute Abf1 function, especially at essential genes (Fig. 6f). This parallels with recent findings in the context of replication93 and 3D genome organization92, where not just NFRs alone but the combination of NFRs and phased regular arrays was key.

Over the past decade, there has been a growing appreciation for the intersection of sequence and chemical specificity in the context of IDR function. The chemical determinants of IDR function have been explored in various contexts5,8,29,63,64,66,94,95,96,97,98,99,100,101, yet how these properties influence the evolutionary landscape of disordered regions remains less clear. Our work suggests that compensatory changes, facilitated by chemical specificity, may buffer the loss of motifs. Our in silico and in vitro evolution experiments implied that the conservation of chemical specificity may be almost invisible in alignment-based analyses. This motivates novel interpretations of conservation in IDRs5,22,23,24,66,102,103.

Functional molecular interactions generally involve specificity, facilitating interactions with relevant partners and preventing deleterious off-target binding. Shape complementarity and chemical compatibility typically mediate specificity in folded domains and, to a lesser extent, SLiMs15,104. Here we find that even for an IDR that depends on a bona fide motif, alternative sequences with rationally interpretable chemical features can uphold function without any motif. Whether motif-free variants interact with the same partners as motif-containing variants (our preferred interpretation) or with different partners without obvious phenotypic consequences (an alternative interpretation) remains to be seen. However, we speculate that this context-only mode, driven by just chemical specificity, offers an evolutionarily plastic mode of molecular recognition (Fig. 6h)8,17,20. Although work here does not strongly support a homotypic phase-separation-based model to underlie Abf1 function (Extended Data Fig. 9), it is consistent with a model in which Abf1 is recruited to specific loci in the nucleus through chemically specific interactions105,106,107. With this in mind, we propose that chemically specific recognition could tune or even determine recruitment or interaction with molecular assemblies without the need for strict stoichiometry (for example, biomolecular condensates)108,109, providing an ideal paradigm for evolutionarily adaptable multicomponent cellular assemblies. Finally, our work demonstrates that nothing can be assumed when interrogating motifs; shuffling and/or distribution are required tests to confirm whether the effect of a putative motif is mediated by sequence or chemical specificity (Extended Data Fig. 9h,i).

The importance of context in relation to motif function has been seen in other systems61,82,110. If context can tune motif function, we might expect a rational evolution experiment to reveal compensatory mutations in action. To test this, we took a sequence generated by random mutagenesis (NCS-21) with 55 point mutations and a +3 growth phenotype. We reverted six mutations that had introduced de novo hydrophobic residues outside the essential motif to their original polar amino acids. Although this variant was more similar to the wild type in sequence, it showed severely impaired growth (+1 growth phenotype) (Supplementary Fig. 2). Therefore, NCS-21 provided a direct example of compensatory mutations offsetting otherwise deleterious mutations.

Whereas full-length Abf1 from K. lactis confers functionality in S. cerevisiae, the K. lactis IDR2 does not. More broadly, we found limited functional conservation of IDR2 across orthologues, despite conservation of amino acid composition and sequence features (Fig. 2d and Supplementary Fig. 9). There are several (non-mutually exclusive) explanations for this result. First, interaction networks co-evolve by coupled evolutionary changes, such that motifs in orthologues may be incompatible with partners in S. cerevisiae. Second, functionally important features can relocate across an IDR-containing protein. The specific location of a binding motif in the protein may be relatively unimportant, such that motifs can be lost from one region and emerge in another. An intriguing prediction from our work is that motifs are expected to appear and disappear within a given IDR, a prediction supported by extant work on ex nihilo motif evolution111.

Although the essential motif in Abf1 appeared to be poorly conserved by alignment, we still expect evolutionarily conserved motifs to be important and ubiquitous across proteomes. Accordingly, we identified thousands of short, conserved hydrophobic sub-sequences within IDRs, with almost twice as many in essential versus non-essential proteins (Extended Data Fig. 10 and Supplementary Tables 1014). We also identified many IDRs with low sequence conservation but extremely high chemical specificity (Supplementary Table 9). As a corollary, we wondered whether other non-conserved regions of transient structure may be found in the yeast proteome and, upon analysis, identified 963 short (<40-residue) subregions in IDRs with predicted transient helicity (Supplementary Table 17). This set of de novo predictions included the previously identified Pho4 activation domain (Pho469–94) and four separate subregions in the N-terminal IDR of Reb1, both of which conferred viability (Fig. 3a). Moreover, global and segmental shuffles (in which local subregions were internally shuffled) of Pho41–249 were non-viable (Fig. 4k), suggesting a bona fide motif within Pho41–249. Although many such regions may be inert, others may offer specific binding interfaces (as was the case for Abf1) or specific helical regions identified in transactivation domains99,112.

Finally, we focused on the function measured by viability in S. cerevisiae growing under low-challenge laboratory conditions. This probably missed some facets of Abf1 function and, accordingly, the importance of other Abf1 regions and features. For example, ongoing work implied large-scale IDR-dependent remodelling of transcription during glucose starvation with IDRs as intrinsic sensors of intracellular state113. Tuning IDR chemistry (via post-translational modifications, changes in protonation state or changes in side chain solvation properties) offers an attractive route to modulate interaction specificity and thereby regulatory output.

Methods

Plasmids and yeast strains for 5-FOA plasmid shuffling assay and growth rate scoring by spot assay

The plasmid pRS416-Abf1 contained the wild-type ABF1 gene (a PCR fragment from 690 base pairs (bp) upstream and 170 bp downstream of the coding region, generated with primers pRS416_Abf1_for and pRS416_Abf1_rev (Supplementary Table 2) and genomic DNA of BY4741 as a template) inserted into the EcoRI and BamHI sites of plasmid pRS416 (ref. 114). Plasmid pRS315-Abf1 contained the equivalent insert to plasmid pRS416-Abf1 (a PCR fragment amplified with primers pRS315_Abf1_for and pRS315_Abf1_rev (Supplementary Table 2) and genomic DNA of BY4741 as a template), but it was inserted via BamHI and HindIII in plasmid pRS315 (ref. 114).

Rationally designed abf1 mutant constructs were generated via Gibson cloning115,116 using the PCR primers and templates in Supplementary Table 3 for insert and backbone fragments. The templates for inserts were mainly ordered by gene synthesis (Thermo Fisher Scientific; Supplementary Table 3) or amplified from genomic DNA of strain BY4741. All constructs in pRS315 contained an SV40 NLS (amino acid sequence: PKKKRKV) at the C terminus. All rationally designed constructs in pRS315, except ΔIDR1, contained the 3xFlag tag introduced via Gibson cloning using primers and template, as specified in Supplementary Tables 2 and 3.

For PCR, Phusion polymerase (New England Biolabs) and thermo cycling conditions were used according to the manufacturer’s recommendations. A pool of inserts for randomly mutagenized abf1 constructs was generated by error-prone PCR (dNTP-Mutagenesis Kit; PP-101; Jena Bioscience) with primers N2-IDR2-N*dC and N3-IDR2-N*dC and assembled via Gibson cloning into a backbone PCR fragment generated with primers N1-IDR2-N*dC and N4-IDR2-N*dC (Supplementary Table 4).

For both the mutagenized insert and the plasmid backbone PCR, pRS315-ΔIDR1&IDR2449–662 was used as a template. After transformation of the Gibson assembly reaction into Escherichia coli strain DH5α and DNA isolation (NucleoSpin Plasmid (NoLid); 740499.250; Macherey-Nagel) from single bacterial colonies, the insert DNA sequences were confirmed by Sanger sequencing with the primers listed in Supplementary Table 5.

The parent diploid strain with heterozygous deletion of the ABF1 gene (Y24962 (BY4743; MATa/MATα; his3Δ1/his3Δ1; leu2Δ0/leu2Δ0; LYS2/lys2Δ0; met15Δ0/MET15; ura3Δ0/ura3Δ0; YKL112w/YKL112w::kanMX4)) was obtained from EUROSCARF (www.euroscarf.de). Y24962 was transformed with plasmid pRS416-Abf1 and sporulated in liquid culture (1% potassium acetate, 5 µg ml−1 histidine and 25 µg ml−1 leucine). Sporulation and tetrad dissection were performed according to Dunham et al.117. Spores with both the abf1::kanMX deletion allele and plasmid pRS416-Abf1 were selected after tetrad dissection (MSM 400; Singer Instruments) on yeast nitrogen base (YNB) − uracil plates and sub-sequent growth on YPDA + kanamycin (0.2 mg ml−1) selection medium.

Three strains originating from independent such spores were kept for further use: 7-A2 (MATa), 7-C5 (MATα) and 7-D4 (MATa). Their mating type was determined by colony PCR with primers as published118. These strains were transformed with plasmids (pRS315-[abf1 construct name]) bearing abf1 mutant constructs, as detailed in Supplementary Tables 14, and selected on YNB −uracil − leucine medium. The insert region of every plasmid batch was checked by Sanger sequencing before transformation into yeast. For each construct, independent single colonies after transformation (for which the numbers of replicates (always ≥3) are given in Supplementary Fig. 1 and Supplementary Table 1) were re-streaked for single colonies on YNB −uracil − leucine selection medium, then single colonies thereof were streaked out as patches on the same medium. From these patches, strains were streaked again as patches on 5-FOA − leucine plates (0.67% YNB, 0.2% Synthetic Complete Supplement Mixture − uracil − leucine drop-out mix, 2% glucose, 50 µg ml−1 uracil and 0.1% 5-FOA (Toronto Research Chemicals (F595000) or Diagnostic Chemicals Limited (1555)) and incubated for 3 days at 30 °C. Viability on 5-FOA plates was visually scored by comparison with known viable or inviable strains on the same plate.

Inviable strains were distinguished from viable strains because they showed a smeary appearance and, at most, sparse single colonies. In contrast, viable strains grew as one more or less dense patch. (Supplementary Fig. 1 and Supplementary Table 1). Sparse single colonies that came up on 5-FOA plates (Supplementary Fig. 1) were shown to bear a plasmid of the wrong size or sequence by colony PCR (Supplementary Table 8), usually reflecting recombination between pRS416-Abf1 and pRS315-[abf1 construct name] so that at least parts of wild-type Abf1 IDR2 were gained that supported viability. Three independent single colonies of strains with constructs that scored as inviable on 5-FOA plates were stored as glycerol cultures from the patches on YNB − uracil − leucine medium.

Constructs viable on 5-FOA plates were tested for uracil auxotrophy on YNB − uracil medium and for bearing the correct plasmid by colony PCR using the primers given in Supplementary Table 8. The colony PCR product was often further validated by Sanger sequencing using the primers given in Supplementary Table 5. For each construct, three validated strains from independent single colonies that were viable on 5-FOA and inviable on −uracil medium were stored as glycerol culture.

Yeast cells grown from these three glycerol cultures of independent clones for each viable Abf1 construct were scored for growth rate by spotting serial dilutions in water (1:10, 1:100, 1:1,000 and 1:10,000) of overnight cultures adjusted with yeast peptone dextrose adenine (YPDA) media to an optical density measured at a wavelength of 600 nm (OD600) of 0.5 (Genesys 20 photometer; Thermo Fisher Scientific) on YPDA plates, then incubated at 30 °C for 1–3 days (Supplementary Fig. 2). Growth rates were scored visually into four broad categories (+1 to +4) compared with the rate of two independent clones of strain 7-A2 bearing plasmid pRS315-Abf1 spotted on the same plates and defined to have a growth rate score of +4 (Supplementary Fig. 2 and Supplementary Tables 1 and 7).

ChIP and quantitative real-time PCR

ChIP assays were performed for all inviable constructs bearing a Flag-tag and some viable constructs, as indicated by the respective ChIP values in Supplementary Table 1. Corresponding yeast strains were grown to logarithmic phase in YNB − uracil − leucine selection medium (inviable constructs or viable constructs together with the plasmid pRS416-Abf1) or in YNB − leucine selection medium (viable constructs) at 30 °C. One hundred OD600 units (Zeiss PMQ II photometer) were crosslinked for 15 min at room temperature with a 1% final concentration of formaldehyde while shaking slowly, and crosslinking was quenched by adding a 250 mM final concentration of glycine for 10 min at room temperature while shaking slowly. Cells were harvested by centrifugation for 5 min at 4,000 r.p.m. (Heraeus Cryofuge 6000i) and 4 °C. Cell pellets were washed with ice-cold ST buffer (10 mM Tris-HCl (pH 7.5) and 150 mM NaCl) and washed twice in FA lysis buffer (50 mM HEPES-KOH (pH 7.5), 150 mM NaCl, 1 mM ethylenediaminetetraacetic acid (EDTA), 1% Triton X-100, 0.1% sodium deoxycholate and 0.1% sodium dodecyl sulfate (SDS)). Washed pellets were flash-frozen in liquid nitrogen and stored at −80 °C. For cell lysis, pellets of 50 OD600 units were resuspended in 250 µl FA lysis buffer supplemented with protease inhibitors (1× cOmplete EDTA-free protease inhibitor cocktail (Roche) and 1 mM phenylmethylsulfonyl fluoride). Zirconia beads (350 µl 0.5 mm glass beads; 11079105; BioSpec Products) were added and cells were lysed in a bead beater (Precellys 24; Bertin Technologies). Lysis efficiency was controlled by comparing OD600 values before and after bead beating and was 75–90%. Beads were removed by centrifugation, cell lysates were adjusted to 1.2 ml with FA lysis buffer, and chromatin was sheared by sonication at 4.5 °C (30 cycles; 30 s on and 30 s off; high intensity; Bioruptor, Diagenode). Fragment lengths were usually around 1 kilobase as assessed in a Bioanalyzer after DNA purification (high-sensitivity DNA chip). If this fragmentation range was not met, the sample preparation was repeated, starting from a new culture. The chromatin solution was pre-cleared by incubation with 25 µl protein G beads for 1–2 h at 4 °C with soft rolling. For immunoprecipitation, 500 µl pre-cleared chromatin solution was incubated with or without (no antibody control) 2.5 µl anti-Flag M2 antibody (F3165; Sigma–Aldrich) overnight at 4 °C with soft rolling and for four further hours with 25 µl pre-blocked (10% bovine serum albumin) protein G beads. Beads were washed twice with high-salt FA lysis buffer (0.50 instead of 0.15 M NaCl), twice with ChIP Wash buffer (10 mM Tris-HCl (pH 8.0), 250 mM LiCl, 0.5% NP-40, 0.5% sodium deoxycholate and 1 mM EDTA) and once with TE buffer (10 mM Tris-HCl (pH 8.0) and 1 mM EDTA). Elution was done twice for 15 min at 65 °C while shaking in 100 µl ChIP Elution buffer (50 mM Tris-HCl (pH 7.5), 10 mM EDTA and 1% SDS). Eluates were pooled, and crosslinking was reversed by incubation overnight at 65 °C while shaking. 50 µl chromatin sample for input DNA was taken off after sonication, adjusted to 200 µl with ChIP Elution buffer and treated in parallel for DNA purification. Immunoprecipitated or input chromatin was treated with RNaseA (0.5 mg ml−1) for 30 min at 37 °C, then with proteinase K (0.5 mg ml−1) for 2 h at 37 °C while shaking. DNA was purified with AMPure beads (1.8 vol), according to the manufacturer’s instructions.

Input DNA (1 µl; 1:100 dilution in water), immunoprecipitated DNA (1 µl) and mock-precipitated (no antibody) DNA (1 µl) were analysed in triplicate by quantitative real-time PCR (LightCycler 480 II; Roche) in a 10 µl total volume using the primers given in Supplementary Table 6 (at a final concentration of 3 µM) and 1× Fast SYBR Green Master Mix (Applied Biosystems). ChIP values were calculated from quantitative real-time PCR cycle threshold values as an average of PCR triplicate averages for each of the three amplicons with the Abf1 site divided by the average of PCR triplicate averages for each of the three amplicons without the Abf1 site.

For unknown reasons, some replicates of inviable constructs failed to score in the ChIP assay (that is, they had a ChIP value close to 1; Supplementary Table 1). Nonetheless, if at least one replicate was positive (with a ChIP value of > 10), this confirmed that the respective construct was expressed and imported into the nucleus and could bind specifically to Abf1 sites. ChIP values for inviable versus viable constructs were, on average, lower (Extended Data Fig. 2a). This was probably due to the competition between the untagged wild-type Abf1 and the Flag-tagged inviable construct for binding to Abf1 sites. This competition was unavoidable for inviable constructs, but usually not the case for viable constructs, as these were tested in the absence of the untagged wild-type Abf1. To confirm this, we tested two viable constructs in the presence of untagged wild-type Abf1 to mimic the competition between wild-type Abf1 and the Flag-tagged construct (marked in Extended Data Fig. 2a) and found that they indeed showed ChIP values that matched the expected values from the inviable constructs (P = 0.24; Mann–Whitney U-test) in the range of the viable constructs.

Abf1 anchor-away strains and growth conditions for RNA- and ODM-seq analyses

The Abf1 anchor-away strain ABF1-aa-V5 clone A9 was obtained from W. de Jonge and F. Holstege42. The strain was transformed with the same pRS315-based plasmids encoding wild-type Abf1, select Abf1 constructs or the empty pRS315 plasmid (no Abf1 control), as was used for the 5-FOA plasmid shuffle assays. At least three independent clones were isolated and stored as glycerol cultures after transformation and used for biologically independent replicate experiments. Anchor-away strains were grown to log phase (~OD600 = 0.4; Genesys 20; Thermo Fisher Scientific) in YNB − leucine media, incubated for 75 min (or 60 min for one RNA-seq replicate each for wild-type Abf1, Abf1 IDR2, FUS1−16312E and FUS1–163 + Y/M) with 8 µg ml−1 rapamycin (Santa Cruz Biotechnology) and crosslinked for 5 min (for ODM-seq) with 1% formaldehyde at room temperature. Formaldehyde crosslinking was quenched for 15 min with 250 mM glycine at room temperature.

RNA-seq

Total RNA was prepared after Abf1 anchor-away induction with an RNeasy kit (Qiagen) according to the manufacturer’s protocol. Briefly, four OD600 units (Genesys 20; Thermo Fisher Scientific) of cells were washed with phosphate-buffered saline (50 mM phosphate buffer (pH 7) and 150 mM NaCl) and flash-frozen in liquid nitrogen. The cell wall was broken up by incubation with beta-mercaptoethanol and zirconia bead beating (6 × 3 min at 30 Hz with 3 min breaks; Precellys; Bertin Technologies). The cell lysate was cleared by centrifugation, mixed with ethanol and bound to a column provided in the kit. We also employed the optional DNase I digestion on the column. RNA was washed on the resin, eluted and digested using DNase I (Qiagen) in solution. After ethanol precipitation, the total RNA preparation was converted into sequencing libraries with the NEBNext Ultra II directional RNA library prep kit for Illumina.

Cloning, expression, purification and characterization of M.SssI DNA methyltransferase

DNA methyltransferase cloning

The DNA sequence coding for the M.SssI methyltransferase, where the TGA stop codons coding for tryptophan in the original organism were replaced by TGG codons, was PCR amplified from the plasmid pCMV-FLAG-4azf-M.SssI (a gift from K. Ford at Kings College London) and cloned into the NcoI and SalI sites of the pBAD24 plasmid (87399; American Type Culture Collection) to achieve tight transcriptional control by the E. coli arabinose-regulated araBAD promoter. An in-frame C-terminal 6xHis-tag was introduced via the 3′ primer during the PCR. The resulting expression plasmid is called pBAD24-M.SssI-His6 and was transformed (selection on plates containing Luria–Bertani growth medium supplemented with ampicillin (LB amp) plus 0.2% glucose) into E. coli One Shot TOP10 (Thermo Fisher Scientific), yielding the expression strain for M.SssI expression.

A single colony streaked out on LB amp plates containing 0.2% glucose was used to inoculate a 100 ml overnight culture in LB amp with 0.2% glucose at 37 °C and 130 r.p.m. Six 1 l volumes of LB amp media (without glucose) were inoculated with 10 ml each of the overnight culture and incubated at 37 °C and 130 r.p.m. until the OD600 reached 0.5 (Genesys 20; Thermo Fisher Scientific). Expression was induced by the addition of 0.0002% arabinose for 4 h at 37 °C and 130 r.p.m. Cells were harvested by centrifugation (4,000 r.p.m.; 30 min; 4 °C; Avanti JXN-26; JLA-8.1000 rotor; Beckman Coulter), resuspended and combined in 75 ml lysis buffer (50 mM Tris-HCl (pH 8) and 300 mM NaCl). Aliquots of 35 ml were snap-frozen in liquid nitrogen and stored at −70 °C.

DNA methyltransferase purification

Purification started with one frozen cell pellet aliquot that was thawed at room temperature, topped up to 50 ml with lysis buffer and supplemented with 1 mg ml−1 chicken egg-white lysozyme (28260; SERVA) and protease inhibitors (0.7 µg ml−1 pepstatin A, 1 µg ml−1 leupeptin, 1 µg ml−1 aprotinin and 0.2 mM phenylmethylsulfonyl fluoride), incubated on ice for 30 min and then sonicated (six cycles of 10 s bursts and 10 s breaks at 50% peak power; Branson 250D sonifier). Lysed cells were centrifuged for 1 h at 4 °C and 20,000g (Optima XPN-80; Ti-45 rotor; Beckman Coulter). During centrifugation, 5 ml slurry of Ni-NTA agarose (745400.25; Macherey-Nagel) was washed twice with 20 ml lysis buffer (1 min; 1,000 r.p.m.; 4 °C; Eppendorf 5810R centrifugation for collecting the agarose) and resuspended in 30 ml lysis buffer and then distributed into three 10 ml aliquots. Next, Ni-NTA beads were collected by centrifugation and the supernatant discarded. After ultracentrifugation, the lysed cells’ supernatant was evenly distributed onto the pre-washed Ni-NTA beads and incubated for 1.5 h at 4 °C on a rotating wheel. Beads were collected by centrifugation (2 min; 1,000 r.p.m.; 4 °C; Eppendorf 5810R), the supernatant was discarded and the beads were resuspended in 20 ml cold lysis buffer with protease inhibitors then transferred to a gravity flow column (Econo-Pac; 7321010; Bio-Rad) at room temperature. The column was then washed by gravity flow with 100 ml cold lysis buffer, 20 ml cold wash buffer (50 mM Tris-HCl (pH 8), 250 mM NaCl and 20 mM imidazole-HCl (pH 6.6)). After the addition of 2 ml cold elution buffer (50 mM Tris-HCl (pH 8), 250 mM NaCl and 300 mM imidazole-HCl (pH 6.6)), the column was closed, lightly vortexed and incubated on ice for 10 min with light vortexing every 2 min. One 2 ml fraction was collected on ice by gravity flow. Additionally, three 1 ml fractions were collected after the addition of 3 ml elution buffer. Fractions of the purification steps were analysed by SDS polyacrylamide gel electrophoresis (SDS-PAGE) (Supplementary Fig. 10a and uncropped Supplementary Fig. 13a).

Fractions from the column enriched for the 42 kDa M.SssI methyltransferase were pooled and directly loaded in the cold room onto a HiLoad 16/600 Superdex 200 pg exclusion column (Cytiva) equilibrated with size exclusion buffer (20 mM Tris-HCl (pH 8), 500 mM NaCl, 0.1 mM EDTA and 0.1 mM dithiothreitol (DTT)). Elution was with size exclusion buffer at a 0.5 ml min−1 flow rate, and 2 ml fractions were collected in the cold. Fractions were analysed by SDS-PAGE (Supplementary Fig. 10b and uncropped Supplementary Fig. 13b) and fractions were enriched for M.SssI combined and dialysed against dialysis buffer (10 mM Tris-HCl (pH 8), 0.2 mM EDTA, 1 mM DTT and 50% glycerol), which also concentrated the protein solution. The dialysed and concentrated M.SssI solution was analysed again by SDS-PAGE (Supplementary Fig. 10c and uncropped Supplementary Fig. 13c) and the methyltransferase activity was quantified by restriction enzyme activity assay. Aliquots corresponding to ~440 units were snap-frozen in liquid nitrogen and stored at −80 °C.

Quantification of CpG-specific DNA methyltransferase activity

Plasmid pUC19-ftz (500 ng; containing a fragment of the Drosophila ftz locus from sequence TAGTTTCCTAATGAT to GCCGAAGATGATGCT cloned into the pUC19 multiple cloning site) was incubated in a 122 µl final volume with increasing amounts of purified M.SssI solution or commercially acquired M.SssI (M0226; New England Biolabs (NEB)) in NEBuffer 2 (NEB) supplemented with 160 µM S-adenosyl-L-methionine for 1 h at 25 °C. The reaction was stopped by the addition of a one-tenth volume of 10× stop buffer (50 mM Tris-HCl (pH 7.5), 4% SDS and 100 mM EDTA) and incubated with 5% proteinase K. After incubation for 1 h overnight at 37 °C, a one-fifth volume of 5 M NaClO4 was added to the solution, followed by an equal volume of phenol (77607; Sigma–Aldrich), vortexing, an equal volume of chloroform:isoamylalcohol (24:1) and then more vortexing. After this, the phases were separated by centrifugation (Hettich Microliter). The DNA was recovered from the aqueous phase by ethanol precipitation and resuspended in 37 µl rCutSmart Buffer (NEB). The DNA was incubated with 30 units each of BamHI-HF and PvuI-HF (NEB) for 1 h at 37 °C. The complete reaction was loaded onto a 1% agarose gel in 1× Tris-acetate-EDTA (TAE) buffer and electrophoresed for 1.5 h at 100 V. The DNA in the gel was stained with 5 µl Midori Green solution (Nippon Genetics Europe) per 100 ml gel for a few minutes and directly imaged (ChemiDoc; Bio-Rad) (Supplementary Fig. 11a and uncropped Supplementary Fig. 14a,b). BamHI linearized the 5,769 bp pUC-ftz plasmid regardless of CpG methylation. PvuI cut in a CpG-methylation-sensitive way and yielded 2,544-, 2,329- and 896-bp fragments in combination with the BamHI cut if cleavage was complete, but only yielded the 5,769-bp linearized plasmid band if complete CpG methylation blocked PvuI digestion. Comparison of the degree of blocked PvuI digestion between reactions with purified and commercial M.SssI (NEB) allowed us to quantify the specific activity of the purified M.SssI according to the unit definition by NEB.

The M.SssI preparation was checked for contaminating nucleases by incubation with 500 ng supercoiled pUC19-ftz plasmid in either rCutSmart or NEBuffer 2 (NEB) for 2 h or overnight at 25 °C. The reaction was electrophoresed in 1% agarose 1× TAE gel, stained and imaged. The persistence of the supercoiled form demonstrated the lack of contaminating nucleases (Supplementary Fig. 11b and uncropped Supplementary Fig. 14c).

Absolute nucleosome occupancy mapping by ODM-seq

Chromatin was prepared from formaldehyde-crosslinked yeast cultures as described71. Chromatin corresponding to a 0.1 g wet cell pellet was washed in cold Kladde buffer (20 mM HEPES-NaOH (pH 7.5), 70 mM NaCl, 0.25 mM EDTA (pH 8.0), 0.5 mM EGTA (pH 8.0) and 0.5% (vol/vol) glycerol), resuspended in 870 µl Kladde buffer and incubated with 4,000 U M.SssI methyltransferase at 25 °C. After 90 min, 4,000 U M.SssI were added, together with the plasmid pUC19-ftz, which was included as a spike-in control. After 180 min, one 430 µl aliquot was removed, and plasmid pFMP233 (ref. 119) was added to the remaining aliquot as a second spike-in control. The reaction was further incubated for 60 min (240 min total) at 25 °C. A similar degree of CpG methylation after 180 versus 240 min (Extended Data Fig. 6a), in combination with ongoing CpG methylation as probed by methylation of the spike-in plasmids (Extended Data Fig. 6b), indicated saturated CpG methylation. After DNA methylation, the reaction was stopped by adding a one-tenth volume of 10× stop buffer. The resulting DNA was purified by proteinase K digestion, phenolization and ethanol precipitation and then resuspended in TE buffer. Nanopore sequencing libraries were prepared and sequenced on a PromethION device by the Laboratory for Functional Genome Analysis at the Gene Center at LMU Munich.

Cloning, expression and purification of wild-type Abf1 and the constructs FUS1 −16312E and ΔIDR1/2

Expression plasmids for wild-type Abf1 and Abf1 variant constructs were constructed in pPROEX as described by Krietenstein et al.67, but starting from the respective pRS315 plasmids with the respective Abf1 constructs and using the primers for backbone and insert PCR, respectively, which were combined by Gibson cloning using the following sequences: pPROEX_V_rev (AATTTGTCCATggGATCCATGG); Abf1_I_for (GATCCCATGGACAAATTAGTCGTGAAT); Abf1_I_rev (ACCGCATGCCTCGAGCTActtatcgtcatcgtctttg); and pPROEX_V_for (CTCGAGGCATGCGGTACC).

Wild-type Abf1 and variant proteins were expressed using the BL21 Star (DE3) pLysS strain (Thermo Fisher Scientific). LB amp media (100 ml) was inoculated with a single colony and incubated overnight at 130 r.p.m. and 37 °C. Then, 20 ml of this preculture was used to inoculate 1 l LB amp and incubated at 130 r.p.m. and 37 °C until the OD600 was 0.5–0.6 (Genesys 20; Themo Fisher Scientific). Isopropyl β-D-1-thiogalactopyranoside was added to a final concentration of 1 mM and incubated for 4 h at 37 °C and 130 r.p.m. Cells were collected by centrifugation (4,000 r.p.m.; 20 min; 4 °C; Avanti JXN-26; JLA-8.1000 rotor; Beckman Coulter), combined in a 50 ml Greiner tube in 30 ml lysis buffer with protease inhibitors and centrifuged (1 min; 1,000 r.p.m.; 4 °C; Eppendorf 5810R). The pellet was flash-frozen and stored at −70 °C.

For purification in the cold room, the cell pellet was thawed on ice, resuspended in 40 ml lysis buffer with 1 mg ml−1 chicken egg-white lysozyme and incubated for 30 min on ice. Cell lysis was performed by sonication on ice (Branson 250D sonifier; six cycles of 10 s bursts and 10 s breaks at 50% peak power). The cell lysate was centrifuged at 45,000 r.p.m. for 1 h at 4 °C (Optima XPN-80; Ti-45 rotor; Beckman Coulter) and the supernatant was collected and transferred to a 50 ml Greiner tube, where it was mixed with pre-washed Ni-NTA resin (see above) equivalent to 2 ml slurry. This mixture was incubated for 1.5 h at 4 °C on a rotating wheel, then washed with 100 ml lysis buffer then 20 ml wash buffer and eluted with 5–10 ml elution buffer. The eluate was dialysed overnight at 4 °C against 1–2 l dialysis buffer II (20 mM HEPES-NaOH (pH 7.5), 70 mM NaCl, 0.1% Tween, 40% glycerol and 1 mM DTT) and concentrated if necessary (Amicon Ultra-4 Ultracells 10 or 30 K; Millipore). The dialysed and maybe concentrated sample, for the wild-type Abf1 and FUS116312E variants, was loaded onto a 1 ml HiTrap Q FF anion exchange chromatography column (Cytiva) equilibrated in low-salt buffer (50 mM Tris-HCl (pH 8.0), 80 mM NaCl, 10% glycerol, 1 mM DTT and 1 mM EDTA) and washed to baseline with low-salt buffer. Elution was by linear gradient from 0–100% high-salt buffer (50 mM Tris-HCl (pH 8.0), 350 mM NaCl, 10% glycerol, 1 mM DTT and 1 mM EDTA) over 20 column volumes. The dialysed and maybe concentrated ΔIDR1/2 construct was loaded onto a 1 ml HiTrap Heparin HP column (Cytiva) equilibrated with heparin buffer A (25 mM Tris-HCl (pH 7.6), 10% glycerol, 1 mM DTT, 50 mM NaCl and 1 mM EDTA), washed to baseline with heparin buffer A and eluted by linear gradient from 0–100% of heparin buffer B (25 mM Tris-HCl (pH 7.6), 10% glycerol, 1 mM DTT, 1 M NaCl and 1 mM EDTA) over 30 column volumes.

The purified proteins (Supplementary Fig. 12 and uncropped Supplementary Fig. 15) were dialysed overnight against storage buffer (20 mM HEPES-NaOH (pH 7.5), 350 mM NaCl, 0.1% Tween, 40% glycerol and 1 mM DTT), aliquoted, flash-frozen in liquid nitrogen and stored at −80 °C.

Genome-wide chromatin reconstitution and remodelling assay

Genome-wide chromatin reconstitution, remodelling assay, MNase-seq assay and Illumina sequencing were done as described by Chacin et al.93 for Fig. 6b and Extended Data Fig. 5a,c, and as described by Oberbeckmann et al.89 (but without ammonium sulfate in the final buffer) for Extended Data Fig. 5b, with the exception that: (1) Illumina sequencing was in paired-end mode; and (2) for the samples shown in Fig. 6b, recombinant human octamers were used, reconstituted from H2A/H2B dimers and H3/H4 tetramers according to Kunert et al.120. Recombinant human histones were obtained from the Histone Source (Colorado State University). The concentration of INO80 was 10 nM unless stated differently in Extended Data Fig. 5b. To verify the robustness of this assay, we repeated the experiment (as seen in Fig. 6b) using Drosophila embryo histone octamers (Extended Data Fig. 5c), yielding analogous results. Wild-type and variant Abf1 proteins were used at 30 nM in Fig. 6b and Extended Data Fig. 5b and 20 nM in Extended Data Fig. 5c. We showed that INO80 could not generate nucleosomal arrays phased to Abf1 sites in the absence of Abf1 (Extended Data Fig. 5a).

Data processing for RNA-seq

Raw sequence reads were preprocessed with a Snakemake pipeline that involved several steps, including quality control, alignment of reads to the S. cerevisiae sacCer3 (R64-1-1 build) genome using the STAR aligner121 and quantification of gene expression via RSEM122. Rule-based directives in Snakemake made it easier to perform key preprocessing steps, ensuring the effective management of computational resources and data dependencies. The processed data were imported into RStudio, where a summarized experiment object was created to encapsulate gene expression metrics alongside sample metadata. This structured data object made robust data manipulation and subset selection in R easier. Before creating plots, the transcripts per million data were log2 transformed to normalize expression levels, improving the statistical reliability of comparisons. The ggplot2 (refs. 123,124) package was utilized to generate boxplots that illustrate the differences in gene expression between wild-type Abf1 and the Abf1 variants (Extended Data Fig. 5f).

Data processing for occupancy measurement via DNA methylation and high-throughput sequencing (ODM-seq)

Both natural and modified bases were called for Oxford Nanopore sequencing reads, and reads were filtered and demultiplexed with Dorado (version 1.0.0+4a76f8aa; Oxford Nanopore Technologies (ONT) Basecalling Software). The resulting reads were mapped to the S. cerevisiae genome (sacCer3 build R64-2-1) with minimap2 (version 2.17), sorted using samtools sort (SAMtools version 1.9) and indexed with samtools index (SAMtools version 1.15). Methylation information was added to the BAM files using modkit repair (version 0.1.13; ONT).

For computational analysis in R (R versions 4.2.0 and 4.3.1 and tidyverse version 2.0.0), the methylation probabilities were extracted from bam files using modkit pileup (version 0.1.13; ONT). The modkit pileup output corresponds to the averaged methylation probabilities over all reads per genomic coordinate per sample. The genome average methylation was calculated for each sample by averaging the modkit pileup output for 5mC modifications over all coordinates (Extended Data Fig. 10c). As slight batch effects were visible on this level of genome average methylation, we introduced a correction factor. For each Abf1 variant, the genome average methylation was determined by averaging across the four genome average methylation values for each individual replicate (two biological replicates (numbers 1 and 2) with two technical replicates each (180 and 240 min incubation with M.SssI as DNA methylation reached saturation; Extended Data Figs. 6a and 10c)). The resulting sample averages were divided by the analogous average of the four wild-type Abf1 replicates. This quotient was used as a correction factor for each individual replicate.

+1-aligned composite plots

The corrected modkit pileup outputs were plotted in composite plots (ggplot2 version 3.4.2) relative to alignment points. Genomic coordinates were converted to new coordinates relative to the alignment points (data.table versions 1.14.8 and 1.17.8; foverlaps function), which were set to zero in each case. Signs of relative coordinates were flipped if the gene corresponding to the alignment point was on the negative strand. Averages of all four replicates per relative coordinate were calculated. Plots were smoothed by plotting the rolling mean over a 50 bp window at the central base pair of each window (data.table; frollmean(., n = 50, align = “center”)). The in vivo +1 nucleosome dyad annotations for this were taken from Mahendrawada et al.74.

PCA

PCA was conducted using the corrected modkit pileup outputs, converted to relative coordinates with respect to the +1 nucleosome positions (see the section ‘+1-aligned plots’). PCA (stats 4.3.1; prcomp(., scale. = F))) compared average occupancies per sample (first averaged over the biological replicates and then over the technical replicates) or per individual replicate in respective regions: downstream flank of +1 nucleosome (0 to 50 bp relative to the in vivo +1 nucleosome dyad); upstream flank of +1 nucleosome (−80 to 0 bp relative to the in vivo +1 nucleosome dyad); and NFR (−145 to −95 bp relative to the in vivo +1 nucleosome dyad). Each +1 nucleosome annotation for each sample or replicate contributed individually to the PCA. In very few cases, the same +1 nucleosome dyad annotation belonged to two different upstream or downstream genes.

Venn diagrams of binding sites

First, we compiled a comprehensive list of all sites, annotated either by PWMs or some in vivo mapping technique. PWMs for the prediction of Abf1 binding sites solely from DNA sequence (FIMO (MEME version 5.5.7)) were taken from JASPAR (https://jaspar.elixir.no)125 and Badis et al.35. Previously annotated coordinates according to PWM predictions by others were according to Kubik et al.69, MacIsaac et al.126 and Pachkov et al.127. The previously annotated in vivo-mapped Abf1 binding sites were either according to Mahendrawada et al.74 (and kindly shared by L. Mahendrawada and S. Hahn) or published by Zentner et al.128, Badjatia et al.129 and Gutin et al.130. If binding sites were annotated as regions, each binding site was mapped to the single genomic coordinate at its central position. If binding sites were annotated for replicates, we averaged over the replicates and used the central position of the average.

For each Venn diagram, the compared subset of Abf1 binding sites selected from the comprehensive list was turned into a reference set of sites. For each site, the distance to the closest site with a lower genomic coordinate was measured. If this distance was <20 bp, the site with the higher genomic coordinate was discarded. From this reference set of sites, pseudo-windows of 1 bp length were constructed.

The Abf1 binding sites selected from the comprehensive list for comparison in each Venn diagram were extended by 75 bp to either side, and overlap with the previously constructed reference set of these sites was determined using foverlaps (data.table 1.17.8). Venn diagrams were constructed from the reference set IDs in each of the displayed categories using ggVennDiagram (of the ggVennDiagram Package) in R131.

Data processing for MNase-seq

Illumina sequencing data were mapped to the S. cerevisiae sacCer3 (R64-1-1 build) genome using Bowtie 2 (ref. 132), excluding reads with multiple matches. The BAM files were converted to BED files and then to RDS files of ranges (start to end), which were imported into RStudio. For comparing samples with different total read numbers, we applied subsampling. To mainly include DNA fragments generated by MNase digestion of canonical nucleosome core particles, we selected only 120- to 170-bp reads. To visualize the MNase-seq patterns, we plotted 50-bp extended dyads. Regions of >200 bp contiguous nulls (no coverage) were omitted among all compared samples for a particular graph. Signs of relative coordinates were flipped if the gene corresponding to the alignment point (the Abf1 site) was on the negative strand.

Proteome-wide bioinformatics analysis of disordered regions

Motivated by previous work, the set of aligned syntenic genes from 20 different yeast species was obtained from the Yeast Gene Order Browser, as described previously5,8,22,25,133. This dataset contains 5,430 protein sequences. Orthologous sets of sequences were then aligned using Clustal Omega to generate alignments for every S. cerevisiae protein134. Per-residue sequence conservation was calculated using normalized Jensen–Shannon divergence based on the BLOSUM62 matrix, as was first introduced by Capra and Singh135,136,137. This analysis yields a per-residue conservation score, with values ranging from 0.019–0.948 across the yeast proteome (Fig. 1).

Disorder scores and predicted predicted local distance difference test (pLDDT) scores for analyses were calculated using metapredict (V1), where defined IDRs were identified using the IDR delineation mode26. The predicted pLDDT value reflects a prediction of the per-residue pLDDT score—a metric introduced by DeepMind in the context of structural evaluation for AlphaFold2. Metapredict implements a deep learning model to predict the pLDDT score from the sequence alone.

Almost all disorder predictions in this work used metapredict V1, instead of the more accurate metapredict V2 or V3. This was a deliberate decision driven by the underlying training data used for metapredict V2 and V3. Metapredict V1 was trained in a manner that avoids convolving evolutionary information inferred from AlphaFold, whereas V2 and V3 combine AlphaFold-derived information with a set of additional predictions, meaning evolutionary information is (indirectly) encoded, given that AlphaFold’s ability to confidently predict structure is in part determined by the availability of large sets of alignable homologous sequences. Using the V1 predictions allows us to legitimately compare disorder and conservation without inadvertently capturing trends encoded in the underlying bidirectional recurrent neural networks. The one exception to this is examining IDRs that are conserved in terms of chemical specificity but not sequence specificity (Fig. 5). For this analysis, we used metapredict V3 first to identify IDRs across the yeast proteome, and then followed up using these IDRs. We made this decision given the accuracy metapredict V3 has for correctly identifying large contiguous IDRs. At the time of writing, metapredict V3 was not published but was publicly available in a branch of metapredict.

pLDDT-defined structured domains are set using a predicted pLDDT threshold of 65 or greater; gaps between contiguous stretches of predicted pLDDT values over 65 that are three residues or smaller are filled in. A minimum region size of five residues is set for convenience.

Sequence analysis

All sequence analyses were performed using SHEPHARD (https://shephard.readthedocs.io/). The complete set of sequences, conservation scores, disorder scores, ppLDDT scores and IDR domain boundaries are provided as SHEPHARD-compatible data files (structured, tab-separated data files). IDR sequence analyses (compositional analysis, hydrophobicity, predicted disorder and predicted pLDDT scores) were performed using localCIDER, SPARROW (https://github.com/idptools/sparrow/) and metapredict26,138,139,140.

Sequence design

IDR designs were generated using GOOSE, a Python package for the rational design of IDRs and IDR variants141. Documentation for GOOSE is available at https://goose.readthedocs.io/ and the open-source code is available at https://github.com/idptools/goose.

Sequence designs constrained residue patterning parameters to retain wild type-like residue patterning (Extended Data Fig. 3d–f). Patterning here is quantified in terms of binary patterning using a kappa-like definition for patterning for chemically distinct residue groups, where κ is a patterning parameter that measures the degree to which sets of residues intermix23,138,142. A high κ value means that residues in groups are highly segregated, whereas a low value implies an even intermixing of residues. In the original definition of κ, three residue groups were considered (positive, negative and neutral), where κ quantifies the extent of intermixing between positive and negative residues. κ can also be defined as a binary patterning parameter16,143, where it measures the extent to which residues in one group segregate from all other residues. The other patterning parameters in Extended Data Fig. 3d–f reflect the binary patterning of groups of amino acids noted below each datapoint versus all other residues. For example, GSNHQT reflects the degree of segregation of residues {GSNHQT} from all other residues. While not quantified here, all shuffle and distribution designs ensured that residues were evenly distributed across the context sequence to minimize the possibility of accidentally generating an additional ex nihilo motif.

Identification of conserved sub-sequences in the yeast proteome

Conserved sub-sequences were identified via the following procedure. First, hydrophobic (I, L, V, M, Y, F and W) residues in an IDR with a conservation score above 0.65 or higher were identified. These residues are in the 85th (or higher) percentile for conservation across all disordered regions (dashed lines; Extended Data Fig. 1b–d). For each conserved hydrophobic residue (initiator methionine residues are excluded) in an IDR, we identified the residues ±6 of that central conserved IDR (truncating at the N and C termini as necessary). We then calculated the median disorder for this subregion of (up to) 13 amino acids. Finally, having done this for an entire protein, we extracted contiguous subregions where the median disorder score was above some threshold. A lower disorder threshold finds more regions but runs the risk of identifying residues at the boundaries between disordered and folded domains, which may, in fact, be folded. A higher disorder threshold identifies fewer regions but provides higher confidence that the identified regions are bona fide IDR. With the most permissive disorder threshold, we identified 5,173 conserved subregions. With the least permissive disorder threshold, we identified 884 subregions. All sub-sequences, including their position and parent protein, are provided in Supplementary Tables 711.

Identification of sub-sequences with predicted transient structure

Using a deep learning model trained on ~360,000 protein structures from AlphaFold2, we used the metapredict package to predict per-residue predicted structure across the yeast proteome26. Cross-referencing those scores against predicted disordered regions and excluding sub-sequences directly adjacent to folded domains, we finally filtered for contiguous sub-sequences with a predicted pLDDT score above 40, where the sub-sequence was up to 40 residues long. The complete set of these sub-sequences, average disorder score and average conservation are provided in Supplementary Table 17.

Identification of IDRs with compositional conservation

Compositional conservation was calculated by taking a set of orthologous aligned IDRs and computing the mean Euclidean distance between fractional amino acid composition across all IDRs. Compositional conservation was only calculated for IDRs ten residues or longer in length (without alignment gaps) and for sequences where ten or more orthologous IDRs were identified. The complete set of IDRs, including median and mean sequence conservation and compositional conservation, is provided in Supplementary Table 16.

Hydrophobicity and charge scores

Hydrophobicity and charge scores (Fig. 4i) were calculated as compositional weighted metrics.

The hydrophobicity score (B) for a sequence was calculated using the following expression;

$$B=\mathop{\sum }\limits_{i}^{\beta }{f}_{i}{\beta }_{i}$$
(1)

Where, for a given protein, B is the binding score, β is the set of residues used to calculate the binding score (Y, W, F, M, V, I, L, A and P), βi is the binding score coefficient associated with residue identity i and fi is the fraction of the sequence made up of residue i.

The charge scores (C) were calculated analogously:

$$C=\mathop{\sum }\limits_{i}^{\theta }{f}_{i}{\theta }_{i}$$
(2)

Where the residues in set θ are R, K, E and D and θi is the charge score coefficient. These coefficients are listed in Supplementary Table 16. This definition is an ad hoc analysis used specifically in the context of Abf1 IDR2, and we do not necessarily expect these precise parameters to be transferable over to other systems. That said, the spirit of this analysis may well transfer to other types of IDR-mediated interactions.

AlphaFold2 model

The AlphaFold2 model was taken directly from the European Bioinformatics Institute AlphaFold2 protein structure prediction database (https://alphafold.ebi.ac.uk/entry/P14164)144,145.

Gene Ontology analysis

The Gene Ontology analysis described in Extended Data Fig. 1g, and for proteins with IDRs that are conserved in terms of chemical specificity but not sequence, was performed using PANTHER146.

Structure predictions

Helical structural predictions were generated using AlphaFold2 (Fig. 3)145.

Coarse-grained simulations

Coarse-grained simulations were performed using the PIMMS simulation engine (https://github.com/holehouse-lab/pimms) to generate the 2D binding landscapes. A 50-bead flexible polymer was defined in which the central six residues were designated as a motif, whereas the flanking residues on either side were defined as the context. A second 50-bead polymer that collapsed due to attractive intramolecular interactions defined a binding partner. Motif:globule (M) and context:globule (C) interactions were systematically titrated to generate a 2D sweep of parameter space (Extended Data Fig. 4a–d). The fraction bound was calculated for each simulation run at a unique combination of M and C interaction strengths. Binding here was defined as the two molecules in direct interaction and did not distinguish between motif- and context-based binding.

Monte Carlo simulations were run on a cubic lattice of 30 × 30 × 30 sites. The system was evolved via a combination of chain repetition and rigid-body translation moves. Each simulation was run for a total of approximately 30 × 106 individual Monte Carlo moves.

In silico evolution

In silico evolution (Fig. 6) was performed by iteratively acquiring a set of mutations and then evaluating the fitness of the sequence with respect to some selection function. In our case, mutations were introduced as single-point mutations only (no insertions or deletions) and fitness was evaluated in terms of attractive interactions with a specific binding partner.

Chemical specificity was calculated using FINCHES with the Mpipi-GG forcefield64,102,147. In silico evolution used a fitness function whereby the summed attractive interactions across the predicted intermolecular interaction map were computed and variants survived if the summed attractive interactions remained approximately equal to or more attractive than the wild-type-associated value. Mutations were introduced by sequentially making a series of mutations before evaluating fitness, enabling compensatory mutations to appear.

Mutations were introduced through an iterative mutational strategy. First, the amino acid sequence was converted into a nucleotide sequence. Next, the underlying nucleotide sequence was stochastically mutated according to the expected transition or transversion rate for nucleobases in coding sequences. This was done by randomly selecting a nucleotide and having it undergo a transition or transversion event. We typically performed 30–40 events per nucleotide sequence to mutagenize the underlying nucleic acid sequence to a consistent number of amino acid changes. Having mutagenized the underlying nucleic acid sequence, this sequence was converted back into protein space. If a premature stop codon was introduced, the sequence was discarded; otherwise, that sequence was saved as one of a set of possible variants. This procedure was repeated many times to generate an initial library of variants. Finally, each variant was assessed with respect to the underlying fitness function and only those variants that were sufficiently fit were kept. For null models (evolution-absent selection), the same procedure was carried out, except we did not apply a fitness selection filter at the end. The acquisition of all mutations before evaluation by fitness function facilitated the gain of compensatory mutations. It also more accurately mirrors the logistical steps taken for the in vitro error-prone PCR experiment.

The described protocol for in silico evolution was used to create two libraries of 350 sequences; in one library (the real library), every sequence was evaluated and kept based on the fitness function, whereas in the other (the null library), sequences were randomly mutagenized without constraints on which sequences were kept or discarded. We ensured that the numbers of mutations observed in sequences in the real library and null library were always matched and were typically around 28–31 mutations per sequence (that is, affecting 12–15% of the residues). However, it is worth noting that the results did not change if we increased or decreased the number of mutations.

To perform statistical comparisons, we performed a bootstrap-like analysis to investigate per-residue conservation for sequences generated for our two libraries (Fig. 5c). This involved randomly sampling 20 sequences (without replacement) from either the real library or the null library, and for those 20 sequences, calculating the per-residue conservation score. This was then repeated ten times to build a distribution of per-residue conservation scores expected for sequences generated under selection versus sequences generated from the null model. For each residue position, the mean and standard deviation for those distributions are shown in Fig. 5c.

We initially compared the conservation of IDR2 versus the N-terminal IDR of Rad7 as our hypothetical target protein. To establish that the apparent lack of conservation observed with alignment-based methods is a general phenomenon rather than a fluke, we repeated this analysis using eight additional IDRs from proteins with which Abf1 has been proposed to interact. Putative interactors were taken as physical interactors from BioGRID50. However, we note that in many cases the interactions were identified in the presence of DNA, such that we cannot distinguish between a scenario in which Abf1 interacts directly with the putative partner and one in which both proteins interact with the same DNA molecule. Rad16 and Rad7 are notable exceptions where protein-only complexes have been identified47,65. Regardless, in silico evolution for the eight additional potential partners leads us to the same conclusion (Extended Data Fig. 4h).

For comparing in silico evolution versus the results from error-prone PCR experiments (Fig. 5e), a similar strategy was taken to that described above, except we matched the number of mutations to the average number of mutations seen experimentally (~28) and subsampled groups of 14 sequences to mirror the number of viable sequences identified experimentally.

As a final note, our in silico evolution was far from a faithful reproduction of the putative dynamics that shape mutations in protein-coding sequences. We consider this a toy system enabling us to ask a stylized question: if the only determinants of protein sequence changes were constrained in this way, could conservation be detected through multiple sequence alignments? In reality, partners are co-evolving and post-transcriptional restraints (that is, RNA secondary structure, processing and degradation) will also shape protein-coding sequences, and insertions and deletions will contribute additional noise. We consider our simplified model to be an overly stringent and deliberately simple model for selection—if signatures of conservation are not easily detectable here, the ability to do so in even noisier scenarios seems even less likely.

Conservation of chemical specificity

To assess IDRs in which chemical specificity was conserved, we first identified those IDRs with a low degree of average sequence conservation. We used an average sequence conservation value of 0.35. To assess chemical specificity, we computed average IDR interactions with a set of 36 previously identified chemically orthogonal dipeptides64. For each set of orthologous IDRs, we calculated the variance in mean-field interaction scores with each of those 36 peptides. We then identified those IDRs where, despite very low sequence conservation, the variance among one or more dipeptide interactions was in the bottom 15%. The set of proteins, IDR sequences and dipeptides they were conserved with respect to are included in Supplementary Table 9.

This analysis focused explicitly on large IDRs and is a relatively unsophisticated way to systematically identify signatures of conservation with respect to chemical specificity. This is an area of active and ongoing investigation.

Flory–Huggins phase diagrams

To generate hypothetical Flory–Huggins phase diagrams, we solved the Flory–Huggins free energy of mixing using the FIREBALL package148. This allowed us to calculate volume fraction (ϕ) and temperature phase diagrams for different chain lengths (Extended Data Fig. 9c). The reduced temperature was calculated as the binodal temperature divided by the critical temperature (Tc). To convert volume fraction into molar concentration, we operated under the assumption that

$$\varphi =c\left(\frac{{v}_{{\rm{m}}}{N}_{{\rm{a}}}}{{M}_{{\rm{m}}}}\right)$$
(3)

Where ϕ is the volume fraction, c is the mass concentration (in mg ml−1), vm is the average volume occupied by a single amino acid (140 Å2), Na is Avagadro’s number and Mm is the average mass of an amino acid (110 g mol−1). Given that three of the four terms here are known, this expression can be rewritten as

$$c=\varphi {\rho }_{0}$$
(4)

Where ρ0 is 1,310.16 mg ml−1. Converting volume fraction to mass concentration (in units of mg ml−1) involves multiplying the volume fraction by 1,310.16.

Converting from mass concentration to molar concentration involves dividing the mass concentration by the molar mass of the molecule. For the short and long synthetic repetitive FUS sequences, the molar masses were 7,208.9 and 17,422.1 g mol−1, respectively. The precise sequences are listed in Supplementary Table 1.

Data collection

Throughout the methods, data collection and analysis were not performed blind to the conditions of the experiments.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.