Fig. 1: Intradomain insertions are common in natural proteins.

a, Novel proteins can arise from the insertion of one domain into another. b, Strategy for generating a large domain insertion dataset of natural proteins. c, The number of unique domain superfamily combinations is shown for a given parent (left panel) or insert domain (right panel). Only the top five most promiscuous domains are shown. d, Alphafold2-generated structures of example proteins from the dataset with PDZ domain insertions. Insert domains are in green and parent proteins in blue. e, Length distribution of insert domains, parent domains (without the respective insert) and parent proteins in the dataset. Parent domains are defined as the annotated Interpro domains originally carrying the insert, while parent protein refers to the full-length protein without the insert. aa, amino acids. f, Distribution of relative insertion site positions within parent domains and parent proteins. g–j, The frequency of different CATH-GENE3D domain types at the ‘class’ (g) and ‘architecture’ (h, mainly alpha; i, mainly beta; j, alpha beta) levels (according to the CATH hierarchy) within the whole CATH database are compared with the corresponding distribution of the insert and parent domains in our dataset.