Fig. 2: Latent space representation of sequences captures phylogenetic relationships between sequences. | Nature Communications

Fig. 2: Latent space representation of sequences captures phylogenetic relationships between sequences.

From: Deciphering protein evolution and fitness landscapes with latent space models

Fig. 2

a, b Latent space representation of sequences from the multiple sequence alignment of the fibronectin type III domain and the cytochrome P450 family, respectively. c Latent space representation of 10,000 random sequences with 100 amino acids sampled from the equilibrium distributions of the LG evolutionary model. d A schematic representation of the phylogenetic tree used to simulate the evolution of a random protein sequence with 100 amino acids. The actual tree has 10,000 leaf nodes. The dashed lines, \(\alpha\) and \(\beta\), represent two reference evolutionary time points on which sequences of leaf nodes are grouped. Sequences of leaf nodes are in the same group if they are in the same branch at the reference time point, either \(\alpha\) or \(\beta\), which have an evolutionary distance of 0.5 and 0.9 from the root node, respectively. The evolutionary distance from the root node represents the expected number of substitutions per site compared to the root node sequence. e Latent space representation of simulated sequences of all leaf nodes. Sequences are separated into groups at the reference time point \(\alpha\). Sequences are colored based on groups. Quantification of the clustering can be found in Supplementary Fig. 4. f Sequences from the yellow colored group (enclosed by the dashed triangle) in e are regrouped and recolored based on the reference time point \(\beta\). g Latent space representation of grouped sequences of the fibronectin type III domain family. A phylogenetic tree is inferred based on its MSA using FastTree2. Based on the inferred phylogenetic tree, sequences are grouped similarly as in d, e with an evolutionary distance of 2.4. The top 20 largest groups of sequences are plotted and sequences are colored based on their group. h A similar plot as g for the cytochrome P450 family. i Sequences from the purple colored group (enclosed by the dashed triangle) in h are regrouped and recolored based on a reference time point with an evolutionary distance of 2.6.

Back to article page