Fig. 3: Unsupervised learning reveals amino acid preferences for ligand specificity.
From: Highly multiplexed design of an allosteric transcription factor to sense new ligands

a Uniform Manifold Approximation and Projection (UMAP) 2D embedding of 17,430 variants with physicochemical properties of amino acids at each variable position in the TtgR library. Multicolored plot shows 23 clusters identified with Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). Gray points correspond to points the algorithm identified as noise. b Dot plot showing the performance of each of the 23 clusters identified after UMAP-HDBSCAN. The color of each dot represents the average F-score of all variants within the cluster-ligand pair that has a minimum of 1.5 F-score. The size of the dot represents the percentage of variants with >1.5 F-score within each cluster normalized to the highest percentage for that ligand. c Heatmaps for the top 3 performing clusters of each ligand showing log2Enrichment of each possible amino acid at the variable positions of TtgR. Clusters are shown from rank 1 to 3 going from left to right. Enrichment was calculated by obtaining the F-score-weighted frequency of amino acids in the cluster using variants with a minimum of 1.5 F-score and normalizing to the DNA count-weighted frequencies of the initial library. Red letters denote the wild-type residues. See “Methods” for an in-depth description of analysis. Source data are provided as a Source Data file.