Fig. 3: Signature-based analysis of compound collections.
From: Bioactivity descriptors for uncharacterized chemical compounds

a Chemical libraries are hierarchically clustered by their proximity to the full CC; here, proximity is determined by the cluster occupancy vector relative to the k-means clusters identified in the CC collection (number of clusters = (N/2)1/2; GSigs are used). Proximal libraries have small Euclidean distances between their normalized occupancy vectors. Size of the circles is proportional to the number of molecules available in the collection. Color (blue-to-red) indicates the homogeneity (Gini coefficient) of the occupancy vectors relative to the CC. b Occupancy of high-applicability regions is further analyzed for five collections (plus the full CC). In particular, we measure the average 10-nearest-neighbor L2-distance (measured in the GSig space) of molecules to the high-α subset of CC molecules (103, Fig. 2). The red line denotes the distance corresponding to an empirical similarity P-value of 0.01. The percentage indicates the number of molecules in the collection having high-α vicinities that are, on average, below the significance threshold. This percentage is shown for the rest of the libraries in a. c The previous five compound collections are merged and projected together (t-SNE). Each of them is highlighted in a different color with darker color indicating a higher density of molecules. d Detail of the compound collections. The first column shows the chemical diversity of the projections, measured as the average Tanimoto similarity of the 5-nearest neighbors. Blue denotes high diversity and red high structural similarity between neighboring compounds. Coloring is done on a per-cluster basis. The rest of the columns focus on annotated subsets of molecules. Blue indicates high-density regions.