Fig. 2: Average neighborhood size (y-axis in log scale), defined by a similarity threshold of 0.8, to a source compound grouped by their number of heavy atoms (x-axis). | Nature Communications

Fig. 2: Average neighborhood size (y-axis in log scale), defined by a similarity threshold of 0.8, to a source compound grouped by their number of heavy atoms (x-axis).

From: Exhaustive local chemical space exploration using a transformer model

Fig. 2

The Tanimoto similarity was evaluated with ECFP4 fingerprints with counts. The transformer models trained with λ = 10 always outperforms the transformer models trained with λ = 0. λ represents the hyperparameter controlling the regularization term. The filling color surrounding solid lines from HAC (heavy atom counts) 1 to 12 and 19 to 36 represent the standard deviation. The standard deviation is calculated based on a variable number of samples, which depends on the HAC and the specific dataset or generated compounds. Due to computational complexity, the similarity on GDB-12* was computed with ECFP4 fingerprints without counts. Also included in the figure is the size of the near-neighborhood retrieved from PubChem for each source compound. For HAC between 13 and 18, the neighborhood size was plotted explicitly since only a few source compounds were available in the TTD database, whereas for a HAC greater or equal than 19 the average and standard deviation are depicted.

Back to article page