Fig. 3: Evaluation of the effects of molecular weight and dataset size on PCA using molecular fingerprints.

Structurally isomeric compounds with the same molecular formula were retrieved from PubChem. Red circles represent compounds with the formula C₆H₆O₂ (n = 377), and blue circles represent compounds with C₄₈H₈₉NO₁₈ (n = 31). The PCA distribution shows that conventional fingerprints are strongly influenced by dataset size and molecular weight, leading to biased chemical space representations.