Structural Isomer Cumulative molecular fingerprinting method (SIC) for standardizing structural isomeric relationships

Torigoe, Taihei

doi:10.1038/s42004-025-01798-3

Download PDF

Article
Open access
Published: 12 December 2025

Structural Isomer Cumulative molecular fingerprinting method (SIC) for standardizing structural isomeric relationships

Taihei Torigoe (鳥越大平) ORCID: orcid.org/0009-0008-4761-8090¹

Communications Chemistry volume 8, Article number: 406 (2025) Cite this article

2025 Accesses
4 Altmetric
Metrics details

Subjects

Abstract

Standardizing structural isomeric relationships and evaluating their distribution in chemical space remain major challenges in cheminformatics. Conventional molecular fingerprints and dimensionality reduction techniques are often sensitive to dataset size and structural complexity. Here, we introduce a molecular fingerprint, Structural Isomer Cumulative molecular fingerprint (SIC), that quantitatively captures relative structural differences among isomers with high precision. SIC consists of two variables: SIC_em, representing exact mass, and SIC_L, a cumulative descriptor derived from substructural differences. SIC_L enables calculation of relative structural distances within isomeric groups regardless of dataset size or molecular complexity. Using SIC, we successfully quantified structural differences across positional, skeletal, and functional group isomers, which were not adequately captured by existing descriptors. Furthermore, a scatter plot of SIC_em and SIC_L visualized metabolite distributions among cellular compartments, and nine endogenous metabolites were identified whose structural characteristics suggest potential toxicity.

Hybridization of SMILES and chemical-environment-aware tokens to improve performance of molecular structure generation

Article Open access 15 May 2025

Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data

Article Open access 19 December 2022

Advancing molecular machine learning representations with stereoelectronics-infused molecular graphs

Article 23 May 2025

Introduction

Molecular fingerprints have been widely employed for compound similarity assessment, biological activity evaluation, and the construction of predictive models^1,2,3, establishing themselves as core technologies in cheminformatics and drug discovery. Among widely used molecular fingerprints, the Atom Pair fingerprint (AP)⁴ represents descriptors based on interatomic distances across the entire molecule, enabling global structural evaluation. However, its sensitivity to subtle substructural differences is limited. In contrast, Extended-Connectivity Fingerprints (ECFP⁵) compute local environments around each atom, achieving high resolution for identifying substructures, but they struggle to capture holistic features such as overall molecular size and shape⁶. To address these limitations, MAP4⁶ and MAP4C⁷ have been developed, which are less sensitive to molecular size and capable of more accurately encoding substructural information. Nonetheless, these approaches still face challenges in continuously and relatively evaluating distances between structural isomers. In particular, for compound groups comprising positional, skeletal, or functional group isomers, conventional fingerprint-based similarity metrics often fail to appropriately reflect their structural differences. Currently, no established method exists for continuously and relatively evaluating such fine-grained structural variations among isomers, which remains an important yet underexplored issue in many applied domains, including toxicity prediction, bioactive molecule screening, and metabolite identification. A numerical descriptor that enables the continuous and relative characterization of structural isomerism using a single variable is especially valuable for applications such as assessing the coverage of in-house compound libraries for unknown compound identification and conducting structure-based toxicity screening. In this study, we propose a novel molecular descriptor, the SIC fingerprint, which aims to quantify and standardize structural isomeric relationships. SIC consists of two components: SIC_em, representing the exact mass, and SIC_L, a cumulative structural distance metric derived from substructural differences. This framework enables visualization and comparison of chemical space within isomeric compound groups, independent of dataset size or molecular weight. The method was applied to various datasets containing endogenous and toxic compounds, including YMDB^8,9, ECMDB^10,11, HMDB^12,13, T3DB¹⁴, and TOXRIC¹⁵. Compared to conventional fingerprints such as ECFP and Atom Pair, SIC demonstrated superior ability to evaluate structural isomeric relationships as relative structural distances. Furthermore, analysis based on SIC_L identified nine endogenous human metabolites whose structural features were located in close proximity to known toxic compounds in the SIC_L-defined chemical space, suggesting potential toxicological relevance.

Results

The Structural Isomer Cumulative molecular fingerprint (SIC) provides a standardized framework for representing datasets consisting of structurally isomeric compounds

Conventional molecular fingerprints often exhibit substantial bias due to factors such as dataset size, molecular weight, and structural complexity. Furthermore, when chemical space is visualized using dimensionality reduction methods such as principal component analysis (PCA), it is often difficult to interpret the resulting distributions in terms of structural features. The SIC addresses these challenges by providing a structural distance metric that is relatively insensitive to molecular size and dataset scale. SIC enables continuous and relative quantification of structural differences, ranging from subtle substructural variations to large-scale scaffold-level differences. Moreover, when compounds are visualized in the two-dimensional space defined by SIC_L and SIC_em, the resulting chemical space exhibits high interpretability (Fig. 1). SIC consists of two variables: SIC_L, a structural distance metric, and SIC_em, the monoisotopic exact mass. An overview of the SIC_L calculation is shown in Fig. 2. To calculate SIC_L, compounds are first grouped by molecular formula to ensure relative comparisons within isomeric groups. For each compound, a list of substructures is defined based on atom and bond information (Supplementary Data 1). The molecular center was defined as the centroid of the 2D atomic coordinates of all atoms, generated using RDKit. The planar distance from this center (L_sub) was then computed for each substructure. Within each formula group, the median distance (L_median) was calculated for each substructure type and used as a reference. When the same substructure pattern occurred multiple times within a molecule, each occurrence was treated as an independent instance. Thus, multiple instances of the same functional group (e.g., R-NH₂) could differ in their classification, with some considered structurally divergent and others not, depending on their individual distances relative to the group median. This design allows SIC to capture positional differences of identical functional groups within a molecule. If the absolute difference between L_sub and L_median exceeds a predefined threshold (0.01), the substructure is considered structurally divergent. The product of its distance (L_i) and molecular weight (M_i) is computed. The sum of these products across all such substructures is denoted as S, and the SIC_L value is obtained by normalizing this sum with the monoisotopic mass M:

$${SI}{C}_{L}=\frac{S}{M}$$

**Fig. 2: Workflow for calculating the SIC.**

In this formulation, compounds with many substructures that are both heavy and distant from the group median contribute to a higher SIC_L value. The second axis, SIC_em, represents the monoisotopic mass. Therefore, in the two-dimensional SIC space, a wide spread along the SIC_L axis indicates the presence of diverse structural isomers. Conversely, compounds with similar SIC_L values tend to be structurally similar, often representing positional isomers with substructures situated at comparable distances.

Benchmarking the performance of chemical space visualization for structurally isomeric compounds

Chemical space visualizations based on conventional molecular fingerprints are often strongly influenced by dataset size and molecular weight. This effect becomes particularly pronounced in structurally isomeric compounds, which share the same molecular formula but differ in structure, making similarity evaluation difficult. In this study, we compared the visualization performance of several existing molecular fingerprints (MQN¹⁶, MAP4C, MHFP¹⁷, MACCS Key¹⁸, RDKit¹⁹, ECFP, and AP) with that of my newly developed fingerprint, the SIC, using principal component analysis (PCA). The evaluation was conducted on two isomer sets retrieved from PubChem: compounds with the formula C₆H₆O₂ (377 compounds) and C₄₈H₈₉NO₁₈ (31 compounds). Dimensionality reduction was performed using PCA, a linear method with high interpretability and reproducibility²⁰. As a result, MQN, MAP4C, MHFP, MACCS Key, RDKit, and ECFP exhibited significantly biased distributions depending on dataset size and molecular weight (Fig. 3). In particular, AP showed a strong dependency on molecular weight. These fingerprints failed to produce chemically interpretable distributions. In contrast, SIC was minimally affected by dataset size or molecular weight. The first principal component (PC1) primarily reflected molecular weight, while the second (PC2) captured subtle structural differences. These results suggest that SIC is a highly effective method for visualizing and comparing structurally isomeric compounds.

**Fig. 3: Evaluation of the effects of molecular weight and dataset size on PCA using molecular fingerprints.**

Performance evaluation of molecular fingerprints/Tanimoto similarity scores and scaled SIC_L

SIC_L inherently represents structural diversity within a compound set as relative distances between compounds, thus eliminating the need for pairwise comparisons such as those used in conventional Tanimoto similarity. However, to ensure a fair performance comparison, this study compared SIC_L with conventional molecular fingerprints and Tanimoto similarity-based structural similarity evaluations. SIC_L values were divided by their maximum value and scaled to the range 0–1. To enable pairwise comparison using SIC_L, the absolute difference between the normalized SIC_L values for each compound pair was calculated, and this difference was subtracted from one to derive a pairwise similarity score. This procedure reinterprets the numerical distance between SIC_L values as a measure of “similarity,” allowing direct comparison with traditional fingerprint-based methods evaluated by Tanimoto coefficients. The evaluation was performed using five sets of structural isomers obtained from PubChem: C₆H₆O₂ (377 compounds), C₁₂H₁₄O₇S (141 compounds), C₆H₁₆O₁₈P₄ (22 compounds), C₃₉H₇₉N₂O₆P (79 compounds), and C₄₈H₈₉NO₁₈ (31 compounds). The compounds selected for these evaluations were determined based on molecular formula and molecular weight, which fall within the structural evaluation range targeted by SIC.

First, we compared the average similarity scores by molecular weight between the molecular fingerprints with average Tanimoto similarity scores from previous studies and the SIC_L with average pairwise similarity scores obtained in this study (Fig. 4a). The results showed that the average Tanimoto similarity scores of conventional molecular fingerprints tended toward 1.0 with increasing molecular weight. In contrast, the scaled SIC_L remained stable between ~0.5 and 0.8, indicating that it is independent of molecular weight. The average pairwise similarity score derived from the scaled SIC_L also remained stable, between ~0.8 and 0.9, similarly showing molecular weight-independent behavior. The higher average pairwise similarity score compared with the scaled SIC_L itself can be explained by the definition of SIC_L, which represents the relative differences in the mass and structural distances of substructures within each molecular formula group. Consequently, computing pairwise similarity from SIC_L values, already representing relative structural distances, leads to redundant evaluation of the same structural relationships.

**Fig. 4: Comparison of structural similarity between molecular fingerprints and SIC_L.**

Next, we compared the distribution of average pairwise similarity scores across different molecular weights (Fig. 4b). The results showed that for conventional molecular fingerprints with average Tanimoto similarity scores, the interquartile range (IQR) narrowed and tended toward 1.0 as molecular weight increased. In contrast, the scaled SIC_L exhibited a stable IQR between ~0.8 and 0.95, demonstrating molecular weight-independent behavior.

As similarity scores approach 1.0, they indicate higher structural similarity. Previous studies have discussed the limitations of conventional fingerprints in distinguishing substructural differences among high molecular-weight compounds^6,21. The SIC developed in this study can evaluate structural similarity independently of molecular weight. Unlike conventional approaches, it does not require calculating pairwise similarity scores using molecular fingerprints. Instead, it directly evaluates the structural differences among compounds within the same molecular formula group as relative distance values.

Benchmarking chemical space visualization performance using public compound databases

To evaluate the practical utility of SIC, we conducted a comparative analysis using multiple compound databases spanning from prokaryotic to eukaryotic organisms. As reference methods, we selected AP and ECFP, which, although clearly influenced by dataset size and molecular weight, showed relatively less distortion in distribution compared to other fingerprints, as demonstrated in the results of Fig. 3. First, using the ECMDB, YMDB, and HMDB datasets (Supplementary Data 2), we performed chemical space visualizations. The results indicated that AP and ECFP tended to produce compressed distributions due to data size and molecular weight effects, resulting in limited coverage of chemical space (Fig. 5). In contrast, SIC demonstrated the broadest distribution, effectively capturing a greater range of structural diversity. This trend was consistent across HMDB and YMDB, which represent complex metabolic networks of eukaryotic organisms and encompass broader chemical diversity. By contrast, ECMDB showed a narrower spread, likely reflecting the simpler metabolic repertoire of E. coli, with its limited range of molecular weights and structural types. To assess whether this broader distribution holds biological relevance, we evaluated the structural diversity of compounds associated with different organelles, based on subcellular annotations provided in HMDB. The SIC-based visualization revealed that small-molecule metabolites were broadly distributed in the nucleus and mitochondria, whereas the cytosol exhibited a more diverse distribution, including higher-molecular-weight compounds. This trend may reflect functional differences in metabolism among subcellular compartments. To verify that these observed distributions were not simply due to differences in dataset size, we compared two databases with relatively similar numbers of compounds: HMDB (2933 endogenous metabolites) and T3DB (3457 toxic compounds). This analysis aimed to evaluate differences in chemical space coverage based on structural diversity alone. The results showed that with AP and ECFP, compound distributions from both databases largely overlapped, and no distinct clustering was observed. In contrast, SIC revealed clear distinctions: HMDB compounds exhibited broad diversity, including high-molecular-weight lipids, whereas T3DB compounds showed structural diversity biased toward low-molecular-weight compounds. Taken together, these findings suggest that SIC is a useful method for effectively evaluating structural diversity across a wide range of compounds, from endogenous metabolites to toxic xenobiotics.

**Fig. 5: Chemical space visualization of compound datasets collected from public databases using PCA.**

Evaluation of SIC utility based on structural distances between endogenous eukaryotic metabolites and toxic compounds

Among endogenous metabolites in eukaryotes, some compounds are listed in toxic compound databases such as T3DB and TOXRIC (Supplementary Data 3). Compounds structurally similar to known toxic substances are considered to carry a higher risk of toxicity²². Therefore, the accurate identification of metabolites structurally analogous to toxic compounds (even at the level of positional isomers) is considered critically important for risk assessment. The SIC enables the calculation of structural distances based on relative differences between structural isomers. To evaluate the performance and utility of SIC, we analyzed endogenous metabolites (from HMDB and YMDB) and toxic compounds (from T3DB and TOXRIC) to identify potential novel toxic candidates among endogenous metabolites. Using InChIKeys, we annotated endogenous metabolites that are already listed as toxic compounds in those databases. Then, Compounds were grouped based on molecular formulae derived from their canonical SMILES and calculated SIC values. For visualization, SIC_L and SIC_em were directly used as axes, allowing chemical space mapping based on structural similarity and substructural differences weighted by molecular mass (Fig. 6). Notably, several endogenous metabolites without known toxicity were found to cluster closely with known toxic compounds, particularly at the level of positional or functional group isomers. To further validate this, we manually examined the top nine endogenous metabolites that showed the smallest SIC_L differences to a toxic compound, sorted within each group of compounds sharing the same molecular formula (Fig. 7). The dataset used comprised pairs where one compound was a known toxicant and the other had no toxicity annotation despite being a structural isomer. The threshold of the top nine was selected because there were 94 compound pairs with SIC_L scores above 0.9, and manual evaluation of all cases was not practical (Fig. 8c). Thus, this analysis represents a pilot assessment to verify the plausibility of SIC-based toxic analog identification. Within the top nine, compounds such as 6-Hydroxynicotinic acid (HMDB0002658, C₆H₅NO₃) and 2-Butanol (HMDB0011469, C₄H₁₀O) were judged to have low toxicity potential. 2-Hydroxyfluorene (HMDB0013163, C₁₃H₁₀O), while part of a dataset that included a toxic positional isomer, was matched as the most similar compound based on a functional group isomer, indicating a possible false positive in the SIC_L score. However, Methylsuccinic acid (HMDB0001844, C₅H₈O₄) is a positional isomer of a known toxic compound and therefore may carry a significantly high toxicological risk, despite lacking current toxicity annotation. Ethyl hexanoate (YMDB01381, C₈H₁₆O₂) was a compound for which the toxic form could be generated through ester hydrolysis via esterases (EC 3.1.1.x class). Pyruvaldehyde (HMDB0001167, C₃H₄O₂) and Propanal (HMDB0003366, C₃H₆O), both bearing reactive aldehyde groups, were also found to be highly similar to toxic counterparts and may represent hazardous endogenous compounds, especially in the context of aging-related accumulation and abnormal chemical modifications^{23,24,25,26,27,28,29,30,31}.

**Fig. 6: Distribution of SIC-based structural distances between toxic compounds and their positional isomeric endogenous metabolites.**

**Fig. 7: Endogenous human metabolites exhibiting high structural similarity to toxic compounds based on the SIC.**

**Fig. 8: Structural similarity between toxic compounds and endogenous metabolites evaluated using different molecular fingerprints.**

Finally, we compared SIC_L scores with Tanimoto coefficients calculated using AP and ECFP fingerprints, scaling SIC_L to a 0–1 range. All pairs of toxicants and structurally related nontoxic endogenous metabolites were analyzed (Fig. 8). The majority of absolute SIC_L differences were below 1.0 (Fig. 8a). Each score was scaled by its maximum value and compared to AP and ECFP (Fig. 8b). SIC_L produced unique scores for every compound pair, sharply distinguishing structural differences with minimal redundancy (Supplementary Data 4). To evaluate the characteristics of high-similarity pairs, we analyzed those with scores ≥0.9. The focus of this comparison was not the number of pairs per se, but the proportion of structural isomers and the occurrence of false-perfect matches (score = 1.0 between distinct structures). SIC_L identified 94 such pairs, with a higher proportion of structural isomers and no false-perfect matches, whereas AP and ECFP identified 26 and 27 pairs, respectively, including several false-perfect matches (Fig. 8c). Furthermore, the proportion of positional isomeric pairs among high-similarity matches (score ≥ 0.9) was 22.3% for SIC_L, compared to 0% and 2.85% for ECFP and AP, respectively. Collectively, these findings demonstrate that SIC_L exhibits superior discriminative power in evaluating fine structural similarity between isomeric compounds, particularly in identifying endogenous metabolites closely resembling toxic compounds. Additionally, SIC_L provides a novel approach for capturing structural proximity beyond the reach of traditional molecular fingerprints evaluated using Tanimoto similarity, making it a promising tool for toxicological prediction and safety assessment.

Discussion

Conventional molecular fingerprint-based similarity evaluations are often influenced by dataset size and molecular complexity, and have continued to face challenges in the continuous and relative assessment of structurally isomeric compound groups. To address this issue, we developed a method that accumulates structural differences within compound groups sharing the same molecular formula, based on substructural differences and their associated molecular weights. Traditional approaches assess chemical space using overall molecular structure information, and therefore have limited capacity to distinguish small differences. In contrast, the SIC method computes structural distances using variables derived from differing substructures and molecular weight. This design minimizes the impact of dataset size and molecular weight, allowing for relative evaluation of structural distances even at the level of positional isomers.

To evaluate the utility of SIC, we compared its performance with that of existing molecular fingerprints. The comparison used isomer datasets with differing molecular weights and dataset sizes (C₆H₆O₂: 377 compounds; C₄₈H₈₉NO₁₈: 31 compounds) and employed principal component analysis (PCA) to visualize chemical space. we focused primarily on small molecules with molecular weights under 1000 Da. To ensure a fair performance comparison, this study evaluated SIC_L alongside conventional molecular fingerprints and their Tanimoto similarity scores. The evaluation was conducted using five sets of structural isomers obtained from PubChem: C₆H₆O₂ (377 compounds), C₁₂H₁₄O₇S (141 compounds), C₆H₁₆O₁₈P₄ (22 compounds), C₃₉H₇₉N₂O₆P (79 compounds), and C₄₈H₈₉NO₁₈ (31 compounds). Comparisons with larger or extremely small molecules were not conducted, and therefore future studies will be needed to explore SIC’s applicability across a broader range. In addition, we assessed the generalizability of SIC using diverse metabolite databases derived from prokaryotic and eukaryotic organisms (HMDB, YMDB, ECMDB), as well as toxic compound datasets (T3DB). The results showed that SIC_em and SIC_L explicitly represented structural distributions based on molecular weight and diversity. We also plan to evaluate how increasing dataset size affects chemical space visualization to assess its effectiveness for mapping the diversity of natural products. Furthermore, in toxicity assessments using SIC, we were able to identify numerous endogenous metabolites with structures highly similar to known toxic compounds, which had not been detected using previous methods. However, manual curation was labor-intensive, highlighting the need for standardized annotation criteria in the future.

The benchmarking conducted in this study addressed several key issues in prior methods, including sensitivity to molecular weight and dataset size, limited resolution of substructural differences, and general applicability. As a result, we demonstrated that SIC can effectively minimize the influence of dataset size and molecular weight, and enable relative evaluation of structural distances at the level of positional isomers. Compared to existing molecular fingerprints, SIC showed greater clarity in capturing differences among structural isomers while reducing dependency on dataset size and molecular weight. Conventional visualization techniques for chemical space struggle to represent structural diversity explicitly. In contrast, SIC, comprising two fingerprint variables (SIC_L and SIC_em), enabled clear visualization of structural differences as relative distances. As shown in Fig. 5, the compound distribution was well organized, with SIC_em capturing molecular weight variation, while SIC_L reflected substructural differences. SIC also demonstrated high discriminative power by sensitively extracting endogenous metabolites structurally close to toxic compounds from databases such as T3DB and TOXRIC. Unlike previous methods, SIC avoided issues such as redundant scores or score saturation, allowing for the assignment of unique structural distances.

SIC minimizes the influence of dataset size and molecular weight and effectively captures differences among structural isomers. However, as it quantifies relative structural distances within groups sharing the same molecular formula, its application becomes unstable when the number of compounds in a group is insufficient, making practical use difficult in such cases. Therefore, identifying the minimum number of required comparison compounds for reliable SIC evaluation remains an important issue. On the other hand, SIC does not require the construction of predictive models and enables explicit evaluation of structural distances, making it a fast and simple tool for structural assessment, without the need for prior domain validation or large datasets typically required for machine learning approaches. Looking ahead, SIC may be extended to evaluate stereoisomers; however, this would require overcoming computational challenges such as incorporating quantum chemical calculations to assess structural stability.

A limitation of the present SIC implementation is that it relies on a predefined list of substructures. In this study, the evaluation was centered on endogenous metabolites and toxic compounds, and therefore the atomic types included were restricted to those most relevant to this context. From the standpoint of general chemical research, the omission of elements such as B or Si may appear unusual, and this limitation should be acknowledged. In addition, the current approach may yield false positives when substructures with nearly identical molecular weights are cumulatively evaluated, leading to compounds with different scaffolds being placed in closer proximity than expected along the SIC axis. This reflects the fact that SIC quantifies structural differences using 2D atomic coordinates and substructure molecular weights, which may not fully capture subtle electronic effects. In future work, the SMARTS list can be expanded to incorporate additional elements, and integration of 3D structural descriptors or quantum chemical information such as electron density distributions could provide a more robust evaluation of structural diversity. In addition, the dataset dependence of SICL has both advantages and drawbacks. For metabolomic applications, variability across datasets is useful for comparing isomeric repertoires between organisms, whereas in contexts such as drug discovery, dataset dependence may hinder evaluation of candidate chemical spaces. In such cases, fixed reference values for each substructure type would be preferable.

SIC offers a novel approach to the explicit quantification of structural differences among isomers and opens new possibilities in the evaluation of chemical diversity and toxicological screening. Future work will include expanding its application to stereochemical isomers and quantitatively validating its relationship with biological similarity. In addition, integration with clustering or visualization methods based on structural distance may enhance the accuracy and interpretability of compound classification and discovery pipelines.

Methods

Method for calculating the structural isomer cumulative molecular fingerprint

The calculation of the Structural Isomer Cumulative molecular fingerprint (SIC) was performed as follows:

1. Grouping by molecular formula: Canonical SMILES strings were used to calculate molecular formulas, and compounds with identical formulas were grouped together.

2. Substructure distance calculation: For each compound, a predefined list of substructures—each defined by a single atom and its bond types—was used to compute the 2D atomic coordinates of atoms (generated using RDKit). The molecular center was defined as the centroid of these coordinates, and the distance to this center (L_sub) was calculated for each substructure. This list was systematically constructed to comprehensively cover covalent bonding patterns of the atomic types analyzed in this study.

3. Median distance determination: For each substructure type across the group, the median distance L_median was computed.

4. Cumulative distance calculation: For each substructure where | L_sub - L_median |≧0.01 and substructure molecular weight M_i were recorded. The cumulative structural score S was then calculated as:

$${\sum }_{i=1}^{n}{L}_{i}\times {M}_{i}=S$$

where i is the index of the substructure and n is the number of substructures meeting the threshold condition. The threshold of 0.01 was determined empirically through benchmarking analyses across datasets with different molecular weights and sample sizes. Thresholds ranging from 1 to 0.0001 were tested, and 0.01 was found to yield the most stable and consistent distributions (Supplementary Data 5).

5. Normalization and fingerprint output: The final SIC values consisted of two variables: (1) SIC_L, defined as the normalized cumulative score S/M, where M is the monoisotopic exact mass of the compound, and (2) SIC_em, the exact mass M itself (rounded to six significant digits).

The SIC fingerprint is designed such that SIC_L increases when the substructures of a compound have both larger spatial deviation (SIC_L) from the median and higher molecular weights (M_i) within a group of structural isomers. By normalizing S with respect to M, variation due to overall molecular size is minimized. The substructure list was generated using SMART notation, covering atoms of C, N, P, O, S, H, Cl, Br, I, F, and Hg, with classification based on bonding type and coordination. Molecular weights Mi for substructures included the atomic mass of the central atom and its attached hydrogens. The threshold value of 0.01 for | L_sub - L_median | was selected based on performance evaluation, as it maximized the ability to distinguish positional isomers in compounds with molecular weights below 1000. All computations were implemented using Python 3.11.5, with RDKit 2023.3.2, pandas 1.3.5, SciPy 1.7.3, and NumPy 1.21.6. The SIC tool developed in this study is available at BioChemCalc (BCC, https://biochemcalc.com/).

Chemical space visualization

Molecular fingerprints were calculated using RDKit (version 2023.3.2) and mapchiral (version 0.0.7). For the comparison with existing fingerprints, the following specifications were used: Atom Pair (count fingerprint, 2048 bits), Morgan (binary, radius = 2, 2048 bits; corresponding to ECFP4), MACCS Keys (binary, 165 bits), MAP4C (binary, 2048 bits), MHFP (MinHash fingerprint, 128 dimensions), MQN (count, 42 dimensions), and RDKit (binary, 2048 bits). Prior to principal component analysis (PCA), all fingerprint vectors were standardized by Z-score normalization so that each feature had a mean of 0 and a standard deviation of 1. PCA was performed with scikit-learn (version 1.0.2), and the resulting scores were exported to Microsoft Excel for graphical visualization.

Use of public databases

Datasets of endogenous metabolites from both prokaryotic and eukaryotic organisms were retrieved from YMDB^8,9 (Detected and quantified), ECMDB^10,11, and HMDB^12,13 (Detected and quantified, Endogenous). Toxic compounds were obtained from T3DB¹⁴ and TOXRIC¹⁵ (Toxicity Category: Carcinogenicity). Since the datasets included stereoisomeric information, InChIKeys were computed using RDKit from canonical SMILES. These InChIKeys were then used to merge stereoisomers into structural isomers. Additionally, metabolites from YMDB and HMDB that matched InChiKeys in T3DB and TOXRIC were extracted and classified as toxic compounds.

Data availability

The source data supporting the figures and graphs are provided in the Supplementary Data file. Additional data generated or analyzed during this study are available from the corresponding author upon reasonable request.

Code availability

The tools and source code developed are openly available and freely accessible under an open-access model at https://biochemcalc.com. They may be used without restriction for academic purposes. The source code implementing the SIC algorithm is openly available at https://github.com/TaiheiTorigoe/SIC and archived on Zenodo (DOI: 10.5281/zenodo.17446622) under the MIT License³².

References

Gomez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018).
Article PubMed PubMed Central CAS Google Scholar
Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
Article PubMed CAS Google Scholar
Awale, M. & Reymond, J. L. The polypharmacology browser: a web-based multi-fingerprint target prediction tool using ChEMBL bioactivity data. J. Cheminform. 9, 11 (2017).
Article PubMed PubMed Central Google Scholar
Carhart, R. E., Smith, D. H. & Venkataraghavan, R. Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci. 25, 64–73 (1985).
Article CAS Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model 50, 742–754 (2010).
Article PubMed CAS Google Scholar
Capecchi, A., Probst, D. & Reymond, J. L. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J. Cheminform. 12, 43 (2020).
Article PubMed PubMed Central CAS Google Scholar
Orsi, M. & Reymond, J. L. One chiral fingerprint to find them all. J. Cheminform. 16, 53 (2024).
Article PubMed PubMed Central CAS Google Scholar
Ramirez-Gaona, M. et al. YMDB 2.0: a significantly expanded version of the yeast metabolome database. Nucleic Acids Res. 45, D440–D445 (2017).
Article PubMed CAS Google Scholar
Jewison, T. et al. YMDB: the yeast metabolome database. Nucleic Acids Res. 40, D815–D820 (2012).
Article PubMed CAS Google Scholar
Sajed, T. et al. ECMDB 2.0: a richer resource for understanding the biochemistry of E. coli. Nucleic Acids Res. 44, D495–D501 (2016).
Article PubMed CAS Google Scholar
Guo, A. C. et al. ECMDB: the E. coli metabolome database. Nucleic Acids Res. 41, D625–D630 (2013).
Article PubMed CAS Google Scholar
Wishart, D. S. et al. HMDB: the human metabolome database. Nucleic Acids Res. 35, D521–D526 (2007).
Article PubMed PubMed Central CAS Google Scholar
Wishart, D. S. et al. HMDB 5.0: the human metabolome database for 2022. Nucleic Acids Res. 50, D622–D631 (2022).
Article PubMed CAS Google Scholar
Lim, E. et al. T3DB: a comprehensively annotated database of common toxins and their targets. Nucleic Acids Res. 38, D781–D786 (2010).
Article PubMed CAS Google Scholar
Wu, L. et al. TOXRIC: a comprehensive database of toxicological data and benchmarks. Nucleic Acids Res. 51, D1432–D1445 (2023).
Article PubMed Google Scholar
Awale, M., van Deursen, R. & Reymond, J. L. MQN-mapplet: visualization of chemical space with interactive maps of DrugBank, ChEMBL, PubChem, GDB-11, and GDB-13. J. Chem. Inf. Model 53, 509–518 (2013).
Article PubMed CAS Google Scholar
Probst, D. & Reymond, J. L. A probabilistic molecular fingerprint for big data settings. J. Cheminform. 10, 66 (2018).
Article PubMed PubMed Central CAS Google Scholar
Durant, J. L., Leland, B. A., Henry, D. R. & Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput Sci. 42, 1273–1280 (2002).
Article PubMed CAS Google Scholar
Landrum, G. Fingerprints in the RDKit. https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf (2012).
Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374, 20150202 (2016).
PubMed PubMed Central Google Scholar
Bajusz, D., Racz, A. & Heberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?. J. Cheminform. 7, 20 (2015).
Article PubMed PubMed Central Google Scholar
Wassenaar, P. N. H., Rorije, E., Vijver, M. G. & Peijnenburg, W. Evaluating chemical similarity as a measure to identify potential substances of very high concern. Regul. Toxicol. Pharm. 119, 104834 (2021).
Article CAS Google Scholar
Banerjee, S. Methylglyoxal-induced modification of myoglobin: an insight into glycation mediated protein aggregation. Vitam. Horm. 125, 31–46 (2024).
Article PubMed CAS Google Scholar
Zhu, Z. et al. Acrolein, an endogenous aldehyde induces synaptic dysfunction in vitro and in vivo: involvement of RhoA/ROCK2 pathway. Aging Cell 21, e13587 (2022).
Article PubMed PubMed Central CAS Google Scholar
Li, Y. et al. Oxidative stress and 4-hydroxy-2-nonenal (4-HNE): implications in the pathogenesis and treatment of aging-related diseases. J. Immunol. Res. 2022, 2233906 (2022).
PubMed PubMed Central Google Scholar
Demir, E. & Marcos, R. Assessing the genotoxic effects of two lipid peroxidation products (4-oxo-2-nonenal and 4-hydroxy-hexenal) in haemocytes and midgut cells of drosophila melanogaster larvae. Food Chem. Toxicol. 105, 1–7 (2017).
Article PubMed CAS Google Scholar
Islam, U. L., Moinuddin, B., Mahmood, R. & Ali, A. Genotoxicity and immunogenicity of crotonaldehyde modified human DNA. Int J. Biol. Macromol. 65, 471–478 (2014).
Article Google Scholar
Tong, Z. et al. Accumulated hippocampal formaldehyde induces age-dependent memory decline. Age (Dordr.) 35, 583–596 (2013).
Article PubMed Google Scholar
Perluigi, M., Coccia, R. & Butterfield, D. A. 4-Hydroxy-2-nonenal, a reactive product of lipid peroxidation, and neurodegenerative diseases: a toxic combination illuminated by redox proteomics studies. Antioxid. Redox Sig. 17, 1590–1609 (2012).
Article CAS Google Scholar
Garaycoechea, J. I. et al. Genotoxic consequences of endogenous aldehydes on mouse haematopoietic stem cell function. Nature 489, 571–575 (2012).
Article PubMed CAS Google Scholar
Stein, S., Lao, Y., Yang, I. Y., Hecht, S. S. & Moriya, M. Genotoxicity of acetaldehyde- and crotonaldehyde-induced 1,N2-propanodeoxyguanosine DNA adducts in human cells. Mutat. Res. 608, 1–7 (2006).
Article PubMed CAS Google Scholar
Torigoe, T. SIC_ver_1. Zenodo https://doi.org/10.5281/zenodo.17446622 (2025).

Download references

Acknowledgements

I thank Ryohei Torigoe for his generous personal financial support that enabled this research. I also appreciate the valuable feedback and English editing assistance provided by Omidreza Heravizadeh, M.Sc., during the preparation of the manuscript.

Author information

Authors and Affiliations

Independent Researcher, Karatsu, Saga, Japan
Taihei Torigoe (鳥越大平)

Authors

Taihei Torigoe (鳥越大平)
View author publications
Search author on:PubMed Google Scholar

Contributions

Taihei Torigoe conceived the study, developed the SIC algorithm, performed all computational analyses, created figures and tables, and wrote the manuscript.

Corresponding author

Correspondence to Taihei Torigoe (鳥越大平).

Ethics declarations

Competing interests

The author declares no competing interests.

Peer review

Peer review information

Communications Chemistry thanks Yannick Djoumbou-Feunang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Supplementary Data 5

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Torigoe, T. Structural Isomer Cumulative molecular fingerprinting method (SIC) for standardizing structural isomeric relationships. Commun Chem 8, 406 (2025). https://doi.org/10.1038/s42004-025-01798-3

Download citation

Received: 31 July 2025
Accepted: 06 November 2025
Published: 12 December 2025
Version of record: 23 December 2025
DOI: https://doi.org/10.1038/s42004-025-01798-3