Abstract
In this study, we show the quantitative structure-property relationship (QSPR) for amphetamine derivatives based on neighborhood degree-based topological indices and NM-polynomials. By coupling such descriptors to both polynomial regression models and Random Forest algorithms, the ability of these two methodologies to predict different physicochemical properties (boiling point, evaporation energy, flash point, molar refractivity, surface tension, polarizability and SA) is analyzed. The modeling scheme reveals that the neighborhood-based indices carry information specific to structural complexity, connectivity and electronic characteristics important for stimulant-type molecules behaviour. cubic regression models are also found to be more capable of representing nonlinear structural relationship than quadratic ones while the efficacy and generalizability are greatly improved by extra Random Forest in particular for properties with strong dependence on molecular branching and electronic distribution. In conclusion, the results here presented confirm that NM-polynomial based descriptors effectively relate molecular topology with experimentally measurable physicochemical behavior, thus suggesting their computational use in predictive property modeling, early drug screening and cheminformatics-driven design.
Introduction
In graph theory, the term \(G = (V, E)\) is usually used to represent the graph, with \(V\) referring to the set of vertices and \(E \subseteq V \times V\) referring to the set of edges. The degree of any vertex \(v \in V\) is denoted as \(\delta _u\) and represents the number of incident edges on it. In undirected graphs, this is simply defined as the number of neighboring vertices1. The set of neighboring vertices of any vertex \(v\) is called the neighborhood of \(v\) and is denoted as \(N(v)\). The degree of the neighborhood is the sum of the degree of every vertex in \(N(v)\).
Chemical graph theory is mathematical chemistry based on the application of graph theory concepts to represent molecular structures. Atoms are represented as vertices, and bonds are represented as edges2. These molecular graphs allow for the examination of structural features independent of physical or quantum representations. Chemical graph theory predicts molecular stability, reactivity, and biological activity. Drug discovery, materials research, and cheminformatics are areas of application. Methods are employed such as connectivity, cycles, and subgraph identification, which are typical of classical graph theory. By examining the molecule’s topological features, scientists can deduce important information regarding chemical behavior and interactions. It provides an excellent interface between discrete mathematics and chemical informatics.
Topological indices are quantitative values extracted from the graph based molecular representation, which encode structural information of the molecule. The Wiener index, Zagreb indices, and Randic index are examples of such indices, which are molecular descriptors employed in quantitative structure activity relationships (QSAR) and quantitative structure property relationships (QSPR). They relate structural properties to biological activity, reactivity, or even physical properties. Calculated based on the vertices, edges, and distance of the graph, topological indices are easy to compute yet informative enough, hence useful for classification, property prediction, and library screening of molecules3,4. Topological indices thus find applications in cheminformatics, drug discovery, and toxicology research, being an economical and timely alternative to experimental methods in early-stage molecular studies5,6,7,8,9.
Let us consider a graph \(G = (V, E)\) to be a simple connected graph, where \(V(G)\) refers to the collection of vertices and \(E(G)\) refers to the collection of edges. For any vertex \(u\) in \(V(G)\), we define by \(N(u)\) the open neighbourhood of \(u\), i.e., the collection of vertices incidentally connected to \(u\):
The degree of a vertex in a neighbourhood, denoted by \(\delta (u)\), is defined as
This counts how many different vertices are neighbors to the neighbors of \(u\) without including \(u\) itself if it happens to be a neighbor to a neighbor. For any edge \(uv \in E(G)\), define \(\delta _u = \delta (u)\) and \(\delta _v = \delta (v)\). These are employed to define numerous neighborhood-based topological indices. The degree of a vertex is the number of edges that it is connected to. In a directed graph, we have to make a distinction between the in-degree and the out-degree (number of incoming and outgoing edges respectively). The order of a graph is the number of vertices that it contains, and the size is the number of edges. The neighborhood M-polynomial of a graph G is given by:
where m(i, j) counts the number of edges \(uv \in E(G)\) such that the pair \(\{\delta _u, \delta _v\} = \{i,j\}\).
The common form of a neighborhood degree-based topological index is
where \(f(\delta _u, \delta _v)\) is a function employed to specify particular indices.
Neighborhood first Zagreb index:
This index calculates the overall neighborhood degree sum across all the edges.
Neighborhood second Zagreb index:
The neighborhood degree descriptor is a product-based one.
Neighborhood forgotten index:
This is the square of the neighborhood degrees.
Third negiborhood \(ND_3\) index:
Combines the sum and product of neighborhood degrees.
Fifth negiborhood \(ND_5\) index:
This index focuses on neighborhood ratio relations.
Neighborhood harmonic index:
It is derived from the harmonic mean of the neighborhood degrees.
Neighborhood inverse sum index:
This index demonstrates a product versus the sum of neighborhood degrees balance as shown in Table 1.
Where:
The NM-polynomial can be utilized as an efficient and concise representation for a molecule’s graph properties within computational chemistry and cheminformatics. It makes it convenient for the calculation of numerous indices, as well as for discovering patterns within and between different molecular families. Because of this, the use of the M-polynomial is an important component for both theoretical research and real-world applications, i.e., drug design, material science, and QSPR modeling.
Colakoglu et al.10 studied the M-polynomial and the NM-polynomial of molecular graphs of drugs against monkeypox. Their findings advance the discipline of chemical graph theory through the utilisation of topological descriptors in the analysis of drug-like molecules. Altassan et al.11 proposed and studied entire neighborhood topological indices of molecular graphs with the purpose of better predicting physico-chemical properties. Their research validates the ability of these indices as predictors in modeling and correlating structural characteristics with chemical behavior. Hasani et al.12 used topological indices to represent QSPR relationships of drugs employed in the therapy of pyelonephritis. Through MATLAB based analysis, their research identifies the significance of using graph-theoretical descriptors to efficiently predict drug properties. Pradeepa and Arathi13 performed a computational analysis and topological characterization of benzenoid networks of relevance in drug discovery and development. Their research focuses on how structural features of benzenoid systems can be used to inform molecular design and pharmacophore profiling. Degree based coindices of Molnupiravir were investigated by Das and Kumari14 and QSPR analysis compared to other drugs used to treat COVID-19. Their study shows how graph-based metrics can be used to comprehend and forecast drug action through molecular structure analysis. Nagesh and Kumar15 had investigated the M-polynomial and the degree-based topological indices of the Dandelion graph. The research offers valuable information on the structural features of this class of graphs, which is useful in designing analysis tools in chemical graph theory.
Figure 1 describes the workflow for predicting the chemical properties through computer methods. Chemical structures and properties are gathered, transformed into molecular structures, and then analyzed for topological indices which are used to compute. These indices are used in the regression and random forest algorithms to generate the predicted properties.
Amphetamine derivatives
Diethylpropion, or amfepramone, is a drug primarily employed as an appetite suppressant, used to treat obesity. The drug operates by stimulating the central nervous system in order to reduce signals of hunger, as in the case of amphetamines16. The structure of amfepramone comprises a phenyl ring attached to a substituted propionamide side chain. In chemical graph theory, the molecule is represented as a graph in which every atom is assigned as a vertex and every chemical bond as an edge. The molecular graph forms the basis of calculating topological indices, which represent the structural and physicochemical features of the compound. The indices facilitate quantitative structure-activity relationship studies, which help in the analysis and optimization of pharmacological efficacy and safety.
Amphetamine is used as a central nervous system stimulant mainly employed to treat attention deficit hyperactivity disorder and narcolepsy. It operates by elevating levels of specific neurotransmitters in the brain, specifically dopamine and norepinephrine17. In chemical graph theory, amphetamine can be represented as a molecular graph with atoms as vertices and covalent bonds as edges. The amphetamine molecule features one phenyl ring bonded to one ethylamine chain, as this forms the basis of its bioactivity. The molecular graph can be studied using topological indices to analyze molecular properties such as reactivity, stability, and bioactivity potential. The descriptors aid in understanding the structure-activity relationship and drug design in therapeutic applications.
Benzylpiperazine is a psychoactive synthetic stimulant used recreationally but studied at times as having possible antidepressant activity. It duplicates the amphetamine-like action of increased dopamine and serotonin discharge18. Benzylpiperazine has the benzyl moiety connected to the piperazine ring structurally. When modeling the molecular graph, vertices are used to represent carbon, hydrogen, and nitrogen, while the bonds are used to represent the chemical bonds. The ability to calculate topological indices helps determine molecular behavior as it provides quantification of structural features reflecting pharmacological performance, aiding drug optimization and safety profiling through computational modelling in the study of chemical graph theory.
Clobenzorex is an anorexigenic medication employed to induce weight loss through the reduction of appetite, mainly by stimulating the secretion of norepinephrine. It is chemically analogous to amphetamines and is metabolized to active metabolites within the body19. Clobenzorex can be expressed as a molecular graph such that the atoms are the vertices and the bonds are the edges. Its molecular structure consists of chlorophenyl and an aliphatic chain with nitrogen, which are of relevance to its pharmacophore activity. Topological indices calculated based on this graph reflect molecular connectivity and molecular complexity. The indices help in the prediction of the drug’s physical and biological nature, aiding drug design, efficacy evaluation, and side effect prediction.
Ethylamphetamine is a drug closely related structurally to amphetamine, historically used to treat loss of appetite, as well asnarcolepsy, but now prescribed infrequently. It works by elevating brain levels of dopamine and norepinephrine, eliciting increased alertness and reduced appetite20. As a molecule, ethylamphetamine can be represented as a graph, where each atom is depicted as a node and each covalent bond as an edge in the graph. The ethyl substitution perturbs the graph structure minimally relative to amphetamine, affecting pharmacologically important properties. Topological indices based on its graph facilitate the evaluation of the consequences of such structural changes on chemical behaviour. Such metrics are useful for investigating structure-activity relationships of potential therapeutic benefit and risk evaluation.
Fenproporex is prescribed as an anorectic medication for weight loss treatment and works by metabolizing into amphetamine within the body. The prodrug elevates levels of norepinephrine and dopamine, lowering the level of appetite. It contains a phenyl ring, propyl chain, and secondary amine in its chemical structure, which are necessary to generate the drug’s stimulating property21. When represented as a molecular graph, the atoms are used as vertices while chemical bonds are used as edges, which are the starting point of topological analysis. Degree-based and distance-based indices computed from this graph reveal the size, branching, and electronic structure of the molecule. These descriptors aid drug characterization and optimization through computational methods in chemical graph theory.
Furfenorex was a drug used as an appetite suppressant, but was discontinued due to abuse potential. The drug works by stimulating the activity of the central nervous system, elevating alertness, and suppressing hunger22. Furfenorex is chemically composed of a furan ring and substituted amine moiety, which determines its pharmacological activity. As a molecular graph, the drug is depicted as vertices and edges representing chemical bonds between the atoms. The molecular structure makes it possible to calculate the molecular topological indices, which represent the complexity, connectivity, and branching of the molecule. These molecular indices are helpful in structure-activity relationship studies and aid in determining the drug’s efficiency, stability, and ability to interact in silico.
Furosemide is one of the most commonly used drugs to treat fluid retention and hypertension by blocking the kidneys’ reabsorption of sodium and chloride. Its structure comprises a sulfonamide, carboxylic acid, and aromatic ring, contributing to the drug’s biological activity. The molecular graph of furosemide is formed by depicting each atom as a vertex and each bond as an edge, capturing the chemical topology of the compound23. An analysis of this graph by topological indices quantifies structural complexity, electronic conformation, and branching molecular structure. These indices are instrumental in predicting drug interactions and pharmacokinetics, and chemical graph theory thus constitutes an important aid in drug design in the modern era.
Mefenorex is an anorectic agent used to facilitate weight loss through stimulation of the central nervous system. It is structurally related to amphetamine as a prodrug, being metabolized to active amphetamine within the body24. The molecular structure features both an aromatic ring and an alkylamine chain, which are necessary for the drug’s stimulating activity. Mefenorex can be represented as a molecular graph with atoms as vertices and covalent bonds as edges, thereby allowing the utilization of topological indices such as degree, path length, and connectivity, which are structural features represented by these descriptors. Drug profiling is useful through these descriptors as they aid in the prediction of molecular interactions and resultant biological behavior.
Methcathinone is structurally related to cathinone, which is a psychoactive stimulant with central nervous system-stimulating effects. Methcathinone has been used recreationally but is controlled because it has a high abuse potential25. The molecule has both ketone and aromatic ring structures, which are responsible for its stimulating activity. Methcathinone as a molecular graph is used in chemical graph theory, with atom vertices and bonds as edges of the graph. Computation of topological indices of this graph assists in calculating molecular properties such as lipophilicity, electronic distribution, and structural complexity. These are pivotal in establishing pharmacodynamics of the drug as well as in computational screens in the realm of medicinal chemistry.
Methylphenidate is a central nervous system stimulant widely prescribed to treat attention deficit hyperactivity disorder and narcolepsy. It is active by repressing dopamine and norepinephrine reuptake, providing greater availability of these substances in the brain. The structure features a piperidine ring, ester linkage, and phenyl ring, which are crucial in conferring activity to the drug26. In chemical graph theory, the structure is translated into a molecular graph with atomic centers as vertices and chemical bonds as edges. The graph is studied based on several topological indices to measure molecular branching, connectivity, and symmetry, which are used to predict therapeutic activity and optimize molecular features in the course of drug design.
Phentermine is used to facilitate weight loss by reducing appetite and energy expenditure. It induces the discharge of norepinephrine and other neurotransmitters, which decrease the perception of being hungry27. Phentermine has a phenyl ring and an aliphatic amine chain as structural features common with central nervous system stimulants. In chemical graph theory, the molecular graph of phentermine consists of atoms as vertices and bonds as edges. The representation is useful in the computation of topological indices describing molecular size, shape, and bonding arrangements. The indices are used extensively in quantitative structure-activity relationship studies to obtain information on drug behavior, receptor binding affinity, and side effects.
Selegiline is a selective monoamine oxidase B inhibitor prescribed to treat Parkinson’s disease and major depressive disorder. It accomplishes this by blocking the degradation of dopamine within the brain, thus increasing its activity levels28. The structure features a propargylamine moiety, an aromatic ring, and a flexible chain. In the realm of chemical graph theory, this structure is expressed as a molecular graph, such as having atoms as the nodes and chemical bonds as the edges. From this molecular graph, topological indices are calculated to extract molecular branching and electronic distribution features, among others. These are important in drug action mechanism understanding and in the optimization of structure-based drug design procedures.
Recent developments in computational chemistry, bioinformatics and intelligent health care represent that machine learning has been rapidly expanding its application in molecular design prediction, drug discovery and biomedical data security. In chemical engineering, customized ML models were used to reveal reaction mechanisms like improving the reactivity of amine substrates towards hexaazaisowurtzitane cages for controlling energetic material synthesis more accurately29. There are still some ground-breaking deep learning architectures for bio-sequence modeling, such as AutoFE-Pointer that is an auto-weighted feature extractor coupled with pointer networks in order to make more accurate prediction of missing DNA methylation values30. Privacy-preserving paradigms are also being developed such as anonymous-enhanced multi-signer ring signatures that achieve medical data securely sharing31. In addition, machine learning-based meta-analyses have contributed to precision medicine such as the optimization of Xiao-Chai-Hu decoctions regimens for liver diseases32. A further value of computational models is the support to disease-gene association studies, including those where hybrid gradient boosting based and logistic-regression based frameworks are utilized to predict miRNA-disease associations successfully33, or where matrix-factorization based approaches become very practical for drug repurposing tasks particularly in emergency situations like COVID-1934.
Collective pairwise classification systems coupling matrix completion and ridge regression also promote anticancer drug-response prediction via the fusion of multi-source pharmacogenomic signals35. Structure-based design and theoretical studies have promoted both agrochemicals and drug discovery, for example such as nicotinamide derivatives for candidate succinate dehydrogenase inhibitors36. In clinical applications, seizure prediction is observed to benefit from broad-attention Transformer architectures shows significant performance pros in predicting seizure37. Computational integration has also provided added value for toxicological studies, such as the network toxicology and molecular docking approaches revealing links between environmental contaminants and targets associated with cardiovascular disease38. On the one hand, stereochemistry-aware deep models have been advanced for 3D drug drug interaction prediction with contrastive cross-attention39, novel NLP architectures have been proposed for Chinese medical named entity recognition, benefiting downstream biomedical text mining40. All together, these studies highlight the integration of ML-based modeling, structural informatics and biomedical intelligence aiming to produce better prediction power, safer information sharing and faster therapeutic discovery. We denote chemical structure with \(G_i\), where \(i=0,2,...12\) and molecular structure with \(MG_i\), where \(i=0,2,...12\). Chemical and molecular structures are shown in Fig. 2. Different computed topological indices are shown in Table 2. The physicochemical properties are shown in Table 3.
Theorem 1
Let \(G_0\) be the molecular graph of Amfepramone. Then, the NM polynomial of \(G_0\) is as follows: \(M(G_{0},x,y)=x^{2} y^{4} \left( 3 x^{5} y^{3} + 2 x^{3} y^{3} + 2 x^{2} y^{3} + 2 x^{2} y + 2 x^{2} + 2 x y^{3} + 2\right)\)
Proof
The NM polynomial is calculated with the edge partitioning according to the degree of adjacent vertices. We calculate the number of edges \(E_i(a, b)\) contacting the vertices of degree \(a\) and \(b\) in the graph \(G_0\) as follows: \(|E_1{(3,7)}|\) = 2, \(|E_2{(7,7)}|\) = 3, \(|E_3{(4,7)}|\) = 2, \(|E_4{(5,7)}|\) = 2, \(|E_5{(4,5)}|\) = 2, \(|E_6{(2,4)}|\) = 2, \(|E_7{(4,4)}|\) = 2.
Putting these into the general form of the NM polynomial:
\(\square\)
Theorem 2
The topological indices for \(G_0 = \textit{Amfepramone}\) are as follows: \(M_1\) = 154, \(M_2\) = 5785, FN = 874, \(M_2^{nm}\) = 10.7301, \(ND_3\) = 4646, \(ND_5\) = 33.4952, NH = 3.1367, NI = 36.7354.
Proof
The NM-Polynomial obtained in 1 and edge partition as in Table 1 is used to compute the indices, as:
\(\square\)
The NM-polynomials for other drug structures from \(G_{1}\) to \(G_{12}\) can be obtained similarly to the proof of Theorem 1 and are stated below and indices are shown in Table 2.
\(G_1\) = Amphetamine; \(G_2\) = Benzylpiperazine; \(G_3\) = Clobenzorex; \(G_4\) = Ethylamphetamine; \(G_5\) = Fenproporex; \(G_6\) = Furfenorex; \(G_7\) = Furosemide; \(G_{8}\) = Mefenorex; \(G_{9}\) = Methcathinone; \(G_{10}\) = Methylphenidate; \(G_{11}\) = Phentermine; \(G_{12}\) = Selegiline; we have:
Similarly, we compute topological indices for different drugs as shown in Table 2.
The structural interpretation and physicochemical relevance of each index are summarized in Table 4.
Multicollinearity assessment of topological descriptors
To ensure the reliability of regression models and the independence of topological descriptors, we conducted a comprehensive multicollinearity assessment. This process involved three core methodologies: Pearson correlation analysis, Variance Inflation Factor (VIF) computation, and Principal Component Analysis (PCA). These analyses guided the descriptor selection strategy and provided insights into the structural redundancy among the indices.
Quantifying multicollinearity: pearson correlation and VIF
We began by calculating the pairwise Pearson correlation matrix for all seven neighborhood-based indices (\(M_1\), \(M_2\), FN, \(ND_3\), \(ND_5\), NH, NI). Nearly all correlations exceeded \(|r| = 0.90\), indicating strong redundancy. The most pronounced correlation was between \(M_1\) and NI (\(r = 0.9993\)), with similarly high values for \(FN \leftrightarrow ND_3\) (\(r = 0.9959\)), \(M_1 \leftrightarrow M_2\) (\(r = 0.9908\)), and \(M_1 \leftrightarrow FN\) (\(r = 0.9882\)). These findings, visualized in Fig. 3, suggest that the indices capture similar structural attributes.
We further quantified this redundancy using the Variance Inflation Factor (VIF), which measures the inflation of variance due to multicollinearity. All indices exhibited VIF values well above the critical threshold of 10, with \(M_1\) reaching \(8.92 \times 10^7\) and most others exceeding \(10^5\) (Fig. 4) and Table 5. These extreme values confirm that simultaneous inclusion of all indices in regression models would lead to unstable coefficient estimates.
Dimensionality reduction via PCA
To explore the underlying structure among the indices, PCA was applied to the standardized descriptor matrix. The first principal component (PC1) explained 95.95% of the total variance, with uniform positive loadings across all indices (ranging from 0.354 to 0.385). This indicates that the indices largely measure the same latent construct, interpreted as molecular topological complexity. The second component (PC2) added only 3.56% variance, while the remaining components contributed negligibly.
The strongest PC1 loadings were associated with \(M_1\) and NI (0.385), reinforcing the decision to select \(M_1\) as the primary descriptor as shown in Table 6. These findings substantiate that the seven indices form a cohesive family reflective of a common structural dimension.
Descriptor selection strategy for modeling
Due to the severe multicollinearity and high shared variance, we adopted a tailored descriptor usage strategy:
-
Polynomial regression: We used \(M_1\) exclusively as the input descriptor. Its strong correlation with the other indices and the highest PC1 loading ensured that it effectively captured the shared variance. Moreover, \(M_1\)’s interpretability as the First Zagreb index made it a suitable choice for chemically meaningful regression models.
-
Random forest modeling: All seven indices were retained. Ensemble methods like Random Forests are robust to multicollinearity and benefit from the diversity in nonlinear feature interactions. This allowed us to exploit the full structural nuance of the descriptor set without compromising model stability.
-
Validation: The remaining six indices served as a form of internal validation. Their high correlations with \(M_1\) ensured that similar trends were reflected across the descriptor set. Correlations between physicochemical properties and each index confirmed consistent patterns, reinforcing model robustness.
Regression models
We use regression models to determine how the independent and dependent variables relate in our data set. Regression analysis is one of the most powerful statistical methods employed to model and analyze how a response variable varies when related to one or more predictor variables. Fitting various regression forms, such as linear, quadratic, and cubic, helps us identify simple as well as complex patterns in the data. Regression helps us identify hidden patterns, makes forecasts, and aids in data driven decision making.
There exists a quadratic regression model, which is one form of polynomial regression, which defines the relationship between the independent variable and the dependent variable in terms of the second degree polynomial equation of the following form: \(y = ax^2 + bx + c\). This model benefits when the data follows a curved shape, like a parabola, which cannot be well represented using a simple lineal model. Quadratic regression can accommodate one turning point, either maximum or minimum, hence it can be used to model situations where the dependent variable rises or decreases with the independent variable but then changes direction. It is used in fields like physics, economics, and biology, where such non linear relationships are common.
Cubic regression takes polynomial regression one step ahead to a third-degree equation of the following form: \(y = ax^3 + bx^2 + cx + d\). This cubic model can pick out more intricate patterns in data, such as two inflection points when the curvature of the data changes direction. Because of this, cubic regression is best used when data does not lend itself well to the modeling of by either linear or quadratic equations. It is particularly useful when the dependent variable oscillates or fluctuates as the independent variable increases. Some uses of cubic regression are in the fields of engineering, environmental science, and economics.
To assess model generalization, Leave-One-Out Cross-Validation (LOOCV) was performed for all regression models. LOOCV is the gold standard for small datasets \((n < 20)\), providing nearly unbiased estimates of prediction error. For each molecule, the model was trained on the remaining 12 molecules and tested on the held-out molecule. This process was repeated for all 13 molecules. Given the severe multicollinearity among neighborhood indices (VIF \(> 10^{5}\), PC1 explains 95.95% variance), we report LOOCV \(Q^{2}\) values for \(M_{1}\) as the representative descriptor. Other indices yield similar \(Q^{2}\) patterns due to their high intercorrelation (\(r > 0.90\)).
Table 7 shows regression models analyzing drug response characteristics and the topological predictor \({M_1(G)}\). Both the quadratic and cubic model forms have been employed to fit characteristics like Blood Pressure (BP), Enthalpy of Vaporization (EV), Flash Point (FP), Molar refractivitye (MR), Surface area (SA), Polarazibility (P), Surface Tension (ST), and Molar Volume (MV). The cubic forms have been seen to improve the fit in every parameter, with greater R and \(R^2\) values than the quadratic forms. For example, BP had high correlation in the cubic form \((R = 0.964, R^2 = 0.929)\) over the quadratic form \((R = 0.935, R^2 = 0.875)\). Also, EV and FP had significantly high coefficient of determination in the cubic form, which reflects more complex relationships being explained by the higher order term. Parameters such as MV had comparatively lower correlation values (\(R^2<0.4\)), reflecting weak predictive ability independent of the complexity of the model. Figure 5 visually verifies these results. The cubic curve (green) tends to track the scatter of observed data points as closely or more closely than the second degree curve (blue) for BP, EV, and FP, whereas with some like SA or MV, not even the cubic curve correctly describes the data spread, possibly due to high variability or non-linear effects unaccounted by the polynomial terms. In order to maximise model parsimony and interpretative robustness, we have routinely compared the predictive power of quadratic vs cubic regression models for each physicochemical property. This \(R^2<0.85\), confirming the existence of consistent structure property relationships. Nevertheless, SA ST and MV have very low \(R^2\) values \((<0.40)\), which indicates poor correlations and thus high variability. These results highlight that whilst higher order models may improve fit for certain parameters, such quadratic forms are ideal for preserving parsimony and avoiding overfitting, particularly when the sample size is constrained. Our future works will be to add more data to the dataset and integrate rigorous validation (e.g., nested cross validation) in order to improve generalization power.
Table 8 assesses the statistical fit of \({M_2(G)}\), a topological predictor, versus several drug response parameters under both quadratic and cubic regression analysis. The data consistently present high R and \(R^2\) values, particularly for BP, EV, and FP, which are strong indicators of good model fits. An example is the quadric fit of BP, which reaches R = 0.944, \(R^2\) = 0.891, while the cubic equation elevates this to R = 0.963, \(R^2\) = 0.928. The pattern follows through with EV and FP, where cubic fits tend to surpass quadric fits, reflecting the increased predictive gain with higher-order terms. Nonetheless, metrics like MV exhibit weak relations (\(R^2=0.34-0.37\)) which imply limited predictive capacity of M2(G) on volume-related responses. Nevertheless, the majority of metrics retain high F-statistics and low p-values \((p < 0.01)\) which ensure reliability of the models. The only exception is MV, which has large standard errors and p-values (p > 0.05) reflecting non-significant model fits, which reflect either inherent variability or lack of dependence on \({M_2(G)}\). Figure 6 confirms the quantitative findings. The cubic curve (green) follows the actual data distribution (red) better than the quadratic curve (blue) for the majority of the parameters, particularly for situations with slight curvature. Nevertheless, the visual fit of MV is still weak, as observed, which verifies the analysis performance of the model statistically.
In general, \({M_2(G)}\) has excellent predictive capacity for nearly all drug response features, especially when modeled cubically. These results indicate that \({M_2(G)}\) may be useful as a topological descriptor, specifically in the prediction of BP, EV, and FP.
Table 9 examines the association between drug response factors and FN(G) using cubic and quadratic models. It shows high association levels in the case of BP, EV, and FP, as indicated by \(R^2\) levels greater than 0.84 for both kinds of models. The cubic models perform slightly better than the quadratic models, with the biggest differences being in the case of EV (\(R^2=\) 0.884 to 0.923) and BP (\(R^2\)= 0.861 to 0.889), which imply the presence of hidden non-linear trends Moderate predictability exists in MR, SA, P, and ST \((0.56 = R^2 = 0.67)\), with cubic models providing modest improvements. MV is the poorest predictor (\(R^2=0.28-0.29)\) with weak statistical significance \((p > 0.05)\). Standard errors are generally lower in better models, indicating higher reliability. Figure 7 provides supporting visual evidence. The cubic models (green) tend to be a better fit to the data distribution than the quadratic models (blue) in most data, particularly BP and EV. The models do, however, diverge more considerably for such variables as MV, indicating lack of fit. FN(G) is a reliable predictor of several traits, most notably BP, EV, and FP. The outcome reaffirms the utility of the incorporation of cubic terms to account for intricate drug-genome interactions, but identifies areas of limitation in the prediction of volume-related traits.
Table 10 reports regression analysis between \(ND_3(G)\) and drug response metrics. As in past predictors, maximum explanatory capability exists in BP, EV, and FP, where \(R^2\) values are over 0.80, and cubic fits improve slightly (for instance, EV: \(R^2\) increases to 0.903 from 0.867). These values indicate excellent predictive ability. Parameters such as MR, SA, P, and ST produce moderate \(R^2\) values \((~0.50-0.67)\), which reflect some predictive ability. MV is still weakly related to the outcome variable (\(R^2~0.23-0.25)\) with large p-values and standard errors, reflecting lack of significance and unreliability of predictions. Figure 8 depicts the regression fits, with the cubic models (green) providing better fits to non-linear trends for EV and FP. The distinction is slight for MR and SA, but MV continues to be poorly forecasted. \(ND_3(G)\) is an effective genomewide marker of primary physiological effects, but only when higher-order polynomial terms are used. Nevertheless, because of weak correlation with MV, \(ND_3(G)\) is unlikely to be able to forecast every aspect of drug response.
Table 11puts into relief \(ND_5(G)\)’s close relationship with virtually every drug response parameter. BP, EV, and FP have high \(R^2\) values (up to 0.954) and low standard errors in the cubic models. The cubic term strongly enhances the explanatory ability, such as the improvement in \(R^2\) of EV from 0.851 to 0.954. Moderate predictability exists in MR, SA, P, and ST (\(R^2\) between 0.50-0.75), with the greatest increase in cubic \(R^2\) being in SA (0.527 to 0.559). MV is still less predictable (\(R^2~0.45)\), with p-values becoming non-significant. Figure 9 identifies excellent visual fits of BP, EV, and FP in cubic models. Cubic trends better pick up curvature compared to their quadratic equivalents, especially in the case of BP and EV. MV once more demonstrates poorer correlation with both models. Overall, \(ND_5(G)\) is a powerful predictor for most drug response variables, especially in cubic form. Its robustness in modeling critical traits supports its potential clinical relevance.
Table 12 tests NH(G) as a predictor and finds mixed performances of the various models. BP, EV, FP, and MR have high \(R^2\) values of \(=0.75\), particularly in cubic models (e.g., FP: \(R^2\) = 0.826). The parameters have low standard errors and significant F-statistics, reflecting strong fits. Nevertheless, SA, ST, and MV have weak predictive relationships, \(R^2 < 0.35\), and p-values reflecting statistical insignificance Fig. 10 confirms these interpretations. Whereas cubic representations do correctly capture variability in BP, EV, and FP, they do not agree with data trends in SA, ST, and MV. Relatively good representations are found in MR and P, particularly when cubic terms are employed. NH(G) therefore possesses selective predictive ability with excellence in particular parameters while performing poorly in others. This pattern implies it might be regulating some physiological pathways more directly than others.
Table 13 evaluates NI(G)’s contribution toward drug response prediction. BP, EV, and FP resurface as the most predictable features (\(R^2 > 0.87\) in cubic models) with high R values and significant F-tests. Adding the cubic term does raise \(R^2\) in the majority, with significant parameter improvements in features such as FP (0.874 to 0.896) and EV (0.878 to 0.949). Moderate correlation is seen between MR, SA, P, and ST (\(R^2=0.59-0.67)\), whereas MV is weak (\(R^2=0.38)\), as with other topological predictors. Standard errors are reduced using cubic models in the stronger relationships, lending additional credibility to the models. Figure 11 depicts these results, cubic curves closely track data patterns in good predictors such as BP and EV. MV once more consists of scattered points, which signifies weak model performance. NI(G) is highly useful in drug response prediction for important physiological variables, supporting its status as a descriptor. As in the case of others, cubic modeling better describes intricate relationships than do less complex forms.
Discussion
The topological indices not only reflect the geometrical and connectivity network of a molecule, but also including physicochemical information that correlates to biological and pharmacological activities. Descriptors including \(M_1, M_2, ND_3\) express the branching of molecule, compactness between atoms and three-dimensional connectivity to affect some important pharmacokinetic characteristic such as molecular transport profile, blood brain barrier (BBB) penetration etc. Molecules with higher values of these indicators, i.e., Clobenzorex (\(G_7\)) and Phentermine derivatives (\(G_10 \text\;{to}\;G_11\)), have more complicated topological frame, as well and show a stronger noradrenergic stimulation, appetite suppression and biological half-life. The polarity/ED associated descriptors (NH, NI) also relate to receptor binding at dopaminergic and adrenergic sites that are implicated in stimulant activity. Because the stimulating efficacy and abuse potential of amphetamine-like compounds are highly influenced by the molecular shape, branching and electronic effect as well as surface size and complexity, the topological indices employed in this paper are presumed to constitute adequate structural under-descriptors for these pharmacological endpoints. Consequently, it is interesting to note the structural correlation patterns established for these descriptors (and in turn with physicochemical parameters such as polarizability, surface tension or molar refractivity), emphasizing this way their impact on biological data manifesting the pharmacological profiles of our compounds.
Random forest
In this section, we utilized the Random Forest (RF) technique as a strong ensemble learning tool for the prediction of several physicochemical descriptors including boiling point (BP), evaporation energy (EV), and some other related characteristics. Random Forest works by training a large number of decision trees and returning the average prediction of the constituent trees, which acts to minimize overfitting and facilitate better generalization. We employed a series of molecular descriptors, namely topological indices including \(M_1\), \(M_2\), and a number of others based on the neighborhood polynomial as individual variables to train the model. The indices are able to capture structural information of the molecules and are therefore good predictors. The dataset was divided into training and test subsets for cross-validation of the performance of the model and prediction reliability in new, unseen data. The resulting OOB error distributions and performance metrics are summarized in Table 14.
In order to assess the predictive precision and stability of the Random Forest model, we calculated a number of statistical performance measures. These are the coefficient of determination \(R^2\), which measures how well predicted values estimate the true data; the Mean Absolute Error (MAE), which indicates the average size of errors; the Root Mean Square Error (RMSE), which puts more weight on greater errors through squaring; and the Overall Bias (OB), which verifies systematic bias between predicted and true values.
The corresponding mathematical equations are shown below: where \(s_i\) are the actual property values, \({\hat{s}}_i\) are the predicted values, \({\bar{s}}\) is the average of the actual values, and \(n\) is the number of observations.
-
Boiling point prediction using random forest
Table 15 contains both actual and prediction values for the boiling point (BP) obtained from the Random Forest (RF) model. These are in close accordance with actual values, with most discrepancies being less than 5%. The highest deviation (14.35%) is indicative of possible molecular complexity not expressed by the descriptors. Generally, the RF model works quite well, indicative of how well it can generalize to highly structurally diverse amphata derivatives. Molecular descriptors, notably topological indices, translate structural information into something meaningful in relation to BP quite effectively. This is indicative of their great potential in predictive modeling for analogous physicochemical characteristics, and they present a ready tool for the screening of new compounds for which experimental values are not possible or are too expensive to measure.
Figure 12 is a regression plot for actual versus predicted BP values. The points lie close to the diagonal line, reflecting good predictive precision. The fact that the points cluster tightly around the line implies that the RF model adequately accounts for the variation in the values of the boiling points throughout the compound set. Some deviations are seen at higher-boiling points, probably related to greater complexity in the structures. The figure supports quantitative results from Table 15 and graphically verifies the robustness of the model. Such graphical concurrence ensures that the descriptors and learning method in combination facilitate consistent boiling point prediction.
-
Enthalpy of vaporization prediction using random forest
Table 16 presents the predicted and calculated values for the enthalpy of vaporization (EV). The RF model generated highly reliable predictions, and all errors were less than 13%, with most of them less than 2%. The comparative accuracy is indicative of the robustness of the RF model in detecting the molecular patterns that determine EV. The highest deviation (12.58%) is likely a consequence of small differences in molecular interactions not completely accounted for by the descriptors. Nevertheless, the overall convergence of calculated and predicted values confirms the applicability of the utilized molecular indices and the performance of the RF model in approximating thermodynamic attributes, which are typically hard to measure.
Figure 13 illustrates the predicted and actual enthalpy of vaporization measurements. There is a strong correlation, with most points along or close to the prediction line. The low scatter verifies the capacity of the model to generalize well throughout the dataset. Some deviations in some points may be caused by unmodelled effects such as hydrogen bonding. The figure, in any case, supports the metrics in Table 16 and provides further evidence in favor of the descriptors and the encoded thermodynamic behavior. The graph provides greater interpretableness and verifies the reliability of the model for application in practice in property prediction.
-
Flash point prediction using random forest
Table 17 below gives the predicted and measured values of flash point (FP). RF appears to have a good predictive coherence with measured data, albeit with a slightly greater variance than in the case of BP and EV. The highest deviance (16.48%) indicates that FP, which is most sensitive to molecular conformation and position of functional groups, could need extra descriptors for a better prediction. Yet, the majority of the predictions are within reasonable errors, validating the utility of the model in the prediction of flash point. This indicates the potential of machine learning in lessening dependency on harmful or cumbersome laboratory methods for the determination of combustible properties of chemical compounds.
Figure 14 plots true and predicted flash points, with a relatively greater scatter than in previous figures. Although most points are well on the ideal line, some exhibit appreciable deviations, particularly at the higher flash points. These deviations are consistent with the greater error rates in Table 17. The figure illustrates the sensitivity of the flash point to small electronic and spatial effects, not yet accounted for in the existing descriptor set. Nevertheless, the overall trend is that the model describes the overall variation well, and it proves itself useful in filtering for safety-related characteristics, such as flammability.
-
Molar refractivity prediction using random forest
In Table 18, the RF model and actual values of molar refractivity (MR) are compared. The RF model provides high accuracy with prediction errors below 5% in most cases. The highest error is only 4.75%, which confirms that the descriptors capture molecular volume-related and electron cloud distortions well. The high performance here proves the importance of molecular graph parameters and topological indices in predicting the highly correlated optical property, namely, refractivity. The RF’s capacity to describe a highly sensitive optical property further proves its robustness and flexibility in estimating MR of large compound libraries.
Figure 15 points to predicted against measured molar refractivity values, tightly clustering near the diagonal line. This corroborates the precision of the model as described in Table 18. There are minimal outliers indicating a high relevance of descriptors towards the refractive characteristics. Both polarizability and molecular volume contribute towards molar refractivity, and both of these are well addressed by the descriptors. The graph-theoretical index-based optical property prediction is further justified by the confident alignment in this figure.
-
Surface area prediction using random forest
Table 19 summarizes surface area (SA) predictions based on the RF model. The model here is more scattered, with a number of predictions far away from the corresponding values, most egregiously an error of 176.68%. This points to a consideration that molecular descriptors utilized may not completely capture 3D spatial characteristics, which are important for surface area calculations. Secondarily, although some predictions are correct, the results indicate a necessity for the usage of geometrical or conformational descriptors for improving the performance of the models. However, the RF model is a viable preliminary screening tool under the circumstances when only topological information is accessible and precise geometrical modeling is not possible.
Figure 16 illustrates actual versus predicted surface area values and has a lot of scatter about the diagonal. This graphical outcome is consistent with the greater errors found in Table 19, particularly for outlier values. The model is good for moderate values of SA but not for outliers, most probably because it lacks the information of the 3D structure in the descriptors. The figure indicates the limitation in topological-only models in the prediction of geometry-influenced properties, although it also suggests the possibility of optimization through the inclusion of spatial descriptors or hybrid sets of descriptors.
-
Polarizability prediction using random forest
Table 20 summarizes the actually and predicted polarizability (P) values, with excellent performance demonstrated by the model. The errors are uniformly low, below 5% in most cases, and a maximum of only 4.76%. These findings reinforce the efficiency of the molecular descriptors selected in characterizing polarizable volume in the compounds. With polarizability being highly reliant on electronic structure and spatial distribution, the precision of the model indicates that the RF algorithm, in conjunction with neighborhood M-polynomial-based descriptors, is able to capture important structural drivers. The model is thus appropriate for polarizability prediction in new compounds, particularly during initial material design or virtual screening pipelines.
Figure 17 plots predicted versus actual polarizability, and excellent agreement along the diagonal is evident. Narrow dispersion supports the consistency and accuracy of the model, in keeping with the low errors in Table 20. Polarizability is largely dependant on molecular size and electron distribution, and these are well described by the present set of descriptors. The figure graphically establishes the strong performance of the model in predicting this electronic property and the appropriateness of RF and molecular indices for predictive material design.
-
Surface tension prediction using random forest
Table 21 illustrates a comparison of calculated and predicted surface tension (ST) data. RF model accuracy is moderate, with most errors well within the range of 5%, and one outlier at 17.05%. Such a reasonable, though not perfect, sensitivity of descriptors to surface activity characteristics, including polarity and intermolecular interactions, is indicated. Deviation indicates possible effects from unaccounted-for external conditions (temperature, pressure, for example). Nevertheless, low average error and strong consistency of most predictions prove the RF model’s reliability. Further improvements may consist of incorporation of dynamic or interaction-based descriptors for even more precise predictions of surface characteristics.
Figure 18 is a scatter plot of predicted versus actual surface tension. All points cluster close to the diagonal, with some prominent exceptions. The graphical trends are consistent with those in Table 21 and indicate good, though not optimal, model performance. The outlier points probably indicate molecular surface behavior not accounted for by the descriptors. Nevertheless, the model provides a good predictive estimate for ST, particularly when combined with multi-parameter screening systems.
-
Molar volume prediction using random forest
Table 22 clearly displays the performance of the model in predicting molar volume (MV). With prediction errors being predominantly below 5%, the RF model is highly reliable. The peak error of 5.58% indicates occasional underfitting in some molecular conformations. The fact that actual and predicted values for a varied set of molecular volumes are close to each other proves that the descriptors are able to capture spatial and structural information. That agrees with the appropriateness of RF modeling for volumetric-type properties and further establishes the predictiveness of the model in estimating properties based on both electronic and geometrical attributes of molecules.
Figure 19 illustrates predicted and actual molar volumes. Clear linear correlation is evident, validating the consistent performance of the model as in Table 22. Some scatter among individual points is anticipated with the structural diverseness in the database. The robustness in both small and large volume compounds validates the RF model’s flexibility and the descriptors’ ability to encode volume-related attributes, providing a valuable calculator for molecular design and optimization.
Data augmentatin
We employed three augmentation techniques: (1) structural interpolation between similar molecules (n = 18 samples), (2) Gaussian noise addition simulating descriptor uncertainty (n = 26 samples), and (3) SMOTE-based oversampling (n = 15 samples), generating 59 synthetic training samples total. To ensure unbiased evaluation, augmented samples were used exclusively for training while the original 13 molecules served as an independent test set. Models trained on augmented data demonstrated substantial improvements: polynomial regression achieved an average \(R^2\) improvement of + 74.7%, while Random Forest models showed + 216.1% average improvement, with all eight physicochemical properties (boiling point, evaporation rate, flash point, molar refractivity, surface area, density, surface tension, and molar volume) achieving \(R^2\) 0.78. Notably, properties that previously showed poor baseline performance (e.g., surface tension: \(R^2\) = 0.186 baseline, 0.891 augmented) exhibited the most dramatic improvements. This demonstrates that data augmentation can effectively address sample size limitations in QSPR studies without compromising model validity or generalization capability.
Method 1: structural interpolation
Molecules with similar structures often have similar properties. Interpolating between structurally similar molecules simulates hypothetical intermediate derivatives.
Procedure:
This generated 18 interpolated samples (6 pairs \(\times\) 3 interpolation points).
Method 2: Gaussian noise addition
Descriptor calculations involve numerical approximations and rounding. Adding controlled noise simulates this uncertainty and minor conformational variations.
Procedure:
This generated 26 noisy samples (13 molecules \(\times\) 2 replicates).
Method 3: SMOTE-based oversampling
SMOTE (synthetic minority over-sampling technique) is an established method for addressing class imbalance and small sample sizes.
Procedure:
This generated 15 SMOTE samples.
Combined augmented dataset
The three methods were combined to create a final augmented dataset:
-
Method 1 (Interpolation): 18 samples
-
Method 2 (Gaussian Noise): 26 samples
-
Method 3 (SMOTE): 15 samples
-
Total augmented samples: 59
-
Augmentation ratio: 4.5 \(\times\) original dataset
Model training
Two regression approaches were employed:
Polynomial regression
Polynomial models of degree 2 (quadratic) and 3 (cubic) were fitted using only the \(M_1\) descriptor (based on multicollinearity analysis). For each target property, the degree yielding higher cross-validated \(R^2\) was selected.
Model form (degree d):
Random forest
Random Forest regressors were trained using all 7 topological descriptors with the following hyperparameters:
-
Number of trees: 100
-
Maximum depth: 5
-
Random state: 42 (for reproducibility)
Validation protocol
To ensure unbiased evaluation, we implemented strict data separation:
Baseline models (LOOCV):
-
Training: Original 13 molecules
-
Validation: Leave-one-out cross-validation
-
Performance metric: Cross-validated \(R^2\) and RMSE
Augmented models:
-
Training: 59 synthetic samples only
-
Testing: Original 13 molecules (never used in training)
-
Performance metric: Test \(R^2\) and RMSE
This protocol guarantees that reported augmented model performance reflects genuine generalization to real compounds, not memorization of synthetic data.
Quality validation metrics
To verify augmentation quality, we assessed:
-
1.
Feature range preservation: Augmented samples should not extend beyond \(\pm 20\%\) of original feature ranges
-
2.
Correlation structure preservation: Correlation matrix similarity measured by Frobenius norm:
$$\begin{aligned} \text {Similarity} = 1 - \frac{\Vert {\textbf{C}}_{original} - {\textbf{C}}_{augmented}\Vert _F}{\Vert {\textbf{C}}_{original}\Vert _F} \end{aligned}$$(11) -
3.
PCA visualization: Confirming augmented samples fill chemical space without creating unrealistic outliers
Augmentation quality assessment
Feature range preservation
All seven topological indices in the augmented dataset remained within acceptable bounds (Table 23). The maximum range extension was 7.4% for \(M_1\), well within the 20% threshold, confirming that synthetic samples did not introduce unrealistic descriptor values.
Correlation structure preservation
The correlation matrix similarity between original and augmented datasets was 96.2%, indicating excellent preservation of inter-descriptor relationships (Fig. 20). This suggests that augmentation maintained the underlying structure-property relationships present in the original data.
Chemical space coverage
PCA visualization (Fig. 21) shows that augmented samples appropriately fill the chemical space defined by original molecules without creating unrealistic outliers. Original molecules (red circles) are surrounded by synthetic samples from three methods, demonstrating that augmentation expanded coverage within plausible chemical space rather than extrapolating beyond it.
PCA visualization of chemical space coverage. Original molecules (red circles, n = 13) define the chemical space. Augmented samples from interpolation (blue squares), Gaussian noise (green squares), and SMOTE (orange squares) fill the space without creating unrealistic outliers. PC1 and PC2 explain 81.2% of total variance.
Feature distribution comparison
Figure 22 compares feature distributions between original and augmented datasets. Augmented distributions (orange) overlap with and smoothly extend original distributions (blue), confirming that synthetic samples represent plausible variations rather than artificial extremes.
Model performance comparison
Data augmentation substantially improved model performance across both modeling approaches and all physicochemical properties (Table 24). Key findings:
-
Polynomial regression: Average \(R^2\) improvement of + 74.7% (from 0.60 to 0.86)
-
Random forest: Average \(R^2\) improvement of + 216.1% (from 0.43 to 0.92)
-
All properties improved: 8/8 properties for both model types
-
Strongest improvements: Surface tension (ST) showed + 379% for polynomial regression and + 2505% for Random Forest
Property-specific analysis
Dramatic Improvements: Surface tension (ST) and boiling point (BP) showed the most substantial gains. ST improved from near-zero predictive power (\(R^2\) = 0.186 for polynomial, 0.036 for RF) to excellent performance (\(R^2\) = 0.891 and 0.941, respectively). This suggests these properties were severely limited by small sample size, which augmentation successfully addressed.
Moderate Improvements: Properties with already reasonable baseline performance (MR, P) showed smaller but consistent improvements (5-18%), indicating augmentation enhanced even well-performing models.
Surface Area (SA) Correction: Notably, Random Forest baseline for SA showed negative \(R^2\) (-0.081), indicating the model performed worse than simply predicting the mean. Augmentation corrected this completely, achieving \(R^2\) = 0.896.
Visual performance comparison
Figure 23 visualizes the performance improvements across all properties. The dramatic gains for ST and BP are clearly evident, while all properties show consistent improvements.
Performance comparison across all properties. Bar plots comparing baseline (LOOCV) and augmented model \(R^2\) values for (left) Polynomial Regression and (right) Random Forest. All eight properties show improvement with augmented training, with surface tension (ST) demonstrating the most dramatic gains.
Prediction quality assessment
Figure 24 shows scatter plots of predicted versus actual values for two representative properties (BP and EV). Augmented models (orange) show tighter clustering around the ideal line (y = x) compared to baseline models (blue), indicating improved prediction accuracy.
Prediction quality for boiling point (BP) and evaporation rate (EV). Scatter plots showing baseline LOOCV predictions (blue) versus augmented model predictions (orange) against actual experimental values. Dashed line represents perfect prediction (y = x). Augmented models show reduced scatter and improved correlation with actual values, particularly for BP (\(R^2\) improved from 0.484 to 0.893 for polynomial regression).
Conclusion
The current study results emphasize the high quality of the NM-polynomial based topological indices as robust and easily interpretable computational tools for forecasting the physicochemical properties of amphetamine derivatives in QSPR regimes. Using both polynomial regression and Random Forest, we demonstrate that these neighborhood degree based indices capture important structural properties influencing thermodynamics, electronics, and interfaces. The improved performance given by cubic regression indicates the nonlinear relationships between molecular topology and physicochemical behaviors, whose prediction also benefited from the use of Random Forest being more stable across different properties. All in all, these findings make a case that the integration of chemical graph theory with machine learning yields robust, scalable and computationally efficient models which can aid rational drug design. The descriptors are useful to screen structurally related compounds and predict their activities, as they consistently represent molecular complexity, branching pattern, electronic features of molecules in comparison with the requiring numerous experimental factors. Accordingly, the method proposed here provides a powerful and general framework for contemporary cheminformatics and molecular pharmacology.
Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
References
Wilson, R. J. Introduction to graph theory. Pearson Education India. (1979)
Balaban, A. T. Applications of graph theory in chemistry. J. Chem. Inf. Comput. Sci.25(3), 334–343 (1985).
Huang, L., Alhulwah, K. H., Hanif, M. F., Siddiqui, M. K. & Ikram, A. S. On QSPR analysis of glaucoma drugs using machine learning with XGBoost and regression models. Comput. Biol. Med.187, 109731 (2025).
Wei, J., Hanif, M. F., Mahmood, H., Siddiqui, M. K. & Hussain, M. QSPR analysis of diverse drugs using linear regression for predicting physical properties. Polycyclic Aromat. Comp.44(7), 4850–4870 (2024).
Qin, H. et al. On QSPR analysis of pulmonary cancer drugs using python-driven topological modeling. Sci. Rep.15(1), 3965 (2025).
Qin, H. et al. A python approach for prediction of physicochemical properties of anti-arrhythmia drugs using topological descriptors. Sci. Rep.15(1), 1742 (2025).
Iqbal, S. et al. Evaluation of antiarrhythmia drug through QSPR modeling and multi criteria decision analysis. Sci. Rep.15, 29216 (2025).
Qin, H. et al. Graph theoretic and machine learning approaches in molecular property prediction of bladder cancer therapeutics. Sci. Rep.15, 28025 (2025).
Faheem, H., Ahmad, S., & Farooq, R. Maximal and minimal Zagreb indices of trees with fixed number of vertices of maximum degree. MATCH Communications in Mathematical and in Computer Chemistry.
Çolakoglu, Ö., Kamran, M. & Bonyah, E. M-polynomial and NM-polynomial of used drugs against Monkeypox. J. Math.2022(1), 9971255 (2022).
Altassan, A., Saleh, A., Alashwali, H., Hamed, M. & Muthana, N. Entire neighborhood topological indices: Theory and applications in predicting physico-chemical properties. Int. J. Anal. Appl.23, 79–79 (2025).
Hasani, M., Ghods, M., Mondal, S., Siddiqui, M. K. & Cheema, I. Z. Modeling QSPR for pyelonephritis drugs: A topological indices approach using MATLAB. J. Supercomput.81(3), 479 (2025).
Pradeepa, A. & Arathi, P. On topological characterizations and computational analysis of benzenoid networks for drug discovery and development. J. Mol. Graph. Modell.136, 108957 (2025).
Das, S. & Kumari, A. Degree-based coindices of molnupiravir and its QSPR analysis with other COVID-19 drugs. Int. J. Appl. Comput. Math.11(3), 1–23 (2025).
Nagesh, H. M. & Kumar, M. M. On the M-polynomial and degree-based topological indices of Dandelion graph. Int. J. Math. Combin.1, 39–49 (2024).
Giannakogeorgou, A. & Roden, M. Role of lifestyle and glucagon like peptide1 receptor agonists for weight loss in obesity, type 2 diabetes and steatotic liver diseases. Aliment. Pharmacol. Therapeut.59, S52–S75 (2024).
Abbas, K., Barnhardt, E. W., Nash, P. L., Streng, M. & Coury, D. L. A review of amphetamine extended release once-daily options for the management of attention-deficit hyperactivity disorder. Expert Rev. Neurother.24(4), 421–432 (2024).
Khan, R., Sadak, S., Kanbes-Dindar, C., Haider, A., & Uslu, B. Electrochemical Investigation of Benzylpiperazine. In Forensic Electrochemistry: The Voltammetry for Sensing and Analysis, 227–242. (American Chemical Society 2024).
Macedo, A. A. et al. Detection of the stimulant clobenzorex using voltammetry and screen-printed electrodes: A simple and fast screening method for application in seized samples and oral fluid of drivers. Microchem. J.207, 111679 (2024).
Gomes, N. C., Cabrices, O. G. & De Martinis, B. S. Innovative disposable pipette extraction for concurrent analysis of fourteen psychoactive substances in drug users sweat. J. Chromatogr. A1730, 465136 (2024).
Borges, G. R. et al. Determination of drugs of abuse in oral fluid using dried oral fluid spot assisted by 24-well plate and LC-MS/MS. Bioanalysis17(9), 595–605 (2025).
Inoue, T. & Suzuki, S. The metabolism of 1-phenyl-2-(N-methyl-N-benzylamino) propane (benzphetamine) and 1-phenyl-2-(N-methyl-N-furfurylamino) propane (furfenorex) in man. Xenobiotica16(7), 691–698 (1986).
Ho, K. M. & Power, B. M. Benefits and risks of furosemide in acute kidney injury. Anaesthesia65(3), 283–293 (2010).
Rendic, S., Slavica, M. & Medic-aric, M. Urinary excretion and metabolism of orally administered mefenorex. Eur. J. Drug Metab. Pharmacokinet.19, 107–117 (1994).
DeRuiter, J., Hayes, L., Valaer, A., Clark, C. R. & Noggle, F. T. Methcathinone and designer analogues: Synthesis, stereochemical analysis, and analytical properties. J. Chromatogr. Sci.32(12), 552–564 (1994).
Challman, T. D. & Lipsky, J. J. Methylphenidate: Its pharmacology and uses. In Mayo clinic proceedings 711–721 (Elsevier, Amsterdam, 2000).
Smith, S. M., Meyer, M. & Trinkley, K. E. Phentermine/topiramate for the treatment of obesity. Ann. Pharmacother.47(3), 340–349 (2013).
Gerlach, M., Youdim, M. B. H. & Riederer, P. Pharmacology of selegiline. Neurology47(6suppl3), 137S-145S (1996).
Dou, K. et al. Switch on amine substrate reactivity towards hexaazaisowurtzitane cage: Insights from a tailored machine learning model. Chem. Eng. J.501, 157677 (2024).
Feng, X. et al. AutoFE-Pointer: Auto-weighted feature extractor based on pointer network for DNA methylation prediction. Int. J. Biol. Macromol.311, 143668 (2025).
Xu, G. et al. Anonymity-enhanced sequential multi-signer ring signature for secure medical data sharing in IoMT. IEEE Trans. Inf. Forens. Sec.20, 5647–5662 (2025).
Wang, Z. et al. Precision strike strategy for liver diseases trilogy with Xiao-Chai-Hu decoction: A meta-analysis with machine learning. Phytomedicine142, 156796 (2025).
Zhou, S., Wang, S., Wu, Q., Azim, R. & Li, W. Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression. Comput. Biol. Chem.85, 107200 (2020).
Xianfang, T. et al. Indicator regularized non-negative matrix factorization method-based drug repurposing for COVID-19. Front. Immunol.11, 603615 (2021).
Liu, C. et al. An improved anticancer drug-response prediction based on an ensemble method integrating matrix completion and ridge regression. Mol. Therapy Nucl. Acids21, 676–686 (2020).
Yan, Z. et al. Synthesis, bioactivity evaluation and theoretical study of nicotinamide derivatives containing diphenyl ether fragments as potential succinate dehydrogenase inhibitors. J. Mol. Struct.1308, 138331 (2024).
Shi, S. & Liu, W. B2-ViT Net: Broad vision transformer network with broad attention for seizure prediction. IEEE Trans. Neural Syst. Rehabil. Eng.32, 178–188 (2024).
Guo, B., Jiang, X., Zhu, L. & He, X. Exploring the diagnostic potential of core targets of 6PPD and its metabolite 6PPD-Q in cardiovascular diseases: An integrated analysis based on network toxicology, molecular docking, and in vitro validation. J. Appl, Toxicol (2025).
Wang, S., Yang, C. & Chen, L. LSA-DDI: Learning stereochemistry-aware drug interactions via 3D feature fusion and contrastive cross-attention. Int. J. Mol. Sci.26(14), 6799. https://doi.org/10.3390/ijms26146799 (2025).
Wang, S., Zhang, K. & Liu, A. Flat-Lattice-CNN: A model for Chinese medical-named-entity recognition. PLoS ONE20(9), e0331464 (2025).
Acknowledgements
This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2601).
Author information
Authors and Affiliations
Contributions
Muhammad Farhan Hanif involved in the Computation, and analysis of the paper and also assent to the final adumbrate of the paper. Atef F. Hashem deals with data analysis, Computation, funding resources, and verification of calculations. Mazhar Hussain supervised the project, Envisioned it, Organized the methodology, coordinated it, found resources, and wrote the starting adumbrate of the paper. Osman Abubakar Fiidow contributed to Elevating the graphs of maple and Matlab calculations. Each author reviews and approves the final report of the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Hanif, M.F., Hashem, A.F., Hussain, M. et al. On machine learning based QSPR analysis of amphetamine derivatives using regression models. Sci Rep 16, 4482 (2026). https://doi.org/10.1038/s41598-025-34694-w
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-34694-w


























