Abstract
In the present work, the physicochemical characteristics of important anti-tuberculosis (TB) drugs such as isoniazid, pyrazinamide, ethambutol, ethionamide, linezolid, and levofloxacin are explored using extended energy-based topological indexes. Based on the molecules of the drugs, we calculate the extended energies of many widely recognized indexes such as Zagreb Second Index, Harmonic Index, Randic Index, Sombor Index, Reduced Sombor Index, and Average Sombor Index. All the calculations are done using Python, and the rigorous algorithmic implementation in the form of matrix formulation and computation of the eigenvalue is also given for reproducibility. We use the linear, quadratic, and logarithmic regression models to predict nine important physicochemical parameters: the boiling point, the melting point, the flash point, the molar refractivity, the polarizability, the molar volume, the molecular weight, the logarithm of the partition coefficient, and the surface area. Among the three models, the quadratic regression always yields the best predictability, as reflected in the largest coefficient of determination (\(R^2\)) as well as the minimum root mean square error (RMSE) values. Visual analyses such as heatmaps, scatter plot matrices, bar charts, and regression plots are employed to complement the numerical findings. Also, a rigorous discourse about model validity, model significance, and limitations is discussed. The entire source code and dataset are made available through GitHub to allow verification and transparency. The Python-based QSPR methodology, in addition to elucidating the high correlation of the topological descriptors with the properties of drugs, offers a drug design and optimization process in pharmaceutical research in an efficient way.
Similar content being viewed by others
Introduction
The compounds are portrayed in terms of molecular graphs, with atoms represented in terms of vertices and bonds in terms of edges. Several descriptors of structure, represented in terms of topological indices, serve a mechanism for predicting behavior, reactivity, and stability. All these factors contribute to enhancing therapeutic effectiveness1.
TB is an infectious disease produced by Mycobacterium tuberculosis and continues to be a worldwide medical problem2. It is most often a pulmonary disease but can extend to include other organs. Successful treatment for TB consists of a combination of antibiotics such as isoniazid3, pyrazinamide4, ethambutol, ethionamide, linezolid, and levofloxacin. All these drugs target a range of phases in infection and depend almost wholly on physicochemical factors for activity. Isoniazid stops reproduction of TB bacilli, and bacterial development is suppressed by pyrazinamide4. Ethambutol inhibits the growth of the bacterial cell wall, and ethionamide is used for multidrug-resistant therapy for TB. Linezolid and levofloxacin play a key role in overcoming resistant strains, with levofloxacin being preferred for its enhanced in-vitro activity in overcoming Mycobacterium tuberculosis.
The physicochemical characteristics of drugs play an important role in characterizing behavior, stability, and compatibility in the organism. Boiling point (BP), melting point (MP), flash point (FP), molecular refractivity (MR), polarity (P), molecular volume (MV), molecular weight (MW), log partition coefficient \((\log P)\), and surface area (SA) are important in characterizing their pharmacokinetics and pharmacodynamics5,6. BP and MP have an impact on drugs’ solubility and stability, and hence, in formulating and routes of administration. FP is important in terms of safety, characterizing flammability of a compound. MR and polarity convey information about molecule-molecule interactions, having an impact on absorption and receptor binding. MV and MW convey information about a drug’s size and transport behavior7. Log P conveys information about lipophilicity, predicting membrane crossing behavior of a drug, and SA in drug-receptor interaction. All these together convey information about optimized drug design and delivery for therapeutic efficacy8.
In this study, the physicochemical properties of these TB drugs were analyzed through the extended energies of several topological indices, including the Zagreb second index, Harmonic index, Randic index, Sombor index, reduced Sombor index, and average Sombor index. Linear, quadratic, and logarithmic regression models were applied to investigate the relationship between these indices and the drugs’ physicochemical properties. The quadratic regression model emerged as the best fit, showing the highest \(R_v\) values and the lowest RMSE values, outperforming the other models. The correlation analysis revealed significant relationships between extended energies of indices and physicochemical descriptors of drugs. Various forms of visualization, such as heatmaps, scatter plot matrices, bar plots, and plots of a regression line, have been adopted in an effort to visualize such relationships in a better form. These findings illustrate that a quadratic model is the most reliable model for predicting physicochemical property of drugs for TB, and it can provide significant information about molecular descriptors of drugs. It can contribute positively in terms of enhancing drug design and optimization and in formulating effective drugs for treating TB.
Topological descriptors mean numerical descriptors representing molecular structure descriptors in terms of its graphical form, derived through its graphical form. In graphical form, atoms have been considered as vertices and bonds have been considered as edges9. These indices act as a bridge between molecular property and chemical structure, and useful information regarding reactivity, stability, and bioactivity of a compound can be derived through them. Some of the most prevalent types of topological indices include degree-based, distance-based, and connectivity indices, describing a specific molecular structure feature each one of them. With the use of these indices, one can make an estimation regarding boiling point, melting point, solubility, and toxicity, etc., and these values become an imperative for drug and chemical compound design and optimization. Mostly, degree based topological descriptors10 are symbolized as:
Where, \(\phi \left( y,z \right)\) is defined as mapping of z, y with the property \(\phi \left( z,y \right) =\phi \left( y,z \right)\) and \(\gimel \left( {{\varsigma }} \right)\) is the degree of the vertex \(\wp\). Some well-known topological indices of these groups are as follows:
-
Zagreb second descriptor \(\phi \left( \gimel \left( {{\varsigma }_{i}} \right) ,\gimel \left( {{\varsigma }_{j}} \right) \right) =\gimel \left( {{\varsigma }_{i}} \right) \times \gimel \left( {{\varsigma }_{j}} \right)\),
-
Harmonic descriptor \(\phi \left( \gimel \left( {{\varsigma }_{i}} \right) ,\gimel \left( {{\varsigma }_{j}} \right) \right) =\frac{2}{\gimel \left( {{\varsigma }_{i}} \right) +\gimel \left( {{\varsigma }_{j}} \right) }\)
-
Randic descriptor \(\phi \left( \gimel \left( {{\varsigma }_{i}} \right) ,\gimel \left( {{\varsigma }_{j}} \right) \right) =\frac{1}{\sqrt{\gimel \left( {{\varsigma }*-11-_{i}} \right) \times \gimel \left( {{\varsigma }_{j}} \right) }}\)
-
Sombor descriptor \(\phi \left( \gimel \left( {{\varsigma }_{i}} \right) ,\gimel \left( {{\varsigma }_{j}} \right) \right) =\sqrt{\gimel {{\left( {{\varsigma }_{i}} \right) }^{2}}+\gimel {{\left( {{\varsigma }_{j}} \right) }^{2}}}\),
-
Reduced Sombor descriptor \(\phi \left( \gimel \left( {{\varsigma }_{i}} \right) ,\gimel \left( {{\varsigma }_{j}} \right) \right) =\sqrt{{{\left( \gimel \left( {{\varsigma }_{i}} \right) -1 \right) }^{2}}+{{\left( \gimel \left( {{\varsigma }_{j}} \right) -1 \right) }^{2}}}\),
-
Average Sombor descriptor \(\phi \left( \gimel \left( {{\varsigma }_{i}} \right) ,\gimel \left( {{\varsigma }_{j}} \right) \right) =\sqrt{{{\left( \gimel \left( {{\varsigma }_{i}} \right) -\frac{2m}{n} \right) }^{2}}+{{\left( \gimel \left( {{\varsigma }_{j}} \right) -\frac{2m}{n} \right) }^{2}}}\), where n, m are the total number of nodes and arcs.
A single node is a node with degree 1, it is associated to only one other node. Suppose this single node is symbolized as \(\gimel \left( {{\varsigma }_{i}} \right)\) and its neighboring as \(\gimel \left( {{\varsigma }_{j}} \right)\). Let \(\gimel \left( {{\varsigma }_{j}} \right) =c\), then
These mathematical expressions not only provide computational efficiency but also encapsulate fundamental structural features that influence key physicochemical properties11. Extended energies derived from indices such as the Zagreb second index, Harmonic index, Randic index, Sombor index, reduced Sombor index, and average Sombor index encode critical information about molecular symmetry, bond connectivity, and atomic distribution. These structural attributes exhibit strong correlations with physicochemical characteristics such as boiling point, melting point, molecular refractivity, polarity, and molecular weight12. By analyzing these indices, valuable insights into molecular behavior can be obtained, aiding in the prediction and optimization of drug properties for improved therapeutic applications.
The study emphasizes a set of chosen anti-tuberculosis drugs, the use of which is mandatory in the control and treatment of Mycobacterium tuberculosis. These drugs, including such widely used substances as isoniazid, rifampicin, ethambutol, and pyrazinamide, are of essential importance in first-line anti-TB chemotherapy. Their molecular structures possess diverse chemical characteristics, affecting their physicochemical properties such as the boiling point, entropy, molar refractivity, and lipophilicity. In the work, a Quantitative Structure-Property Relationship (QSPR) model, that makes mathematical relations between the molecular structure of the drugs and their experimentally established properties based on extended energy-based topological indices, is used. QSPR modeling is a widely recognized method in the field of cheminformatics that may render predictions without the requirements of expensive experimental protocols. Utilizing graph-theoretical descriptors such as extended energy, the objective of the work is to study the effect of the structural characteristics of the TB drugs and to assist the rational design and optimization of anti-tuberculosis drugs.
Motivation
The advent of global drug-resistant tuberculosis is a major public health concern, prompting researchers to seek low-cost, yet efficient ways of comprehending and maximizing the physicochemical properties of anti-TB drugs. Conventional experimental methods of drug physicochemical property determination may be costly, time-consuming, and labor-intensive. Such a hurdle necessitates accurate and interpretable computational methods. Graph-theoretical modeling, particularly the utilization of extended energy-based topological indices, offers a potential alternative. Based on Quantitative Structure-Property Relationship (QSPR) models, the analysis of the structural features of molecules in this study proposes a low-cost yet efficient tool of assessing drug properties, a potential catalyst for the discovery of better TB drugs.
Methodology
In this part, we introduce the mathematical expressions of various graph-based descriptors, including the extended energies of indices such as the Zagreb second index, Harmonic index, Randic index, Sombor index, reduced Sombor index, and average Sombor index13. These descriptors establish relationships between atomic structure and molecular properties, which are essential for predicting physicochemical characteristics. Several types of matrices have been defined in the literature to represent molecular structures. Among these, the adjacency matrix14, denoted as Z, plays a fundamental role. For a molecular graph \(\Im\) with n vertices, the adjacency matrix Z is an \(\Im\) \(n\times n\) matrix, where its entries are defined as follows:
Sarkar et al.15explained extended energy matrices for graph structures, by finding correlations with molecular characteristics. The \(n^{th}\) order general extended matrix \(Z_{TI}\) is symbolized as:
The extended energy of graph is stated as:
where, \({{\chi }_{1}},{{\chi }_{2}},\ldots ,{{\chi }_{n}}\) are eigenvalues of matrix Z. The extended adjacency matrices15 of the second Zagreb, Harmonic and Randic descriptors are explained as:
Assume that \(\tau _{1}^{\left( 1 \right) },\tau _{2}^{\left( 1 \right) },\ldots ,\tau _{n}^{\left( 1 \right) }\), \(\tau _{1}^{\left( 2 \right) },\tau _{2}^{\left( 2 \right) },\ldots ,\tau _{n}^{\left( 2 \right) }\) and \(\tau _{1}^{\left( 3 \right) },\tau _{2}^{\left( 3 \right) },\ldots ,\tau _{n}^{\left( 3 \right) }\) are the eigenvalues of second Zagreb, Harmonic and Randic descriptors. The second Zagreb, Harmonic and Randic energies are listed as:
The Sombor, reduced Sombor and average Sombor descriptors are:
Now, assume \(\gamma _{1}^{\left( 1 \right) },\gamma _{2}^{\left( 1 \right) },\ldots ,\gamma _{n}^{\left( 1 \right) }\), \(\gamma _{1}^{\left( 2 \right) },\gamma _{2}^{\left( 2 \right) },\ldots ,\gamma _{n}^{\left( 2 \right) }\) and \(\gamma _{1}^{\left( 3 \right) },\gamma _{2}^{\left( 3 \right) },\ldots ,\gamma _{n}^{\left( 3 \right) }\) are eigenvalues of Sombor descriptors. Then, the Sombor energies16 are explained as:
The mathematical descriptors and definitions in this section present a consistent scheme for molecular property quantitation via graph-based indices17. With widespread application of energy matrices and eigenvalue calculation, such indices expose molecular connectivity and structure variation in a deeper level. By combining such descriptors, a complete analysis of molecular characteristics can be conducted, with an improvement in physicochemical property analysis of them18. Chemical graph theory practice, such an activity, is a significant contribution in predictive modeling in a variety of industries, including in chemistry, pharmacy, and materials science19.
TB drug molecular descriptors were computed with RDKit, a widely used open-source cheminformatics package. PubChem-derived molecule structures were used for computation of the descriptors. Linear, quadratic, and logarithmic regressions for statistical modeling were conducted with Python and Scikit-Learn20. Standard R-squared (\(R^2\)) and Root Mean Squared Error (RMSE) measures were used for training and model evaluation for finding the best-fit model. Preprocessing, visualization, and correlation analysis were achieved with Pandas, NumPy, Matplotlib, and Seaborn. For reproducibility, all code and data have been released publicly on GitHub and archived with a DOI on Zenodo. Instructions for data access and repository links are provided in the ‘Code Availability’. Energy-based topological indices have been meticulously investigated because of their interest in the analysis of molecular structure, as well as in predicting their properties. Graph energy based on the eigenvalues of the adjacency matrix was first conceptualized by Gutman, and that has been the cornerstone of energy-based indices21. A number of the extended versions of energy, such as Laplacian energy, Seidel energy, and Randi? energy, have subsequently been investigated for their predictability. Researchers in the form of Ili? and Stevanovi?22, Das and Gutman23, and Cavers et al.24 particularly contributed toward the establishment and generalization of the indices. More recently, contributions by Chellali et al.25 and Dehmer et al.26 illustrate further the aptitude of spectral descriptors in the task of QSPR and QSAR model-building. These studies form the base of the research that is conducted in the current work using the Python-based approach by applying extended energy-based descriptors to the molecules of Tuberculosis drugs.
Dataset selection and justification
This data set consists of six FDA-approved tuberculosis (TB) medicines selected from PubChem based on their well-documented pharmacological relevance and previous experience with quantitative structure-property relationship (QSPR) studies. The data set has previously been employed in29, where it performed well for predictive modeling. The drugs selected here represent structural and physicochemical variability relevant to TB drug design, allowing for meaningful inference regarding their behavior at a molecular level.
While a larger data set would make for greater generalizability, one should bear in mind that what is most important for this research is correlation with drug properties via extended energy-based topological indices. Expanding the data set would mean additional experimental validation, which is beyond what this theoretical research can accommodate. Similar numbers of samples have been used for previous QSPR studies, which is a testament that a small data set can provide valid data if paired with rigorous statistical validation.
To establish our models’ reliability, internal validation tools, including adjusted R-squared values and root mean square error (RMSE), were employed. These are effective measures for model predictability and accuracy. Although external validation on a second data set would further substantiate our data, currently, they are restricted due to a lack of TB drugs with experimentally validated physicochemical properties. However, trends from our research are consistent with published data, further confirming our methodology.
Main results and analysis for tuberculosis treatment drugs
In this section, we present a detail Table 1 representing extended energies of a variety of topological indices, including Zagreb second index, Harmonic index, Randic index, Sombor index, reduced Sombor index, and average Sombor index. All these indices serve as primitive descriptors, and a quantitative relation between molecular structure and physicochemical property is derived through them. By comparing these values, one can understand in a deeper manner the structural feature of TB drugs and its role in altering thermodynamic property. In the below-presented table, a detail depiction of these calculated indices is represented, and a deeper analysis of molecular behavior prediction can be performed through them. The molecular structures of the selected anti-tuberculosis drugs isoniazid, pyrazinamide, ethambutol, ethionamide, linezolid, and levofloxacin are illustrated in Figs. 1, 2, 3, 4, 5, and in Fig. 6. These structures were sketched using ChemSketch and served as the basis for calculating the extended energy-based topological indices used in this study.
To further explore relations between extended energies of extended topological indices, a scatter plot matrix is represented. In a pairwise analysis, extended energies for drugs for treating tuberculosis can be represented in a visualization, and through it, one can reveal concealed trends and relations between them. Examining such a scatter plot, such as in Fig. 7, one can reveal trends in molecular structure variation and its effect, possibly, on physicochemical property values. With such a graphical visualization, one can gain a deeper understanding of how extended topological indices act together and contribute towards characterizing drugs for treating TB. For example, in the case of Isoniazid, the eigenvalues are calculated using the extended matrix in MATLAB. The values are 26.7296, 16.0189, 10.4357, 9.0000, 4.6809, 1.0515, 0.0000, 1.0515, 4.6809, 9.0000, 10.4357, 16.0189, and 26.7296, with the sum of these eigenvalues being 135.8332. The extended energies for the remaining cases can be calculated on the same pattern.
Table 2 presents six drugs for treating TB, i.e., isoniazid, pyrazinamide, ethambutol, ethionamide, linezolid, and levofloxacin, and its physicochemical characters including boiling point (BP), melting point (MP), flash point (FP), molecular refractivity (MR), polarity (P), molecular volume (MV), molecular weight (MW), log partition coefficient \((\log P)\), and surface area (SA). All such mentioned characteristics have a significant role in describing molecular behavior and character of drugs. All such factors impact solubility, bio-availability, and compatibility with biological processes of drugs. Comparison with other drugs is significant in providing information regarding drugs’ character and efficacy in treating tuberculosis.
The box plot of physicochemical characteristics of drugs for treating TB in Fig. 8 is a graphical representation of distribution and variation in significant molecular descriptors, including boiling point (BP), melting point (MP), flash point (FP), molecular refractivity (MR), polarity (P), molecular volume (MV), molecular weight (MW), log partition coefficient \((\log P)\), and surface area (SA). In each plot, a range of interquartile range is represented in a form of a central box, depicting \(50\%\) of the data, and a dash in form of a horizontal line in a box representing value of a median. Horizontal lines extending outwards denote minimum and maximum values in an acceptable range, and any out of range values and regarded outliers have been represented in a different form. By offering a graphical view, such a plot aids in comparative analysis of physicochemical property of drugs for treating tuberculosis and brings out variation, trends, and possibly relations between such traits. The presence of outliers in certain properties indicates significant deviations in specific drugs, which may influence their pharmacokinetic behavior and therapeutic effectiveness.
Significance of physicochemical properties in tuberculosis drug analysis
In the following part, we examined a dataset consisting of various tuberculosis (TB) treatment drugs to investigate the relationships between their physicochemical properties. These properties include BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA. The boiling point and melting point are measured in degrees Celsius, while the flash point is expressed in degrees Fahrenheit. Molecular refractivity and polarity are dimensionless, molecular volume is in cubic angstroms, molecular weight is in atomic mass units, and the log partition coefficient is also dimensionless. To analyze these properties, we applied three statistical models: linear, quadratic, and logarithmic regression. Linear regression predicts the value of a dependent variable based on an independent variable using a straight-line relationship. Quadratic regression builds on this by adding a squared term, which captures nonlinear trends in the data. Logarithmic regression models relationships where the rate of change of the dependent variable decreases as the independent variable increases, making it particularly useful for datasets with diminishing returns. These models were used to identify trends, correlations, and predictive relationships among the physicochemical properties of TB drugs, offering valuable insights into their pharmacokinetic behavior and potential therapeutic effectiveness. These models27 are defined as:
In this study, X represents the independent variable, while Y denotes the dependent variable. We analyzed the physicochemical properties of TB treatment drugs, including BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA, to develop predictive models. Using the least squares fitting procedure, we constructed regression models incorporating linear, quadratic, and logarithmic approaches to examine correlations and trends among these properties.
In our analysis, we employed \(R_v\) to measure the strength and direction of relationships between variables, while \(\zeta _e\) was used as the standard error of estimation to assess the accuracy of predictions. The F-value determined the overall significance of the regression model, and \(\nabla\) represented the significance of F, indicating the reliability of the model in explaining variations in the data. For the physicochemical property values of drugs for TB, having a single predictive model with a basis in statistical regression analysis will make computation efficient and consistent and will capture inter-dependencies between such property values. In case performance discrepancies are high, or in case a property value shows high dependencies for a specific model, several such models can then be considered. For such scenarios, a statistical validation will have to be performed for increased predictive accuracy and confidence.
Linear regression models for physicochemical characteristics of TB treatment drugs using \(EE_{M_2}\)
In this section, we identified the models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{M_2}\).
Linear regression models for physicochemical characteristics of TB treatment drugs using \(EE_{H}\)
In this section, we identified the models of \(\Delta {{H}_{f}}\), S, BP, \(\log\) In this section, we identified the models for BP, MP, FP, MR, P, MV, MW, logP, and SA associated with \(EE_{H}\).
Linear regression models for physicochemical characteristics of TB treatment drugs using \(EE_{R}\)
In this section, we identified the models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{R}\).
Linear regression models for physicochemical characteristics of TB treatment drugs using \(EE_{SO}\)
In this section, we identified the models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO}\).
Linear regression models for physicochemical characteristics of TB treatment drugs using \(EE_{SO_{red}}\)
In this section, we identified the models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO_{red}}\).
Linear regression models for physicochemical characteristics of TB treatment drugs using \(EE_{SO_{avg}}\)
In this section, we determined the models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO_{avg}}\).
The heatmap for the linear regression model as shown in Fig. 9 represents the correlation between the extended energy matrix and the physicochemical properties of TB treatment drugs. In this heatmap, the \(R_v\) values are shown in color, where darker shades indicate stronger correlations, suggesting a direct or inverse linear relationship between the energy matrix and the property. Lighter shades reflect weaker correlations, indicating minimal linear dependence. This heatmap is useful for identifying properties that can be effectively predicted using a simple linear regression model, with higher \(R_v\) values suggesting a good fit. Properties with weak correlations in this heatmap indicate that a linear approach may not be the best model for those attributes.
Quadratic models related to \(EE_{M_2}\)
In this portion, we determined the quadratic models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{M_2}\).
Quadratic models related to \(EE_{H}\)
In this part, we determined the quadratic models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{H}\).
Quadratic models related to \(EE_{R}\)
In this portion, we determined the quadratic models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{R}\).
Quadratic models related to \(EE_{SO}\)
In this part, we determined the quadratic models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO}\).
Quadratic Models related to \(EE_{SO_{red}}\)
In this portion, we determined the quadratic models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO_{red}}\).
Quadratic models related to \(EE_{SO_{avg}}\)
In this portion, we determined the quadratic models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO_{avg}}\).
The quadratic regression model heatmap as shown in Fig. 10 shows how the relationships between the physicochemical properties and extended energies change when a squared term is introduced. A higher \(R_v\) value in the quadratic model heatmap, compared to the linear model, indicates that the property follows a nonlinear trend and benefits from the inclusion of the squared term. This heatmap helps in identifying properties with a parabolic relationship, where the impact of extended energies on the property either increases or decreases at an accelerating rate, highlighting properties that require a more complex regression model for accurate prediction
Logarithm models related to \(EE_{M_2}\)
In this portion, we determined the logarithm models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{M_2}\).
Logarithm models related to \({EE_{H}}\)
In this portion, we determined the logarithm models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{H}\).
Logarithm models related to \(EE_{R}\)
In this portion, we determined the logarithm models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{R}\).
Logarithm models related to \(EE_{SO}\)
In this portion, we determined the logarithm models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO}\).
Logarithm models related to \(EE_{SO_{red}}\)
In this portion, we determined the logarithm models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO_{red}}\).
Logarithm models related to \(EE_{SO_{avg}}\)
In this portion, we determined the logarithm models for BP, MP, FP, MR, P, MV, MW, \(\log P\), and SA associated with \(EE_{SO_{avg}}\).
The logarithmic regression model heatmap as shown in Fig. 11 examines the relationships where the rate of change of the dependent variable decreases as the independent variable increases. This heatmap reveals whether the logarithmic transformation improves the fit compared to the linear and quadratic models. Strong correlations in the logarithmic model suggest that the relationship between extended engeries and certain properties follows a diminishing return pattern, where the effect of extended engeries on the property diminishes as its value increases. By comparing the heatmaps of the three models, it is possible to determine which transformation best captures the behavior of each physicochemical property.
Statistical validation of predictive model consistency
In statistical analysis, correlation is a fundamental measure used to assess the strength and direction of the relationship between two variables. It provides insights into how variations in one variable correspond to changes in another, making it a crucial tool in predictive modeling. The correlation coefficient (r) between 1 and 1 varies, with positive values for a direct relation and a negative value for an inverse relation, and values near zero for no relation and a weak relation28. As a larger value in terms of its absolute value, a larger association between two variables is denoted. In chemical graph theory, a function of significant role in predictive capability checking of topological indices in molecular property characterization is played through correlation analysis.
For further evidence of the quadratic model effectiveness, the actual values of three critical physicochemical characteristics, boiling point, melting point, and flash point, are graphically exhibited alongside their respective predicted values. These are included as Figs. 12, 13, and 14, respectively, within the manuscript.
The quadratic model for regression was determined to provide the highest prediction accuracy for such properties, with actual vs. predicted values plotting very near the regression line, reflecting the presence of a strong correlation as well as a good prediction. \(R^2\) values as well as the RMSE values for each of the properties further support the accuracy of the model. Similar graphs for molar refractivity, polarizability, molar volume, molecular weight, log of partition coefficient, and surface area can be drawn by following the same procedure. These plots give an overall insight into the model’s stability with respect to varied drug properties and enhance the usefulness of the quadratic model for QSPR analysis. Table 3 shows the extended energies of a variety of topological descriptors and physicochemical descriptors of drugs for antitubercular activity. The indices analyzed include \(EE_{M_{2}}\), \(EE_{H}\), \(EE_{R}\), \(EE_{SO}\), \(EE_{SO_{red}}\), \(EE_{SO_{avg}}\), while the molecular properties considered are BP, MP, FP, MR, P, MV, MW, \(\log (P)\), and SA. The correlation values denote the intensity of association between each molecular property and topological index, with high values signifying strong relations. In a striking observation, Sombor index \((EE_{SO})\) reflects strong relations with a range of significant properties, such as \(MW,(r=0.965)\) and \(BP,(r=0.941)\), indicative of its use in explaining molecular behavior. In contrast, the Randic index \(EE_{R}\) reflects a negative relationship with properties such as \(MR,(r= -0.303)\) and \(\log (P), (r= -0.736)\), indicative of an inverse relationship. The variation in correlation values across different indices emphasizes the importance of selecting appropriate descriptors for predictive modeling, as certain indices consistently exhibit stronger associations with molecular properties. This analysis reinforces the reliability of predictive models and provides valuable insights into the most influential topological indices for understanding the physicochemical behavior of TB drugs, contributing to the development of more accurate pharmaceutical property predictions.
In addition to the analysis presented in Tables 4, 5, and 6, where \(EE_{SO_{avg}}\) consistently yielded the lowest RMSE values across linear, quadratic, and logarithmic models, a comparison using Python and R revealed significant differences in the predictive accuracy of each model. This was further illustrated by a bar plot chart of RMSE values as shown in Figs. 15, 16, and 17 for each model, which visually demonstrated the performance of the different modeling approaches in predicting drug properties. The findings underscore the importance of model selection and optimal descriptor choice for accurate molecular property predictions
The code, as shown in Fig. 18, provides an algorithm to compare the RMSE values and \(R_v\) values for different models (Linear, Quadratic, and Logarithmic) in predicting drug properties. It first loads the RMSE and \(R_v\) value data from separate Excel files and ensures that the drug properties match across both datasets. The script then extracts the relevant data for each model and property and determines the best model for each drug property based on the minimum RMSE and maximum \(R_v\) value. The result is summarized in a new Excel file, listing the best model for each property along with its corresponding \(R_v\) value and RMSE. According to the comparison, the quadratic model emerges as the best for predicting the drug properties. The corresponding chart and summary are captioned as the “Algorithm” for visualization.
Algorithm
Following the comparison of RMSE and \(R_v\) values across different models, a “Quadratic Scatter Plot between Extended Energy of \(M_2\) and Drug Properties” was generated as shown in Fig. 19. This plot illustrates the relationship between the extended energy of \(M_2\) and the drug properties, with the quadratic model effectively capturing the correlation. The scatter plot demonstrates the predictive accuracy of the quadratic model in depicting the influence of \(M_2\) on the drug properties. A similar approach can be applied to other extended energies, allowing for comparative analysis of their effects on drug properties. These plots provide a clear visualization of the performance and predictive potential of each extended energy descriptor when used within a quadratic framework.
Model significance and validation criteria
In the current study, the topological indices based on extended energy were assessed by application of several regression models-linear, quadratic, and logarithmic-to model the following nine essential physicochemical characteristics: boiling point, melting point, flash point, molar refractivity, polarizability (P), molar volume, molecular weight, logarithm of the partition coefficient, and surface area. Standard statistical measures such as the coefficient of determination (\(R^2\)), root mean square error (RMSE), and the adjusted \(R^2\) were employed to evaluate the performance of the model. A model is statistically significant when \(R^2\) is high, and RMSE is low.
Among the tested models, the quadratic model outperformed the remainder in the prediction of the majority of drug properties, such as BP, FP, MR, MV, and \(\log P\), as demonstrated by greater \(R^2\) and smaller RMSE values. The model, however, had low predictivity for MP, suggesting that the topological descriptors employed are perhaps insufficient to account for the underlying structural or energetics that impact the melting point. This points toward the possibility of investigating more complex models or the inclusion of extra descriptors that are more specialized in the case of MP in subsequent work.
Limitations and future work
The research has some limitations. The dataset contains only a few tuberculosis drugs, so the generalizability of the findings might be impacted by it. The employed QSPR models, including the linear, quadratic, and logarithmic ones, are simplifications and do not necessarily represent complex molecular interactions. No external validation using independent datasets, a factor that can increase the robustness of the model, was practiced. Further research in the future will extend the dataset, employ more sophisticated machine learning methods such as graph neural networks and random forests, and incorporate hybrid indices to enhance the accuracy of the predictions. More detailed and open-source code implementations can also facilitate reproducibility and stimulate further studies in the topic.
Conclusion
In this study, we analyzed the physicochemical properties of six Tuberculosis (TB) drugs using extended energies of topological indices, including the Zagreb second index, Harmonic index, Randic index, Sombor index, and others. Linear, quadratic, and logarithmic regression models were applied to explore the relationships between the indices and drug properties. The quadratic regression model provided the best fit, showing the highest \(R_v\) values and lowest RMSE, outperforming the other models. A comparison algorithm was added to validate the results, further supporting the superiority of the quadratic model. Various visualizations, including heatmaps, scatter plots, and a bar plot matrix, were created to better understand the correlations.
The results of this study offer valuable insights for drug design and optimization, particularly for Tuberculosis treatments. By identifying the most accurate models for predicting physicochemical properties, this work can guide the development of more effective TB drugs with better therapeutic outcomes. Additionally, leveraging topological indices and advanced regression modeling allows for a deeper understanding of drug properties at a molecular level, enhancing the potential for novel drug discovery and optimization in the fight against TB.
Data availability
All data generated or analysed during this study are included in this published article.
Code availability
The custom Python code used in this study for data analysis and modeling is publicly available in a GitHub repository: github.com/kirannaz145/Linear-Quadratic-Logarithmic. To ensure long-term accessibility and reproducibility, the version of the code referenced in this publication has been archived on Zenodo and can be accessed via the following DOI: https://doi.org/10.5281/zenodo.15240618. This ensures that the code remains accessible even if modifications are made to the GitHub repository in the future. The archived version can be cited and used by other researchers to replicate and extend our findings. No restrictions apply to access or use of the provided code.
References
Leite, L. S., Banerjee, S., Wei, Y., Elowitt, J. & Clark, A. E. Modern chemical graph theory. Wiley Interdiscipl. Rev. 14(5), e1729 (2024).
Bommahalli Jayaraman, B. & Siddiqui, M. K. Exploring the properties of antituberculosis drugs through QSPR graph models and domination-based topological descriptors. Sci. Rep. 14(1), 24387 (2024).
Fernandes, G. F. D. S., Salgado, H. R. N. & Santos, J. L. D. Isoniazid: A review of characteristics, properties and analytical methods. Crit. Rev. Anal. Chem. 47(4), 298–308 (2017).
Njire, M. et al. Pyrazinamide resistance in Mycobacterium tuberculosis: Review and update. Adv. Med. Sci. 61(1), 63–71 (2016).
Feng, X., Ma, Z., Yu, C. & Xin, R. MRNDR: Multihead attention-based recommendation network for drug repurposing. J. Chem. Inf. Model. 64(7), 2654–2669 (2024).
Zhou, Y. et al. Dermatophagoides pteronyssinus allergen Der p 22: Cloning, expression, IgE-binding in asthmatic children, and immunogenicity. Pediatr. Allergy Immunol. 33(8), e13835 (2022).
Hu, S. et al. Races of small molecule clinical trials for the treatment of COVID-19: An up-to-date comprehensive review. Drug Dev. Res. 83(1), 16–54 (2022).
Pu, X., Sheng, S., Fu, Y., Yang, Y. & Xu, G. Construction of circRNA-miRNA-mRNA ceRNA regulatory network and screening of diagnostic targets for tuberculosis. Ann. Med. 56(1), 2416604 (2024).
Naz, K., Ahmad, S., Bilal, H. M. & Siddiqui, M. K. Computing degree based topological indices for bulky and normal polymers. Int. J. Quant. Chem. 124(12), e27435 (2024).
Ismail, R. et al. Investigating Seidel energies and thermodynamic properties of benzenoid hydrocarbons through regression models. Sci. Rep. 15(1), 867 (2025).
Wu, Z., Shangguan, D., Huang, Q. & Wang, Y. Drug metabolism and transport mediated the hepatotoxicity of Pleuropterus multiflorus root: A review. Drug Metab. Rev. 56(4), 349–358 (2024).
Wang, H. et al. NIR-II AIE luminogen-based erythrocyte-like nanoparticles with granuloma-targeting and self-oxygenation characteristics for combined phototherapy of tuberculosis. Adv. Mater. 36(38), 2406143 (2024).
Liu, H., You, L., Tang, Z. & Liu, J. B. On the reduced Sombor index and its applications. MATCH Commun. Math. Comput. Chem 86, 729–753 (2021).
Liu, J. B. & Pan, X. F. Asymptotic incidence energy of lattices. Physica A 422, 193–202 (2015).
Sarkar, P., Dey, A., Kumar, S. & Pal, A. On some extended energy of graphs and their applications. Yugoslav J. Oper. Res. 00, 40–50 (2024).
Milovanovic, I. Z., Milovanovic, E. I. & Zakic, A. A short note on graph energy. MATCH Commun. Math. Comput. Chem 72(1), 179–182 (2014).
Li, W. et al. Puerarin-loaded PEG-PE micelles with enhanced anti-apoptotic effect and better pharmacokinetic profile. Drug Deliv. 25(1), 827–837 (2018).
Li, H. et al. The effects of ferulic acid on the pharmacokinetics of warfarin in rats after biliary drainage. Drug Des. Dev. Ther. 10, 2173–2180 (2016).
Zeng, M. et al. The integration of nanomedicine with traditional Chinese medicine: Drug delivery of natural products and other opportunities. Mol. Pharm. 20(2), 886–904 (2023).
Li, H. et al. The effects of warfarin on the pharmacokinetics of Senkyunolide I in a rat model of biliary drainage after administration of Chuanxiong. Front. Pharmacol. 9(1461), d25-35 (2018).
Gutman, I. The energy of a graph. Ber. Math.-Stat. Sekt. Forschungszent. Graz 103, 1–22 (1978).
Ilić, A. & Stevanović, D. The energy of graphs and matrices. Linear Algebra Appl. 431, 2195–2203 (2010).
Das, K. C. & Gutman, I. Some properties of the Laplacian energy of a graph. MATCH Commun. Math. Comput. Chem. 52, 103–112 (2004).
Cavers, M., Fallat, S. M. & Kirkland, S. J. On the normalized Laplacian energy and general Randi? index. Linear Algebra Appl. 433, 172–190 (2010).
Chellali, M., Kiani, D. & Gutman, I. Recent developments in energy-like graph invariants. MATCH Commun. Math. Comput. Chem. 82, 5–28 (2019).
Dehmer, M., Emmert-Streib, F. & Mehler, A. Graph entropy and information functionals for the analysis of complex networks. Appl. Math. Comput. 201, 82–94 (2009).
Huang, J. C., Ko, K. M., Shu, M. H. & Hsu, B. M. Application and comparison of several machine learning algorithms and their integration models in regression problems. Neural Comput. Appl. 32(10), 5461–5469 (2020).
Asuero, A. G., Sayago, A. & González, A. G. The correlation coefficient: An overview. Crit. Rev. Anal. Chem. 36(1), 41–59 (2006).
Siddiqui, M. K. Exploring the properties of antituberculosis drugs through QSPR graph models and domination-based topological. Sci. Rep. 14, 24387 (2024).
Author information
Authors and Affiliations
Contributions
Kiran Naz was responsible for data analysis, computation, and verification of calculations. Hafiz Muhammad Bilal contributed to enhancing the graphical representations using Python and MATLAB. Muhammad Kamran Siddiqui supervised the project, conceptualized and structured the methodology, coordinated the research, secured resources, and drafted the initial version of the paper. Sarfraz Ahmad assisted with computation, data analysis, and reviewing the final draft of the paper. Mustafa Ahmed Ali contributed for Validation, formal analysis of experiments, funding acquisition, and software development. Each author reviewed and approved the final version of the work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Naz, K., Bilal, H.M., Siddiqui, M.K. et al. Predicting tuberculosis drug properties using extended energy based topological indices via a python driven QSPR approach. Sci Rep 15, 15642 (2025). https://doi.org/10.1038/s41598-025-00579-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-00579-1