Predictive modeling of asthma drug properties using machine learning and topological indices in a MATLAB based QSPR study

Bayati, Jalal Hatem Hussein; Mahboob, Abid; Amin, Laiba; Rasheed, Muhammad Waheed; Alameri, Abdu

doi:10.1038/s41598-025-07022-5

Download PDF

Article
Open access
Published: 19 August 2025

Predictive modeling of asthma drug properties using machine learning and topological indices in a MATLAB based QSPR study

Jalal Hatem Hussein Bayati¹,
Abid Mahboob²,
Laiba Amin³,
Muhammad Waheed Rasheed³ &
…
Abdu Alameri⁴

Scientific Reports volume 15, Article number: 30373 (2025) Cite this article

2265 Accesses
3 Altmetric
Metrics details

Subjects

Abstract

Machine learning is a vital tool in advancing drug development by accurately predicting the physical, chemical, and biological properties of various compounds. This study utilizes MATLAB program-based algorithms to calculate topological indices and machine learning algorithms to explore their ability to predict the physio-chemical properties of asthma drugs. By combining machine learning with topological indices, we can conduct faster and more precise analyses of drug structures. As we deepen our understanding of the relationship between molecular structure and performance, the integration of machine learning with QSPR research highlights the significant potential of computational strategies in pharmaceutical discovery. The use of machine learning algorithms such as random forest and extreme gradient boosting is essential in this process. These algorithms leverage labeled data to predict complex molecular processes, aiding in the discovery of new medication options and enhancing their properties. These methods enhance the accuracy of physical and chemical property predictions, streamline the drug discovery process, and efficiently evaluate large datasets through machine learning. Ultimately, these advancements facilitate the development of innovative and effective treatments.

Computational approaches in drug chemistry leveraging python powered QSPR study of antimalaria compounds by using artificial neural networks

Article Open access 02 June 2025

Developing a QSPR model for Alzheimer’s drugs using topological indices and M-polynomial: A computational study

Article Open access 15 December 2025

Advanced QSPR modeling of profens using machine learning and molecular descriptors for NSAID analysis

Article Open access 20 July 2025

Introduction

Asthma is a chronic respiratory condition that impacts millions of individuals globally. Regarding the definition and classification of asthma, several views and opinions have been presented for many years¹. The primary feature of asthma is the fluctuating difficulty in breathing. Several cell types and components, such as mast cells, eosinophils, lymphocytes, macrophages, neutrophils, and epithelial cells, are involved in asthma, a chronic inflammatory disease of the respiratory system². Coughing, wheezing, chest tightness, and dyspnea are some of the symptoms that are associated with this condition, which is characterized by inflammation and constriction of the airways^3,4. A number of different things, including allergies, environmental irritants, physical exertion, and respiratory infections, can be the root cause of asthma, and the severity of the condition can vary from person to person. Understanding the fundamental mechanisms behind asthma is crucial for its management and treatment. In response to certain triggers, asthma sufferers frequently develop inflammation of the airways, which causes an excess of mucus production and constriction of the smooth muscles surrounding the airways. This results in the common symptoms that asthmatics experience. There are two types of asthma: allergic asthma is caused by allergens such as dust mites, pollen, or animal dander, while non-allergic asthma is triggered by factors such as cold air, smoke, exercise, or respiratory disorders. Finding their unique triggers is crucial for asthmatics in order to reduce exposure and stop their symptoms from getting worse. Asthma management entails a multi-modal strategy that combines pharmaceutical and non-pharmacological treatments. To reduce inflammation and widen the airways, bronchodilators, leukotriene modifiers, and inhaled corticosteroids are frequently utilized. Apart from taking medicine, people who have asthma are often recommended to stay away from triggers, have a healthy lifestyle, and get regular evaluations and monitoring from medical specialists.

Most of the time, the symptoms of asthma appear for the first time in early childhood. Although children of pre-school age frequently experience wheezing as a result of viral infections, only approximately half of these children go on to develop typical asthma when they are of school age. It is more likely that children who have wheezing that is frequent or persistent would have indications of airway inflammation and remodeling, reduced lung function, and symptoms that continue to be irritating well into adulthood⁵. Studies conducted inside communities at multiple times (between the 1960s and the early 1990s) revealed an increased prevalence of asthma; however, each study utilized their own technique, and very few of the studies were conducted in countries that were not high-income⁶. Asthma in children and young adults was the subject of repeated cross-sectional surveys from 1983 to 1996. There were a total of 178215 adults between the ages of 18 and 45 from 70 different countries who replied to questions regarding asthma and the symptoms that are associated with it⁷. A critical evaluation of these surveys revealed sixteen research that were of interest. The only studies that documented trends in present wheeze were those conducted in the United Kingdom, Australia, and New Zealand.The remaining studies relied on diagnoses of asthma, which can be impacted by the condition’s prevalence and trends in diagnosis or labeling^8,9.

In mathematical chemistry, degree-based topological indices are very useful tools that tell us a lot about the structure and physical features of chemical compounds. The molecular graph of a compound gives these numbers. In this graph, each atom is a node and each chemical link is an edge. This makes it possible to study molecule structure in a quantitative way. One very basic measure of topology is the degree of a vertex, which is the number of lines that connect it to other points. Using this idea as a base, degree-based topological indices figure out a number of attributes linked to various chemical properties and reactions by looking at the degrees of the nodes in a molecular graph. In many QSPR/QSAR investigations, topological indices are employed. It is established that there is a strong correlation between the topological indices and a number of the physicochemical characteristics of molecules. To create QSAR models, topological indicators with strong predictive power should be selected. We recommend that readers refer to^{10,11,12,13,14} for further information on the various applications of topological indices and also some graph related see^15,16.

Now a days, QSPR has become a significant factor in drug development. In 2023, an examination of the QSPR study of asthma disease was carried out by D. Balasubramaniyan et al.¹⁷ using the methodology of neighborhood degree on TIs. In 2024, Micheal Arockiaraj et al.¹⁸ conducted a study on QSPR analysis, employing distance-based structural indices for drug compound in tuberculosis disease. They claimed that the selected properties highlight a robust correlation between the Wiener index and boiling point, enthalpy, and flash point, whereas the Padmakar-Ivan index displays a notable correlation with molar refraction, polarization, and molar volume. Abid Mehboob et al.¹⁹ are conducting research in 2024 on the QSPR analysis of hepatitis disease, utilizing eleven physical properties and 14 molecular descriptors through the degree method. They reveal that eight out of eleven properties, namely, boiling point, enthalpy, flash point, molar refractivity, LogP, molar volume, as well as molecular weight show a good correlation with all the 14 indices at the range of 0.7, 0.8 and 0.9. In 2024, Mehri Hasani and Masoud Ghods²⁰ conducted research on the QSPR analysis of different beta-blocker medications for heart disease, focusing on the degree-based topological indices obtained from the M-polynomial. The relationship between indices and eight properties was determined using both linear and quadratic models. Harmonic index proved to be the most accurate predictor for boiling point, flash point, and enthalpy, whereas the modified third Zagreb index showed significant effectiveness in determining polarizability, molar refractivity, and molar volume through linear analysis. Moreover, the redefined third Zagreb index proved to be the most best fit predictor for polarizability and molar refractivity, while the second modified Zagreb index showing strong predictability for molar volume in quadratic analysis. The study by B. Kirana et al.²¹ in 2024 focused on the QSPR analysis and curvilinear regression applied to eleven TIs and four physical properties of Quinoline antibiotics. The outcomes indicates that the harmonic index showed a very good correlation with all considered indices for all the the three regression models.

Machine learning approaches have been shown to enhance the prediction of physicochemical and structural properties in drug discovery and material science applications^22,23,24. XGBoost has also been successfully applied in QSPR/QSAR studies for its ability to handle non-linear relationships and feature interactions^25,26, although performance may vary in small datasets.

Some basic definitions

A graph G is represented by the pair $G\simeq (V, E)$ where V is a collection set of vertices and E is a collection set of edges. Whenever two vertices are adjacent in graph G, it is displayed as u $\sim$ v. A line drawing between two vertex points signifies an edge represented as $e = uv$. In G, the degree of a vertex v is calculated by counting its connected edges. Usually, it is denoted as $d_{G}(v)$ or $d_{v}$. According to chemical graph theory, a molecular structure can be interpreted as a mathematical graph consisting of atoms as vertices and bonds of atoms as edges. Typically, hydrogen atom are not considered in chemical graph. In this article all the chemical graphs is connected, finite, and simple.

Reducible first and second Zagreb index

The first and second Zag-indices are two graph invariants that were first introduced by Gutman and Trinajstic²⁷. These are the oldest graph invariants that examined the total pi-electron energy and branching of carbon atom skeleton in molecular structure. These indices have extensive used in the field of chemical graph theory. In 2011, Kexiang Xu²⁸ calculate these two indices by using the methodology of n-vertex graphs with clique number k. In 2022, S.R. Islam²⁹ calculated the second Zagreb index for fuzzy graphs and conducted QSPR research using linear fitting model. In recent year 2023, Abid Mehboob et al.³⁰ were inspired by the work of these indices and generate its new extension known as reducible first and second Zagreb indices. In this research they discussed the QSPR analysis, employing degree-based structural indices for drug compound in blood cancer disease. The mathematical formula of these indices are defined as;

$$\begin{aligned} RM_{1}(G)=\sum \limits _{uv\varepsilon E(G)} (\frac{n}{d_{u}}+\frac{n}{d_{v}}), \end{aligned}$$

(2.1)

$$\begin{aligned} RM_{2}(G)=\sum \limits _{uv\varepsilon E(G)} (\frac{n}{d_{u}}\times \frac{n}{d_{v}}). \end{aligned}$$

(2.2)

The total number of vertices in graph G is represented by $''n''$, while the degree of u and v denoted by $d_{u}$ and $d_{v}$, respectively.

Reducible reciprocal randic index

Reciprocal Randic index is named after Milan Randic, a croatian mathematician and chemist who introduced the concept in 1975³¹. The reciprocal Randic index is a topological descriptors that quantifies the complexity of a chemical compound by taking into account the connectivity of its atoms. It is defined as the sum of the reciprocals of the square root of the degrees of all vertices in a molecular graph. In 2021, Z.Du et al.³² studied the relationship between Randic index and various topological descriptors like Zagreb indices, ABC-index, GA-index, and augmented Zagreb index. In 2022, C.T.Martinez-Martinez et al.³³ compute the randic index by using vertex-degree method in Erdos-Renyi graphs and other random graphs. Suleyman Ediz et al.³⁴ studied the QSPR analysis of total Zagreb indices and total Randic indices of octanes. The new extension of this index known as reducible reciprocal Randic index which is mathematically defined as;

$$\begin{aligned} RR(G)=\sum \limits _{uv\varepsilon E(G)} (\sqrt{\frac{n}{d_{u}}\times \frac{n}{d_{v}}}). \end{aligned}$$

(2.3)

Reducible first and second hyper Zagreb index

Shirdel et al.³⁵ proposed a new molecular descriptor called the hyper Zagreb index, which is a distance-based version of the Zagreb index. The molecular complexity of a chemical compound is determined by the first hyper Zagreb index. This index has also been used in QSPR/QSAR studies, where it has shown a very good correlation with the biological activity molecules. M. Suresh and G. Sharmila Devi calculated the hyper Zagreb indices of graph based operation, which are related to lexicographic product³⁶. Hao Zhou et al.³⁷ observed the QSPR analysis of topological descriptors and biological properties for narcotic drugs. This index showed a high correlation with BP, VP, and EV at the range of 0.9. A new version of these indices has been introduced known as reducible first and hyper Zagreb index which is mathematically written as;

$$\begin{aligned} RHM_{1}(G)=\sum \limits _{uv\varepsilon E(G)} (\frac{n}{d_{u}}+\frac{n}{d_{v}})^{2}.\end{aligned}$$

(2.4)

$$\begin{aligned} RHM_{2}(G)=\sum \limits _{uv\varepsilon E(G)} (\frac{n}{d_{u}}\times \frac{n}{d_{v}})^{2}. \end{aligned}$$

(2.5)

Reducible sigma index

Gutman proposed the concept of the Sigma index³⁸, which was inspired by the Albertson index. In his article, he investigates the inverse problem of the sigma index and establishes that, for every given graph, this index will always have an even value. Reti³⁹ examined the sigma index in comparison to a few well-known irregularity measures and pointed out a number of this index’s interesting features. The latest version of this index has been released under the name of reducible Sigma index, which is mathematically defined as;

$$\begin{aligned} RS(G)=\sum \limits _{uv\varepsilon E(G)} (\frac{n}{d_{r}}-\frac{n}{d_{s}})^{2}. \end{aligned}$$

(2.6)

Reducible forgotten index

Furtula and Gutman generates the new version of Zagreb indices called Forgotten index⁴⁰. This index is also measure of branching and it has shown that it can predict outcomes similarly to $M_{1}(G)$. However, for unknown reasons, it didn’t get much interest until 2015 when it was reinvented then this index received a significant attention. In the case of entropy and acentric factor, correlation coefficients higher than 0.95 are obtained for both $M_{1}(G)$ and F(G)⁴¹. The recently developed extension of this index known as reducible forgotten index, which can be mathematically defined as;

$$\begin{aligned} RF(G)=\sum \limits _{uv\varepsilon E(G)} ((\frac{n}{d_{u}})^{2}+(\frac{n}{d_{v}})^{2}). \end{aligned}$$

(2.7)

Reducible $1^{st}$ and $2^{nd}$ Gourava Index

The Zagreb indices’ definition and its popularity served as inspiration for V. R. Kulli’s⁴² to introduce the two new indices known as first and second Gourava indices. He calculates this index for a few common types of graphs and applies the definition of this index for armchair and zigzag edge polyhex nanotubes. Furthermore, he computes the exact formulas for the friendship graph and wheel graph using these indices⁴³. A new form of this index has been generated known as reducible first and second Gourava indices, which is mathematically defined as;

$$\begin{aligned} RG_{1}(G)=\sum \limits _{uv\varepsilon E(G)} (\frac{n}{d_{u}}+\frac{n}{d_{v}}+(\frac{n}{d_{u}}\times \frac{n}{d_{v}})).\end{aligned}$$

(2.8)

$$\begin{aligned} RG_{2}(G)=\sum \limits _{uv\varepsilon E(G)} ((\frac{n}{d_{u}}+\frac{n}{d_{v}})(\frac{n}{d_{u}}\times \frac{n}{d_{v}})). \end{aligned}$$

(2.9)

These degree-based reducible topological indices provide a straightforward yet effective method for describing the molecular structures. It is an invaluable instrument in the world of chemistry and beyond due to its broad applicability and capacity to predict a wide range of physical and chemical properties. The application of these reducible indices is probably going to keep expanding and advancing in numerous scientific fields. Their simplicity, which makes them simple to compute and understand, is their primary benefit.

Material and method

Our initial step involved establishing the edge partition and degree counting approach, which relies on graph connectivity to define molecular graphs. This method plays a crucial role in identifying structural characteristics. Then, topological indices (TIs) based on degree were determined by analyzing changes in node degrees within the molecular graph. For the purpose of simplification, we developed a custom MATLAB script to compute the proposed edge-based topological indices efficiently. Following this, Python code was utilized to construct machine learning models for analyzing physicochemical properties. Additionally, we utilized SPSS(Version 26.0, https://www.ibm.com/products/spss-statistics) software to investigate the relationships between the derived indices and experimental variables. To further validate our findings, we conducted a graphical analysis comparing actual and computed drug properties to ensure the accuracy and reliability of our results. To avoid the risk of overfitting and to ensure the generalizability of our models, we adopted a robust validation strategy. Specifically, the full dataset was partitioned into 80% training and 20% testing sets using train_test_split() from the scikit-learn library. An 80:20 train-test split was applied using scikit-learn’s train test split function with a fixed random state of 42. The list of compounds included in the training and testing sets is provided in Table 1, respectively. Additionally, a 10-fold cross-validation technique was employed on the training dataset to evaluate the performance of the models across multiple subsets. To ensure rigorous evaluation and prevent data leakage, all machine learning models were trained from scratch for each analysis step. Specifically, for the 80:20 train-test split, the models were trained exclusively on the training data, and the test data was used only for final performance evaluation. Additionally, for 10-fold cross-validation, a new model was trained for each fold using 90% of the data and validated on the remaining 10%. At no point was a model trained on the entire dataset used for testing or validation purposes. This technique divides the training data into 10 equal parts, iteratively training the model on 9 parts and validating on the remaining part. The average performance metrics obtained from the folds provided a stable and reliable assessment of the model’s predictive capability. The models performance was assessed using standard metrics including Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and the coefficient of determination ($R^2$).

Dataset acquisition strategy

We employed Python version 3.8 for determining topological indices and gathered the physicochemical properties from online databases such as ChemSpider (http://www.chemspider.com/Default.aspx) and

PubChem (https://pubchem.ncbi.nlm.nih.gov/). The treatment for asthma requires the use of various medications including Montelukast, Prednisone, Methylprednisolone, Dexamethasone, Terbutaline, etc., which are identified as a, b, c,..., m in Fig 1. During the analysis, topological descriptors were utilized as feature variables and physicochemical properties as target variables.
After labeling the dataset, we chose to employ supervised machine learning methods, including Random Forest (RF) and XGBoost (XGB), to analyze the data and obtain predictive insights. Random Forest was selected for its robustness and ability to manage overfitting through ensemble learning, whereas XGBoost applies gradient boosting to iteratively correct previous errors, offering high performance for complex data. Both models were trained using cross-validation to ensure the reliability of the results.
The main Python libraries used in model implementation included:
- Jupyter notebook for an interactive environment,
- pandas for data manipulation,
- numpy for numerical computations,
- scikit-learn for machine learning models including Random Forest and utility functions like train_test_split and cross_val_score,
- xgboost for implementing the XGB algorithm,
- matplotlib and seaborn for data visualization.

Model configuration and hyperparameters

All machine learning models were implemented in Python using the scikit-learn and xgboost libraries. The following hyperparameters were used for reproducibility:

Random Forest Regressor (RF):
- Number of trees: n_estimators = 100
- Maximum tree depth: max_depth = 10
- Feature sampling: max_features = ’auto’
- Minimum samples per leaf: min_samples_leaf = 1
XGBoost Regressor (XGB):
- Number of trees: n_estimators = 100
- Learning rate: learning_rate = 0.1
- Maximum depth: max_depth = 6
- Regularization parameters: reg_alpha = 0, reg_lambda = 1
- Subsample ratio: subsample = 0.8

Table 1 List of drugs assigned to the training and testing sets.

Subjects

Abstract

Similar content being viewed by others

Computational approaches in drug chemistry leveraging python powered QSPR study of antimalaria compounds by using artificial neural networks

Developing a QSPR model for Alzheimer’s drugs using topological indices and M-polynomial: A computational study

Advanced QSPR modeling of profens using machine learning and molecular descriptors for NSAID analysis

Introduction

Some basic definitions

Reducible first and second Zagreb index

Reducible reciprocal randic index

Reducible first and second hyper Zagreb index

Reducible sigma index

Reducible forgotten index

Reducible \(1^{st}\) and \(2^{nd}\) Gourava Index

Material and method

Dataset acquisition strategy

Model configuration and hyperparameters

Mathematical formulation

Theorem 1

Proof

Algorithm 1: Computation of reducible topological indices

Remark

Linear regression model

Computation of statistical parameters

Supervised machine learning

Algorithm 2

Random forest

Extreme gradient boosting

Comparative analysis of actual and predicted properties

Evaluation with cross-validation and test set

Comparison with baseline model

Comparison with additional benchmark models

Conclusion

Future work

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links