A robust and statistical analyzed predictive model for drug toxicity using machine learning

Rawat, Deepak; Bajaj, Rohit; Manchanda, Rachit; Mehta, Ankush; Paramasivam, Prabhu; Bhagat, Suraj Kumar; Ayanie, Abinet Gosaye

doi:10.1038/s41598-025-02333-z

Download PDF

Article
Open access
Published: 23 May 2025

A robust and statistical analyzed predictive model for drug toxicity using machine learning

Deepak Rawat¹,
Rohit Bajaj²,
Rachit Manchanda²,
Ankush Mehta³,
Prabhu Paramasivam⁴,
Suraj Kumar Bhagat⁵ &
…
Abinet Gosaye Ayanie⁶

Scientific Reports volume 15, Article number: 17993 (2025) Cite this article

3569 Accesses
1 Citations
Metrics details

Subjects

Abstract

Over the years, toxicity prediction has been a challenging task. Artificial intelligence and machine learning provide a platform to study toxicity prediction more accurately with a reduced time span. An optimized ensembled model is used to contrast the results of seven machine learning algorithms and three deep learning models with regard to state-of-the-art parameters. In the paper, optimized model is developed that combined eager random forest and sluggish k star techniques. State-of-the-art parameters have been evaluated and compared for three scenarios. In first scenario with original features, in the second scenario using feature selection and resampling technique with the percentage split method, and in the third scenario using feature selection and resampling technique with 10-fold cross-validation. The principal component analysis is performed for feature selection. An optimized ensembled model performs well in comparison to other models in all three scenarios. It achieved an accuracy of 77% in the first scenario, 89% in the second scenario, and 93% in the third scenario. The proposed model shows the performance increase in accuracy by 8% as compared to the top performer Kstar machine learning model and 21% as compared to deep learning model AIPs-DeepEnC-GA which is remarkable. Also there is significant improvement in other important evaluation parameters in comparison to top performing models. Further concept of W-saw score and L-saw is presented for all the scenarios. An optimized ensembled model using feature selection and resampling technique with tenfold cross-validation performs best among all machine learning models in all the scenarios.

Comparative evaluation of feature reduction methods for drug response prediction

Article Open access 28 December 2024

The application of chemical similarity measures in an unconventional modeling framework c-RASAR along with dimensionality reduction techniques to a representative hepatotoxicity dataset

Article Open access 06 September 2024

Prediction of tablet disintegration time based on formulations properties via artificial intelligence by comparing machine learning models and validation

Article Open access 21 April 2025

Introduction

The degree to which a medicinal compound is hazardous to living things is known as its toxicity¹. Toxicology prediction is extremely difficult. Worldwide, numerous medicinal compounds are created each year. Toxicity is related to the amount of chemicals that are inhaled, applied, or injected and can result in death, allergies, or negative consequences on living organisms². A drug’s toxicity can differ from person to person as per their characteristics. Therefore, a dose that is curative for one patient may be poisonous for other³. Drugs are necessary for living beings to help with illness, disease diagnosis, or disease prevention⁴. A new medication or chemical molecule must go through a lengthy, expensive process of development. There are two types of chemicals, namely active and inactive ingredients found in every medicine. The term “active ingredients” refers to the substance that constitutes the therapeutic essence of medicine⁵. The other is known as an inactive component, which has no direct therapeutic benefit but is utilized to balance a drug’s potency. Inactive medications are occasionally used to bind, coat, flavor, or even speed up the breakdown of active pharmaceuticals. Therefore, maintaining a balance between active and inactive medications is crucial. The imbalance of active and inactive medications results in toxicity⁶. Thus, predicting drug toxicity is vital. Over the last few decades, toxicity has been a crucial subject of ongoing research⁷. In the past, drug testing was performed on animals followed by human trials but computational intelligence makes it possible to forecast and assess a drug’s toxicity⁸. It is possible to forecast drug toxicity using machine learning approaches⁹. These methods reduce the cost and duration of the evolution process. A critical phase in the machine learning pipeline is FS, where pertinent characteristics are selected in the dataset and excludes redundant attributes¹⁰. Proper feature selection can shorten training times, prevent overfitting, and enhance model performance. There are numerous ways for selecting features, ranging from straightforward to sophisticated¹¹. The best feature subset for a particular machine learning assignment is detected by frequently considering a combination of approaches and experimenting carefully. Additionally to prevent data leakage and estimate performance of model precisely, feature selection must be carried out inside a validated framework¹². A common dimensionality reduction method in statistics and machine learning is principal component analysis¹³. Its main applications are feature selection and data visualization, with the aim of decreasing a dataset’s dimensionality while retaining as much crucial data as feasible. Principal component Analysis uses linear combinations to produce the main components, and the term “combinations” refers to the linear combinations of the original features¹⁴. The goal of PCA is to identify a set of orthogonal (i.e., uncorrelated) linear combinations of the initial characteristics that best account for the data’s variation. The original attributes are combined with particular weights or coefficients to create these linear combinations, which are known as the principal components. In brief, PCA diminishes the amplitude of the data by keeping as much information as feasible. The process of dimension reduction is applied by combining the actual features presented as principal components¹⁵. These combinations are determined by assigning weight to original features that are necessary for structure and reducing the dimension of data. Resampling is a method to change the dataset by addition, deletion, or change of dataset points. This is used to overcome class imbalance and fitting problems in a dataset¹⁶. Oversampling and undersampling can add biases or lower the quantity of data accessible for training¹⁷. It must be done with caution. The resampling approach and parameters are persuaded by the dataset, specific challenge, and the desired result. The appropriate model selection, hyperparameter tweaking, and cross-validation must be used in association with resampling to get a balanced and robust machine-learning model¹⁸. The process of splitting a dataset into subsets for training, testing, and validation is termed as percentage split in machine learning¹⁹. These subsets are given in terms of the percentage of a dataset. The percentage split selection is based on the size of the dataset and the data accessibility. In a typical percentage split, for training, testing, and validation for instance criteria of 80%, 10%, and 10% can be utilized²⁰. Depending upon data size we can build a robust model²¹. However, depending upon the need of a project, the percentage split can be adjusted to get the best machine learning model. To avoid biases, it is important to split data at random or by the use of a method that ensures the best subsets of the entire dataset. K-fold cross-validation is a well-known technique to apply²². It helps to estimate the performance on data when a dataset is small. In the technique, the dataset is split into equal size of k-folds. Noted the performance statistics or metrics²³. Calculate the performance metric(s) average and standard deviation over all K iterations. In comparison to a single train-test split, these statistics offer a more reliable estimation of your model’s performance. The size of our dataset and the available computational resources are only two examples of the many variables that influence the choice of K. K frequently has values between 5 and 10, and 10-fold cross-validation is frequently a suitable place to start. We can experiment with several K values to discover which one gives the most accurate performance estimates for your model²⁴. K-fold cross-validation offers a more thorough review than a single train-test split, which can be impacted by the randomness of the split, and aids in evaluating. A number of machines learning algorithms are applied and the performance is evaluated which is quite satisfactory but still the challenges needs to be addressed are overfitting, generalization, dependency on a single factor i.e. accuracy. The presented paper uses k-fold cross validation method to deal with overfitting. Ensembling of lazy and eager is performed to address generalization. Saw scores are composite scores of all performance parameters which strengthens the optimized model.

The main contribution of presented paper is to

A number of machine learning approaches are already used to improve the performance of the model.
Optimize and strengthen the model with multidisciplinary domain operational research where W-saw and L-saw are calculated and their respective scores validate the performance of optimized model before deployment.

Literature review

In this section, different techniques used by researchers in machine learning have been discussed with their findings of the research. Sukumaran et al. created a mongrel method based on particle swarm optimisation and support vector machines to autonomously analyse computed tomography images, offering a high likelihood of detecting the existence of Covid-19-related pneumonia²⁵. The model was trained and clarifies the existence of disease in patients that saves time frame for physicians. Sarwar et al. exhibited an ensembled model in deaconing type II diabetes²⁶. The authors considered a total of 15 models but used five main approaches. To achieve the desired results they employed matrix laboratory and the weka tool. The voting technique is used in ensembling the classifiers. A medical dataset of 400 people around the globe is considered during the research. Verma et al. provided an analysis of machine learning methods, both supervised and unsupervised, for identifying incredulous behaviour²⁷. The authors studied the behavior of a single person in a crowd with artificial intelligence techniques. Bojamma et al. studied the importance of plant identification in balancing the nature and saving the geodiversity of a zone²⁸. The authors assessed the condition to explore latest approaches for systematic identifications of flora. The combined efforts of artificial intelligence and botanists are important to robotize the complete method of recognition of plants considering leaves as crucial characteristics that help identify between different plants. Shidnal et al. studied about lack of nutrients in a paddy crop. They used neural network to categorise the shortcomings using tensor flow²⁹. Clustering technique k means is applied to build clusters³⁰. The authors estimated state of deficiencies on a measurable basis. A rule-based matrix is also used to estimate cropland’s yield. Table 1 represents the literature based on the algorithm used in the study.

Table 1 Literature based on the algorithms studied.

Subjects

Abstract

Similar content being viewed by others

Comparative evaluation of feature reduction methods for drug response prediction

The application of chemical similarity measures in an unconventional modeling framework c-RASAR along with dimensionality reduction techniques to a representative hepatotoxicity dataset

Prediction of tablet disintegration time based on formulations properties via artificial intelligence by comparing machine learning models and validation

Introduction

Literature review

Proposed methodology

Results and discussions

Dataset description

Results and discussions

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Consent for publication

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links