Abstract
The development of continuous pharmaceutical manufacturing is crucial and can be analyzed via advanced computational models. Machine learning is a strong computational paradigm that can be integrated into a continuous process to enhance the drugs’ solubility and efficacy. In this research, a simulation method for estimating pharmaceutical solubility was considered in green solvents to develop the idea of continuous pharmaceutical manufacturing. Artificial intelligence strategies were utilized to apply models for fitting several solubility datasets. Using machine learning techniques, the solubility of Clobetasol Propionate (CP) was modeled at temperature values between 308 K and 348 K, and pressures in the range of 12.2 MPa to 35.5 MPa. In this research, two models—a neural network-based model called MLP (Multilayer Perceptron) and a probabilistic model called GPR (Gaussian Process Regression)—along with an ensemble voting model based on these two, were considered for modeling. A GWO (Grey Wolf Optimization) method was also used to tune their hyperparameters. All three models have significant performances on estimation of CP solubility. But the voting model, which is a combination of the other two models, is better than the other two models in terms of accuracy. The ensemble voting model, integrating MLP and GPR with GWO optimization, offers superior accuracy for predicting CP solubility, advancing continuous pharmaceutical manufacturing.
Similar content being viewed by others
Introduction
The development of green manufacturing is of great interest owing to the advantages of these processes for the environmental as well as cost of the processes. In process engineering, sustainability plays a crucial role, and extensive research is being carried out for the development of green and sustainable processes1,2. In biotech and pharmaceutical manufacturing, some green processing has been developed such as supercritical processing which utilizes a gas solvent at supercritical conditions for processing3,4. Since harmful organic materials such as solvents are not employed in these processes, they are considered green and sustainable processing. These processes offer unique characteristics in the preparation of pharmaceuticals5,6,7. Additionally, these processes can operate as continuous mode to address one of the challenges in the pharmaceutical industry.
Some works have been carried out in supercritical green processing for preparation of nanomedicines by this technique such as the method of rapid expansion by which nanoparticles of drug can be obtained by proper control of process parameters8,9. Particle size between 200 and 300 nm has been obtained by supercritical processing as reported by Sakabe and Uchida8. Despite the experimental development of green processing, most of studies in this area are devoted to correlation of solubility data, as experimental measurement of solubility of drugs in supercritical solvents is tedious and costly10,11,12. In this context, gravimetric method is usually preferred for measuring solubility, and the data are collected. Then some correlations (e.g., data-driven models) are fitted to the measured data to train an AI (Artificial Intelligence) algorithm for the measured data. Once the model is trained and tested, that is then utilized for estimation of the drug solubility in a wide range of conditions. Furthermore, the tested models can be employed for evaluation of influence of pressure and temperature as the main inputs on the variations of drug solubility in the supercritical solvents13.
The methods based on machine learning (ML) for solubility estimation have attracted much attention as it is facile to be implemented and it is fast and accurate enough for fitting small dataset of pharmaceutical solubility5,14,15. Therefore, ML modeling approach has been suggested for determination of drug solubility in supercritical solvents by which one can avoid extensive cost and time of laboratory measurements16. The goal of ML is to design computer algorithms that are able to examine data and use those data to construct models17,18. In this study, we choose to use three regression models: the Gaussian Process Regression (GPR), the Multi-layer Perceptron (MLP), and the voting regression.
The GPR model, a non-parametric Bayesian approach, is useful for both exploration and exploitation tasks. It can represent a wide variety of input-output relationships by considering an infinite number of input features. Bayesian inference helps determine the model’s complexity19,20,21,22. Also, neural networks, such as MLP estimate models, have been utilized for research objectives. As a consequence of a layered feed-forward architecture composed of multiple layers of feed-forward units triggered by an input-output transfer function. Feed Forward Neural Network is the term used for this23,24.
The voting ensemble learning system combines different base learners by assigning weights during training. New learners help balance and improve overall performance. The effectiveness of the ensemble depends on the diversity of base model outputs and the integration strategy used25,26. GWO (Grey Wolf Optimization)27 simulates grey wolf leadership and hunting via population-based meta-heuristic optimization. Initially, wolves are randomly positioned. During optimization, they update their positions based on the alpha wolf, which represents the best solution28. As a result, the second and third best solutions are designated beta and delta, and those wolves will guide them through the quest. Omega will be the fourth best answer29.
In this study, GPR and MLP were selected for calculating the solubility of Clobetasol Propionate (CP) in supercritical solvent, with their strengths combined in a voting ensemble. GPR, a probabilistic model, was chosen for its ability to quantify uncertainty and perform well with small datasets, which aligns with the limited experimental data available. MLP, a neural network, was selected for its capacity to capture complex, non-linear relationships influenced by factors like temperature and pressure, common in solubility data. Their combination enhances predictive accuracy by leveraging GPR’s uncertainty handling and MLP’s pattern recognition. While other robust methods like XGBoost or Support Vector Regression (SVR) are effective, GPR and MLP were preferred due to their proven success in pharmaceutical applications, with GPR offering probabilistic outputs and MLP excelling in intricate data modeling. This approach ensures reliable predictions for green pharmaceutical manufacturing.
The main novelty and contribution of the current study is in model development and integration with GWO optimizer to enhance the accuracy of the models, which was used for the first time to estimate the solubility of CP in supercritical CO2 (SC-CO2). The integration of GPR, MLP, and GWO has been employed in previous studies for predictive modeling tasks. However, this work introduces a novel approach by combining these models into a voting ensemble framework explicitly optimized for predicting the solubility of CP in SC-CO2. Our method enhances prediction accuracy and addresses key phenomena, such as the cross-over pressure point, offering improvements over existing literature, as detailed in subsequent sections.
Dataset of green processing
The dataset applied in this study was obtained from paper30, and it comprises two inputs, namely temperature (K) and pressure (MPa). The process can be investigated as continuous processing of medicines with enhanced aqueous solubility. The single response of this dataset is s, which represents the drug’s solubility, and its unit is g/L as indicated in Table 1. Each of the 45 rows in this data set was recorded at a different one of the five temperature levels and nine pressure levels31. The range of T and P for modeling and measurements are selected to keep the solvent at the supercritical state for CO2. Given that the supercritical condition for CO2 is 7.38 MPa and 304 K32, the selected range ensures the solvent operates at the supercritical condition throughout the experiments and measurements. Each data point is displayed in Table 1. The same dataset was used by Liang et al.33 for developing regressive models.
Figure 1 shows the pairwise distribution of parameters for CP solubility in SC-CO2. Diagonal elements (s vs. s, P vs. P, T vs. T) display the distributions of solubility (0.0003–0.3 g/L), pressure (12.2–35.5 MPa), and temperature (308–348 K), highlighting the range and frequency of each variable in the dataset.
Modeling methodology
The method of GPR is a frequently used probabilistic ML method for various scientific tasks34. Gaussian processes are characterized by their mean and covariance functions (kernel function). For declaration of GPR, the data is separated into train dataset (X) with known targets (y) and test dataset \(\:\left({X}_{t}\right)\) with uncertain output\(\:\:\left({y}_{t}\right)\). The covariance \(\:K(.,.)\) and prior mean \(\:m(.)\) are given as the following joint distribution35:
\(\:K\left(X,{X}_{t}\right)\) signifies the matrix of covariances calculated for entire pairings of train and test points. Also, the variable y stands for the observed values under input X, and \(\:\:\left({y}_{t}\right)\:\) stands for the output of test input \(\:\:\left({X}_{t}\right)\). When the Gaussian Process prior has been established, the Gaussian Process posterior can be used to forecast the testing output. The conditional distribution is36:
\(\:{m}_{*}\:\) represents the posterior mean and \(\:{K}_{*}\:\) stands for the corresponding covariance, respectively.
The MLP is a widely used network architecture in academic fields. In a feed-forward neural network, inputs and biases are accumulated by weight, and then transferred to the activation level, where they are processed by a transfer function to produce an output. The network has numerous layers of feed-forward units23,24,37. When the network’s output units are activated, the output is calculated as follows38:
Activating nodes in the hidden layer follows a process that is analogous to the one described above. Compared the estimated output to the desired output and accounting for the difference, the error definition is37:
L output nodes and N patterns make up a data set. Our research aims to decrease errors by adjusting the connections between layers39.
The development of ML models and the visualization of the results were carried out using Python (version 3.8), accessible from: https://www.python.org.
Results and discussions
The models presented earlier were optimized and evaluated using the GWO method. This section discusses the results of that evaluation.
Single models
The single GPR model was optimized, and its alpha value was considered equal to 6.68 × 10−7. Also, for MLP, the number of hidden layers selected to be equal to 40 and the tolerance to 0.00013 and ‘lbfgs’ as solver function. In terms of RMSE metric, GPR has value of 1.46 × 10−2 and MLP shows an estimated error of 1.08 × 10−2. The R2 is 0.973 for MLP and 0.951 for GPR model. Those are significant results for both. Also, in Figs. 2 and 3 the residuals of models are displayed. The 3D estimation curves of them are in Figs. 4 and 5.
Figure 3 reveals a systematic underestimation of CP solubility by the models at higher pressure values (above 25 MPa), offering that the MLP may not fully learn complex solubility dynamics under these conditions.
Voting model
A voting regression model was also developed using MLP and GPR models. As it was predictable, this model is more accurate than both single models with RMSE error rate of 1.03 × 10−2. Also, the R2 is equal to 0.977 with voting model. So, we can introduce this model as the most accurate one amongst other models evaluated in this work. The residual for this model is displayed in Fig. 6 and the 3D estimation surface is shown in Fig. 7. Also, as the primary model of this research, the trends of both inputs are shown in Figs. 8 and 9 using the voting model. Taking into account Fig. 8, it is clearly visible that the increase in solubility increases with increasing pressure. By looking at Fig. 9, we can also conclude that at higher pressures (more than 20 MPa), the solubility of drug increases with increasing temperature. This relationship is slightly inverse at pressures less than 20 MPa, meaning that the solubility of drug is decreased with enhancing the temperature. This behavior has been already reported for solubility of medicines in SC-CO231,33,40,41. Liang et al.33 reported similar solubility trend for CP drug modeling via ML. This point can be considered as the cross-over pressure point where the solubility variations are reversed above this point. Both the effects of solvent density and sublimation pressure create this phenomenon in supercritical solvents and has been already reported in previous studies on drug solubility in SC-CO24. The same trend has been observed in previous studies on the cross-over pressure point which confirms the validity of the models developed via ML in this study16.
In summary, the models have been shown to be reliable in estimation of drug solubility in a wide range of T and P, without encountering overprediction which confirms that the models are not biased for the range of input features. Indeed, the ML models developed and optimized in this research can be adopted for estimation of other drugs solubility in SC-CO2 by learning the solubility data as well as the density of solvent because the density has significant effects on the solubility variations. Thus, the models are able to be used for a wide range of drugs, and the proper optimization method is described in this work which can be used also for outside this scope such as supercritical extraction and nanonization of drugs.
The reliability of voting model in estimating CP solubility across a wide range of temperatures (308–348 K) and pressures (12.2–35.5 MPa) was evaluated using RMSE and R² metrics, with values of 1.08 × 10⁻² and 0.973 for MLP, 1.46 × 10⁻² and 0.951 for GPR, and 1.03 × 10⁻² and 0.977 for the ensemble model, respectively. A 5-fold cross-validation approach ensured robust performance across the dataset, minimizing overfitting. Residual plots (e.g., Fig. 6) revealed no systematic overprediction, with residuals evenly distributed around zero across the input feature range, confirming the absence of bias in the models’ predictions.
The voting regression model achieved an RMSE of 1.03 × 10⁻² and R² of 0.977, surpassing the RMSE of 5.02 × 10⁻1 and R² of 0.967 reported by Obaidullah et al.31 for drug solubility modeling in SC-CO2 via regressive models (Support vector machine models). This can confirm the validity of the model developed in this study and proper methodology to estimate drug solubility at different conditions.
This study is limited by the relatively small dataset used for training and testing the MLP, GPR, and ensemble voting models, which may restrict the models’ ability to generalize to a broader range of conditions or compounds. The focus on CP in SC-CO2 limits the applicability of the findings to other solvents or drugs without further validation. Furthermore, the computational complexity of the GWO optimization process may pose challenges for real-time applications42. Experimental uncertainties in solubility measurements, such as potential variations in temperature or pressure control, could also affect model accuracy and reliability.
Future research could explore the application of the proposed ensemble model to other pharmaceutical compounds to assess its generalizability across diverse solubility profiles in supercritical solvents. Incorporating additional ML techniques, such as deep learning or hybrid models, could further improve predictive accuracy and capture more complex solubility behaviors. Moreover, incorporating real-time experimental data into the model has the potential to facilitate dynamic optimization of process conditions within a continuous manufacturing framework. Investigating the scalability of the model for industrial applications and its integration with process control systems would also advance the practical implementation of green pharmaceutical manufacturing.
Conclusion
The solubility of CP in SC-CO2 was modeled via machine learning approaches. In this study, the MLP neural network model, the GPR probabilistic model, and an ensemble voting model combining these two were investigated for predictive modeling. GWO was employed to optimize the hyperparameters of these models. All three models demonstrated strong performance in estimating CP solubility across a range of temperatures (308–348 K) and pressures (12.2–35.5 MPa). The ensemble voting model, which integrates the strengths of MLP and GPR, obtained the highest accuracy, with a RMSE of 1.03 × 10⁻² and a coefficient of determination (R²) of 0.977.
The optimized models were used to evaluate the impact of process variables, specifically temperature and pressure, on CP solubility. The analysis revealed that solubility increases with pressure, while at higher pressures (> 20 MPa), solubility increases with temperature, and at lower pressures (< 20 MPa), an inverse relationship is observed. This behavior, consistent with the cross-over pressure phenomenon reported in prior studies, validates the reliability of the models. The absence of systematic overprediction in residual plots further confirms that the models are unbiased across the input feature range.
This modeling framework provides a robust approach for predicting drug solubility in SC-CO2, offering potential for application to other pharmaceutical compounds with similar solubility datasets. By leveraging limited experimental solubility data and optimizing predictive models, this approach can support the development of green pharmaceutical manufacturing processes. Future research could extend this framework to other drugs and explore additional machine learning techniques to further enhance predictive accuracy.
Data availability
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.
References
Bennett, J. A., Campbell, Z. S. & Abolhasani, M. Role of continuous flow processes in green manufacturing of pharmaceuticals and specialty chemicals. Curr. Opin. Chem. Eng. 26, 9–19 (2019).
Viveiros, R. et al. Enzyme-inspired dry-powder polymeric catalyst for green and fast pharmaceutical manufacturing processes. Catal Commun. 172, 106537 (2022).
Askarizadeh, M. et al. Binary and ternary approach of solubility of Rivaroxaban for Preparation of developed nano drug using supercritical fluid. Arab. J. Chem. 17 (4), 105707 (2024).
Ardestani, N. S. et al. Experimental and modeling of solubility of sitagliptin phosphate, in supercritical carbon dioxide: proposing a new association model. Sci. Rep. 13 (1), 17506 (2023).
Xiang, S. T. et al. Solubility measurement and RESOLV-assisted Nanonization of gambogic acid in supercritical carbon dioxide for cancer therapy. J. Supercrit. Fluids. 150, 147–155 (2019).
Zeinolabedini Hezave, A. & Esmaeilzadeh, F. Solubility measurement of diclofenac acid in the supercritical CO2. J. Chem. Eng. Data. 57 (6), 1659–1664 (2012).
Zhao, Z. et al. Multi support vector models to estimate solubility of Busulfan drug in supercritical carbon dioxide. J. Mol. Liq. 350, 118573 (2022).
Sakabe, J. & Uchida, H. Nanoparticle size control of Theophylline using rapid expansion of supercritical solutions (RESS) technique. Adv. Powder Technol. 33 (1), 103413 (2022).
Uchida, H. et al. Production of Theophylline nanoparticles using rapid expansion of supercritical solutions with a solid cosolvent (RESS-SC) technique. J. Supercrit. Fluids. 105, 128–135 (2015).
Yamini, Y. et al. Solubility of capecitabine and docetaxel in supercritical carbon dioxide: data and the best correlation. Thermochim. Acta. 549, 95–101 (2012).
Yin, X. et al. Itraconazole solid dispersion prepared by a supercritical fluid technique: preparation, in vitro characterization, and bioavailability in beagle dogs. Drug Des. Devel Ther. 9, 2801–2810 (2015).
Zhu, H. et al. Machine learning based simulation of an anti-cancer drug (busulfan) solubility in supercritical carbon dioxide: ANFIS model and experimental validation. J. Mol. Liq. 338, 116731 (2021).
Ghazwani, M., Yasmin, M. & Begum Machine learning aided drug development: assessing improvement of drug efficiency by correlation of solubility in supercritical solvent for nanomedicine Preparation. J. Mol. Liq. 387, 122511 (2023).
Wang, T. & Su, C. H. Medium Gaussian SVM, wide neural network and Stepwise linear method in Estimation of lornoxicam pharmaceutical solubility in supercritical solvent. J. Mol. Liq. 349, 118120 (2022).
Zhao, Z. et al. Multi support vector models to estimate solubility of Busulfan drug in supercritical carbon dioxide. J. Mol. Liq. 350, 118573 (2022).
Abouzied, A. S. et al. Advanced modeling and intelligence-based evaluation of pharmaceutical nanoparticle Preparation using green supercritical processing: theoretical assessment of solubility. Case Stud. Therm. Eng. 48, 103150 (2023).
El Naqa, I. & Murphy, M. J. What is machine learning?, in Machine Learning in Radiation Oncology. Editors: Issam El Naqa, Ruijiang Li, Martin J. Murphy, Springer. 3–11. (2015).
Goodfellow, I., Bengio, Y. & Courville, A. Machine Learn. Basics Deep Learn., 1(7): 98–164. (2016).
Gershman, S. J. & Blei, D. M. A tutorial on bayesian nonparametric models. J. Math. Psychol. 56 (1), 1–12 (2012).
Williams, C. K. Prediction with Gaussian Processes: from Linear Regression To Linear Prediction and Beyond, in Learning in Graphical Modelsp. 599–621 (Springer, 1998).
Schulz, E., Speekenbrink, M. & Krause, A. A tutorial on Gaussian process regression: modelling, exploring, and exploiting functions. J. Math. Psychol. 85, 1–16 (2018).
Hoang, N. D. et al. Estimating compressive strength of high performance concrete with Gaussian process regression model. Advances in Civil Engineering, (2016).
Rafiq, M., Bugmann, G. & Easterbrook, D. Neural network design for engineering applications. Comput. Struct. 79 (17), 1541–1552 (2001).
Mielniczuk, J. & Tyrcha, J. Consistency of multilayer perceptron regression estimators. Neural Netw. 6 (7), 1019–1022 (1993).
Chen, Y., Wong, M. L. & Li, H. Applying ant colony optimization to configuring stacking ensembles for data mining. Expert Syst. Appl. 41 (6), 2688–2702 (2014).
Chen, G. et al. VAERHNN: Voting-averaged ensemble regression and hybrid neural network to investigate potent leads against colorectal cancer. Knowl. Based Syst. 257, 109925 (2022).
Mirjalili, S., Mirjalili, S. M. & Lewis, A. Grey Wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014).
Sodeifian, G. et al. Application of supercritical carbon dioxide to extract essential oil from Cleome coluteoides boiss: experimental, response surface and grey Wolf optimization methodology. J. Supercrit. Fluids. 114, 55–63 (2016).
Alomoush, A. A. et al. Hybrid harmony search algorithm with grey Wolf optimizer and modified opposition-based learning. IEEE Access. 7, 68764–68785 (2019).
Asiabi, H. et al. Measurement and correlation of the solubility of two steroid drugs in supercritical carbon dioxide using semi empirical models. J. Supercrit. Fluids. 78, 28–33 (2013).
Obaidullah, A. J. et al. Computational intelligence modeling using artificial intelligence and optimization of processing of small-molecule API solubility in supercritical solvent. Case Stud. Therm. Eng. 49, 103321 (2023).
Li, S. et al. Supercritical carbon dioxide as solvent for manufacturing of ibuprofen loaded gelatine sponges with enhanced performance. Chem. Eng. Process. - Process. Intensif. 205, 110038 (2024).
Liang, J. et al. Computational intelligence modeling and optimization of small molecule API solubility in supercritical solvent for production of drug nanoparticles. Sci. Rep. 15 (1), 14774 (2025).
Wang, J. An intuitive tutorial to Gaussian processes regression. arXiv preprint arXiv:2009.10862, (2020).
Rasmussen, C. E. Gaussian Processes in Machine Learning (Springer, 2004).
Tabatabaeipour, M. et al. Ultrasonic guided wave Estimation of minimum remaining wall thickness using Gaussian process regression. Mater. Design. 221, 110990 (2022).
Venkatesan, P. & Anitha, S. Application of a radial basis function neural network for diagnosis of diabetes mellitus. Curr. Sci. 91 (9), 1195–1199 (2006).
Yilmaz, I. & Kaynar, O. Multiple regression, ANN (RBF, MLP) and ANFIS models for prediction of swell potential of clayey soils. Expert Syst. Appl. 38 (5), 5958–5966 (2011).
Altun, H. & Gelen, G. Enhancing performance of MLP/RBF neural classifiers via an multivariate data distribution scheme. in International conference on computational intelligence (ICCI2004), Nicosia, North Cyprus. (2004).
Hao, C. et al. Computational study and experimental validation on the solubility of drugs in supercritical solvent for assessment of nanomedicine production via green technology for enhanced drug bioavailability. J. Mol. Liq. 382, 121835 (2023).
Ghazwani, M. et al. Development of advanced model for Understanding the behavior of drug solubility in green solvents: machine learning modeling for small-molecule API solubility prediction. J. Mol. Liq. 386, 122446 (2023).
Lachure, J. & Doriya, R. ESML: A Hyperparameter-Tuned stacking machine learning approach for anomaly attack detection in water distribution systems. SN Comput. Sci. 6 (5), 505 (2025).
Acknowledgements
The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2025-24).
Funding
This research was funded by Taif University, Saudi Arabia, Project No. (TU-DSPP-2025-24).
Author information
Authors and Affiliations
Contributions
Ahmed A. Lahiq: Writing, Methodology, Investigation, Formal analysis, Supervision.Abdullah A. Alshehri: Resources, Investigation, Writing, Software, Visualization, Funding. Shaker T. Alsharif: Conceptualization, Validation, Writing, Methodology.All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lahiq, A.A., Alshehri, A.A. & Alsharif, S.T. Machine learning analysis of drug solubility via green approach to enhance drug solubility for poor soluble medications in continuous manufacturing. Sci Rep 15, 26007 (2025). https://doi.org/10.1038/s41598-025-11823-z
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-11823-z











