Introduction

Drug solubility can be estimated by computational models such as thermodynamic or data-driven models. In pharmaceutical industry, it is important to determine the solubility of medicines at different solvents, as the solubility prediction is used for design of formulations with enhanced solubility as well as analysis of crystallization process for separation of solid (drug) from solvent. Solubility plays a crucial role in crystallization of pharmaceutical compounds as it is needed to find the metastable zone and supersaturation conditions for driving the process1,2. Equation of state (EoS) is one of the major approaches in thermodynamic modeling of drugs solubility in supercritical solvents. Zhang et al.3 developed PC-SAFT equation of state to correlate solubility data of anticancer drugs in supercritical CO2. They used measured data for optimization of temperature-independent parameters of the model which showed great prediction accuracy. Also, other EoS-based models have been proposed for estimation of solubility data which are based on the thermodynamic nature4,5. For modeling solubility of drugs, usually a solid-liquid equilibrium is assumed for building thermodynamic models.

Measured solubility data are needed for building correlative models such as thermodynamic models, while it is complicated to obtain measured data for large number of APIs (active pharmaceutical ingredients). On the other hand, building and optimization of thermodynamic models for large number of APIs is time consuming and thus other modeling techniques should be explored for correlation of drug solubility dataset. Data-driven models are great alternatives to thermodynamic models in correlation of solubility data, and have been used with reliability in drug solubility estimation6. The data are generated via solubility measurements and then used for fitting the data-driven models. The trained and tested models can be used for estimation of drug solubility values at wide conditions including pressure, temperature, and composition7,8. Better accuracy has been obtained for data-driven models in comparison to thermodynamic models which shows the usefulness of these models in pharmaceutical modeling9.

Machine learning (ML) models are considered as data-driven approach which can be used in correlating drug solubility dataset10. This approach has applications in a wide range of fields and industries. Regression models are an integral part of ML for quantitative analysis of large dataset. In regression, the goal is to learn a mapping function that can predict the target variable using the features11,12,13. Elastic Net Regression (ENR), Orthogonal Matching Pursuit (OMP), and Gaussian Process Regression (GPR) are three powerful regressive methods for correlation of API solubility data.

The ML models used in this study are all justified for their robustness in correlation of drug dataset. In fact, ENR combines the benefits of Lasso and Ridge regression, offering a comprehensive framework for variable selection and regularization. The model’s flexibility, interpretability, and ability to handle multicollinearity contribute to its wide applicability and relevance in diverse domains14. GPR is a Bayesian approach that provides not only point predictions but also a measure of uncertainty by estimating the conditional probability distribution, enabling robust and flexible modelling in a wide range of applications15. Also, in OMP regression, the algorithm commences with a null set of chosen inputs and proceeds to incrementally incorporate the feature that shows the greatest correlation with the residual16,17.

In this study, Elastic Net Regression, Orthogonal Matching Pursuit, and Gaussian Process Regression models are used with Grey Wolf Optimization (GWO) as hyperparameters tuner to determine solubility of raloxifene and CO2 density under supercritical state. The modeling strategy offers a unique platform for the first time in correlating raloxifene solubility with the aim of increasing fitting precision compared to the previous ML models. The models are then compared to reveal their accuracy in handling the dataset.

Dataset

We analyzed a supercritical processing using machine learning, and the dataset of this research which is taken from18, consists of measurements of solubility of raloxifene, temperature, pressure, and supercritical CO2 density. Raloxifene was considered as a case study because its water solubility is poor and the method of supercritical can be assessed for application of this drug to nanonize the drug particles for more aqueous solubility. All parameters were considered for building ML models in two steps, i.e., training and validation. The variables were selected to ensure that the solvent (CO2) is in the supercritical state which is at the pressure of 7.38Ā MPa and temperature of 304.1Ā K for CO2. Here is a breakdown of the different columns in the dataset:

  • Temperature (T): This column represents the temperature values measured in Kelvin (K). The dataset includes temperature values of 313Ā K, 318Ā K, 328Ā K, and 343Ā K.

  • Pressure (P): This column represents the pressure values measured in bar. The dataset includes pressure values of 100Ā bar, 120Ā bar, 140Ā bar, 165Ā bar, 185Ā bar, 205Ā bar, and 240Ā bar. All values are above the solvent’s supercritical pressure.

  • Solubility of Raloxifene (y): This column represents the solubility of raloxifene. The solubility values are given as numerical values for each combination of temperature and pressure.

  • Supercritical CO2 Density: This column represents the density of carbon dioxide (CO2). The density values are also provided as numerical values corresponding to each combination of temperature and pressure. Density is selected as the output because the solvent is compressible, and density changes can have major influence on the variations of raloxifene solubility.

Figure 1 illustrated the frequencies of input and output variables in histogram plots using raw data. As seen, the distribution of density is more skewed compared to the drug solubility data. While low solubility points have higher frequency, the solvent density distribution shows higher frequency for large numbers which is due to increasing pressure in the process which significantly impacted the solvent density as it is a compressible solvent in the process.

Fig. 1
figure 1

Histograms of frequencies for raloxifene solubility variables.

Methodology

Grey Wolf optimization (GWO)

GWO is a metaheuristic technique taking cues from the hunting strategies exhibited by grey wolves. The algorithm simulates the hunting dynamics of a pack consisting of four different categories of wolves, namely alpha, beta, delta, and omega. Each wolf in the pack is associated with a vector of decision variables representing its position within the search space. The search process in GWO is guided by the positions of these wolves, which are progressively optimized according to their own positions and the positions of other wolves in the pack19,20.

The updating equation for the position of each wolf is given by21:

$$\:\overrightarrow{{x}_{i}}\left(t+1\right)=\overrightarrow{A}-\overrightarrow{C}\cdot\:\overrightarrow{{r}_{1}}\cdot\:\left|\overrightarrow{D}\cdot\:\overrightarrow{{X}_{i}}\left(t\right)-\overrightarrow{{x}_{i}}\left(t\right)\right|$$

In the above equation, \(\:\overrightarrow{C}\) and \(\:\overrightarrow{A}\) denote coefficient vectors, \(\:\overrightarrow{{r}_{1}}\) stands for a random vector, \(\:\overrightarrow{D}\) indicates the distance vector, and \(\:\overrightarrow{{x}_{i}}\left(t\right)\) stands for the position of the i-th wolf at t-th iteration.

Elastic net regression (ENR)

Considering the training samples \(\:X=\left[{x}_{1},{x}_{2},\dots\:,{x}_{N}\right]\in\:{R}^{N\times\:D}\), where \(\:N\) denotes the quantity of samples and \(\:D\) is the dimensionality, the Elastic Net (EN) regression algorithm aims to find the optimal coefficients that minimize the sum of squared errors, while simultaneously promoting sparsity. EN optimization problem is formulated as22:

$$\:\underset{{\upbeta\:}}{\text{min}}\left(\frac{1}{2N}\sum\limits_{i=1}^{N}|{y}_{i}-{x}_{i}^{T}{\upbeta\:}{|}_{2}^{2}\right)+{{\uplambda\:}}_{1}|{\upbeta\:}{|}_{1}+\frac{{{\uplambda\:}}_{2}}{2}|{\upbeta\:}{|}_{2}^{2}$$

where \(\:{\upbeta\:}\) is the coefficient vector, \(\:{y}_{i}\) is the target output associated with the i-th sample, \(\:{x}_{i}\) is the corresponding feature vector, \(\:{\left|\cdot\:\right|}_{1}\) denotes the \(\:{\text{l}}_{}norm\left(L1norm\right),{\left|\cdot\:\right|}_{}\) denotes the \(\:{\text{l}}_{2}\) norm (Euclidean norm), \(\:{{\uplambda\:}}_{1}\) controls the amount of L1 regularization, and \(\:{{\uplambda\:}}_{2}\) controls the amount of L2 regularization23.

Orthogonal matching pursuit (OMP)

OMP combines the inherent strengths of the OMP algorithm with an additional regularization term, resulting in enhanced performance and interpretability24. Given a set of training samples \(\:X=\left[{x}_{1},{x}_{2},\dots\:,{x}_{N}\right]\in\:{R}^{N\times\:D}\), where \(\:N\) shows the total count of samples and \(\:D\) is the dimensionality, the OMP regression algorithm aims to find the sparsest representation of each input sample in terms of a learned dictionary25. The dictionary matrix \(\:D\in\:{R}^{D\times\:K}\), with \(\:K\) being the number of dictionary atoms, is constructed by selecting a subset of \(\:K\) atoms from a larger candidate pool26.

To estimate the coefficients, OMP regression solves the following optimization problem for each sample27:

$$\:\underset{{{\upalpha\:}}_{i}}{\text{min}}|{y}_{i}-D{{\upalpha\:}}_{i}{|}_{2}^{2}+{\uplambda\:}|{{\upalpha\:}}_{i}{|}_{0}$$

where \(\:{{\upalpha\:}}_{i}\) is the coefficient vector for the i-th sample, \(\:{y}_{i}\) is the target output associated with \(\:{x}_{,}{\left|\cdot\:\right|}_{2}^{}\) denotes the squared Euclidean norm, \(\:{\left|\cdot\:\right|}_{0}\) stands for the \(\:{\text{l}}_{0}\) ā€œpseudo-normā€ that counts the number of nonzero entries in \(\:{{\upalpha\:}}_{i}\), and \(\:{\uplambda\:}\) controls the trade-off between fitting the data and promoting sparsity25.

Gaussian process regression (GPR)

In ML, GPR is applied for estimating the conditional probability distribution of a continuous response variable, denoted as y, given a set of predictor variables, denoted as x. The key concept underlying GPR is the Gaussian processes (GPs), which represent a set of random variables following a Gaussian distribution. In the context of GPR, GPs are employed to model the unknown function that establishes the interrelationship between the input variables and the output targets15,28.

In the GPR model, the response variable y is patterned after a Gaussian process with a covariance function k(x, x’) and a mean function m(x). This is expressed as \(\:y\left(x\right)\sim\:\mathcal{G}\mathcal{P}\left(m\left(x\right),k\left(x,{x}^{{\prime\:}}\right)\right)\) in which, m(x) denotes the expected value of the response given the predictors, and k(x, x’) captures the similarity between responses at different predictor values. Deciding on the covariance function such as the squared exponential or Matern functions, determines the level of similarity between responses and is typically selected from a family of parametric or non-parametric functions29,30,31.

The GPR algorithm involves the following steps32,33:

  • Choose a mean and a covariance function for the Gaussian process.

  • Maximize the marginal likelihood of the training data to estimate the covariance function hyperparameters.

  • Given the observed training data, compute the posterior variance and mean of the Gaussian process at the predictor values in the test data.

  • Predict the response of the new measurement as the posterior mean of the Gaussian process.

Results and discussion

Solvent density

The models described in this research were fitted and developed via Python version 3.8 software, as the open-source software which was downloaded from: https://www.python.org.

TableĀ 1 shows the statistical evaluations associated with the prediction of supercritical solvent density via the three different methods. Comparative analysis was carried out using several statistical parameters as listed in TableĀ 1. The results indicate that the GPR model achieved the highest R2 score (0.98578), illustrating a great correlation. This accuracy is attributed to the proper selection of features and optimizing the hyperparameters using GWO33. It also indicated the lowest RMSE (26.255) and AARD% (4.83286), demonstrating accurate and reliable predictions. Therefore, the GPR model was the best model for CO2 density prediction. FigureĀ 2 displays the residuals of GPR for CO2 density correlation. Also, Figs.Ā 3 and 4 show the direct relationship between density and pressure and its inverse relationship with temperature. The predictive function of the GPR model for this output is also shown in 3D in Fig.Ā 5. Contour plot of CO2 density is shown in Fig.Ā 6 using the GPR model. As the density is changed with pressure (see Fig.Ā 3), it is expected that the solubility to be increased with pressure which is a great advantage of supercritical CO2 as the solvent. This has been also reported in other works for drug solubility estimation via machine learning6,33,34,35,36,37.

Table 1 Performance metrics for CO2 density prediction.
Fig. 2
figure 2

Residuals of GPR for correlation of supercritical CO2 density.

Fig. 3
figure 3

Solvent density vs. pressure (GPR model).

Fig. 4
figure 4

Solvent density vs. T (GPR model).

Fig. 5
figure 5

3D surface of supercritical CO2 density simulated using GPR model. Drawn by Python 3.8, which can be accessed from the link: https://www.python.org.

Fig. 6
figure 6

Contour plot of supercritical CO2 density simulated using GPR model.

Solubility analysis

The analysis of solubility prediction using the three models is listed in TableĀ 2. Similar to the density correlation, four major criteria are presented and applied for comparative evaluation of ML models33. Both the ENR and OMP models exhibited similar performance metrics, achieving high R2 and relatively small RMSE. However, GPR outperformed them with the highest R2 of 0.97755. The GPR model also demonstrated the lowest RMSE (0.33221) and AARD% (7.08009). Consequently, the GPR model is chosen as the most robust model for solubility prediction.

Table 2 Metrics for solubility prediction via ML models.

The results of ML model (GPR) in this study are greater than the thermodynamic models developed for raloxifene as reported by18. TableĀ 3 compares the performance of GPR model developed in this research for raloxifene with the thermodynamic models which are based on Equation of State model of Peng Robinson as well as semi empirical correlation which is Mendez-Santiago-Teja (MST)18.

Table 3 Comparative analysis for drug solubility.

Figure 7 shows the residuals of GPR for drug solubility correlation. Also, Figs.Ā 8 and 9 show the increase in solubility with the increase in both input parameters. The final predictive function of the GPR model for this output is also shown in 3D in Fig.Ā 10. Also, contour plot of solubility of drug is shown in Fig.Ā 11. The dual effect of T on raloxifene solubility is related to the solvent compressibility. At higher T values, the density of supercritical CO2 is reduced, however the solubility of raloxifene is increased due to the higher interactions between the drug molecules and solvent which dominates the density reduction with increasing temperature10. These solubility changes were also reported in other studies with similar trends6,33,35,36,37.

Fig. 7
figure 7

Residuals of GPR for raloxifene solubility correlation.

Fig. 8
figure 8

Variations of raloxifene solubility with P (GPR model).

Fig. 9
figure 9

Variations of raloxifene solubility with T (GPR model).

Fig. 10
figure 10

3D surface of raloxifene solubility constructed via GPR model. Drawn by Python 3.8, which can be accessed from the link: https://www.python.org.

Fig. 11
figure 11

Contour plot of raloxifene solubility constructed via GPR model.

To validate the generalizability of the Gaussian Process Regression model trained on raloxifene data, an external dataset consisting of 15 additional drugs with diverse molecular structures and physicochemical properties was analyzed. For each compound, solubility predictions were compared with experimental data, and model performance was evaluated using the R², RMSE, and AARD%. Data was collected from published sources for different drugs38,39,40,41,42. As shown in TableĀ 4, the GPR model consistently achieved R² values above 0.91 and AARD% below 10% for all drugs, demonstrating high predictive accuracy and robustness. The strong performance across compounds such as sunitinib malate, lansoprazole, and buprenorphine HCl highlights the versatility of the model, confirming its suitability as a generalized solubility prediction tool for pharmaceutical process design under supercritical COā‚‚ conditions.

Table 4 Performance metrics for solubility prediction of 15 drugs using GPR model.

Conclusion

Three regression models of ENR, OMP, and GPR were tuned and fitted to predict raloxifene solubility and CO2 density via P and T. These models were optimized using the Grey Wolf Optimization algorithm to obtain their optimal hyperparameters. Based on the evaluation metrics, three models performed well in predicting the solubility and CO2 density. The GPR model showed the highest accuracy for CO2 density and solubility. It also showed the RMSE and AARD% values among the models. The ENR and OMP models also yielded satisfactory results, with decent R-squared and reasonably low errors. Overall, the results confirmed the validity of machine learning regression models in predicting the solubility and CO2 density for raloxifene drug. The accurate predictions obtained from these models can contribute to deeper knowledge of the drug’s behavior and aid in the optimization of pharmaceutical processes. Further research can focus on exploring additional features and applying these models to larger datasets to enhance their predictive capabilities in pharmaceutical applications.