Introduction

The development of green manufacturing is of great interest owing to the advantages of these processes for the environmental as well as cost of the processes. In process engineering, sustainability plays a crucial role, and extensive research is being carried out for the development of green and sustainable processes1,2. In biotech and pharmaceutical manufacturing, some green processing has been developed such as supercritical processing which utilizes a gas solvent at supercritical conditions for processing3,4. Since harmful organic materials such as solvents are not employed in these processes, they are considered green and sustainable processing. These processes offer unique characteristics in the preparation of pharmaceuticals5,6,7. Additionally, these processes can operate as continuous mode to address one of the challenges in the pharmaceutical industry.

Some works have been carried out in supercritical green processing for preparation of nanomedicines by this technique such as the method of rapid expansion by which nanoparticles of drug can be obtained by proper control of process parameters8,9. Particle size between 200 and 300 nm has been obtained by supercritical processing as reported by Sakabe and Uchida8. Despite the experimental development of green processing, most of studies in this area are devoted to correlation of solubility data, as experimental measurement of solubility of drugs in supercritical solvents is tedious and costly10,11,12. In this context, gravimetric method is usually preferred for measuring solubility, and the data are collected. Then some correlations (e.g., data-driven models) are fitted to the measured data to train an AI (Artificial Intelligence) algorithm for the measured data. Once the model is trained and tested, that is then utilized for estimation of the drug solubility in a wide range of conditions. Furthermore, the tested models can be employed for evaluation of influence of pressure and temperature as the main inputs on the variations of drug solubility in the supercritical solvents13.

The methods based on machine learning (ML) for solubility estimation have attracted much attention as it is facile to be implemented and it is fast and accurate enough for fitting small dataset of pharmaceutical solubility5,14,15. Therefore, ML modeling approach has been suggested for determination of drug solubility in supercritical solvents by which one can avoid extensive cost and time of laboratory measurements16. The goal of ML is to design computer algorithms that are able to examine data and use those data to construct models17,18. In this study, we choose to use three regression models: the Gaussian Process Regression (GPR), the Multi-layer Perceptron (MLP), and the voting regression.

The GPR model, a non-parametric Bayesian approach, is useful for both exploration and exploitation tasks. It can represent a wide variety of input-output relationships by considering an infinite number of input features. Bayesian inference helps determine the model’s complexity19,20,21,22. Also, neural networks, such as MLP estimate models, have been utilized for research objectives. As a consequence of a layered feed-forward architecture composed of multiple layers of feed-forward units triggered by an input-output transfer function. Feed Forward Neural Network is the term used for this23,24.

The voting ensemble learning system combines different base learners by assigning weights during training. New learners help balance and improve overall performance. The effectiveness of the ensemble depends on the diversity of base model outputs and the integration strategy used25,26. GWO (Grey Wolf Optimization)27 simulates grey wolf leadership and hunting via population-based meta-heuristic optimization. Initially, wolves are randomly positioned. During optimization, they update their positions based on the alpha wolf, which represents the best solution28. As a result, the second and third best solutions are designated beta and delta, and those wolves will guide them through the quest. Omega will be the fourth best answer29.

In this study, GPR and MLP were selected for calculating the solubility of Clobetasol Propionate (CP) in supercritical solvent, with their strengths combined in a voting ensemble. GPR, a probabilistic model, was chosen for its ability to quantify uncertainty and perform well with small datasets, which aligns with the limited experimental data available. MLP, a neural network, was selected for its capacity to capture complex, non-linear relationships influenced by factors like temperature and pressure, common in solubility data. Their combination enhances predictive accuracy by leveraging GPR’s uncertainty handling and MLP’s pattern recognition. While other robust methods like XGBoost or Support Vector Regression (SVR) are effective, GPR and MLP were preferred due to their proven success in pharmaceutical applications, with GPR offering probabilistic outputs and MLP excelling in intricate data modeling. This approach ensures reliable predictions for green pharmaceutical manufacturing.

The main novelty and contribution of the current study is in model development and integration with GWO optimizer to enhance the accuracy of the models, which was used for the first time to estimate the solubility of CP in supercritical CO2 (SC-CO2). The integration of GPR, MLP, and GWO has been employed in previous studies for predictive modeling tasks. However, this work introduces a novel approach by combining these models into a voting ensemble framework explicitly optimized for predicting the solubility of CP in SC-CO2. Our method enhances prediction accuracy and addresses key phenomena, such as the cross-over pressure point, offering improvements over existing literature, as detailed in subsequent sections.

Dataset of green processing

The dataset applied in this study was obtained from paper30, and it comprises two inputs, namely temperature (K) and pressure (MPa). The process can be investigated as continuous processing of medicines with enhanced aqueous solubility. The single response of this dataset is s, which represents the drug’s solubility, and its unit is g/L as indicated in Table 1. Each of the 45 rows in this data set was recorded at a different one of the five temperature levels and nine pressure levels31. The range of T and P for modeling and measurements are selected to keep the solvent at the supercritical state for CO2. Given that the supercritical condition for CO2 is 7.38 MPa and 304 K32, the selected range ensures the solvent operates at the supercritical condition throughout the experiments and measurements. Each data point is displayed in Table 1. The same dataset was used by Liang et al.33 for developing regressive models.

Figure 1 shows the pairwise distribution of parameters for CP solubility in SC-CO2. Diagonal elements (s vs. s, P vs. P, T vs. T) display the distributions of solubility (0.0003–0.3 g/L), pressure (12.2–35.5 MPa), and temperature (308–348 K), highlighting the range and frequency of each variable in the dataset.

Table 1 Dataset of CP solubility30.
Fig. 1
figure 1

Pairwise relative distribution of parameters for CP solubility in SC-CO2.

Modeling methodology

The method of GPR is a frequently used probabilistic ML method for various scientific tasks34. Gaussian processes are characterized by their mean and covariance functions (kernel function). For declaration of GPR, the data is separated into train dataset (X) with known targets (y) and test dataset \(\:\left({X}_{t}\right)\) with uncertain output\(\:\:\left({y}_{t}\right)\). The covariance \(\:K(.,.)\) and prior mean \(\:m(.)\) are given as the following joint distribution35:

$$\:\left[\begin{array}{c}y\\\:{y}_{t}\end{array}\right]N\left(\left[\begin{array}{c}m\left(X\right)\\\:m\left({X}_{t}\right)\end{array}\right],\left[\begin{array}{c}K\left(X,X\right)K\left(X,{X}_{t}\right)\\\:K\left({X}_{t},X\right)K\left({X}_{t},{X}_{t}\right)\end{array}\right]\right)$$

\(\:K\left(X,{X}_{t}\right)\) signifies the matrix of covariances calculated for entire pairings of train and test points. Also, the variable y stands for the observed values under input X, and \(\:\:\left({y}_{t}\right)\:\) stands for the output of test input \(\:\:\left({X}_{t}\right)\). When the Gaussian Process prior has been established, the Gaussian Process posterior can be used to forecast the testing output. The conditional distribution is36:

$$\:{p(y}_{t}|{X}_{t},X,y)N({m}_{*}\left({y}_{t}\right),{K}_{*}\left({y}_{t}\right))$$
$$\:{m}_{*}\left({y}_{t}\right)=m\left({X}_{t}\right)+K\left({X}_{t},X\right){K\left(X,X\right)}^{-1}\left(y-m\left(X\right)\right)$$
$$\:{K}_{*}\left({y}_{t}\right)=K\left({X}_{t},{X}_{t}\right)-K\left({X}_{t},X\right){K\left(X,X\right)}^{-1}K\left(X,{X}_{t}\right)$$

\(\:{m}_{*}\:\) represents the posterior mean and \(\:{K}_{*}\:\) stands for the corresponding covariance, respectively.

The MLP is a widely used network architecture in academic fields. In a feed-forward neural network, inputs and biases are accumulated by weight, and then transferred to the activation level, where they are processed by a transfer function to produce an output. The network has numerous layers of feed-forward units23,24,37. When the network’s output units are activated, the output is calculated as follows38:

$$\:{x}_{0}=f\left(\mathop{\sum\:}\limits_{h}{x}_{h}{w}_{ho}\right)$$

Activating nodes in the hidden layer follows a process that is analogous to the one described above. Compared the estimated output to the desired output and accounting for the difference, the error definition is37:

$$\:E=\frac{\sum_{s}^{N}\sum_{o}^{L}({t}_{o}^{\left(s\right)}-{x}_{o}^{\left(s\right)}{)}^{2}}{2}$$

L output nodes and N patterns make up a data set. Our research aims to decrease errors by adjusting the connections between layers39.

The development of ML models and the visualization of the results were carried out using Python (version 3.8), accessible from: https://www.python.org.

Results and discussions

The models presented earlier were optimized and evaluated using the GWO method. This section discusses the results of that evaluation.

Single models

The single GPR model was optimized, and its alpha value was considered equal to 6.68 × 10−7. Also, for MLP, the number of hidden layers selected to be equal to 40 and the tolerance to 0.00013 and ‘lbfgs’ as solver function. In terms of RMSE metric, GPR has value of 1.46 × 10−2 and MLP shows an estimated error of 1.08 × 10−2. The R2 is 0.973 for MLP and 0.951 for GPR model. Those are significant results for both. Also, in Figs. 2 and 3 the residuals of models are displayed. The 3D estimation curves of them are in Figs. 4 and 5.

Figure 3 reveals a systematic underestimation of CP solubility by the models at higher pressure values (above 25 MPa), offering that the MLP may not fully learn complex solubility dynamics under these conditions.

Fig. 2
figure 2

Residuals for GPR model.

Fig. 3
figure 3

Residuals for MLP model.

Fig. 4
figure 4

Estimation curve of GPR model including experimental data points.

Fig. 5
figure 5

Estimation curve of MLP model including experimental data points.

Voting model

A voting regression model was also developed using MLP and GPR models. As it was predictable, this model is more accurate than both single models with RMSE error rate of 1.03 × 10−2. Also, the R2 is equal to 0.977 with voting model. So, we can introduce this model as the most accurate one amongst other models evaluated in this work. The residual for this model is displayed in Fig. 6 and the 3D estimation surface is shown in Fig. 7. Also, as the primary model of this research, the trends of both inputs are shown in Figs. 8 and 9 using the voting model. Taking into account Fig. 8, it is clearly visible that the increase in solubility increases with increasing pressure. By looking at Fig. 9, we can also conclude that at higher pressures (more than 20 MPa), the solubility of drug increases with increasing temperature. This relationship is slightly inverse at pressures less than 20 MPa, meaning that the solubility of drug is decreased with enhancing the temperature. This behavior has been already reported for solubility of medicines in SC-CO231,33,40,41. Liang et al.33 reported similar solubility trend for CP drug modeling via ML. This point can be considered as the cross-over pressure point where the solubility variations are reversed above this point. Both the effects of solvent density and sublimation pressure create this phenomenon in supercritical solvents and has been already reported in previous studies on drug solubility in SC-CO24. The same trend has been observed in previous studies on the cross-over pressure point which confirms the validity of the models developed via ML in this study16.

In summary, the models have been shown to be reliable in estimation of drug solubility in a wide range of T and P, without encountering overprediction which confirms that the models are not biased for the range of input features. Indeed, the ML models developed and optimized in this research can be adopted for estimation of other drugs solubility in SC-CO2 by learning the solubility data as well as the density of solvent because the density has significant effects on the solubility variations. Thus, the models are able to be used for a wide range of drugs, and the proper optimization method is described in this work which can be used also for outside this scope such as supercritical extraction and nanonization of drugs.

The reliability of voting model in estimating CP solubility across a wide range of temperatures (308–348 K) and pressures (12.2–35.5 MPa) was evaluated using RMSE and R² metrics, with values of 1.08 × 10⁻² and 0.973 for MLP, 1.46 × 10⁻² and 0.951 for GPR, and 1.03 × 10⁻² and 0.977 for the ensemble model, respectively. A 5-fold cross-validation approach ensured robust performance across the dataset, minimizing overfitting. Residual plots (e.g., Fig. 6) revealed no systematic overprediction, with residuals evenly distributed around zero across the input feature range, confirming the absence of bias in the models’ predictions.

The voting regression model achieved an RMSE of 1.03 × 10⁻² and R² of 0.977, surpassing the RMSE of 5.02 × 10⁻1 and R² of 0.967 reported by Obaidullah et al.31 for drug solubility modeling in SC-CO2 via regressive models (Support vector machine models). This can confirm the validity of the model developed in this study and proper methodology to estimate drug solubility at different conditions.

Fig. 6
figure 6

Residuals for voting model.

Fig. 7
figure 7

Estimation curve of voting model including experimental data points.

Fig. 8
figure 8

Variations of CP solubility with pressure described by the ML model.

Fig. 9
figure 9

Variations of CP solubility with temperature as predicted by the model.

This study is limited by the relatively small dataset used for training and testing the MLP, GPR, and ensemble voting models, which may restrict the models’ ability to generalize to a broader range of conditions or compounds. The focus on CP in SC-CO2 limits the applicability of the findings to other solvents or drugs without further validation. Furthermore, the computational complexity of the GWO optimization process may pose challenges for real-time applications42. Experimental uncertainties in solubility measurements, such as potential variations in temperature or pressure control, could also affect model accuracy and reliability.

Future research could explore the application of the proposed ensemble model to other pharmaceutical compounds to assess its generalizability across diverse solubility profiles in supercritical solvents. Incorporating additional ML techniques, such as deep learning or hybrid models, could further improve predictive accuracy and capture more complex solubility behaviors. Moreover, incorporating real-time experimental data into the model has the potential to facilitate dynamic optimization of process conditions within a continuous manufacturing framework. Investigating the scalability of the model for industrial applications and its integration with process control systems would also advance the practical implementation of green pharmaceutical manufacturing.

Conclusion

The solubility of CP in SC-CO2 was modeled via machine learning approaches. In this study, the MLP neural network model, the GPR probabilistic model, and an ensemble voting model combining these two were investigated for predictive modeling. GWO was employed to optimize the hyperparameters of these models. All three models demonstrated strong performance in estimating CP solubility across a range of temperatures (308–348 K) and pressures (12.2–35.5 MPa). The ensemble voting model, which integrates the strengths of MLP and GPR, obtained the highest accuracy, with a RMSE of 1.03 × 10⁻² and a coefficient of determination (R²) of 0.977.

The optimized models were used to evaluate the impact of process variables, specifically temperature and pressure, on CP solubility. The analysis revealed that solubility increases with pressure, while at higher pressures (> 20 MPa), solubility increases with temperature, and at lower pressures (< 20 MPa), an inverse relationship is observed. This behavior, consistent with the cross-over pressure phenomenon reported in prior studies, validates the reliability of the models. The absence of systematic overprediction in residual plots further confirms that the models are unbiased across the input feature range.

This modeling framework provides a robust approach for predicting drug solubility in SC-CO2, offering potential for application to other pharmaceutical compounds with similar solubility datasets. By leveraging limited experimental solubility data and optimizing predictive models, this approach can support the development of green pharmaceutical manufacturing processes. Future research could extend this framework to other drugs and explore additional machine learning techniques to further enhance predictive accuracy.