Machine learning analysis of rivaroxaban solubility in mixed solvents for application in pharmaceutical crystallization

Alqarni, Mohammed; Alqarni, Ali

doi:10.1038/s41598-024-84741-1

Download PDF

Article
Open access
Published: 17 January 2025

Machine learning analysis of rivaroxaban solubility in mixed solvents for application in pharmaceutical crystallization

Mohammed Alqarni¹ &
Ali Alqarni²

Scientific Reports volume 15, Article number: 2241 (2025) Cite this article

2239 Accesses
8 Citations
1 Altmetric
Metrics details

Subjects

Abstract

This study investigates the use of machine learning models to predict solubility of rivaroxaban in binary solvents based on temperature (T), mass fraction (w), and solvent type. Using a dataset with over 250 data points and including solvents encoded with one-hot encoding, four models were compared: Gradient Boosting (GB), Light Gradient Boosting (LGB), Extra Trees (ET), and Random Forest (RF). The Jellyfish Optimizer (JO) algorithm was applied to tune hyperparameters, enhancing model performance. The LGB model achieved the best results, with an R² of 0.988 on the test set and low error rates (RMSE of 9.1284E-05 and MAE of 5.85322E-05), surpassing other models in predictive accuracy and generalizability. Parity plots confirmed the LGB model’s close alignment between predicted and actual solubility values, highlighting its robust performance. Furthermore, 3D surface plots and partial effect plots demonstrated LGB’s capacity to model solubility across different solvent systems, capturing complex interactions between T, w, and solvent effects. Finally, the LGB model predicted maximum solubility at a temperature of 305.76 K and a mass fraction of 0.753 in a dichloromethane + methanol mixture, providing valuable insights for solubility optimization in solvent selection. This work underscores the effectiveness of the LGB model for solubility prediction, with potential applications in formulation and experimental planning.

Computational intelligence investigations on evaluation of salicylic acid solubility in various solvents at different temperatures

Article Open access 28 February 2025

Correlation of rivaroxaban solubility in mixed solvents for optimization of solubility using machine learning analysis and validation

Article Open access 08 February 2025

Solubility of Glibenclamide in supercritical solvent versus pressure and temperature via development of machine learning and rain optimization algorithm

Article Open access 06 January 2026

Introduction

Pharmaceutical crystallization is one of the major unit operations in solid-dosage manufacturing which plays an important role in quality of obtained products. Indeed, the properties of finished products are dependent on the drug crystal properties such as size, habit, and shape^1,2. Given that nanosized particles possess higher solubility, control of particles size is a crucial step in drug crystallization to enhance drug solubility. As the crystallization process is driven by changing the drug solubility, variations of solubility with underlying parameters should be well understood in order to control the rate of nucleation as well as crystal growth^3,4.

The analysis and understanding of drug solubility in solvent for crystallization development can be done via either experimental measurements or computational evaluation. In most cases, both methods are needed to provide a predictive tool for analysis of crystallization^5,6. As such, the relationship between the solubility and process parameters is the key aspect of pharmaceutical crystallization modeling. Various techniques have been developed so far for prediction of pharmaceutical solubility where the correlative models are the most widely used ones. These methods rely on collection of measured solubility data and correlating a robust model to fit the dataset. Several thermodynamic and machine learning models have been developed for solubility analysis^7,8,9.

In machine learning (ML), computers learn from data without explicitly being programmed using a collection of techniques. The objective of ML is to develop meta-programs that can analyze experimental data and utilize it for model training. ML models are great tools for correlation of drug solubility dataset due to their learning nature which can provide great accuracy in estimating drug solubility^10,11. Regression statistical analysis is essential for predicting numerical results based on input data, making it a fundamental element of predictive modeling. The fundamental role of regression models in machine learning is to enable the development of predictive models that stimulate innovation in various sectors. Tree-based ensemble regression models, which incorporate Decision Trees, offer precise and reliable solutions for regression problems. Their approach employs the variety of several trees to enhance the precision and dependability of predictions¹².

The selection of models in this study—Gradient Boosting (GB), Light Gradient Boosting (LGB), Extra Trees (ET), and Random Forest (RF)—was guided by their strong performance and interpretability in regression tasks involving multivariate data with complex interactions. Every model presents distinct advantages tailored to the needs of predictive modeling. For example, GB and LGB are gradient-boosting models recognized for iteratively enhancing forecasts by concentrating on residual errors from prior rounds, thereby adept at capturing complex data patterns. LGB, specifically, integrates optimization methods like gradient-based one-side sampling and exclusive feature bundling, boosting computational efficiency while maintaining precision, rendering it ideal for managing datasets rich in categorical attributes.

The ET and RF models, both ensemble tree-based methods, contribute robustness and stability to the predictions. ET enhances generalizability by introducing randomness in its tree-splitting strategy, thereby increasing diversity among trees, while RF uses bootstrapping to reduce overfitting, maintaining reliability in predictions. To further refine these models, the Jellyfish Optimizer (JO) was employed for hyperparameter tuning. Inspired by jellyfish behavior, JO effectively balances exploration and exploitation in complex parameter spaces, enabling the identification of optimal configurations that enhance each model’s performance.

This paper makes several key contributions to correlation of pharmaceutical solubility by developing a holistic modeling approach. First, it introduces a systematic approach to optimizing machine learning models for solubility prediction of a drug, namely rivaroxaban by incorporating temperature, solvent type, and mass fraction data. The application of JO for tuning model hyperparameters represents a novel approach in this domain, significantly improving model performance. Second, the study provides a comprehensive comparison of multiple ensemble-based regression models, with LGB emerging as the top-performing model based on its high predictive accuracy and low error rates. Third, through 3D surface and partial effect plots, the study visually examines the interactions between temperature, solvent type, and mass fraction, offering insights into how these variables collectively influence solubility. Finally, the model identifies optimal conditions for maximum solubility, pinpointing specific temperature and mass fraction values in particular solvents, providing valuable guidance for future experimental setups and formulation strategies.

Data set description

The dataset, comprising over 250 data points, models the relationship between temperature (T), the mass fraction of dichloromethane (w), various solvents, and the target variable, x, is the compound solubility. The solute is rivaroxaban whose solubility was investigated in the binary mixtures of dichloromethane and alcohols with various compositions which are sourced from¹³. The dataset incorporates solvent types as categorical variables, encoded using one-hot encoding to capture each solvent type independently and without implied ordinality. . The dataset is the same as dataset used in our previous study¹⁴.

Figure 1 shows the correlation heatmap of the numeric inputs and the output variable, solubility (x), in the dataset. This visualization illustrates the degree of association among temperature (T), mass fraction of dichloromethane (w), and solubility (x). Darker or lighter colors in the heatmap indicate stronger correlations, with annotations providing precise correlation coefficients.

Methodology

The This study’s methodology integrates data preprocessing, machine learning model training, and hyperparameter optimization to precisely predict solubility based on temperature, mass fraction, and solvent type. The data was initially preprocessed through one-hot encoding of solvent types to eliminate ordinal bias and z-score normalization to standardize numerical features, ensuring uniform scaling for effective model training. Outliers were identified and eliminated utilizing the Mahalanobis Distance metric, thereby enhancing data quality for model training.

Four tree-based ensemble models—GB, LGB, ET, and RF—were selected for their performance and ability to capture complex interactions within multivariate data. The JO algorithm was applied for hyperparameter tuning, efficiently navigating the parameter space to maximize model accuracy through five-fold cross-validation. This process established an optimized configuration for each model, which was then used for predictive performance assessment.

Pre-processing

Categorical input encoding: Categorical data encoding is a crucial preprocessing step for handling categorical data in machine learning, particularly for models that operate on numerical inputs. In this study, one-hot encoding transforms the categorical solvent data into a binary matrix where each unique category of solvent is represented as a distinct column. For a given solvent, a “1” is placed in the column corresponding to that category, while all other columns receive a “0.” This encoding process effectively eliminates any ordinal relationship between categories, ensuring that the model interprets each solvent as an independent, discrete attribute. By converting categorical solvent data into this binary format, one-hot encoding enables models to process categorical information accurately, supporting enhanced predictive performance and model interpretability¹⁵.
Normalization: Z-score normalization, also known as standardization, is a vital preprocessing technique applied to numerical data, ensuring consistent scaling across features. In this study, z-score normalization standardizes each feature by centering it around zero and scaling it based on its standard deviation. This procedure modifies each data point by deducting the feature’s mean and dividing by its standard deviation, resulting in a distribution with a mean of zero and a standard deviation of one. By normalizing in this manner, z-score normalization mitigates the impact of varying feature scales, allowing the model to learn more effectively and avoid bias toward features with larger numerical ranges. This standardized data contributes to improved convergence and model stability during training¹⁶.
Outlier detection: in this study, we used Mahalanobis Distance to identify and remove outliers. Mahalanobis Distance is an effective metric for detecting outliers, particularly useful in multidimensional data spaces. This distance measure calculates the distance of a data point from the mean, taking into account the correlations between features. In this study, Mahalanobis Distance is used to identify outliers by evaluating how far each point deviates from the central distribution of the dataset. By incorporating the covariance matrix, this method accounts for feature interdependencies, assigning higher distances to points that deviate significantly from the data’s general pattern. Data points with a Mahalanobis Distance exceeding a defined threshold are flagged as outliers, enhancing data quality by isolating anomalous instances that may skew model training¹⁷.

Tree-based ensemble models

GB

GB is an ensemble of sequentially organized base models, wherein each subsequent model learns from the errors of its predecessor¹⁸. This machine learning model produces responses by refining an ensemble of less reliable predictive models, to formulate a more precise model. A GB with M trees can be defined as¹⁹:

$$f_{M} \left( {x_{j} } \right) = \sum\limits_{m}^{M} {\gamma_{m} } h_{m} \left( {x_{j} } \right)$$

where, ${h}_{m}$ is a base model that may perform poorly individually, and m stands for a scaling factor that incorporates the contribution of a tree to the model. By employing the gradient descent loss function, GB minimizes errors by modifying the initial estimate with the revised estimate²⁰. Thus, by integrating all initial predictions with appropriate weights, an ensemble is developed²¹.

Tunable hyperparameters in GB models optimize performance by balancing complexity and generalization. Key parameters include the learning rate, which adjusts each tree’s contribution; the number of trees, affecting accuracy and risk of overfitting; and max depth, controlling tree complexity. The subsample parameter introduces randomness to reduce overfitting, while min samples split and leaf adjust the minimum data points needed for splits or leaves, fine-tuning model complexity and robustness. Adjusting these enhances accuracy, reduces overfitting, and improves computational efficiency.

LGB

LGB is a variant of GB model that optimizes the loss function through gradient boosting, leveraging a tree-based model known as LightGB, which demonstrates greater efficiency and speed compared to conventional decision trees²².

LGB employs gradient-based one-side sampling (GOSS) to reduce the number of samples necessary for tree fitting, thereby enhancing efficiency compared to conventional gradient boosting methods. Additionally, it introduces exclusive feature bundling (EFB) to decrease the required features for each tree, further improving efficiency.

Moreover, LGB adopts a “leaf-wise” splitting strategy, wherein the tree expands by splitting the leaf with the most significant loss change, rather than the node with the highest loss change. This approach not only enhances the accuracy of the tree but also increases its complexity.

Mathematically, efficiency of LGB can be demonstrated by considering the reduction in sample and feature requirements. Let N represent the total number of samples and F denote the total number of features. Through GOSS, LGB reduces the number of samples needed for tree fitting to αN, where 0 < α < 1. Similarly, EFB reduces the required features to $\upbeta F$, where $0 <\upbeta < 1$. Thus, the overall efficiency improvement can be quantified as the reduction in computational resources required for tree fitting, leading to faster training and inference times.

In LG), tunable hyperparameters enhance model efficiency and accuracy. Key parameters include the learning rate, which controls each tree’s contribution, and the number of leaves, which balances model complexity and overfitting. The max depth limits tree expansion, while min data in leaf and feature fraction reduce overfitting by restricting sample size and feature usage in splits. LGB’s exclusive feature bundling and gradient-based one-side sampling further improve speed and efficiency by reducing features and sampling. Adjusting these parameters optimizes accuracy, training time, and generalization.

RF and ET

RF is another widely used ensemble method for regression that builds on decision tree concepts. This approach constructs an ensemble of trees, each trained on different subsets of the data via bootstrapping. During each tree split, random features aretaken into account, aiding in averting overfitting and enhancing the model’s generalization capacity.

The RF algorithm combines outputs from numerous decision trees to produce a prediction. The final estimate is typically the average or median of the predictions made by each tree. For a given input x, the RF model’s prediction can be represented as²³:

$$\widehat{{y}_{\text{RF}}}\left(x\right)=\frac{1}{N}{\sum }_{i=1}^{N}{y}_{i}^{\left(t\right)}$$

In this notation, $\widehat{{y}_{\text{RF}}}\left(x\right)$ represents the RF model’s prediction for input x, N stands for the total quantity of trees in the forest, and ${y}_{i}^{\left(t\right)}$ denotes the prediction from the i-th tree.

RF models are well-regarded for their robustness, ability to manage high-dimensional data efficiently, and strong resistance to overfitting. They offer substantial flexibility and effectiveness when applied to regression tasks.

ET algorithm is another ensemble method similar to RF. Like RF, ET utilizes bootstrapping and random feature selection when constructing trees. However, ET increases the level of randomness by further randomizing the tree-splitting process, enhancing model diversity²⁴.

In the ET algorithm, node splits are made with heightened randomness. Rather than determining the optimal split point for each feature, ET randomly chooses threshold values for node splits. This extra layer of randomness creates a more varied set of decision trees, which can enhance generalization and further mitigate overfitting risks²⁵.

The prediction for an ET regression model, given an input x, is computed in a manner similar to that of an RF model²⁶:

$$\widehat{{y}_{\text{ET}}}\left(x\right)=\frac{1}{N}{\sum }_{i=1}^{N}{y}_{i}^{\left(t\right)},$$

In this notation, $\widehat{{y}_{\text{ET}}}\left(x\right)$ denotes the ET model’s prediction for the input x, N stands for the number of trees, and ${y}_{i}^{\left(t\right)}$ stands for the prediction from the i-th individual tree.

ET models are valued for their computational efficiency, as the randomized splitting process reduces the processing demands. This makes them an ideal choice when time is constrained²⁷.

In RF and ET, tunable hyperparameters refine model robustness and computational efficiency. Key RF parameters include the number of trees, balancing accuracy and risk of overfitting, and max depth, controlling tree complexity. RF also uses max features to randomly limit features at each split, enhancing generalization. ET shares similar hyperparameters but introduces further randomness by selecting thresholds at splits, increasing diversity in trees. Both models use min samples split and min samples leaf to adjust data requirements for splits and leaves, which helps reduce overfitting and improve predictive stability. Adjusting these parameters enhances accuracy, robustness, and computational efficiency²⁷.

JO for hyper-parameter tuning

From observations of jellyfish behavior in their natural habitat, researchers devised an algorithm known as the JO. Jellyfish exhibit mobility in the ocean influenced by currents²⁸. The transition between distinct movement patterns is regulated by a temporal control system.

Jellyfish are attracted to ocean currents due to the abundance of food they provide. The ocean current direction ($\overrightarrow{co}$) is computed as the average of the vectors representing all jellyfish in the ocean²⁸.

$$\begin{array}{c}\overrightarrow{d}=\frac{1}{n}\sum {\overrightarrow{d}}_{i}=\frac{1}{n}\sum \left({X}_{best}-{e}_{c}{X}_{i}\right)={X}_{best}-{e}_{c}\frac{\sum {X}_{i}}{n}={X}_{best}-{e}_{c}\mu \\ ={X}_{best}-df\end{array}$$

Here, n represents the population size, ${X}_{\text{best}}$ signifies the best position attained, $\upmu$ denotes the mean of the population positions, ${e}_{c}$ stands for the convergence factor, and $df$ indicates the disparity between the mean positions of the jellyfish and the best position. Consequently, the new position is determined as²⁸.

$${X}_{i}\left(t+1\right)={X}_{i}\left(t\right)+r\left({X}_{best}-\beta .r\right)\mu$$

where r stands for a randomly chosen value within the range (0, 1), and $\upbeta$ denotes a distribution coefficient greater than 0. The jellyfish exhibit two types of movement: passive movement (type A) and active movement (type B), beginning the process with type A movement²⁹.

Subsequently, with time, they gradually transition to displaying type B motions. The following illustrates an example of type A movement²⁸.

$${X}_{i}\left(t+1\right)={X}_{i}\left(t\right)+\gamma .r\left(Ub-Lb\right)$$

In this context, $Lb$ and $Ub$ symbolize the lower and upper limits of the exploration space, respectively, while $\upgamma$ represents a movement factor relative to the distance covered around the jellyfish’s location. The succeeding example depicts a type A movement^30,29.

$$\begin{array}{*{20}c} {X_{i} \left( {t + 1} \right) = X_{i} \left( t \right) + r.\vec{D}} \\ {\vec{D} = \left\{ {\begin{array}{*{20}c} {X_{j} \left( t \right) - X_{i} \left( t \right)iffit\left( {X_{i} } \right) \ge fit\left( {X_{j} } \right)} \\ {X_{i} \left( t \right) - X_{j} \left( t \right)iffit\left( {X_{i} } \right) < fit\left( {X_{j} } \right)} \\ \end{array} } \right.} \\ \end{array}$$

Here, j is randomly selected and $fit$ represents the fitness function. The temporal control parameter C switches between ocean current movement and jellyfish swarm. Its calculation is delineated as³⁰:

$$c\left(t\right)=\left|\left(1-\frac{t}{{T}_{\text{max}}}\right).\left(2.r-1\right)\right|$$

where, ${T}_{\text{max}}$ denotes the maximum iteration count. The flowchart depicted in Fig. 2 outlines JO workflow.

JO is well-suited for hyperparameter tuning due to its effective exploration–exploitation balance, inspired by jellyfish movement patterns in ocean currents. This characteristic allows JO to navigate complex search spaces efficiently, which is essential for identifying optimal hyperparameters that enhance model performance.

In this study, the objective function for JO is based on a five-fold cross-validation (CV) score for each candidate solution. Each model configuration undergoes five-fold CV, where the average validation score guides the optimization process. This approach ensures robust performance by reducing variance and avoiding overfitting, allowing JO to identify hyperparameters that generalize well across different data folds.

For JO’s configuration, the population size is set to 100 jellyfish, with 30 maximum iterations to allow thorough exploration of the search space. The time control mechanism (C) alternates movement patterns, with a convergence factor 0.0005 guiding exploitation in later stages. The upper and lower bounds of the search space are set based on the model’s parameter range, ensuring that the search is constrained to practical values for each hyperparameter. These configurations enable the JO algorithm to converge effectively on the optimal hyperparameter set for the model.

Results and discussion

The dataset underwent partitioning into training and testing subsets using an 80–20% division, guaranteeing a substantial portion of the data for model training, and earmarking a subset for autonomous assessment of model efficacy. To enhance the predictive accuracy of the regression models, hyperparameters were optimized using the JO algorithm, which effectively balanced exploration and exploitation in the search space. Table 1 summarizes the best hyperparameters identified through JO for each model, including settings like learning rate, number of trees, and loss function. These optimal configurations were subsequently used to train the models and assess their predictive capabilities across the dataset.

Table 1 Final R² scores of the models.

Full size table

Tables 2 and 3 offer a detailed view of model performance across training, cross-validation, and test datasets. Table 2 presents R² scores, indicating the proportion of variance each model explains. All models demonstrate relatively high R² values on training data, reflecting good data fit. LGB emerges as a top performer, with an R² score of 0.998726 on training, a cross-validation mean of 0.980350 with a low standard deviation of 0.006004334, and a test score of 0.988146. This performance suggests LGB’s high accuracy and generalizability, with minimal variance across cross-validation folds. GB and ET, however, show lower stability. GB has a cross-validation R² mean of 0.926217 with a higher standard deviation (0.019463014), indicating variability in performance across folds. ET shows even less stability, with a lower cross-validation mean of 0.803459 and a higher deviation of 0.070517859, which suggests inconsistency in its predictions. RF has moderate performance, with an R² test score of 0.945560, but its cross-validation mean of 0.826692 and standard deviation indicate it may not generalize as well as LGB across different data subsets.

Table 2 Final R² scores of the models.

Full size table

Table 3 Final error rates the models.

Full size table

Table 3 further analyzes model accuracy using RMSE and MAE across training, test, and total datasets. LGB again demonstrates the lowest RMSE and MAE values across all sets, confirming its high prediction accuracy. The total RMSE and MAE for LGB, at 4.8015E-05 and 2.58020E-05, are notably lower than those of other models. ET and RF, on the other hand, display higher RMSE and MAE values across test and total datasets, indicating less precise predictions and a higher likelihood of errors in practical applications.

Figures 3, 4, 5 and 6, visually compare actual versus predicted values through parity plots for the GB, LGB, and ET models, respectively. These plots reveal the predictive accuracy of each model by illustrating alignment with the ideal diagonal line. In Fig. 3, the GB model shows some deviation from the ideal line, indicating it captures general data trends but lacks precise predictive accuracy. This pattern aligns with GB’s higher cross-validation variance in Table 2, suggesting potential instability in its performance. Figure 4, which represents LGB, shows a tighter fit to the diagonal, with points clustering closely around the line. This tight fit reflects LGB’s strong predictive power, as observed in its high cross-validation R² mean and low error rates in Table 3. The close alignment across both training and test points highlights LGB’s consistent performance on both seen and unseen data. In Fig. 5, the ET model demonstrates a noticeable scatter away from the ideal line, indicating more significant prediction errors, particularly on test data. This visual pattern aligns with ET’s relatively lower R² scores and higher RMSE and MAE values, which suggest less reliable predictive performance. Same facts are obvious in Fig. 6 for RF model.

The analysis indicates that the LGB model is the most reliable choice for further analysis. With high R² scores across all datasets, low RMSE and MAE values, and a strong parity plot alignment, LGB demonstrates excellent predictive accuracy and generalizability. Additionally, its unique features, make it computationally efficient, allowing it to handle complex datasets with large feature spaces effectively. These characteristics make LGB particularly suited to this study, where multiple inputs and one-hot-encoded solvents demand efficient handling of feature interactions. LGB’s overall performance supports its selection as the most robust and accurate model for analyzing solubility prediction in this dataset, ensuring reliable insights into the interactions between temperature, solvent type, and mass fraction on solubility.

Figures 7, 8, 9, and 10 show the 3D plots estimated by the LGB , illustrating the estimated solubility (x) as a function of temperature (T) and mass fraction (w) for different solvent systems: dichloromethane combined with ethanol, methanol, n-butanol, and n-propanol, respectively¹⁴.

In Fig. 7 (dichloromethane + ethanol), the solubility surface shows a trend where solubility increases with both temperature and mass fraction. The increase is non-linear, with solubility rising more sharply at higher values of T and w, indicating a synergistic effect between temperature and mass fraction on solubility in this solvent. Similar observations were reported in our previous study on the drug solubility in mixed solvents¹⁴.

Figure 8 (dichloromethane + methanol) shows a similar trend, with solubility increasing as both temperature and w increase, although the effect of mass fraction appears somewhat more moderate than in ethanol. This difference suggests that methanol has a distinct thermal and concentration-dependent solubility profile.

In Fig. 9 (dichloromethane + n-butanol), the solubility response to temperature and mass fraction is more gradual, with a smoother increase in solubility across the range of T and w. This figure suggests that solubility in n-butanol is less sensitive to temperature changes compared to ethanol and methanol, indicating a more stable interaction with the solute under varying conditions.

Finally, Fig. 10 (dichloromethane + n-propanol) exhibits a solubility surface similar to that of n-butanol, with a relatively linear increase in solubility with both temperature and mass fraction. The slope indicates a moderate response, with solubility values increasing consistently but without sharp transitions, highlighting n-propanol’s relatively balanced interaction with temperature and concentration.

Figures 11 and 12 present partial effect plots generated by the LGB model, illustrating how T and w individually influence solubility (x) for the dichloromethane + ethanol solvent system at fixed levels of the other variable. These plots help isolate the effects of temperature and concentration on solubility in this specific solvent.

Figure 11 shows the impact of temperature on solubility at various fixed levels of mass fraction (w). As temperature increases, solubility generally rises across all levels of w, with a more pronounced increase observed at higher concentrations. This trend suggests that temperature has a stronger impact on solubility when the solute concentration is high, indicating an interaction between thermal conditions and solute mass fraction.

Figure 12 illustrates the effect of mass fraction on solubility at different fixed temperature levels. Here, solubility increases with rising w, but the rate of increase varies with temperature. At higher temperatures, the effect of increasing mass fraction is amplified, leading to steeper increases in solubility. This indicates that temperature and concentration jointly contribute to solubility, with elevated temperatures enhancing the solubility response to mass fraction.

Table 4 provides the optimal conditions for achieving maximum solubility as predicted by the final LGB model. The table identifies the precise values of temperature (T), mass fraction (w), and solvent type that yield the highest predicted solubility (x). According to the model, a temperature of 305.76306 K and a mass fraction of 0.75376 in the dichloromethane + methanol solvent system yield the maximum solubility, with a predicted value of 4.212032E-03. This table highlights the effectiveness of the LGB model in pinpointing the optimal combination of conditions for enhancing solubility, providing valuable insights for experimental planning and formulation in solvent systems¹⁴ .

Table 4 Maximum solubility using final LGB model.

Full size table

Conclusion

This study successfully demonstrates the potential of machine learning models for predicting solubility of rivaroxaban in binary solvents based on key variables, including temperature, mass fraction, and solvent type. By employing advanced tree-based ensemble models—specifically, GB, LGB, ET, and RF—and optimizing their hyperparameters using the Jellyfish Optimizer (JO), this research highlights the power of ML in addressing complex solubility prediction tasks. Among the models tested, LGB emerged as the best-performing model, achieving high predictive accuracy and generalizability. Its efficiency and capability to capture intricate patterns in multivariate data make it especially suited for this application.

The analysis through parity plots, 3D surface plots, and partial effect plots confirmed LGB’s robustness in reflecting the relationships among temperature, mass fraction, and solvent effects on solubility. Furthermore, the model’s prediction of optimal conditions for maximum solubility offers practical guidance for experimental setups in solvent selection and formulation.

In conclusion, this work contributes a valuable framework for solubility prediction using machine learning, with the potential to streamline experimental planning and formulation development in chemical and materials science. Future research may extend this approach by incorporating larger datasets, exploring alternative modeling techniques, or applying this framework to other chemical properties, enhancing its applicability in predictive modeling for complex systems.

Data availability

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

References

Pu, S. & Hadinoto, K. Yield enhancement and phase behaviors of cyclic peptide drug crystallization in the presence of polyethylene glycol additive. Chem. Eng. Res. Des. 205, 354–363 (2024).
Article CAS Google Scholar
Wang, H. et al. Crystallization and intermolecular hydrogen bonding in carbamazepine-polyvinyl pyrrolidone solid dispersions: An experiment and molecular simulation study on drug content variation. Int. J. Pharm. 666, 124769 (2024).
Article CAS PubMed MATH Google Scholar
Soto, R. et al. Ketoprofen solubility in pure organic solvents using in situ FTIR and UV–vis and analysis of solution thermodynamics. Org. Process Res. Dev. 25(11), 2403–2414 (2021).
Article CAS MATH Google Scholar
Zeng, L., Rasmuson, Å. C. & Svärd, M. Solubility of two polymorphs of tolbutamide in n-propanol: Comparison of methods. J. Pharm. Sci. 109(10), 3021–3026 (2020).
Article CAS PubMed MATH Google Scholar
Quilló, G. L. et al. In Dynamic Optimization of Active Pharmaceutical Ingredient (Semi-)Batch Crystallization using Population Balance Modelling, in Computer Aided Chemical Engineering (eds Kokossis, A. C. et al.) 1495–1500 (Elsevier, 2023).
MATH Google Scholar
Trampuž, M., Teslić, D. & Likozar, B. Crystal-size distribution-based dynamic process modelling, optimization, and scaling for seeded batch cooling crystallization of active pharmaceutical ingredients (API). Chem. Eng. Res. Des. 165, 254–269 (2021).
Article Google Scholar
Orosz, Á. et al. Dynamic modeling and optimal design space determination of pharmaceutical crystallization processes: realizing the synergy between off-the-shelf laboratory and industrial scale data. Ind. Eng. Chem. Res. 63(9), 4068–4082 (2024).
Article CAS MATH Google Scholar
Sodeifian, G. et al. Determination of Regorafenib monohydrate (colorectal anticancer drug) solubility in supercritical CO2: Experimental and thermodynamic modeling. Heliyon 10(8), e29049 (2024).
Article CAS PubMed PubMed Central Google Scholar
Sodeifian, G. et al. Solubility measurement of Ceftriaxone sodium in SC-CO2 and thermodynamic modeling using PR-KM EoS and vdW mixing rules with semi-empirical models. Case Stud. Therm. Eng. 61, 105074 (2024).
Article MATH Google Scholar
Liu, Y. et al. Machine learning based modeling for estimation of drug solubility in supercritical fluid by adjusting important parameters. Chemometr. Intell. Lab. Syst. 254, 105241 (2024).
Article CAS MATH Google Scholar
Yu, Y., Sun, C. & Jiang, W. A comprehensive study of pharmaceutics solubility in supercritical solvent through diverse thermodynamic and hybrid Machine learning approaches. Int. J. Pharm. 664, 124579 (2024).
Article CAS PubMed Google Scholar
Dutta S et al. Robust counterfactual explanations for tree-based ensembles. In International Conference on Machine Learning. PMLR. (2022).
Jeong, J.-S. et al. Measurement and correlation of solubility of rivaroxaban in dichloromethane and primary alcohol binary solvent mixtures at different temperatures. J. Mol. Liq. 357, 119064 (2022).
Article CAS Google Scholar
Alqarni, M., Alqarni, A. Computational intelligence investigations on the correlation of pharmaceutical solubility in mixtures of binary solvents: Effect of composition and temperature. Chinese Journal of Physics 93, 503–514, https://doi.org/10.1016/j.cjph.2024.12.021 (2025).
Seger C. An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing. Student thesis, KTH Royal Institute of Technology, (2018).
Patro, S., Normalization: A preprocessing stage. arXiv preprint, 1503.06462: arXiv:1503.06462. (2015).
Oyeyemi, G. M., Bukoye, A. & Akeyede, I. Comparison of outlier detection procedures in multiple linear regressions. Am. J. Math. Stat. 5(1), 37–41 (2015).
Google Scholar
Otchere, D. A. et al. Application of gradient boosting regression model for the evaluation of feature selection techniques in improving reservoir characterisation predictions. J. Pet. Sci. Eng. 208, 109244 (2022).
Article CAS MATH Google Scholar
Rao, H. et al. Feature selection based on artificial bee colony and gradient boosting decision tree. Appl. Soft Comput. 74, 634–642 (2019).
Article MATH Google Scholar
Sun, P., Huo, S., He, T. Multiple machine learning models in estimating viscosity of crude oil: Comparisons and optimization for reservoir simulation. Journal of Molecular Liquids 384, 122251. https://doi.org/10.1016/j.molliq.2023.122251 (2023).
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Ke G, et al. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30. (2017).
Pavlov, Y. L. Random forests. In Random Forests (ed. Kani, B.) 1–2 (De Gruyter, 2019).
MATH Google Scholar
Heddam, S. Extremely randomized trees versus random forest, group method of data handling, and artificial neural network. In Handbook of Hydroinformatics 291–304 (Elsevier, 2023).
Chapter Google Scholar
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006).
Article MATH Google Scholar
Kocev D, M Ceci. Ensembles of extremely randomized trees for multi-target regression. In Discovery Science: 18th International Conference, DS 2015, Banff, AB, Canada, October 4–6, 2015. Proceedings 18. Springer. (2015).
Majrashi, M.A.A. et al. Nonsteroidal anti-inflammatory drug solubility optimization through green chemistry solvent: Artificial intelligence technique. Case Studies in Thermal Engineering 53, 103767 https://doi.org/10.1016/j.csite.2023.103767 (2024).
Chou, J.-S. & Truong, D.-N. A novel metaheuristic optimizer inspired by behavior of jellyfish in ocean. Appl. Math. Comput. 389, 125535 (2021).
MathSciNet MATH Google Scholar
Xia, S., Wang, Y. Preparation of solid-dosage nanomedicine via green chemistry route: Advanced computational simulation of nanodrug solubility prediction using machine learning models, Journal of Molecular Liquids 375, 121319. https://doi.org/10.1016/j.molliq.2023.121319 (2023).
Olabi, A. et al. Boosting carbon dioxide adsorption capacity applying Jellyfish optimization and ANFIS-based modelling. Ain Shams Eng. J. 14, 101931 (2022).
Article MATH Google Scholar

Download references

Acknowledgments

The authors extend their appreciation to Taif University, Saudi Arabia, for supporting this work through project number (TU-DSPP-2024-153).

Funding

This research was funded by Taif University, Saudi Arabia, Project No. (TU-DSPP-2024-153).

Author information

Authors and Affiliations

Department of Pharmaceutical Chemistry, College of Pharmacy, Taif University, P.O. Box 11099, 21944, Taif, Saudi Arabia
Mohammed Alqarni
Department of Oral & Maxillofacial Surgery and Diagnostic Sciences, Faculty of Dentistry, Taif University, 21944, Taif, Saudi Arabia
Ali Alqarni

Authors

Mohammed Alqarni
View author publications
Search author on:PubMed Google Scholar
Ali Alqarni
View author publications
Search author on:PubMed Google Scholar

Contributions

Mohammed Alqarni: Writing, Methodology, Investigation, Software, Visualization. Ali Alqarni: Validation, Conceptualization, Formal analysis, Resources.

Corresponding author

Correspondence to Mohammed Alqarni.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Alqarni, M., Alqarni, A. Machine learning analysis of rivaroxaban solubility in mixed solvents for application in pharmaceutical crystallization. Sci Rep 15, 2241 (2025). https://doi.org/10.1038/s41598-024-84741-1

Download citation

Received: 16 November 2024
Accepted: 26 December 2024
Published: 17 January 2025
Version of record: 17 January 2025
DOI: https://doi.org/10.1038/s41598-024-84741-1

Keywords

This article is cited by

Correlation of rivaroxaban solubility in mixed solvents for optimization of solubility using machine learning analysis and validation
- Muteb Alanazi
- Jowaher Alanazi
- Bader Huwaimel
Scientific Reports (2025)
Development of several machine learning based models for determination of small molecule pharmaceutical solubility in binary solvents at different temperatures
- Mohammed Alqarni
- Ali Alqarni
Scientific Reports (2025)
Computational evaluation using machine learning for analysis of membrane desalination process powered by solar energy
- Muteb Alanazi
- Tareq Nafea Alharby
Scientific Reports (2025)