Introduction

In the fields of pharmaceutical, petroleum, chemical and polymer engineering, scientists and engineers prioritize identifying the most suitable solvents per explicit industrial applications. Solubility parameter provides a critical measure of a solvent tendency to solve a given solute. The idea of the solubility factor was formalized in the 1930s by Hildebrand, building on earlier work by Scatchard on cohesive energy density. This theory has since been extensively studied and applied across various chemical processes1. Subsequent contributions by Hildebrand2,3 and Hansen4 further refined the notion by introducing additional parameters and expanding its applicability. The solubility parameter has proven instrumental in polymer processing, surface treatments and coating technologies5.

The theoretical foundation of the solubility parameter lies in molecular interaction and chemical structure energies5. In the case of similarity of solubility factors for two molecules, the miscibility between one molecule (solution or liquid) and another (solid) tends to improve6. This principle, known as “like dissolves like” underpins many solvent selection strategies6. In condensed phases, intermolecular forces, primarily cohesive forces and van der Waals attractions, govern molecular behavior6. Hildebrand5 expanded on earlier cohesive energy theories by introducing the concept of the solubility factor, which relates to the energy density of cohesion of materials. Generally, the solubility parameter can be determined as the square root of the cohesive pressure and is often called the Hildebrand solubility parameter5. However, the Hildebrand parameter’s inability to account for hydrogen bonding limits its application to slightly polar or nonpolar materials deprived of hydrogen bonding5. To overcome this limitation, Hansen introduced solubility parameter of three-component type, which takes the account of dipolar, nonpolar and hydrogen-bonding interactions7. Hansen solubility parameters (HSP) have found widespread use in some industries such as coatings and paints, where solvent-polymer interactions should be essentially taken into account4. Additionally, HSP has been extended to represent materials like nanoparticles, pigments DNA, and others, enabling the analysis of interactions with plasticizers, solvents, foodstuffs, fragrance chemicals and more8,9.

Recognizing the importance of solubility parameter, numerous approaches have been developed to calculate or estimate it. Equation-of-state (EOS)-based methods have been extensively considered for this purpose over the past years10,11,12,13. Group contribution (GC) methods, pioneered by van Krevelen14, enable the assessment of partial solubility parameters for pure organic and polymer compounds.

Quantitative structure-property/activity relationship (QSPR/QSAR) method, leveraging classification/regression techniques, is broadly employed in engineering and chemical science to forecast physical characteristics15,16. Although QSPR/QSAR models deliver accurate results, determining molecular descriptors involves time-intensive processes such as optimization, structure drawing and calculation, which may pose challenges for researchers lacking advanced computational chemistry expertise.

This study employs machine learning (ML), known as a multidisciplinary arena combining ideas from probability theory, statistics, approximation theory, and other domains17, to develop predictive models for the solubility parameter of diverse polymer solutions using a QSPR strategy. Advanced ML algorithms, including linear regression, convolutional and artificial neural networks, lasso and ridge regression, support vector machines, elastic net, random forests, gradient boosting, k-nearest neighbors, decision trees, extreme gradient boosting, categorical boosting, light gradient boosting, and Gaussian processes, are used. The Monte Carlo outlier detection algorithm ensures dataset suitability for model training. Model performance is assessed using multiple metrics and visualization tools, while SHAP values provide insights into the impact of essential characteristics on solubility estimations. A detailed description of the methodology is provided in Fig. 1.

Fig. 1
figure 1

Modeling solubility parameter of polymers via various data driven methods.

Machine learning backgrounds

Convolutional neural network

A Convolutional Neural Network (CNN) is a type of deep learning algorithm tailored to handle structured data, including images and time-series data. These networks excel in visual computing applications, like object detection, image classification, and intricate pattern recognition18. The structure of a CNN is composed of several layers, with pooling layers, convolutional layers, and completely linked layers playing crucial roles18,19. Convolutional layers employ filters or kernels to detect fundamental characteristics such as lines and patterns, which are subsequently processed by deeper layers to identify more complex patterns, such as shapes or objects. Pooling layers help reduce data dimensionality, allowing the network to focus on the primary significant features19. CNNs are usually trained using the backpropagation algorithm, which enables the network to fine-tune its weights by comparing predictions with actual outcomes20. These networks are highly effective in various fields, including facial recognition, medical imaging, and general image analysis, and can autonomously learn hierarchical features with little human input. A key advantage of CNNs is their ability to generalize well to new, previously unseen data, making them particularly efficient for tasks involving large datasets21,22. Recent improvements in hardware, especially the adoption of Graphics Processing Units (GPUs), have greatly enhanced the training speed of these networks, further boosting their widespread use21,22.

K-nearest neighbors

The K-Nearest Neighbors (KNN) method is among the most commonly used supervised learning methods in machine learning. It operates by evaluating distances between data points to make predictions or classifications23,24. When estimating the value or class of a new data point, KNN identifies the K closest points from training dataset and determines prediction based on majority label among these K neighbors23. This process usually involves distance metrics such as Euclidean distance, meaning that closer data points have a stronger influence on the outcome. A key advantage of KNN is its non-parametric nature, which means it does not need a clear training or learning stage24. Rather than constructing a complex predictive model, KNN directly uses the training data to make predictions at runtime. However, this also leads to one of its main drawbacks: the algorithm can be computationally expensive during the prediction phase, especially with large datasets, since it must determine distances for every data point in the training set25.

Additionally, execution of KNN is highly responsive to the choice of the K factor and method used to measure distances. Selecting an inappropriate K value or distance metric can greatly influence results accuracy26. Despite these limitations, KNN is extensively applied in different fields, like text classification, and junk email filtering, due to its simplicity and effectiveness in many scenarios26.

Lasso regression

Lasso Regression is a linear regression algorithm designed for feature selection and mitigating overfitting in regression models. The name “Lasso” stands for " Selection Operator and Least Absolute Shrinkage”27. This technique incorporates an L1 regularization to reduce features coefficients, bringing certain coefficients to zero and effectively removing the corresponding features from the model27. This process allows for automatic feature selection, as less important or irrelevant features are discarded, leading to a model that is easier to understand and interpret.

A major benefit of Lasso Regression is its capability to automatically identify relevant features and optimize the model, particularly when dealing with datasets that include numerous irrelevant or highly correlated features28,29. It is especially effective in high-dimensional scenarios in situations with a substantial amount of characteristics, and some may be redundant or non-informative. By simplifying the model, Lasso Regression improves the model’s capability to apply to novel, unencountered data30.

Despite its advantages, Lasso Regression has some drawbacks. It may struggle when features are highly correlated, as it tends to randomly select one feature from a group of correlated predictors and disregard the others29. In such situations, alternative approaches like Ridge Regression might be more suitable. Nevertheless, Lasso Regression continues to be an influential instrument in machine learning and data assessment, providing a reliable method for constructing streamlined models in cases with intricate feature relationships31.

Decision tree

The Decision Tree algorithm is a frequently used technique in machine learning for solving regression and classification problems. It constructs a model as a hierarchical tree structure, with each node symbolizing a decision or condition that partitions the data into subsets32. The tree is built using the input features, with the process continuing until it reaches leaf nodes, which provide the final predictions or outcomes33. While building the tree, metrics such as Gini Index or Entropy are employed as splitting criteria to minimize impurity and enhance the accuracy of classifications at each step34.

One of the primary strengths of decision trees is their interpretability. The logic behind the decisions made by the model can be easily traced and understood, rendering them appropriate for uses that require transparency is important35. Moreover, decision trees are capable of managing both numerical and categorical data without the need for feature scaling, adding to their versatility29,36.

However, decision trees have some limitations. They are susceptible to overfitting, particularly when handling noisy data, since the model might grasp irrelevant patterns or anomalies found in the training dataset36,37. To mitigate this issue, methods such as ensemble methods or pruning are often utilized. These approaches help simplify the tree and improve its capacity to apply to unknown data, thereby enhancing the general durability and efficiency of the model38.

Linear regression

Linear Regression is among the most frequently utilized machine learning methods for representing the relationship between a dependent parameter and a single or several independent variables. This method operates under the assumption that there is a straight relationship between the input characteristics and the output labels39. The primary objective of linear regression is to identify a line or hyperplane that most suitable for the data. Mathematically, this is represented by a linear equation, where the coefficients are optimized using techniques such like the least squares approach40.

Linear regression finds applications in various domains, including predicting real estate prices, forecasting product sales, and analyzing economic trends. One of its key advantages is its simplicity and interpretability, as the results are straightforward to understand and explain39,40. Additionally, when the connection between parameters is genuinely linear, linear regression can generate highly accurate predictions41.

However, linear regression may struggle with datasets that exhibit complex or nonlinear relationships, leading to issues such as underfitting or overfitting. In such cases, more advanced models, such as polynomial regression or nonlinear methods, might be more appropriate42. In spite of these constraints, linear regression remains a powerful instrument for comprehension and forecasting outcomes in scenarios where the relationships between variables are approximately linear43.

Gradient boosting machine

The Gradient Boosting Machine (GBM) is a potent machine learning algorithm designed to tackle regression and classification jobs. It leverages boosting technique, which enhances predictive performance by combining several fragile models into a more robust one ensemble44. In GBM, models are constructed sequentially, with every following model aimed at correcting the errors of those before it. This repetitive method makes use of the Gradient Descent improvement method to minimize prediction errors by targeting the residuals (differences between actual and predicted values) at each step45.

At its core, during each training phase, GBM ensures that the new model specifically addresses the weaknesses of the previous one, progressively improving overall accuracy. A primary benefit of GBM is its capacity to manage intricate, nonlinear datasets effectively, resulting in high predictive precision46. By relying on weak models and gradually refining them, GBM achieves strong generalization capabilities, rendering it appropriate for various uses, such as fraud detection, sales forecasting, and disease diagnosis47.

Despite its strengths, GBM does have some drawbacks. It can be sensitive to noisy data, which may lead to overfitting if not properly managed. Additionally, the algorithm typically requires more training time compared to other methods due to its sequential nature. Another challenge lies in parameter tuning, as selecting optimal hyperparameters for GBM can be complex and computationally demanding47,48. Nevertheless, when properly configured, GBM remains a highly effective tool to address different real-life issues48.

Support vector machine

Support Vector Machine (SVM) is a robust machine learning method primarily utilized for classification jobs, although it can likewise be modified for regression jobs. The core objective of SVM is to create a perfect decision boundary, referred to as a hyperplane, that efficiently divides data points that pertain to various categories49. In classification, this hyperplane is designed to maximize margin between classes, ensuring the most effective separation possible. For datasets that are not linearly separable, SVM utilizes kernel functions to transform the data into a higher-dimensional space, enabling linear separation in this transformed space50.

A major advantage of SVM is its ability to manage intricate, nonlinear relationships within data by leveraging various kernel functions, like radial basis function (RBF) kernels or polynomial. This makes SVM particularly effective in high-dimensional spaces, where it has demonstrated strong predictive accuracy51. As a result, SVM has been effectively utilized in various fields, such as facial recognition, illness diagnosis, and text categorization51.

However, SVM does have notable limitations. It is highly sensitive to parameter settings, especially the choice of kernel function and the tuning of regularization parameter C, which manages the balance between obtaining a low error on the training dataset and maintaining a smooth decision boundary52. Additionally, training an SVM model may require significant computational resources, especially for extensive datasets, as it entails addressing a quadratic optimization challenge. This requirement for significant computational resources and time can make SVM less practical for very large-scale applications. In spite of these difficulties, SVM continues to be an effective instrument for numerous machine learning applications, especially when dealing with complex, high-dimensional data53.

Categorical boosting

Categorical Boosting, commonly known as CatBoost, is an innovative machine learning method aimed at enhancing the efficacy of classification models, particularly when working with datasets that include categorical attributes54. In contrast to conventional boosting algorithms like XGBoost and LightGBM, which mainly target numerical data and necessitate preprocessing steps such as One-Hot Encoding for categorical variables, Categorical Boosting directly manages categorical features without extensive preprocessing requirements. This is accomplished via cutting-edge encoding techniques that are incorporated into the model’s structure55.

CatBoost integrates Gradient Boosting concepts with unique feature engineering methods specifically designed for categorical data. This approach removes the necessity for manual encoding or changing of categorical variables, thereby decreasing preprocessing time and complexity56. The algorithm utilizes advanced techniques like target-oriented encoding and mixtures of categorical attributes to efficiently grasp the connections between categories and the target variable57.

A major benefit of Categorical Boosting is its capacity to improve model precision while streamlining the process. By directly training on categorical data, the model can utilize information from various classes at once, resulting in enhanced classification effectiveness58. This is especially beneficial in situations where data shows batch traits or when the dataset includes many categorical variables. Moreover, CatBoost’s method of managing categorical features guarantees that the model stays strong and efficient, even when dealing with high-cardinality data56.

In conclusion, Categorical Boosting provides an effective approach for classification tasks by overcoming the challenges of conventional boosting algorithms when working with categorical data. Its capacity to enhance preprocessing efficiency, along with its robust predictive capabilities, renders it a useful resource in multiple areas, such as customer segmentation, fraud identification, and natural language processing. Nonetheless, similar to other intricate models, it might need precise adjustment of hyperparameters for optimal performance59.

Artificial neural network

An Artificial Neural Network (ANN) is a computing framework intended to mimic information functioning abilities of the human mind. It is made up of linked processing units known as neurons, which function similarly to biological neurons. Each neuron receives inputs from other neurons, processes them using a mathematical operation such as an activation function and then sends an output to subsequent neurons60. ANNs are especially proficient at addressing intricate jobs like pattern identification and forecasting, and data classification, as they have the ability to learn from datasets and progressively enhance their results gradually61.

Typically, ANNs are structured in multiple layers, forming what is commonly referred to as a Multilayer Perceptron (MLP). These layers include a layer for input, several hidden layers, and a layer for output62. The input layer takes in data, while the hidden layers handle processing it through various transformations, and output layer provides the final result. In the training period, algorithms like backpropagation are employed to adjust the weights between neurons. This adjustment minimizes prediction errors and gradually enhances the network’s accuracy by fine-tuning its internal parameters63.

ANNs are especially well-suited for handling intricate, nonlinear problems that demand substantial computational power. As a result, they are widely used in various areas like processing of language, recognition of image and speech, and many other areas within data science63,64. Their capacity to represent intricate connections and adjust to new data makes them an effective instrument for resolving real-world problems across a variety of industries64.

10. Extreme gradient boosting

Extreme Gradient Boosting (XGBoost) is an exceptionally powerful and robust machine learning method celebrated for its skill in addressing intricate regression and classification challenges, particularly in extensive datasets. XGBoost, as an enhanced version of the gradient boosting framework, builds upon conventional boosting methods by merging various weak models usually decision trees for building a robust forecasting model65. In this algorithm, every a new tree is formed to address the errors of the earlier one, gradually enhancing the predictions and moving closer to the best solution.

A key attribute of XGBoost is its incorporation of regularization methods, which aid in avoiding overfitting and improving the model’s generalization abilities. This renders it especially efficient in managing noisy or high-dimensional data. XGBoost is renowned for its speed and efficiency, capable of providing very accurate predictions even when managing large data sets65. Utilizing parallel processing, the algorithm greatly shortens computation duration, making it highly appropriate for practical applications that deal with extensive datasets66.

XGBoost is highly effective in handling complex and unstructured data, providing strong solutions for different activities, such as recognition of image and processing of natural language. Its extensive customizability enables users to adjust multiple parameters, including learning rate, tree depth, and regularization strength, to enhance performance for particular applications. These features have resulted in its extensive use in both scholarly research and commercial applications, including data mining contests such as those on Kaggle, where it has repeatedly shown exceptional performance67.

To summarize, XGBoost is acknowledged as one of the greatest preferred algorithms in machine learning because of its effectiveness, precision, and versatility. Its capacity to manage intricate datasets while ensuring high performance has established its status as a preferred approach for addressing prediction and classification problems across various fields68.

11. Light gradient boosting machine

The LightGBM is a cutting-edge machine learning algorithm specifically designed to handle large and complex datasets efficiently. It represents an optimized variant of the gradient boosting method, offering significant benefits like quicker training speed, reduced memory usage, and improved accuracy compared to similar algorithms like XGBoost45. One of the key innovations of LightGBM is its unique approach to building decision trees using “leaf-wise” growth rather than the conventional “level-wise” method. This strategy allows the algorithm to create deeper trees more effectively, focusing on splits that contribute the most to performance of model, thereby enhancing accuracy69.

A major strength of LightGBM excels due to its capability to train models rapidly, even on massive datasets. By leveraging techniques such as parallel processing and batch sampling, LightGBM achieves superior performance while minimizing computational resources. Additionally, it is highly efficient in managing both discrete and continuous data, all while consuming significantly less memory than other gradient boosting frameworks. These features make LightGBM particularly well-suited for applications involving large-scale data modeling, such as fraud detection, sales forecasting, and improving search engine rankings70.

Due to its efficiency, scalability, and high performance, LightGBM has gained widespread adoption across various industries and research domains. Its ability to deliver accurate predictions with reduced resource requirements makes it serves as a perfect option for addressing practical issues where data size and complexity are significant challenges. Overall, LightGBM stands out as a powerful tool for building robust predictive models in large-scale applications71.

1 elastic net

Elastic Net is a regression approach that merges the benefits of two well-known regularization techniques: Lasso regression and Ridge regression. This combined strategy is especially useful for datasets that exhibit high dimensionality or significant correlations between features. Lasso regression prioritizes feature selection by forcing the coefficients of less significant features to zero, whereas Ridge regression combats overfitting by imposing penalties on large coefficients and lowering model variance30. Elastic Net combines these two approaches to find a balance between selecting features and applying regularization, which makes it particularly advantageous in situations where the feature count is greater than the observation count72.

A significant benefit of Elastic Net is its capacity to manage highly correlated features more efficiently than using Lasso or Ridge separately. In cases where several features are highly correlated, Lasso typically chooses one feature at random and disregards the rest, while Elastic Net can incorporate clusters of correlated features into the model. This renders Elastic Net ideal for datasets that exhibit intricate relationships among variables, as seen in fields like genomics, finance, or image processing73.

The algorithm attains this equilibrium through a weighted mix of the L1 and L2 penalties. By adjusting the mixing parameter, which governs the balance of each regularization technique, Elastic Net can enhance the compromise between selecting features and minimizing variance. This adaptability enables Elastic Net to enhance the precision of forecasts and generalization of models, particularly when handling numerous features or multicollinearity74.

In conclusion, Elastic Net is an effective method for regression issues that involve high-dimensional data or related attributes. Its capacity to integrate the advantages of Lasso and Ridge regression renders it a perfect option for improving model performance in intricate datasets, guaranteeing strength and dependability in numerous practical applications75.

13. Ridge regression

Ridge Regression is a regularization method employed in machine learning to avoid overfitting in regression models, especially when working with high-dimensional datasets. It is particularly useful in circumstances where the quantity of features surpasses the amount of observations or when there are significant correlations between the features75. In Ridge Regression, a penalty component is included in loss function to prevent the model from giving overly large weights to any individual feature. This penalty is determined by summing the squares of the weight values, promoting the model to allocate importance more uniformly among all features by diminishing the size of the coefficients76.

The hyperparameter, λ (lambda), referred to as the regularization coefficient, regulates the regularization strength in Ridge Regression. An increased value of λ enforces a more significant penalty on the model, resulting in reduced weights and a more straightforward model that performs better on new data. On the other hand, a smaller λ enables the model to align more precisely with the training data, potentially leading to overfitting if not adequately adjusted. By modifying λ, a balance between bias and variance can be achieved, allowing the model to stay both adaptable and strong77.

Ridge Regression is highly valued due to its user-friendliness and efficiency, rendering it a favored option in data analysis endeavors that include numerous features and intricate connections among them. It is especially beneficial when dealing with noisy data since it minimizes the influence of superfluous or redundant features by driving their coefficients closer to zero. This aids in enhancing the model’s forecasting accuracy and consistency, particularly in cases where more intricate models could overfit the data. In general, Ridge Regression offers an effective method for enhancement the generalization abilities of regression models when dealing with difficult datasets78.

14. Random forest

Random Forest is a robust machine learning algorithm that leverages an ensemble of decision trees to predict and analyze data. As a kind of ensemble learning technique, it integrates the forecasts from various weak models (individual decision trees) to create a greater precision and stable overall model. In Random Forest, instead of constructing a single decision tree, numerous trees are built, each using a random subset of data79. The ultimate prediction is decided via a voting process for classification tasks (majority vote) or averaging for regression tasks, based on the outputs of all the individual trees80,81.

One of the main benefits of Random Forest is its capability to significantly reduce overfitting, a common issue with individual decision trees. By introducing randomness in both the data samples and the characteristics employed to construct each tree, the algorithm ensures diversity among the trees, which enhances the model’s generalization capability. This randomness prevents any single tree from becoming too complex or overly tailored to training data, thereby enhancing the model’s effectiveness on unfamiliar data82.

Another notable feature of Random Forest is its capacity to manage high-dimensional datasets with noisy and correlated features. The algorithm’s inherent ability to work with subsets of features during the tree-building process ensures that it can effectively manage intricate relationships within the data. This makes Random Forest particularly well-suited for a broad variety of applications, comprising acknowledgment of pattern, processing of image, and forecasting of financial83.

In conclusion, Random Forest is an adaptable and robust tool for tasks involving both classification and regression. Its ability to handle complex, noisy datasets while reducing overfitting and improving accuracy has made it a popular choice across various domains, from data science to real-world problem-solving scenarios84.

Data description and analysis

The dataset utilized in this study originates from the DIPPR 801 database, maintained by the Design Institute for Physical Properties (DIPPR) of the American Institute of Chemical Engineers (AIChE)85. Constructed by Design Institute for Physical Properties (DIPPR) to address industry demands for reliable real estate information, DIPPR 801 database is renowned for its rigorous curation and validation processes. These processes ensure the consistency and precision of the information85. For this study, the independent parameters were extracted from the physical constants of the aforementioned database, which encompasses solubility factors of 1,889 polymeric compounds. Concerning this, the input variables include molecular weight, melting point, boiling point, liquid molar volume, dielectric constant, radius of gyration, refractive index, van der Waals area, reduced volume, and parachor, while the output factor refers to the solubility parameter of polymers, which is predicted using the QSPR approach developed through machine learning models. This comprehensive dataset enables the creation of sophisticated models that can accurately predict solubility parameters, thereby facilitating advancements in polymer science and engineering.

It is crucial to note that the solubility parameters of polymers were experimentally determined, as reported in the literature, based on a range of physicochemical properties and structural descriptors. The resulting experimental dataset consists of 1,799 records, which have been systematically divided into three subsets to facilitate model development and evaluation. Specifically, approximately 70% of the data (1,259 data points) is designated for training the predictive models, while 15% (270 data points) is reserved for testing and another 15% (270 data points) for validation. This strategic partitioning of the dataset ensures a rigorous and reliable framework for developing and assessing models aimed at accurately predicting the solubility parameters of polymers. Such an approach improves generalizability and robustness of the models, rendering them more relevant to real-world polymer processing scenarios.

The input parameters for the predictive model were carefully selected by identifying the key factors that significantly influence the solubility parameter of polymers. These parameters include a comprehensive set of physicochemical properties, such as molecular weight, melting point, boiling point, liquid molar volume, dipole moment, radius of gyration, dielectric constant, van der Waals area, refractive index, and reduced volume, and parachor. Each of these properties holds a crucial position in deciding behavior and interactions of polymers, thereby directly or indirectly affecting their solubility characteristics. For example, molecular weight has been shown to influence polymer solubility by affecting chain entanglement and free volume, with higher molecular weights typically reducing solubility due to increased chain stiffness and lower segmental mobility14. Similarly, dielectric constant reflects the polarity of the polymer, which directly affects interactions with polar or nonpolar solvents86. Boiling and melting points are indicators of intermolecular forces, such as van der Waals or hydrogen bonding, which also influence solubility behavior6.

In this study, the solubility parameter of polymers serves as the primary output variable, which the model aims to predict accurately based on aforementioned input parameters. To provide insights into the range, distribution, and relationships between the input parameters and the solubility parameter, scatter matrix plots have been generated and are presented in Fig. 2. These visualizations offer a valuable overview of the data, highlighting patterns, correlations, and potential outliers, which are crucial for grasping the foundational framework of the dataset and guiding the development of an effective predictive model.

Fig. 2
figure 2figure 2figure 2figure 2figure 2

Scatter matrix diagrams: Relationships between variables.

Figure 3 presents the Pearson correlation coefficients for all parameter pairs analyzed in this study. The Pearson correlation coefficient r is calculated using the following formula:

\(\:{r}_{j}=\frac{\sum\:_{i=1}^{n}({I}_{i.j}-\stackrel{-}{{I}_{j}})({Z}_{i}-\stackrel{-}{Z})}{\sqrt{\sum\:_{i=1}^{n}{({I}_{i.j}-\stackrel{-}{{I}_{j}})}^{2}\sum\:_{i=1}^{n}{({Z}_{i}-\stackrel{-}{Z})}^{2}}}\)2

In the context of correlation analysis, Zˉ represents the mean of the second variable, while Z denotes the second variable itself.

According to the information shown in Fig. 3, it is clear that every input variable shows a connection with polymers solubility parameter. Significantly, boiling point, melting point, dipole moment, dielectric constant and refractive index, show a beneficial effect, while molecular weight, liquid molar volume, van der Waals area, radius of gyration and reduced volume, and parachor present an adverse correlation with the solubility parameter.

Fig. 3
figure 3

Correlation coefficient matrix: Two-way relationships among variables.

Prior to developing machine learning models to predict the solubility parameter of polymers, it is essential to ensure data reliability through the management of outliers. This research uses the Monte Carlo Outlier Detection (MCOD) algorithm, which effectively identifies outliers in extensive datasets by integrating random selection using density-driven techniques. To improve data quality, we applied a Monte Carlo Outlier Detection (MCOD) algorithm, which combines random sampling with local density estimation to identify and exclude outliers. This technique helps reduce computational complexity while retaining the representativeness of the dataset for model training. A boxplot summarizing the data distribution after applying MCOD is shown in Fig. 4.

Figure 4 presents a boxplot representation of databank employed in this research, showcasing the data’s distribution and defining the range considered appropriate for model creation. As illustrated, most data points reside within the allowable range, indicating excellent data quality. In this study, each data point from the collected dataset was used to train the machine learning methods. This method ensures that the models are built on a thorough databank, thus improving their ability to efficiently generalize to new, unknown databank. By including full spectrum of data, the algorithms are more adept at recognizing the fundamental patterns and discrepancies, which leads to more dependable and precise forecasts for determining the solubility parameter of polymers in different solvents.

Fig. 4
figure 4

(A) Identifying outliers utilizing the monte carlo algorithm, (B) Boxplot of data distribution.

Results and discussions

In this part, we perform a thorough evaluation of the dependability and efficacy of created data-driven models for forecasting the solubility parameter of various polymers. In order to achieve this, we utilize different machine learning methods, like ridge regression, Convolutional Neural Networks (CNN), linear regression, Artificial Neural Networks (ANN), elastic net, Support Vector Regression (SVR), lasso regression, Random Forest (RF), Gradient Boosting Machines (GBM), K-Nearest Neighbors (KNN), Decision Trees (DT), LightGBM, XGBoost, and CatBoost.

In order to evaluate the effectiveness of these models, a range of metrics are employed, like R-squared (R²), average squared error (MSE), mean relative deviation percentage (MRD%), and standard deviation of residuals (σ). The meanings of these metrics are outlined below:

\(\:R-squared\left({R}^{2}\right)=1-\frac{{\sum\:}_{i=1}^{N}{({\text{y}}_{\text{i}}^{\text{r}\text{e}\text{a}\text{l}}-{\text{y}}_{\text{i}}^{\text{p}\text{r}\text{e}\text{d}\text{i}\text{c}\text{t}\text{e}\text{d}})}^{2}}{{\sum\:}_{i=1}^{N}{({\text{y}}_{\text{i}}^{\text{r}\text{e}\text{a}\text{l}}-\overline{{\text{y}}^{\text{r}\text{e}\text{a}\text{l}}})}^{2}}\)

2

\(\:Mean\:squared\:error\:\left(MSE\right)\:=\:\frac{1}{N}{\sum\:}_{i=1}^{N}{({y}_{i}^{real}-{y}_{i}^{predicted})}^{2}\)

3

\(\:Mean\:relative\:deviation\:\left(MRD\right)=\frac{100}{N}{\sum\:}_{i=1}^{N}\left(\frac{{y}_{i}^{real}-{y}_{i}^{predicted}}{{y}_{i}^{real}}\right)\)

4

\(\:Residuals\:Standard\:Deviation\:\left(\sigma\:\right)=\sqrt{\frac{1}{N}{\sum\:}_{i=1}^{N}{({\text{y}}_{\text{i}}^{\text{r}\text{e}\text{a}\text{l}}-{\text{y}}_{\text{i}}^{predicted})}^{2}}\)

5

Here, \(\:{\varvec{y}}_{\varvec{i}}^{\varvec{p}\varvec{r}\varvec{e}\varvec{d}\varvec{i}\varvec{c}\varvec{t}\varvec{e}\varvec{d}}\) and \(\:{\varvec{y}}_{\varvec{i}}^{\varvec{r}\varvec{e}\varvec{a}\varvec{l}}\) represent predicted and actual target values, while N denotes the total number of data points in dataset.

As per Table 1, the highest-performing models recognized are ANN, CNN, and Catboost. Table 1 validates their exceptional performance, as CNN records the highest R² values (training: 0.941, validation: 0.838, and testing: 0.914), the lowest RMSE (training: 1079, validation 1415, and testing: 1329), and the lowest MRD% (training: 3.8%, validation:4.3%, and testing: 4.6%). Likewise, ANN and Catboost show outstanding performance, exhibiting high R² metrics, minimal MSE, and relatively low MRD%, which affirms their reliability in predictive tasks.

In contrast, simpler models like Linear Regression, Decision Tree, and XGBoost demonstrate significantly lower performance metrics when compared to more advanced models. These models achieve R² values ranging from 0.70 to 0.83, which are notably lower than those of top-performing models. Additionally, they exhibit substantially higher Root Mean Squared Error (RMSE) values, varying between 1910 and 2540 indicating larger deviations from actual values.

The Decision Tree and XGBoost models, in particular, display perfect performance on the training data (R² = 1.0000, MSE = 0), but this is indicative of severe overfitting. As a result, its performance deteriorates markedly on validation and test sets, revealing poor generalization capabilities. Similarly, the KNN model demonstrates relatively acceptable R² values (0.88 for training, 0.80 for validation, and 0.79 for testing). However, its higher MRD% values in both validation (5.6%) and test sets (4%) suggest that it incurs greater relative errors compared to the leading models. These results highlight the constraints of these models in effectively capturing complex patterns within the data and emphasize the significance of selecting more sophisticated algorithms for improved predictive performance.

In summary, CNN, ANN, and Catboost emerge as the most precise and reliable frameworks for predicting the solubility parameter of diverse polymers, as evidenced by their robust performance across all evaluation metrics. Their superior performance is consistently demonstrated across all evaluation metrics, showcasing their ability to effectively manage the intricacies and fluctuations present in polymer datasets. These gradient boosting-based algorithms not only achieve high predictive accuracy but also exhibit strong generalization abilities, creating them well-suited for this challenging job. The robustness of these models regarding metrics like R², Root Mean Squared Error (RMSE), and Mean Relative Deviation (MRD%) further solidifies their position as leading choices for solubility parameter prediction in polymer science applications.

Table 1 Assessment metrics for machine learning models: performance during all segments.

This study utilizes graphical techniques, including relative deviation values and crossplots, to evaluate the efficiency of machine learning models designed for forecasting the solubility parameter of various polymers. These visualization methods play an essential role in enhancing evaluation process by providing clear and intuitive comparisons of model performance. By plotting predicted solubility parameter values against actual values, these graphical tools facilitate a deeper understanding of the models’ accuracy and reliability. They also help identify any discrepancies or error patterns that may not be immediately apparent from numerical metrics alone.

Relative deviation values offer a quantitative measure of the disparities between forecasted and real outcomes, enabling a more nuanced evaluation of model precision. Crossplots, on the other hand, visually represent these deviations, allowing for easy identification of trends, outliers, or systematic biases in the predictions. Together, these techniques offer valuable perspectives on the strength and consistency of models, reinforcing overall assessment of their ability to accurately predict the solubility parameter of polymers across diverse datasets. This graphical approach ensures a comprehensive and reliable evaluation framework, supporting the development of effective predictive models in polymer science.

Figure 5 provides a comparative visualization of actual data points versus modeled data points based on the data point index for all developed models, covering all phases for the solubility parameters of various polymers. As depicted, the actual and predicted values for CatBoost, ANN, and CNN closely overlap, highlighting that these machine learning algorithms outperform all other methods evaluated in this study. This near-perfect alignment underscores their superior ability to generalize across different datasets.

Furthermore, Fig. 6 presents cross-plots comparing actual versus estimated values for all machine learning algorithms. A dense clustering of datapoints around the line y = x is clearly visible for CatBoost, ANN, and CNN, reinforcing their exceptional foretelling accuracy. The fitted lines derived from these cross-plot points for the three algorithms closely adhere to line y = x, indicating a strong linear correlation between real solubility values and estimations generated by these models.

These visual analyses collectively demonstrate the robustness and reliability of CatBoost, ANN, and CNN in predicting the solubility factors of diverse polymers. The high degree of correspondence between actual and predicted values not only validates exceptional performance of these algorithms but also confirms their suitability for real-world applications in polymer science and engineering.

Figure 7 presents scatter plots that show relative errors of developed algorithms in predicting the solubility parameters of various polymers. The distribution of these error values around the x-axis is particularly evident for CatBoost, ANN, and CNN models. This illustration emphasizes the slight difference between their predictions and the real solubility values, reinforcing strong alignment between the predicted and true values. Such a tight clustering of errors around zero indicates that these models exhibit high precision and accuracy in their predictions.

Figure 8 provides an additional layer of insight by showcasing the prediction distribution capabilities of all developed machine learning models across all phases. Notably, estimation profiles for CatBoost, ANN, and CNN show greater consistency across these phases compared to other methods. This consistency suggests that these Models excel not only on the training dataset but also generalize efficiently to unfamiliar testing and validation datasets.

The ability of CatBoost, ANN, and CNN to maintain consistent performance across different phases further validates their robustness and reliability. These findings collectively confirm that these three models are the most effective and dependable choices for estimating the solubility parameters of diverse polymers within the context of this study. Their superior performance makes them highly suitable for practical applications in polymer science and related fields.

Fig. 5
figure 5figure 5figure 5figure 5figure 5

Comparison between actual and forecasted values: all segments for all data-driven models.

Fig. 6
figure 6figure 6

Cross plots: Modeled versus real points for all data across all data-driven algorithms.

Fig. 7
figure 7figure 7

Percentage of relative deviation: Training, testing, and validation phases for all data-driven models.

Fig. 8
figure 8figure 8

Distribution of frequencies: All phases for all data-driven algorithms.

Evaluating the importance of input features is a critical step in comprehending how different factors affect the prediction of solubility parameters for diverse polymers using machine learning models. In this research, the SHAP (Shapley Additive Explanations) method is employed to assess feature importance, offering a robust framework for interpreting the complexities of high-performance models. Rooted in game theory, the SHAP method assigns contribution values to every datapoint based on its input features, thereby enhancing algorithm interpretability at aggregate and individual levels. This approach offers important perspectives on the role of each input variable in shaping target output in this case, the solubility parameters of polymers.

Figure 9 presents the SHAP values for the input parameters and their corresponding importance, as established by the CNN model. The features are ranked from highest to lowest according to their mean SHAP values, with higher ranks indicating a greater influence on the solubility parameter estimations. The SHAP analysis identifies the dielectric constant as the most influential feature in determining the solubility parameter of polymers. This observation is consistent with the physicochemical interpretation that the dielectric constant reflects the polarity of a material. Polymers with higher dielectric constants tend to have stronger dipole–dipole interactions and enhanced compatibility with polar solvents, leading to higher solubility parameters. This relationship has been previously reported in polymer-solvent compatibility studies86,87, where higher polarity generally correlates with greater cohesive energy density. Conversely, materials with low dielectric constants exhibit weaker intermolecular interactions and thus lower solubility parameters, which the model accurately captures.

In addition to dielectric constant, other top-ranking features such as molecular weight, liquid molar volume, and melting point also contribute significantly. Molecular weight typically exhibits a negative correlation with solubility due to the increased rigidity and reduced free volume in higher molecular weight polymers, which hinders solvation14. Similarly, higher liquid molar volumes often indicate bulkier molecules, resulting in lower cohesive energy density and reduced solubility. Melting point, as an indicator of crystalline strength and intermolecular forces, tends to have a positive correlation, suggesting that polymers with higher melting points often possess stronger intermolecular cohesion, which influences their solubility characteristics12. These insights reinforce the interpretability of the model and its alignment with established physicochemical principles.

By providing a detailed breakdown of feature contributions, the SHAP method not only enhances the transparency of the Random Forest model but also offers actionable insights for improving predictive accuracy and understanding the underlying mechanisms governing polymer solubility. This analysis serves as a foundation for optimizing polymer design and processing strategies, making it an invaluable tool in the field of polymer science and engineering.

Fig. 9
figure 9

SHAP analysis based on the CNN model.

While several recent studies have employed machine learning to predict solubility parameters of polymers88,89, many of these efforts suffer from limitations such as the use of relatively small or domain-specific datasets, limited model diversity (e.g., reliance on linear models or single-layer ANNs), and a lack of interpretability analysis regarding feature importance. In contrast, the present study utilizes a significantly larger and more diverse dataset (1,799 polymers) extracted from the DIPPR 801 database, covering a broad range of physicochemical descriptors. Moreover, we implement and compare a wide variety of both classical and advanced machine learning models including CNN, CatBoost, and ANN and evaluate them comprehensively using multiple performance metrics and visualization techniques. Importantly, the incorporation of SHAP analysis allows us to go beyond predictive accuracy and offer insights into the relative importance and physical relevance of each input feature, thus improving model transparency and scientific interpretability. These advancements differentiate this work from prior studies and address common weaknesses in earlier approaches.

Conclusion

This paper is centered on development of data-driven models leveraging a wide array of machine learning algorithms to estimate solubility parameters of diverse polymers. The algorithms employed include Random Forests (RFs), Linear Regression, K-Nearest Neighbors (KNN), Convolutional Neural Networks (CNNs), Artificial Neural Networks (ANNs), Ridge Regression, Lasso Regression, Elastic Net, Support Vector Machines (SVMs), Extreme Gradient Boosting (XGBoost), Decision Trees (DTs), Light Gradient Boosting Machines (LightGBM), Gradient Boosting Machines (GBMs), and Categorical Boosting (CatBoost). These models were trained using key physicochemical descriptors such as molecular weight, melting point, liquid molar volume, radius of gyration, boiling point, dielectric constant, dipole moment, van der Waals area, refractive index and reduced volume, and parachor. To assess the effectiveness of the developed models, a combination of quantitative metrics and visualization tools was utilized. The Monte Carlo outlier detection method was utilized to guarantee the databank’s reliability, confirming that most of the information points were suitable for validation and training purposes. The superior performance of CatBoost, ANN, and CNN can be attributed to their inherent ability to capture complex, non-linear relationships and intricate feature interactions within the dataset. CatBoost, as a gradient boosting method tailored for handling categorical and numerical features, efficiently models non-linearities while mitigating overfitting through ordered boosting and advanced regularization. ANN and CNN, on the other hand, are deep learning architectures capable of learning hierarchical patterns and internal representations without explicit feature engineering. In particular, CNNs leverage local connectivity and weight sharing mechanisms that help identify spatial correlations and subtle dependencies among features. These capabilities allow the models to generalize better across the diverse physicochemical properties present in the polymer dataset, resulting in more accurate and robust predictions. Further analysis revealed that specific physicochemical properties, including boiling point, melting point, dipole moment, dielectric constant, and refractive index, significantly influence the solubility favtors of polymers. Some of these factors exhibited positive relationships with solubility, while others showed negative correlations. This finding underscores the models’ capability to successfully capture intricate and nonlinear connections present in the dataset, reinforcing their suitability for predictive applications in this domain. To enhance interpretability, SHAP (Shapley Additive Explanations) values were employed to clarify contributions of individual features to the predictions. The analysis identified the dielectric constant as the most influential factor affecting the solubility parameters of diverse polymers. This insight not only validates the robustness of the selected models but also provides deeper understanding of the critical elements influencing polymer solubility. In conclusion, this comprehensive approach not only demonstrates the effectiveness of CatBoost, ANN, and CNN in accurately predicting polymer solubility parameters but also offers valuable understanding of the fundamental processes motivating these predictions. Through the integration of sophisticated machine learning methods with interpretable feature importance analysis, this research opens the door for enhanced precision and transparency in future research, enabling more informed decisions in polymer science and engineering applications.