Introduction

Broadly speaking, regular petroleum derivatives, including gasoline and aviation fuels, are composed of a vast array of hydrocarbons, numbering in the hundreds1. The primary constituents of these fuels include normal alkanes, isoparaffins, naphthenes, and aromatics, with normal alkanes and naphthenes featuring alkyl side chains being especially crucial in the composition of aviation fuels2,3,4,5. These blends are used in fuel production, such as jet fuel. This fuel includes four main hydrocarbon groups: linear alkanes, aromatics, branched alkanes, and cycloalkanes. The relative amounts of these components highly affect the performance of the jet fuel in combustion, atomization, aircraft range, metering, and thermal stability. It should be mentioned that fuel quality has a direct effect on global warming issues6. Accurately delineating their fundamental properties presents a significant challenge due to the various hydrocarbons and their intricate interactions. This complexity consequently impedes the precise simulation of fluid dynamics and the design of heat exchange systems6,7,8,9,10.

Within aviation fuel composition, straight-chain alkanes such as n-dodecane, n-hexadecane, and n-tetradecane are frequently selected as representative model compounds. Conversely, shorter alkyl cycloalkanes are utilized for the cycloalkane category, with methylcyclohexane and ethylcyclohexane being commonly employed exemplars8,11,12,13. Regarding the investigation into either methylcyclohexane or ethyl cyclohexane independently14,15, there exists a wealth of density-related data, which has been extensively documented by various researches16,17,18,19,20. However, the study of physical properties, such as the density of binary mixtures comprising these cycloalkanes and straight-chain alkanes, has received relatively scant attention.

The importance of studying fuels10 and ignition21 and the properties that may affect their performance is undeniable22. Given the application of cycloalkanes in surrogate mixtures for aviation fuels, determining critical properties pertinent to fuel transportation and combustion becomes imperative for combinations of these compounds with n-alkanes23. Acquiring basic physical property data remains challenging across broad pressure and temperature spectra, even for binary mixtures with rudimentary composition. Therefore, elucidating and measuring the physical characteristics of such binary blends encompassing thermal expansion, compressibility, and density form the foundational groundwork for the advanced evaluation and modeling of complex aviation fuels. Consequently, many researchers have embarked on experimental investigations into their physical properties. For instance, Prak et al.23 conducted studies on the physical properties of mixtures of methylcyclohexane or ethylcyclohexane with n-dodecane or n-hexadecane across temperatures ranging from 293.15 to 333.15 K and at a pressure of 0.1 MPa. By comparing these properties with those of traditional petroleum-based fuels, they deduced that mixtures involving ethylcyclohexane and n-hexadecane most closely resemble jet fuel characteristics. Baragi et al.24 explored the densities and forecasted the excess molar volume of methylcyclohexane and n-dodecane mixtures at temperatures between 298.15 and 308.15 K and a pressure of 0.1 MPa. Calvar et al.25 assessed the density values of aromatic hydrocarbons-methylcyclohexane blends at a temperature of 313.15 K and pressure of 0.1 MPa. Van Hecke et al.26 experimentally determined densities for ethyl cyclohexane blends with organic compounds at temperatures of 288.15 and 318.15 K while at a constant pressure of 0.1 MPa, noting an uncertainty of 0.0005 g/cm3. However, the influence of purity variations on this uncertainty was not discussed. Prak et al.27 presented density measurements for n-alkylcyclohexane/n-tetradecane mixtures from 288.15 to 333.15 K at 0.1 MPa, specifying an expanded uncertainty of 0.3 kg/m3. Wang et al.28 opted for n-dodecane, n-hexadecane, and n-tetradecane as proxies for straight-chain alkanes and methylcyclohexane/ethylcyclohexane representing alkyl cycloalkanes, conducting density measurements for six mixtures across a pressure range from 0.1 to 9 MPa and temperature span of 280–423.15 K, thus covering the operational circumstances of most engineering applications. Chum-in et al. developed correlations to estimate density of binary biofuel mixture by using Gibbs energy29. Cano-Gómez et al. suggested a non-linear relationship based on the fractional ratios to determine viscosity of a binary system30. Krisnangkura et al. implemented an approach to determine viscosity of mixtures of diesel and biodiesel based on the molecular similarities31. Yoon et al. suggested a method in terms of temperature and proportional ratio to predict the density of soybean oil and diesel mixtures32.

The comprehensive laboratory research outlined previously underscores a labor-intensive and intricate process, necessitating sophisticated analytical techniques alongside the employment of high-cost laboratory apparatus. Concurrently, the domain of artificial intelligence (AI) has manifested remarkable efficacy in a myriad of applications33, spanning interpretation and prognostication tasks across diverse fields34,35,36,37,38,39,40,41,42. It is of particular interest that, despite the acute demand for data characterizing the density of hydrocarbon mixtures under elevated pressures and temperatures, a noticeable gap exists in the endeavor to formulate models leveraging advanced intelligent modeling methodologies. In this vein, the present study aims to harness cutting-edge machine learning techniques, including Random Forest (RF), Adaptive Boosting, Decision Tree (DT), Ensemble Learning, K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Multi-Layer Perceptron(MLP) Artificial Neural Network and Convolutional Neural Network (CNN). These methodologies are employed to predict the density of binary blends of Ethylcyclohexane or methylcyclohexane with n-hexadecane/n-dodecane/n-tetradecane across an extensive range of operational conditions (encompassing pressure and temperature) and cycloalkane mole fractions in n-alkanes, utilizing refined laboratory data extracted from existing scholarly publications. The reliability of the utilized data is affirmed through an outlier detection algorithm, and a relevancy factor is applied to determine the significance of each input variable on the mixture density. Additionally, the precision of the developed models is stringently assessed using statistical indices and graphical representation techniques.

Methodology

Data gathering

This investigation leverages a detailed dataset comprising 1461 datapoints, culled from a thorough review of extant scholarly articles23,24,28, dedicated predominantly to the empirical determination of the density of binary blends involving ethylcyclohexane or methylcyclohexane and n-hexadecane/n-dodecane/n-tetradecane. This dataset spans a broad spectrum of mole fractions, temperatures, and pressures. The statistical attributes of all experimental data employed in the modeling process are systematically cataloged in Table 1. For model development, 1156 data points are utilized for training, while 158 and 147 data points validate and test the constructed models’ potential, respectively.

Table 1 Statistical data pertinent to the experimental data.

Machine learning approaches

We delve into the mathematical foundations underpinning the machine learning algorithms employed for the development of intelligent models in this study. The details of theory of these models are reported in Appendix.

Results and discussion

In this part, the density of binary blends for ethyl cyclohexane or methylcyclohexane with n-hexadecane/n-dodecane/n-tetradecane is estimated by using eight machine learning methods. First, the hyper-parameters are determined for each model. For example, the max-depth is estimated to be 15 for DT algorithm as shown in Fig. 1. Then, Fig. 2 illustrates that the performance of the KNN algorithm in the K value of 1 is better than other K values.

Fig. 1
figure 1

The estimation of max-depth in the DT model.

Fig. 2
figure 2

The estimation of K in the KNN model.

Figure 3 demonstrates the R-squared values for different c values in the SVM algorithm, and the optimum value of c is obtained at about 1. The accuracy values of AdaBoosting in terms of the number of estimators are shown in Fig. 4. As illustrated, 63 estimators are the best structure for this model. After that, it is determined that the RF model with 9 estimators shows the most accuracy in the max-depth value of 9 (See Fig. 5).

Fig. 3
figure 3

SVM algorithm performance for disparate c values.

Fig. 4
figure 4

Adaptive Boosting algorithm performance for disparate numbers of estimators.

Fig. 5
figure 5

Hyper-parameter estimation for the RF algorithm.

The performance of CNN and MLP methods during the training process is shown in Figs. 6 and 7, respectively.

Fig. 6
figure 6

The MSE values in different iterations for the CNN model.

Fig. 7
figure 7

The MSE values in different iterations for the MLP model.

The chosen hyper-parameters for each algorithm are reported briefly in Table 2.

Table 2 The chosen hyper-parameters for each algorithm.

Some statistical parameters are employed to assess the models mentioned above. They are defined as below:

$$Mean absolute\;percentage\;error\;\left( {MAPE} \right) = \frac{100}{N}\mathop \sum \limits_{i = 1}^{N} \left| {\frac{{y_{i}^{real} - y_{i}^{predicted} }}{{y_{i}^{real} }}} \right|$$
(1)
$$Mean\;squared\;error\;\left( {MSE} \right) = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left( {y_{i}^{real} - y_{i}^{predicted} } \right)^{2}$$
(2)
$$R - squared\;\left( {R^{2} } \right) = 1 - \frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {{\text{y}}_{{\text{i}}}^{{{\text{real}}}} - {\text{y}}_{{\text{i}}}^{{{\text{predicted}}}} } \right)^{2} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {{\text{y}}_{{\text{i}}}^{{{\text{real}}}} - \overline{{{\text{y}}^{{{\text{real}}}} }} } \right)^{2} }}$$
(3)
$$Mean relative\;error\;\left( {MRE} \right) = \frac{100}{N}\mathop \sum \limits_{i = 1}^{N} \left( {\frac{{y_{i}^{real} - y_{i}^{predicted} }}{{y_{i}^{real} }}} \right)$$
(4)

The summary of statistical analysis is reported in Table 3. For better description, the MAPE values are shown in Fig. 8, and it shows that the RF and DT models are the most accurate models in the calculation of density with MAPE values of 4.527 and 4.6935, respectively.

Table 3 The summary of statistical analysis.
Fig. 8
figure 8

Calculation of MAPE values for different algorithms.

Figure 9 confirms that DT and RF algorithms have acceptable performance in the calculation of density with R2 values of 0.9985 and 0.09982, respectively. On the other hand, the MLP and Adaboosting models exhibit the weakest performance in this field, with R2 values of 0.9455 and 0.9477, respectively. Then, the summary of the determination of MAE and MSE is shown in Figs. 10 and 11, respectively.

Fig. 9
figure 9

Calculation of R2 values for different algorithms.

Fig. 10
figure 10

Calculation of MAE values for different algorithms.

Fig. 11
figure 11

Calculation of MSE values for different algorithms.

The predicted density is demonstrated against the actual density in Fig. 12, in which density points’ compaction expresses the models’ precision. In addition, the fitting lines on the different subsets of data points are constructed, and they seem similar to the bisector line. After that, the relative error between the predicted and real density values is shown for each model in Fig. 13. if the relative errors lie near the x-axis, the model will be more accurate.

Fig. 12
figure 12figure 12figure 12

The predicted density versus the actual density.

Fig. 13
figure 13figure 13

The relative error between the predicted and actual density values.

One of the critical points in selecting the best estimator is the time required for training the model. Due to this fact, the time spent training each model is reported in Fig. 14, and it seems that neural network models require more time than others.

Fig. 14
figure 14

Calculation of run time for different algorithms.

It is worth mentioning the advantages and disadvantages of these models. The KNN algorithm has a simple implementation procedure but can be computationally expensive for large data banks. Also, the SVM is computationally expensive but works well with small databanks. DT is robust to outliers but prone to overfitting. On the other hand, RF is robust to overfitting and high accuracy because of ensemble learning. ANN-based algorithms can be used for complex tasks, but they need high processing time.

The accuracy of the utilized dataset in the training of algorithms has vital importance. Hence, assessing the precision of ethylcyclohexane/methylcyclohexane with n-dodecane/n-tetradecane/n-hexadecane density databank is necessary. The leverage method, which is based on mathematical techniques, is employed in this work. There is a matrix shown by H, and it is called the Hat matrix and determined as below43,44,45:

$$H = X\left( {X^{T} X} \right)^{ - 1} X^{T}$$
(5)

In this definition, X is \(m\times n\) dimensional matrix, in which dimensions are the number of data points and model parameters, respectively. Then, the reliable and outlier zones should be identified. Hence, the critical value of H* is defined as follows:

$$H^{*} = 3\left( {n + 1} \right)/m$$
(6)

Then, the William’s plot is applied to indicate the results of this analytical technique.The normalized residuals are plotted against the Hat values in this plot. These Hat values is determined by using the main diagonal of H. Figure 15 expresses the situation of these data points in this method and it is obvious that all data points are reliable. Therefore, they have enough accuracy to be used in the development of algorithms. In this work, the best model, DT, is used to generate this graph and analysis.

Fig. 15
figure 15

Outlier detection.

The effect of pressure, temperature, and mole fraction of cycloalkane on density of ethyl cyclohexane/methylcyclohexane with n-dodecane/n-tetradecane/n-hexadecane is determined by the relevance factor (\({r}_{j}\)). This parameter is used to predict the effect of a particular variable (xj) on the density (y). The below formulation describes the procedure of this method46,47:

$$r_{j} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{j,i} - \overline{x}_{j} } \right)\left( {y_{i} - \overline{y}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{j,i} - \overline{x}_{j} } \right)^{2} \mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \overline{y}} \right)^{2} } }}, \left( {j = 1,2,3} \right)$$
(7)

The determined parameter lies between –1 and 1 with negative sign showing that the density decreases by increasing that particular parameter. Furthermore, a higher value of r represents the stronger relationship between that particular input and density. As shown in Fig. 16, temperature has the most impact on the density, with an r value of –0.9619. Also, the negative sign expresses a reverse relationship between temperature and density. On the other hand, pressure is the least effective, with an r value of 0.041, which explains the straight relationship between pressure and density.

Fig. 16
figure 16

Relevancy factor for density of ethylcyclohexane/methylcyclohexane with n-dodecane/n-tetradecane/n-hexadecane databank.

Conclusions

In this study, several artificial intelligence methods are applied to forecast the density values of ethyl cyclohexane/methylcyclohexane blended with n-dodecane/n-hexadecane/n-tetradecane in terms of operational conditions (encompassing pressure and temperature) and cycloalkane mole fractions in n-alkanes. A number of 1461 data points which lie in the extensive range of conditions are used in suggesting models. These data points are assessed using a mathematical method, and it is obtained that all data points are maintained in the reliable region. Hence, they can be used in different steps of model development. The carried out sensitivity analysis by the relevancy factor concept exhibits that temperature is the most influential parameter on the density with r value of –0.9619. Also, the negative sign expresses a reverse relationship between temperature and density. The statistical and graphical comparisons between the developed models show that the DT and RF algorithms have the best performance in calculating density with R2 values of 0.9985 and 0.09982, respectively. According to the results, this paper provides several robust tools to calculate the density of Ethylcyclohexane /methylcyclohexane with n-dodecane/n-tetradecane/n-hexadecane that is useful for chemical engineers and chemists.