Introduction

Today, the problems caused by non-renewable energies have been widespread in the world. The production of greenhouse gases is due to the high consumption of fossil fuels, which makes the Earth warmer1,2,3. Many measures have been taken to control the production of greenhouse gases4. It exists as one of the most popular sources of fuel that causes less damage to nature5. By combining fatty acid alkyl esters, biodiesel is produced. Transesterification of fats is carried out catalytically by different alcohols. Catalysts increase the reaction rate6. The reason why biodiesel is known as a clean fuel is the presence of a small amount of sulfur in its composition, which reduces the production of greenhouse gases7. Among other uses of biodiesel in diesel engines, it can be mentioned to increase the life of the engine due to its high fluidity8. However, the use of biodiesel compared to petroleum-based fuels has disadvantages such as oxidation stability, higher viscosity, and production cost9. Nowadays, due to the importance of biofuels, the focus on their properties and applications has increased, and many experimental relationships and modeling have been done to determine these properties10. One of the important features of biodiesel is surface tension which is used in atomization, so atomization quality increases with the reduction of surface tension10. Surface tension is one of the important issues in diesel fuels that affects economic and environmental issues.

In the following, an overview of thermodynamic models, artificial intelligence, and experimental studies conducted for forecasting biodiesel and fossil fuels surface tension is presented:

At first, Queimada et al.11 established a model to estimate fuels’ viscosity and surface tension. Then, a smart model for predicting brine interfacial tension utilizing the least-squares method of support vector machine (SVM) was presented by Barati-Harooni et al.12 In this model, the inputs were water salinity, temperature, and pressure. Also, Rostami et al.13 presented a model using genetic programming algorithm for estimating water and hydrocarbon surface tension, and the value of R2 for this model was reported as 0.91. Next, a model was presented by Pratas et al.14 for predicting biodiesel density with an error of 0.25–2.96%. Then a smart model was developed by Gahek et al.15 to approximate alkane density with an average absolute error of 0.6%. After that, a model using ANN methods was presented by Miraboutalebi et al.16. R2 value and root mean square error (RMSE) for this model were reported as 0.95 and 2.53, respectively. Then cetane number of biodiesel was estimated by Hosseinpour et al.17 using SVM. R2 and RMSE values for this model were reported as 0.99 and 0.72. Then Mostafaei18 predicted the cetane number using the logical phase neural system. Bemani et al. developed models for estimating the cetane number of biodiesel based on FAME properties of experimental data. The LSSVM algorithm was used and coupled with three models: Genetic algorithm (GA), particle swarm optimization (PSO), and a hybrid of GA and PSO (HGAPSO) algorithms. R2 values for LSSVM-GA, LSSVM-PSO, and LSSVM-HGAPSO were reported as 0.965, 0.966, and 0.978, respectively19,20,21. Razavi et al. developed a precise model using LSSVM-PSO algorithm to predict biodiesel properties such as pour point, cloud point, iodine value, and kinematic viscosity based on fatty acid composition that the accuracy of test data of biodiesel properties are 0.99995, 0.99981, 0.99848 and 0.99930, respectivly22,23,24,25. Baghban et al. developed TLBO-NN and PSO-NN to improve the prediction of cetane number of FAMEs based on biodiesel. This study showed that the TLBO-NN was more accurate than PSO-NN and the R-squared and mean square of errors are 0.973 and 3.538 and 0.951 and 6.324, respectively26. Nabipour et al. presented four advanced models, including Least Square Support Vector Machine (LSSVM), Radial Basis Function Artificial Neural Network (RBF-ANN), Multi-layer Perceptron Artificial Neural Network (MLP-ANN), and Adaptive Network-based Fuzzy Inference System (ANFIS), for forecasting biofuel density. These models leverage intermolecular interactions and the van der Waals radii of atoms in their predictions. The LSSVM model is more accurate than other models and the R-squared of this model is 0.847. This investigation demonstrates the potential efficacy of employing the LSSVM model as a proficient means of estimating biofuel density, thereby presenting a viable alternative to conventional thermodynamic modeling approaches27.

In the following, studies on the biodiesel’s surface tension approximation will be reviewed. In order to predict the surface tension of pure FAME and biodiesel, Phankosol et al.28 presented two relations in terms of Gibbs free energy and, the error value of these models was reported as 1.84% and 1.21% for 10 & 8 distinct biodiesel FAME. Further, Thangaraja29 proposed a relationship in the temperature range of 306–353 with 7% absolute error for the approximation of biodiesel and vegetable oil surface tension. The relationships presented by Miller and Macleod-Sugden were again examined by An et al.30 and it was concluded that the relationships presented by Miller have a higher performance than the Macleod-Sugden relationship. Then, in order to forecast fatty acid ethyl esters surface tension, Valk31 used Brock and Rari/Olivier models and reported the following accuracies of 7.5% and 2.4%, respectively, for each correlation. Also, models utilizing intelligent methods to predict the surface tension of different oils in different temperature ranges by Melo-Espinosa et al.32 were presented. According to the results, it can be seen that artificial neural network (ANN) is more accurate than multilevel regression (MLR) in predicting surface tension. Moreover, ANN and thermodynamic models were developed by Hosseini et al.33 for approximation of the surface tension of 3 biodiesel and FAME at different temperatures with accuracies of 0.44 and 1.82%. Salehi et al.34 used machine learning methods to model the interfacial tension of N2/CO2 mixture + n-alkanes of oils. Their model estimated laboratory data with high accuracy with an average absolute relative error of 0.77%34. Also, biodiesel surface tension was predicted utilizing the models of Ceriani et al., Ferrando et al., and Marrero et al.35,36,37. Also, Oliveira38 presented a model for predicting esters surface tension for a distinct temperature range by combining the gradient theory and the equation of cubic plus state (CPA). The accuracy value of the model for independent and temperature-dependent parameters was reported as 5.44% and 1.5%. Some of the properties of biodiesel that have been investigated experimentally are given below:

The soybean oil biodiesel density in the temperature range of 298.15–393.15 K and pressures up to 140 MPa was experimentally measured by Aitbelale et al.39. Next, the surface tension of three different types of biodiesel was measured by Chehtri40 at a pressure of 7 MPa and a temperature of 473 K. And finally, the surface tension, viscosity, and density of biodiesel were measured for an extensive temperature range by Blangino et al.41 and they used these data to validate their proposed models. The models presented above require accurate thermophysical properties and have long calculations and insufficient accuracy in predicting the desired parameter. Also, experimental studies conducted in the laboratory require a lot of time and money. Due to the great importance of biodiesel, we need an accurate method to predict its properties.

Other investigations have explored biodiesel production utilizing supercritical methanol (SCM), employing the LSSVM model and ANFIS model42,43,44,45,46.

The Novelty of this research was the use of two white box models, including Group method of data handling (GMDH) and Gene expression programming (GEP), which work on the basis of artificial intelligence, and by using these models, two simple mathematical equations with high accuracy were presented to predict the surface tension of biodiesel, which these models are for the range different temperature and molecular weight can be used. The data used in this research are 78 surface tension laboratory data collected from the literature. The input parameters in this research were temperature and fatty acid ethyl esters mass fraction. Also, the effect of input parameters on the surface tension was evaluated using sensitivity analysis. Finally, the suspicious laboratory data and outlier data points were identified by leverage technique.

Theory and methods

Data gathering

To approximate biodiesel surface tension, 78 laboratory data were collected from the literature41. The statistical parameters related to the input data are given in Table 1. Input data includes temperature and mass fraction of fatty acid ethyl esters. In order to reduce the dimensions of the input data, esters are divided into three groups according to their molecular weight: less than 200 (Mw1), between 200 and 300 (Mw2), and greater than 300 (Mw3). The input parameters in the presented models and correlations are displayed with the abbreviations T, Mw1, Mw2, and Mw3. Also, the data was divided into a 20/80 ratio for testing and training.

Table 1 Statistical parameters of the data utilized in the research to approximate biodiesel surface tension.

Gene expression programming (GEP)

GEP is a well-known Evolutionary Algorithm (EA), that uses the development of computer programs to address user-defined problems47. GEP was verified to be efficient in the search for accurate and concise software. GEP is separated into numerous distinct sections. For simplicity, These are organized into eight groups in this survey. GEP includes encoding design, design of the evolutionary mechanism, design of adaptation, design of cooperative coevolution, design of continual creation, design of parallel systems, theoretical research, and, last but not least, design of the applications of GEP. The design of the encoding has a significant impact on GEP performance, as it determines the research space of genotypes and phenotypes. Traditional evolutionary mechanisms GEP adopts multiple operators based on genetic algorithm (GA), such as random mutation and crossing a point, to make chromosomes evolve48. Adaptation design refers to the design of adaptive control mechanisms for GEP parameters. It’s important to note that the GEP incorporates a number of control variables, such as population size, chromosomal length, and mutation rate. EAs are frequently enhanced with cooperative coevolutionary (CC) design when dealing with complex optimization issues. An optional GEP operator called constant creation searches for numerical constants to build precise GEP solutions. Further GEP processing time reduction by integrating parallel design. Theoretical studies of GEP have received the most attention, including the estimation of convergence speed and the proof of convergence49. In the GEP strategy, an evolutionary algorithm is used to determine the most effective mathematical format47,50. As a result, the GEP approach was used in this investigation to relate the inputs to the output of how much asphaltene precipitated. The evolutionary algorithm (EA) is used to find the optimum solution for optimization problems. This is comparable to characteristic evolution. GEP is really thought of as an improved form of Genetic Programming (GP), which was created by Koza50,51. It addressed problems with GP, such as the use of just a few regression techniques47,50. Like other evolutionary algorithms, GEP searches for the optimum expression technique by formalizing and representing alternative solutions using chromosomes. In particular, the Expression Tree (ET), a crucial element, is introduced by GEP. The chromosomes are transformed into real ET contenders. Genes having a head and terminals containing functions are necessary for GEP. There is a set number of symbols for each gene that stand in for various operators, such as + , /, and log, as well as a terminal set, such as x, y, and z50.

Algorithm framework of GEP has many steps, and each step is explained separately in the next paragraph. The flowchart of the algorithm framework of GEP is shown in Fig. 1.

Figure 1
figure 1

The flowchart of the algorithm framework of GEP.

The initialization step aims to create the initial population and create a set of chromosomes at random. Depending on the kind of element, each fixed-length string’s chromosome in the initial population is randomly assigned to one of the elements. In fitness assessment, all of the population’s chromosomes have their fitness values assessed. The performance of the algorithm is significantly impacted by the problem-specific fitness evaluation function. Choice and Replication to create a new population for the following generation that this phase picks the population’s superior chromosomes. Many different selection techniques should be employed, such as the tournament selection strategy and the roulette wheel selection strategy, because these strategies perform better when addressing difficult problems49,52. Every component on each chromosome is randomly altered with a preset mutation rate (pm) during the mutation process, according to the mutation step53. The transposition step tries to swap out a section of the chromosome’s consecutive elements for a segment of the same chromosome’s consecutive elements. It consists of three sub-steps that are each carried out with a probability of pis, pris, and pg. A section of consecutive elements in the chromosome is known as an insertion sequence (IS). An IS is chosen at random in this step’s sub-step54. Then, a copy of the IS is generated and randomly placed into a gene’s head. So, the name of this step is IS-transposition. The RIS-transposition step has a group of subsequent items that begins with a function known as a root insertion sequence (RIS)55. So, genes’ heads are used to select RISs. The chromosome, the gene that will be changed, the start location of the RIS, and the length of the RIS are all determined at random in this sub-step56. As part of the IS transposition process, after a RIS is chosen, a copy of the RIS is created and put into the root of the chosen gene. In gene transposition, the chromosome that will be changed is picked at random. Then, a randomly chosen gene except for the first gene from the predetermined chromosome is picked and moved to the start of the chromosome57. The purpose of the recombination process is to create two offspring by exchanging the gene information from the two parent chromosomes. Gene recombination involves the random selection of a gene from one parent58. The chosen gene is then switched for its counterpart from the other parent, producing two children. The three sub-steps in the recombination are carried out with a probability of as follows: pc1, pc2, and pcg. A new population, similar in size to the parent population is produced following the recombination procedure. The evolutionary process continues until the termination conditions (such as producing a good result or reaching the maximum generation) are met, at which point the algorithm moves on to the fitness evaluation phase49.

GEP mechanism is briefly described as follows:

In the GEP algorithm, predictive models are generated through the use of genetic ideas. First, an initial population of predictive models is randomly generated as a set of genetic members. Then, these models are evaluated based on their performance in predicting the training data. Models that perform better are more likely to survive and reproduce in the next generation, while models with poorer performance are less likely to survive59. This iterative process continues to arrive at new generations of models that perform better in predicting new data. Also, in each generation, genetic operators such as mutation and combination are used to increase population diversity and generate new models with different combinations of features. This process ensures improved performance and accuracy of models in predicting new data49.

Advantages and disadvantages have been reported for the GEP model, which are described as follows:

Advantages of GEP model:

  • The ability to generate prediction models with complex structures and the ability to explore the space of different models.

  • The possibility of using genetic operators to improve and adapt models to input data.

  • Ability to quickly adapt and change models in response to changes in data or issues under investigation.

Disadvantages of the GEP model:

  • Complexity in interpreting the resulting models, especially when using more complex structures.

  • The need to adjust genetic parameters appropriate to the problem in order to improve the performance of the model.

  • The possibility of encountering problems related to lack of training data or incorrect selection of genetic parameters that can lead to inferable models60.

Group method of data handling (GMDH)

Basically, Volterra-Kolmogorov-Gabor (VKG) polynomials (Eq. (1)) are used to model complex systems61.

$$y = a_{0} + \sum\limits_{i = 1}^{n} {a_{i} x_{i} + \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {a_{ij} x_{i} x_{j} + \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\sum\limits_{k = 1}^{{}} {a_{ij} x_{i} x_{j} x_{k} } + \ldots } } } } }$$
(1)

where x = x1,x2,…,xn are the input vectors, y is the output of the model, and ai are polynomial constants. VKG polynomials are estimated by means of quadratic polynomials. These quadratic polynomials are built based on binary mixtures of network inputs. Utilizing knowledge as a learning technique, the GMDH algorithm has been introduced to model complex systems61,62.

The GMDH neural network has the construction of a multi-layered and forward network and contains a set of neurons that are formed by connecting dissimilar input couples to complete a second-degree polynomial. Every layer in this network contains one or more processor parts, every of which has two inputs and one output. These parts truly play the role of model formation constituents and are presumed in the form of a second-degree polynomial (Eq. (2))63.

$$\widehat{{y_{n} }} = a_{0} + a_{1} x_{1} + a_{2} x_{2} + a_{3} x_{1} x_{2} + a_{4} x_{{_{1} }}^{2} + a_{5} x_{2}^{2}$$
(2)

The unidentified parameters of GMDH algorithm are the polynomial constants of Eq. (2). In order to estimate the output value yi for each input vector x = xi1,xi2,…,xin based on Eq. (6), the mean square error of Eq. (3) must be minimized60.

$$e = \sum\limits_{i + 1}^{n} {\left( {\widehat{{y_{i} }} - y_{i} } \right)}^{2}$$
(3)

To find the minimum error value, the partial derivative of Eq. (3) is used. By replacing Eq. (2) in this partial derivative, a matrix equation (Aa = y) is gained. In the equation, a = (a0,a1,a2,a3,a4,a5) and Y = (y1,…,ym)T is matrix A according to Eq. (4) 50.

$$\left[ {\begin{array}{*{20}c} 1 & {x_{1p} } & {x_{1p} } & {x_{{_{1p} }}^{2} } & {x_{{_{1p} }}^{2} } & {x_{1p} } & {x_{1p} } \\ 1 & {x_{2p} } & {x_{2q} } & {x_{{_{2p} }}^{2} } & {x_{{_{2q} }}^{2} } & {x_{2p} } & {x_{2q} } \\ 1 & {x_{np} } & {x_{nq} } & {x_{{_{np} }}^{2} } & {x_{{_{nq} }}^{2} } & {x_{np} } & {x_{nq} } \\ \end{array} } \right]$$
(4)

A solution method for this matrix equation (Aa = y) is to use the Singular Value Decomposition (SVD) method. If using the SVD method, the unknown \(\alpha\) is estimated from Eq. (5).

$$\alpha = \left( {A^{T} A} \right)^{ - 1} A^{T} y$$
(5)

In Eq. (1), AT is the term of matrix A. By utilizing the method, the solution of the unidentified can be computed in any case. As long as the matrix (ATA) is not invertible, the Thikhonov method will be utilized to resolve the equation.64. In the design of GMDH neural network, the goal is to avoid the growth of network divergence and to relate the shape and construction of the network to one or more numerical parameters, so that the network structure changes with the change of this parameter. To generalize GMDH neural networks, the condition of using the conjoining layer in building the next layer should be removed. This form of neural network is called GS and it uses all the former layers (including the input layer) to build a new layer65.

The structure of the GMDH model is shown in Fig. 2.

Figure 2
figure 2

GMDH framework to approximate biodiesel surface tension.

Briefly, the mechanism of GMDH is written in several lines:

In the GMDH algorithm, simple mathematical models are automatically created by the algorithm when the process starts. These models include linear combinations of input variables. Then, by evaluating the performance of each of these models on the training data, models that show better performance than other models are selected50. The selected models are then combined with each other to create more complex models with better predictive ability. This process continues iteratively and models with better performance are added to the new models. Finally, the model with the best performance on the test data is selected to predict the new data more accurately. This process continues to improve the performance and prediction accuracy of the models to provide an optimal final model65.

The GMDH model has advantages and disadvantages, including the following:

Advantages of GMDH model:

  • The ability to create predictive models with variable complexity and the ability to adapt to different input data.

  • Ability to automate the process of selecting and combining models based on their performance.

  • Good performance in cases where there are more complex relationships between variables66.

Disadvantages of the GMDH model:

  • The need for larger training data volumes in order to create more accurate models.

  • High computational processing to combine and upgrade models, which may be time-consuming and complex.

  • The complexity of the resulting models may be difficult to interpret for non-expert users67.

Results and discussion

In the research, using GMDH and GEP, two models were developed to approximate biodiesel surface tension with high accuracy. The proposed correlations in the research to forecast biodiesel surface tension are presented in Table 2. The details related to each model, including the execution time and hyper-parameters set to achieve the desired accuracy, are listed in Table 3. As mentioned previously, model input parameters include 78 laboratory data including temperature and mass fraction of fatty acid ethyl esters and esters are divided into three groups according to their molecular weight: less than 200 (Mw1), between 200 and 300 (Mw2), and greater than 300 (Mw3). Classification of mass fractions is one of the methods of reducing the dimensions of input parameters, and input parameters with similar characteristics are placed in one category, and the similarity of the input parameters in this research was considered molecular weight. Among 78 laboratory data, 63 data were designated as train subset, and 15 points were randomly selected as test data for checking the precision and perfection of the presented models.

Table 2 The presented correlations in the research to approximate biodiesel surface tension using GEP, and GMDH networks.
Table 3 Hyper-parameters of the established models in the research to approximate biodiesel surface tension.

Determinant error parameters

The precision of the presented models was assessed utilizing the statistical parameters introduced below68:

Average percent relative error:

$$APRE = \frac{100}{N}\mathop \sum \limits_{i = 1}^{N} \left( {\frac{{ST^{act} - ST^{cal} }}{{ST^{act} }}} \right)$$
(6)

Root mean square error

$$RMSE = \left( {\frac{{\mathop \sum \nolimits_{i = 1}^{N} \left( {ST^{act} - ST^{cal} } \right)^{2} }}{N}} \right)^{\frac{1}{2}}$$
(7)

Average absolute percent relative error

$$AAPRE = \frac{100}{N}\mathop \sum \limits_{i = 1}^{N} \left| {\frac{{ST^{act} - ST^{cal} }}{{ST^{act} }}} \right|$$
(8)

Standard deviation

$$SD = \left( {\frac{1}{N - 1}\mathop \sum \limits_{i = 1}^{N} \left( {\frac{{ST^{act} - ST^{cal} }}{{ST^{act} }}} \right)^{2} } \right)^{\frac{1}{2}}$$
(9)

R-squared

$$R - squared\left( {R^{2} } \right) = 1 - \frac{{\sum\limits_{i = 1}^{N} {\left( {ST^{act} - ST^{cal} } \right)^{2} } }}{{\sum\limits_{i = 1}^{N} {\left( {ST^{act} - \overline{{ST^{act} }} } \right)^{2} } }}$$
(10)

In the correlations that were presented above, ST, \(\overline{ST}\) and N represent the surface tension, average surface tension and the number of data, respectively, and the predicted and experimental surface tension are shown with superscript cal and act.

Determinant error diagrams

One of the methods of evaluating the presented models is the use of error-determining diagrams. The error-determining diagrams in this research include the relative error distribution diagram, cross-plot diagram, and bar chart diagrams. In the relative error distribution diagram, the deflection of the data from the zero error line is shown. In the cross-plot diagram, the degree of deflection of the data from the X = Y line is shown, and in both diagrams, the degree of compatibility of the experimental data with the predicted data by the model is checked.

Precisions and validities of the models

To check the accuracy of the developed models in this research, the statistical parameters were presented in Table 4, which shows the accuracy of these models. In this table, the training, test and total error values for both proposed models in this research were calculated. As mentioned in the table, the AAPRE value for the GMDH model is the lowest value and is equal to 0.97%, which indicates the high accuracy of this model. Also, other error parameters for this method are as follows:

$${\text{APRE}}\, = \, - \,0.0{7},\,{\text{RMSE}} = \,0.{44}0{93},\,{\text{SD}}\, = \,0.000{233},\,{\text{R}}^{{2}} \, = \,0.{9233}$$
Table 4 Statistical error parameters to measure the precision of the presented models in this research to approximate biodiesel surface tension.

It should be noted that between two introduced models, the GEP model reports a higher error than GMDH with AAPRE equal to 1.89%. Considering the amount of AAPRE for the GMDH model, which is equal to 0.97%, it can be concluded that this model has a high ability to forecast biodiesel surface tension.

It is also clear that the amount of SD for two models reports a small value, which shows the robustness and accuracy of the presented models. Also, the values of APRE for two models, GMDH and GEP, are estimated to be − 0.07 and 0.13, respectively, and according to these values, it can be said that no overestimate or underestimate occurred in any of the models.

In the following, the accuracy of the established models is checked in the form of a diagram. Figure 3 displays the cross-plot diagram for the developed models in the research for two training and testing data sets. As it is clear, both models report high accuracy and their R2 values are close to 1. According to the cross-plot diagram, the laboratory data have a good match and overlap with the predicted data. Also, the density and high accumulation of data around the line with a slope of 1 are high, and this indicates the high accuracy of the presented models in this research. Also, another diagram has been drawn to check the accuracy of the models in Fig. 4 called the relative error distribution diagram. In this diagram, it can be perceived that the dispersion of the data around the zero error line is low and the density of the data around this line is high. The highest density around this line is related to the GMDH model, which has high accuracy. It is also worth mentioning that there was no over-fitting or under-fitting in these models. When the accumulation of data is high below the zero error line, it can be understood that the model has under-fitting, and also when the accumulation of data is above the zero error line, the model predicts the value of the desired parameter much more than the experimental data.

Figure 3
figure 3

The cross-plot diagrams of the presented models in this research for estimating the surface tension of biodiesel.

Figure 4
figure 4

The relative error distribution diagram of the presented models in this research.

Also, in order to compare the models presented in this research with the existing models in the literature to measure biodiesel surface tension, Table 5 was presented69,70. The statistical parameter used to compare these models was considered R2. It is clear that both presented models in this research are more precise than the models in the literature and their R2 value is close to 1.

Table 5 Comparing the precision of the developed models in the research with the presented models in literature to approximate biodiesel surface tension.

Compatibility and overlapping of laboratory data and data predicted by the model are of great importance. In order to check this purpose in detail, Fig. 5a,b was presented. In this diagram, the horizontal axis represents the index of data points and the vertical axis represents the experimental and predicted surface tension by the GMDH. Also, Fig. 5a is for checking the training data and Fig. 5b is for checking the compatibility of the test data. Finally, it can be concluded that the data predicted by the GMDH follows the same trend as the laboratory data.

Figures 5
figure 5

Comparison between the laboratory data of surface tension of biodiesel with the data predicted by the GMDH model for (a) training and (b) testing sets.

In order to specify the data that report the highest amount of absolute error, a three-dimensional diagram was used. Figure 6 shows a cumulative chart for the models developed in this research to compare their efficiency and accuracy. The absolute error of each model is shown on the X-axis of this diagram, and the Y-axis shows the cumulative frequency. In this graph, the steeper the slope of the graph and converges towards the Y axis, the less error the model reports. According to the explanations mentioned and according to this graph, the line related to the GMDH model reports an accuracy of about 4% for 95% of the data. Also, according to the graph related to the GEP model, it can be found that this model reports an error of 4% for 80% of the data.

Figure 6
figure 6

Cumulative chart to compare the precision of the models offered in this research.

Trend analysis

In general, liquids’ surface tension reduces with growing temperature and reaches zero when the critical temperature is reached. The cause for decreasing surface tension with increasing temperature is that when the temperature rises, the kinetic energy of the molecules increases and leads to a diminution in the energy of attraction between molecules71. As it is clear in Fig. 7, with the increase in temperature, the value of surface tension decreases and the data predicted by the model follow the same trend as the laboratory data and have high overlap and accuracy.

Figure 7
figure 7

Investigating changes in biodiesel surface tension at different temperatures using laboratory data and predicted data by the GMDH method.

Sensitivity analysis

In order to check the effectiveness of the output of the most accurate model in this research of the input parameters, sensitivity analysis is used. The basis of this method is to use the relevancy factor function64. The purpose of this function is to find the effect of inputs on the output, and the values obtained by this function are between − 1 and 1, where the positive value indicates the direct behavior of the input with the output, while the negative value indicates the inverse behavior of the input parameter with the output67. The relevancy factor is measured based on the relationships presented below59.

$$r\left( {Inp,\,ST} \right) = \frac{{\sum\limits_{i = 1}^{n} {\left( {Inp_{k,i} - \overline{{Inp_{k} }} } \right)\left( {ST_{i} - \overline{S} } \right)} }}{{\sqrt {\sum\limits_{i = 1}^{n} {\left( {Inp_{k,i} - \overline{{Inp_{k} }} } \right)^{2} \sum\limits_{i = 1}^{n} {\left( {ST_{i} - \overline{S} } \right)^{2} } } } }}$$
(11)

Inpk,i and Inpk represent the ith and kth average values of the input, respectively. In this relationship, ST represents the predicted value of surface tension and \(\overline{ST}\) represents the average value of surface tension. Also, k can be any of the input parameters including temperature or mass fractions. The outcomes of the mentioned method are given in Fig. 8. According to the diagram, temperature has the highest relevancy factor, and it can be concluded that surface tension is more affected by temperature than other input parameters, and the negative value of temperature indicates the inverse effect of temperature on surface tension. Also, the mass fractions related to esters, esters with molecular weight of less than 200 have the greatest effect, and esters with molecular weight of more than 300 report the least effect on surface tension.

Figure 8
figure 8

Investigating the impact of input parameters on the surface tension of biodiesel obtained using the GMDH method.

Detection of outliers and suspected data

William’s chart was used to find outlier data and suspicious experimental data. In the chart, the horizontal axis indicates Hat values and the vertical axis demonstrates the value of standardized residuals. How to calculate the cap and Standardized Residuals is as follows60,67:

$$H = input \times inv\left( {Transpose\left( {input} \right) \times input} \right) \times Transpose\left( {input} \right)$$
(12)
$$hat(h) = diag(H)$$
(13)
$$Standardized \, Residuals(SR) = \frac{{\left( {Outputs - T\arg ets} \right)}}{{\left( {1 - h} \right) \times RMSE}}$$
(14)

In Fig. 9, the vertical line drawn in the middle of the graph represents the Hat*, which is determined by the value of the Hat*of outlier data. According to the figure, it is clear that only three data points of their hat are more than the Hat* and they are out of the applicable range of the model. This shows the uniformity and validity of the dataset used, as well as the reliability of the models provided by this dataset. Also, suspicious laboratory data are data that their standardized residuals are out of the range of 3– − 3. According to the graph, only three data points from the dataset have been identified as suspicious laboratory data. It can also be seen that there is a large amount of data within the range of the model validity area and reliability, and their Hats are less than the Hat*, and their standardized residuals are between 3 and − 3.

Figure 9
figure 9

Determining outlier data points and suspicious laboratory data by Leverage technique.

Conclusions

It is clear that one of the sources of clean fuels for energy production is biodiesel. For this reason, the importance of this fuel is clear to everyone, and measuring its properties is of considerable importance. In this research, the surface tension of biodiesel was approximated by GMDH and GEP methods. The input parameters include mass fraction of fatty acid ethyl esters and temperature (T), and esters are divided into three groups according to their molecular weight: less than 200 (Mw1), between 200 and 300 (Mw2), and greater than 300 (Mw3). The advantage of this model compared to the presented models in the literature is the higher accuracy and ease of use of these models. The presented models in this research are white boxes and are available for use, while the presented models in the literature are all black boxes and special software and codes are needed to use them. After performing calculations to check the accuracy of the presented models, it was concluded that the GMDH model with the value of AAPRE = 0.97% and R2 = 0.9233 has higher accuracy than the GEP method. Also, the accuracy of the presented models in this research was checked using the error-determining diagram including the cross-plot diagram and the relative error distribution diagram, in which satisfactory results were observed. Then, the surface tension behavior of biodiesel was investigated at different temperatures and it was concluded that the surface tension of biodiesel decreases with increasing temperature, which was well predicted by the model. As well as that, the effect of input parameters on the surface tension obtained from the GMDH method was investigated and it was found that the maximum effect of the input parameters on the surface tension of biodiesel is related to temperature. Finally, only five data points were identified as outliers and suspicious laboratory data using the Leverage technique.