Introduction

Granular segregation is an occurrence in which flowing particles with similar properties (such as size, density, or shape) accumulate in specific areas. Segregation is regarded unfavourable in the majority of applications because it might reduce the homogeneity of the granular mixtures and, consequently, is aimed at being controlled/minimised1. To achieve this goal, a thorough understanding of segregation and the factors influencing it is required.

Numerous experimental studies have aimed to unravel the segregation phenomenon since the 1970s2,3,4,5. While these studies provided useful insights into segregation, the experimental approaches to studying segregation generally suffer from several limitations6. These include the difficulty in collecting the required samples for segregation measurements, limitations in obtaining particle-scale data, as well as being expensive and time-consuming.

Recent advancements in computational power have led to widespread usage of the discrete element method (DEM), initially introduced by Cundall and Strack7, as a useful alternative to experiments for studying granular materials. Especially for segregation, DEM has major advantages over experiments, as it allows for the modelling of granular mixtures with any combinations of size, density, and shape while providing comprehensive particle-level information that is difficult or impossible to obtain in physical experiments6. While DEM is widely used, achieving a balance between model accuracy and computational efficiency remains a challenge8. The accuracy of the DEM model heavily depends on the proper determination of its parameters through a process called calibration. However, the calibration process can be time-consuming, particularly for multi-component mixtures, where the number of DEM parameters significantly increases.

Trial and error is still extensively employed for calibrating DEM models9,10,11,12,13. However, it is not only inefficient but also depends on the user’s expertise and barely results in an optimal parameter set14. To systematically calibrate the DEM model, several approaches have been proposed. Typically, these approaches use optimisation techniques to update the parameters and determine the calibrated parameter set. Examples include using advanced design of experiments (DoE) in combination with simple optimisation algorithms15, particle swarm optimisation16, and genetic algorithms17,18. However, these methods are still not computationally efficient due to the high number of simulations required14.

Richter et al.14 conducted a thorough literature review on various optimisation techniques, concluding that surrogate-based optimisation is the most suitable approach for DEM calibration. A surrogate model (SM) is an approximation of a more complex and computationally expensive model (such as DEM) aimed at mapping the relationship between the model’s input(s) and output(s)19. They can be built using advanced mathematics or machine learning (ML). Surrogate-based optimisation is effective at finding a global optimum, is computationally efficient, and can handle parameter limitations and multi-objective problems14. Additionally, ML-based surrogates can take advantage of the rapid progress in the field of machine learning in other fields13,20. Several studies have used surrogate-based optimisation for DEM calibration. This includes using Gaussian process regression (GPR) and Kriging14,21,22,23,24, multi-objective reinforcement learning25, Bayesian filtering26,27, multi-variate regression analysis28, neural networks29,30,31, and random forest (RF)13.

Despite the advancement of surrogate-based DEM calibration, several challenges remain to be addressed. Firstly, the vast array of available algorithms can make it challenging to choose the most suitable approach, often leading to subjective decision-making. Secondly, while using the SM reduces the computational cost of the DEM calibration, training the SMs themselves, especially when employing sampling techniques such as Latin Hypercube Sampling (LHS), requires a substantial number of simulations. Thirdly, most of the studies consider only a limited number of DEM parameters to construct the SM, potentially overlooking significant DEM parameters. Lastly, most studies aim at single granular materials, and to the best of the authors’ knowledge, no study has yet explored surrogate modelling for multi-component granular mixtures. This study attempts to address these challenges by developing SMs that effectively link particle-particle and particle-wall DEM interaction parameters to segregation. We demonstrate this on the basis of a case study for reliable estimation of radial segregation of multi-component mixture in a heap.

The objective of this paper is twofold:

  1. 1.

    We evaluate several ML models to develop surrogate models for DEM simulations involving a two-component mixture (i.e., pellet-sinter) that flows from a hopper through a chute into a receiving bin. Our goal is to develop SMs that capture the relationship between all particle-particle and particle-wall DEM interaction parameters to radial segregation in the heap. To investigate the effect of the initial configuration (IC) of the mixture within the hopper on heap segregation, we vary the mixing degree, pellet-to-sinter mass ratio, and layering order within the hopper. For each individual IC, we use the definitive screening design (DSD), a cost-effective three-level DoE technique, to efficiently create our dataset. To construct effective SMs, we encode ICs, which consist of a combination of categorical and numerical variables, to prepare them as input features for the SMs.

  2. 2.

    Following the identification of the most effective SMs, we innovatively implement a transfer learning (TL) approach to transform the surrogate into an adaptive SM tailored for new, unseen ICs, named the ‘transfer learning-based surrogate model (TL-SM)’, thereby addressing our second objective. In pursuit of this, we systematically exclude one IC from the training-validation dataset, designated as the ‘unseen IC’, which serves as the target domain for TL. Subsequently, we train and cross-validate the SM coupled with Bayesian optimisation (BO) using the remaining dataset as the source domain for TL. Utilising the TL methodology, we deploy the pre-trained ML model as the surrogate for the unseen target IC. We then update and retrain the SM by integrating new data points from the unseen IC while monitoring performance enhancements. The effectiveness of the proposed data-driven SM is assessed through nested cross-validation (NCV), which involves iteratively excluding each IC. Additionally, the stability of the TL-SM is evaluated using distinct random seed numbers for weight and bias initialisation.

Achieving these two objectives will pave the way for efficiently building generalised SMs for various scenarios. These SMs, in turn, will facilitate and speed up the DEM calibration process, contributing to the development of more robust and reliable DEM models in a significantly shorter time.

Simulation method and established dataset

Discrete element method

We used the Hertz-Mindlin (no-slip)32 contact model with an elastic-plastic spring-dashpot rolling friction model (referred to as “type C” in33) in our DEM model. This contact model has been successfully employed in past studies on pellets and sinter34,35. Detailed equations and more information on the contact model are addressed in the relevant literature32,33,34,36. We developed the DEM model using the commercial software EDEM version 2022.3, where all of the simulations were performed on the DelftBlue high-performance cluster37.

We simulated the mixture of sinter and iron ore pellets, as an example of a multi-component mixture used in blast furnace. The intrinsic material properties as fixed and varied interaction parameters were employed, which are listed in Tables 1 and 2, respectively.

Table 1 Intrinsic material properties used in DEM simulations.
Table 2 Investigated DEM parameters with their low, middle and high values (\(\:{\mu\:}_{s}=\) coefficient of sliding friction, \(\:{\mu\:}_{r}=\) coefficient of rolling friction, \(\:{C}_{r}=\) coefficient of restitution). The underlined values for pellet-pellet and sinter-sinter parameters used for pellet-sinter interactions.

System and geometry

A system of geometries composed of a hopper, a chute and a receiving bin was used, as shown in (Fig. 1a). The sequence of the simulations’ steps is as follows: First, a mixture of pellets and sinter was generated in the hopper. Next, the outlet of the hopper was opened, allowing the materials to discharge from the hopper under the influence of gravity. Finally, the materials were accumulated in the receiving bin after a chute flow (see Fig. 1b).

Fig. 1
figure 1

(a) The geometry employed in the simulations and their dimensions, (b) the flow of the mixture of pellets and sinter from the hopper to the receiving bin through the chute.

Since the IC of the mixture within the hopper significantly influences the final segregation in the heap, we used various initialisations in the hopper, as illustrated in Fig. 2.

Fig. 2
figure 2

Various initial configurations (ICs) of pellets and sinter in the simulations (copper and black particles represent pellets and sinter, respectively).

Quantifying segregation in heap

At the end of the simulations, a heap of the mixture of pellets and sinter was formed, as illustrated in Fig. 3a. Segregation in the heap can be measured in different directions, namely radial, vertical, and circumferential. This study specifically targets radial segregation. To measure the radial segregation, the heap was first divided into a number (\(\:m\)) of radial bins (see Fig. 3b). Next, the mass ratio of one component such as pellets within each bin (\(\:{C}_{{p}_{m}}\)) was determined. Then, the segregation was quantified using the relative standard deviation (RSD):

$$\:RSD=\frac{\sigma\:}{\mu\:}$$
(1)

where \(\:\sigma\:\) and \(\:\mu\:\) are the standard deviation and the mean of \(\:{C}_{{p}_{m}}\)s, respectively.

Fig. 3
figure 3

(a) Side view of the heap formed within the receiving bin, (b) radial bins used to quantify radial segregation.

Established dataset and feature engineering

Because there is a high number of DEM parameters (15, as listed in Table 2) to vary, we aimed to use a sampling strategy that minimises the number of DEM simulations required. To achieve this, we employed the definitive screening design (DSD), a unique three-level design that was first presented by Nachtsheim and Jones48. The power of DSD is that, in addition to the main effects, it can identify two-factor and quadratic terms. For an odd number of \(\:k\) variables (as in this study), \(\:2k+3\) runs are needed. Furthermore, to increase the DSD design’s power, Jones and Nachtsheim49 suggested adding four extra runs. Therefore, only 37 simulations were required to establish a DSD design for 15 DEM interaction parameters. The DSD design is presented in Table A.1 in the Appendix.

As we conducted the DSD for five different ICs in the hopper (see Fig. 2), the generated dataset comprised a total of 185 simulation samples. To effectively distinguish between these ICs, it was essential to perform feature engineering to create additional features that describe these configurations. This feature engineering could potentially enhance the performance of ML models50. We specifically selected three features: segregation index, pellets mass ratio, and layering mode, which can distinguish between the five ICs used. Since the layering mode is a categorical variable, we employed the label encoding technique, which is known for its computational simplicity, to convert it into a numerical format51. Table 3 presents these three features along with their values for all ICs.

Table 3 Three features used to describe the initial configurations (see Fig. 2) along with their values.

Data-driven surrogate models

The overall proposed framework for designing data-driven SMs is illustrated in Fig. 4. The dataset underwent nested cross-validation (NCV), one of the most rigorous validation approaches, which involves two other cross-validation (CV) steps. In this section, the ML models employed in this study to build surrogates for DEM are briefly described. These models include linear regression, support vector machine (SVM), regression tree, ensemble learning, Gaussian process regression (GPR), and artificial neural network (ANN). Subsequent sections elaborate on NCV and hyperparameter optimisation for the aforementioned ML models.

Fig. 4
figure 4

The overall proposed framework to design data-driven SM leveraged by TL.

Linear regression

Linear regression is a widely used statistical tool for modelling the linear relationship between independent (inputs) and dependent (output) variables. In the case of only one independent variable, it is called a “simple linear regression model”, and when there is more than one independent variable, it is referred to as a “multiple linear regression model”52, which is the case in the current work. Considering the original dataset composed of n data points as \(\:D=\:\left\{\left({x}_{i},\:{y}_{i}\right),i=1,\:2,\:\dots\:,\:n\right\}\), the linear regression model is expressed as:

$$\:y={\beta\:}_{0}+\sum\:_{i=1}^{n}{(\beta\:}_{i}{x}_{i}+{\epsilon\:}_{i})$$
(2)

where \(\:y\) is the vector of dependent variables (i.e., observed response), \(\:\beta\:\) is the coefficients vector, \(\:{\beta\:}_{0}\) is the intercept (or bias in machine learning), and \(\:{\epsilon\:}_{i}\) is the random error term. The objective in the linear regression is to minimise the sum of squared errors (SSE) between the predicted (\(\:{\widehat{y}}_{i}\)) and actual (\(\:{y}_{i}\)) values, which are calculated using the following equation:

$$\:SSE=\:\sum\:_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}$$
(3)

There are several techniques to improve the interpretation of linear regression models, including linear regression with interactions, robust linear regression, and stepwise regression. Linear regression with interactions takes the interactions between the independent variables into account, allowing for modelling of complex relationships among them. Robust linear regression helps mitigate the impacts of outliers, leading to more reliable estimates53. Stepwise regression aids in refining the model by iteratively adding or removing them based on statistical criteria, ensuring that the most significant variables are included in the model. All these techniques contribute to the development of more accurate linear regression models across various scenarios54.

Support vector machine (SVM)

Support vector machines (SVMs) are efficient statistical learning models for classification and regression tasks55. SVMs are known for finding the optimal decision boundary, known as the maximum-margin, which can effectively separate various classes in the data. Because of this feature, SVMs are very effective at handling complicated datasets.

In training data, where \(\:{x}_{i}\) is the multivariate set of \(\:n\) observations, the goal in support vector regression is to determine the estimating function \(\:f\left(x\right)\), which takes the form56:

$$\:f\left(x\right)={w}^{T}G\left(x\right)+b$$
(4)

where \(\:w\) is the weight vector, \(\:b\) denotes the bias term and \(\:G\left(x\right)\) is a set of linear or non-linear kernel functions (e.g., quadratic, cubic, etc.). To determine \(\:w\) and \(\:b\), the following objective function is to minimise57:

$$\:\frac{1}{2}{w}^{T}w+C\sum\:_{i=1}^{n}({\xi\:}_{i}+{\xi\:}_{i}^{*})$$
(5)

subject to:

$$\:\left\{\begin{array}{c}{w}^{T}G\left({x}_{i}\right)+b-{y}_{i}\le\:\epsilon\:+{\xi\:}_{i}^{*}\\\:-({w}^{T}G\left({x}_{i}\right)+b-{y}_{i})\le\:\epsilon\:+{\xi\:}_{i}\\\:{\xi\:}_{i},{\xi\:}_{i}^{*}\ge\:0,\:\:i=1,\:2,\:\dots\:,\:n\:\end{array}\right.$$
(6)

where \(\:C\) is the box constraint, \(\:{x}_{i}\) and \(\:{y}_{i}\) are the input and output vectors, respectively, and \(\:{\xi\:}_{i}\) and \(\:{\xi\:}_{i}^{*}\) are positive slack variables. SVMs utilise kernel functions (\(\:k\left(x,{x}^{{\prime\:}}\right)\), where \(\:x\) and \(\:{x}^{{\prime\:}}\) are two data points) to handle non-linear relationships between the input and output vectors. Even if the original input space is not linearly separable, kernel functions enable SVMs to implicitly transfer input vectors into a higher-dimensional space where the data may be more separable. In this case, the decision function (Eq. (4)) takes the form:

$$\:f\left(x\right)=\sum\:_{i=1}^{n}\left({\alpha\:}_{i}-{\alpha\:}_{i}^{*}\right)k\left({x}_{i},{x}^{{\prime\:}}\right)+b$$
(7)

where \(\:{\alpha\:}_{i}\) and \(\:{\alpha\:}_{i}^{\text{*}}\) are Lagrange multipliers. For example, the decision function for the radial basis function (RBF) kernel is as follows:

$$\:f\left(x\right)=\sum\:_{i=1}^{n}\left({\alpha\:}_{i}-{\alpha\:}_{i}^{*}\right)exp\left(-\gamma\:{\parallel{x}_{i}-{x}^{{\prime\:}}\parallel}^{2}\right)+b$$
(8)

where \(\:\gamma\:\) is a parameter for the RBF kernel.

Regression tree

Decision trees are predictive models that partition the feature space to predict the label associated with an instance by traversing from the tree’s root to a leaf58. Regression trees are a specific type of decision trees designed for predicting numerical values. They recursively divide the input space (\(\:{x}_{i}\)) into \(\:J\) number of disjoint regions (\(\:{R}_{1},\:{R}_{2},\:\dots\:,\:{R}_{J}\)) using splitting rules. Regression tree splitting rules are derived from the minimisation of the sum of squared errors inside each division:

$$\:\left(j,s\right)=\text{a}\text{r}\text{g}\text{m}\text{i}\text{n}\left[\sum\:{({y}_{i}-{c}_{m})}^{2}+\sum\:{({y}_{i}-{c}_{m}^{{\prime\:}})}^{2}\right]$$
(9)

where \(\:j\) and \(\:s\) are the index and the threshold value of the feature used for splitting, respectively. \(\:{c}_{m}\) and \(\:{c}_{m}^{{\prime\:}}\) are the constant predictions for regions \(\:{R}_{m}\) and \(\:{R}_{m}^{{\prime\:}}\). The main parameter in regression trees is the minimum leaf size, which represents the minimum number of samples required to create a terminal node (leaf) in the process of building the tree.

Ensemble learning

While regression trees are easy to interpret and fast for fitting and prediction, like other weak learners, they are susceptible to overfitting and have sensitivity to training data. Integrating several weak learners makes the model more resilient and less prone to overfitting, as each learner depends on a different set of data points. Ensemble learning in ML is the process of combining several weak learners. Given regression trees, one way to overcome this issue is to construct a weighted collection of multiple regression trees to build models called ensembles of trees. Combining many regression trees generally improves the prediction capability and accuracy. Several ensemble learning methods exist, including bagging and boosting.

Bagging (bootstrap aggregation) involves training many weak (base) learners (parallelly) simultaneously and integrating them using averaging techniques59. Considering the original data set as \(\:D=\:\left\{\left({x}_{1},\:{y}_{1}\right),\left({x}_{2},\:{y}_{2}\right),\:\dots\:,\:\left({x}_{n},\:{y}_{n}\right)\right\}\), first, a number of bootstrap samples \(\:{(D}_{i},\:i=1,\:2,\:.,\:B)\:\) is created by randomly choosing \(\:n\) samples from \(\:D\) with replacement. Then, a base learner \(\:{f}_{i}\) is trained based on \(\:{D}_{i}\) to minimise the error between \(\:y\) and \(\:{f}_{i}\left(x\right)\). Finally, the aggregated prediction model \(\:f\left(x\right)\) is obtained by averaging the predictions:

$$\:f\left(x\right)=\frac{1}{B}\sum\:_{i=1}^{B}{f}_{i}\left(x\right)$$
(10)

By training each learner with the output of the preceding learner, the boosting approach progressively boosts the model’s overall performance. One boosting technique used for building regression ensembles is least-squares boosting (LSBoost)60. This technique successively fits a set of weak learners (e.g., decision trees), with each new learner trained to reduce residual errors from the ensemble’s total predictions. The approach iteratively improves the ensemble’s predictions by including fresh weak learners. First, the ensemble prediction is initialised as the mean of the target values (\(\:{y}_{i},\:i=1,\:2,\:\dots\:,\:n\)):

$$\:{f}_{0}\left(x\right)=\frac{1}{n}\sum\:_{i=1}^{n}{y}_{i}$$
(11)

Then, for iteration \(\:m\:(m=1,\:2,\:\dots\:,\:M)\), the residuals between the target values and the accumulated prediction (\(\:{f}_{m-1}\left({x}_{i}\right)\)) for each observation is calculated as:

$$\:{r}_{{i}_{m}}={y}_{i}-{f}_{m-1}\left({x}_{i}\right)\:\:\:\:\:\:\:\:\:\:\:1\:\le\:i\le\:n$$
(12)

Next, a new weak learner (\(\:{h}_{m}\)) is trained by fitting it to the residuals:

$$\:{h}_{m}=\:\underset{h}{\text{argmin}}\left(\frac{1}{2n}\sum\:_{i=1}^{n}{[{r}_{{i}_{m}}-h({x}_{i}\left)\right]}^{2}\right)$$
(13)

Finally, the ensemble model is updated:

$$\:{f}_{m}\left(x\right)=\:{f}_{m-1}\left(x\right)+{\eta\:}_{m}{h}_{m}$$
(14)

where \(\:\eta\:\) is the learning rate (the shrinkage parameter), which controls the contribution of each weak learner and ranges from 0 to 1.

Gaussian process regression (GPR)

Gaussian process regression (GPR) is a probabilistic and non-parametric kernel-based machine learning regression model61 rooted in Bayesian principles. Due to its simplicity of use and flexibility in obtaining hyperparameters, GPR is well-suited to handle small-sized datasets and nonlinear problems62. A Gaussian process (GP) is a collection of random variables having Gaussian distribution and is fully defined by its mean function \(\:\mu\:\left(x\right)\) and covariance kernel function \(\:k\left(x,{x}^{{\prime\:}}\right)\).

Considering \(\:{x}_{i}\) and \(\:{y}_{i}\) as the input and corresponding output vectors, respectively, the GPR model with Gaussian noise is formulated as:

$$\:{y}_{i}=f\left({x}_{i}\right)+{\epsilon\:}_{i}$$
(15)

where \(\:{\epsilon\:}_{i}\) denotes a constant additive noise term assumed to follow a Gaussian distribution with a mean of 0 and a standard deviation of \(\:\sigma\:\) (i.e., \(\:{\epsilon\:}_{i}\mathcal{\:}\sim\mathcal{\:}\mathcal{N}(0,{\sigma\:}^{2})\)). The objective of GPR is to infer the function \(\:f\) in a non-parametric and Bayesian approach, utilizing the provided training dataset \(\:\{\left({{x}_{i},y}_{i}\right);\:i=1,\:2,\:\dots\:,\:n\}\). A prior distribution on \(\:f\) needs to be established in order to learn this function. Typically, this prior is utilized to encapsulate qualitative attributes of the function such as continuity, differentiability, or periodicity. In GPR, the prior distribution for \(\:f\) as the regression function is represented by:

$$\:f\left(x\right)\mathcal{\:}\sim\mathcal{\:}\mathcal{G}\mathcal{P}(\mu\:\left(x\right),k(x,{x}^{{\prime\:}}\left)\right)$$
(16)

In this formulation, while the mean function \(\:\mu\:\left(x\right)\) is often set constant, the covariance kernel \(\:k\left(x,{x}^{{\prime\:}}\right)\) varies. When the values of the function \(\:f\left({x}_{i}\right)\) have a joint Gaussian distribution defined by \(\:\mu\:\left(x\right)\) and \(\:k\left(x,{x}^{{\prime\:}}\right)\) for every finite set of inputs \(\:{x}_{i}\), then the function \(\:f\left(x\right)\) is a GP, implying:

$$\:\left[\begin{array}{c}f\left({x}_{1}\right)\\\:\vdots\\\:f\left({x}_{n}\right)\end{array}\right]\sim\:\mathcal{N}\left(\left[\begin{array}{c}\mu\:\left({x}_{1}\right)\\ \vdots\\ \mu\:\left({x}_{n}\right)\end{array}\right],\left[\begin{array}{ccc}k\left({x}_{1},{x}_{1}\right) & \cdots & k\left({x}_{1},{x}_{n}\right)\\ \vdots & \ddots & \vdots \\ k\left({x}_{n},{x}_{1}\right)\: &\cdots & k\left({x}_{n},{x}_{n}\right)\end{array}\right]\right)$$
(17)

which using the notation below:

$$\:\varvec{\upmu\:}\triangleq\:\left[\begin{array}{c}\mu\:\left({x}_{1}\right)\\\:\vdots\\\:\mu\:\left({x}_{n}\right)\end{array}\right]\mathcal{\:};\mathcal{\:}\mathcal{\:}\mathcal{K}\triangleq\:\left[\begin{array}{ccc}k\left({x}_{1},{x}_{1}\right) & \cdots & k\left({x}_{1},{x}_{n}\right)\\ \vdots & \ddots & \vdots \\ k\left({x}_{n},{x}_{1}\right)\: &\cdots & k\left({x}_{n},{x}_{n}\right)\end{array}\right]\mathcal{\:};\mathcal{\:}\mathcal{k}\left(x\right)\triangleq\:\left[\begin{array}{c}k\left({x}_{1},x\right)\\\:\vdots\\\:k\left({x}_{n},x\right)\end{array}\right]$$
(18)

the equations can be simplified. In the process of learning functions through GPR, the implications of expanding Eq. (17) by including a new data point \(\:{x}_{*}\), separate from the training data, are being considered. The objective is to predict the value of the function at this particular location, i.e., \(\:f\left({x}_{*}\right)\). To do so, given the already observed values \(\:\mathcal{Y}={[{y}_{1}\dots\:{y}_{n}]}^{T}\), the relationship can be expressed by incorporating Eqs. (15)–(18):

$$\:\left[\begin{array}{c}\mathcal{Y}\\\:f\left({x}_{*}\right)\end{array}\right]\sim\:\mathcal{N}\left(\left[\begin{array}{c}\varvec{\upmu\:}\\\:\mu\:\left({x}_{*}\right)\end{array}\right],\left[\begin{array}{cc}\mathcal{K}+{\sigma\:}^{2}{\mathbf{I}}_{n}&\:\mathcal{k}\left({x}_{*}\right)\\\:{\mathcal{k}\left({x}_{*}\right)}^{T}&\:k\left({x}_{*},{x}_{*}\right)\end{array}\right]\right)$$
(19)

Here, \(\:{\mathbf{I}}_{n}\) is the N×N identity matrix. Conditioning on the new data based on observations, the posterior probability distribution for \(\:f\left({x}_{*}\right)\) can be estimated as:

$$\:f\left({x}_{*}\right)|\mathcal{Y}\mathcal{\:}\sim\:\mathcal{N}\left({\mu\:}_{*},{\sigma\:}_{*}^{2}\right)\:$$
(20)

where:

$$\:{\mu\:}_{*}=\:\mu\:\left({x}_{*}\right)+{\mathcal{k}\left({x}_{*}\right)}^{T}{\left(\mathcal{K}+{\sigma\:}^{2}{\mathbf{I}}_{n}\right)}^{-1}\left(\mathcal{Y}-\varvec{\upmu\:}\right)$$
(21)
$$\:{\sigma\:}_{*}^{2}=\:k\left({x}_{*},{x}_{*}\right)-{\mathcal{k}\left({x}_{*}\right)}^{T}{\left(\mathcal{K}+{\sigma\:}^{2}{\mathbf{I}}_{n}\right)}^{-1}\mathcal{k}\left({x}_{*}\right)$$
(22)

The posterior probability distribution is Gaussian once more, allowing for Bayesian reasoning on the function \(\:f\). One noteworthy aspect of these formulations is that the function’s posterior expected value, \(\:\mathbb{E}\left(f\left({x}_{*}\right)|\mathcal{Y}\right)\), could be stated using a weighted sum of kernel functions:

$$\:\mathbb{E}\left(f\left({x}_{*}\right)|\mathcal{Y}\right)=\:{\mathcal{k}\left({x}_{*}\right)}^{T}{\left(\mathcal{K}+{\sigma\:}^{2}{\mathbf{I}}_{n}\right)}^{-1}\left(\mathcal{Y}-\varvec{\upmu\:}\right)=\sum\:_{i=1}^{n}k\left({x}_{*},{x}_{n}\right){\alpha\:}_{i}$$
(23)

where:

$$\:\left[\begin{array}{c}{\alpha\:}_{1}\\\:\vdots\\\:{\alpha\:}_{n}\end{array}\right]\triangleq\:{\left(\mathcal{K}+{\sigma\:}^{2}{\mathbf{I}}_{n}\right)}^{-1}\left(\mathcal{Y}-\varvec{\upmu\:}\right)$$
(24)

To delve into the GPR model, the weighted sum in Eq. (24) is advantageous as it facilitates computations that would otherwise be challenging.

The details on the kernel functions used in this study are presented in Table B.1 in the Appendix. It is possible to use either isotropic or non-isotropic kernel functions with GPR. In contrast to isotropic kernels, non-isotropic ones give each predictor variable a distinct correlation length scale. This results in an improved accuracy at the cost of slowing down the fitting process.

Artificial neural network (ANN)

An artificial neural network (ANN) is an ML model inspired by neuronal organisation in animal brains. ANN is composed of interconnected nodes which are arranged into several layers. These layers are typically organised into three groups: the input layer, hidden layers, and the output layer. Each node (or neuron) conducts a basic computation, and the connections between nodes transport weighted signals from one layer to the next63,64.

The output of the NN is computed through feedforward propagation65. Considering the input vector as \(\:x\), the activation of each neuron in layer \(\:l\) as \(\:{a}^{\left(l\right)}\), the weight matrix linking layer \(\:l\) to layer \(\:l+1\) as \(\:{W}^{\left(l\right)}\), and the bias term for layer \(\:l\) as \(\:{b}^{\left(l\right)}\), the feedforward computation is as follows:

$$\:{z}^{(l+1)}={W}^{\left(l\right)}{a}^{\left(l\right)}+{b}^{\left(l\right)}$$
(25)
$$\:{a}^{(l+1)}=g\left({z}^{(l+1)}\right)$$
(26)

where \(\:{z}^{(l+1)}\) is the input to layer \(\:l+1\) and \(\:g\left(z\right)\) is the activation function which is applied element-wise to the input. In this study, we used the rectified linear unit (ReLU), Tanh, and sigmoid activation functions, whose formulas are provided in Table B.2 in the Appendix.

Hyperparameters optimisation and model validation

Overfitting is a common challenge in ML models, requiring the use of cross-validation (CV) methods to validate the effectiveness of the model. As illustrated in Fig. 5, we employed nested cross-validation (NCV), which consists of two CV steps:

  1. I.

    An external loop conducts CV on the dataset based on the number of ICs, excluding one IC at each iteration to create a distinct test set of “unseen ICs”.

  2. II.

    An internal loop performs CV on the remaining dataset after the execution of the external loop to tune hyperparameters and mitigate overfitting.

This NCV technique properly estimates model performance by combining 5-fold outer loops based on the number of ICs with 10-fold inner loops. In the outer loop, model performance assessment occurs through the partitioning of the dataset into a training-validation set (comprising four ICs) and a distinct test set (comprising one IC). The outer loop is also referred to as leave-one-out cross-validation (LOOCV)66 on ICs. The training-validation set undergoes further subdivision into diverse folds using a 10-fold CV to estimate the generalisation error and fine-tune hyperparameters. To this end, Bayesian Optimisation (BO) algorithms67 were employed to adjust hyperparameters. To determine the minimum or maximum of a function, BO combines Bayesian inference with optimisation techniques. BO constructs a prior distribution over parameters, updates it with data, and selects promising parameters using an acquisition function. This iterative process efficiently explores the function space until it converges on the optimal parameters. Compared to exhaustive search techniques such as grid search or random search, this method effectively explores the hyperparameter space and frequently requires fewer trials.

Fig. 5
figure 5

The process of nested cross-validation (NCV), given LOOCV in the outer loop and 10 folds in the inner loop.

Transfer learning (TL) for unseen ICs

In this section, we explore the application of transfer learning (TL) to enhance the performance of the SM when confronting a new, previously unseen IC. TL is a powerful technique that leverages knowledge gained from related tasks or domains to improve performance on a target task68,69,70. TL is specifically advantageous when providing a sufficient number of training samples is costly.

In the context of granular material segregation, the IC is crucial as it can significantly influence segregation outcomes6. Consequently, if the IC changes, the DEM model must be recalibrated, which is very time-consuming. To address this challenge, we can leverage prior knowledge gained from previously encountered ICs as the source domain to pretrain the SMs. Subsequently, we can transfer the pretrained SMs as base learners for new, unseen ICs, treating them as the target domain in TL. In the next stage, the pretrained SMs are updated by incorporating a small number of samples from the target domain through model retraining while retaining the prior information from the source domain. This approach allows TL to expedite the learning process, as the SMs require fewer samples from the unseen ICs to achieve effective learning. Additionally, it eliminates the need for full retraining and cross-validation, coupled with BO, using data from both the source and target domains, thereby reducing computational demands.

Considering \(\:{x}_{i}\) and \(\:{y}_{i}\) representing the input and output vectors, TL involves extracting knowledge from the source domain (\(\:{\mathcal{D}}_{s}={\left\{\left({x}_{i}^{s},{y}_{i}^{s}\right)\right\}}_{i=1}^{{N}_{s}}\), ), and use it to pretrain the model for the target domain (\(\:{\mathcal{D}}_{t}={\left\{\left({x}_{i}^{t},{y}_{i}^{t}\right)\right\}}_{i=1}^{{N}_{t}}\)). Here, we investigate the effectiveness of TL in updating the model by varying the number of samples from the newly unseen IC. This includes the following steps:

  • Initial model training: Initially, the SM model is trained using cross-validation on data from four out of the five ICs, yielding a baseline model \(\:{f}_{s}\). This baseline model provides a starting point for the TL approach.

For example, the training of the source model for GPR can be expressed as:

$$\:\left\{{f}_{s}\left(x,{\varvec{w}}_{\varvec{s}},{\varvec{\theta\:}}_{0}\right)\mathcal{\:}\sim\mathcal{\:}\mathcal{G}\mathcal{P}(\mu\:\left(x\right),k(x,{x}^{{\prime\:}}\left)\right|\left({\varvec{w}}_{0},{\varvec{\theta\:}}_{0}\right))\::\:x\in\:{\mathcal{D}}_{s}\right\}$$
(27)

where the initial learnable parameters are denoted as \(\:{\varvec{w}}_{0}\) and the trained learnable parameters are denoted as \(\:{\varvec{w}}_{\varvec{s}}\). If we denote initial hyperparameters of \(\:f\left(x\right)\) with \(\:{\varvec{\theta\:}}_{0}\), the BO algorithm – given \(\:x\in\:{\mathcal{D}}_{s}\) – is applied to tune hyperparameters of \(\:{f}_{s}\left(x\right)\), after which they can be denoted by \(\:{\varvec{\theta\:}}_{s}\).

  • Testing the unseen IC: Subsequently, the performance of the pretrained baseline model is evaluated on the data from the new unseen IC.

$$\:\:\left\{{f}_{s}\left(x,{\varvec{w}}_{\varvec{s}},{\varvec{\theta\:}}_{\varvec{s}}\right)\::\:x\in\:{\mathcal{D}}_{t}\right\}$$
(28)

However, the results indicated suboptimal performance, highlighting the need for further model refinement.

  • TL with limited samples: To address the limitations in performance, the transferred model is updated using limited samples from the target domain \(\:{\mathcal{D}}_{t}={\left\{\left({x}_{i}^{t},{y}_{i}^{t}\right)\right\}}_{i=1}^{{N}_{t}}\) alongside the source domain data \(\:{\mathcal{D}}_{s}\). For example, the updating of the source model for GPR can be expressed as:

$$\:\left\{{f}_{t}\left(x,{\varvec{w}}_{\varvec{t}},{\varvec{\theta\:}}_{\varvec{s}}\right)\mathcal{\:}\sim\mathcal{\:}\mathcal{G}\mathcal{P}(\mu\:\left(x\right),k(x,{x}^{{\prime\:}}\left)\right|\left({\varvec{w}}_{\varvec{s}},{\varvec{\theta\:}}_{\varvec{s}}\right))\::\:x\in\:\left({\mathcal{D}}_{s}\cup\:{\mathcal{D}}_{t}\right)\right\}$$
(29)

Specifically, we experimented with \(\:{N}_{t}\) values of 1, 5, 10, and 20 samples available from the new unseen IC to investigate the impact of varying sample sizes of the target domain on model improvement.

Evaluation metrics

We used several metrics to compare the performance of the trained ML models. These metrics can be categorised into two groups: metrics that evaluate the accuracy of the models and those that assess the speed of the training and prediction processes. Regarding the first group, we used root-mean-square error (RMSE), coefficient of determination (R-squared or \(\:{R}^{2}\)), and mean-absolute-error (MAE), with the following equations:

$$\:RMSE=\:\sqrt{\frac{{\sum\:}_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{n}}$$
(30)
$$\:{R}^{2}=\:1-\frac{{\sum\:}_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{{\sum\:}_{i=1}^{n}{\left({y}_{i}-{\stackrel{-}{y}}_{i}\right)}^{2}}$$
(31)
$$\:MAE=\:\frac{{\sum\:}_{i=1}^{n}\left({y}_{i}-{\widehat{y}}_{i}\right)}{n}$$
(32)

where n is the number of data points, \(\:y\) is the actual value vector, \(\:\widehat{y}\) is the predicted value vector, and \(\:\stackrel{-}{y}\) is the mean of actual values.

In addition to the metrics used to assess accuracy, we also employed two additional metrics: training time and prediction speed. The former indicates the time required for the model to be trained (in seconds), while the latter represents the number of predictions the model can make per second. Therefore, models with low training times and high prediction speeds are preferred.

Results and discussion

In this section, the performance of various ML models (mentioned in Section 3), considered as the SM, is first compared in the subsection “surrogate model selection”. The influence of including or excluding initial configurations (ICs) as complementary input features through label encoding is investigated to select the best models and determine the optimal approach for shaping the feature input. This step is crucial for further evaluations and subsequent steps toward nested cross-validation (NCV) on unseen ICs and updating the SM using TL, as will be discussed in the subsection “Transfer Learning (TL) for Unseen ICs”. We used MATLAB 2022a on a laptop with an Intel Core i7-8665U CPU and 16 GB of RAM to train and evaluate ML models.

Surrogate model selection

We trained various ML models to compare and select those showing the best performance for the next steps. The models were trained under two distinct scenarios with respect to ML inputs: (1) using only DEM interaction parameters, and (2) considering extra inputs to characterise the ICs of the mixture within the hopper (see Fig. 2). Figure 6 illustrates an example of the regression tree model’s performance for both scenarios. Obviously, the model’s performance is significantly enhanced in the case of the extra inputs related to ICs. This improvement is anticipated as the segregation (or degree of mixing) of multiple materials heavily depends on their ICs6. The results for the first scenario (i.e., excluding ICs as ML inputs) are given in Table B.3 of the Appendix, where it is evident that the model’s performance is unsatisfactory even with optimised hyperparameters. As a result, we proceed with the second scenario to train the models.

Fig. 6
figure 6

Comparison of the impact of input features with (w/) and without (w/o) initial configurations (ICs) on fine-tuned regression tree outcomes using Bayesian optimisation (BO), assessed via 5-fold cross-validation.

The training results of various ML models are presented in Table 4, where ICs were included as complementary features. The training was performed with 5-fold cross-validation on 185 samples and 18 features. Optimisable models were fine-tuned using the Bayesian optimisation (BO) algorithm with 50 iterations. Notably, the performance of the models significantly improved when the hyperparameters were optimised using BO, underscoring the importance of fine-tuning hyperparameters in the ML training process. It is also worth mentioning that despite increasing the number of hidden layers and neurons, the performance of ANN did not improve, possibly due to overfitting or vanishing gradient problems71.

Based on the evaluation metrics employed in this study, the optimal model is characterised by minimal error (i.e., RMSE and MAE), maximal \(\:{R}^{2}\), low training time, and high prediction speed. According to Table 4, we identify two models that fulfil the majority of these criteria simultaneously: Gaussian process regression (GPR) and ensemble of trees, with GPR being superior across all metrics.

Table 4 Performance comparison of regression models, including 3 features representing initial configurations (ICs). Optimisable models’ hyperparameters were fine-tuned using the BO algorithm.

Transfer learning (TL) for unseen ICs

In this section, we present the outcomes of the TL method and updating process utilised in this study, as explained in Section 5. Results for TL-based ensemble learning (TL-Ensemble) and GPR (TL-GPR) are given in Tables 5 and 6, respectively. To ensure stability and repeatability, we used five different random seed numbers to initialise the ML model’s parameters during training. This strategy ensures a reliable assessment of the model’s performance. The results in Tables 5 and 6 are reported as (mean ± standard deviation) resulting from these five repetitions. All training procedures were performed with 10-fold cross-validation and 18 features. Additionally, hyperparameters were fine-tuned using the BO algorithm. The hyperparameters of these models together with their search space are provided in Table B.4 and Table B.5 in the Appendix.

Initially, we cross-validated the models (i.e., Ensemble learning and GPR) using all 185 available samples given ten folds to establish a benchmark for comparison. The results are displayed in the first row of Tables 5 and 6, labelled as “All ICs were seen”. Then, we iteratively applied the transfer learning approach across all ICs. For each iteration, one IC was excluded, and the model was pretrained on the remaining four ICs (i.e., on 4 × 37 = 148 samples) via a 10-fold cross-validation coupled with BO fine-tuning of hyperparameters. The results of the validation phase are presented under “Validation”, excluding the training phase’s results. Next, we tested the model’s performance on the “unseen IC” as the target domain of TL that was previously excluded. We ran tests where the pretrained model was retrained with 0, 1, 5, 10, and 20 samples from the unseen IC, monitoring its performance each time. These test results are shown under “Unseen IC”. For an overall assessment of the model’s performance across all unseen ICs, we calculated the average of the TL-based outcomes across all unseen ICs for retraining with different available samples of 0, 1, 5, 10, and 20 from the target domain. These averages are displayed at the bottom of Tables 5 and 6, indicated as “Mean”.

Table 5 Performance of TL-Ensemble averaged over 5 different random seed numbers for initialisation.
Table 6 Performance of TL-GPR averaged over 5 different random seed numbers for initialisation.

To visually illustrate the impact of the updating process on the pretrained TL-SMs, Fig. 7 shows predicted versus true responses for two different unseen initial configurations, IC1 and IC3, where they were updated with 0 (no update), 1, and 5 samples from the target domain. The models’ predictions were initialised using the first random seed number in this figure. As illustrated, providing even a few samples from the unseen IC—the target domain—for updating leads to a significant improvement. Comprehensive results for various unseen ICs, random seed numbers, and the model updated with different numbers of samples (0, 1, 5, 10, 20) from the unseen IC under test are provided in two videos in the supplementary material for both TL-Ensemble and TL-GPR. These videos allow easy tracking of the updating process and the effect on targeting different unseen ICs.

Fig. 7
figure 7

Predicted and true responses for unseen initial configurations IC1 and IC3 with TL-Ensemble and TL-GPR, updated with varying numbers of samples: (a) 0 (no update), (b) 1, and (c) 5 from the unseen IC.

Figure 8 illustrates a comparison between the “mean” performance of TL-Ensemble and TL-GPR across different numbers of available samples from the unseen IC. As shown, TL-GPR consistently outperforms Ensemble-TL across all available sample sizes, highlighting the superiority of TL-GPR over TL-Ensemble. Consequently, we focus our attention on TL-GPR for further analysis of the results.

Fig. 8
figure 8

Comparison between the mean performances of TL-Ensemble and TL-GPR during validation and test phases for different numbers of available samples from unseen IC to update in terms of (a) RMSE, (b) \(\:{R}^{2}\), and (c) MAE.

Figure 9 illustrates the reduction in RMSE in percentage for different sample sizes used to update the TL-GPR model. The reduction was calculated as \(\:\left(\left|{RMSE}_{ICi}-{RMSE}_{IC0}\right|/{RMSE}_{IC0}\right)\times\:100\), where \(\:(i=1,\:5,\:10,\:20)\) corresponds to different numbers of samples to retrain SM, and IC0 denotes the case where no update happens to the pretrained model. The bar graph demonstrates that updating the TL-GPR model with just one sample from the unseen IC results in a significant reduction (~ 50%) in the RMSE. Figure 8a also shows that by updating the model with only a few new samples (e.g. 5 samples) from the unseen IC, its accuracy for the new IC can approach that of the validation set. This finding underscores the efficiency of updating the SM with a minimal number of new DEM simulations to achieve improved accuracy of the SM for unseen IC.

Fig. 9
figure 9

Reduction in RMSE of TL-GPR model following updating with different numbers of samples from unseen ICs.

It is important to note that the results and analyses discussed above are based on the average performance of the TL-based model across all ICs. However, according to Table 6, the model’s performance varies depending on which IC is considered as unseen. To facilitate a clearer comparison, Fig. 10 presents the RMSE and \(\:{R}^{2}\) of TL-GPR across all ICs and for different numbers of samples used to update the pretrained model. It reveals that, when the model is not updated (i.e., 0 sample), the TL-GPR model exhibits the poorest performance for IC3 and IC2, characterised by the highest RMSE and lowest \(\:{R}^{2}\).

The observed performance for IC3, where materials within the hopper are fully-mixed (see Fig. 2), was anticipated. This is because there is no comparable data in the training dataset; the other four initial configurations (IC1, IC2, IC4, and IC5) have fully segregated initial configurations, leading to a far different data distribution for IC3. Similarly, the relatively less optimal performance of the model for IC2 can be attributed to its unique feature, i.e., having a reversed layering order, opposite to IC1, IC4, and IC5. Nevertheless, in both IC2 and IC3 cases, after updating the model with only a few data points, a significant improvement in performance is observed. For instance, in IC3, updating the model with only 1 and 5 samples results in a remarkable reduction in RMSE by 68% and 85%, respectively.

Fig. 10
figure 10

Accuracy of TL-GPR across various unseen ICs and for different numbers of samples to update the model in terms of (a) RMSE and (b) R-squared.

Conclusion

In this study, we successfully demonstrated a framework for developing surrogate models (SMs) that effectively link particle-particle and particle-wall DEM interaction parameters to the segregation of a multi-component mixture. We first examined various ML models to develop SMs capable of estimating radial segregation in the heap based on DEM parameters and the initial configuration (IC) of the mixture. We found that developing accurate SMs requires consideration of features describing IC through feature engineering. Moreover, we emphasise that fine-tuning hyperparameters is crucial to obtaining the optimal performance of ML-based SMs. Among the six ML models tested, Ensemble learning and Gaussian process regression (GPR) demonstrated the best performance.

Next, we developed an adaptive SM leveraging a transfer learning (TL)-based approach using Ensemble learning and GPR. Cross-validation, coupled with Bayesian optimisation for fine-tuning the SM’s hyperparameters, was conducted using four ICs as the pretraining phase of TL, while predictions and a retraining phase made for a fifth “unseen IC”. Model performance was monitored after updating and retraining the TL-SMs with access to different numbers of samples from the unseen IC. We observed that TL-GPR consistently outperformed TL-Ensemble.

Our findings indicate that the performance of TL-SMs varied depending on the specific unseen IC. When testing the pretrained TL-SMs with new ICs possessing specifications not included in the source dataset, their performance appears to be relatively lower. For instance, IC3 is the only configuration out of five in which the materials are fully-mixed, and the TL-SMs’ performance for IC3 was inferior compared to other ICs. However, the performance significantly improved after updating the model with a few samples from the “unseen IC”. For instance, the RMSE of TL-GPR was reduced by an average of 50% by retraining the model with just one additional sample, highlighting the adaptability and effectiveness of our proposed adaptive TL approach.

To overcome the challenge encountered by the TL model for IC3 (i.e., fully-mixed configuration), additional intermediate initial configurations between fully-segregated and fully-mixed ones can be incorporated. Additionally, our surrogate model is currently constrained to predicting segregation for pellets and sinter with specified material properties and certain geometrical properties of the system. To improve the generalisability of the model, it is essential to vary and include these material- and geometry-related properties in the training phase. Furthermore, our surrogate model is developed for only one response variable. However, in DEM model calibration problems, multiple responses are typically considered simultaneously. Future research endeavours could focus on addressing these challenges, including incorporating multiple response variables and exploring a broader range of material and geometric properties to improve model performance and applicability.

The authors would like to acknowledge dr.ir. Jan van der Stel and ir. Allert Adema from Tata Steel Ijmuiden for the insightful discussions in the context of blast furnace.