Introduction

Accurate temperature prediction plays a pivotal role in different engineering applications, where understanding thermal behavior is essential for optimal decision-making and system performance. For instance, in chemical reactors the temperature distribution should be well estimated in order to enhance the yield of chemical reactions1. The temperature prediction can be determined via computational fluid dynamics (CFD) simulation of process where the governing equations are numerically solved for obtaining the temperature distribution inside the reactors and at various times2. However, implementing CFD is challenging for industrial chemical reactors, and some computational methods based on machine learning (ML) can be applied in combination with CFD to better understand the behavior of process3,4.

In this context, the utilization of machine learning models has gained considerable attention for their ability to capture complex temperature patterns and provide precise predictions5. The method of machine learning is used for training based on calculated CFD data which can reduce computational time. This research endeavors to explore and compare the performance of some prominent ML models, namely Bayesian Ridge Regression (BRR), Deep Neural Network (DNN), Attention-based Deep Neural Network, and Support Vector Machine (SVM), in the specific context of temperature prediction within a radial-axial coordinate system for data collected from a CFD simulation of heat transfer in a chemical reactor.

DNN Regression is a flexible model that utilizes deep neural networks to estimate and forecast continuous target variables based on input features. The capacity to capture intricate patterns and interconnections renders it a valuable instrument within the realm of regression analysis6. BRR stands out as a regression methodology characterized by its unique blend of norm regularization and adaptability driven by data. This technique, rooted in Bayesian principles, offers a versatile approach to modeling relationships between variables while simultaneously mitigating the risk of overfitting, making it useful in predictive modeling and data analysis7. The SVM regression is a highly effective and adaptable machine learning methodology employed for addressing regression challenges. SVM regression, similar to SVM classification, operates on the fundamental principle of identifying the optimal hyperplane that most effectively aligns with the given dataset8.

As a novel aspect of this paper, we implemented an attention mechanism to the DNN model to enhance its performance further for simulation of chemical reactors in liquid phase considering CFD dataset. This mechanism allows the model to dynamically focus on the most powerful input features, consequently using spatial dependencies more effectively. The outcomes of this enhancement were carefully evaluated and showed notable increases in model robustness and predictive accuracy.

The paper makes significant contributions to the field of temperature prediction in a spatial context. The study not only shows the impressive predictive power of the Deep Neural Network and Attention-based Deep Neural, with a noticeable R2 score and low MAE and RMSE, but also highlights the practical applicability of these models in real-world scenarios. The findings are made more reliable by employing the Jellyfish Swarm Optimizer for hyper-parameter optimization. These results highlight the potential of using deep neural methods for spatial temperature prediction, consequently providing a useful tool for professionals in the field as well as for the scientific community.

Data analytics

Liquid-phase chemical reactor was studied in this work which operates based on temperature difference in the process and conversion of raw materials to the products. The dataset comprises three variables, namely r (representing the radial coordinate), z (indicating the axial coordinate), and T (indicating the temperature in Kelvin). The study designates the independent variables as r and z, whereas the dependent variable is denoted as T (reactor temperature distribution). The dataset has over 54,000 data points, and Figs. 1 and 2 depict the distributions of the input and output variables. The simulations were carried out using CFD for the chemical reactor and the data (T) was used for machine learning analysis. Simulations were performed considering conduction and convection for heat transfer as well as diffusion and convection for mass transfer. CFD simulations were executed in COMSOL package considering laminar flow in a tubular shaped chemical reactor. More details can be found in9 regarding the CFD simulations.

Fig. 1
figure 1

Distributions of Input parameters for ML modeling.

Fig. 2
figure 2

T distributions obtained from CFD model.

In this work, the z-score approach was applied for dataset outlier identification and consequent removal. Expressed in standard deviations, the z-score is a statistical measure offering a quantitative representation of the distance between each point of data and the mean of the dataset. Through the establishment of a z-score threshold, we successfully discerned data points that exhibited substantial deviations from the mean. The removal of outliers from the dataset was carried out in order to enhance the robustness of our analysis and mitigate any potential negative effects on the performance and accuracy of our machine learning models10. The implementation of a rigorous procedure for detecting and eliminating outliers significantly enhances the dependability and credibility of the findings in our study. This enables us to draw more accurate and insightful conclusions by utilizing a refined and more inclusive dataset.

Methodology

Jellyfish swarm optimizer (JSO)

Optimizing hyper-parameters is a crucial stage in the ML model development process. The selection of hyper-parameters plays a pivotal role in shaping the model’s performance, and the quest for the ideal combination of hyper-parameters is frequently a laborious and computationally demanding undertaking. To address this challenge, a novel optimization technique called the JSO has emerged as a promising solution for hyper-parameter tuning. JSO draws inspiration from the remarkable swarming behavior of jellyfish in the ocean. Just as jellyfish coordinate their movements to optimize their location and resources in the vast sea, JSO leverages the collective intelligence of a swarm of agents to explore the hyper-parameter space efficiently.

Let \(\:X\) represent the population of jellyfish agents, where each agent \(\:i\) is associated with a set of hyper-parameters denoted as \(\:{x}_{i}\). The objective function to be optimized is defined as \(\:f\left(x\right)\). The movement of each agent in the JSO population is governed by the following equations11:

$$\:\begin{array}{c}{v}_{i}\left(t+1\right)={v}_{i}\left(t\right)+\alpha\:\cdot\:\left({p}_{i}\left(t\right)-{x}_{i}\left(t\right)\right)+\beta\:\cdot\:\left(g\left(t\right)-{x}_{i}\left(t\right)\right)\end{array}$$
(1)
$$\:\begin{array}{c}{x}_{i}\left(t+1\right)={x}_{i}\left(t\right)+{v}_{i}\left(t+1\right)\end{array}$$
(2)

Here, \(\:{v}_{i}\left(t\right)\) stands for the velocity of agent \(\:i\) at time \(\:t\), \(\:{\upalpha\:}\) and \(\:{\upbeta\:}\) are acceleration coefficients, \(\:{p}_{i}\left(t\right)\) denotes the personal best solution found by agent \(\:i\) up to time \(\:t\), and \(\:g\left(t\right)\) is the global best solution discovered by any agent in the population12.

The exploration and exploitation abilities of JSO are balanced through these equations. Agents adjust their velocities based on their personal experiences (\(\:{p}_{i}\left(t\right)\)) and the global best solution (\(\:g\left(t\right)\)). This allows the swarm to effectively investigate the hyper-parameter space and converge toward optimal configurations. Figure 3 provides a comprehensive depiction of the JSO algorithm. The JSO algorithm iteratively refines the hyper-parameter settings by updating the agents’ positions and velocities. Until a specified stopping condition is met—such as reaching the maximum allowed number of iterations or achieving swarm convergence to an acceptable solution—the iterative process continues.

Fig. 3
figure 3

Flowchart of JSO algorithm utilized for this study.

Deep neural network regression

DNN Regression model is a highly effective machine learning approach utilized for addressing regression tasks. This neural network variant belongs to the category capable of understanding and approximating intricate non-linear associations between input characteristics and continuous target variables. DNN regression excels in scenarios where the connection between input and output defies conventional linear regression methods.

The architecture of a DNN regression model consists of multiple layers of artificial neurons, known as nodes or units. Each layer comprises a set of interconnected nodes, and these nodes are organized into input, hidden, and output layers. The input layer takes the feature vectors, denoted as \(\:X\), as its input. The final output layer produces the predicted continuous values, denoted as \(\:\widehat{y}\), based on the learned relationships13. Mathematically, a DNN regression model can be represented using following equation14:

$$\begin{aligned} h^{{\left( 1 \right)}} \, & = \sigma \left( {W^{{\left( 1 \right)X}} + b^{{\left( 1 \right)}} } \right) \\ h^{{\left( 2 \right)}} \, & = \sigma \:\left( {W^{{\left( 2 \right)h^{{\left( 1 \right)}} }} + b^{{\left( 2 \right)}} } \right) \\ & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\vdots \\ \hat{y}\, & = W^{{\left( L \right)h^{{\left( {L - 1} \right)}} }} + b^{{\left( L \right)}} \\ \end{aligned}$$
(3)

In this equation, X represents the input feature vector, while \(\:{h}^{\left(l\right)}\) corresponds to the activations within the l-th hidden layer15.

Attention-based deep neural network for regression

To boost the predictive capability of DNN in fitting the T data, we introduce an attention mechanism, which lets DNN dynamically focus on the most relevant features for temperature prediction. This approach leverages the spatial dependencies between the radial and axial coordinates, enabling the model to prioritize input regions that significantly influence the target variable16,17.

The attention-based DNN integrates an attention mechanism as an intermediary layer within the network. The overall architecture is composed of the following components:

  1. 1.

    Input Layer: Accepts the feature vector \(\:\left[r,z\right]\), where r represents the radial coordinate and z the axial coordinate.

  2. 2.

    Hidden Layers: Consists of multiple dense layers with non-linear activation functions (e.g., ReLU) to extract high-level features from the input data.

  3. 3.

    Attention Layer: This layer computes attention weights \(\:{{\upalpha\:}}_{i}\) for each feature or feature interaction. The weights are calculated as follows:

    • A compatibility score \(\:{e}_{i}\) is computed for each input feature:

      $$\:\begin{array}{c}{e}_{i}=\text{t}\text{a}\text{n}\text{h}\left({W}_{a}{h}_{i}+{b}_{a}\right)\end{array}$$
      (4)

      where \(\:{h}_{i}\) is the feature representation from the preceding hidden layer, and \(\:{W}_{a}\) and \(\:{b}_{a}\) are learnable parameters.

    • The attention weights \(\:{{\upalpha\:}}_{i}\) are derived using a softmax function:

      $$\:\begin{array}{c}{{\upalpha\:}}_{i}=\frac{\text{exp}\left({e}_{i}\right)}{{\sum\:}_{j}\text{exp}\left({e}_{j}\right)}\end{array}$$
      (5)
    • The weighted feature representations are computed as:

      $$\:\begin{array}{c}{h}_{i}^{{\prime\:}}={{\upalpha\:}}_{i}{h}_{i}\end{array}$$
      (6)
  4. 4.

    Output layer: aggregates the attention-weighted feature representations and produces the final temperature prediction T.

The ML is trained using the MSE loss function as follows:

$$\:\begin{array}{c}\text{L}\text{o}\text{s}\text{s}=\frac{1}{n}{\sum\:}_{i=1}^{n}{\left({T}_{i}-\widehat{{T}_{i}}\right)}^{2}\end{array}$$
(7)

where \(\:{T}_{i}\) is the actual temperature, and \(\:\widehat{{T}_{i}}\) is the predicted temperature. The network parameters, including the attention weights, are updated via backpropagation using an optimizer like Adam.

Support vector machine (SVM)

SVM was specifically developed to handle classification and regression tasks. SVM stands as one of the prominent models in supervised learning, adept at scrutinizing data to classify data samples18. SVM regression operates on the same fundamental principles as SVM classification, aiming to maximize the margin between the data points and the regression hyperplane while minimizing prediction errors. To achieve this, SVM regression uses a loss function and a set of constraints. The core components of SVM regression include19:

1. Kernel Functions (\(\:K\left(x,{x}^{{\prime\:}}\right)\)): SVM regression can be enhanced using kernel functions, which allow it to model non-linear correlation between input variables and the target parameter. The most common kernels used in SVM regression include the polynomial, linear, sigmoid kernels, and radial basis function (RBF).

2. Objective Function: The objective of SVM regression is to optimize the loss function subject to margin constraints. The loss function in SVM regression is often defined as20:

$${L\left( {y,f\left( x \right)} \right) = \frac{1}{2}\left\| w \right\|^{2} + C\sum\nolimits_{{i = 1}}^{N} {\left( {{\text{max}}\left( {0,|y_{i} - f\left( {x_{i} } \right)\} | - \epsilon } \right)} \right)^{2} } }$$
(8)

Where, \(\:L\left(y,f\left(x\right)\right)\) stands for the loss function, w stands for the weight vector and the regularization parameter is denoted by C. Also, N indicates the quantity of training data points, \(\:{y}_{i}\) is the true target value for the i-th example, \(\:f\left({x}_{i}\right)\) denotes the predicted value for the i-th data point, and \(\:\epsilon\) stands for the ε-insensitive tube, allowing for a certain amount of error.

3. Hyperparameters: Hyperparameters of SVM regression include kernel function parameters and C (regularization parameter). Achieving best performance needs an strong optimization of these parameters.

4. Support Vectors: Support vectors constitute the data instances positioned in closest proximity to the regression hyperplane, exerting a significant influence on its location and orientation. These data points are pivotal in determining the margin and the overall model performance.

By applying suitable kernel functions, SVM regression can capture non-linear relationships in the data and manage both simple and complex regression activities. The performance of the model is much influenced by the choices of the hyperparameters and kernel function. SVM regression is especially advantageous in situations where the dataset includes a considerable number of features and contains outliers that require effective handling21,22.

Bayesian ridge regression

Bayesian Ridge Regression (BRR) is a regression methodology that combines norm regularization with the ability to adapt to data. In BRR, the regularization parameter, denoted as \(\:{\upalpha\:}\), is not fixed but instead dynamically determined from the data. The primary objective of BRR can be expressed as7:

$$\:\begin{array}{c}\underset{\varvec{w}}{\text{m}\text{i}\text{n}}{\left\|\varvec{X}\varvec{\omega\:}-\varvec{y} \right\|}_{2}^{2}+\alpha\:{\left\|\varvec{\omega\:}\right\|}_{2}^{2}\end{array}$$
(9)

In this equation, \(\:{\upomega\:}\) represents the coefficient vector estimate, \(\:X\) is the feature matrix, and \(\:y\) is the target variable.

The regularization parameter, \(\:{\upalpha\:}\), takes on a distinct role in BRR. It is treated as a random variable that is calculated from the available data. The distribution of \(\:{\upomega\:}\) in BRR is modeled as a spherical Gaussian distribution, described as7:

$$\:\begin{array}{c}p\left(\varvec{\omega\:}|\lambda\:\right)=N\left(\varvec{\omega\:}|0,{\lambda\:}^{-1}{\mathbf{I}}_{\varvec{p}}\right)\end{array}$$
(10)

Here, \(\:{\uplambda\:}\) represents a precision parameter, and \(\:{I}_{p}\) signifies an identity matrix of appropriate dimensions. To enhance the versatility of BRR, we derive the optimal values for both λ and α by maximizing the logarithm of the marginal likelihood. Additionally, the prior distribution for λ and α can be characterized using four unique parameters: \(\:{\alpha\:}_{1}\), \(\:{\alpha\:}_{2}\), \(\:{\lambda\:}_{1}\), and \(\:{\lambda\:}_{2}\).

Results and discussion

The fitting accuracy of three introduced models was evaluated on the test data set (20% of the entire data). The implementation of the models done using Python programing language in this study. The metrics used for evaluations are listed below:

  1. 1.

    R2 Score - evaluates the extent to which the model accounts for the variability observed in the test dataset23:

    $$\:\begin{array}{c}{R}^{2}=1-\frac{{\sum\:}_{i=1}^{n}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}}{{\sum\:}_{i=1}^{n}{\left({y}_{i}-\stackrel{-}{y}\right)}^{2}}\end{array}$$
    (11)
  2. 2.

    Mean Absolute Error (MAE) - quantifies the mean absolute deviation between estimated and observed ones24:

    $$\:\begin{array}{c}MAE=\frac{1}{n}{\sum\:}_{i=1}^{n}\left|{y}_{i}-\widehat{{y}_{i}}\right|\end{array}$$
    (12)
  3. 3.

    Root Mean Squared Error (RMSE) - quantifies the square root of the mean of the squared differences between estimated by model and measured/reference values25:

    $$\:\begin{array}{c}RMSE=\sqrt{\frac{1}{n}{\sum\:}_{i=1}^{n}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}}\end{array}$$
    (13)
  4. 4.

    Average Absolute Relative Deviation (AARD%) - Measures the average percentage deviation between actual and predicted values24:

    $$\:\begin{array}{c}AARD\%=\frac{1}{n}{\sum\:}_{i=1}^{n}\frac{\left|{y}_{i}-\widehat{{y}_{i}}\right|}{{y}_{i}}\times\:100\end{array}$$
    (14)

    where, n signifies the size of the dataset, \(\:{y}_{i}\) stands for the actual values. Also, \(\:\widehat{{y}_{i}}\) and \(\:\stackrel{-}{y}\) are the estimated values by the model and the mean of the actual values. By utilizing these metrics for evaluating and contrasting the efficacy of the models on the test dataset, the numerical outcomes are succinctly presented in Table 1.

The training process for the machine learning models in this study was designed to ensure robust and reliable performance. For the BRR model, training involved fitting the data into a probabilistic framework with regularization to prevent overfitting. The model was optimized by maximizing the marginal likelihood, ensuring accurate parameter estimation.

For the DNN and Attention-Based DNN models, training was carried out using the backpropagation algorithm. The Mean Squared Error (MSE) loss function was minimized using the Adam optimizer, which combines momentum and adaptive learning rates to accelerate convergence. The models were trained over multiple epochs, and batch normalization was employed to stabilize learning and improve generalization. The attention-based DNN further included an attention mechanism layer that dynamically prioritized relevant input features.

Training the SVM model with a radial basis function (RBF) kernel enables it to detect non-linear relationships in the data. The margin of tolerance for error was established from the ε-insensitive loss function; grid search was used to fine-tune the hyperparameters to produce best outcomes.

To further enhance model performance, the JSO was employed for hyperparameter tuning, ensuring efficient exploration of the parameter space and improved model accuracy. The dataset was divided into training and test sets (80%-20%), and 5-fold cross-validation was performed during training to prevent overfitting and validate the models’ robustness.

Table 1 Summary of the results for fitting the dataset.

According to Table 1, the attention-based Deep Neural Network achieved the highest R² score of 0.99432, outperforming all other models in predictive accuracy. Its superiority is evident in Figs. 4, 5 and 6, and 7, which illustrate the comparison of actual and predicted values across all models. The attention-based DNN consistently exhibited closer alignment with the actual values, as evidenced by the reduced MAE, RMSE, and AARD%. This indicates its ability to dynamically prioritize significant features in the input data, employing spatial dependencies more effectively than the traditional DNN, BRR, and SVM models. These findings solidify the attention-based DNN as the optimal model for temperature prediction of chemical reactor in the radial-axial coordinate system.

Fig. 4
figure 4

Bayesian Ridge Regression Model: Predicted Vs. Actual values.

Fig. 5
figure 5

Deep Neural Network Model: Predicted Vs. Actual values.

Fig. 6
figure 6

Attention-based Deep Neural Network Model: Predicted Vs. Actual values.

Fig. 7
figure 7

Support Vector Machine Model: Predicted Vs. Actual values.

Figure 8 presents a 3D visualization of the temperature distribution as a function of radial (r) and axial (z) coordinates for the liquid phase in the chemical reactor. This figure illustrates the spatial temperature variations within the reactor, highlighting the combined effects of radial and axial positions on the thermal gradient. Such a representation aids in understanding the complex thermal behavior across the reactor, which is critical for optimizing the process and improving heat transfer efficiency. For the temperature gradient, both conduction and convection mechanisms are important to be considered, however the effect of thermal convection is more important due to the fluid flow (mixing) inside the reactor.

Fig. 8
figure 8

3D representation of the Temperature as a function of coordinates.

Figure 9 explores the effect of the radial coordinate (r) on temperature. It shows how temperature varies radially from the center of the reactor towards its edges. The sharp decrease in the temperature could be attributed to the path of heat transfer in which less heat is transferred to the reactor wall. Indeed, the highest T value is observed at the center of reactor which creates thermal gradient for driving heat transfer in the fluid. Figure 10 depicts the relationship between the axial coordinate (z) and fluid temperature, revealing the longitudinal thermal behavior along the tubular reactor.

Fig. 9
figure 9

The impact of r(m) on the temperature profile.

Fig. 10
figure 10

The impact of z(m) on the temperature profile.

Conclusion

In this study, we conducted a comprehensive analysis of three distinct machine learning models, BRR, DNN, and SVM, for the task of temperature prediction in a radial-axial coordinate system for a chemical reactor. Our findings reveal valuable insights into the performance of these models, their strengths, and their areas of applicability.

BRR demonstrated solid predictive capabilities, making it a reliable choice for temperature forecasting in scenarios where a balance between model complexity and accuracy is critical. The results showed a commendable R2 score, MAE, RMSE, and AARD% values.

SVM, on the other hand, revealed strong fitting performance, striking a balance between accuracy and computational efficiency. It offers a practical solution for scenarios where real-time or near-real-time temperature predictions are required.

DNN, with its remarkable accuracy and generalization abilities, excelled in capturing complex temperature patterns within the radial-axial coordinate system. This model is particularly suitable for applications where precision is of utmost importance.

The application of the attention mechanism in the DNN model represents a novel contribution of this paper. This enhancement enabled the model to dynamically prioritize relevant input features, resulting in the attention-based DNN achieving the highest predictive accuracy among all models. It recorded the best performance across all metrics, with an R² score of 0.99432, along with reduced MAE, RMSE, and AARD% values, confirming its superior capability in capturing complex spatial dependencies.

Furthermore, the implementation of Jellyfish Swarm Optimizer (JSO) for hyper-parameter optimization proved to be instrumental in enhancing the performance of each model, underscoring the importance of fine-tuning in machine learning applications.

This research provides a valuable resource for practitioners and researchers in fields such as thermal engineering, materials science, and physics, offering guidance on selecting the most appropriate machine learning model based on the specific requirements of their temperature prediction tasks. The insights gained from this comparative analysis contribute to the advancement of temperature modeling and underscore the potential of machine learning in addressing complex spatial temperature dependencies. As technology continues to evolve, these findings will exert a significant impact in achieving greater accuracy and efficiency in temperature prediction, thereby facilitating progress in a wide range of scientific and industrial applications.