Introduction

Concrete is the most widely used man-made material globally and accounts for ~7.5% of total anthropogenic CO₂ emissions1. To meet the requirement of the infrastructure construction and renewal, the usage of concrete materials will significantly increase in the coming decades, thereby substantially intensifying the burden on the environment2. Although using some low-carbon concrete can mitigate the increase in the anthropogenic CO₂ emissions3, ensuring concrete durability remains crucial for sustainable infrastructure4. In cold regions, freeze-thaw damage (FTD) is the most common factor to degrade the concrete durability5. Therefore, FTD prediction of concrete relates to the achievement of the sustainable infrastructure in these cold regions.

Traditional approaches for predicting concrete’s frost resistance include empirical regression methods6 and physics-based models7,8. However, empirical methods require extensive experimentation, incurring high labor costs and often yielding formulas applicable only to specific concrete types9. Physics-based models, which involve developing multi-scale, multi-physics numerical simulations based on water transport, phase transitions, and damage evolution, can elucidate FTD mechanisms at the mesoscale or microscale10. However, they frequently encounter convergence issues11.

In contrast, data-driven machine learning (ML) methods, with their capacity to capture high-dimensional nonlinear relationships, have recently emerged as a promising alternative12. Consequently, ML has been widely applied in the durability prediction of concrete in recent years13,14. Early studies demonstrated the potential of artificial neural networks (ANN) for predicting freeze–thaw damage in concrete15,16. Subsequently, Huang et al.17 enhanced ANN performance using a hybrid sparrow search algorithm, improving prediction accuracy by nearly 30%. More advanced ensemble learning models, such as random forest18 and extreme gradient boosting19,20, have since been employed to achieve greater prediction accuracy and robustness by integrating individual model outputs21.

Despite these advances, the performance of ML models heavily depends on the precision of manually engineered features22. The prevailing databases primarily comprise numerical data—cement content, aggregate gradation, number of freeze-thaw cycles, etc.—which restricts both prediction accuracy and model interpretability. In fact, the FTD in concrete is the results of the changes in microstructure, which is a complex physicochemical processes23. Therefore, textual information related to the FTD process of concrete is particularly important.

Although manual extraction of textual features such as those related to raw materials, corrosive environments, and microstructural variations is feasible, it is often constrained by limited domain expertise, subjective bias, and fatigue, thereby reducing data reliability24,25. In recent years, significant advancements have been made in natural language processing (NLP)26,27. NLP has been widely applied in various fields, such as alloy material design28, medical diagnosis29,30, microbial genomics31, and drug design32. In civil engineering, scholars have proposed lightweight large language model (LLM)–based frameworks for automated literature mining to systematically identify research hotspots and trends in sustainable concrete substitute materials33. One study developed an artificial language to represent concrete mixtures and their physicochemical properties, which was subsequently applied to predict compressive strength, evaluate variable importance, and identify chemical reactions34. Another study introduced a multi-agent LLM framework that automates code-compliant reinforced concrete design, ensuring interpretability and verifiability while enabling efficient, accurate, and transparent structural analysis through natural language interaction35. Moreover, researchers have employed text mining and NLP techniques to analyze and standardize unstructured inspection narratives, transforming qualitative comments into structured data that improve the predictive capacity of bridge condition assessment models36. Nevertheless, the potential of NLP for predicting concrete durability remains largely underexplored.

In this study, we constructed a specialized dataset by compiling published investigations, in which the parameters of concrete raw materials are represented numerically, while the freeze-thaw media, test methods, morphological changes, and FTD mechanisms are represented textually. Four multimodal deep learning (MDL) models were developed to predict FTD of concrete, i.e., the evolution in the mass loss rate (MLR) and relative dynamic modulus of elasticity (RDME). These MDL models integrate NLP with deep neural networks (DNN), employing architectures such as long short-term memory (LSTM), gated recurrent units (GRU), LSTM with self-attention (LSTM-SA), and multi-head self-attention (MSA). To achieve the interpretability of the MDL models, A visual method was developed to further reveal how to identify key textual information related to FTD process. It is worth noting that the novelty of the developed MDL models can simultaneously consider the numerical characteristics of concrete raw materials and the key textual information during the FTD process. More importantly, the developed MDL framework exhibits significant scalability, and it can also be applied to predict the durability damage when concrete is subjected to other corrosive environments, such as chloride ingress, sulfate attack, atmospheric carbonation, etc.

Results and discussion

Statistical analysis of dataset

The statistical measures of numerical and categorical features in the dataset are summarized in Table 1, including variable type, unit, minimum, maximum, mean, and standard deviation (SD). Variables with relatively low standard deviations generally exhibit values clustered around the mean (e.g., C, W/B, NA). In contrast, larger standard deviations indicate a broader spread of values across the distribution.

Table 1 Descriptive statistics of numerical and categorical variables in the dataset

Figure 1 presents the Pearson correlation matrix for the input and output variables, along with the corresponding significance levels for each feature. The results indicate that the correlation coefficient between NOC and MLR is 0.43, and that between NOC and RDME is -0.46, demonstrating statistically significant correlations between NOC and these output variables. This finding aligns with the general empirical understanding that NOC significantly affects the frost resistance of concrete. Furthermore, all correlations between the input and output variables are below 0.8, suggesting an absence of significant multicollinearity issues. Given the dataset’s complexity, conventional linear regression methods may fail to capture its intricate relationships37. Therefore, the employment of deep learning models is recommended for effective modeling and prediction38.

Fig. 1: Correlation matrix of the input parameters and output variable.
figure 1

A flatter ellipse denotes a stronger correlation, while an ellipse approaching a circular shape implies a weaker correlation.

Implementation process analysis of DNN and MDL models

To demonstrate the critical role of textual information in predicting FTD in concrete, we initially designed a conventional DNN model, whose input layer consists solely of standardized numerical inputs. Building upon this baseline model, we developed a fully automated NLP framework to convert textual information into a format amenable to DNN input. This framework encompasses three stages: tokenization, word embedding, and feature extraction. This method, which integrates word embedding with feature extraction, has become a prevalent approach in modern NLP tasks39. It surpasses traditional techniques, such as the bag-of-words model, by more effectively capturing inter-word relationships, comprehending text sequences, and preserving contextual information40. Notably, the NLP processes for the four distinct text inputs operate independently. In this study, we adopt an early fusion strategy: the 18-dimensional standardized numerical vector is concatenated directly with the 32-dimensional textual feature vector extracted by the NLP module. The resulting fused representation serves as the input to the first hidden layer of the DNN. This direct concatenation not only simplifies the integration process by avoiding additional alignment procedures, but also enables the network to learn joint representations of textual and numerical modalities from the very beginning of training. Figure 2 illustrates the architecture of the MDL model designed and constructed in this study. To clearly distinguish the MDL model from the basic DNN model and simplify model names, all MDL models are named after the feature extraction model used in the NLP framework; for example, LSTM-DNN is abbreviated to LSTM. Detailed descriptions of the NLP framework’s implementation within the MDL model will follow.

Fig. 2
figure 2

Model architecture for multimodal deep learning models.

In the tokenization phase, all texts in the dataset are split into individual tokens, and unique tokens are counted to construct a vocabulary. Each token is assigned a unique index, with an extra index reserved for padding symbols to facilitate subsequent text processing. Using the constructed vocabulary, all texts are converted into integer vectors. However, variations in text length across samples hinder their combination into a unified tensor shape, potentially reducing training efficiency. To address this, zero-padding was applied to align the text sequences. To minimize unnecessary padding while preserving as much original information as possible, the maximum token count for each of the four text categories was determined post-tokenization and set as the alignment length for that category. Additionally, a masking mechanism was employed to obscure the padding positions, thereby preventing them from interfering with the model’s training process.

However, although tokens have been converted into unique numerical identifiers, these one-dimensional representations cannot capture the complex relationships and semantic nuances between words. Consequently, word embedding techniques are employed. Word embedding maps discrete variables, such as words or phrases, into a continuous vector space. Each token is then assigned a dense vector of fixed dimensions. During model training, these vectors are iteratively optimized to better capture the complex associations and semantic meanings between tokens, thereby enhancing the model’s capacity to process and understand text content41. In the word embedding phase, the embedding dimension is a critical hyperparameter. After manual trial and error, the embedding dimension was determined to be 64.

Finally, to effectively capture the key information from the text, we employed four models—LSTM, GRU, LSTM-SA, and MSA—to process the output from the word embedding layer. Numerous studies have demonstrated the effectiveness of LSTM and GRU in handling sequential data42,43,44. Through manual trial and error, we set the output dimensions of both LSTM and GRU to 32. The self-attention mechanism computes attention weights across all input tokens, thereby quantifying each token’s contribution to the final output, which is particularly beneficial for processing long sequences and preserving global contextual information45. Notably, both the LSTM-SA and MSA models produce second-order tensor outputs, which we subsequently converted into vectors via global max pooling to satisfy the DNN’s input requirements.

Prediction results of DNN and MDL models

The prediction results of the DNN and MDL models are illustrated in Fig. 3. The absolute mean error on the training and validation datasets in Fig. 3a–e is referred to as the loss and validation loss, respectively. It can be seen that the training process of the MDL models is different from that of the DNN. After 700 epochs of training, the convergence values of the validation loss for the LSTM, GRU, LSTM-SA, and MSA models are all lower than those observed for the DNN model. Because the attention-based models do not rely on such time-step dependencies, the training time of the MSA model is the shortest among these four MDL models (Fig. 3f).

Fig. 3: Prediction results of the DNN/MDL model.
figure 3

ae represent the training processes of the simple DNN, LSTM, GRU, LSTM-SA, and MSA models, respectively. f Training time for a single epoch and total training time for different MDL models. gi represent the Coefficient of determination(R²), Mean Absolute Error (MAE), and Root Mean Square Error (RMSE), respectively, for different models in predicting MLR. jl represent the R², MAE, and RMSE of different models in predicting RDME, respectively. The model performance metrics are the mean values from 5-fold cross-validation.

Figure 3g–l present the average performance metrics of these models on the test set, i.e., coefficient of determination (R²), Mean Absolute Error (MAE) and Root Mean Square Error (RMSE). These metrics reflect the models’ prediction accuracy and stability. The results indicate that the four MDL models exhibit significantly higher R² for predicting MLR and RDME compared to the DNN model. This suggests that incorporating textual information describing the freeze-thaw deterioration process of concrete plays an important role in improving the accuracy of FTD predictions. Furthermore, the R² values for the two output variables (i.e., MLR and RDME) in the DNN model show significant differences, whereas the R² values for the MDL models are more consistent. This difference may stem from the distinct physical characteristics of RDME and MLR. RDME is typically influenced by the development of microcracks in concrete and the variation in the pore structure as function of freeze-thaw cycles46,47. In contrast, MLR directly reflects the mass loss of concrete under freeze-thaw cycles at the macro level48. Thus, the RDME prediction is more dependent on the textual data than the MLR prediction since it supplements specific information regarding the microcracks and pore structure in the model.

Figure 4a–e and g–k present the prediction results of various models using parity plots. The results indicate that the prediction performance of the four MDL models exceeds that of the DNN model. In particular, the MSA model enhances the prediction accuracy of MLR and RDME by 8% and 21%, respectively. This improvement is attributed to NLP’s capability to extract and convey key information from textual data, thereby enabling the MDL models to yield more scientifically robust and accurate predictions. Additionally, Fig. 4f, g utilize Taylor diagrams to compare the predictive performance of all models. In the Taylor diagram, polar coordinates centered at the origin represent the standard deviation and correlation coefficient, while polar coordinates centered on the actual data represent the center root mean square error. The results further confirm that the DNN model produces the poorest predictions, while the four MDL models exhibit similar predictive performance, consistent with the conclusions drawn from the parity plots. It is worth noting that the accuracy of concrete FTD predictions is also influenced by sample categories and damage levels.

Fig. 4: Parity plot and Taylor diagram of the DNN/MDL model on the test set.
figure 4

ae represent the parity check plots of the MLR prediction results on the test set for the DNN, LSTM, GRU, LSTM-SA, and MSA models, respectively. f Taylor diagram of MLR predictions for different models. gk represent the parity check plots of the RDME prediction results on the test set for the DNN, LSTM, GRU, LSTM-SA, and MSA models, respectively. l Taylor diagram of RDME predictions for different models.

Figure 5a–f shows the FTD prediction results for samples of different concrete types. The results show that, the DNN model exhibits reduced accuracy for NC samples in MLR predictions compared to FRC and RAC specimens (Fig. 5a–c), while demonstrating particular limitations in RDME predictions for RAC samples (Fig. 5b–f). This phenomenon can be attributed to the presence of multiple baseline groups across various experiments in NC samples, where variations in experimental methods contribute to increased data heterogeneity. Additionally, the unique pore structure of recycled aggregates and the mortar–aggregate interfacial transition zone in RAC samples significantly affect RDME predictions49. Indeed, as these factors are typically conveyed in textual form, the DNN model struggles to capture a consistent FTD pattern. Conversely, the MDL model not only improves prediction accuracy across different sample types, but also effectively mitigates discrepancies in FTD predictions. It is noteworthy that although manually incorporating the fiber category feature into the dataset allowed the DNN model to achieve higher prediction accuracy for FRC samples compared to other sample types, this improvement relied on extensive manual effort. In contrast, the NLP approach integrated into the MDL model proves to be more efficient and cost-effective.

Fig. 5: Prediction results of the DNN/MDL model for samples of different categories and freeze-thaw damage levels.
figure 5

ac represent the absolute errors, root mean square errors (RMSE), and determination coefficients (R2) for mass loss rate (MLR) prediction in different models on normal concrete (NC), fiber reinforced concrete (FRC), and recycled aggregate concrete (RAC) samples. df represent the absolute errors, RMSE, and R2 for RDME prediction in different models on NC, FRC, and RAC samples. gi represent the absolute errors, RMSE, and R2 for MLR prediction in different models for samples with absolute mass change rates <2.5% and >2.5%. jl represent the absolute errors, RMSE, and R2 for RDME prediction in different models for samples with RDME <0.8 and >0.8.

Figure 5g–l shows the prediction results of samples with different degrees of FTD. As the severity of FTD in concrete intensifies (manifested as an increase in MLR or a decrease in RDME), the prediction accuracy of the DNN model declines significantly. Particularly for samples with RDME below 0.8, the predictive performance of the DNN model is even inferior to that of simple mean or median prediction methods. A possible explanation is that severely freeze–thaw damaged samples constitute a relatively small fraction of the training dataset, and the degradation mechanism in high-damage states exhibits pronounced nonlinear characteristics, making it challenging for the DNN model to capture the complex evolution of damage effectively. Compared to the DNN model, the MDL model demonstrates superior performance in predicting severely freeze–thaw damaged samples, primarily because pore structure parameters are key determinants of FTD. Moreover, the morphological features in the textual data record the evolution patterns of the structure, and the freeze-thaw mechanism analysis section discusses in detail the influence of mineral admixtures, fibers, and nanomaterials on the pore structure. By integrating multi-source data features, the MDL model gains essential physical insights, thereby enhancing its ability to predict samples with severe FTD.

Visual analysis of MDL models

The above research results indicate that text input plays a universally important role in predicting concrete FTD. However, the internal mechanism by which the MDL model processes textual data remains largely a “black box.” To address this, we propose a visualization method that quantifies the weight of each token within the MDL model, thereby unveiling its internal operations. Given that both LSTM and GRU are sequential models in which each token induces changes in the hidden state (), we adopt variations in Manhattan distance and cosine similarity between consecutive hidden states as quantitative indicators. A larger value indicates a more drastic change in the hidden state, implying a greater influence of that token on the model’s output. For the LSTM-SA and MSA models, differences in token importance are uncovered by calculating the attention weights assigned to each token.

Figure 6 presents the visualization results of the LSTM model for text processing. The dimensions of the concrete specimen and the term “temperature sensor” are assigned high weights (Fig. 6d), indicating that the model effectively captures key experimental details. “recycled aggregate” and its abbreviation “rca” also exhibit high weights, reflecting the model’s ability to discern category differences between this sample and others. Despite differences in the formatting of key phrases, the model can still identify their similarity and significance. Real-world data often contain non-standardized inputs, and formatting issues may persist even after cleaning50. This fault tolerance greatly enhances the model’s robustness in handling imperfect or noisy text.

Fig. 6: Visualization results of text processing using the LSTM model.
figure 6

a weight heatmap of the experimental method text. b weight heatmap of the morphological description text. c weight heatmap of the freeze-thaw mechanism text. d color-mapped word cloud of the experimental method. e color-mapped word cloud of the morphological description. f color-mapped word cloud of the freeze-thaw mechanism.

Additionally, “c-s-h” and “fine particles” are assigned high weights (Fig. 6e), as both play a crucial role in improving the pore structure of recycled aggregate. The model also emphasizes words such as “slag”, “resistance”, “beneficial”, and “denser” (Fig. 6f), suggesting that it recognizes the positive effect of slag on the freeze–thaw resistance of RAC. It is worth noting that the weight assigned to “fly ash” is significantly lower than that of “slag” (Fig. 6f), and only slag was used in this sample’s mixing ratio. This indicates that, although text and numerical data are processed independently, the model can still identify potential connections between them. As previous studies have shown, RCA tends to accelerate mass loss during freeze–thaw cycling due to its high water absorption and microcrack susceptibility, whereas slag contributes to pore refinement and improved frost resistance51,52. The emphasis of these mechanisms by the MDL model indicates its ability to capture degradation pathways highly relevant to material performance.

Figure 7 presents the visualization results of the MSA model for text processing. The results indicate that the model not only captures microscopic features such as the reduction of internal C-S-H, the decrease in cohesion between mortar and coarse aggregates, and the deterioration of the interfacial transition zone (Fig. 7c) but also identifies the specific impact of waste bricks on concrete FTD from textual data. Specifically, the high attention weights assigned to “waste brick” and “unsuitable” suggest that the model recognizes a potential decline in concrete’s frost resistance due to the presence of waste bricks. Previous studies have shown that the interior of waste brick coarse aggregate (WBCA) is loose and porous, with interconnected capillary pores that promote water ingress and intensify freeze–thaw damage, thereby reducing frost resistance53. Conversely, WBCA also contains closed pores that function like air voids, mitigating ice crystallization and enhancing frost durability54. In particular, the model attends to the term “pre-wetting,” a factor known to reduce the closed-pore content in WBCA and consequently lessen its positive contribution to frost resistance (Fig. 7d). This demonstrates that textual data convey critical information that is difficult to capture through other means; by integrating textual and numerical data, the MDL model facilitates more scientific and precise predictions of concrete FTD. Additional evidence is provided in the supplementary materials.

Fig. 7: Visualization results of text processing using the MSA model.
figure 7

a weight heatmap of the morphological description text. b weight heatmap of the freeze-thaw mechanism text. c color-mapped word cloud of the morphological description. d color-mapped word cloud of the freeze-thaw mechanism.

MDL framework for FTD prediction: contributions, challenges, and future directions

Despite the promising results, several challenges and limitations remain, warranting discussion to contextualize our findings and clarify our contributions. Future work will aim to address these open questions.

The construction of a multimodal dataset that integrates numerical mix-design variables with textual descriptions of freeze–thaw conditions, testing methods, and degradation mechanisms constitutes a key contribution of this study. This integration enables the combined use of structured and unstructured information for durability prediction. However, the dataset remains constrained by limitations in literature collection and data heterogeneity. Restricted access to certain sources and reliance on keyword-based retrieval may have resulted in incomplete coverage of relevant studies, while variations in terminology and reporting standards across publications reduce consistency. Moreover, manual curation of textual information inevitably introduces subjectivity. Future research should adopt more comprehensive literature-mining strategies, such as crawler- or embedding-based retrieval combined with ontology-driven annotation, to broaden coverage and improve reproducibility. Additionally, enriching the dataset with field monitoring data, simulation outputs, multi-scale characterizations, and image-based information (e.g., microstructural or morphological data) would further enhance its representativeness and utility.

We also developed a novel multimodal deep learning framework that integrates numerical parameters with NLP-encoded textual information to predict freeze–thaw damage in concrete. This framework demonstrates the potential of combining heterogeneous data sources to improve both predictive accuracy and interpretability. Nonetheless, the current architecture remains relatively simple, which constrains its ability to capture complex feature interactions, reduces generalizability, and precludes transfer learning. In addition, the visualization methods provide only indirect proxies of the decision-making process and cannot fully reveal the internal reasoning of the models. Future work will therefore focus on strengthening the structural design of the framework, incorporating physics-informed constraints to ensure consistency with established degradation mechanisms, and extending its applicability to other durability challenges such as chloride ingress, sulphate attack, and carbonation.

This study established a comprehensive dataset encompassing concrete mix ratios, freeze–thaw media, experimental methods, morphological descriptions, and freeze–thaw mechanisms. A novel MDL model integrating NLP and DNN was then proposed to predict the MLR and RDME of concrete under freeze-thaw conditions. Visualization method was developed to explain how to effectively identify key information in the text in the MDL model. This innovative framework offers a new solution for predicting the durability damage of concrete exposed to freeze–thaw environments. The main conclusions of the study are as follows:

(1) By integrating fully automated NLP techniques with DNN, this study overcame the limitations of manual text data processing. The textual information, such as freeze-thaw environments, experimental methods, morphological descriptions, and freeze-thaw mechanisms, have been fully utilized. With the incorporation of prior knowledge regarding the FTD process, the MDL models demonstrated a marked improvement in prediction accuracy for samples with severe FTD compared to the conventional DNN model. Among the four MDL models, the MSA model improved prediction accuracy for MLR by 8% and for RDME by 21%.

(2) Visual analyses indicate that the MDL model can effectively capture critical textual information, such as various types of aggregates and the microscopic morphological changes occurring during the freeze–thaw process. Moreover, it reveals a latent relationship between this textual information and the digital inputs, which to some extent elucidates the decision mechanism of the MDL model and offers a novel solution for the scientific prediction of concrete FTD.

Methods

Establishment of the dataset

The dataset used in this study was compiled from 44 peer-reviewed articles published between 2011 and 202422,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97, primarily in leading journals such as Construction and Building Materials and Case Studies in Construction Materials. The inclusion criteria required that each article reported both MLR and RDME under freeze–thaw testing, and provided microscopic morphology images to support mechanistic interpretation. This ensured the consistency and comparability of the dataset. The final dataset consists of 18 numerical variables, four textual variables, and two output variables. The textual data were extracted from relevant sections of the papers, and a typical example of the textual features is presented in Table 2.

Table 2 Examples of textual data types and content in the dataset

In certain contexts, replacing different fiber types with numerical codes rather than applying one-hot encoding may be more efficient, as it avoids the high-dimensional sparsity problem typically encountered when using one-hot encoding for 11 distinct fiber types. To improve scientific rigor and analytical efficiency, the dataset was segmented into five categories based on variations in material composition: normal concrete (NC), fiber-reinforced natural aggregate concrete (FRC), recycled aggregate concrete (RAC), nano-reinforced concrete (NRC), and fiber-reinforced recycled aggregate concrete (FRRAC). Consequently, a new categorical feature column was incorporated into the dataset to represent these groupings. Notably, this column will maintain its original non-numeric format, as it is intended exclusively for use as an identification label in subsequent stratified sampling and other targeted analyses.

Due to the variability in raw material composition among samples, the dataset includes instances of missing data. For instance, the VF feature column in RAC-class samples has missing values. In this study, all missing data are uniformly imputed with zeros. Additionally, some samples lack values for the output variables, indicating that these two critical outputs were not always fully available in the source publications. To ensure data consistency and completeness, samples missing these essential output variables are removed from the dataset. After these adjustments, the dataset comprises 1851 samples.

For the textual data, the initial step is to standardize all text by converting it to lowercase, thereby enabling the model to disregard case differences and accurately recognize words and their synonyms. Subsequently, all punctuation marks, such as commas, colons, and periods, are removed, since they generally do not contribute meaningful information to the model’s understanding.

Deep Neural Network (DNN) and Multimodal deep learning (MDL) models

The DNN model is a type of machine learning model typically composed of a multi-layer architecture in which each layer functions as a nonlinear information processing unit. By simulating the behavior of biological neurons, the model effectively handles complex nonlinear input features and produces accurate predictions of output variables. In this study, we utilized the Keras98 library to construct and train a basic DNN model, along with four MDL models.

In deep learning, a model’s generalization ability is commonly evaluated by partitioning the dataset into a training set and a test set. Parameter optimization is performed on the training set, while the test set is used to assess the model’s generalization capability. We employed the StratifiedShuffleSplit method from the Scikit-learn library to divide the dataset into an 80:20 training-to-test ratio. Notably, stratification was based on manually created concrete category features to ensure that the distribution of different concrete categories in both sets reflected that of the original dataset. Under conditions of limited data, 5-fold cross-validation is critical for stable and reliable model evaluation.

Normalization eliminates the dimensional differences among input variables, thereby accelerating the convergence of machine learning models. In this study, the input variables in the training set were scaled using the following formula:

$$Y=\frac{X-\mu }{\sigma }$$
(1)

where Y and X represent the scaled and original data value, respectively; μ is the mean of the original dataset; and σ is the standard deviation of the original dataset. After standardizing the training set, a trained scaler was obtained and subsequently applied to standardize the test set. This approach ensures that the distribution information of the test data is not prematurely exposed during standardization, thereby maintaining fairness and scientific rigor in evaluating the model’s generalization ability.

Selecting appropriate hyperparameter combinations is crucial for constructing a high-performance DNN. In this study, we defined a hyperparameter search space that included the number of hidden layers (three or four), the number of nodes per layer (16, 32, 64, or 128), dropout rate (0.2, 0.3, or 0.4), batch size (64 or 128), number of training epochs (500, 700, or 900), and learning rate (0.01, 0.001, or 0.0005). Although conventional hyperparameter optimization methods—such as grid search, random search, and Bayesian optimization—offer certain advantages in specific applications, the Tree-structured Parzen Estimator (TPE) algorithm has proven more efficient for optimizing large-scale parameter spaces99. Therefore, we employed the TPE method from the Optuna100 library for hyperparameter optimization, with the objective of iteratively minimizing the MAE. The final DNN hyperparameter configuration consisted of three hidden layers with 128, 64, and 32 nodes, a dropout rate of 0.2, the Adam101 optimizer for training, a learning rate of 0.001, a batch size of 64, and 700 training epochs.

In addition, this study uses MAE, RMSE, and R² as evaluation metrics to assess the prediction accuracy of the DNN model and four MDL models on both the training and testing datasets, in order to prevent underfitting or overfitting. The MAE, RMSE, and R² can be obtained as follows:

$$MAE=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}|{y}_{i}-{\hat{y}}_{i}|$$
(2)
$$RMSE=\sqrt{\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}}$$
(3)
$${R}^{2}=1-\frac{{\sum }_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}}{{\sum }_{i=1}^{n}{({y}_{i}-\bar{y})}^{2}}$$
(4)

where yi represents the true value of the output variable in the dataset, \({\hat{y}}_{i}\) is the predicted value from the model, and \(\bar{y}\) is the mean of the output variable in the dataset.

Visualization method

For both LSTM and GRU architectures, the hidden state h serves as the final output. By quantifying the changes in h induced by each token, we can pinpoint the model’s focus on the text. We measure these variations per word using two metrics. Specifically, the Manhattan distance (MD) captures the cumulative absolute differences across dimensions between consecutive hidden states, reflecting overall activation fluctuations, while the cosine distance (CD) measures changes in the orientation of hidden state h within the semantic space. The formulas for MD and CD are as follows:

$$M{D}_{j}=\mathop{\sum }\limits_{i=1}^{n}|{h}_{j}^{(i)}-{h}_{j-1}^{(i)}|$$
(5)
$$C{D}_{j}=1-\frac{{h}_{j}\cdot {h}_{j-1}}{\Vert {h}_{j}\Vert \Vert {h}_{j-1}\Vert }$$
(6)

where MDj represents the overall absolute change in the hidden state vector at the jth token, and CDj represents the angular difference (direction change) between the hidden state vectors at the jth token and the preceding token. Here, hj denotes the hidden state vector at the j th token, hj-1 denotes the hidden state vector at the (j − 1)th token, hj(i) represents the ith component of the hidden state vector at the jth token, and n is the total number of dimensions in the hidden state vector.

To eliminate scale differences among these metrics, the Manhattan and cosine distance (MD and CD) values are first standardized and then normalized to a fixed range [0,1] to facilitate the generation of a color-mapped word cloud. In this visualization, the sum of the normalized MD and CD—computed separately for the forward and backward directions of the bidirectional LSTM—determines the color mapping, effectively highlighting the tokens that the model focuses on.