Introduction

The continuous increase in computing resources has encouraged the development and application of high-performance computing (HPC) and artificial intelligence (AI) in a variety of scientific disciplines, ranging from medicine to economics and social sciences to engineering. The use of artificial intelligence is becoming increasingly important in the field of seismology. It is widely employed, for example, to analyze seismic signals, define early warning systems, and locate earthquake sources. Interested readers can refer to numerous review papers available in the literature (see, for example, 1,2,3).

Physics based simulations (PBS) represent a valuable alternative to the Empirical Ground Motion Equations to compute the seismic wavefiled and, thus, to calculate any ground motion Intensity Measure (IMs) of interest, such as peak ground velocity (PGV) and peak ground acceleration (PGA). IMs are essential parameters in the seismic hazard analysis and are often used for seismic risk assessment and prevention. Recently, in the same spirit of this work, an unconventional approach to defining fragility curves starting from IMs obtained from PBS has been proposed in 4. However in the presented paper, extending the framework presented in 4, the proposed approach relies predominantly on machine learning techniques rather than traditional fragility curve formulations, with the aim of enhancing the accuracy of building damage estimation. After a strong earthquake, once the emergency phase has passed, it is necessary to assess the impact of the earthquake on the building heritage and quickly identify buildings that are habitable or with limited damage, which can be made available to the population. On the other hand, another fundamental aspect of seismic risk prevention and assessment is the identification of the geographical, geological, and constructional characteristics that make a building type more vulnerable to the ground shaking. Currently, the assessment of possible impact of the earthquakes on the building damage level is carried out by the above-mentioned fragility curves, which assign the probability that a building will exceed a certain level of damage. The ”classical methods” for the definition of fragility functions are usually based on IMs values obtained using Ground Motion Models (GMMs) 5 or on ShakeMap. 6. More recently an alternative approach based on machine learning have been proposed. In 7 the authors used a ML-based supervised classifiers to assess the damage level of 2276 buildings after the 2014 South Napa 6.1 Mw earthquake. Their working dataset contains many different features such as number of stories, buildings-fault distance, Vs\(_{30}\), spectral acceleration (SA) at 0.3 s, age of construction, value, cost, size and regularity of each building. The target variable is the level of damage, assigned as low, medium, and high. Other interesting applications can be found in 8,9 and 10.

Concerning on the L’Aquila 2009 earthquake, in 11,12 the authors propose a comparative study using the same dataset as in this paper. In 13 a random forest based classifier has been employed to analyze the dataset introduced and described in 14.

In this paper we employ a combination of machine learning tools and physics based simulations to assess the damage level on the built-up area after an earthquake, aiming to support both hazard mitigation efforts and the post-earthquake emergency response. As a case of study we considered the L’Aquila (6\(^{th}\) April 2009) earthquake, that has been the first of a long and destructive series of seismic events which hit the central Italy between the 2009 and the 2017 15,16,17,18,19,20,21,22,23,24,25,26.

There are at least a couple of reasons why L’Aquila earthquake is particularly suitable for the purpose. First of all, the main shock was significantly stronger than the aftershock having a magnitude 5.4 Mw. It makes reasonable the assumption that the damage recorded is mostly due to the 6.1 Mw event. The second reason relates to the availability of geological data. Thanks to the complex microzonation work carried out by the Italian Civil Protection Agency, in-depth knowledge of the territory’s geology is available, which is essential for constructing an accurate computational domain. Furthermore, during the L’Aquila Opendata project (https://www.opendatalaquila.it/), data was collected and made available, allowing for the construction of a dataset containing approximately 3,000 buildings already used and validated in 11. We aslo observe as the building typologies found in the historic centre of L’Aquila are representative of many other Apennine locations; therefore, the methodology presented in this study could be generalized to these areas, provided that the necessary data are available. However, extending this approach to seismological regimes or reconstruction frameworks other than those analyzed in this work remains more complex and would require further assessment. This paper consists of two main parts. In the first one, we used the numerical code SPEED (https://speed.mox.polimi.it/) 27 to simulate the 2009 earthquake and obtain synthetic values of PGV and PGA. Then, in the second part, we used simulated PGA and PGV as predictive variables for an AI-based model to assess the value of damage to buildings. In the absence of PBS numerical simulations, IMs can be obtained by interpolating recorded values (with a level of uncertainty that increases with distance from the detection station) or by using simplified models such as ground motion models 28. The Shake-Maps repository (https://shakemap.ingv.it) combines observed ground motion values and predictive relations to provide regional and local shake maps 29,30. Here, both simulated and inferred from Shake-Maps IMs are used as predictive variables, and their impact on determining the level of damage has be assessed with the aid of AI-based techniques.

The paper is organized as follows. In the next section, we describe the computational domain and validate the numerical results by comparing recorded and simulated waveforms. In Section 3, we present the working dataset and we introduce some information related to the dataset preparation and to the ML tools used for our analysis algorithm. In Section 4, we present and discuss our results. Finally, in the last section, we summarize our results and present further developments of this work.

Numerical model and sensitivity analysis

In this work, we consider a three-dimensional (3D) computational domain that extends 59.5 km in width, 57.5 km in length, and 19.8 km in depth, centered around the city of L’Aquila.

The domain has been built using a Python tool introduced in 31 and the TINITALY database for topographic data 32,33,34. The CUBIT software (https://coreform.com/coreform-cubit/) was used to create a mesh containing 776,426 elements, with sizes ranging from 130 m (top layer) to 1 km (bottom layer). The maximum achieved frequency resolution is around 2–2.5 Hz, with polynomial degree equal to 3. The fault plane extends 28 Km in length and 20.9 Km in width with a strike equal to 133\(^\circ\), dip equal to 54\(^\circ\), and rake equal to -102\(^\circ\). The hypocenter position has been obtained using the epicenter position (42.34 Lat–13.38 Lon) and fault geometry. The adopted seismic source, reconstructed from the seismographic data after the earthquake 19, has already been successfully employed in 35,36,37.

In addition to the Plio-Quaternary sedimentary basin named Media Valle dell’Aterno, already included in 36, we add nine smaller basins with a maximum depth of approximately 150 m. In Figure 1 we show the computational domain (top view) where the main basin and the additional nine basins are visible (all highlighted using a color scale ranging from purple to white, representing the thickness). Specifically, we refer to the main basin as the area that contains the measuring seismic stations named AQG, AQV, AQK, AQA, AQU and GSA, whereas the new nine additional basins are the highlighted areas disjointed from the main basin.

Fig. 1
Fig. 1
Full size image

This figure shows the computational domain, including the main basin, named Media Valle dell’Aterno, and nine smaller sedimentary basins with a maximum depth of 150 m. We refer to the main basin as the highlighted area (color scale ranging from white to peach) that contains the seismic stations AQG, AQV, AQK, AQA, AQU and GSA, whereas the nine additional basins are the highlighted areas (in purple color) disjointed from the main basin. The (6\(^{th}\) April 2009) L’Aquila earthquake epicenter is marked with a red star. (This figure has been realized using the software QGis (https://qgis.org/) version 3.40.3, license QGIS (GPL v.2+)).

The shallow structures of the secondary basins have a similar geological structure to that of the first 150 m of the main basin. Therefore, it is reasonable to assume that the velocities and density profiles of all considered basins are the same. For the numerical simulations, as in 36, we considered a four layer computational domain. However, in the present study, we employed two distinct models for the mechanical properties of the topmost layer. Specifically, in the first model for topmost layer, as in 35, we assumed that within the basins both \(V_S\) and \(V_P\) depend only on the depth z as follows:

$$\begin{aligned} V_S = 300 + 36 \cdot z^{0.43} (\textrm{m}/\textrm{s}), \; V_P = 2.14 V_S, \; Q_S = 0.10 V_S, \; \rho = 1.9 (\textrm{g}/\textrm{cm}^3). \end{aligned}$$
(1)

Here \(Q_S\) is the S-wave quality factor and \(\rho\) the soil mass density. In contrast, for the outcropping bedrock we considered constant \(V_S\) values as reported in 35. Hereafter, we refer to this model as non-improved bedrock case. For the second topmost layer model, as described in 37, we assume a depth-dependent shear velocity in the outcropping bedrock, in order to get more realistic Vs\(_{30}\) values. Namely, we set:

$$\begin{aligned} V_S = 800 + 28.4\cdot z^{0.5} (\textrm{m}/\textrm{s}), \; V_P = 1.86V_S, \\ Q_S = 0.10 V_S, \,\rho = 2.2+9.5\cdot z^{0.5} (\textrm{g}/\textrm{cm}^3). \nonumber \end{aligned}$$
(2)

Hereafter, we refer to this second model as improved bedrock case. For the other three layers, the mechanical properties were taken from 35 and are the same already used in 36.

To evaluate the contribution of the nine small basins and the improved bedrock on the top layer, we considered four different computational domains. The first and second domains, here named T1 and T2 respectively, contain only the main basin, already included in 36. In T1, we assume constant values of \(V_S\), \(V_P\), and \(\rho\) as in 36, while in T2 we assume to work with the improved bedrock mechanical properties as in 37. Domains T3 and T4 contain nine minor basins in addition to the main basin. In T3, we assume constant values of \(V_S\), \(V_P\), and \(\rho\) as in 36, while in T4 we assume to work with the improved bedrock model. The above is summarised in Table 1 for the reader’s convenience:

Table 1 Computational domains used for the numerical simulations.

In the following, we compare the simulated waveform obtained for the four different computational domains with the available recorded data. Among the several seismic stations that recorded the 2009 event, just seven of them fall in our computational domain, as shown in Figure 1. Three stations, AQK, AQU, and AQG that are located in areas of high population density, are used to validate our computational models. In particular, AQK and AQU are situated close to the centre of L’Aquila (1.8 and 2.2 km from the epicentre, respectively), while AQG is located on the western outskirts of the city at 5 km from the epicentre.

In Figure 2, the north-south (NS), east-west (EW), and up-down (UP) components of the synthetic seismographs are compared with the corresponding recorded data.

To estimate the agreement between the recorded and simulated waveforms, we computed the normalized cross-correlation (NCC) values reported in Table 2 for the stations AQK, AQU and AQG and the components EW, NS and UP. The analysis of the NCC values reveals specific trends across the monitored stations. For AQK and AQU, the highest NCC values are achieved using domains \(T_2\) and \(T_4\) in the components EW and NS, while the best agreement for the vertical component (UP) is observed in domains \(T_1\) and \(T_3\). In contrast, station AQG consistently shows the highest NCC values for domains \(T_1\) and \(T_3\) across all three components (EW, NS and UP).

In view of the lack of a clearly better simulation scenario, we can observe that domains \(T_2\) and \(T_4\) at stations AQK and AQU yield the highest NCC values in two out of the three components (EW and NS). Given that the numerical differences between these values are minimal, domain \(T_4\) was selected for all subsequent analyses, as it represents the most comprehensive model.

Finally, it should be emphasized that the agreement between recorded data and simulations is quite satisfactory, with the exception of the NS component at the AQG station. In line with the previous findings 38, the mismatch at AQG-NS component may be due to inaccuracies in the local geological model which, although detailed, is not able to capture some localised site amplification effects.

Fig. 2
Fig. 2
Full size image

Comparison between the recorded waveform (in black) at AQK, AQU, and AQG and the simulated one. The four different tests are indicated as follows: T1-red, T2-blue, T3-magenta, T4-green. In the horizontal axis, we reported the earthquake simulation time equal to 20 s, in the vertical axis, we reported the displacement (DIS-EW, DIS-NS, DIS-UP) in cm related to the three spatial components of the simulated seismograms.

As mentioned in the introduction, the PBS model can be useful for studying the behavior of seismic waves and simulating ground motion; however, its maximum frequency resolution is quite low, typically ranging from 1 to 3 Hz. This frequency limit leads to low-quality simulated data for high-frequency ground motion parameters, such as PGA. To obtain the high-frequency component of the spectrum, a hybrid approach can be used that combines PBS simulations with empirical or data-driven methods, such as Green’s function, stochastic models, or deep learning techniques 38,39,40,41. In this work, we employ the ANN2BB tool to generate broadband ground motions with a suitable frequency content (see 42) and, thus, to compute the PGV and PGA datasets used as input for the AI-based tool. This approach, based on Artificial Neural Networks (ANN), produces a correlation between long and short period spectral ordinates trained on strong motion records. With this technique, starting from PBS outputs, one can produce synthetic signals having broadband content.

Table 2 Normalized cross correlation (NCC) computed between the simulated and recorded displacements for the stations AQK, AQU and AQG in the three components (EW, NS and UP) reported in Figure 2.

Dataset preparation

In this work, we used supervised machine learning (ML) techniques to train, validate, and test a model that can assess the damage levels of buildings as a consequence of a major seismic event. The working dataset is the same as that used in 11 (see also 43 where similar dataset are presented), composed of 3060 buildings located in the area of the L’Aquila 2009 earthquake, but enriched by the PGA and PGV values provided by the physics-based simulations obtained from the domain T4, as described in the previous section. In the following each building is described by 20 predictive variables, divided into:

  • Buildings features: construction techniques (C), aggregation type (C), position (with respect to the aggregate) (C), number of units (in the aggregate) (N), height (N), surface aggregate area (N), mean area (N), number of vertices (in the aggregate) (N), and age (C).

  • Geophysical features: geographical coordinate (WGS84/UTM 33 N) (N), distance from the epicenter (N), distance from the depocenter (N), peak ground velocity (N), peak ground acceleration (N), time-average shear wave velocity to 30 m of depth (Vs30) (N), coefficient of stratigraphic amplification, coefficient of topographic amplification (N), slope (N), maximum design acceleration value (N).

Labels C and N indicate categorical and numerical variables, respectively.

For the peak ground velocity and peak ground acceleration, we considered both the values provided by the platform Shakemap 29, named PGV and PGA, and by the PBS simulation, from now on PGV\(_{PBS}\) and PGA\(_{PBS}\), respectively.

To ensure compatibility with the Random Forest algorithm, ordinal categorical variables (such as Building Age, Soil Morphology, and Damage Level) were transformed into numerical values. For the remaining categorical features, a one-hot encoding scheme was applied. The dataset is divided into training, validation and test sets, and the numerical variables are normalized. As usually, the normalization parameters are fitted on the training set only and then applied to the test set to prevent data leakage. The dataset must also be free of missing values (NaN). In this case, the only variable that contains NaN is the age of construction (simply ’Age’ in the list of characteristics). There are 528 missing values for this variable, which is a significant number compared to the size of the dataset. In 11 the buildings having missing values have been removed, significantly reducing the dataset size. On the contrary, in this study, the missing or uncorrected values have been estimated and included in the dataset using a technique known as data imputation. If the number of missing values is less than 5\(\%\), it is sufficient to substitute them with their mean or mode. If the number of missing data is between 5\(\%\) and 20\(\%\), as in our case, a more complex approach should be that takes the entire dataset into account, rather than just the columns containing missing values.

On the other hand, when the percentage of missing data exceeds 20\(\%\), it is recommended to use more advanced techniques, such as those based on artificial neural networks. In this work, the KNN-Imputer, as implemented in Scikit-learn (https://scikit-learn.org/stable/), is used for data imputation. We emphasize that data imputation is only carried out on the training and validation sets, while the test set containing has not been modified to guarantee the reliability of the case study presented at the end of this section. Furthermore, only a restricted subset of predictive variables is included, which do not contain the target variables and all features related to the seismic event, such as epicenter distance and all the intensity measures. The latter are clearly not correlated with the construction era of the building, while the former are excluded to avoid generating artificial correlations that could potentially compromise the proper training of RF-based models. The complete list of the features included in the data imputation procedure is reported in the Appendix, for the sake of completeness, where also the distribution of the imputed variable before and after is reported. After imputing missing values, the dataset used for training and validating the models contains 2754 data-points (dp).

The damage classification was originally based on six levels (from D0-no damage to D5 heavy damage or collapse). However, due to the size of the dataset, we will define three different combinations of damage levels, as listed below:

DS1: D0-D1 light damage (710 dp), D2-D3 moderate damage (1108 dp), D4-D5 heavy damage (936 dp)

DS2: D0-D1 light damage (710 dp), D2-D3-D4-D5 form moderate to heavy (2044 dp) damage

DS3: D0-D1-D2-D3 form light to moderate damage (1818 dp), D4-D5 heavy damage (936 dp) The first case is similar to the one considered in 11. In the other two cases, instead, we aim to identify buildings with no or minor damage (DS2) and those with more serious damage or collapsed (DS3), with the scope of supporting the pre- and post-emergency phases.

To properly evaluate the classifier’s performance, various metrics can be employed.

The parameter most often used to evaluate the performance of a supervised learning algorithm is the accuracy, defined as:

$$\begin{aligned} \text{ Accuracy }=\frac{TP+TN}{TP+TN+FP+FN}. \end{aligned}$$
(3)

Here, TP, FP, TN, and FN stand for true positives, false positives, true negatives, and false negatives, respectively.

Other possible metrics are recall, precision, and \(F_1\) score, which can be written as:

$$\begin{aligned} \text{ Recall }=\frac{TP}{TP+FN}, \quad \text{ Precision }=\frac{TP}{TP+ FT},\quad F_1=\frac{TP}{TP+(FN+FP)/2}. \end{aligned}$$
(4)

Accuracy is a suitable metric for well-balanced datasets, whereas for strongly or moderately unbalanced datasets, the F1-score is a more appropriate metric for evaluating the classifier’s performance. As mentioned above, in this work, we used a popular algorithm called Random Forest (RF), which has already been used for similar purposes in 7,8,12. RF, first introduced by Ho 44 in 1995, is a robust algorithm based on an ensemble of decision trees (DT), which guarantees good quality performance on the treatment of tabular data. An accurate description of the used algorithm is beyond the scope of this paper, and the interested reader may refer to introductory texts on machine learning available in the literature, such as 45.

In order to reduce the size of the datasets by removing unnecessary variables and saving computational resources, we performed a preliminary analysis to evaluate the impact of each variable on determining the damage level. Figure 3 shows the importance score (obtained using a balanced random forest (BRF) model with default hyperparameter values) of the top 12 features for the three different distributions of the target variables. We immediately notice that ten features, namely Mean Area, depocenter distance, epicenter distance, PGV\(_{PBS}\), PGA\(_{PBS}\), NS and EW coordinates, age, height, total area, and Vs\(_{30}\) are common to all three datasets. In all cases, the building (or aggregate) average surface area is the variable that most influences the level of damage, as stated in 11. We also emphasize that, of the two metrics characterizing the site-source distance, the depocentre distance always registers a higher score than the epicenter one. Finally, we note that only the IMs calculated via PBS appear to have a significant effect on determining the damage level, at least for this dataset.

Fig. 3
Fig. 3
Full size image

Feature importance scoring for the three datasets DS1(a), DS2(b) and DS3(c). To improve the clarity and readability of the figures, only the top 12 scores are shown.

From now on, for sake of simplicity, the working datasets will include the eleven common features.

Dataset Analysis and models validation

Our working datasets are moderately unbalanced, particularly for DS2 and DS3. This could affect the performance of the classifier. The simplest approaches to managing an unbalanced dataset are undersampling and oversampling. In the first case, the size of the majority class is reduced until it matches that of the minority class. This approach is practicable only if the dataset is large enough and it is not strongly unbalanced. In this spirit, a modified version of the RF algorithm, specifically designed for unbalanced datasets (BRF), will be trained and validated. Compared to the traditional RF, BRF trains each tree using a balanced dataset by undersampling the majority classes. Alternatively, one can increase the number of elements in the minority class, i.e., oversampling. Among the most popular algorithms developed for this purpose, we cite Synthetic Minority Over-sampling Technique (SMOTE) introduced in 46. In recent years, a hybrid approach combining oversampling techniques with specific algorithms has emerged. This approach is particularly useful for treating strongly unbalanced datasets where the minority class contains fewer than 5\(\%\) of the total elements. In this work, we apply the SMOTE algorithm to create a balanced version of our dataset, oversampling the minority classes. A classical RF algorithm is then trained and tested for all three datasets. The three models (RF, BRF, and RF+SMOTE) are compared. Repeated stratified K-Fold cross-validation approach (setting n\({\_}\)splits=10 and n\({\_}\)repeats=5) is applied to increase the robustness of the analysis and reduce the overfitting. For each dataset, we report in Table 3 and Table 4 the mean value and the standard deviation for accuracy and the F1-score obtained for the three proposed models. In this preliminary phase, hyperparameters are not optimized; instead, default values are used for each test, covering both the training/validation procedure and the data augmentation via SMOTE. All machine learning procedures, including pre-processing, balancing and model evaluation, were implemented in Python using the Scikit-learn library, while the SMOTE algorithm was applied via the Imbalanced-learn toolbox. For the sake of completeness, it should be noted that the workflow is computationally efficient to be handled on standard workstations. It does not rely on high-performance computing (HPC) or complex optimization, making it highly accessible for a variety of applications.

Table 3 Accuracy for three datasets and different models, standard RF, balanced RF, and RF+SMOTE on the augmented dataset.
Table 4 F1-score for three datasets and different models, standard RF, balanced RF, and RF+SMOTE on the augmented dataset.

We observe that introducing a balanced dataset significantly improves the accuracy of our results in all case studies. However, the F1 score does not show the same trend for the DS2 dataset. The F1 score is sensitive to the number of false negatives and false positives. The basic RF model achieved an artificially high F1 score due to over-specialisation in the dominant class, labelled as 1, essentially ”ignoring” the minority class, labelled as 0, while maintaining high precision. By balancing both the training and validation sets, the RF+SMOTE model establishes a fairer decision boundary, effectively trading distorted majority class accuracy for robust overall classification performance. For the sake of completeness, we report the 95% confidence intervals (CI) related to the metrics listed above. For accuracy and F1 score, we also report in Table 5-6 the 95% confidence intervals calculated according to

$$\begin{aligned} CI = \mu \pm \left( t_{\alpha /2, n-1} \times \frac{\sigma }{\sqrt{n}} \right) \end{aligned}$$
(5)

where:

  • \(\mu\) is the mean performance metric across the 10 folds as reported in Table 3-4;

  • \(t_{\alpha /2, n-1}\) is the critical value from the Student’s t-distribution for a 95% confidence level (\(\alpha = 0.05\)) and \(n-1 = 9\) degrees of freedom (df), which corresponds to 2.262 in our case;

  • \(\sigma\) is the standard deviation ;

  • \(n = 10\) is the number of folds in the cross-validation.

We would like to point out that in calculating the confidence intervals, we used the number of folds (\(n=10\)) rather than the total number of repetitions to ensure a more conservative estimate.

Table 5 95% Confidence Intervals for Accuracy across different datasets and models (calculated with \(t=2.262, n=10, df=9\)).
Table 6 95% Confidence Intervals for F1-score across different datasets and models (calculated with \(t=2.262\), \(df=9\)), according to 5.

We observe that the confidence intervals achieved with RF+SMOTE are higher than those of the other models and do not overlap with them, except for the F1-score in dataset DS2, as previously discussed. These results confirm the reliability and robustness of the performance reached by the proposed approach.

Fig. 4
Fig. 4
Full size image

Confusion matrices referring to the validation dataset are reported for all the models and target variables considered. In particular, sub-figures (a) to (c) show the results of a classical RF model for classification. In (d)–(f), however, a balanced RF algorithm has been employed. Finally, in sub-figures (g)–(i), the augmented dataset and the RF+SMOTE algorithm have been used.

Figure 4 shows the confusion matrices associated with a single fold, corresponding to 10\(\%\) of the entire training set, the three datasets, and three models under consideration. Usually, the main diagonal contains the correctly classified buildings. We note that the balanced algorithm slightly improves the algorithm’s ability to correctly assign the target to minority classes, especially for binary target distributions corresponding to cases DS2 and DS3. For example, in DS2, the classic algorithm incorrectly assigns a moderate-to-heavy damage level to 38 buildings that suffered instead, light damage. Using the balanced algorithm, this number drops to 24. Similarly, in DS3 (Figure 4-(b) RF incorrectly classifies 52 buildings with heavy damage. Although the balanced algorithm improves performance slightly, it still incorrectly classifies 32 out of 94 buildings in the minority class. Despite its benefits for minority classes, BRF does not achieve acceptable levels of accuracy or F1-score, and the false positive rate remains unacceptably high.

However, significant performance improvements were observed both in terms of accuracy and F1-score, particularly for DS2 and DS3, using the augmented dataset. In fact, for both DS2 and DS3, the number of misclassified elements is significantly lower than the number classified correctly. This generates a significant improvement in Accuracy and F1 score values, as reported in Table 3 and Table 4.

Case of study

In the previous section, we trained and validated three different classifiers. A comparative analysis of their performance reveals the need to employ specific data augmentation techniques to achieve high accuracy, particularly for minority classes. We begin testing the model on a previously unseen dataset comprising 306 data points, which corresponds to \(10\%\) of the initial (non-imputed) and not augmented dataset. To this end, we consider the following case study:

  • suppose we have completed the inspection of some of the buildings damaged by the earthquake, then we want to use the collected data to identify severely damaged buildings in a previously unseen dataset.

This case study is equivalent to the configuration of target variables as in DS3.

Before proceeding with the analysis of the test dataset, we removed all data points that had no information on the construction period (Age). The test dataset now contains 157 buildings with minor/moderate damage and 94 with heavy damage. Then, the RF model is trained on the entire (imputed and agumented ) dataset and tested on the test set. The accuracy obtained for the test set, using the optimized hyperparameter values reported in the Appendix, is 0.76. This result is comparable to those obtained on the fully balanced dataset reported in Table 3 and, in any case, is higher than the accuracy achieved using an unbalanced training dataset Fixed \(random\_state=42\) is used to ensure reproducibility. A similar argument can be made for the F1 score, for which we have 0.76 (Weighted Average). The result in this case is significantly better than the results obtained with unbalanced datasets (0.571 (std 0.030) and 0.607(std 0.030) using RF and BRF, respectively). In Fig. 5, results are displayed in terms of confusion matrices for both the training and the dataset.

Fig. 5
Fig. 5
Full size image

Confusion matrix for the training (imputed and augmented) and the never-seen test set using RF.

The performance obtained on the training dataset is better than that obtained on the unseen dataset. This was expected, given the relatively small size of the test dataset and the complexity of the problem under analysis. A larger dataset would certainly improve performance, and more advanced data augmentation techniques could help to limit the problem. This will be the subject of a future study. Finally we report the precision recall-curve for the test set (Fig. 6).

Fig. 6
Fig. 6
Full size image

Precision–recall curve for the Random Forest model with balanced training set and never seen test set.

The model achieved an average precision (AP) of 0.74, demonstrating a solid balance between precision and recall across different thresholds. Unlike the model evaluation phase, the hyperparameters for testing were tuned using a random search strategy. With this approach, different combinations of hyperparameters are evaluated and compared, with the latter being selected randomly within a range defined by the user. This approach is less precise than grid search, since it does not evaluate every possible combination, but it is far more computationally efficient and typically yields a satisfactory, though suboptimal, solution. All the information concerning the optimization procedure is reported in the Appendix.

Conclusions

In this work, we developed a tool for assigning damage levels to buildings following a strong earthquake. The PGA and PGV values derived from physics-based numerical simulations were used as predictive variables in an artificial intelligence-based model to determine the level of damage suffered by buildings. A preliminary analysis was conducted using a random forest feature scoring algorithm, which showed that IMs calculated using PBS simulations have a significant impact on damage values. Using these variables, rather than those derived from ShakeMaps, we trained a random forest-based classifier. The use of appropriate data imputation and data augmentation techniques allowed us to significantly improve the performance of the classification algorithm on the validation dataset by about 80\(\%\) for binary classification problems.

The classifier also performed well on a previously unseen test dataset, especially considering the small size of the test dataset. This tool, once properly validated and developed, can contribute to improving post-emergency procedures and implementing effective preventive measures.

However, it is necessary to carefully evaluate the model generality in the same spirit as 12. In conclusion, while the model performed satisfactorily in the case study described, this may not be the case for other test sets or distributions of target variables. To this end, we are conducting a more in-depth study to examine the influence of predictive characteristics. Finally, we note that further developments of this work could include the use of larger datasets related to different earthquake events and/or a greater number of IMs obtained from PBS simulations 4. This outcome highlights the importance of properly constraining input data. In this regard, the integration of satellite-derived imagery 47 offers a reliable and efficient means to quantify building damage at scale, providing essential constraints for identifying the most appropriate datasets.

Furthermore, it should be noted that the model’s current generalizability is limited to structural typologies and site conditions similar to those in the training dataset (i.e., historic centers located in central Italy such as L’Aquila). Future research will focus on extending this methodology to diverse geological and urban contexts, as well as to multi-hazard frameworks, in the spirit of recent resilience modeling approaches 48.