Abstract
Fire door defects in residential buildings negatively impact construction management by reducing fire safety effectiveness, increasing the likelihood of smoke and fire spreading, and consequently putting occupant safety at greater risk. To address this critical safety issue, this study proposes and evaluates five transformer-based text classification methods—BERT, RoBERTa, ALBERT, DistilBERT, and XLNet—for automated defect detection. These methods are optimized using both common and method-specific hyperparameters, resulting in 1,458 model variants evaluated through multiple metrics. Among these, the optimized RoBERTa achieves the highest performance, demonstrating F1 scores of 92.13% (frame gap), 87.29% (door closer adjustment), 78.17% (contamination), 82.89% (dent), 80.17% (scratch), 96.66% (sealing components), 98.43% (mechanical components), and 67.81% (others), yielding an average F1 score of 85.44%. Furthermore, RoBERTa significantly outperforms the other optimized 535 text classification models (ANN, SVM, DT, LR, 1D CNN, and LSTM). These results underscore the potential and effectiveness of transformer-based methods for safety management in real-world construction scenarios.
Similar content being viewed by others
Introduction
In residential buildings, fire doors are typically installed at stairwell entrances and front doors of individual units to ensure residents have safe evacuation routes during fires1. Additionally, fire doors are strategically engineered to compartmentalize buildings, effectively limiting the spread of fire and smoke between different Sects2,3,4. However, various factors during construction can introduce defects in fire doors. Physical damage often occurs when doors serve as access points during material transportation or interior finishing processes. Furthermore, cost-saving practices, such as using lower-quality materials or employing insufficiently trained workers who do not strictly follow installation guidelines, can significantly undermine fire door integrity5.
To verify their proper operation, fire doors undergo visual inspections before the completion of construction. Inspection findings are documented as unstructured textual records, which include details like inspection dates, responsible contractors, and defect locations. Defects recorded are classified into categories, including operational issues (such as doors failing to close properly) and physical damages (such as dents or scratches). This classification is crucial for maintenance teams, enabling them to prioritize safety-critical repairs and improve efficiency by grouping similar repair tasks to minimize downtime and travel between units6.
Nevertheless, manually classifying fire door defects from extensive textual descriptions is labor-intensive, time-consuming, and susceptible to errors. Previous research indicates low accuracy in manual defect classification at many construction sites, underscoring the need for automated classification solutions. Earlier studies in the construction field have frequently utilized conventional text classification techniques like Naive Bayes (NB) and Support Vector Machines (SVM). These methods typically depend on simpler mathematical structures with fewer parameters7,8,9. Although they can perform adequately for basic classification tasks, they generally fall short in capturing complex patterns and subtle linguistic details present in natural language datasets, such as those encountered in inspection reports and defect records10,11.
In contrast, recent advancements in deep learning, specifically supervised learning methods based on Bidirectional Encoder Representations from Transformers (BERT), offer sophisticated architectures that incorporate multiple sequential layers. These layers introduce non-linear transformations during training, enabling BERT-based models to generate rich and contextually meaningful representations. Consequently, these models effectively address linguistic complexities, including diverse vocabularies, varied syntax structures, and stylistic variations commonly observed in construction-related textual data12.
However, numerous variations of BERT-based methods exist, and successfully deploying these models requires careful customization, extensive training, and systematic evaluation tailored specifically to the targeted classification task. These factors significantly impact key aspects of model performance, including accuracy and detection speed. Given these considerations, this study aims to develop and rigorously evaluate various BERT-based approaches for the automated classification of textual data, with a particular emphasis on detecting steel fire door defects.
The main contributions of this study are summarized as follows: (1) Real-world text data were collected from apartment complexes encompassing a total of 8,786 household units. (2) A robust classification framework was developed, identifying eight defect categories—seven common defects and one minor defect. (3) A comprehensive set of 1,458 predictive models were developed using five BERT-based methodologies, each extensively optimized through hyperparameter tuning. (4) 535 models based on other machine learning methods were optimized, and a detailed comparative analysis between these traditional machine learning models and the BERT-based models was conducted. (5) Based on resulting findings, the practical applications and implications for real-world construction management were thoroughly discussed.
Literature review
Typical process of BERT based text classification
The typical workflow for employing supervised deep learning models, particularly those based on the BERT family, involves several critical steps: target class definition, annotation, data cleaning, text vectorization, model training, and hyperparameter optimization13.
The process begins with clearly defining target classes from the raw textual data, a foundational step for effective model training14. For example, defects in fire doors may be broadly categorized into hardware faults and aesthetic faults, with potential for further subdivision into more detailed categories based on specific inspection requirements and intended use. After defining these target classes, an annotation step assigns precise labels to corresponding textual descriptions, creating structured datasets for model learning. Next, data cleaning is conducted to ensure textual data is consistent, relevant, and free from extraneous noise.
Following cleaning, text vectorization methods transform the textual data into numerical representations suitable for computational processing. BERT-based models play a crucial role here, producing context-aware vector representations that effectively capture complex linguistic patterns, significantly enhancing model performance15. Hyperparameter optimization systematically tunes model parameters such as learning rate, batch size, and the number of epochs to achieve optimal model performance. Each of these steps significantly influences overall model effectiveness16. This study specifically emphasizes evaluating the effectiveness of BERT-based vectorization methods for detecting defects in fire doors.
Existing text classification methods
Various comparative studies have evaluated the performance of supervised learning methodologies within the construction industry, using both traditional machine learning and deep learning techniques. The methods employed in these studies, their achieved accuracies, and respective application contexts are summarized in Table 1.
Salama and El-Gohary17 compared machine learning methods, including SVM, NB, and Maximum Entropy (ME), for extracting rules from contractual texts, achieving the highest accuracy of 82% with ME. Goh and Ubeynarayana18 assessed methods such as SVM, Linear Regression (LR), and NB for construction accident classification, where SVM showed superior performance. Zhang et al.19 employed Decision Trees (DT), LR, and ensemble methods for classifying construction accident types, identifying Decision Trees as effective in their context.
Ul Hassan et al.20 developed supervised machine learning approaches to categorize textual descriptions into three project phases: design, construction, and operation and maintenance. Among the tested models—NB, SVM, Linear Regression (LR), K-Nearest Neighbor (KNN), DT, and Artificial Neural Networks (ANN)—LR performed best with an accuracy of 94.12%. Luo et al.21 also evaluated traditional models such as SVM, NB, and LR alongside Convolutional Neural Networks (CNN) for accident type classification, with CNN delivering the highest accuracy of 76%. Wang et al.22 proposed both machine learning and deep learning approaches, including DT, Random Forest (RF), NB, SVM, and CNN with Attention (CNN-AT), to categorize construction defects from daily supervisory reports. Among these, NB demonstrated notable performance with an accuracy of 98%.
Moreover, recent studies highlighted transformer-based methods’ superior capability in text classification tasks. Fang et al.23 employed deep learning models, including FastText, TextCNN, TextCNN combined with Bidirectional Gated Recurrent Units (BiGRU), Text Recurrent Convolutional Neural Networks (TextRCNN), and BERT, to classify near-miss incidents, with BERT obtaining the highest accuracy of 86.91%. Yang et al.24 applied deep learning methods such as CNN and Generative Pre-trained Transformer 2 (GPT-2) to classify facility defects, achieving the highest accuracy of 90.72% using a CNN-based model. Additionally, Wang et al.25 evaluated transformer architectures such as BERT-Bidirectional Long Short-Term Memory (BiLSTM) and BERT-LSTM-CRF for explicit safety knowledge extraction from regulatory texts, obtaining the highest accuracy of 91.74% with the BERT-BiLSTM-CRF method.
Tian et al.26 compared transformer-based approaches such as BERT-Graph Convolutional Network (GCN) and BiLSTM for safety hazard classification, achieving accuracy up to 86.56%. Jianan et al.27 evaluated transformer architectures, including RoBERTa and Longformer-RoBERTa, for classifying knowledge types from construction consulting standards, with the Longformer-Robustly Optimized BERT Approach (RoBERTa) variant reaching the highest accuracy of 91.65%. In addition, a systematic assessment by Zhong et al.28 contrasted classical classifiers (SVM, NB, LR, DT, KNN) with neural architectures (TextCNN, TextCNN–LSTM, RCNN, Transformer) for construction-dispute classification; TextCNN achieved the best result at 65.09% accuracy.
These findings indicate that no single algorithm consistently outperforms others across all scenarios, underscoring that the optimal model choice heavily depends on the specific classification task and its context. Currently, no research has been conducted on automated monitoring of fire door defects in buildings. Despite the proven effectiveness of transformer-based methods, particularly BERT variants, their application to classifying fire door defects from inspection reports remains underexplored. To address this research gap, this study proposes and evaluates multiple transformer-based models (BERT, RoBERTa, A Lite BERT (ALBERT), Distilled BERT (DistilBERT), and XLNet). Additionally, traditional machine learning algorithms (ANN, SVM, DT, RF, LR) and other deep learning approaches (1D CNN and LSTM) are comprehensively assessed and compared.
Proposed approach
As illustrated in Fig. 1, the proposed method consists of the following steps: First, reports describing fire door defects are collected from real households and categorized based on various defect types, such as frame gap and contamination. Next, each instance within the collected dataset is systematically annotated according to the corresponding defect categories. The dataset is then preprocessed through sequential cleaning steps, including lowercasing and lemmatization. Subsequently, five different transformer-based models (BERT, RoBERTa, ALBERT, DistilBERT, XLNet) are developed by fine-tuning their architectures and systematically optimizing hyperparameters, resulting in a total of 1,458 generated models. These transformer-based models are carefully evaluated to identify underfitting and overfitting issues, and their performance is then compared against other established text classification methods, comprising an additional 535 models. Multiple evaluation metrics are utilized for a comprehensive comparative analysis. Each of these steps is thoroughly detailed in the subsequent sections.
Target class definition
In this research, excluding the category labeled ‘others’, seven distinct defect types were determined based on fire door inspection records. These include gaps in the frame, defects related to door closer adjustments, contamination issues, dents, scratches, missing sealing elements, and missing mechanical operation components.
Each defect type is visually demonstrated in Fig. 2. Frame gap defects occur as horizontal or vertical separations between the door and its frame. Defects in door closer adjustments involve incorrect calibration of the closer’s two distinct speed zones, each controlled via an adjustment screw. Rotating this screw clockwise decreases the door’s closing speed, whereas rotating it counter-clockwise accelerates the closing. Therefore, precise adjustment is essential for effective operation.
Contamination, dents, and scratches typically affect door or frame surfaces. Deficiencies classified under missing sealing elements usually pertain to the absence of critical gaskets required for containing smoke and fire effectively. Lastly, defects classified as missing mechanical operation components include absent essential hardware such as door closers, digital locks, hinges, or door stoppers.
Annotation
After visually inspecting fire doors for defects, inspectors first document their findings manually in handwritten reports. Sample statements extracted from these handwritten records are presented in Table 2. Each fire door can have several distinct defects, and each defect is individually recorded. Subsequently, the handwritten observations are converted into digital form, as shown earlier in Fig. 3. Finally, these digital sentences are annotated with appropriate labels that identify the specific categories of defects.
Data cleaning methods
In this study, the five data preprocessing methods were employed to refine textual information. The impact of these preprocessing techniques on raw text data transformation is demonstrated with an example in Fig. 3:
-
1)
Lowercasing: Converts all text characters to lowercase. This standardizes words such as “Fire door” and “fire door”, eliminating unnecessary distinctions and simplifying the dataset.
-
2)
Removing punctuation: Exclude punctuation marks (e.g., commas, question marks, periods) from the text. For many classification tasks, punctuation adds little semantic value; eliminating it reduces noise and yields a cleaner, more consistent input.
-
3)
Tokenization: Breaks down text into smaller components known as tokens, typically words. Tokenization converts raw text into structured units that can then be represented numerically, making them suitable for processing by algorithms used in text classification.
-
4)
Removing stop words: Excludes frequently occurring words such as “and,” “the,” “is,” and “to,” which usually provide minimal informational value. Removing these stop words considerably reduces vocabulary size and data complexity, facilitating more efficient processing.
-
5)
Lemmatization: Reduces words to their fundamental or dictionary forms, known as lemmas, based on grammatical usage and context. For example, the words “pushed” and “pulled” would become “push” and “pull,” respectively. Unlike basic truncation, lemmatization identifies accurate root forms, ensuring consistent data representation.
BERT-based methods
The transformer architecture, proposed by Vaswani et al.29, employs stacked self-attention layers and fully connected layers organized into encoder and decoder modules. Unlike traditional sequence models, transformers leverage self-attention mechanisms instead of recurrent or convolutional operations, enabling the efficient modeling of long-range dependencies within textual data30,31. The overall transformer architecture, which consists of multiple layers of self-attention and feed-forward neural networks in both encoder and decoder blocks, is illustrated in Fig. 4.
To incorporate positional information, positional encodings are combined directly with input embeddings at the base of both encoder and decoder modules. These positional encodings match the dimensionality (\(\:{d}_{model}\)) of the embeddings, allowing seamless integration. Positional encodings use sinusoidal functions at different frequencies, as defined by Eqs. (1) and (2):
Here, pos represents the position in the sequence, and i indicates the embedding dimension. Each dimension generates a distinct sinusoidal waveform. The Transformer encoder and decoder each consist of identical layers. Each encoder layer includes a multi-head self-attention mechanism followed by a fully connected feed-forward network, while each decoder layer contains these components plus an additional multi-head attention layer focused on encoder outputs32.
Scaled Dot-Product Attention (Fig. 6) processes queries (Q), keys (K), and values (V), with attention weights calculated by scaling dot products between queries and keys by the square root of the key dimension (\(\:{d}_{k}\)), then applying a softmax function, as expressed in Eq. (3):
In multi-head attention, instead of utilizing a single attention operation with d_model-dimensional queries, keys, and values, multiple parallel attention heads (h) are employed as shown Fig. 5, each applying unique linear projections to produce queries, keys, and values of dimensions \(\:{d}_{q}\), \(\:{d}_{k}\), and \(\:{d}_{v}\), respectively33. Multi-head attention enhances performance by parallelizing attention computations across multiple attention heads, each using unique linear projections of queries, keys, and values34.
Building upon the transformer architecture, numerous variants have emerged, each tailored to specific tasks and use cases. These variants primarily differ in their pre-training strategies, parameter efficiency, and internal architectural enhancements. The following subsections provide detailed discussions of prominent BERT variants, with their key distinctions clearly summarized in Table 3.
BERT
BERT is an encoder-only Transformer trained with Masked-Language Modeling (MLM) in an autoencoding regime35. A small subset of tokens is replaced by a mask token, and the network recovers the originals using bidirectional self-attention, which integrates left and right context within each encoder block. The model employs subword tokenization (e.g., WordPiece), learned token and positional embeddings, and a stack of multi-head self-attention and position-wise feed-forward layers with residual connections and layer normalization. For sequence classification, a task-specific affine transformation with softmax is applied to the sequence-level [CLS] representation (classification token), and all parameters are fine-tuned jointly. Due to its parameter count and the quadratic cost of self-attention in sequence length, BERT is generally more demanding computationally than lighter variants.
RoBERTa
RoBERTa preserves BERT’s MLM objective and autoencoding pre-training while emphasizing optimization and scale through dynamic masking, larger mini-batches, longer training schedules, and larger corpora, alongside implementation refinements that stabilize learning36. Downstream use follows the same encoder-only architecture and fine-tuning protocol as BERT; a task-specific affine transformation with softmax is applied to the [CLS] representation, and all parameters are fine-tuned jointly. Given its reliance on aggressive scaling, RoBERTa typically matches or exceeds BERT’s accuracy at similar model sizes while being equally or more demanding to pre-train.
ALBERT
ALBERT targets parameter efficiency within the same MLM/autoencoding paradigm by introducing factorized embedding parameterization, which decouples the vocabulary embedding dimension from the hidden size used in Transformer layers, and by applying cross-layer parameter sharing to reduce the number of unique weights37. These design choices substantially lower memory usage and improve throughput with limited accuracy loss. Fine-tuning mirrors BERT; a task-specific affine transformation with softmax is applied to the [CLS] representation, and all parameters are fine-tuned jointly. With far fewer parameters, ALBERT is less demanding in both memory and inference latency, making it suitable for resource-constrained settings.
DistilBERT
DistilBERT compresses BERT via knowledge distillation while retaining the MLM/autoencoding objective. A smaller student model with fewer encoder layers is trained to match a larger teacher’s softened outputs (and, in some implementations, intermediate representations), typically combining a distillation term with the MLM loss38. Architectural simplifications further reduce complexity without altering the encoder-only design. In downstream tasks, a task-specific affine transformation with softmax is applied to the [CLS] representation, and all parameters are fine-tuned jointly. This compression yields markedly lower latency and memory usage with modest accuracy trade-offs, rendering DistilBERT less demanding than full-size BERT at inference.
XLNet
XLNet replaces MLM with Permutation Language Modeling (PLM)—an autoregressive objective that maximizes the expected likelihood over all factorization orders of a sequence39. Built on Transformer-XL, it incorporates segment-level recurrence and relative positional encodings to capture long contexts efficiently, and employs two-stream attention to condition on permuted “future” tokens without information leakage. For sequence classification, implementations use a sequence-level summary token (analogous to [CLS]); a task-specific affine transformation with softmax is applied to this sequence-level representation, and all parameters are fine-tuned jointly. Owing to the PLM objective and long-context machinery, XLNet typically incurs a higher computational burden, comparable to or above BERT/RoBERTa, while providing strong language-understanding performance.
Fine tuning of architectures
To utilize BERT-based models for classification tasks, a FC (Fully Connected) classification layer is typically added on top of the transformer architecture. The embeddings generated by the BERT model are passed into an FC classification head, which generally consists of one dense hidden layer with 128 units, followed by a dropout layer with a dropout rate of 0.3. Finally, a dense output layer with softmax activation classifies these embeddings into distinct categories, ensuring reliable accuracy and efficient computational performance40,41,42. In this research, the classifier head is specifically configured to classify fire-door conditions into eight distinct categories, as described in Table 2.
Optimization of hyperparameters
Selecting optimal hyperparameters is essential for achieving high model performance; however, exhaustive hyperparameter optimization can be computationally expensive43,44,45. A practical strategy to address this is to define hyperparameter ranges based on empirical results and previous studies46,47,48. Accordingly, hyperparameter ranges and specific values were established, resulting in the generation of 1,458 model configurations. Detailed information regarding the selected hyperparameters for each method and the corresponding number of generated models is clearly summarized in Table 4.
Model evaluation
F1 score and accuracy
In classification tasks, both the F1 score and accuracy are widely used metrics to evaluate model correctness. The F1 score integrates precision and recall into a single metric, balancing the model’s ability to correctly detect relevant cases (recall) and its effectiveness in avoiding false alarms (precision). A detailed explanation of the F1 score is provided by49,50. A high F1 score indicates that the model effectively identifies faults while minimizing false positives and false negatives, crucial for reliable system operations. Accuracy measures the proportion of correctly classified instances over the total number of instances, providing a straightforward assessment of overall model correctness51,52. Both metrics are important for evaluating model performance, especially in scenarios where misclassification can lead to significant operational consequences.
Detection speed
The detection speed refers to the computational time a model takes to process a single input instance53. In this study, the detection speed is reported in terms of instances processed per second, reflecting the efficiency of each transformer-based method when analyzing textual descriptions of fire door defects.
Experiment
Dataset preparation
Raw data collection
The dataset analyzed in this research regarding fire door defects was gathered from three residential apartment complexes in South Korea, each constructed by a different developer. All complexes utilized standardized steel fire-rated doors. Inspectors systematically checked a total of 8,786 households, recording identified defects in fire doors. Through these inspections, 4,212 separate defects were documented. Further details about the households inspected and the distribution of defects among the three companies are summarized in Table 5. Initially, inspection personnel manually documented the defects in Korean, subsequently converting these handwritten reports into digital form. For the purpose of this study, the original Korean descriptions were translated into English for clear presentation.
The degree of specificity in defect descriptions differed significantly across various construction firms, resulting in diverse classification methods. Several key factors influencing textual differences among companies were evident from the collected data. A primary factor impacting these variations was the training level of the inspectors; individuals with more specialized training often described defects using technical vocabulary. Experienced inspectors, in particular, utilized precise and technically advanced terminology. Figure 6 demonstrates this contrast by comparing descriptions from Company A and Company B. Specifically, in Fig. 6(a), a defect is articulated explicitly using the technical term “stack phenomenon,” reflecting a detailed, technical perspective. Conversely, Fig. 6(b) illustrates the same defect described by Company B using simpler, less technical wording.
Furthermore, in Apartment A, residents’ feedback regarding dissatisfaction was collected via a mobile app and subsequently integrated with defect descriptions to improve the overall accuracy and comprehensiveness of defect assessments. An illustrative example of this combined approach appears in Fig. 7, where the residents’ dissatisfaction is emphasized in bold text for clarity. In addition to inspector training and resident input, other elements influencing the detail level of defect descriptions included the duration of inspections, availability of inspection staff, and budgetary limitations.
Due to the inherent flexibility and variability of linguistic expressions in Korean, translating textual descriptions into English required careful attention to maintain the original nuances. Figure 8 illustrates this challenge by showcasing varied expressions related to defects in window and door operation mechanisms. Therefore, the distinct variations originally present in the Korean descriptions were intentionally retained in the English translations to authentically represent this linguistic diversity.
Annotation
In this study, annotation of the raw textual data was carried out according to criteria specified in Table 3, with the purpose of classifying defects explicitly associated with fire doors. Three annotators, each with over a decade of experience dealing with fire door defects and extensive expertise in construction management, manually labeled the data by recording defect details into an Excel spreadsheet. The completed labeled dataset can be obtained by contacting the corresponding author.
Data split
Following annotation, the entire dataset, consisting of 4,212 instances, was randomly divided into three distinct subsets: a training set comprising 2,527 instances (approximately 60% of the data) for model development, a validation set containing 842 instances (approximately 20% of the data) for selecting the optimal model, and a test set with 843 instances (approximately 20% of the data) for evaluating the performance of the final model on unseen data. To address potential biases or underrepresentation of smaller classes, such as contamination, stratified sampling was applied. Stratified sampling ensures that each subset maintains class proportions consistent with those found in the overall dataset54. For example, if a particular class represents 5% of the complete dataset, it will similarly constitute about 5% within each subset. Detailed distributions for each subset are provided in Table 6.
Experimental settings
All experiments were conducted on a system operating on Windows 10, equipped with an Intel Core i7-7700HQ CPU (2.80 GHz, 8 threads), an NVIDIA GeForce GTX 3080 Ti GPU, and sufficient memory to handle computational tasks effectively. The implementations utilized the TensorFlow framework for developing and executing the deep learning models.
Results and discussion
In this research, the performance of classification models was evaluated across eight defect categories. For clarity, these defect categories are defined as follows: Class1 – Frame gap, Class2 – Door closer adjustment, Class3 – Contamination, Class4 – Dent, Class5 – Scratch, Class6 – Sealing components, Class7 – Mechanical operation components, and Class8 – Others.
Underfitting, and overfitting
In this study, the training and validation loss curves were analyzed to identify potential issues related to underfitting or overfitting during model training. Multiple hyperparameter combinations involving different batch sizes and epochs were explored, affecting the total number of training iterations. The scenario with the maximum number of iterations (batch size of 16, epochs 4, and a dataset of 2,527 training instances, resulting in 4,992 iterations) was specifically chosen, as it provided the richest insights into model convergence behavior compared to scenarios with fewer iterations (e.g., batch size of 32, epochs 4, with only 624 iterations).
Figure 9 illustrates the training and validation loss curves for each model under this maximum iteration scenario. All models exhibited a clear and consistent decrease in both training and validation losses as training progressed, indicating effective and steady learning across the iterations. The consistent decline in training loss confirms that none of the models experienced significant underfitting. Furthermore, the close alignment and concurrent downward trends observed in both training and validation loss curves suggest that overfitting was minimal or non- existent.
Analysis of performance variation by methods
Figure 10 presents boxplots illustrating the distribution of F1-score, and accuracy metrics for five transformer-based models by hyperparameter combinations. A preliminary visual analysis shows that RoBERTa consistently achieves the highest accuracy and F1-score, with its boxplots positioned distinctly higher and demonstrating narrower interquartile ranges, indicating strong and stable performance. In contrast, DistilBERT exhibits lower overall performance, as indicated by boxplots located at lower positions, reflecting weaker predictive capability. ALBERT and XLNet demonstrate moderate performance, with tightly grouped distributions suggesting stable yet moderate performance across parameter combinations. BERT shows intermediate performance but has slightly wider distribution ranges, indicating variability in response to hyperparameters.
Detailed statistical analysis as shown in Table 7 further substantiates these observations. RoBERTa achieves the highest mean accuracy (84.98%) and F1-score (85.51%), confirming superior and consistent performance. DistilBERT displays the lowest mean accuracy (81.12%) and F1-score (81.73%), reinforcing its lower overall performance despite minimal variance (std = 0.37). ALBERT and XLNet yield moderate results, with stable but lower averages, highlighting less sensitivity to hyperparameter variations. BERT exhibits moderate performance but higher variability, suggesting sensitivity to specific hyperparameters.
As shown in Table 8, per-class performance is reported for the best model under each method. RoBERTa shows the strongest overall results, with the highest mean accuracy (85.01%) and mean F1 score (85.47%) among the evaluated models. Specifically, RoBERTa achieved superior precision (up to 99.78%) and recall (up to 99.13%) in certain classes, highlighting its strong predictive capability and consistent reliability. BERT exhibited moderate performance, with an average accuracy of 83.36% and mean F1-score of 83.83%, although it notably struggled with Class 8 (F1-score: 65.87%). ALBERT had stable but lower average performance (accuracy: 81.79%, F1-score: 82.31%), suggesting reliability but somewhat limited predictive accuracy.
DistilBERT showed the lowest performance across all models, with a mean accuracy of 81.12% and mean F1-score of 81.70%, indicating limited effectiveness compared to other methods. XLNet exhibited moderate performance (accuracy: 82.29%, F1-score: 82.86%), placing it between ALBERT and BERT in overall capability.
Final model selection
As shown in Table 8, among these models, RoBERTa consistently achieved the highest overall performance. Specifically, RoBERTa recorded the best average F1 score (85.55%) and accuracy (85.01%), clearly surpassing other models. DistilBERT, on the other hand, performed poorest, with the lowest average F1 score (81.70%) and accuracy (81.09%). ALBERT and XLNet showed moderate performance, while BERT exhibited intermediate results with slightly better performance than ALBERT and DistilBERT.
Regarding the detection speed (Table 9), DistilBERT was the fastest, averaging 0.016 s per instance. Although RoBERTa did not achieve the fastest detection speed, its average inference time of 0.023 s per instance.
Considering both performance and computational efficiency, the optimized RoBERTa model—with a sequence length of 512, learning rate of 1e-5, warm-up proportion of 0.1, batch size of 32, and trained for 5 epochs—was selected as the final model. This choice was based primarily on its superior accuracy and F1 score, coupled with a sufficiently fast inference time suitable for real-world deployment scenarios.
Model evaluation in test data
Test data performance
Figure 11 illustrates RoBERTa’s validation and test performance across different classes. RoBERTa demonstrated robust and consistent predictive performance, with validation F1 scores ranging from 67.65% (Class8) to 99.26% (Class7), averaging at 85.55%, and validation accuracies ranging from 66.72% (Class8) to 99.13% (Class7), averaging 85.01%.
When assessing generalization on the test dataset, RoBERTa showed minimal variation, with F1 scores fluctuating slightly (between − 0.73% and + 0.78%) and accuracies varying between − 1.00% and + 1.02%. This indicates stable and reliable generalization to unseen data, reflecting neither significant overfitting nor underfitting.
Figure 12 demonstrates the relative performance of transformer-based models (BERT, ALBERT, DistilBERT, XLNet) compared to RoBERTa as a baseline. Overall, RoBERTa exhibited superior predictive capability, consistently outperforming other models. Specifically, DistilBERT displayed the largest negative variations (up to −4.74% in F1 score and − 4.80% in accuracy), while BERT and XLNet showed moderate negative differences. ALBERT also lagged behind RoBERTa but maintained relatively smaller and more consistent negative differences.
Comparison with other detectors
Selection of other detectors
To demonstrate the superiority of the proposed RoBERTa-based approach, comparisons were conducted with several widely used text classification methods, including ANN, SVM, DT, LR, 1D CNN, and LSTM. Additionally, various vectorization techniques such as Term Frequency-Inverse Document Frequency (TF-IDF), Bag-of-Words (BOW), 3-gram, Word Embedding, Word2Vec, and FastText were utilized. Hyperparameters for each method were empirically optimized through preliminary experimentation, resulting in a total of 535 evaluated models across all methods. Table 10 summarizes the classification methods along with their optimized hyperparameters and presents the best-performing configurations identified from the test dataset.
Comparison results with other detectors
Tables 11 and 12 compare the performance of the proposed RoBERTa-based method with other machine learning and deep learning models on the test dataset. Table 11 illustrates that the proposed method the RoBERTa-based classifier achieved the highest overall performance on the test set, with an average F1-score of 85.44% and accuracy of 84.76% across eight classes. Relative to the strongest traditional machine-learning baselines—SVM (F1-score = 79.15%) and RF (accuracy = 78.73%)—RoBERTa yields absolute gains of + 6.29% points in F1-score and + 6.03% points in accuracy. Compared with the best non-transformer deep-learning model (LSTM; F1-score = 80.89%, accuracy = 79.65%), the improvements are + 4.55 and + 5.11% points, respectively, with the largest margins in the more challenging categories (Classes 3 and 8).
Additionally, Table 12 highlights the detection speed advantage of the RoBERTa-based method, achieving an average inference time of 0.0221 s per instance, significantly faster than other evaluated models. Despite having more parameters, the superior detection speed of the RoBERTa-based method can be primarily attributed to its transformer-based architecture, optimized embedding techniques, and effective GPU acceleration, enabling efficient parallel processing. In contrast, traditional machine learning models and certain deep learning architectures typically rely on CPU-based computations, which restricts their inference speed.
SHAP analysis
Figure 13 summarizes SHapley Additive exPlanations (SHAP)-based behavior for “Contamination” and “Others”—the two classes with the lowest F1 score and accuracy in both the validation and test sets. For “Contamination”, global SHAP indicates strong reliance on defect cues: mean |SHAP| for oil (~ 0.22), stain (~ 0.20), dust (~ 0.17), and gap/closer/gasket (~ 0.16–0.19) is roughly 3–4× higher than for administrative terms (TM/request/completion ~ 0.03–0.06). Δ(FN − TP) identifies administrative markers as error drivers: request (+ 0.06–0.09), completion (+ 0.07–0.10), TM (+ 0.04–0.08), and date/ID codes (+ 0.04–0.07) gain attribution in false negatives, diluting contamination cues.
For “Others”, the pattern reverses: administrative/generic tokens dominate (mean |SHAP| for request/completion/ticket/photo ~ 0.05–0.07), while class-generic words (misc/note/observation) are weaker (~ 0.03–0.04). Δ(FN − TP) again flags administrative terms as principal error sources (+ 0.05–0.09), consistent with this class’s catch-all nature.
Notably, a small set of administrative tokens accounts for substantial misclassification pressure in both classes (attribution shifts of + 0.04 to + 0.10), motivating abbreviation/administrative-text normalization, refined “Others” labeling, and negation-aware preprocessing to improve class- specific F1 score.
Potential applications of research findings
The proposed transformer-based classifiers—particularly RoBERTa—can automate triage of fire-door defect reports in computer-aided maintenance systems, enabling rapid prioritization of safety-critical repairs and efficient work-order bundling. Integrated with digital-twin, and Building Information Modeling (BIM) environments, they can surface defect locations, track remediation, and support audit readiness for regulatory compliance55,56. Near-real-time inference (~ 0.02–0.03 s per instance) enables monitoring dashboards and contractor-performance benchmarking. Beyond fire doors, the workflow extends to other textual quality-assurance records and multilingual logs, facilitating preventive-maintenance planning and evidence-based resource allocation. Moreover, the outputs can support cost-effectiveness and Return-On-Investment (ROI) analyses by quantifying accuracy-driven reductions in reinspection effort, response time, and safety risk57,58,59,60.
Conclusions
This study proposed and evaluated five transformer-based models (BERT, RoBERTa, ALBERT, DistilBERT, and XLNet) for detecting fire door defects, utilizing optimized hyperparameters such as sequence length, epochs, and batch size. In total, 1,458 model variants were developed and evaluated. The dataset comprised 4,212 real-world fire door defect reports collected from apartment complexes, covering 8,786 household units. Eight defect categories were identified, including seven common defects and one minor defect.
Among the evaluated methods, the optimized RoBERTa model demonstrated the highest performance. Specifically, on the test dataset, RoBERTa achieved the following F1 scores per defect category: frame gap (92.13%), door closer adjustment (87.29%), contamination (78.17%), dent (82.89%), scratch (80.17%), sealing components (96.66%), mechanical operation components (98.43%), and others (67.81%), resulting in an average F1 score of 85.44%. Regarding accuracy, RoBERTa attained the following results: frame gap (91.55%), door closer adjustment (87.22%), contamination (75.84%), dent (81.78%), scratch (79.49%), sealing components (96.16%), mechanical operation components (98.27%), and others (67.74%), with an overall average accuracy of 84.76%.
Furthermore, the RoBERTa-based model significantly outperformed six conventional classifiers— ANN, SVM, DT, LR, 1D CNN, and LSTM—across 535 tuned variants. On the test set it achieved an average F1-score of 85.44% and accuracy of 84.76%, exceeding the best traditional baseline (SVM: F1-score 79.15%; Random Forest: accuracy 78.73%) by + 6.29 and + 6.03% points, respectively. Relative to the strongest non-transformer deep-learning baseline (LSTM: F1-score 80.89%, accuracy 79.65%), the improvements are + 4.55 and + 5.11% points, underscoring the method’s robustness and practical value.
Despite promising results, several limitations warrant further study. The corpus was assembled from a small number of companies within a fixed period, and inspector training levels varied; consequently, writing style and label quality may not reflect broader practice. As with any data-driven approach, performance depends on the quality and representativeness of the training texts, so external validity will hinge on how closely new inspection narratives resemble those used here. The study also did not quantify economic benefit; life-cycle cost and ROI analyses tailored to this triage workflow and to alternative deployment contexts remain outstanding.
Future work should expand evaluation to multi-source datasets spanning additional contractors, time windows, and inspector backgrounds to strengthen generalizability. It should also assess cross-lingual robustness by training and testing on the original Korean texts and other languages, using multilingual encoders, back-translation consistency checks, and lightweight adapters where appropriate. Finally, the impact of translation should be examined systematically by comparing human and machine translations and measuring their effects on class-wise performance and error patterns. These steps will clarify how domain shift, language choice, and translation quality influence reliability and guide practical deployment at scale.
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
References
Alshboul, O. & Shehadeh, A. Enhancing infrastructure project outcomes through optimized contractual structures and long-term warranties. Eng. Constr. Archit. Manag. https://doi.org/10.1108/ECAM-07-2024-0954 (2024).
Shehadeh, A., Alshboul, O. & Saleh, E. Enhancing safety and reliability in multistory construction: A multi-state system assessment of shoring/reshoring operations using interval-valued belief functions. Reliab. Eng. Syst. Saf. 252, 110458 (2024).
Liu, Z. & Zhuang, Y. An investigation using resampling techniques and explainable machine learning to minimize fire losses in residential buildings. J. Build. Eng. 95, 110080 (2024).
Liu, Z. et al. Towards evidence-based fire prevention policy: Uncovering drivers of urban residential fire spread via explainable machine learning. Dev. Built Environ. 24, 100761 (2025).
Wang, S., Moon, S., Eum, I., Hwang, D. & Kim, J. A text dataset of fire door defects for pre-delivery inspections of apartments during the construction stage. Data Br. 60, 111536 (2025).
Zhou, P. & El-Gohary, N. Ontology-Based multilabel text classification of construction regulatory documents. J. Comput. Civ. Eng. https://doi.org/10.1061/(asce)cp.1943-5487.0000530 (2016).
Zhang, J. & El-Gohary, N. M. Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques. J. Comput. Civ. Eng. doi:https://doi.org/10.1061/(asce)cp.1943-5487.0000536. (2016).
Alshboul, O., Shehadeh, A. & Tamimi, M. Sustainability-Focused pavement management under climate variability. J. Constr. Eng. Manag. 151, 4025076 (2025).
Shehadeh, A. & Alshboul, O. Enhancing Engineering and Architectural Design Through Virtual Reality and Machine Learning Integration. Buildings 15, (2025).
Shariatfar, M. & Lee, Y. C. Urban-Level data interoperability for management of Building and civil infrastructure systems during the disaster phases using model view definitions. J. Comput. Civ. Eng. https://doi.org/10.1061/(asce)cp.1943-5487.0001040 (2023).
Alshboul, O., Al-Shboul, K., Shehadeh, A. & Tatari, O. Advancing equipment management for construction: introducing a new model for cost, time and quality optimization. Constr. Innov. https://doi.org/10.1108/CI-04-2024-0129 (2025).
Wang, S. & Han, J. Automated detection of exterior cladding material in urban area from street view images using deep learning. J. Build. Eng. 96, 110466 (2024).
Shuang, Q., Liu, X., Wang, Z. & Xu, X. Automatically categorizing construction accident narratives using the Deep-Learning model with a Class-Imbalance treatment technique. J. Constr. Eng. Manag. 150, 4024107 (2024).
Shehadeh, A. & Alshboul, O. Enhancing occupational safety in construction: predictive analytics using advanced ensemble machine learning algorithms. Eng. Appl. Artif. Intell. 159, 111761 (2025).
Xue, J., Shen, G. Q., Li, Y., Han, S. & Chu, X. Dynamic analysis on public concerns in Hong kong-zhuhai-macao bridge: integrated topic and sentiment modeling approach. J. Constr. Eng. Manag. 147, 4021049 (2021).
Wang, S., Kim, M., Hae, H., Cao, M. & Kim, J. The development of a Rebar-Counting model for reinforced concrete columns: using an unmanned aerial vehicle and Deep-Learning approach. J. Constr. Eng. Manag. 149, 1–13 (2023).
Salama, D. M. & El-Gohary, N. M. Semantic text classification for supporting automated compliance checking in construction. J. Comput. Civ. Eng. https://doi.org/10.1061/(asce)cp.1943-5487.0000301 (2016).
Goh, Y. M. & Ubeynarayana, C. U. Construction accident narrative classification: an evaluation of text mining techniques. Accid. Anal. Prev. https://doi.org/10.1016/j.aap.2017.08.026 (2017).
Zhang, F., Fleyeh, H., Wang, X. & Lu, M. Construction site accident analysis using text mining and natural Language processing techniques. Autom. Constr. https://doi.org/10.1016/j.autcon.2018.12.016 (2019).
Ul Hassan, F., Le, T. & Tran, D. H. Multi-Class Categorization of Design-Build Contract Requirements Using Text Mining and Natural Language Processing Techniques. In Construction Research Congress 2020: Project Management and Controls, Materials, and Contracts - Selected Papers from the Construction Research Congress 2020 (2020). https://doi.org/10.1061/9780784482889.135
Luo, X., Li, X., Song, X. & Liu, Q. Convolutional neural network Algorithm–Based novel automatic text classification framework for construction accident reports. J. Constr. Eng. Manag. https://doi.org/10.1061/jcemd4.coeng-13523 (2023).
Wang, Y., Zhang, Z., Wang, Z., Wang, C. & Wu, C. Interpretable machine learning-based text classification method for construction quality defect reports. J. Build. Eng. 89, 109330 (2024).
Fang, W. et al. Automated text classification of near-misses from safety reports: an improved deep learning approach. Adv. Eng. Inf. https://doi.org/10.1016/j.aei.2020.101060 (2020).
Yang, D. et al. Defect text classification in residential buildings using a multi-task channel attention network. Sustainable Cities Soc. at. https://doi.org/10.1016/j.scs.2022.103803 (2022).
Wang, H., Xu, S., Cui, D., Xu, H. & Luo, H. Information integration of regulation texts and tables for automated construction safety knowledge mapping. J. Constr. Eng. Manag. https://doi.org/10.1061/jcemd4.coeng-14436 (2024).
Tian, D., Li, M., Han, S. & Shen, Y. A novel and intelligent safety-hazard classification method with syntactic and semantic features for large-scale construction projects. J. Constr. Eng. Manag. 148, 4022109 (2022).
Jianan, G., Kehao, R. & Binwei, G. Deep learning-based text knowledge classification for whole-process engineering consulting standards. J. Eng. Res. https://doi.org/10.1016/j.jer.2023.07.011 (2024).
Zhong, B., Shen, L., Pan, X., Zhong, X. & He, W. Dispute classification and analysis: deep Learning–Based text mining for construction contract management. J. Constr. Eng. Manag. https://doi.org/10.1061/jcemd4.coeng-14080 (2024).
Vaswani, A. et al. Attention is all you need. In Adv. Neural. Inf. Process. Syst. (2017).
Wang, S., Kim, J., Park, S. & Kim, J. Fault diagnosis of air handling units in an auditorium using real operational labeled data across different operation modes. Comput Civ. Eng 39, (2025).
Wang, S. Automated fault diagnosis detection of air handling units using real operational labelled data and Transformer-based methods at 24-hour operation hospital. Build. Environ. 113257 https://doi.org/10.1016/j.buildenv.2025.113257 (2025).
Wang, S., Park, S., Kim, J. & Kim, J. Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and Body-Worn cameras. J. Constr. Eng. Manag. https://doi.org/10.1061/JCEMD4/COENG-16760 (2025).
Wang, S. Automated non-PPE detection on construction sites using YOLOv10 and transformer architectures for surveillance and body worn cameras with benchmark datasets. Sci. Rep. 15, 27043 (2025).
Park, S., Kim, J., Wang, S. & Kim, J. Effectiveness of image augmentation techniques on Non-Protective personal equipment detection using YOLOv8. Appl Sci 15, 2631 (2025).
Devlin, J., Chang, M. W., Lee, K., Toutanova, K. & BERT Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference (2019).
Liu, Y. et al. Roberta: A robustly optimized Bert pretraining approach. ArXiv Prepr (2019). arXiv1907.11692.
Lan, Z. et al. Albert: A lite Bert for self-supervised learning of Language representations. ArXiv Prepr (2019). arXiv1909.11942.
Sanh, V., Debut, L., Chaumond, J. & Wolf, T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv Prepr (2019). arXiv1910.01108.
Yang, Z. et al. Xlnet: generalized autoregressive pretraining for Language Understanding. Adv Neural Inf. Process. Syst 32, (2019).
Zhou, X. Sentiment analysis of the consumer review text based on BERT-BiLSTM in a social media environment. Int. J. Inf. Technol. Syst. Approach. https://doi.org/10.4018/IJITSA.325618 (2023).
Wang, S. Effectiveness of traditional augmentation methods for rebar counting using UAV imagery with faster R-CNN and YOLOv10-based transformer architectures. Sci. Rep. 15, 33702 (2025).
Wang, S., Eum, I., Park, S. & Kim, J. A semi-labelled dataset for fault detection in air handling units from a large-scale office. Data Br. 57, 110956 (2024).
Wang, S., Korolija, I. & Rovas, D. Impact of Traditional Augmentation Methods on Window State Detection. CLIMA 2022 Conf. 1–8 (2022). https://doi.org/10.34641/clima.2022.375
Liu, G. et al. Dual-agent intelligent fire detection method for large commercial spaces based on numerical databases and artificial intelligence. Process. Saf. Environ. Prot. 191, 2485–2499 (2024).
Alshboul, O., Shehadeh, A. & Almasabha, G. Reliability of information-theoretic displacement detection and risk classification for enhanced slope stability and safety at highway construction sites. Reliab. Eng. Syst. Saf. 256, 110813 (2025).
Hwang, D., Kim, J. J., Moon, S. & Wang, S. Image augmentation approaches for Building dimension Estimation in street view images using object detection and instance segmentation based on deep learning. Appl. Sci. 15, 2525 (2025).
Wang, S., Eum, I., Park, S. & Kim, J. A labelled dataset for rebar counting inspection on construction sites using unmanned aerial vehicles. Data Br. 110720 https://doi.org/10.1016/j.dib.2024.110720 (2024).
Wang, S., Park, S., Park, S. & Kim, J. Building façade datasets for analyzing Building characteristics using deep learning. Data Br. 57, 110885 (2024).
Wang, S. Development of approach to an automated acquisition of static street view images using transformer architecture for analysis of Building characteristics. Sci. Rep. 15, 29062 (2025).
Wang, S. Evaluating Cross-Building transferability of Attention-Based automated fault detection and diagnosis for air handling units: auditorium and hospital case study. Build. Environ. 113889 https://doi.org/10.1016/j.buildenv.2025.113889 (2025).
Han, J., Kim, J., Kim, S. & Wang, S. Effectiveness of image augmentation techniques on detection of Building characteristics from street view images using deep learning. J. Constr. Eng. Manag. 150, 1–18 (2024).
Wang, S. A hybrid SMOTE and Trans-CWGAN for data imbalance in real operational AHU AFDD: A case study of an auditorium Building. Energy Build. 348, 116447 (2025).
Shehadeh, A., Alshboul, O., Taamneh, M. M., Jaradat, A. Q. & Alomari, A. H. Enhanced clash detection in Building information modeling: leveraging modified extreme gradient boosting for predictive analytics. Results Eng. 24, 103439 (2024).
Wang, S. Real operational labeled data of air handling units from office, auditorium, and hospital buildings. Sci. Data. https://doi.org/10.1038/s41597-025-05825-9 (2025).
Shehadeh, A., Alshboul, O. & Arar, M. Enhancing Urban Sustainability and Resilience: Employing Digital Twin Technologies for Integrated WEFE Nexus Management to Achieve SDGs. Sustainability 16, (2024).
Shehadeh, A. et al. Advanced integration of BIM and VR in the built environment: enhancing sustainability and resilience in urban development. Heliyon 11, e42558 (2025).
Shehadeh, A. & Alshboul, O. Game theory integration in construction management: A comprehensive approach to Cost, Risk, and coordination under uncertainty. J. Constr. Eng. Manag. 151, 4025039 (2025).
Shehadeh, A., Alshboul, O. & Tamimi, M. Quantitative analysis of Climate-Adaptive pavement technologies: mitigating environmental impact and enhancing economic viability in urban infrastructure. J. Constr. Eng. Manag. 151, 4025064 (2025).
Alshboul, O. & Shehadeh, A. Integrating labor and insurance regulations for enhanced safety and security of construction workers. J. Leg. Aff Disput Resolut Eng. Constr. 17, 4525021 (2025).
Alshboul, O., Shehadeh, A. & Saleh, E. Advancing construction quality management: an integrated evidential reasoning and belief functions framework. J. Constr. Eng. Manag. 151, 4025093 (2025).
Author information
Authors and Affiliations
Contributions
Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Data Curation, Resources, Writing—Original Draft Preparation, Writing—Review and Editing: S. Wang.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
There are no studies by any of the authors in this article that used humans or animals as subjects.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, S. Development of an automated transformer-based text analysis framework for monitoring fire door defects in buildings. Sci Rep 15, 43910 (2025). https://doi.org/10.1038/s41598-025-27648-9
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-27648-9















