Introduction

In residential buildings, fire doors are typically installed at stairwell entrances and front doors of individual units to ensure residents have safe evacuation routes during fires1. Additionally, fire doors are strategically engineered to compartmentalize buildings, effectively limiting the spread of fire and smoke between different Sects2,3,4. However, various factors during construction can introduce defects in fire doors. Physical damage often occurs when doors serve as access points during material transportation or interior finishing processes. Furthermore, cost-saving practices, such as using lower-quality materials or employing insufficiently trained workers who do not strictly follow installation guidelines, can significantly undermine fire door integrity5.

To verify their proper operation, fire doors undergo visual inspections before the completion of construction. Inspection findings are documented as unstructured textual records, which include details like inspection dates, responsible contractors, and defect locations. Defects recorded are classified into categories, including operational issues (such as doors failing to close properly) and physical damages (such as dents or scratches). This classification is crucial for maintenance teams, enabling them to prioritize safety-critical repairs and improve efficiency by grouping similar repair tasks to minimize downtime and travel between units6.

Nevertheless, manually classifying fire door defects from extensive textual descriptions is labor-intensive, time-consuming, and susceptible to errors. Previous research indicates low accuracy in manual defect classification at many construction sites, underscoring the need for automated classification solutions. Earlier studies in the construction field have frequently utilized conventional text classification techniques like Naive Bayes (NB) and Support Vector Machines (SVM). These methods typically depend on simpler mathematical structures with fewer parameters7,8,9. Although they can perform adequately for basic classification tasks, they generally fall short in capturing complex patterns and subtle linguistic details present in natural language datasets, such as those encountered in inspection reports and defect records10,11.

In contrast, recent advancements in deep learning, specifically supervised learning methods based on Bidirectional Encoder Representations from Transformers (BERT), offer sophisticated architectures that incorporate multiple sequential layers. These layers introduce non-linear transformations during training, enabling BERT-based models to generate rich and contextually meaningful representations. Consequently, these models effectively address linguistic complexities, including diverse vocabularies, varied syntax structures, and stylistic variations commonly observed in construction-related textual data12.

However, numerous variations of BERT-based methods exist, and successfully deploying these models requires careful customization, extensive training, and systematic evaluation tailored specifically to the targeted classification task. These factors significantly impact key aspects of model performance, including accuracy and detection speed. Given these considerations, this study aims to develop and rigorously evaluate various BERT-based approaches for the automated classification of textual data, with a particular emphasis on detecting steel fire door defects.

The main contributions of this study are summarized as follows: (1) Real-world text data were collected from apartment complexes encompassing a total of 8,786 household units. (2) A robust classification framework was developed, identifying eight defect categories—seven common defects and one minor defect. (3) A comprehensive set of 1,458 predictive models were developed using five BERT-based methodologies, each extensively optimized through hyperparameter tuning. (4) 535 models based on other machine learning methods were optimized, and a detailed comparative analysis between these traditional machine learning models and the BERT-based models was conducted. (5) Based on resulting findings, the practical applications and implications for real-world construction management were thoroughly discussed.

Literature review

Typical process of BERT based text classification

The typical workflow for employing supervised deep learning models, particularly those based on the BERT family, involves several critical steps: target class definition, annotation, data cleaning, text vectorization, model training, and hyperparameter optimization13.

The process begins with clearly defining target classes from the raw textual data, a foundational step for effective model training14. For example, defects in fire doors may be broadly categorized into hardware faults and aesthetic faults, with potential for further subdivision into more detailed categories based on specific inspection requirements and intended use. After defining these target classes, an annotation step assigns precise labels to corresponding textual descriptions, creating structured datasets for model learning. Next, data cleaning is conducted to ensure textual data is consistent, relevant, and free from extraneous noise.

Following cleaning, text vectorization methods transform the textual data into numerical representations suitable for computational processing. BERT-based models play a crucial role here, producing context-aware vector representations that effectively capture complex linguistic patterns, significantly enhancing model performance15. Hyperparameter optimization systematically tunes model parameters such as learning rate, batch size, and the number of epochs to achieve optimal model performance. Each of these steps significantly influences overall model effectiveness16. This study specifically emphasizes evaluating the effectiveness of BERT-based vectorization methods for detecting defects in fire doors.

Existing text classification methods

Various comparative studies have evaluated the performance of supervised learning methodologies within the construction industry, using both traditional machine learning and deep learning techniques. The methods employed in these studies, their achieved accuracies, and respective application contexts are summarized in Table 1.

Salama and El-Gohary17 compared machine learning methods, including SVM, NB, and Maximum Entropy (ME), for extracting rules from contractual texts, achieving the highest accuracy of 82% with ME. Goh and Ubeynarayana18 assessed methods such as SVM, Linear Regression (LR), and NB for construction accident classification, where SVM showed superior performance. Zhang et al.19 employed Decision Trees (DT), LR, and ensemble methods for classifying construction accident types, identifying Decision Trees as effective in their context.

Ul Hassan et al.20 developed supervised machine learning approaches to categorize textual descriptions into three project phases: design, construction, and operation and maintenance. Among the tested models—NB, SVM, Linear Regression (LR), K-Nearest Neighbor (KNN), DT, and Artificial Neural Networks (ANN)—LR performed best with an accuracy of 94.12%. Luo et al.21 also evaluated traditional models such as SVM, NB, and LR alongside Convolutional Neural Networks (CNN) for accident type classification, with CNN delivering the highest accuracy of 76%. Wang et al.22 proposed both machine learning and deep learning approaches, including DT, Random Forest (RF), NB, SVM, and CNN with Attention (CNN-AT), to categorize construction defects from daily supervisory reports. Among these, NB demonstrated notable performance with an accuracy of 98%.

Moreover, recent studies highlighted transformer-based methods’ superior capability in text classification tasks. Fang et al.23 employed deep learning models, including FastText, TextCNN, TextCNN combined with Bidirectional Gated Recurrent Units (BiGRU), Text Recurrent Convolutional Neural Networks (TextRCNN), and BERT, to classify near-miss incidents, with BERT obtaining the highest accuracy of 86.91%. Yang et al.24 applied deep learning methods such as CNN and Generative Pre-trained Transformer 2 (GPT-2) to classify facility defects, achieving the highest accuracy of 90.72% using a CNN-based model. Additionally, Wang et al.25 evaluated transformer architectures such as BERT-Bidirectional Long Short-Term Memory (BiLSTM) and BERT-LSTM-CRF for explicit safety knowledge extraction from regulatory texts, obtaining the highest accuracy of 91.74% with the BERT-BiLSTM-CRF method.

Tian et al.26 compared transformer-based approaches such as BERT-Graph Convolutional Network (GCN) and BiLSTM for safety hazard classification, achieving accuracy up to 86.56%. Jianan et al.27 evaluated transformer architectures, including RoBERTa and Longformer-RoBERTa, for classifying knowledge types from construction consulting standards, with the Longformer-Robustly Optimized BERT Approach (RoBERTa) variant reaching the highest accuracy of 91.65%. In addition, a systematic assessment by Zhong et al.28 contrasted classical classifiers (SVM, NB, LR, DT, KNN) with neural architectures (TextCNN, TextCNN–LSTM, RCNN, Transformer) for construction-dispute classification; TextCNN achieved the best result at 65.09% accuracy.

These findings indicate that no single algorithm consistently outperforms others across all scenarios, underscoring that the optimal model choice heavily depends on the specific classification task and its context. Currently, no research has been conducted on automated monitoring of fire door defects in buildings. Despite the proven effectiveness of transformer-based methods, particularly BERT variants, their application to classifying fire door defects from inspection reports remains underexplored. To address this research gap, this study proposes and evaluates multiple transformer-based models (BERT, RoBERTa, A Lite BERT (ALBERT), Distilled BERT (DistilBERT), and XLNet). Additionally, traditional machine learning algorithms (ANN, SVM, DT, RF, LR) and other deep learning approaches (1D CNN and LSTM) are comprehensively assessed and compared.

Table 1 Summary of reviewed studies.

Proposed approach

As illustrated in Fig. 1, the proposed method consists of the following steps: First, reports describing fire door defects are collected from real households and categorized based on various defect types, such as frame gap and contamination. Next, each instance within the collected dataset is systematically annotated according to the corresponding defect categories. The dataset is then preprocessed through sequential cleaning steps, including lowercasing and lemmatization. Subsequently, five different transformer-based models (BERT, RoBERTa, ALBERT, DistilBERT, XLNet) are developed by fine-tuning their architectures and systematically optimizing hyperparameters, resulting in a total of 1,458 generated models. These transformer-based models are carefully evaluated to identify underfitting and overfitting issues, and their performance is then compared against other established text classification methods, comprising an additional 535 models. Multiple evaluation metrics are utilized for a comprehensive comparative analysis. Each of these steps is thoroughly detailed in the subsequent sections.

Fig. 1
figure 1

Development workflow of the proposed method.

Target class definition

In this research, excluding the category labeled ‘others’, seven distinct defect types were determined based on fire door inspection records. These include gaps in the frame, defects related to door closer adjustments, contamination issues, dents, scratches, missing sealing elements, and missing mechanical operation components.

Each defect type is visually demonstrated in Fig. 2. Frame gap defects occur as horizontal or vertical separations between the door and its frame. Defects in door closer adjustments involve incorrect calibration of the closer’s two distinct speed zones, each controlled via an adjustment screw. Rotating this screw clockwise decreases the door’s closing speed, whereas rotating it counter-clockwise accelerates the closing. Therefore, precise adjustment is essential for effective operation.

Fig. 2
figure 2

Visual examples of seven defects in fire door.

Contamination, dents, and scratches typically affect door or frame surfaces. Deficiencies classified under missing sealing elements usually pertain to the absence of critical gaskets required for containing smoke and fire effectively. Lastly, defects classified as missing mechanical operation components include absent essential hardware such as door closers, digital locks, hinges, or door stoppers.

Annotation

After visually inspecting fire doors for defects, inspectors first document their findings manually in handwritten reports. Sample statements extracted from these handwritten records are presented in Table 2. Each fire door can have several distinct defects, and each defect is individually recorded. Subsequently, the handwritten observations are converted into digital form, as shown earlier in Fig. 3. Finally, these digital sentences are annotated with appropriate labels that identify the specific categories of defects.

Fig. 3
figure 3

Data cleaning methods with examples.

Table 2 Labeling basis on seven types of classification.

Data cleaning methods

In this study, the five data preprocessing methods were employed to refine textual information. The impact of these preprocessing techniques on raw text data transformation is demonstrated with an example in Fig. 3:

  1. 1)

    Lowercasing: Converts all text characters to lowercase. This standardizes words such as “Fire door” and “fire door”, eliminating unnecessary distinctions and simplifying the dataset.

  2. 2)

    Removing punctuation: Exclude punctuation marks (e.g., commas, question marks, periods) from the text. For many classification tasks, punctuation adds little semantic value; eliminating it reduces noise and yields a cleaner, more consistent input.

  3. 3)

    Tokenization: Breaks down text into smaller components known as tokens, typically words. Tokenization converts raw text into structured units that can then be represented numerically, making them suitable for processing by algorithms used in text classification.

  4. 4)

    Removing stop words: Excludes frequently occurring words such as “and,” “the,” “is,” and “to,” which usually provide minimal informational value. Removing these stop words considerably reduces vocabulary size and data complexity, facilitating more efficient processing.

  5. 5)

    Lemmatization: Reduces words to their fundamental or dictionary forms, known as lemmas, based on grammatical usage and context. For example, the words “pushed” and “pulled” would become “push” and “pull,” respectively. Unlike basic truncation, lemmatization identifies accurate root forms, ensuring consistent data representation.

BERT-based methods

The transformer architecture, proposed by Vaswani et al.29, employs stacked self-attention layers and fully connected layers organized into encoder and decoder modules. Unlike traditional sequence models, transformers leverage self-attention mechanisms instead of recurrent or convolutional operations, enabling the efficient modeling of long-range dependencies within textual data30,31. The overall transformer architecture, which consists of multiple layers of self-attention and feed-forward neural networks in both encoder and decoder blocks, is illustrated in Fig. 4.

Fig. 4
figure 4

Workflow of transformer architecture.

To incorporate positional information, positional encodings are combined directly with input embeddings at the base of both encoder and decoder modules. These positional encodings match the dimensionality (\(\:{d}_{model}\)) of the embeddings, allowing seamless integration. Positional encodings use sinusoidal functions at different frequencies, as defined by Eqs. (1) and (2):

$$\:PE\left(pos,\:2i\right)=\:\text{s}\text{i}\text{n}\left(\frac{pos}{{\text{10,000}}^{2i/{d}_{model}}}\right)$$
(1)
$$\:PE\left(pos,\:2i+1\right)=\:\text{c}\text{o}\text{s}\left(\frac{pos}{{\text{10,000}}^{2i/{d}_{model}}}\right)$$
(2)

Here, pos represents the position in the sequence, and i indicates the embedding dimension. Each dimension generates a distinct sinusoidal waveform. The Transformer encoder and decoder each consist of identical layers. Each encoder layer includes a multi-head self-attention mechanism followed by a fully connected feed-forward network, while each decoder layer contains these components plus an additional multi-head attention layer focused on encoder outputs32.

Scaled Dot-Product Attention (Fig. 6) processes queries (Q), keys (K), and values (V), with attention weights calculated by scaling dot products between queries and keys by the square root of the key dimension (\(\:{d}_{k}\)), then applying a softmax function, as expressed in Eq. (3):

$$\:Attention\left(Q,\:K,\:V\right)=\:\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(\frac{{QK}^{T}}{\sqrt{{d}_{k}}}\right)V$$
(3)
Fig. 5
figure 5

Scaled dot and multi-head attention.

In multi-head attention, instead of utilizing a single attention operation with d_model-dimensional queries, keys, and values, multiple parallel attention heads (h) are employed as shown Fig. 5, each applying unique linear projections to produce queries, keys, and values of dimensions \(\:{d}_{q}\), \(\:{d}_{k}\), and \(\:{d}_{v}\), respectively33. Multi-head attention enhances performance by parallelizing attention computations across multiple attention heads, each using unique linear projections of queries, keys, and values34.

Building upon the transformer architecture, numerous variants have emerged, each tailored to specific tasks and use cases. These variants primarily differ in their pre-training strategies, parameter efficiency, and internal architectural enhancements. The following subsections provide detailed discussions of prominent BERT variants, with their key distinctions clearly summarized in Table 3.

Table 3 Summary of distinct characteristics in each method.

BERT

BERT is an encoder-only Transformer trained with Masked-Language Modeling (MLM) in an autoencoding regime35. A small subset of tokens is replaced by a mask token, and the network recovers the originals using bidirectional self-attention, which integrates left and right context within each encoder block. The model employs subword tokenization (e.g., WordPiece), learned token and positional embeddings, and a stack of multi-head self-attention and position-wise feed-forward layers with residual connections and layer normalization. For sequence classification, a task-specific affine transformation with softmax is applied to the sequence-level [CLS] representation (classification token), and all parameters are fine-tuned jointly. Due to its parameter count and the quadratic cost of self-attention in sequence length, BERT is generally more demanding computationally than lighter variants.

RoBERTa

RoBERTa preserves BERT’s MLM objective and autoencoding pre-training while emphasizing optimization and scale through dynamic masking, larger mini-batches, longer training schedules, and larger corpora, alongside implementation refinements that stabilize learning36. Downstream use follows the same encoder-only architecture and fine-tuning protocol as BERT; a task-specific affine transformation with softmax is applied to the [CLS] representation, and all parameters are fine-tuned jointly. Given its reliance on aggressive scaling, RoBERTa typically matches or exceeds BERT’s accuracy at similar model sizes while being equally or more demanding to pre-train.

ALBERT

ALBERT targets parameter efficiency within the same MLM/autoencoding paradigm by introducing factorized embedding parameterization, which decouples the vocabulary embedding dimension from the hidden size used in Transformer layers, and by applying cross-layer parameter sharing to reduce the number of unique weights37. These design choices substantially lower memory usage and improve throughput with limited accuracy loss. Fine-tuning mirrors BERT; a task-specific affine transformation with softmax is applied to the [CLS] representation, and all parameters are fine-tuned jointly. With far fewer parameters, ALBERT is less demanding in both memory and inference latency, making it suitable for resource-constrained settings.

DistilBERT

DistilBERT compresses BERT via knowledge distillation while retaining the MLM/autoencoding objective. A smaller student model with fewer encoder layers is trained to match a larger teacher’s softened outputs (and, in some implementations, intermediate representations), typically combining a distillation term with the MLM loss38. Architectural simplifications further reduce complexity without altering the encoder-only design. In downstream tasks, a task-specific affine transformation with softmax is applied to the [CLS] representation, and all parameters are fine-tuned jointly. This compression yields markedly lower latency and memory usage with modest accuracy trade-offs, rendering DistilBERT less demanding than full-size BERT at inference.

XLNet

XLNet replaces MLM with Permutation Language Modeling (PLM)—an autoregressive objective that maximizes the expected likelihood over all factorization orders of a sequence39. Built on Transformer-XL, it incorporates segment-level recurrence and relative positional encodings to capture long contexts efficiently, and employs two-stream attention to condition on permuted “future” tokens without information leakage. For sequence classification, implementations use a sequence-level summary token (analogous to [CLS]); a task-specific affine transformation with softmax is applied to this sequence-level representation, and all parameters are fine-tuned jointly. Owing to the PLM objective and long-context machinery, XLNet typically incurs a higher computational burden, comparable to or above BERT/RoBERTa, while providing strong language-understanding performance.

Fine tuning of architectures

To utilize BERT-based models for classification tasks, a FC (Fully Connected) classification layer is typically added on top of the transformer architecture. The embeddings generated by the BERT model are passed into an FC classification head, which generally consists of one dense hidden layer with 128 units, followed by a dropout layer with a dropout rate of 0.3. Finally, a dense output layer with softmax activation classifies these embeddings into distinct categories, ensuring reliable accuracy and efficient computational performance40,41,42. In this research, the classifier head is specifically configured to classify fire-door conditions into eight distinct categories, as described in Table 2.

Optimization of hyperparameters

Selecting optimal hyperparameters is essential for achieving high model performance; however, exhaustive hyperparameter optimization can be computationally expensive43,44,45. A practical strategy to address this is to define hyperparameter ranges based on empirical results and previous studies46,47,48. Accordingly, hyperparameter ranges and specific values were established, resulting in the generation of 1,458 model configurations. Detailed information regarding the selected hyperparameters for each method and the corresponding number of generated models is clearly summarized in Table 4.

Table 4 Detail of used hyperparameters in each method.

Model evaluation

F1 score and accuracy

In classification tasks, both the F1 score and accuracy are widely used metrics to evaluate model correctness. The F1 score integrates precision and recall into a single metric, balancing the model’s ability to correctly detect relevant cases (recall) and its effectiveness in avoiding false alarms (precision). A detailed explanation of the F1 score is provided by49,50. A high F1 score indicates that the model effectively identifies faults while minimizing false positives and false negatives, crucial for reliable system operations. Accuracy measures the proportion of correctly classified instances over the total number of instances, providing a straightforward assessment of overall model correctness51,52. Both metrics are important for evaluating model performance, especially in scenarios where misclassification can lead to significant operational consequences.

Detection speed

The detection speed refers to the computational time a model takes to process a single input instance53. In this study, the detection speed is reported in terms of instances processed per second, reflecting the efficiency of each transformer-based method when analyzing textual descriptions of fire door defects.

Experiment

Dataset preparation

Raw data collection

The dataset analyzed in this research regarding fire door defects was gathered from three residential apartment complexes in South Korea, each constructed by a different developer. All complexes utilized standardized steel fire-rated doors. Inspectors systematically checked a total of 8,786 households, recording identified defects in fire doors. Through these inspections, 4,212 separate defects were documented. Further details about the households inspected and the distribution of defects among the three companies are summarized in Table 5. Initially, inspection personnel manually documented the defects in Korean, subsequently converting these handwritten reports into digital form. For the purpose of this study, the original Korean descriptions were translated into English for clear presentation.

Table 5 Characteristics of defect datasets recorded by companies.

The degree of specificity in defect descriptions differed significantly across various construction firms, resulting in diverse classification methods. Several key factors influencing textual differences among companies were evident from the collected data. A primary factor impacting these variations was the training level of the inspectors; individuals with more specialized training often described defects using technical vocabulary. Experienced inspectors, in particular, utilized precise and technically advanced terminology. Figure 6 demonstrates this contrast by comparing descriptions from Company A and Company B. Specifically, in Fig. 6(a), a defect is articulated explicitly using the technical term “stack phenomenon,” reflecting a detailed, technical perspective. Conversely, Fig. 6(b) illustrates the same defect described by Company B using simpler, less technical wording.

Fig. 6
figure 6

Examples of defect descriptions by work experience of inspectors.

Furthermore, in Apartment A, residents’ feedback regarding dissatisfaction was collected via a mobile app and subsequently integrated with defect descriptions to improve the overall accuracy and comprehensiveness of defect assessments. An illustrative example of this combined approach appears in Fig. 7, where the residents’ dissatisfaction is emphasized in bold text for clarity. In addition to inspector training and resident input, other elements influencing the detail level of defect descriptions included the duration of inspections, availability of inspection staff, and budgetary limitations.

Fig. 7
figure 7

Example of variability of language.

Due to the inherent flexibility and variability of linguistic expressions in Korean, translating textual descriptions into English required careful attention to maintain the original nuances. Figure 8 illustrates this challenge by showcasing varied expressions related to defects in window and door operation mechanisms. Therefore, the distinct variations originally present in the Korean descriptions were intentionally retained in the English translations to authentically represent this linguistic diversity.

Fig. 8
figure 8

Example of description including dissatisfaction by household.

Annotation

In this study, annotation of the raw textual data was carried out according to criteria specified in Table 3, with the purpose of classifying defects explicitly associated with fire doors. Three annotators, each with over a decade of experience dealing with fire door defects and extensive expertise in construction management, manually labeled the data by recording defect details into an Excel spreadsheet. The completed labeled dataset can be obtained by contacting the corresponding author.

Data split

Following annotation, the entire dataset, consisting of 4,212 instances, was randomly divided into three distinct subsets: a training set comprising 2,527 instances (approximately 60% of the data) for model development, a validation set containing 842 instances (approximately 20% of the data) for selecting the optimal model, and a test set with 843 instances (approximately 20% of the data) for evaluating the performance of the final model on unseen data. To address potential biases or underrepresentation of smaller classes, such as contamination, stratified sampling was applied. Stratified sampling ensures that each subset maintains class proportions consistent with those found in the overall dataset54. For example, if a particular class represents 5% of the complete dataset, it will similarly constitute about 5% within each subset. Detailed distributions for each subset are provided in Table 6.

Table 6 Detailed distribution of data sets.

Experimental settings

All experiments were conducted on a system operating on Windows 10, equipped with an Intel Core i7-7700HQ CPU (2.80 GHz, 8 threads), an NVIDIA GeForce GTX 3080 Ti GPU, and sufficient memory to handle computational tasks effectively. The implementations utilized the TensorFlow framework for developing and executing the deep learning models.

Results and discussion

In this research, the performance of classification models was evaluated across eight defect categories. For clarity, these defect categories are defined as follows: Class1 – Frame gap, Class2 – Door closer adjustment, Class3 – Contamination, Class4 – Dent, Class5 – Scratch, Class6 – Sealing components, Class7 – Mechanical operation components, and Class8 – Others.

Underfitting, and overfitting

In this study, the training and validation loss curves were analyzed to identify potential issues related to underfitting or overfitting during model training. Multiple hyperparameter combinations involving different batch sizes and epochs were explored, affecting the total number of training iterations. The scenario with the maximum number of iterations (batch size of 16, epochs 4, and a dataset of 2,527 training instances, resulting in 4,992 iterations) was specifically chosen, as it provided the richest insights into model convergence behavior compared to scenarios with fewer iterations (e.g., batch size of 32, epochs 4, with only 624 iterations).

Figure 9 illustrates the training and validation loss curves for each model under this maximum iteration scenario. All models exhibited a clear and consistent decrease in both training and validation losses as training progressed, indicating effective and steady learning across the iterations. The consistent decline in training loss confirms that none of the models experienced significant underfitting. Furthermore, the close alignment and concurrent downward trends observed in both training and validation loss curves suggest that overfitting was minimal or non- existent.

Fig. 9
figure 9

Training and validation loss curves for each transformer-based method over 5 epochs (4,992 iterations).

Analysis of performance variation by methods

Figure 10 presents boxplots illustrating the distribution of F1-score, and accuracy metrics for five transformer-based models by hyperparameter combinations. A preliminary visual analysis shows that RoBERTa consistently achieves the highest accuracy and F1-score, with its boxplots positioned distinctly higher and demonstrating narrower interquartile ranges, indicating strong and stable performance. In contrast, DistilBERT exhibits lower overall performance, as indicated by boxplots located at lower positions, reflecting weaker predictive capability. ALBERT and XLNet demonstrate moderate performance, with tightly grouped distributions suggesting stable yet moderate performance across parameter combinations. BERT shows intermediate performance but has slightly wider distribution ranges, indicating variability in response to hyperparameters.

Fig. 10
figure 10

Boxplot of F1 score, and accuracy across methods by each hyperparameter.

Detailed statistical analysis as shown in Table 7 further substantiates these observations. RoBERTa achieves the highest mean accuracy (84.98%) and F1-score (85.51%), confirming superior and consistent performance. DistilBERT displays the lowest mean accuracy (81.12%) and F1-score (81.73%), reinforcing its lower overall performance despite minimal variance (std = 0.37). ALBERT and XLNet yield moderate results, with stable but lower averages, highlighting less sensitivity to hyperparameter variations. BERT exhibits moderate performance but higher variability, suggesting sensitivity to specific hyperparameters.

Table 7 Statistics of F1 score and accuracy.

As shown in Table 8, per-class performance is reported for the best model under each method. RoBERTa shows the strongest overall results, with the highest mean accuracy (85.01%) and mean F1 score (85.47%) among the evaluated models. Specifically, RoBERTa achieved superior precision (up to 99.78%) and recall (up to 99.13%) in certain classes, highlighting its strong predictive capability and consistent reliability. BERT exhibited moderate performance, with an average accuracy of 83.36% and mean F1-score of 83.83%, although it notably struggled with Class 8 (F1-score: 65.87%). ALBERT had stable but lower average performance (accuracy: 81.79%, F1-score: 82.31%), suggesting reliability but somewhat limited predictive accuracy.

DistilBERT showed the lowest performance across all models, with a mean accuracy of 81.12% and mean F1-score of 81.70%, indicating limited effectiveness compared to other methods. XLNet exhibited moderate performance (accuracy: 82.29%, F1-score: 82.86%), placing it between ALBERT and BERT in overall capability.

Table 8 Best model in each method on validation data.

Final model selection

As shown in Table 8, among these models, RoBERTa consistently achieved the highest overall performance. Specifically, RoBERTa recorded the best average F1 score (85.55%) and accuracy (85.01%), clearly surpassing other models. DistilBERT, on the other hand, performed poorest, with the lowest average F1 score (81.70%) and accuracy (81.09%). ALBERT and XLNet showed moderate performance, while BERT exhibited intermediate results with slightly better performance than ALBERT and DistilBERT.

Regarding the detection speed (Table 9), DistilBERT was the fastest, averaging 0.016 s per instance. Although RoBERTa did not achieve the fastest detection speed, its average inference time of 0.023 s per instance.

Considering both performance and computational efficiency, the optimized RoBERTa model—with a sequence length of 512, learning rate of 1e-5, warm-up proportion of 0.1, batch size of 32, and trained for 5 epochs—was selected as the final model. This choice was based primarily on its superior accuracy and F1 score, coupled with a sufficiently fast inference time suitable for real-world deployment scenarios.

Table 9 Detection speed by BERT based methods.

Model evaluation in test data

Test data performance

Figure 11 illustrates RoBERTa’s validation and test performance across different classes. RoBERTa demonstrated robust and consistent predictive performance, with validation F1 scores ranging from 67.65% (Class8) to 99.26% (Class7), averaging at 85.55%, and validation accuracies ranging from 66.72% (Class8) to 99.13% (Class7), averaging 85.01%.

Fig. 11
figure 11

Validation and test results of the best-performing model (RoBERTa-based method).

When assessing generalization on the test dataset, RoBERTa showed minimal variation, with F1 scores fluctuating slightly (between − 0.73% and + 0.78%) and accuracies varying between − 1.00% and + 1.02%. This indicates stable and reliable generalization to unseen data, reflecting neither significant overfitting nor underfitting.

Figure 12 demonstrates the relative performance of transformer-based models (BERT, ALBERT, DistilBERT, XLNet) compared to RoBERTa as a baseline. Overall, RoBERTa exhibited superior predictive capability, consistently outperforming other models. Specifically, DistilBERT displayed the largest negative variations (up to −4.74% in F1 score and − 4.80% in accuracy), while BERT and XLNet showed moderate negative differences. ALBERT also lagged behind RoBERTa but maintained relatively smaller and more consistent negative differences.

Fig. 12
figure 12

Comparison results of best model with other models.

Comparison with other detectors

Selection of other detectors

To demonstrate the superiority of the proposed RoBERTa-based approach, comparisons were conducted with several widely used text classification methods, including ANN, SVM, DT, LR, 1D CNN, and LSTM. Additionally, various vectorization techniques such as Term Frequency-Inverse Document Frequency (TF-IDF), Bag-of-Words (BOW), 3-gram, Word Embedding, Word2Vec, and FastText were utilized. Hyperparameters for each method were empirically optimized through preliminary experimentation, resulting in a total of 535 evaluated models across all methods. Table 10 summarizes the classification methods along with their optimized hyperparameters and presents the best-performing configurations identified from the test dataset.

Table 10 Comparison methods with hyperparameters.
Comparison results with other detectors

Tables 11 and 12 compare the performance of the proposed RoBERTa-based method with other machine learning and deep learning models on the test dataset. Table 11 illustrates that the proposed method the RoBERTa-based classifier achieved the highest overall performance on the test set, with an average F1-score of 85.44% and accuracy of 84.76% across eight classes. Relative to the strongest traditional machine-learning baselines—SVM (F1-score = 79.15%) and RF (accuracy = 78.73%)—RoBERTa yields absolute gains of + 6.29% points in F1-score and + 6.03% points in accuracy. Compared with the best non-transformer deep-learning model (LSTM; F1-score = 80.89%, accuracy = 79.65%), the improvements are + 4.55 and + 5.11% points, respectively, with the largest margins in the more challenging categories (Classes 3 and 8).

Additionally, Table 12 highlights the detection speed advantage of the RoBERTa-based method, achieving an average inference time of 0.0221 s per instance, significantly faster than other evaluated models. Despite having more parameters, the superior detection speed of the RoBERTa-based method can be primarily attributed to its transformer-based architecture, optimized embedding techniques, and effective GPU acceleration, enabling efficient parallel processing. In contrast, traditional machine learning models and certain deep learning architectures typically rely on CPU-based computations, which restricts their inference speed.

Table 11 Comparison results of best model on test set.
Table 12 Detection speed by each method (seconds per instance).

SHAP analysis

Figure 13 summarizes SHapley Additive exPlanations (SHAP)-based behavior for “Contamination” and “Others”—the two classes with the lowest F1 score and accuracy in both the validation and test sets. For “Contamination”, global SHAP indicates strong reliance on defect cues: mean |SHAP| for oil (~ 0.22), stain (~ 0.20), dust (~ 0.17), and gap/closer/gasket (~ 0.16–0.19) is roughly 3–4× higher than for administrative terms (TM/request/completion ~ 0.03–0.06). Δ(FN − TP) identifies administrative markers as error drivers: request (+ 0.06–0.09), completion (+ 0.07–0.10), TM (+ 0.04–0.08), and date/ID codes (+ 0.04–0.07) gain attribution in false negatives, diluting contamination cues.

Fig. 13
figure 13

Results of SHAP analysis.

For “Others”, the pattern reverses: administrative/generic tokens dominate (mean |SHAP| for request/completion/ticket/photo ~ 0.05–0.07), while class-generic words (misc/note/observation) are weaker (~ 0.03–0.04). Δ(FN − TP) again flags administrative terms as principal error sources (+ 0.05–0.09), consistent with this class’s catch-all nature.

Notably, a small set of administrative tokens accounts for substantial misclassification pressure in both classes (attribution shifts of + 0.04 to + 0.10), motivating abbreviation/administrative-text normalization, refined “Others” labeling, and negation-aware preprocessing to improve class- specific F1 score.

Potential applications of research findings

The proposed transformer-based classifiers—particularly RoBERTa—can automate triage of fire-door defect reports in computer-aided maintenance systems, enabling rapid prioritization of safety-critical repairs and efficient work-order bundling. Integrated with digital-twin, and Building Information Modeling (BIM) environments, they can surface defect locations, track remediation, and support audit readiness for regulatory compliance55,56. Near-real-time inference (~ 0.02–0.03 s per instance) enables monitoring dashboards and contractor-performance benchmarking. Beyond fire doors, the workflow extends to other textual quality-assurance records and multilingual logs, facilitating preventive-maintenance planning and evidence-based resource allocation. Moreover, the outputs can support cost-effectiveness and Return-On-Investment (ROI) analyses by quantifying accuracy-driven reductions in reinspection effort, response time, and safety risk57,58,59,60.

Conclusions

This study proposed and evaluated five transformer-based models (BERT, RoBERTa, ALBERT, DistilBERT, and XLNet) for detecting fire door defects, utilizing optimized hyperparameters such as sequence length, epochs, and batch size. In total, 1,458 model variants were developed and evaluated. The dataset comprised 4,212 real-world fire door defect reports collected from apartment complexes, covering 8,786 household units. Eight defect categories were identified, including seven common defects and one minor defect.

Among the evaluated methods, the optimized RoBERTa model demonstrated the highest performance. Specifically, on the test dataset, RoBERTa achieved the following F1 scores per defect category: frame gap (92.13%), door closer adjustment (87.29%), contamination (78.17%), dent (82.89%), scratch (80.17%), sealing components (96.66%), mechanical operation components (98.43%), and others (67.81%), resulting in an average F1 score of 85.44%. Regarding accuracy, RoBERTa attained the following results: frame gap (91.55%), door closer adjustment (87.22%), contamination (75.84%), dent (81.78%), scratch (79.49%), sealing components (96.16%), mechanical operation components (98.27%), and others (67.74%), with an overall average accuracy of 84.76%.

Furthermore, the RoBERTa-based model significantly outperformed six conventional classifiers— ANN, SVM, DT, LR, 1D CNN, and LSTM—across 535 tuned variants. On the test set it achieved an average F1-score of 85.44% and accuracy of 84.76%, exceeding the best traditional baseline (SVM: F1-score 79.15%; Random Forest: accuracy 78.73%) by + 6.29 and + 6.03% points, respectively. Relative to the strongest non-transformer deep-learning baseline (LSTM: F1-score 80.89%, accuracy 79.65%), the improvements are + 4.55 and + 5.11% points, underscoring the method’s robustness and practical value.

Despite promising results, several limitations warrant further study. The corpus was assembled from a small number of companies within a fixed period, and inspector training levels varied; consequently, writing style and label quality may not reflect broader practice. As with any data-driven approach, performance depends on the quality and representativeness of the training texts, so external validity will hinge on how closely new inspection narratives resemble those used here. The study also did not quantify economic benefit; life-cycle cost and ROI analyses tailored to this triage workflow and to alternative deployment contexts remain outstanding.

Future work should expand evaluation to multi-source datasets spanning additional contractors, time windows, and inspector backgrounds to strengthen generalizability. It should also assess cross-lingual robustness by training and testing on the original Korean texts and other languages, using multilingual encoders, back-translation consistency checks, and lightweight adapters where appropriate. Finally, the impact of translation should be examined systematically by comparing human and machine translations and measuring their effects on class-wise performance and error patterns. These steps will clarify how domain shift, language choice, and translation quality influence reliability and guide practical deployment at scale.