Fault-class coverage–aligned combined training for AFDD of AHUs across multiple buildings

Wang, Seunghyeon

doi:10.1038/s41598-025-24959-9

Download PDF

Article
Open access
Published: 21 November 2025

Fault-class coverage–aligned combined training for AFDD of AHUs across multiple buildings

Seunghyeon Wang¹

Scientific Reports volume 15, Article number: 41192 (2025) Cite this article

763 Accesses
Metrics details

Subjects

Abstract

Deep learning–based Automated Fault Detection and Diagnosis (AFDD) for Air Handling Units (AHUs) has often performed well. However, most prior studies relied on single-building datasets and fixed feature schemas, limiting applicability to new sites. This study evaluated TabTransformer and TabNet on a unified multi-building dataset pooling auditorium, hospital, and office data to examine coverage-aligned combined training across buildings. Extensive hyperparameter optimization covering 3240 models and systematic analyses were performed, including checks for underfitting and overfitting, assessment of validation–test variation, best-model selection, attention heatmaps, and comparisons with non-attention baselines. The optimized TabNet trained on unified data achieved, for the auditorium, an F1 score of 97.43% and an accuracy of 97.91%; for the hospital, an F1 score of 92.01% and an accuracy of 92.50%; and for the office, an F1 score of 92.25% and an accuracy of 92.46%. Single-building TabNet baselines reached 96.82% F1 and 97.37% accuracy for the auditorium, 95.40% F1 and 97.21% accuracy for the hospital, and 96.27% F1 and 97.29% accuracy for the office. Across all three buildings, gains from combined training arose primarily under strong coverage alignment of fault classes between the training and target sets; when alignment was weak, the gains diminished.

Introduction

Air Handling Units (AHUs) in Heating, Ventilation, and Air Conditioning (HVAC) systems play a key role in regulating and distributing conditioned air to sustain desirable indoor environments, including appropriate temperature, humidity, ventilation, and air quality levels^1,2,3. Automated Fault Detection and Diagnosis (AFDD) for AHUs involves the ongoing monitoring of system components—such as sensors, actuators, and controllers—to rapidly identify abnormal operating conditions or equipment faults^4,5.

Data-driven AFDD approaches, particularly those based on supervised machine learning, depend heavily on the availability of labeled datasets⁶. Lab- or simulation-based data are easier to obtain but miss real operational complexity; real building data capture authentic fault patterns yet are costly to collect and label, requiring substantial domain expertise^7,8,9. However, many existing AFDD models suffer from poor transferability across buildings, with performance dropping sharply when models trained on one site are applied to others. This is largely due to the limited diversity of single-source datasets, compounded by variations in operational schedules, environmental conditions, and sensor configurations^10,11,12.

Expanding training to unified datasets that integrate information from multiple buildings can help address this issue. Such datasets expose models to a wider variety of operating conditions and fault scenarios, enabling them to develop more generalized, building-independent fault representations that improve performance in unfamiliar environments¹³. However, unified datasets bring their own challenge: sensor configurations frequently vary in both type and number across buildings. Conventional supervised models, which typically require fixed-length inputs, are not well suited to handle this variability.

Attention-based neural networks offer a promising solution, as they are inherently capable of processing variable-length inputs. This allows them to flexibly adapt to differing sensor setups without the need for extensive manual preprocessing or feature alignment¹⁴. As a result, attention mechanisms were a strong fit for unified training across buildings with heterogeneous sensors. To examine the role of coverage alignment of classes in multi-building AFDD, the study evaluated attention-based architectures trained and tested on real operational datasets from three building types—an auditorium, a hospital, and an office. Coverage alignment of classes was defined as the overlap between the class distributions of the training sources and the target building.

The main contributions of this paper are summarized as follows:

(1)
Real operational datasets were collected over one year from 13, 8, and 20 AHUs installed in an auditorium, hospital, and office building, respectively, each with distinct sensor configurations.
(2)
Operating-condition labels were defined and annotated—seven in the auditorium and four in both the hospital and office—each set comprising one normal state and multiple fault states.
(3)
Attention-based models (TabTransformer, TabNet) were optimized via systematic hyperparameter sweeps, and 3240 configurations were evaluated to select the best models.
(4)
Analyses included checks for underfitting and overfitting, assessment of between-method variability, best-model selection, attention heatmaps, and comparative evaluation against 2592 single-building baseline models.
(5)
Findings showed that performance under unified training depended on coverage alignment of classes: accuracy was high when target-building fault coverage matched the training set and declined under mismatched coverage.

Literature review

Supervised learning using real operational data

Previous supervised learning studies can be divided into three: simulation programs, laboratory experiments, or real operation data. Simulation data generated using non-proprietary, physical model-based software have been widely employed in HVAC research^{15,16,17,18,19}. Although simulation-based datasets provide convenient and controlled environments for developing and validating FDD models, these models often exhibit reduced performance in real-world settings due to inherent discrepancies between simulated and actual operational conditions⁶. Controlled simulations generally fail to fully represent real-world complexities and variability, limiting the effectiveness of models when deployed in practice.

Laboratory datasets are collected from controlled experimental setups that mimic actual HVAC operation^{20,21,22,23,24} While laboratory data offer greater realism compared to purely simulated datasets, controlled laboratory settings still cannot fully capture the variability and unpredictability present in real operational environments¹². Factors such as weather fluctuations and occupant behavior, which significantly influence HVAC performance, are typically absent or controlled in laboratory conditions, thereby limiting the real-world generalizability of laboratory-based models⁶.

Real operational datasets, on the other hand, provide higher realism as they capture the actual complexities and operational variability present in HVAC systems⁶. Nonetheless, generating precise labels for these datasets is challenging due to subtle indications of faults, the infrequent nature of specific fault conditions causing class imbalance, and the considerable domain expertise and effort necessary for accurate data annotation.

Research trends in supervised learning

Although real operational datasets provide notable advantages, previous AFDD research has primarily relied upon simulated or laboratory-generated data^25,26,27. To overcome labeling difficulties inherent to real-world data, researchers have pursued alternative methods, such as semi-supervised learning, which significantly reduces the required amount of labeled data, and fully unsupervised learning approaches that operate without any labeled examples.

Key semi-supervised techniques include AutoEncoders (AEs)^28,29 and Generative Adversarial Networks (GANs)^30,31,32. Additionally, unsupervised approaches involving Principal Component Analysis (PCA)^17,33,34 and transformer encoder architectures³⁵ have been explored, alongside specialized noise reduction methods employing GANs³⁶, PCA³⁷, and AEs^38,39.

However, the validation of these alternative methods has typically been limited to datasets that fail to represent the intricate complexities and realistic scenarios of operational HVAC systems. This limitation restricts their practical utility. Consequently, there is a pressing requirement to develop robust supervised learning techniques explicitly trained and validated on real-world operational datasets sourced directly from active HVAC systems. Such methods could establish reliable benchmarks, facilitating accurate evaluations and improvements of models derived from simulations or laboratory experiments.

Most prior AFDD studies rely on operational data from individual buildings, limiting transfer across sites. Recent work has begun to cover diverse facilities—e.g., data centers⁴⁰, hospitals¹¹, auditoriums¹³, and university campuses⁴¹. A unified, multi-building dataset enhances generalization by exposing models to diverse operating regimes, sensor schemas, and fault manifestations, thereby reducing site-specific overfitting and improving robustness on unseen buildings. By broadening the data distribution (seasonal loads, schedules, climates, maintenance practices) and spanning varied sensor types, counts, and calibrations, it drives the model to learn building-agnostic fault signatures rather than site-specific artifacts.

This study contributes such a unified, real-operation dataset and shows that unified training yields high accuracy—establishing a practical path toward building-agnostic AFDD. To preserve these gains, unified training should be paired with schema-tolerant encoders (e.g., attention/set/graph with sensor-type embeddings and masking) and coverage-aware evaluation to manage remaining mismatches in sensors and fault labels.

Attention-based method

Attention-based neural networks have gained popularity due to their flexibility and efficiency in processing variable-length input sequences. Attention mechanisms enable models to dynamically identify and emphasize crucial portions or features within input data, which is especially advantageous for modeling complex temporal interactions and long-term dependencies inherent in time-series data^42,43. Unlike traditional neural networks, attention models adaptively adjust the significance assigned to individual data points, significantly enhancing predictive accuracy and interpretability by highlighting data features indicative of specific faults⁴⁴.

In AFDD applications, attention mechanisms are particularly beneficial because they enable detection of subtle anomalies and operational variations in AHUs even in noisy, real-world environments. The inherent flexibility of attention-based models makes them suitable for robust fault detection across diverse settings, including hospitals, auditoriums, and large-scale office buildings, where prompt and accurate fault diagnosis is essential for operational reliability and efficiency.

To the best of the authors’ knowledge, this is the first study to train attention-based AFDD models exclusively on operational AHU data collected across multiple environments—an auditorium, a hospital, and a large office building. By consolidating these datasets, the study demonstrates the effectiveness of attention mechanisms for handling variable-length sensor inputs and examines how performance depends on the coverage alignment of fault classes between training and target buildings; in practice, gains are largest under strong alignment.

Proposed methodology

As depicted in Fig. 1, the attention-based approaches proposed in this research is systematically structured into six sequential steps, starting from acquisition raw operational data and culminating in model performance evaluation. The following subsections detail the selected discretization strategies for fault classification, outline the comparative methods adopted, and provide justifications for the chosen evaluation criteria.

Acquisition of raw operational data

Each building category employed multiple AHUs, each differing in quantity, specifications, and sensor arrangements specifically designed for effective AFDD across distinct rooms or spaces. Originally, sensor readings were captured at one-minute intervals. However, to facilitate practical analysis at an hourly scale, these minute-level readings were consolidated into hourly averages. This approach significantly streamlined data storage and alleviated the management complexities commonly encountered with extensive, high-frequency datasets in operational scenarios. Each sensor type was calibrated as outlined below. Table 1 provides sensor descriptions, units, and calibration ranges with accuracy. While the sensor types remained uniform across different buildings, the exact number of sensors varied slightly due to each building’s distinct characteristics. Further dataset specifics for each building type are presented below.

Table 1 Description of measurement points for AHU datasets.

Full size table

Auditorium

Data collection took place at the Sejong Arts Center (located at 21 Gungnipbangmulgwan-ro, Sejong-si), which encompasses an area of 16,186 m² spanning from basement level 1 (B1) up to the sixth floor (6 F). The data were systematically gathered over the full calendar year of 2022, from January 1 through December 31. The center houses 13 AHUs, each equipped with specialized cooling and heating functionalities tailored to distinct building zones, as presented in Table S.1. Each AHU was integrated with a network of 15 different sensors. These AHUs continuously collected data at hourly intervals throughout the year, culminating in an extensive dataset totaling 113,880 hourly data points.

Hospital

Data were collected from the New Wing of the National Cancer Center Hospital (323 Ilsan-ro, Ilsandong-gu, Goyang-si, Gyeonggi-do), which has a total area of 18,900 m², ranging from the second basement level up to the fifth floor. The facility, completed in October 2020, underwent continuous data collection throughout the year 2022, specifically from January 1 to December 31. The building features eight AHUs, each fitted with dedicated cooling and heating coils, as described in Table S.2. Nine distinct sensor variables were integrated into the AHU system. The data collected hourly over the entire year generated an extensive dataset consisting of 70,080 hourly observations.

Office

Data collection was carried out in a large office building situated at 48 Gwacheon-daero 7na-gil, Gwacheon-si, Gyeonggi-do. The facility spans 50,966 m², extending from basement level 4 up to the sixth floor. Continuous data collection occurred from December 1, 2023, through November 30, 2024. The building houses 20 AHUs, each equipped with dedicated heating and cooling coils, further detailed in Table S.3. Eighteen distinct sensor variables were integrated. The hourly dataset collected over this period contains 175,532 entries, slightly surpassing the expected theoretical total of 175,200 hourly observations (computed as 20 AHUs × 365 days × 24 h). This discrepancy likely results from supplementary logging or operational monitoring activities that occurred during the data collection period.

Elimination of missing and duplicate data

Initially, data collection involved hourly measurements from multiple AHUs across three distinct buildings over an entire year (365 days). The expected theoretical data points were computed by considering the number of AHUs, the days in a year, and the hourly recording frequency. However, upon comparing actual collected datasets with these theoretical counts, some discrepancies emerged, as illustrated in Table 2.

Specifically, the datasets from the auditorium and hospital contained fewer records than expected, suggesting the presence of missing values across certain independent variables. The auditorium data lacked 505 hourly records, whereas the hospital data was missing 4032 entries. These missing records likely stemmed from various issues such as sensor malfunctions, temporary interruptions in data logging, or communication failures during the data collection process⁴⁵.

In contrast, the office building dataset had 332 more records than theoretically predicted. Further analysis revealed these surplus entries were duplicate measurements resulting from logging errors or synchronization issues. After eliminating these duplicates, the office dataset aligned precisely with the theoretically anticipated data count.

Table 2 Comparison between expected and collected data points, and identified data quality issues.

Full size table

In this study, both missing and duplicate records were removed. Although imputation is an option when data are sparse, exclusion was chosen because the dataset was sufficiently large to support robust analysis without imputation. This choice is consistent with prior studies^11,13, which report strong performance when training datasets are sufficiently large.

Development of annotation criteria

Operational categories and thresholds. We defined seven operational categories for annotation—one normal condition and six fault conditions: supply fan fault, cooling pump fault, heating pump fault, return-air temperature sensor fault, supply-air temperature sensor fault, and valve-position fault. Although general guidance exists to distinguish normal from faulty behavior, not all faults have well-established numerical thresholds. As noted by Wang et al.¹², sensor-related faults such as RATSF and SATSF often have broadly accepted numeric criteria, whereas actuator-related faults like VPF are typically identified using more qualitative rules rather than precise ranges.

In practice, numeric criteria must be tailored to each AHU to reflect differences in configuration, manufacturer specifications, and operating environment. In this study, six HVAC specialists (each with >20 years of experience) derived building-appropriate thresholds following ASHRAE-recommended practices, calibrated for typical AHUs. Table S4 reports the condition-specific criteria and thresholds, which follow and refine¹². Table S5 illustrates the labeling procedure with an auditorium example.

Indentification of target classes

Four annotators possessing extensive expertise exceeding 20 years in mechanical engineering and HVAC AFDD manually annotated the dataset using Excel. Table S6 presents the distribution of fault conditions within each dataset, emphasizing variations resulting from the differing frequencies of specific faults. Interestingly, despite their inclusion in the annotation criteria, certain faults were not detected in particular buildings:

Cooling pump and heating pump faults were not observed in the hospital and office buildings.
Cooling supply temperature fault was identified only once within the hospital building.

Certain fault conditions either were not observed or appeared very rarely during the data collection period, complicating the creation of robust datasets for model training, validation, and testing. As a result, operational classes with fewer than 300 occurrences were omitted from the analysis. Figure 2 provides a visual representation of the data distribution across the target operational conditions.

Design of attention-based methods

The inherent variability across datasets and the differing strengths of various algorithms highlight the importance of exploring and comparing multiple tabular-based methods^46,47. Such comparative assessments facilitate the selection of optimal models that provide robust performance and enhanced ability to generalize.

In this study, two sophisticated tabular-focused methods specifically tailored for structured datasets are investigated: TabTransformer⁴⁸ and TabNet⁴⁹. These methods were chosen due to their demonstrated efficacy and robustness across complex and dynamic applications, including demand prediction, anomaly detection, and fault diagnosis⁵⁰. Despite both approaches incorporating attention mechanisms, they significantly differ regarding their internal attention designs, techniques for feature decomposition, and capabilities for effectively processing lengthy data sequences. Detailed descriptions and critical comparisons of these transformer-based techniques are elaborated upon in the following sections, with their principal distinctions succinctly summarized in Table 3.

Table 3 Summary of distinct characteristics in tabular-based transformer model.

Full size table

TabTransformer

As depicted in Fig. 3, the TabTransformer is specifically tailored to handle tabular data by embedding categorical variables using transformer-based layers, effectively capturing their interrelationships. It integrates a complete self-attention mechanism, enabling extensive modeling of interactions among categorical features. Continuous variables, in contrast, undergo normalization independently before being merged with categorical embeddings through concatenation⁵¹. Although its robust attention architecture excels at identifying intricate feature interactions, it leads to moderately increased computational complexity^12,52. Furthermore, TabTransformer does not explicitly perform feature selection, thereby inferring feature importance implicitly rather than directly. As a result, this method is particularly effective for datasets with rich categorical interactions but exhibits moderate levels of interpretability and computational efficiency.

TabNet

As illustrated in Fig. 4, TabNet utilizes a sparse attention mechanism known as Sparsemax, specifically optimized to facilitate explicit feature selection. This attention strategy dynamically emphasizes the most critical features, thereby significantly enhancing interpretability through clear identification of feature importance. Both continuous and categorical variables undergo unified processing via iterative decision steps that incorporate feature and attentive transformers, progressively refining feature relevance and prominence^43,49. The explicit feature selection capability inherent to TabNet reduces computational load relative to full self-attention models, resulting in greater computational efficiency. Additionally, TabNet’s iterative design effectively captures intricate feature interactions, delivering robust performance alongside high interpretability⁵³. Consequently, TabNet is highly suitable for real-world AFDD applications involving sensor-based datasets.

Fine tuning of architectures

For effective implementation of attention-based approaches (TabTransformer and TabNet) in AFDD tasks, embeddings derived from these models are directed into a Fully Connected (FC) classification module. Typically, this module is structured with an initial dense layer consisting of 128 neurons, succeeded by a dropout layer set at a dropout rate of 0.3. The final classification phase involves an additional dense layer equipped with a softmax activation function, which classifies embeddings into the distinct operational categories of AHUs illustrated in Fig. 2—specifically, seven operational categories.

Optimization of hyperparameters

Selecting optimal hyperparameters is essential for achieving the highest performance from transformer-based models, yet exhaustive hyperparameter tuning often demands significant computational resources. A pragmatic alternative is to establish hyperparameter ranges guided by existing research findings and empirical insights^54,55,56. Consequently, careful consideration was given to setting hyperparameter intervals and selecting specific parameter values, resulting in various model configurations for TabTransformer and TabNet. Table S7 provides a comprehensive summary of these hyperparameters, distinguishing between those common to both methods and those specific to each, along with the total count of model configurations generated.

Model performance evaluation

F1 score

In AFDD classification tasks, F1 score and accuracy are commonly adopted to assess model performance. The F1 score combines precision and recall into a unified metric, effectively balancing the model’s capacity to identify relevant occurrences (recall) and its precision in minimizing false detections. A detailed definition and discussion are provided in paper⁵⁷. High F1 scores reflect the model’s proficiency in accurately detecting faults while reducing both false positives and false negatives, thereby enhancing system reliability.

Accuracy

In this study, two forms of accuracy are reported: average accuracy and overall accuracy. Average accuracy (macro accuracy) is calculated as the unweighted mean of per-class accuracies, giving equal importance to each fault category regardless of its prevalence. Overall accuracy (micro accuracy) is computed as the total number of correct predictions divided by the total number of samples, thereby weighting results according to class frequency. The detailed definitions of these metrics are provided in⁵⁸.

This distinction is particularly important in HVAC AFDD applications, where datasets are often imbalanced due to the low occurrence of certain fault types. Relying solely on overall accuracy can obscure poor performance in rare but operationally critical faults, potentially leading to undetected failures. In contrast, average accuracy ensures that all fault categories, including infrequent yet high-impact faults, are equally represented in the performance evaluation, enabling a more balanced and reliable assessment of diagnostic capability.

Detection speed

Detection speed refers to the duration required by a model to evaluate an individual data instance. In this study, detection speed is quantified as the count of data instances analyzed per second, serving as an indicator of computational efficiency across various models in processing sensor data for determining AHU operational states.

Experimental design

Dataset Preparation

Upon completing dataset annotation, the collected data were randomly divided into training (60%), validation (20%), and test (20%) subsets. This division ensures an objective and accurate evaluation of the model’s performance using unseen data. However, traditional random splitting methods can unintentionally lead to inadequate representation of minority classes, resulting in biased performance assessments^59,60. To address this concern, stratified sampling was employed to preserve proportional representation across all classes within each subset.

As previously illustrated in Fig. 2, certain fault conditions either did not occur or occurred infrequently during data collection, leading to varying target classes across different building types. Consequently, some fault categories were excluded for particular buildings due to insufficient data instances, resulting in differences in target classes among buildings. For simplicity and readability, the following abbreviations are utilized throughout this study:

Normal condition: “Normal”.
Return Air Temperature Sensor Fault: “RATSF”.
Supply Air Temperature Sensor Fault: “SATSF”.
Supply Fan Fault: “SFF”.
Valve Position Fault: “VPF”.
Cooling Pump Fault: “CPF”.
Heating Pump Fault: “HPF”.

By applying stratified sampling, class distributions remain consistent across training, validation, and test subsets. This approach ensures balanced representation, particularly for less frequently occurring classes, reducing potential biases, facilitating more stable model training, and enhancing the reliability and accuracy of subsequent evaluations. The detailed breakdown of class distributions for each subset is provided in Table 4.

Table 4 Distribution of training, validation, and test datasets by Building type and fault condition.

Full size table

Experimental environments

All experiments were executed using a computing environment running Windows 10, equipped with an Intel Core i7-7700HQ processor (operating at 2.80 GHz with 8 threads), an NVIDIA GeForce GTX 3080Ti GPU, and 32 GB of memory. Python was employed for all software implementations, with TensorFlow utilized as the main deep learning framework for model development and execution.

Results and discussions

Training and validation results

Underfitting, and overfitting

In this research, the training and validation loss curves were examined to identify any potential issues related to underfitting or overfitting during model training. Figure 5 demonstrates illustrative cases of these loss curves for each transformer-based model at the maximum number of training epochs, clearly showing their behavior across multiple iterations.

All the models exhibited a steady and consistent reduction in both training and validation losses over the training duration, reflecting stable and effective learning. The persistent downward trend in the training losses indicates that significant underfitting was not present in any model. Moreover, the close similarity and parallel reduction of both training and validation loss curves suggest there was minimal to no overfitting.

Analysis of performance variation by methods

Analysis of overall performance

Two attention-based methods—TabTransformer and TabNet—trained on a unified dataset are compared for AFDD across three target buildings (auditorium, hospital, and office); Fig. 6 presents the score distributions, and Table S.8 summarizes the descriptive statistics (macro-F1 over present classes and overall/micro accuracy).

Across all target buildings, TabNet achieved slightly higher mean values than TabTransformer, with gains of + 0.26% F1 and + 0.17% accuracy for the auditorium, + 0.69% F1 score and + 0.62% accuracy for the hospital, and the largest improvements of + 0.96% F1 score and + 0.75% accuracy for the office. Although TabNet consistently outperformed TabTransformer in terms of mean values, the min–max ranges reveal performance overlap between the two models; for example, in the auditorium case, TabTransformer’s maximum accuracy (97.83%) exceeded TabNet’s minimum (96.48%), indicating that under certain hyperparameter configurations TabTransformer can achieve performance comparable to or exceeding TabNet’s lower-bound outcomes. Given that both models were trained on the same unified dataset, these performance variations are attributable not to dataset mismatch but to differences in fault coverage, class distribution, and sensor data characteristics between target buildings.

Standard deviations were small across all cases (< 0.53 for F1, < 0.46 for accuracy), indicating stable performance across hyperparameter settings, with boxplots confirming narrower interquartile ranges in buildings with more comprehensive fault coverage. These findings demonstrate that even when trained on a unified dataset, model performance varies notably across target buildings and is strongly influenced by the breadth and balance of fault coverage, supporting the hypothesis that increasing fault coverage in training data can lead to more consistent generalization across diverse operational contexts.

Class-wise performance of best model in each method

Table 5 presents the precision, recall, F1 score, and accuracy at the class level for the best-performing configuration of each method. For the auditorium, both TabTransformer and TabNet achieved high and consistent results across most classes, with precision and recall frequently at or near 100%. The only notable exception was the SFF class, where TabNet’s higher precision (84.97% vs. 83.04%) and F1 score (90.99% vs. 89.87%) indicate an advantage in correctly identifying this class while maintaining low false positives. Class-wise averages for the auditorium were similar (TabNet: 97.38% precision, 97.76% recall, 97.50% F1; TabTransformer: 97.30%, 97.63%, 97.37%), and overall accuracy was marginally higher for TabNet (98.02% vs. 97.83%).

In the hospital, performance gaps were more pronounced, largely due to zero detection in the SATSF, CPF, and HPF classes for both models, reflecting missing or extremely sparse examples in fault coverage. For the remaining classes, TabNet slightly outperformed TabTransformer in most metrics. Notably, for VPF, TabNet’s precision improved from 75.88% to 78.33% and F1 from 83.09% to 84.81%. Overall accuracy was higher for TabNet (92.74% vs. 92.07%), with class-wise average F1 scores also showing improvement (92.38% vs. 91.64%).

In the office, differences between methods were more substantial for certain classes. TabNet achieved higher precision, recall, and F1 scores for Normal, RATSF, SATSF, and SFF, with particularly strong gains in SATSF (F1: 81.72% vs. 78.33%) and SFF (95.85% vs. 95.15%). Similar to the hospital case, some classes (VPF, CPF, HPF) recorded zero recall and F1 scores for both methods due to absent examples in the training dataset for those categories, again pointing to incomplete fault coverage. Class-wise averages in the office were higher for TabNet (92.46% vs. 91.12% F1), and overall accuracy was notably improved (92.61% vs. 91.57%).

Across all three buildings, the largest class-wise differences were driven by how well each method handled classes with moderate-to-low prevalence. TabNet consistently delivered small to moderate gains in these cases, particularly for faults with more complex feature distributions (SFF, SATSF, VPF). By contrast, when a class had zero or near-zero coverage in the training data, neither method generalized: precision was trivially perfect (no false positives) but recall and F1 were zero. These patterns underscore that the completeness of fault-class coverage governs performance even under unified training and indicate that enriching under-represented classes is essential for improving reliability across buildings.

Table 5 Precision, recall, F1-score, and accuracy in the class-wise level.

Full size table

Best model selection

A large-scale hyperparameter search was conducted for both TabTransformer and TabNet, generating 1,944 models per target building (auditorium, hospital, office), for a total of 5,832 models. Computational efficiency results (Table 6) indicate that TabTransformer exhibited marginally faster inference, processing an average of 45.25 instances per second (mean inference time: 0.0221 s per instance) compared to TabNet’s 43.48 instances per second (0.0230 s per instance).

Table 6 Detection speed by attention-based methods.

Full size table

Performance variations between the two methods are visualized in Fig. 7, which presents class-wise differences (TabNet − TabTransformer) in F1 score for existing classes and in per-class accuracy for all classes, supplemented by macro averages and overall accuracy. Across all three buildings, TabNet achieved positive mean differences in macro F1 score (+ 0.26% in auditorium, + 0.74% in hospital, + 1.32% in office) and macro per-class accuracy (+ 0.20%, + 0.26%, + 0.30%, respectively). Gains were most notable in classes with moderate to high complexity or imbalanced representation, such as SFF (auditorium: +1.12% F1 score; office: +0.70% F1 score) and SATSF (office: +3.39% F1 score), while improvements for classes with already near-perfect performance were marginal. Accuracy improvements were generally smaller than F1 gains, reflecting that TabNet’s advantage lies more in precision–recall balance for fault categories rather than in the aggregate correct prediction rate.

Although TabTransformer maintained a slight advantage in inference speed, TabNet consistently delivered superior diagnostic performance in terms of both F1 score and accuracy across all buildings and most fault categories. These gains were especially valuable in cases where class imbalance and incomplete fault coverage posed challenges, as seen in the SATSF and VPF classes in the hospital and office datasets. Therefore, considering the trade-off between detection speed and predictive performance, and prioritizing F1 score and accuracy as the primary evaluation metrics, the TabNet-based method was selected as the final model for subsequent analysis and deployment.

Test results

Best model analysis for validation and test sets

The best-performing TabNet models selected on the validation set were evaluated on an independent test set to assess generalization and potential overfitting. Figure 8 visualizes the F1 score (top row) and per-class accuracy (bottom row) for all target buildings, and class-wise differences between validation and test performance are summarized in Table 7.

Overall, validation–test differences were small, indicating minimal overfitting. In the auditorium, average F1 decreased by only 0.07% and per-class accuracy by 0.01%, with the largest class-specific drop of − 0.19% F1 for SFF. In the hospital, mean F1 declined by 0.37% and accuracy by 0.03%, with the largest reductions for VPF (− 0.70% F1) and SFF (− 0.56% F1). In the office, average F1 decreased by 0.21% and accuracy by 0.18%, with the greatest single drop in SATSF (− 0.51% F1). Across all buildings, changes in overall accuracy were < 0.25%, reinforcing model stability.

The dataset exhibits building-specific fault coverage: the auditorium includes six faults (RATSF, SATSF, SFF, VPF, CPF, HPF), the hospital three (RATSF, SFF, VPF), and the office three (RATSF, SATSF, SFF). This mismatch in label sets means some faults occur in only one building, producing limited and non-uniform supervision for a unified model across domains. Consequently, performance is lower for SFF in the auditorium, VPF in the hospital, and SATSF in the office.

Table 7 Class-wise performance variation.

Full size table

Attention heat map analysis

Feature level temporal analysis

Figure 9 presents annotated temporal attention heatmaps—columns normalized to 100% per hour—for (a) auditorium, (b) hospital, and (c) office. Table S.9 provides the corresponding numeric values. Using the same unified TabNet model across buildings, attention redistributes according to each dataset’s characteristics. In the auditorium, which has full coverage of all seven faults, attention is comparatively balanced but increases toward later hours for Valve Position and Supply Fan Speed (means 13.33% and 15.75%; late–early changes of + 4.44 and + 2.69% points), indicating stronger reliance on actuator cues when distinguishing SFF.

In the hospital, where SATSF/CPF/HPF are absent, attention concentrates on Valve Position and Supply Air Temperature (means 15.01% and 13.09%), with the largest late-day increases in Supply Air Temperature (+ 5.58% points) and Valve Position (+ 3.84% points), consistent with the building’s emphasis on VPF. In the office, where VPF/CPF/HPF are absent, Supply Air Temperature dominates (mean 20.64%; +4.76% points), with a secondary rise in Cooling Supply Temperature (+ 3.16% points), aligning with SATSF as the most challenging class. Overall, full fault coverage (auditorium) yields a more balanced temporal profile across sensors, whereas reduced coverage (hospital/office) drives attention to converge on a smaller set of discriminative channels.

Feature-class importance analysis

Class–feature attention heatmaps for (a) auditorium, (b) hospital, and (c) office are presented in Fig. 10 (rows normalized to 100% per class), with the numerical values reported in Table S10. Per-class attention patterns are physically consistent across buildings while reflecting differences in available fault coverage.

In the auditorium, SFF is driven primarily by Supply Fan Speed (68.12%), with Supply Air Temperature (15.17%) and Valve Position (11.58%) as secondary cues (Top1 + Top2 = 83.29%), underscoring the pivotal role of actuator signals. In the hospital, VPF is dominated by Valve Position (64.88%), followed by Supply Air Temperature (18.90%) and Return Air Temperature (13.11%) (Top1 + Top2 = 83.79%), consistent with valve-centric behavior. In the office, SATSF places most weight on Supply Air Temperature (59.85%) and Return Air Temperature (22.03%) (Top1 + Top2 = 81.88%); RATSF emphasizes Return Air Temperature (67.96%) with Supply Air Temperature (18.99%) secondary, and SFF emphasizes Supply Fan Speed (74.39%) with Supply Air Temperature (14.56%).

These heatmaps show attention that is both physics-consistent and coverage-dependent: actuator faults weight actuator channels (SFF→Fan Speed; VPF→Valve Position), and sensor faults weight the corresponding temperatures (SATSF/RATSF→Supply/Return Air). With complete coverage in the auditorium, attention is concentrated (Top1 + Top2 ≈ 80%+) on the decisive channel(s), yielding clearer class separation and higher averages; in the hospital/office, where classes are missing, attention spreads to proxy cues (e.g., temperatures for VPF/SATSF), which explains the lower macro-F1 score.

Comparative analysis with other classifiers

Selection of comparative methods

To assess the effectiveness of the proposed TabNet-based approach, comparative analyses were performed against three non-attention classifiers: Artificial Neural Networks (ANN), Recurrent Neural Networks with Long Short-Term Memory (RNN-LSTM), and Graph Convolutional Networks (GCN). TabNet was trained on the unified multi-building dataset, whereas the baselines were trained separately for each building. Hyperparameters for all baselines were tuned via grid search. Per building, ANN, GCN, and RNN-LSTM evaluated 162, 216, and 486 configurations, respectively; across the auditorium, hospital, and office datasets this resulted in 2592 trained models.

Table S.7 summarizes, for all methods, the representative fixed settings, the adjustable hyperparameter grids, the number of configurations, and the best configurations selected based on validation-set performance. For completeness, the GCN row also records the fixed graph schema used across all experiments; this schema was specified a priori and was not treated as a hyperparameter. The basic operation of each method’s components—ANN, RNN-LSTM¹³, and GCN⁶¹—is described in the cited references.

Comparative results

Table 8 compares TabNet (transformer-based attention) with conventional baselines (ANN, RNN-LSTM, and GCN) under two regimes: unified multi-building vs. single-building training. On the test set, the unified TabNet improves only when the target building has full fault coverage (auditorium: macro-F1 97.43% vs. 96.82%, + 0.61 pp; accuracy 97.91% vs. 97.37%, + 0.54 pp). With partial coverage, unified training underperforms the single-building TabNet (hospital: macro-F1 92.01% vs. 95.40%; office: 92.25% vs. 96.27%), with the largest drops in VPF (hospital) and SATSF (office).

Relative to single-building non-attention baselines, the single-building TabNet yields the highest macro-F1 in the hospital and office, while overall accuracy is comparable across methods and slightly higher for GCN in those two buildings (hospital 97.81%, office 97.67%). In the auditorium, all single-building methods are tightly clustered (macro-F1 ≈ 96.5–96.9%, accuracy within ≈ 0.2 pp). Overall, unified training is advantageous when fault coverage is complete; with incomplete coverage, single-building training is more reliable.

Table 8 Comparison results of best model on test set.

Full size table

Conclusions

Deep learning models trained on real operating data have shown strong AFDD performance for AHUs, yet much of the literature trains and evaluates on single buildings, limiting transfer to new sites. Many prior approaches also assume a fixed feature schema (same sensors and counts), so accuracy degrades when buildings instrument different variables. To address this, two attention-based tabular models—TabTransformer and TabNet—were trained on a unified multi-building dataset pooling operational data from an auditorium, a hospital, and an office. Hyperparameters were tuned extensively: 3,240 configurations for the attention models (1,296 TabTransformer; 1,944 TabNet), and 2,592 single-building baseline configurations (ANN, RNN-LSTM, and GCN across the three sites). Analyses covered under/overfitting checks, variance across methods, best-model selection, attention heat maps, and classifier comparisons.

The optimized TabNet trained on the unified dataset achieved a macro-F1 of 97.43% and an accuracy of 97.91% for the auditorium, 92.01% and 92.50% for the hospital, and 92.25% and 92.46% for the office. Single-building TabNet baselines reached 96.82% macro-F1 and 97.37% accuracy for the auditorium, 95.40% and 97.21% for the hospital, and 96.27% and 97.29% for the office. These results indicate that unified training is most beneficial when fault coverage is complete (auditorium), whereas single-building training performs better under incomplete coverage (hospital and office). Overall, TabNet ranked highest across settings, with ANN, RNN-LSTM, and GCN close behind. All models showed reliable performance; however, generalization to new buildings depends on fault-class coverage alignment and requires further validation, given potential confounders such as environmental conditions, operating schedules, and differences in sensor configuration and calibration across buildings.

Despite promising results, external validity is limited by the short collection period and the single geographic region. Differences in sensor inventories, fault prevalence, and operating regimes not represented here may yield different outcomes elsewhere. Future studies should broaden temporal and regional coverage to capture seasonal and geographic variation. The unified model offers a clear advantage when the target site’s fault coverage matches the training set, but this advantage narrows under incomplete coverage. Mitigations include coverage-aware training (e.g., presence-gated heads to mask absent classes, class-wise reweighting or post-hoc calibration), targeted augmentation of missing faults, and light per-building fine-tuning; advanced directions include graph attention networks and self-supervised/contrastive pretraining.

Data availability

Some or all data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.

References

Korolija, I., Zhang, Y., Marjanovic-Halburd, L. & Hanby, V. I. Regression models for predicting UK office building energy consumption from heating and cooling demands. Energy Build. https://doi.org/10.1016/j.enbuild.2012.12.005 (2013).
Article Google Scholar
Korolija, I., Marjanovic-Halburd, L., Zhang, Y. & Hanby, V. I. UK office Buildings archetypal model as methodological approach in development of regression models for predicting building energy consumption from heating and cooling demands. Energy Build. https://doi.org/10.1016/j.enbuild.2012.12.032 (2013).
Article Google Scholar
Wang, H., Lin, J. & Zhang, Z. Single imbalanced domain generalization network for intelligent fault diagnosis of compressors in HVAC systems under unseen working conditions. Energy Build. 312, 114192 (2024).
Article Google Scholar
Eum, I., Kim, J., Wang, S. & Kim, J. Heavy equipment detection on construction sites using you only look once (YOLO-Version 10) with transformer architectures. Appl. Sci. 15, (2025).
Han, J., Kim, J., Kim, S. & Wang, S. Effectiveness of image augmentation techniques on detection of building characteristics from street view images using deep learning. J. Constr. Eng. Manag. 150, 1–18 (2024).
Article Google Scholar
Huang, J. et al. Real vs. simulated: questions on the capability of simulated datasets on building fault detection for energy efficiency from a data-driven perspective. Energy Build. 259, 111872 (2022).
Article Google Scholar
Wang, S. Real operational labeled data of air handling units from office, auditorium, and hospital buildings. Sci. Data. https://doi.org/10.1038/s41597-025-05825-9 (2025).
Article PubMed PubMed Central Google Scholar
Singh, V., Mathur, J. & Bhatia, A. A comprehensive review: fault detection, diagnostics, prognostics, and fault modeling in HVAC systems. Int. J. Refrig. at. https://doi.org/10.1016/j.ijrefrig.2022.08.017 (2022).
Article Google Scholar
Chen, Z. et al. A review of data-driven fault detection and diagnostics for building HVAC systems. Appl. Energy. https://doi.org/10.1016/j.apenergy.2023.121030 (2023).
Article Google Scholar
Wang, S. A hybrid SMOTE and trans-CWGAN for data imbalance in real operational AHU AFDD: a case study of an auditorium building. Energy Build. 348, 116447 (2025).
Article Google Scholar
Wang, S. Automated fault diagnosis detection of air handling units using real operational labelled data and transformer-based methods at 24-hour operation hospital. Build. Environ. 113257 https://doi.org/10.1016/j.buildenv.2025.113257 (2025).
Wang, S., Eum, I., Park, S. & Kim, J. A semi-labelled dataset for fault detection in air handling units from a large-scale office. Data Br. 57, 110956 (2024).
Article CAS Google Scholar
Wang, S., Kim, J., Park, S. & Kim, J. Fault diagnosis of air handling units in an auditorium using real operational labeled data across different operation modes. Comput. Civ. Eng. 39, (2025).
Hwang, D., Kim, J. J., Moon, S. & Wang, S. Image augmentation approaches for Building dimension Estimation in street view images using object detection and instance segmentation based on deep learning. Appl. Sci. 15, (2025).
Du, Z., Fan, B., Jin, X. & Chi, J. Fault detection and diagnosis for buildings and HVAC systems using combined neural networks and subtractive clustering analysis. Build. Environ. https://doi.org/10.1016/j.buildenv.2013.11.021 (2014).
Article Google Scholar
Lee, K. P., Wu, B. H. & Peng, S. L. Deep-learning-based fault detection and diagnosis of air-handling units. Build. Environ. https://doi.org/10.1016/j.buildenv.2019.04.029 (2019).
Article Google Scholar
Montazeri, A. & Kargar, S. M. Fault detection and diagnosis in air handling using data-driven methods. J. Build. Eng. https://doi.org/10.1016/j.jobe.2020.101388 (2020).
Article Google Scholar
Shahnazari, H., Mhaskar, P., House, J. M. & Salsbury, T. I. Modeling and fault diagnosis design for HVAC systems using recurrent neural networks. Comput. Chem. Eng. https://doi.org/10.1016/j.compchemeng.2019.04.011 (2019).
Article Google Scholar
Tun, W., Wong, K. W. & Ling, S. H. Advancing fault detection in HVAC systems: unifying Gramian angular field and 2D deep convolutional neural networks for enhanced performance. Sensors https://doi.org/10.3390/s23187690 (2023).
Article PubMed PubMed Central Google Scholar
Zhou, Q., Wang, S. & Xiao, F. A novel strategy for the fault detection and diagnosis of centrifugal chiller systems. HVAC R Res. https://doi.org/10.1080/10789669.2009.10390825 (2009).
Article Google Scholar
Han, H., Gu, B., Hong, Y. & Kang, J. Automated FDD of multiple-simultaneous faults (MSF) and the application to building chillers. Energy Build. https://doi.org/10.1016/j.enbuild.2011.06.011 (2011).
Article Google Scholar
Yun, W. S., Hong, W. H. & Seo, H. A data-driven fault detection and diagnosis scheme for air handling units in building HVAC systems considering undefined states. J. Build. Eng. 35, 1–12 (2021).
Google Scholar
Li, C., Yu, Y., Shang, L., Zhang, H. & Jiang, Y. Fault diagnosis of air handling unit via combining probabilistic slow feature analysis and attention residual network. Neural Comput. Appl. https://doi.org/10.1007/s00521-023-08910-5 (2023).
Article PubMed PubMed Central Google Scholar
Guo, Y. et al. Deep learning-based fault diagnosis of variable refrigerant flow air-conditioning system for building energy saving. Appl. Energy. https://doi.org/10.1016/j.apenergy.2018.05.075 (2018).
Article Google Scholar
Babadi Soultanzadeh, M., Ouf, M. M., Nik-Bakht, M., Paquette, P. & Lupien, S. Fault detection and diagnosis in light commercial buildings’ HVAC systems: a comprehensive framework, application, and performance evaluation. Energy Build. 316, 114341 (2024).
Article Google Scholar
Uddin, M. R., Yuill, D. P., Williams, R. E. & Dvorak, B. A machine learning classifier for automated fault detection and diagnosis (AFDD) of rooftop units, addressing practical challenges of application. Energy Build. 310, 114101 (2024).
Article Google Scholar
Heimar Andersen, K. et al. Barriers and drivers for implementation of automatic fault detection and diagnosis in buildings and HVAC systems: an outlook from industry experts. Energy Build. 303, 113801 (2024).
Article Google Scholar
Zhang, J., Xu, Y., Chen, H. & Xing, L. A novel Building heat pump system semi-supervised fault detection and diagnosis method under small and imbalanced data. Eng. Appl. Artif. Intell. https://doi.org/10.1016/j.engappai.2023.106316 (2023).
Article Google Scholar
Lai, J., Wang, X., Xiang, Q., Quan, W. & Song, Y. A Semi-Supervised stacked autoencoder using the Pseudo label for classification tasks. Entropy https://doi.org/10.3390/e25091274 (2023).
Article PubMed PubMed Central Google Scholar
Li, B., Cheng, F., Zhang, X., Cui, C. & Cai, W. A novel semi-supervised data-driven method for chiller fault diagnosis with unlabeled data. Appl. Energy. https://doi.org/10.1016/j.apenergy.2021.116459 (2021).
Article PubMed PubMed Central Google Scholar
Bi, J., Yan, K. & Du, Y. End-to-end residual learning embedded ACWGAN for AHU FDD with limited fault data. Build. Environ. 270, 112529 (2025).
Article Google Scholar
Li, B., Cheng, F., Cai, H., Zhang, X. & Cai, W. A semi-supervised approach to fault detection and diagnosis for Building HVAC systems based on the modified generative adversarial network. Energy Build. https://doi.org/10.1016/j.enbuild.2021.111044 (2021).
Article PubMed PubMed Central Google Scholar
Borda, D. et al. Development of anomaly detectors for HVAC systems using machine learning. Processes https://doi.org/10.3390/pr11020535 (2023).
Article Google Scholar
Li, G. et al. Fault detection, diagnosis and calibration of heating, ventilation and air conditioning sensors by combining principal component analysis and improved bayesian inference. J. Build. Eng. https://doi.org/10.1016/j.jobe.2023.108230 (2024).
Article Google Scholar
Abdollah, M. A. F., Scoccia, R. & Aprile, M. Transformer encoder based self-supervised learning for HVAC fault detection with unlabeled data. Build. Environ. 258, 111568 (2024).
Article Google Scholar
Yan, K. Chiller fault detection and diagnosis with anomaly detective generative adversarial network. Build. Environ. https://doi.org/10.1016/j.buildenv.2021.107982 (2021).
Article PubMed PubMed Central Google Scholar
Tra, V., Amayri, M. & Bouguila, N. Unsupervised outlier detection using neural network-based mixtures of probabilistic principal component analyzers for building chiller fault diagnosis. Build. Environ. https://doi.org/10.1016/j.buildenv.2022.109620 (2022).
Article Google Scholar
Choi, Y. & Yoon, S. Autoencoder-driven fault detection and diagnosis in building automation systems: residual-based and latent space-based approaches. Build. Environ. https://doi.org/10.1016/j.buildenv.2021.108066 (2021).
Article Google Scholar
Araya, D. B., Grolinger, K., ElYamany, H. F., Capretz, M. A. M. & Bitsuamlak, G. An ensemble learning framework for anomaly detection in building energy consumption. Energy Build. https://doi.org/10.1016/j.enbuild.2017.02.058 (2017).
Article Google Scholar
Lee, D., Lai, C. W., Liao, K. K. & Chang, J. W. Artificial intelligence assisted false alarm detection and diagnosis system development for reducing maintenance cost of chillers at the data centre. J. Build. Eng. 36, 102110 (2021).
Article Google Scholar
Chen, Y., Wen, J., Pradhan, O., Lo, L. J. & Wu, T. Using discrete bayesian networks for diagnosing and isolating cross-level faults in HVAC systems. Appl. Energy. https://doi.org/10.1016/j.apenergy.2022.120050 (2022).
Article Google Scholar
Sun, K., Qaisar, I., Khan, M. A., Xing, T. & Zhao, Q. Building occupancy number prediction: a transformer approach. Build. Environ. https://doi.org/10.1016/j.buildenv.2023.110807 (2023).
Article Google Scholar
Wang, S. Development of approach to an automated acquisition of static street view images using transformer architecture for analysis of Building characteristics. Sci. Rep. 15, 29062 (2025).
Ma, X. et al. Street microclimate prediction based on transformer model and street view image in high-density urban areas. Build. Environ. 269, 112490 (2025).
Article Google Scholar
Wang, S. Evaluation of impact of image augmentation techniques on two tasks: window detection and window states detection. Results Eng. 24, 103571 (2024).
Article Google Scholar
Pili, S., Desogus, G. & Melis, D. A GIS tool for the calculation of solar irradiation on buildings at the urban scale, based on Italian standards. Energy Build. https://doi.org/10.1016/j.enbuild.2017.10.027 (2018).
Article Google Scholar
Hollmann, N. et al. Accurate predictions on small data with a tabular foundation model. Nature 637, 319–326 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Huang, X., Khetan, A., Cvitkovic, M. & Karnin, Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv Prepr. arXiv:2012.06678 (2020).
Arik, S. Ö. & Pfister, T. Tabnet: attentive interpretable tabular learning. in Proceedings of the AAAI conference on artificial intelligence vol. 35 6679–6687 (2021).
Borisov, V. et al. Deep neural networks and tabular data: a survey. IEEE Trans. Neural Networks Learn. Syst. 35, 7499–7519 (2022).
Wang, S., Park, S., Kim, J. & Kim, J. Safety helmet monitoring on construction sites using YOLOv10 and advanced transformer architectures with surveillance and body-Worn cameras. J. Constr. Eng. Manag. https://doi.org/10.1061/JCEMD4/COENG-16760 (2025).
Article Google Scholar
Wang, S. Effectiveness of traditional augmentation methods for rebar counting using UAV imagery with faster R-CNN and YOLOv10-based transformer architectures. Sci. Rep. 15, 33702 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
McDonnell, K., Murphy, F., Sheehan, B., Masello, L. & Castignani, G. Deep learning in insurance: accuracy and model interpretability using TabNet. Expert Syst. Appl. https://doi.org/10.1016/j.eswa.2023.119543 (2023).
Article Google Scholar
Wang, S., Korolija, I. & Rovas, D. Impact of traditional augmentation methods on window state detection. CLIMA 2022 Conf. https://doi.org/10.34641/clima.2022.375 (2022).
Wang, S., Park, S., Park, S. & Kim, J. Building façade datasets for analyzing building characteristics using deep learning. Data Br. 57, 110885 (2024).
Article CAS Google Scholar
Wang, S., Eum, I., Park, S. & Kim, J. A labelled dataset for rebar counting inspection on construction sites using unmanned aerial vehicles. Data Br. https://doi.org/10.1016/j.dib.2024.110720 (2024).
Wang, S. Automated non-PPE detection on construction sites using YOLOv10 and transformer architectures for surveillance and body worn cameras with benchmark datasets. Sci. Rep. 15, 27043 (2025).
Article ADS PubMed PubMed Central Google Scholar
Wang, S., Moon, S., Eum, I., Hwang, D. & Kim, J. A text dataset of fire door defects for pre-delivery inspections of apartments during the construction stage. Data Br. 60, 111536 (2025).
Article CAS Google Scholar
Park, S., Kim, J., Wang, S. & Kim, J. Effectiveness of image augmentation techniques on Non-Protective personal equipment detection using YOLOv8. Appl. Sci. 15, (2025).
Wang, S. & Han, J. Automated detection of exterior cladding material in urban area from street view images using deep learning. J. Build. Eng. 96, 110466 (2024).
Article Google Scholar
Bao, Q., Huang, X., Zhuang, W. & Pan, P. Multi-grained contrastive-learning driven MLPs for node classification. Sci. Rep. 15, 35156 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar

Download references

Funding

This research received no external funding.

Author information

Authors and Affiliations

Institute for Environmental Design and Engineering, University College London, London, WC1H 0NN, UK
Seunghyeon Wang

Authors

Seunghyeon Wang
View author publications
Search author on:PubMed Google Scholar

Contributions

Seunghyeon Wang: writing – original draft, writing – review & editing, visualization, software, validation, project administration, formal analysis, conceptualization.

Corresponding author

Correspondence to Seunghyeon Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, S. Fault-class coverage–aligned combined training for AFDD of AHUs across multiple buildings. Sci Rep 15, 41192 (2025). https://doi.org/10.1038/s41598-025-24959-9

Download citation

Received: 19 August 2025
Accepted: 16 October 2025
Published: 21 November 2025
Version of record: 21 November 2025
DOI: https://doi.org/10.1038/s41598-025-24959-9

Subjects

Abstract

Introduction

Literature review

Supervised learning using real operational data

Research trends in supervised learning

Attention-based method

Proposed methodology

Acquisition of raw operational data

Elimination of missing and duplicate data

Development of annotation criteria

Indentification of target classes

Design of attention-based methods

TabTransformer

TabNet

Fine tuning of architectures

Optimization of hyperparameters

Model performance evaluation

F1 score

Accuracy

Detection speed

Experimental design

Dataset Preparation

Experimental environments

Results and discussions

Training and validation results

Underfitting, and overfitting

Analysis of performance variation by methods

Analysis of overall performance

Class-wise performance of best model in each method

Best model selection

Test results

Best model analysis for validation and test sets

Attention heat map analysis

Feature level temporal analysis

Feature-class importance analysis

Comparative analysis with other classifiers

Selection of comparative methods

Comparative results

Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links