Introduction

Exterior cladding materials—such as brick, concrete, and paneling—form the building’s outermost protective layer. Their selection often depends on cost, local climate, personal preferences (e.g., texture and color palette), and overall durability with respect to maintenance1.

Because of these wide-ranging variables, numerous studies have examined the distribution of cladding materials at both urban and national scales2,3. Understanding how these materials are employed is vital for multiple applications. For instance, urban heat island analyses depend on recognizing that each cladding type exhibits distinct thermal properties affecting heat retention4. Similarly, in post-earthquake scenarios, differences in cladding choices can significantly influence both the extent and nature of observed damage5.

Among current methods for examining exterior cladding, street view imagery (SVI) stands out as an efficient and cost-effective alternative to on-site inspections, providing a direct view of building façades for focused observations6. In particular, google street view (GSV)—a widely adopted SVI platform—spans over 16 million kilometers across 83 countries. By simply entering geographic coordinates, users can readily access these images, making GSV indispensable for large-scale endeavors such as urban planning or nationwide surveys7.

Despite its accessibility, analyzing SVI to extract exterior cladding materials can be both labor-intensive and time-consuming, especially over large geographical areas. Typical deep learning can automate much of this manual image processing by training a model on large sets of labeled images that capture varied conditions (e.g., differing illumination or scale), thereby exhibiting robust generalization8. However, two primary challenges arise when applying exterior cladding material classification to SVI data: (1) acquiring a sufficiently comprehensive set of images (e.g., abundant examples of brick but fewer examples of stone) under diverse real-world conditions, and (2) labeling this large volume of images, which is both tedious and resource-intensive.

A potential solution involves leveraging existing, well-labeled source datasets (i.e., from regions where data is readily available) and applying them to a different target domain (i.e., where labeled data are scarce). However, these source datasets may differ substantially from the target domain in terms of their underlying data distributions—a phenomenon known as domain shift. For instance, architectural styles vary among London, Scotland, and South Korea due to distinct cultural contexts and regional building practices. While London and Scotland share certain architectural similarities, South Korea’s style diverges more noticeably6,9. This contrast becomes evident when visualizing their distributions using t-distributed Stochastic Neighbor Embedding (t-SNE): London and Scotland’s data points cluster more closely together, whereas South Korea’s points lie farther away, as illustrated in Fig. 1. Such differences—including materials, design principles, and aesthetic nuances—can degrade model performance when a model trained on one region is applied to another.

Fig. 1
figure 1

Visualization of domain shift in three datasets.

In this paper, DA techniques are introduced to have the reliable accuracy of deep learning models for extracting the information of exterior cladding materials from SVI. The main contributions are as follows:

  1. 1.

    A DA strategy employing DANN is proposed for robust feature alignment between source and target domains.

  2. 2.

    Five distinct transformer architectures—including Swin Transformer and Axial Transformer—are assessed to determine their suitability as feature extractors.

  3. 3.

    Various hyperparameters are systematically investigated to understand their effect on model performance.

  4. 4.

    An in-depth examination identifies the top-performing model within each architecture–dataset configuration.

  5. 5.

    Both supervised learning and DA methods are evaluated under domain shift conditions.

Literature review

Existing studies that apply supervised learning to building façade image analysis, as well as work on domain adaptation in related contexts, are examined here to clarify the research gap.

Supervised learning

Supervised learning—often termed general deep learning—is widely applied to labeled datasets, enabling models to learn from annotated examples. Within building façade image analysis, numerous studies have utilized SVI to extract the useful information including exterior cladding materials. For instance, Ilic et al.12 employed Convolutional Neural Networks (CNNs) to detect gentrification-related changes in sequential GSV images, achieving 95.6% accuracy in Ottawa. Hu et al.13 focused on classifying urban geometry using GSV and Densely Connected Convolutional Networks (DenseNets), reporting 89.3%, 86.6%, and 86.1% accuracies across three Hong Kong regions. Meanwhile, Campbell et al.14 explored deep learning for autonomous traffic sign detection in GSV images from the City of Greater Geelong, attaining 95.63% accuracy using Single Shot Detection (SSD) with MobileNet. Similarly, Yan and Ryu15 utilized GSV images and a CNN-based approach to categorize nine crop types (e.g., corn, soybean), reaching 92% accuracy in California’s Central Valley and 97% in Illinois.

In another study, Zou and Wang16 proposed a CNN-VGG16-based technique to identify abandoned houses from GSV data, recording 84% accuracy in five Rust Belt regions. Kim et al.17 introduced a ResNet-based method to differentiate wooden and metal poles, achieving 97.5% accuracy in Texas. Likewise, Kalfarisi et al.18 developed a large-scale solution for detecting soft-story buildings via Faster R-CNN with ResNet-50 and Inception-V2, recording 88% accuracy in both San Bruno and Seattle. Finally, Wang and Han6 proposed an automated system to classify exterior cladding materials using deep learning and GSV images; for London and Scotland, their MobileNetV3 model attained average accuracies of 75.65% and 73.45%, respectively.

Fig. 2
figure 2

Visualization of class distributions of source and target domains.

Application of domain adaptation in other contexts

The distinction between supervised learning and DA is illustrated in Fig. 2 using an example of two classes—brick and concrete—sourced from Scotland (source domain) and London (target domain). In the left panel, where supervised learning is applied, source and target samples of the same class form separate clusters, and the decision boundary fitted on the source domain does not generalize well to the target domain because of domain shift.

In the right panel, where DA is applied, the framework aligns the feature distributions of the two domains so that brick and concrete samples from Scotland and London become more intermixed within each class, and the decision boundary is adjusted to separate both classes consistently across domains. As noted, DA helps address data-availability challenges for both raw and labeled images when classifying exterior cladding materials across different regions.

For DA to function effectively, several conditions generally need to be fulfilled. First, the source and target domains should share the same task and label space, since the labeled data resides only in the source domain, and both domains’ representations must map to these labels. Second, the domains should be sufficiently related—if no direct correspondence exists, knowledge transfer is likely to fail19, and may even degrade performance compared to training without transfer20,21. Finally, both domains should have enough data to produce a robust shared representation; heavily skewed or insufficient datasets in either domain often result in low accuracy in both training and testing22.

Despite these requirements, DA methods can rival or exceed supervised learning while demanding fewer labeled samples. Hong et al.23 tackled a face detection challenge—using only one passport photo—by employing facial landmark detection to expand the dataset, achieving 97.91% accuracy. Hu et al.24 bridged the gap between synthetic point clouds from Building Information Modeling (BIM) and real point clouds for semantic segmentation, achieving an average accuracy of 91.03%. Hong et al.25 tested data from one construction site (source) against three different sites (targets), with DA boosting accuracy from 40.43 to 82.76%. Duan et al.26 introduced an unsupervised, feature-level DA approach to enhance a reinforcement learning (RL) strategy for cable-in-duct installation, raising a 98% success rate in simulation to 95.8% in real-world conditions. Similarly, Tran et al.27 demonstrated that adopting DA in worker detection via object detection models at construction sites led to a 93% success rate in real-world applications.

Such work demonstrates that domain adaptation can substantially enhance performance without relying on large amounts of labeled data, particularly in fields requiring robust transfer from one domain to another. Consequently, if DA techniques are applied to exterior cladding material classification with appropriately designed configurations, high accuracy can be achieved. By leveraging more plentiful datasets from a different source region, domain adaptation boosts model performance while reducing the labeling burden with the collection of raw data for the target domain.

Proposed approach

To accurately classify exterior cladding materials, this study presents a DA) methodology based on DANN, organized into the workflow shown in Fig. 3. The approach comprises six main components—(1) input domain, (2) image augmentation, (3) feature extractor, (4) domain alignment, (5) domain classifier, and (6) label classifier. The subsequent sections provide a detailed explanation of how each component is built and evaluated within the deep learning framework.

Fig. 3
figure 3

Overall framework of proposed method based on DA.

Input domain

The input domain consists of all images feeding into the pipeline—whether from the source or target distribution. While both domains may appear generally similar, they can differ significantly in attributes such as visual style and architectural design. The following subsections describe each domain module’s function.

Source domain

The source domain module supplies a labeled dataset that underpins supervised training. It provides both images and corresponding class labels (e.g., “brick,” “concrete,” “stone”), enabling the model to learn how specific features map to each material category. By exposing the feature extractor and label classifier to a sufficiently diverse range of labeled samples, this module ensures a robust foundation for classification. Moreover, the knowledge the model acquires here—such as recognizing common textures or color patterns—is transferred or adapted when encountering the unlabeled target domain.

Target domain

The target domain module contains images that lack reliable labels. Although these images may appear similar to those in the source dataset, they often differ in architecture, context, or lighting, leading to a domain shift28. The model, initially trained on the source domain’s labeled data, uses DA techniques to accommodate these unlabeled target images. By realigning feature distributions between source and target, DA greatly reduces the need for extensive new labeling in the target environment. Consequently, the model continues to exhibit strong classification performance, even in large-scale or dynamically changing conditions where assembling a fully labeled dataset would be impractical.

Image augmentation

Constructing comprehensive datasets for deep learning typically demands substantial time and effort. To alleviate these challenges, image augmentation is employed to artificially expand the training set. In this study, six different augmentation techniques are utilized, each addressing specific issues as outlined in Table 1. A more detailed, mathematical discussion of these methods is available in prior work29,30.

Since SVI data are captured outdoors, lighting conditions often fluctuate with factors like weather or time of day. Brightness augmentation replicates such variations by adjusting the light intensity in images, producing outputs that range from brighter to darker31. Contrast augmentation, on the other hand, increases luminance contrast, making object features more prominent—particularly useful for emphasizing contours and edges in architectural elements32.

Since SVI images are taken from diverse angles, perspective transformation modifies the homography matrix so that buildings can be recognized regardless of the viewing angle. Scale augmentation, an affine transformation, replicates varying distances and accommodates actual size differences among buildings. Likewise, building façades may appear misaligned if the camera is not level; rotation augmentation compensates for this, supporting accurate detection of features irrespective of orientation. Finally, shear augmentation addresses distortions caused by an off-perpendicular camera angle, simulating a skew along a chosen axis33.

Table 1 The used augmentation techniques with parameters and its values.

Feature extractor

The feature extractor, also known as the encoder or backbone model, converts raw inputs into higher-level, more abstract representations. Rather than relying on raw pixel intensities, the model exploits these extracted features—patterns or attributes that are more stable indicators of content34,35,36. In domain adaptation, a single feature extractor is typically shared by both source and target data, enabling the model to learn a representation space that, ideally, generalizes across domains.

Building on prior work, this study integrates several transformer-based architectures within the domain adaptation framework—namely ViT, Swin Transformer, PVT, MobileViT, and Axial Transformer—chosen for their potential to deliver reliable accuracy in many studies37,38,39,40. Figures for each architecture are presented to illustrate their unique structures. Readers seeking deeper methodological insights into the components of each model are referred to the comprehensive description in41. The subsequent subsections offer a concise overview of the fundamental concepts behind each architecture.

ViT

Vision Transformer (ViT) employs a pure transformer design for image recognition by partitioning images into fixed-size patches and treating each patch as a token (Fig. 4). A key advantage lies in applying self-attention directly to these patch sequences, which allows the model to capture global dependencies across the entire image. This contrasts with traditional CNNs that focus on local receptive fields and often require multiple layers to aggregate global context. However, ViT generally demands large-scale datasets and significant computational resources, owing to the quadratic complexity of self-attention with respect to the number of patches.

Fig. 4
figure 4

The structure of ViT.

Swin transformer

Swin Transformer introduces a hierarchical architecture featuring shifted windows for self-attention (Fig. 5). The image is divided into non-overlapping local windows, and these window partitions shift between layers to capture cross-window interactions. This design yields a linear scaling in attention complexity with increasing image size, making it suitable for high-resolution data. The hierarchical nature also fosters multi-scale feature extraction, proving effective for tasks that require both local details and global context42.

Fig. 5
figure 5

The structure of Swin Transformer.

PVT

PVT (Pyramid Vision Transformer) integrates pyramidal feature extraction—common in CNN-based models—into a transformer framework (Fig. 6). It leverages spatial-reduction attention to downsample image dimensions in the attention layers, thus reducing the cost of processing large inputs43. By producing multi-scale feature maps via a pyramid structure, PVT effectively captures features at multiple resolutions, lending itself well to dense prediction tasks such as classification or detection. This structure preserves the global attention benefits of transformers while offering computational efficiency akin to CNNs44.

Fig. 6
figure 6

The structure of PVT.

Mobilevit

MobileViT merges transformer-based self-attention with lightweight convolutional layers specifically designed for mobile and edge scenarios (Fig. 7). Its defining characteristic is its balanced approach: it harnesses the power of global context provided by self-attention, while remaining resource-efficient by adopting mobile-friendly architectural blocks45. Local features are captured through convolutions, whereas self-attention handles long-range dependencies without substantially increasing parameters or computational load46. Consequently, MobileViT suits applications requiring both high performance and limited hardware resources.

Fig. 7
figure 7

The structure of MobileViT.

Axial transformer

Axial transformer decomposes the standard 2D self-attention operation into two sequential 1D attentions along an image’s height and width dimensions (Fig. 8). This axial split reduces the complexity of self-attention from quadratic to linear in terms of spatial size, enabling efficient processing of high-resolution data47,48. By applying attention along each axis independently, axial transformer preserves the ability to capture long-range dependencies while maintaining manageable computational demands.

Fig. 8
figure 8

The structure of axial transformer.

Domain alignment

Even after feature extraction, a domain gap can persist between source and target data distributions. Domain adaptation addresses this issue by introducing alignment mechanisms. In adversarial methods, a domain discriminator attempts to distinguish whether extracted features originate from the source or target domain, while the feature extractor aims to confuse the discriminator—forcing the two distributions to appear more similar.

One way to implement adversarial training is the gradient reversal layer (GRL) in DANN49. During forward propagation, the GRL passes features to the domain discriminator as usual. However, in backpropagation, it reverses the gradients before they reach the feature extractor. This effectively compels the feature extractor to minimize the discriminator’s accuracy, encouraging domain-invariant representations. Over time, this adversarial interplay causes source and target features to converge within the latent space, ultimately enhancing generalization without requiring extensive labeling in the target domain50.

Domain classifier

After feature extraction, a Fully Connected (FC) network is used as the domain classifier. The dimensionality of its input layer matches that of the extracted features to allow seamless data flow. The output layer contains two neurons—corresponding to source and target domains—to reflect the model’s confidence in each domain category. The number and size of hidden layers can vary substantially, a choice typically treated as hyperparameter tuning51. In this study, the domain classifier is configured with two hidden layers of 32 and 16 neurons each.

Label classifier

In parallel, a label classifier is employed to predict material categories of interest (e.g., “brick,” “concrete,” “stone”) from the aligned features. Like the domain classifier, it also relies on FC layers, whose configuration is determined as part of hyperparameter tuning. In this work, the label classifier has three hidden layers consisting of 32, 16, and 8 neurons, respectively, plus an output layer corresponding to the number of cladding material classes. By making use of the newly aligned features, this classifier more effectively handles unlabeled data from the target domain.

Optimization of the hyperparameters

Although numerous hyperparameters can be tuned to optimize model performance, examining every possible configuration is nearly impossible due to time and computational constraints52. Consequently, this study focuses on a select subset of hyperparameters. Stochastic Gradient Descent (SGD) serves as the optimizer, using a batch size of either 1 or 2, which yields highly stochastic (per-sample or near-per-sample) weight updates. Within SGD, the current gradient is combined with scaled gradients from previous iterations using a momentum term, which in this study is set to 0.7 or 0.9.

To mitigate overfitting, a weight decay term (L2 regularization) is added to the loss function for all model weights, tested at 0.0005 or 0.001. The learning rate (step size for weight updates) is assigned either 0.00025 or 0.0001. Because GPU memory limits the batch size, it remains 1 or 2, preserving the intensely stochastic nature of updates. Finally, the number of training iterations—which significantly impacts both precision and training time—is set at 3000 or 5000. Altogether, these five hyperparameters yield 32 unique configurations, as summarized in Table 2.

Table 2 Combinations of hyperparameters.

Evaluation of model performance

Various other metrics can be used to assess a model’s performance in image classification. In this study, three specific metrics are selected to evaluate the model.

Underfitting, and overfitting

The evaluation of loss progression throughout iterative training provides ensuring two common issues: underfitting and overfitting. The underfitting is characterized by the persistence of high training and validation loss. Models that underfit, having failed to sufficiently learn from the training data, are likely to deliver high performance on both training and unseen data. In contrast, overfitting manifests as a significant discrepancy between a low training loss and a relatively high validation loss. Models that overfit, despite potentially demonstrating high performance on training data, are prone to failure when exposed to new unseen data53. In this study, the models exhibiting signs of underfitting or overfitting were disregarded, as these models are unlikely to yield satisfactory performance on even training or, unseen data, respectively.

F1-score

In image classification, the F1-score is commonly used to evaluate accuracy because it incorporates both precision and recall. Precision represents the proportion of positive predictions that are correct, while recall reflects the proportion of actual positives that are correctly identified. These metrics are defined mathematically in Eqs. (1) and (2):

$$\:\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:=\:\frac{\text{T}\text{P}}{\text{T}\text{P}\:+\:\text{F}\text{P}}$$
(1)
$$\:\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\:=\:\frac{\text{T}\text{P}}{\text{T}\text{P}\:+\:\text{F}\text{N}}$$
(2)

where true positives (TP) denotes the number of correctly identified positive samples, false positives (FP) refers to the number of incorrectly classified positives, and false negatives (FN) represents the number of positive samples the model failed to detect.

The F1-score merges precision and recall using their harmonic mean, making it highly sensitive to low values in either metric. As a result, a model needs sufficiently high precision and recall to achieve an elevated F1-score. Its formulation is provided by Eq. (3):

$${\text{F}}1\text{-}{\text{score}}\: = \:2 \times \:\frac{{{\text{precision}} \times \:{\text{recall}}}}{{{\text{precision}} + {\text{recall}}}}$$
(3)

Detection speed

Detection speed represents the time a model needs to process each frame, typically reported as Frames Per Second (FPS). Measuring FPS helps determine how swiftly the classification model can handle incoming images, providing insight into its practical efficiency—especially in scenarios requiring real-time or high-throughput processing.

Experiment

Dataset preparation

In this research, the dataset collected in our previous work6 using supervised learning was utilized to demonstrate the proposed method. The data collection methodology is briefly described here, with more emphasis placed on showcasing the potential of transformer-based domain adaptation.

Original dataset

For this case study, the United Kingdom was selected because its legal framework permits collecting SVI. In practice, SVI platforms provide large numbers of façade images across many cities, but the distribution of cladding types is uneven and manual annotation is expensive. Rather than repeatedly labeling a full dataset for every new city, an existing labeled dataset from one region can be reused as a source domain and its knowledge transferred to a new, sparsely labeled target region. The experiments therefore employ comparable source and target dataset sizes to control for sample-size effects and to focus on how domain adaptation can mitigate regional shifts in image appearance, while conceptually reflecting the practical need to reduce labeling effort in future deployments.

A random sample of 3000 buildings was taken from each of two UK regions—the North-West (NW) and the Eastern (E)—as shown in Fig. 9. The sampling process was carried out via the national mapping agency’s website (https://osdatahub.os.uk/). The corresponding addresses were then used to query the GSV Application Programming Interface (API), allowing the retrieval and download of building images.

Fig. 9
figure 9

The geographic scope of the dataset.

Various GSV parameters were manually tuned to improve the visual identification of exterior cladding materials. The field of view (FoV), which controls the zoom level or scope of the scene, was set between 10 and 50. The pitch, representing the camera’s vertical angle relative to the street-view vehicle, ranged from 25 to 30. Meanwhile, the heading, which specifies the camera’s horizontal orientation, varied between 33 and 55. All images were captured at 640 × 640 pixels—the maximum resolution currently provided by GSV. Figures 10 and 11 present representative examples of images taken in London and Scotland, respectively, illustrating typical exterior cladding.

However, many GSV images did not offer a sufficiently clear view for accurate identification. Some showed no building at all, while others were partially obscured by trees or fences, and certain images displayed interiors rather than façades. In some cases, no GSV images were available at the address, only an error message. Figure 12 shows examples of images deemed unusable for this study. These images were discarded through manual review. Of the original 3000 addresses, 1,604 images (53.47%) were deemed usable in London and 1,017 images (53.02%) in Scotland.

Fig. 10
figure 10

Examples images in London.

Fig. 11
figure 11

Examples images in Scotland.

Fig. 12
figure 12

Examples of unusable images.

Application of image augmentation techniques

Image augmentation parameters directly affect image quality and, consequently, model performance55,56. These parameters were selected through a trial-and-error process aimed at generating realistic augmented images, as summarized in Table 1. Brightness was adjusted by randomly shifting pixel intensities (β) between − 30 and + 30, whereas contrast was adjusted by randomly varying α between 0.5 and 2.0. For scale augmentation, images were independently resized along the x- and y-axes to 80–120% of their original size to account for potential differences in building features. Perspective transformations were applied by modifying the homography matrix elements \(\:{a}_{31}\)​and \(\:{a}_{32}\)​, randomly assigning values between 0.01 and 0.15. Rotation was performed within a range of − 25° to + 25° using the rotation parameter θ, and shearing ranged from − 15° to + 15°, introducing skew via \(\:{sh}_{x}\)​ and \(\:{sh}_{y}\).

These geometric operations—scaling, rotation, perspective shifting, translation, and shearing—can move parts of an image beyond its original boundaries, creating voids or overlapping pixels. Because deep learning models typically require inputs of fixed dimensions, any empty areas were filled with a pixel value of 255 (white), ensuring that all augmented images had uniform size57. The original training sets of 928 images for London and 608 images for Scotland were expanded through these augmentation techniques to 5568 and 3648 images, respectively. Including the original (non-augmented) images, the final training sets comprised 6496 images for London and 4256 images for Scotland. To demonstrate the effectiveness of the augmented datasets, model accuracy was compared between training with only the original data and training with the augmented data.

Annotation

In image classification, “annotation” (or labeling) refers to assigning each image in the ground truth dataset to a specific category. In this work, images were examined for visual cues indicating the cladding material, and each image was placed into a separate folder—Brick, Concrete, Glass, Stone, Mixed, or Others—corresponding to the relevant cladding class.

Synthesis of final dataset

To validate the proposed methods, four distinct datasets were generated and divided into training, validation, and test sets. Each image was then labeled based on the cladding material category it represented. Table 3 summarizes the annotated distributions for each of these datasets.

Table 3 Comprehensive dataset distribution in London and Scotland.

Experimental settings

All experiments were performed on a Windows 10 (64-bit) machine equipped with an Intel® Core™ i9-12900 K CPU (5.20 GHz, 16 cores), 64 GB of DDR5 RAM, and an NVIDIA GeForce RTX 3090 GPU featuring 24 GB of GDDR6X VRAM.

Results and discussion

In this study, 640 models were trained, spanning five transformer architectures, four dataset configurations, and 32 hyperparameter settings. For each architecture (ViT, Swin Transformer, PVT, MobileViT, and Axial Transformer) and each dataset configuration (raw source + raw target, augmented source + raw target, raw source + augmented target, and augmented source + augmented Target), all 32 combinations of batch size, learning rate, weight decay, momentum, and number of iterations listed in Table 2 were explored, yielding 5 × 4 × 32 = 640 models.

Underfitting and overfitting

Although the training and validation loss graphs for all these models are omitted here due to space constraints, they can be accessed via a link provided in the “Data Availability” section. Figure 13 shows representative training and validation losses for models trained using augmented source and target datasets. All models exhibited a steady reduction in both training and validation losses as training progressed, with occasional fluctuations between epochs rather than a perfectly smooth decline. This behavior indicates that the models effectively learned from the training data, improving validation performance and converging toward an optimal loss value. The consistent drop in training loss across iterations rules out underfitting concerns. Similarly, the parallel decrease and close alignment of training and validation losses suggest that overfitting was not an issue, so no models were excluded on those grounds.

Fig. 13
figure 13

Training and validation loss graphs.

Effectiveness of hyperparameter on model performance

The impact of various hyperparameters on model performance was assessed by clustering results according to each hyperparameter configuration. The datasets considered were Raw Source (RS), Augmented Source (AS), Raw Target (RT), and Augmented Target (AT).

Analysis of the mean variation values in Table 4 shows that the number of training iterations is the most influential hyperparameter in most scenarios. For example, for the Swin Transformer under the “RS + AT” setting, increasing iterations improves the average F1 score by 1.91 points, clearly exceeding the effects of learning rate (0.18), weight decay (0.30), batch size (0.52), and momentum (0.22). Similar behaviour appears for PVT and MobileViT in the “AS + AT” configuration (1.92 and 2.13 points, respectively) and for the axial transformer in “AS + RT” (2.62 points). ViT shows a more nuanced pattern. Although iterations dominate in “AS + RT” (1.96) and “AS + AT” (1.83), weight decay and learning rate become more important in “RS + AT” and “RS + RT,” where their variations exceed those of iterations.

Table 4 Mean variations of F1-score by each dataset and architecture.

Analysis of best performing model in each architecture and dataset

Table 5 reports, for each architecture (ViT, Swin Transformer, PVT, MobileViT, and Axial Transformer) and each source/target configuration, the best model selected from the 32 hyperparameter settings. A clear pattern emerges when comparing configurations that use only raw data with those that include augmented images. Across all architectures, incorporating augmented source and target data consistently increases the average F1-score. For example, the ViT architecture attains an average F1-score of 74.18 in the raw-only setting (608 raw source images and 928 raw target images), but this improves to 81.89 when both source and target sets are augmented (928 raw + 3648 augmented source images and 928 raw + 5568 augmented target images). Swin Transformer, PVT, MobileViT, and Axial Transformer exhibit similar gains when moving from raw-only to augmented conditions.

Beyond the overall performance boost, the per-class results indicate that augmentation particularly benefits more challenging cladding types such as Glass and Stone, which show larger improvements than Brick and Concrete. These findings support the view that domain-adaptation techniques, when combined with augmented training data, help mitigate shifts in data distributions and enhance model robustness and generalization.

Table 5 Model performance by each dataset and architecture.

Table 6 summarizes the training times for two iteration counts (3000 and 5000) and the detection speeds of the five architectures. As expected, increasing the number of iterations leads to longer training times, while detection speeds remain similar across iteration counts. MobileViT achieves the highest detection speed (approximately 33 FPS), indicating strong potential for real-time or low-latency applications. ViT, Swin Transformer, PVT, and Axial Transformer operate at around 20–25 FPS, which is still adequate for many façade-analysis scenarios.

Table 6 Training time by iterations, and detection speed by architectures.

Model test in unseen dataset

Among the 640 trained models, selection of the final model for evaluation on the unseen London dataset followed a two-stage procedure. First, for each architecture and each of the four dataset configurations, all 32 hyperparameter combinations were trained and evaluated on the validation set. The configuration with the highest validation macro F1-score was chosen as the best model for that architecture–dataset pair.

Second, these 20 best models were compared in terms of both validation macro F1-score and detection speed. Priority was given to higher validation F1-score, while detection speed was used to ensure that the final model remained suitable for practical deployment (i.e., at least 20 FPS).

Fig. 14
figure 14

Best model performance on test data.

Based on this procedure, the axial transformer architecture with the configuration corresponding to Case 20—batch size 1, learning rate 0.00025, weight decay 0.0005, momentum 0.9, and 5000 training iterations, trained on augmented source and target datasets—was selected as the final model. As shown in Fig. 14, this model achieved an average F1-score of 82.04% on the unseen London test set, comparable to its validation performance. With a detection speed of approximately 20 FPS (Table 6), this configuration offers a favorable balance between accuracy and efficiency.

Figure 15 presents a subset of images from the London dataset, illustrating both successful and erroneous detections. A more extensive collection of detection outcomes is available at the following link: https://figshare.com/articles/dataset/Dataset_of_building_characteristics_from_building_fa_ade_images/25931941.

Fig. 15
figure 15

Examples of correct and incorrect results in test data.

Feature-space analysis of source–target alignment

To provide mechanistic evidence of how the proposed framework aligns source and target representations, the penultimate-layer features of validation images from Scotland (source) and London (target) were visualized using t-SNE (Fig. 16). Each point represents a single validation image, with color indicating one of the six cladding classes and the legend identifying the domain. Under raw training without DA, source and target samples of the same class form clearly separated clusters, indicating a pronounced domain shift. When DA is applied to models trained only on raw data, the distance between these clusters is reduced, although the feature space remains relatively diffuse.

Fig. 16
figure 16

Visualization of T-SNE using the best model.

Comparison with different baseline model

To examine the effectiveness of both supervised learning and domain adaptation under domain shift, experiments were carried out using an axial transformer-based architecture across various training and testing scenarios involving data from London and Scotland. Table 7 presents a comparison of the results for each dataset under both methods.

Under supervised learning, models trained and tested within the same domain performed reasonably well (e.g., London→London: 72.75%, Scotland→Scotland: 77.10%), indicating that the model effectively learned domain-specific features. However, in cross-domain scenarios without adaptation (e.g., London→Scotland or Scotland→London), the average scores consistently dropped to the mid- to high-60% range. Even when combining training data from both London and Scotland, the improvements remained modest, suggesting that simple data diversification is insufficient to fully overcome the domain shift.

In contrast, applying domain adaptation techniques substantially improved cross-domain generalization. For instance, configurations like AS (Scotland) + AT (London) and AS (London) + AT (Scotland) resulted in the average gains surpassing those achieved in the best supervised scenarios, frequently exceeding 80%. These domain-adapted models also demonstrated more balanced performance across challenging classes, such as Glass and Stone, which are prone to variability in appearance.

Table 7 Comparison results of supervised learning and domain adaptation in each dataset.

Conclusions

This study introduced DA framework designed to address the challenges of exterior cladding material classification in SVI. The proposed method leverages a DANN alongside advanced transformer-based backbones (ViT, Swin Transformer, PVT, MobileViT, and Axial Transformer) and several image augmentation techniques. The following key conclusions can be drawn:

  1. 1.

    Experiments confirmed that DA significantly enhances classification performance in cross-domain scenarios (e.g., London→Scotland or Scotland→London). In many cases, domain-adapted models achieved average scores exceeding 80%, surpassing those trained using conventional supervised learning alone.

  2. 2.

    Evaluations of five transformer architectures underlined the versatility and potential of vision transformers for building façade analysis. Among these, the axial transformer often displayed a competitive balance of accuracy and computational efficiency.

  3. 3.

    Data augmentation substantially improved classification accuracy, particularly when addressing material classes with limited examples in raw training sets (e.g., “Glass” or “Stone”). By producing synthetic samples under diverse brightness, scale, and perspective parameters, the model more effectively generalized to real-world conditions.

  4. 4.

    Tuning of hyperparameters showed that the number of training iterations often exerted the largest influence on model performance—sometimes improving average score by more than two percentage points. However, depending on the architecture, other hyperparameters (like weight decay in ViT) could also introduce notable gains.

  5. 5.

    Despite the computational overhead typically associated with transformers, the tested models maintained practical detection speeds, with MobileViT in particular achieving real-time or near-real-time inference.

However, this research is limited by geographic scope, labeling costs, and the complexity of transformer-based architectures. Future studies could broaden datasets across more regions, explore advanced or multi-step domain adaptation strategies, and investigate lightweight solutions for on-device deployment. Integrating synthetic data may further enhance performance and generalization across a wider variety of building facades.