Introduction

Static Street View Images (SSVIs) capture detailed visual information of building façades, including architectural features such as window types, exterior cladding materials, and the overall façade design. Each SSVI is georeferenced, linking images directly with precise latitude and longitude coordinates1. Additionally, these images include camera parameters such as heading, pitch, and zoom, allowing users to view and analyze buildings from various angles. One prominent example of an SSVI platform widely used for building façade analysis is Google Street View (GSV)2.

Utilizing SSVIs for building façade analysis offers an efficient and cost-effective alternative to traditional, labor-intensive on-site surveys. For instance, GSV covers more than 16 million kilometers and is accessible in 83 countries3. Such extensive coverage facilitates large-scale building analyses relevant to urban planning, nationwide building surveys, and architectural studies.

Many studies have extensively leveraged SSVIs for various tasks, employing diverse deep-learning techniques ranging from basic classification methods to segmentation and object detection approaches. Specific applications include building usage classification4,5 façade material recognition6,7,8 building type identification3,9 floor-level and story estimation10,11,12 building age determination13,14,15 detection of façade deterioration16 identification of façade elements such as windows and walls17 and segmentation of tile peeling on building façades18. Additionally, urban feature mapping tasks, such as pole and streetlight detection19 and urban form analysis20,21,22 have significantly benefited from the application of SSVIs.

Despite advantages of SSVIs, not all images of building facades are suitable for analysis. In terms of the practical application of these images, the acquired SSVIs of buildings can be categorised into the following three types:

(1) Usable images: These are images in which visual cues for specific tasks are clearly visible. For instance, exterior cladding materials (e.g. wood, and brick) can be interpreted by observing the evenly magnified images of building façade, as shown in Fig. 1a. However, images that display the entire façade of a building are crucial for estimating the number of storeies, as illustrated in Fig. 1b.

(2) Potential images: The visual cues in these images might not be immediately apparent. Nevertheless, by adjusting the camera parameters at the same address, they can potentially become usable images. For instance, Fig. 1c and d were obtained from the same address but with different camera parameters. Figure 1c is more zoomed in than Fig. 1d, making tasks that require a view of the entire building façade, such as estimating the number of storeies, impossible. However, by zooming out using various camera settings, the degree of clarity needed to determine the number of storeies can be obtained, as shown in Fig. 1d.

(3) Non-usable images: These images are unsuitable for analysis, even after adjusting camera parameters. A frequent issue is the obstruction of substantial portions of the building façades by external objects, such as trees or fences, as illustrated in Fig. 1e. Additionally, certain addresses do not have available SSVIs, resulting in error messages as shown in Fig. 1f.

Fig. 1
figure 1

Examples of the three types of SSVIs.

Identifying and differentiating between usable, potential, and non-usable images currently relies heavily on manual interpretation. Furthermore, as no universal camera parameter settings are suitable for every building10 manual adjustments are often required, especially for converting potential images into usable ones. This manual approach is labor-intensive and inefficient, particularly when surveying extensive geographic regions.

Despite the broad utilization and proven effectiveness of SSVIs across various fields, the reliance on manual adjustments highlights a significant research gap in automating the acquisition and classification of suitable SSVIs. Advanced supervised deep learning methods, notably convolutional neural networks (CNNs), have shown strong performance in tasks utilizing SSVIs23,24,25. Recently, transformer-based models have emerged as powerful deep learning architectures due to their superior ability to capture long-range dependencies and contextual information within images, often outperforming traditional CNN-based methods26,27,28,29,30.

This study addresses the identified research gap by developing automated transformer-based methods specifically designed for acquiring and classifying SSVIs. It aims to automatically identify images suitable for two key analysis tasks: assessing entire building façades and detailed inspection of first-story façades. The primary contributions of this research include:

  1. 1)

    A total of 1,026 models were developed by combining five transformer-based architectures with various hyperparameter settings, representing the first automated approach for SSVI acquisition and classification.

  2. 2)

    Two specific image analysis tasks were targeted: whole-building façade images and first-story façade images.

  3. 3)

    A comprehensive comparative analysis was performed to evaluate the effectiveness of advanced transformer-based architectures against 810 traditional CNN-based models in classifying SSVIs.

  4. 4)

    Different model evaluation metrics were employed, including analyses of performance variation, Grad-CAM visualization, impacts of varying image conditions, and paired bootstrap statistical testing.

Proposed approach

In order to enable usable SSVIs to be quickly and accurately collected, an overview of the workflow, highlighting the approach used and its main procedures, is illustrated in Fig. 2. The process involves the following five key steps:

1) List of geoinformation: based on a specified area of interest, a list of geoinformation such as building addresses, and spatial information (e.g. latitude and longitude) is created.

2) Pre-definition of camera parameters: the ranges of the camera parameters used, such as pitch, and heading, are randomly defined, and assigned in advance.

3) Image retrieval: based on the list of remaining addresses, and one of the sets of parameters, the images are retrieved from the SSVI platform.

4) Transformer-based classification: a transformer-based model is used to classify the retrieved images into usable, potential, and non-usable images. Addresses linked to both usable and non-usable images are then discarded from the list.

Following the initial steps outlined above, the processes of image retrieval and classification using deep learning techniques are iteratively conducted on additional candidate images. This iterative approach continues until all predefined sets of camera parameters had been thoroughly examined. Consequently, the resulting dataset comprises exclusively high-quality images that clearly depict specific attributes of buildings, each accurately associated with relevant geographic metadata. Finally, multiple analytical approaches and diverse performance metrics are employed to rigorously evaluate the effectiveness of the implemented models.

Fig. 2
figure 2

Workflow for the demonstration of proposed method.

Visual cues for usable images

Fig. 3
figure 3

Examples of target building façade images for each task.

The selection of usable SSVIs varies based on the specific requirements of different analytical tasks. An image that is ideal for one type of building analysis might be unsuitable for another. For instance, images that clearly capture an entire building’s façade are essential when evaluating attributes of window wall ratio or determining the number of stories in a building, as illustrated in Fig. 3a. In contrast, for tasks such as identifying the presence of parking lots, images that primarily showcase the first story of the building may be sufficient, as depicted in Fig. 3b. This research focuses on two types of images: those that capture the entire building façade; and those that mainly show the first story of the building.

Transformer-based image classification

Transformer-based image classification generally involves two primary components: a feature extractor and a classifier. The feature extractor generates pertinent feature maps from an input image, while the classifier utilizes these features to categorize the image into predefined classes.

Robust backbone architectures are frequently adopted due to their demonstrated effectiveness across diverse image classification tasks31. These architectures exhibit distinct trade-offs between accuracy and computational efficiency, prompting exploratory analyses to identify the most suitable model for accurate yet efficient classification.

Transformer-based models often surpass CNN architectures in accuracy by capturing intricate image relationships. However, this improvement commonly comes with increased computational demands, posing challenges for real-time applications28,32. Tasks like safety helmet monitoring—characterized by small objects, occlusions, and varying lighting—highlight the importance of selecting models that balance accuracy and efficiency33. Therefore, this study evaluates five advanced transformer-based architectures, summarized succinctly in Table 1.

Table 1 Summary of key characteristics for each backbone architecture.

Classifier

Following the feature extraction process, a FC layer is attached. The node dimensionality of the input layer matches the size of the extracted features to enable it to receive them as input. The number of neurons in the output layer is adjusted to correspond to the number of classes; in this case, three: usable images, potential images, and non-usable images. The architecture of hidden layers can vary considerably, and optimizing their design is typically considered a form of hyperparameter tuning34,35. In this study, the classifier consists of two hidden layers comprising 32 and 16 nodes, respectively.

Model optimization process

Transfer learning

Training deep learning architectures from scratch often requires significant computational resources, considerable training time, and large amounts of labeled data. To overcome these challenges, pre-trained models—previously trained on extensive datasets—can serve as efficient alternatives36. Such models utilize features learned from prior training to effectively accomplish diverse tasks. In this research, pre-training was performed using ImageNet, a prominent benchmark dataset extensively adopted in computer vision research. ImageNet37 is a comprehensive repository comprising over 14 million labeled images across a broad spectrum of categories, including general object classes such as buildings and people. Therefore, the feature extraction components of the transformer architectures employed in this study were initialized using pre-trained weights derived from the ImageNet dataset.

Image augmentation techniques

The development of datasets for deep learning purposes often requires significant effort and time. To mitigate this, the practice of image augmentation is employed, which effectively increases the size of artificial training datasets. This study incorporates six distinct augmentation techniques which are tailored to each task. The rationale behind each technique is concisely articulated herein, and a full explanation, replete with mathematical details, is available in prior research29,38.

Given that SSVIs are captured outdoors, variations in lighting due to environmental factors such as the weather or time of day are common. Brightness augmentation is used to simulate these conditions by altering the light intensity in the images, resulting in variations from brighter to darker representations39. Contrast augmentation is another technique that can improve the visibility of images by enhancing the luminance contrast, thereby making the defining features of objects within the image more pronounced, which is especially useful for highlighting the contours and edges of architectural features40,41. To accommodate the different angles from which images are taken, perspective transformation augmentation is applied to adjusts the image’s homography matrix. This ensures that the model can accurately recognise buildings regardless of the angle from which they were photographed. Similarly, scale augmentation, which is a type of affine transformation, replicates the different perspectives that result from varying distances to the buildings, and takes into account the actual disparities between them in terms of size.

Images of building façades may sometimes exhibit a lack of vertical alignment, which results from the camera not being level during image capture. This can give the impression that buildings are slanted. Rotation augmentation corrects this by enabling the accurate determination of building characteristics, such as typology and story count, irrespective of the angle at which the building is presented in the image. Finally, shear augmentation addresses the visual distortion that occurs when the camera is not perpendicular to the building façade, causing buildings to look as if they are tilting to one side. This geometric transformation skews the image along a particular axis to simulate this effect42.

Optimization of the hyperparameters

Selecting suitable hyperparameters is crucial for maximizing the performance of transformer-based models; however, performing exhaustive hyperparameter optimization typically requires substantial computational effort6,43. A practical solution is to define hyperparameter ranges informed by prior research and empirical knowledge. Therefore, hyperparameter ranges and specific values were thoughtfully determined, enabling the exploration of diverse configurations for transformer-based approaches. Table S.1 outlines these hyperparameters clearly, differentiating between common parameters applicable to all methods and those unique to individual methods, along with the total number of model configurations evaluated.

Evaluation of model performance

F1 score, and accuracy

In classification tasks, metrics such as the F1 score and accuracy are commonly utilized to evaluate model performance. The F1 score combines precision and recall into one metric, effectively balancing the model’s capability to accurately detect actual detections (recall) with its ability to minimize incorrect detections (precision). A detailed explanation of the F1 score is provided by44 A high F1 score signifies that the model accurately identifies faults with minimal false alarms and missed detections, which is critical for ensuring reliable system operation. Accuracy, on the other hand, calculates the proportion of correctly classified instances relative to the total number of evaluated instances, offering a direct measure of the model’s overall correctness45.

Detection speed

The detection speed is the time computed by a model for a single frame. The speed is measured in Frames Per Second (FPS). A key goal of assessing detection speed in this research is to identify whether certain deep learning models, while maintaining equivalent accuracy, can achieve a faster performance due to different architectural designs, thereby making them more appealing.

Experiment

Dataset preparation

Original dataset

Fig. 4
figure 4

Area of NW London used for data collection.

For the case study, the city of London in the UK was selected due to it being legally permissible to collect the SSVIs. A random selection of addresses for 1,000 buildings was made from the North-West (NW) area of London, as depicted in Fig. 4, using the OS Data Hub, which is Great Britain’s national mapping agency (https://osdatahub.os.uk/). These addresses were then used to retrieve and download corresponding building images via the GSV Static API (https://developers.google.com/maps/documentation/streetview/intro).

Within the API, three critical camera parameters were adjusted because they substantially affected the visual cues necessary for the clear identification of building characteristics. The parameters include: the Field of View (FOV), which sets the zoom level or the breadth of the scene visible; pitch, which defines the camera’s vertical tilt relative to the street view vehicle; and heading, which determines the camera’s lateral orientation. These parameters have been shown to be significant in prior research10,11. The optimal ranges for these parameters were established through preliminary trials and adjustments.

While the ranges for pitch and heading are consistent across tasks, FOV differs; images capturing the entire building façade require a wider zoom compared to those that focus only on the first story. The specific ranges for each parameter, applied in this study, are detailed in Table 2. The images were obtained at an output size of 640 × 640 pixels, which is the maximum resolution available on the GSV platform. In Fig. 5, the top row displays examples of images that are categorised as usable, potential, and non-usable, all of which provide a view of the entire building facade. Conversely, the bottom row presents images that show only the first story of the building, classified under the same categories.

Table 2 The camera parameters used in both tasks.

Although 1,000 distinct building addresses were initially selected, multiple images per address were retrieved by systematically adjusting critical camera parameters. In this research, the same address, acquired using different camera parameters, into both training, validation, and testing subsets are incorporated. This methodological choice explicitly aims to evaluate our model’s ability to identify and recommend suitable camera settings for the practical acquisition of building façade images, thus accurately reflecting real-world application scenarios.

Modifying these parameters created multiple unique perspectives for each building façade, thereby increasing the dataset size beyond the original number of addresses. Specifically, applying the camera parameter ranges resulted in 2,138 images for the whole-building façade task and 2,290 images for the first story building task.

The dataset was then divided into three subsets: a training set for model development, a validation set for selecting the optimal trained model, and a test set for evaluating the finalized model’s performance on unseen data. Images were randomly allocated into training (60%), validation (20%), and test (20%) subsets, resulting in 2,138, 712, and 712 images for the whole-building façade task, and 2,290, 763, and 763 images for the first-story building task, respectively. This random distribution ensured representativeness across all subsets.

Fig. 5
figure 5

Examples of original dataset used in both tasks.

Augmenting the dataset using training image augmentation methods

The augmentation parameters were carefully selected through a trial-and-error approach to generate images closely resembling real-world conditions. As summarized in Table 3, the augmented images were created exclusively from the training datasets for each respective task. Figures 6 and 7 illustrate examples of the original and the corresponding augmented images for each task.

Table 3 Datasets created with ranges of parameters.
Fig. 6
figure 6

Examples of augmented images in a dataset of whole building façade.

Fig. 7
figure 7

Examples of augmented images in a dataset of the first story.

Annotation

In image classification tasks, the annotation process involves assigning specific classes to ground truth data. In this research, annotation was carried out by analysing visual cues to determine whether the entire building façade or just the first story was observable in the images. Owing to the unique requirements of each task, the datasets were annotated individually. For both tasks, images were sorted into designated folders based on their classification as usable images, potential images, or non-usable images. To ensure the quality of the annotations, a separate team conducted a cross-check of a sample from the annotated images.

Synthesis of final dataset

To demonstrate the effectiveness of the proposed methods, two different datasets were constructed, and each dataset was divided into three data subsets: training data, validation data, and test data. Next, each image was annotated for the corresponding categories: usable images, potential images, and non-usable images. Table 4 shows the detailed distribution of annotated information for each dataset, respectively.

Table 4 Detailed distribution of datasets.

Experimental settings

All experiments were mainly conducted on a system running Windows 10, equipped with an Intel Core i7-7700HQ processor (2.80 GHz, 8 threads), an NVIDIA GeForce GTX 3080 Ti GPU, and 32 GB of RAM. Python, along with the TensorFlow and Keras frameworks, was employed to implement and execute the deep learning algorithms. The dataset, along with detailed statistical information and the relevant code, including hyperparameters, is publicly accessible via the Figshare repository2.

Results and discussion

Results of training and validation

Assessment of underfitting, and overfitting

In this research, the analysis focused on the training and validation loss curves to identify possible underfitting or overfitting issues occurring throughout model training. Figure 8 illustrates representative examples of these curves for each transformer-based model at the point of maximum training epochs, clearly demonstrating their evolution during training.

All the models exhibited a consistent and gradual reduction in training and validation losses throughout the entire training duration, reflecting stable and effective learning processes. The continuous decrease observed in training losses confirmed that underfitting was not a significant concern. Furthermore, the similarity and concurrent trends observed in both training and validation losses implied that any overfitting was minimal or insignificant.

Fig. 8
figure 8

Examples of loss curves during training and validation for attention-based methods.

Overall performance

The performance of five transformer-based methods—Swin Transformer, ViT, PVT, MobileViT, and Axial Transformer—was systematically evaluated for the classification of whole building façades and first-story images. The statistical distribution of these performance metrics (F1 score and accuracy) is visualized in the boxplots presented in Fig. 9, while detailed statistical measures are summarized in Table 5.

Swin Transformer outperformed other models across both tasks and performance metrics, achieving higher median and mean values for F1 scores and accuracy. In the whole building façade task, Swin Transformer recorded the highest mean F1 score (89.66%) and accuracy (90.49%). Conversely, Axial Transformer exhibited the lowest performance in this scenario, with a mean F1 score of 85.97% and mean accuracy of 87.19%, highlighting considerable variability among the evaluated transformer architectures.

Fig. 9
figure 9

Boxplot of F1 score, and accuracy across methods by each hyperparameter.

For the first-story classification task, Swin Transformer again demonstrated superior performance, with the highest mean accuracy (91.23%) and a strong F1 score (90.29%). Axial Transformer, however, displayed the lowest mean accuracy (88.13%) and mean F1 score (86.74%).

The boxplots further illustrate minimal variation within models, indicating stable performance across multiple hyperparameter configurations. Swin Transformer’s narrower interquartile range and higher median values suggest robust and reliable classification capability, emphasizing its effectiveness and generalizability in comparison to the other evaluated transformer-based methods.

Table 5 Statistics of model performance based on F1 score and accuracy.

Analysis of class-wise performance

Table 6 presents class-wise performance results of the best-performing models for each transformer-based method evaluated across two classification tasks: whole-building façade classification and first-story façade classification. The results include F1 scores and accuracy for three distinct classes—Usable, Potential, and Non-usable—as well as their overall averages.

In the whole-building façade classification task, Swin Transformer demonstrated the highest class-wise performance, achieving F1 scores of 91.08% (Usable), 90.47% (Potential), and 89.12% (Non-usable), resulting in an overall average F1 score of 90.22%. Accuracy scores followed a similar trend, with the highest accuracy of 92.85% recorded for the Usable class, indicating robust performance across all classes. Conversely, Axial Transformer yielded the lowest performance among the evaluated methods, particularly for the Non-usable class, where it achieved an F1 score of 85.68% and accuracy of 87.42%.

For the first-story façade classification task, Swin Transformer again outperformed the other methods, achieving notably high accuracy of 93.87% for the Usable class and an overall average accuracy of 92.29%. Its class-wise F1 scores remained consistently high, demonstrating balanced and reliable classification performance. In contrast, MobileViT and Axial Transformer recorded comparatively lower class-wise performance. Axial Transformer, in particular, achieved the lowest F1 score (84.87%) and accuracy (87.37%) for the Non-usable class, indicating challenges in accurately classifying these images.

Overall, Swin Transformer exhibited superior and balanced performance across all evaluated classes. Models such as Axial Transformer and MobileViT showed noticeable performance variability and reduced effectiveness, particularly for challenging classes. These findings highlight Swin Transformer’s robust generalization capabilities and suitability for reliable use in both classification scenarios.

Table 6 Class-wise performance of best models in each method.

Selection of best method

Table 7 summarizes the 5-fold cross-validation performance results of the best-performing transformer-based models (Swin Transformer, ViT, PVT, MobileViT, and Axial Transformer) for two key classification tasks: whole-building façade classification and first-story façade classification. Swin Transformer achieved the highest average performance across both tasks, demonstrating robust generalization capability with minimal variability among folds. Specifically, Swin Transformer recorded the highest average F1 scores (90.15% for whole-building façades and 89.66% for first-story façades) and the highest average accuracy scores (91.69% and 92.21%, respectively).

Conversely, Axial Transformer exhibited the lowest overall performance, with average F1 scores of 86.87% for whole-building façades and 86.03% for first-story façades, indicating comparatively limited effectiveness. Performance variations across folds were modest, signifying consistent and reliable stability.

Regarding computational efficiency, Tables S.2 (GPU inference times) and S.3 (CPU inference times) summarize the detection speeds for each transformer-based method. Swin Transformer consistently demonstrated the fastest detection speeds, averaging 0.0221 s per instance on GPU and 0.3315 s per instance on CPU. ViT and Axial Transformer showed slightly slower performances, with mean inference times of 0.0235 and 0.0239 s per instance on GPU, and 0.3525 and 0.3585 s per instance on CPU, respectively. MobileViT had moderate detection speeds, averaging 0.0248 s on GPU and 0.3720 s on CPU. The slowest inference times were observed for PVT, averaging 0.0272 s per instance on GPU and 0.4080 s per instance on CPU.

Considering predictive accuracy and computational efficiency together, Swin Transformer emerges as the most suitable model for practical deployment in façade and first-story image classification tasks. Its superior combination of high accuracy, consistency, and rapid inference speed highlights its strong potential for reliable real-world applications.

Table 7 5-fold cross-validation results of best-performing models in each method.

Performance analysis of best-performing model

Validation and test dataset analysis

Figure 10 represents the results of comparison about the Swin Transformer’s performance between validation and test datasets for two tasks: whole building façade and first story classifications. In the whole building façade task, minor variations appeared, with validation and test averages closely matching (F1: 90.22% vs. 90.15%; accuracy: 91.78% vs. 91.72%), indicating stable performance. Similarly, the first-story task displayed negligible differences (F1: 89.74% vs. 89.72%; accuracy: 92.29% vs. 92.27%). Class-wise analysis also revealed minimal fluctuations, with minor increases and decreases in specific classes, suggesting strong generalization rather than overfitting. The minimal differences (within ± 0.15%) between datasets highlight the model’s robustness and consistent predictive ability across diverse, unseen data, confirming its suitability for practical applications.

Fig. 10
figure 10

Performance variations of the best-performing model between validation and test sets.

Impact of image conditions

Table 8 presents model performance across varying image conditions, highlighting the impact of complexity factors such as illumination, visual range, architectural style, building density, and occlusion. Performance was higher under favorable conditions. In both tasks, the model demonstrated reduced effectiveness under low illumination compared to high illumination, with F1 scores decreasing notably from 91.23 to 90.25 (whole façade) and 90.72 to 89.80 (first story). Similarly, accuracy dropped from 92.74 to 91.82% and from 93.56 to 92.78%, respectively. Near-field views significantly outperformed far-field views, indicating a clear challenge with distant images.

Moderate differences appeared concerning architectural styles and building densities, with traditional buildings and single-density scenarios yielding superior results. Occlusions by cars or other obstacles markedly impacted performance, reducing accuracy in the whole façade task from 93.11% (no occlusion) to 91.45% (occluded), and from 93.89 to 92.45% in the first-story scenario. Based on the resulting findings, the model exhibited robust generalization but was sensitive to challenging conditions involving low illumination, greater distances, and partial occlusions.

Table 8 Results of the best model, grouped by different image conditions.

Application for street view maps

The trained models for each task—Task 1 (whole-building façade images) and Task 2 (first-story façade images)—were applied to the complete experimental dataset, comprising 1,000 images per task. A total of 10 iterations were conducted to adjust camera parameters. Detection results were geocoded, and each class (usable, potential, and non-usable) was visualized on Google Maps using color-coded markers: green for usable, blue for potential, and red for non-usable, as illustrated in Fig. 11a and b.

For Task 1, the discrepancies between ground truth labels and model predictions were as follows:

  • Ground Truth: Usable: 392, Potential: 328, Non-usable: 280.

  • Model Predictions: Usable: 378, Potential: 355, Non-usable: 267.

For Task 2, the discrepancies were:

  • Ground Truth: Usable: 377, Potential: 332, Non-usable: 291.

  • Model Predictions: Usable: 352, Potential: 336, Non-usable: 312.

Fig. 11
figure 11

Application examples of best-performing model (Map data © 2025 Google.).

Grad-cam analysis

To further examine these findings, a Gradient-weighted Class Activation Mapping (Grad-CAM) analysis was carried out. Figure 12 shows illustrative Grad-CAM heatmaps obtained from the top-performing Swin Transformer-based model.

For correctly classified images in both tasks, the “usable” class prominently exhibited focused activations primarily on building façades. In contrast, the “potential” class displayed activations that were generally directed towards façade regions but were often scattered or inconsistently aligned. The “non-usable” class consistently showed strong activations targeting irrelevant or obstructive features, such as trees, and vehicles.

For incorrectly classified images, activations in the “usable” class appeared dispersed across façade areas, likely due to factors such as uncommon architectural designs, lighting variations, or reflections. In the “potential” class, activation patterns were unclear or variable, emphasizing the model’s challenges with ambiguous cases. Lastly, in the “non-usable” class, there was a noticeable lack of emphasis on obstructions or distracting features, reflecting misinterpretations by the model.

Comparative analysis

Selection of comparative methods

To also illustrate the effectiveness of the proposed Switn Transformert-based approach, comparative analyses were performed using various established classification algorithms based on non-attention-based method, including ResNet-101, ResNet-152, MobileNetV3, CSPNet, ConvNeXt, and EfficientNet. Hyperparameters for each of these algorithms were carefully tuned through a series of preliminary experiments, ultimately leading to the evaluation of 198 different model configurations. Table S.1 provides a concise overview of the classification methods used, their selected optimal hyperparameters, and the best-performing model configurations as determined from testing results.

Fig. 12
figure 12

Examples of Grad-CAM analysis for correctly and incorrectly detected results.

Comparative results

F1score, and accuracy with detection speed

Table 9 presents a comparative evaluation of the proposed method and several popular classifiers across two image-classification tasks: whole building façade and first-story analysis. The proposed transformer-based model achieved superior performance compared to alternative architectures such as ResNet-101, ResNet-152, MobileNetv3, CSPNet, ConvNeXt, and EfficientNet.

Specifically, for the whole building façade task, the proposed method attained the highest average F1 score (90.15%) and accuracy (91.72%). This surpasses the second-best model, ConvNeXt, by approximately 0.5% points in both F1 score and accuracy. In the first-story task, the proposed method again delivered the best results, with an average F1 score of 89.72% and accuracy of 92.27%. It notably outperformed the second-best architecture, ConvNeXt, by approximately 0.44% points in F1 score and 0.33% points in accuracy.

As shown in Tables S.2 and S.3, the proposed transformer-based model achieved the highest accuracy and F1 scores, outperforming the second-best model, ConvNeXt, by approximately 0.5% points in the whole building façade task and around 0.4% points in the first-story task. Although the proposed method exhibited slightly slower detection speeds (GPU: 0.0221 s/instance; CPU: 0.3315 s/instance) compared to architectures such as MobileNetv3 and ResNet-101, these differences were minimal.

Table 9 Comparative results of proposed method and other classifiers.
Bootstrap analysis

A paired bootstrap analysis was conducted to verify whether small performance differences (≤ 1% point) between models were statistically significant. Table 10 summarizes the results of this analysis. For the whole-building façade classification task, statistically significant differences were observed between the best-performing model and MobileNetV3 (95% CI: [0.0070, 0.0660]) as well as EfficientNet (95% CI: [0.0028, 0.0618]). Conversely, comparisons with ResNet-101, ResNet-152, CSPNet, and ConvNeXt did not show statistically significant differences, as their 95% confidence intervals included zero. For the first-story façade task, none of the model comparisons revealed statistically significant differences, with all corresponding confidence intervals encompassing zero. Overall, these findings indicate that, despite minor numerical differences (≤ 1% point) between models, most were not statistically significant. This emphasizes the importance of statistical validation rather than solely relying on numerical differences.

Table 10 Paired bootstrap analysis results.

Conclusions

This study developed transformer-based deep learning models designed for automatically classifying SSVIs into three categories—usable, potential, and non-usable—for building characteristic analyses. Specifically, two critical tasks were targeted: the analysis of entire building façades and detailed inspections of first-story façades.

Five transformer architectures (Swin Transformer, ViT, PVT, MobileViT, and Axial Transformer) were combined with extensive hyperparameter tuning and six data augmentation techniques, yielding a total of 1,026 models. The main finding is that, among these, the proposed transformer-based approach demonstrated the highest overall performance. Specifically, for the whole building façade task, the proposed model achieved class-wise F1 scores of 90.95% (usable), 90.52% (potential), and 88.97% (non-usable), with corresponding accuracies of 92.70%, 91.85%, and 90.60%, respectively, resulting in an average F1 score of 90.15% and accuracy of 91.72%. For the first-story façade task, the model similarly exhibited superior performance, obtaining class-wise F1 scores of 90.58% (usable), 89.94% (potential), and 88.65% (non-usable), along with accuracies of 93.92%, 92.41%, and 90.49%, respectively, yielding an average F1 score of 89.72% and accuracy of 92.27%.

Comparative analysis showed that transformer-based models consistently outperformed 810 traditional CNN-based architectures (including ResNet, MobileNet, CSPNet, ConvNeXt, and EfficientNet) in both accuracy and F1 score. Moreover, the best-performing transformer model demonstrated rapid detection capabilities, with an average inference time of 0.022 s per image, underscoring its practical suitability for real-time analysis. Finally, paired bootstrap analysis indicated statistically significant performance differences between the proposed transformer-based model and MobileNetv3 and EfficientNet for the whole building façade task; however, no significant differences were observed for the first-story façade task, as all corresponding confidence intervals included zero.

This study highlights the significant practical benefits of the proposed transformer-based solution, including enhanced efficiency and accuracy in classifying Static Street View Images (SSVIs) for urban building analysis. Additionally, future research should prioritize integrating this method with Geographic Information System (GIS) platforms. Such integration could facilitate comprehensive spatial analyses, enhance real-time urban planning capabilities, and support scalability across diverse geographic contexts.

Although promising, the developed models are limited by the dataset’s geographical specificity and temporal scope, potentially affecting their generalizability. Furthermore, this research employed only traditional image augmentation techniques, highlighting the need to investigate advanced augmentation methods—such as GAN-based augmentation or sophisticated geometric transformations—in future studies. Therefore, subsequent research should focus on expanding dataset diversity, employing advanced augmentation strategies, and leveraging robust pre-trained transformer models to enable more efficient and broadly applicable façade analyses.