Introduction

Breast cancer poses a significant global health threat. It affects millions of women annually and contributes to high mortality rates. Developing robust image recognition technologies to identify abnormal breast tissue regions is crucial. Early detection of these anomalies can improve treatment outcomes by enabling timely intervention. Delayed detection, however, can worsen the disease and complicate treatment strategies. Despite advancements in computer vision and artificial intelligence, fully automating breast cancer detection from X-ray images presents several challenges:

  1. 1)

    High-resolution X-ray images contain critical visual and semantic details, including both normal tissue structures and potential abnormalities. Effectively extracting and analyzing these details to replicate the human gaze sequence is computationally complex. A metric capable of pinpointing abnormal regions, mirroring human visual perception, is needed.

  2. 2)

    Deep learning models show promise in object representation but often overlook the nuanced sequences and geometries of gaze movements essential for precise breast cancer diagnosis. Developing a model that integrates these aspects into a unified visual recognition framework remains challenging.

  3. 3)

    Breast cancer images include various attributes such as cancer stage, tissue types, and patient data, which are crucial for human analysis. Integrating these subtle cues into gaze behavior models, while capturing the spatial relationships of regional features, is essential for identifying meaningful object patches. The fusion of these elements into an actionable gaze behavior model is difficult.

Fig. 1
figure 1

The workflow for the proposed rank-based framework to capture the gaze shift sequence (GSS) from large-scale breast cancer images. The process begins by extracting potential abnormalities using BING to generate object patches. A weakly-supervised ranking algorithm selects prominent patches, which are then linked to form the GSS. These are processed through a hierarchical aggregation network and classified using an SVM to identify different types of breast cancer.

To address these challenges, we introduce a rank-based framework designed to emulate the human gaze shift sequence (GSS), as shown in Fig. 1. This algorithm forms the foundation of a framework tailored to capture GSS from large-scale breast images. Our approach uses BING (Binarized Normalized Gradients)1 to extract potential abnormalities and generate object patches from each image. Given the large number of patches and computational constraints, analyzing each one individually is impractical. We propose a weakly-supervised ranking algorithm that identifies prominent patches by combining multiple subtle features. These selected patches are linked sequentially to form the GSS. A hierarchical aggregation network is then used to create a multi-layer hierarchical representation for each GSS. Using this representation, an SVM is trained to classify different types of breast cancer in the images.

Our primary innovations are as follows: 1) Automated simulation of human visual perception and cognition to enhance breast cancer detection; 2) A weak-labeled learning framework to select salient image regions that mirror human gaze patterns; 3) The establishment of an expansive, regularly updated breast cancer image dataset for comprehensive testing and validation of our methodology.

Review of prior work

Our study intersects with two crucial domains in image understanding: multi-layer image modeling using deep learning and image quality-related feature learning. This review highlights key contributions within these areas.

Deep Convolutional Neural Networks (CNNs) are key to developing visual recognition models for extensive image datasets like ImageNet2. Krizhevsky et al3. demonstrated excellent image classification accuracy using a subset of ImageNet4. The selective search technique11refines exhaustive search with semantic segmentation to generate targeted image regions, enhancing category-specific detection. R-CNN, introduced by Girshick et al11., uses an intelligent mechanism to sample image regions and boost object detection efficiency. Zhou et al12. and Wu et al. optimized high-quality training sample selection and preprocessing to improve CNN-based models. He et al15. developed ResNet, which uses residual learning to train deeper networks. Wu et al.’s BlockDrop selectively activates network layers during inference to save computational resources without affecting performance. In aerial image analysis, semantic models for tasks at the image and region levels have been created. Zhang et al16. used graphlets to geometrically represent aerial photos, improving categorization with a discriminative kernel machine. Xia et al17. and Akar et al18. improved aerial image categorization with weakly-supervised methods and object-level features. Specialized CNN models have been applied to high-resolution urban analysis by Sameen et al. and aircraft localization in aerial photos9. Wang et al17. and Yang et al. focused on visual attention and skip connections for more effective aerial photo analysis. These methods, while effective, face challenges in balancing efficiency with complexity. This limits scalability and real-time applicability for large-scale datasets. Moreover, they deeply encode the global image or multiple patches within each image but cannot select visually or semantically salient patches for deeper analysis.

In image quality assessment, deep learning architectures have made significant advancements. Karayev et al3. showed that deep features linked to object-level labels can rival traditional quality metrics. Lu et al18. introduced a dual-column architecture for multi-source inputs and a deep model to assess visual quality, using randomly-selected image regions during subnet training. Tang et al19. developed a semi-supervised method, trained on unlabeled images and fine-tuned with human-rated images. Mai et al22. focused on preserving global composition by learning quality features directly from images at their original sizes. Kao et al20. and Ma et al21. expanded visual quality assessment with multi-task models to evaluate style and quality. These models use a layout-aware strategy to handle various image sizes, bridging visual recognition and quality evaluation. However, these methods face challenges. Karayev et al. show limited generalization for subjective quality. Lu et al. neglect global coherence. Mai et al. face increased computational demands. Kao et al. and Ma et al. bridge style and quality evaluation but struggle with optimization and diverse image distributions.

Recent works explore machine learning and transformer-based models to improve visual recognition. Yan et al42. introduced a method for learning compact and informative attributes for visual recognition, with an emphasis on semantic understanding. Han et al43. introduced the Dynamic Perceiver, which adapts to different inputs for efficient visual recognition. Hou et al44. presented Conv2Former, a transformer-style convolutional network that blends convolutional networks and transformers for better performance. Jiao et al45. introduced DilateFormer, a multi-scale dilated transformer with better receptive field coverage. On the application side, Botlagunta et al46. addressed predicting breast cancer metastasis using machine learning on clinical data to aid diagnosis. These works highlight the evolution of visual recognition, focusing on efficiency, interpretability, and healthcare applications. However, models like Conv2Former and DilateFormer may increase computational complexity, limiting real-time applicability. Botlagunta et al.’s approach faces challenges in generalizing across diverse clinical datasets due to medical data variability. Umirzakova et al14. introduced a deeply-learned residual feature that enhances the super-resolution of medical images.

Our method

Ranking through a weak-label quality modeling

Every breasts picture contains a multitude of detailed object parts that collectively indicate the presence and stage of cancer. To effectively discern these elements, we employ the Binarized Normed Gradients (BING)13 algorithm to generate object proposals. In the context of breast cancer detection, employing the BING algorithm to generate object proposals offers several key advantages, particularly when detecting object-aware patches for analysis. Firstly, BING’s efficiency accelerates the detection process, enabling quick processing of large volumes of medical images, such as mammograms, to identify potential regions of interest without compromising accuracy, thereby allowing faster diagnoses and real-time feedback, which is invaluable in clinical settings. Secondly, BING produces a concise, non-redundant set of object proposals, focusing only on the most likely regions exhibiting malignancy, thus improving computational efficiency and enhancing downstream detection accuracy. Thirdly, BING’s robust generalization across different object categories enables it to handle a variety of tumor sizes, shapes, and stages in breast cancer images, ensuring effective detection from early-stage tumors to more advanced malignancies. Additionally, by aligning with the Gaze Shifting Sequence (GSS), BING mimics human visual attention, prioritizing crucial areas of the image, similar to how a radiologist would examine specific regions first. This human-like attention mechanism, coupled with BING’s speed and ability to generalize, significantly enhances its precision in detecting breast cancer across varying stages, improving diagnostic accuracy and patient outcomes.

Weakly-Supervised Ranking: In a typical breast image, hundreds of object patches can be identified. However, based on the practice of how humans perceiving different sceneries, only a succinct set of patches-those that are visually or semantically prominent-actually draw human focus, while the rest receive minimal attention. To manage the above feature, we introduce a weak-label ranking technique that encodes: 1) integration of weakly-supervised labels from other domains like aerial imagery and 2) the natural distribution of object patches on a manifold.

For each breast image, several weak attributes are identified, including the ages of different patients, their genders and professions. Given \(K\) distinct attribute types, we construct a feature vector with \(K\)-dimensions for representing these attributes. Each attribute is assigned a value ranging from 0 to 1, reflecting its intensity or prevalence. This feature vector is then integrated into our optimizing objective of our designed ranking model, influencing the prioritization of object patches across every breast picture. The objective can be formulated mathematically as follows:

$$\begin{aligned} \epsilon ({\textbf {D}}, {\textbf {S}}) = \alpha \Vert {\textbf {D}}\Vert _{21} + \beta \sum _i \psi (y_i, k_i) = \alpha \Vert {\textbf {D}}\Vert _{21} + \beta \Vert {\textbf {D}}^T {\textbf {X}} - {\textbf {S}}\Vert , \end{aligned}$$
(1)

This formulation lays the groundwork for the model, in which a transformation matrix \({\textbf {D}}\) of dimensions \(U \times V\) is optimized across all object patches. This matrix linearly transforms each object patch’s feature vector into a corresponding ranked score. Here, \(x_i\) represents the deep/shallow feature from every patch, \(s_i\) denotes the calculated ranked score. We define \(M\) to count the patches at the training stage and \(R\) as the corresponding feature dimension. Matrix \({\textbf {X}}\) consists of the feature vectors \(x_i\), and matrix \({\textbf {S}}\) is composed of the score vectors \(s^i\). That is, \(s_c^i = 1\) indicates that the \(i\)-th patch is considered at the \(c\)-th level by ranking, otherwise it is zero. Our designed regularizer \(\alpha \Vert {\textbf {D}}\Vert _{21}\) promotes sparsity in the rows of \({\textbf {D}}\) by employing the \(l_{21}\) norm, aiming to reduce redundancy and eliminate noise. Furthermore, \(\beta \Vert {\textbf {D}}^T {\textbf {X}} - {\textbf {S}}\Vert\) serves as a loss term to enhance the accuracy of the ranking.

The function \(\psi (y_i, k_i)\) represents a ranking consistency penalty that quantifies the discrepancy between the predicted ranking and the expected ranking derived from weak labels. Specifically, \(y_i\) denotes the weakly supervised label assigned to the \(i\)-th patch, which may be obtained from auxiliary datasets or domain knowledge, while \(k_i\) represents the assigned ranking level for the \(i\)-th patch within the given image. This term enforces consistency between the weak labels and the ranking output, ensuring that patches with stronger indications of malignancy receive higher ranking scores.

Object patches within the same breast image frequently overlap, indicating strong feature space correlations. Leveraging the theoretical derivation of manifold learning, we can conveniently explore the relationships of spatial-neighboring object patches in our designed ranking paradigm. The approach integrates the manifold’s geometric structure into the ranking methodology through two primary steps:

Step 1: We generate a \(T \times T\)-sized similarity graph for the \(T\) patches obtained in every breast picture by utilizing the BING algorithm. This method surpasses traditional \(k\)-nearest neighbor (kNN) graph construction, which demands \(O(T^2)\) time complexity, by connecting only those patches that share regions. This strategy provides an advantage. That is, it uniquely identifies each object patch in relation to its neighbors, implicitly maintaining global spatial configurations.

Step 2: Our method for assembling the kNN graph is markedly more efficient than traditional approaches because it avoids exhaustive comparisons among all possible object patches. Denoting \(P\) as the number of spatially adjacent object patches, the efficiency of constructing the kNN graph is \(O(PT)\), significantly less than the \(O(PT^2)\) required by conventional methods.

This kNN graph forms the basis of our manifold ranking algorithm, which is further described as follows: The regularization function \(\Omega (\textbf{S})\) is expressed as:

$$\begin{aligned} & \Omega (\textbf{S}) = \sum _{jk}^T \left( \textbf{N}_{jk} \left\| \frac{1}{q_{ii}} s_i - \frac{1}{p_{ii}} s_i \right\| ^2 \right) + \phi \left\| s_i - b_1 \right\| \nonumber \\ & = \text {tr}(\textbf{S}^T\textbf{KS}) + \phi \left\| \textbf{S} - \textbf{B} \right\| , \end{aligned}$$
(2)

In this formulation, \(\textbf{N}\) represents the Gaussian kernel matrix that measures the closeness of every patch to the adjacent patches, \(\textbf{K}\) is a symmetrically normalized Laplacian matrix47 defined by \(\textbf{N}\), that is, \(\textbf{K} =\textbf{N}^{-1/2} (\textbf{N}-\textbf{M}) \textbf{N}^{-1/2}\). \(\textbf{M}\) is a diagonal matrix where the ii-th element is calculated as: \(\textbf{M}_{ii}=\sum \nolimits _j \textbf{N}_{ij}\). Matrix \(\textbf{B}\) is utilized to initialize our ranking process, which are set to a unit matrix in our implementation. Besides, \(\phi\) weights the importance of the second term for ranking. \(q_{ii}\) is a normalization factor used to scale the ranking score \(s_i\) for the \(i\)-th patch. This normalization factor is typically computed based on the feature vector associated with the patch. Similarly to \(q_{ii}\), \(p_{ii}\) is another normalization factor used to scale the ranking score \(s_i\). It is related to the distribution of rankings across all patches, adjusting for the global ranking structure. \(b_1\) is a baseline value for the ranking scores. It serves to constrain the rankings and prevent the scores from drifting too far. It can be viewed as a reference or a central tendency around which the rankings should revolve. It is set as a constant or as a summary statistic of the ranking scores. In our implementation, \(q_{ii}\), \(p_{ii}\), and \(b_1\) are calculated as follows.

  • \(q_{ii} = \Vert x_i \Vert _2^2\): The squared Euclidean norm of the feature vector \(x_i\) for patch \(i\).

  • \(p_{ii} = \sum _j \textbf{N}_{ij}\): The sum of kernel matrix values \(\textbf{N}_{ij}\), which represents the total similarity between patch \(i\) and all other patches.

  • \(b_1 = \frac{1}{M} \sum _{i=1}^M r_i\): The mean of the ranking scores for all patches (or another constant value if desired).

The regularizer in Eq.(2) serves two major purposes: local ranking consistency (the first term) and global stability (the second term). The first term ensures that similar patches have similar ranking scores. This is achieved by penalizing differences in ranking scores based on the similarity structure encoded in the kNN graph. By leveraging the Laplacian matrix \(\textbf{K}\), this term enforces a smooth ranking function that aligns with the intrinsic data manifold. The second term ensures that ranking scores do not drift too far from an expected distribution. This prevents extreme fluctuations, reducing overfitting and enhancing robustness in ranking predictions.

The practical and theoretical motivation are as follows. Local ranking consistency preservation is widely used in semi-supervised learning to preserve the intrinsic structure of the data. It improves generalization by ensuring that similar data points (image patches) maintain close ranking values. The regularization improves the stability of the ranking model, making it more robust against noise and variations in feature representations. It enhances ranking accuracy by preserving meaningful local relationships.

By incorporating the regularization term into our existing objective functions, we establish a detailed optimization function in our designed ranking model:

$$\begin{aligned} & \min _{\textbf{D}} \left( \alpha \Vert \textbf{D}\Vert _{21} + \beta \Vert \textbf{D}^T \textbf{X} - \textbf{S}\Vert + \text {tr}(\textbf{S}^T \textbf{KS}) + \phi \Vert \textbf{S} -\textbf{B}\Vert \right) , \nonumber \\ & \text {s.t.,}~~\textbf{S}\in \{0,1\}^{u\times V}, \textbf{S}^T \textbf{S} = \textbf{I}, \end{aligned}$$
(3)

where constraint \(\textbf{S}^T \textbf{S} = \textbf{I}\) ensures the orthogonality of the matrix \(\textbf{S}\). The constraint that the matrix \(\textbf{S}\) is binary (i.e., \(s_c^i = 1\) if the \(i\)-th patch belongs to the \(c\)-th ranking level, otherwise 0) is enforced during the training phase through a combination of initialization, optimization constraints, and thresholding mechanisms. First, during initialization, \(\textbf{S}\) is constructed based on weakly supervised labels or ranking priors, ensuring that each patch is assigned a discrete ranking level. Throughout training, the optimization objective minimizes the discrepancy between the transformed feature representations \(\textbf{D}^T \textbf{X}\) and \(\textbf{S}\), thereby guiding updates to the transformation matrix \(\textbf{D}\) while maintaining the binary nature of \(\textbf{S}\). To preserve this binary structure, projected gradient descent or a similar constraint-aware optimization method can be applied, where the updated values of \(\textbf{S}\) are discretized after each iteration by thresholding-assigning the highest-ranked entry in each row to 1 while setting the others to 0. This ensures that \(\textbf{S}\) remains a valid binary matrix throughout training while still allowing for optimization-driven refinements based on ranking consistency and feature transformation.

Given the complexity of equation (3), it is unfeasible to find an analytical solution. Instead, we deploy the carefully-designed concave-convex procedure algorithm to iteratively compute the parameters, navigating through the optimization proceess effectively.

Having established the ranking framework for object patches in breast images, we now turn our attention to leveraging this ranked information to model how human attention shifts through the most salient regions of an image. This leads us to the construction of the Gaze Shifting Sequence (GSS), where the identified top-ranked patches are used to generate deep representations that emulate human visual behavior.

Calculating deep representation for GSS

Our newly devised algorithm, which employs a weakly-supervised ranking framework, reveals a compelling pattern: the highest-ranked object patches, notable for their high quality, exhibit both visual and semantic prominence that naturally captures the gaze. This phenomenon is illustrated in Fig. 2. Typically, the human gaze is drawn initially to the most conspicuous object patch before migrating to the next in order of salience. Leveraging this behavioral insight, we crafted the Gaze Shift Sequence (GSS) to emulate the human process of sequentially engaging with multiple areas of interest in an image. In practical application, once the GSS is established and assuming it comprises multiple salient object patches, we first engineer a 128-dimensional feature vector calculated by standard CNN to describe each salient patch, which provides a representative descriptor of the individual patch’s attributes. Such 128-dimensional feature vector capture each salient patch’s color, texture, and spatial layout collaboratively. Herein, the Patch-CNN is actually an MLP fine-tuned using our million-scale breast cancer images. These descriptors are then combined to form a comprehensive (128A)-dimensional vector, encapsulating the essence of the full GSS with A object patches.

Fig. 2
figure 2

The figure illustrates the Gaze Shift Sequence (GSS) construction process, where top-ranking object patches are linked to reflect human gaze behavior. A 128-dimensional feature vector for each salient patch is generated using Patch-CNN, which is then combined to form a comprehensive vector representing the full GSS.

Once the Gaze Shift Sequence (GSS) is established and its deep representations computed, the next step in our approach involves leveraging these features for cancer stage classification. By utilizing the GSS features, we construct a multiple-category Support Vector Machine (SVM) model to categorize breast cancer images into distinct stages, ensuring a reliable and efficient classification process.

Categorizing cancer into multiple classes

Leveraging the detailed GSS features previously described, we construct a multiple category SVM for categorizing the breast cancers to distinct stages. Specifically, for breast images associated with the i-th and j-th stages of breast cancers, our binary SVM optimization process is established as follows:

$$\begin{aligned} & \max _{\omega \in R^M} Z(\omega ) = \sum \nolimits _{i=1}^M \omega _i - \frac{1}{2} \sum \nolimits _{i=1}^M \sum \nolimits _{j=1}^M \omega _i \omega _j a_i a_j \theta (F_i, F_j),\nonumber \\ & s.t.~~0 \le \omega _i \le H~~ and~~\sum \nolimits _{i=1}^M \omega _i a_i = 1, \end{aligned}$$
(4)

Herein, \(F_i\) represents our deeply-learned GSS representation toward the each breast picture during training, \(a_i\) is the category label of the corresponding breast picture, \(H\) defines the complex bound level of the SVM to manage cases where classes are not easily separable, and \(M\) represents the total count of images across both stages. A detailed SVM optimization is shown in Table 1.

Table 1 Detailed Steps in the SVM Optimization Process for Breast Cancer Classification.

After the SVM classifier has been trained, the cancer stage of a new test image is determined using its deep GSS feature \(F^*\) as follows:

$$\begin{aligned} \text {sgn}(\sum _{i=1}^M \omega _i a_i \theta (F_i, F^*) + Q), \end{aligned}$$
(5)

In this process, we calculate the offset \(Q\) by leveraging equation \(Q = 1 - \sum _{i=1}^M \omega _i a_i \theta (F_i, F^s)\), where \(F^s\) represents a supporting vector derived from breast cancer at the i-th stage. In our feature categorization phase, for \(B\) different stages of cancer, the method conducts \(B(B-1)/2\) binary classifications. The outcomes are integrated using a voting mechanism, where each binary classification contributes a vote, and the final determination of the cancer stage is based on which stage accumulates the majority of votes. A summary of our proposed breast cancers recognition is presented in Table 2

Table 2 Pipeline of Our Multi-layer Quality-guided Breast Cancers Classification.

Code availability

The Python code for our entire breast cancer identification pipeline has been submitted for review. The code will be fully accessible after the publication of this article.

Empirical comparison and evaluation

In this section, we examine the effectiveness of the proposed approach. We begin by presenting an overview of the extensive breast image dataset we compiled. Then, we compare the proposed technique to a range of established deep and flat recognition models. We also explore the key efficiencies of the cancer recognition pipeline.

Dataset description

As far as we know, no fully-annotated breast cancer dataset is publicly available. Our project required significant effort to collect approximately 2.73 million breast X-ray images from 125 hospitals across our country. These X-rays, particularly from urban hospital centers, show greater clarity and detail and are properly white-balanced. We designed a specialized crawler to automate the downloading and cropping of images from these hospitals. Unlike the crawling process, the sorting and annotating process was semi-automated. This process involved recording data such as age, education, and occupation for each image. Example images are shown in Fig. 3.

Fig. 3
figure 3

Example breast cancer pictures from our dataset.

Comparative study

We first tested our model against traditional, shallow recognition models such as Fixed Length Walk and Tree Kernel (FWK26 and FTK), Multi-Resolution Feature Histogram (MRH)29, and Standard Pyramid Matching Kernel (SPMK)27 with its variants SP-LC, SP-SC, and SPM-OB. These models were fine-tuned to optimize performance, with kernels adjusted from one to fifteen in length and processed using a Gaussian kernel with a sigma of 1 across 15 grey levels. For SPM-based methods, training involved processing images to extract ten million SIFT points, arranged in a 20x20 pixel grid with a 10-pixel spacing. An 800-code entry book was created via hierarchical clustering.

We also compared our method with cutting-edge deep learning models like CNN-SPM31, CleNet7, and other transferable, semantically-enhanced models. The availability of source code for most of these models allowed us to implement them directly, using default settings. For one model, we selected between 256 and 1024 region proposals generated by MCG32. Other models involved training sub-architectures on CIFAR-108, followed by a controller RNN trained to manage multiple RNN-controlled networks. We also compared our method with fine-grained visual classifiers (FGVC)6 and self-knowledge distillation (SKD)10.

Our approach uniquely combines high-definition and corresponding low-definition breast cancer images. Low-definition images are resized to match the high-definition ones for comprehensive analysis. This dual-image processing method surpasses traditional single-image approaches by utilizing a broader data range for improved diagnostic performance. We also reviewed recent scene categorization efforts by33,34, and41. emphasizing the dynamic and competitive nature of contemporary scene classification techniques. Additionally, we included the five visual recognition models42,43,44,45,46 for comparison. The results in Table 3 reflects the effectiveness of our method.

Table 3 Comparative Categorization Accuracy With Multiple Deep/Flat Models. The Numbers In The Table Represent The Categorization Accuracy Of Various Models In Identifying Different Stages Of Cancer. Higher Values Indicate Better Performance, With Each Model’s Accuracy For Each Cancer Stage Compared To Others, Including The Proposed “Our Method.” The Best Results Are Bolded.

Comparative study

We first tested our model against traditional shallow recognition models such as Fixed Length Walk and Tree Kernel (FWK26 and FTK), Multi-Resolution Feature Histogram (MRH)29, and Standard Pyramid Matching Kernel (SPMK)27 with its variants SP-LC, SP-SC, and SPM-OB. These models were fine-tuned to optimize performance, adjusting kernels from one to fifteen in length and processing using a Gaussian kernel with a sigma of 1 across 15 grey levels. For SPM-based methods, training involved extracting ten million SIFT points from images, arranged in a 20x20 pixel grid with a 10-pixel spacing. An 800-code entry book was created using hierarchical clustering.

We also compared our method with cutting-edge deep learning models like CNN-SPM31, CleNet7, and other transferable, semantically-enhanced models. The source code for most of these models was available, allowing us to implement them directly with default settings. For one model, we selected between 256 and 1024 region proposals generated by MCG32. Other models involved training sub-architectures on CIFAR-108, followed by a controller RNN trained to manage multiple RNN-controlled networks. We also compared our method with fine-grained visual classifiers (FGVC)6 and self-knowledge distillation (SKD)10.

Our approach combines high-definition and corresponding low-definition breast cancer images. Low-definition images are resized to match the high-definition ones for comprehensive analysis. This dual-image method surpasses traditional single-image approaches, using a broader data range for better diagnostic performance. We also reviewed recent scene categorization efforts by33,34, and41. The results show the competitiveness of modern visual recognizers. We also included the five visual recognition models42,43,44,45,46 for comparison.

Parameter analysis

Our model uses two main sets of adjustable parameters: three scaling weights (\(\alpha , \beta , \phi\)) and the number of object patches per Gaze Shifting Sequence (GSS), denoted as \(A\). We applied cross-validation to establish baseline settings for these parameters. Initially, we set \(\alpha = \beta = \phi = 0.1\) and \(A = 5\).

We first adjusted \(\alpha , \beta , \phi\) from 0 to 1 in increments of 0.05, keeping other parameters stable. In Fig. 4, classification performance increased as the parameter values grew, peaking before declining. This suggests a critical range for these parameters. Values outside this range could impair system performance.

Fig. 4
figure 4

Performance by varying \(\alpha , \beta , \phi\).

Fig. 5
figure 5

Performance by varying A.

Next, we explored variations in \(A\), adjusting it from 1 to 10. The results are shown in Fig. 5. Accuracy increased as \(A\) rose from 1 to 5. Beyond this, improvements stabilized. This indicates that \(A = 5\) captures the complexities of our dataset effectively.

Based on these results, we set \(A = 5\) for our GSS configuration. This setting optimizes the system’s operational efficiency and diagnostic accuracy. It ensures that our model is competitive, especially for large-scale imaging applications, enhancing breast cancer detection.

Scene classification

We evaluated our method in the context of scenery categorization by comparing its performance with several established shallow classification techniques. These included: 1) FWK and FTK26. 2) MRH29. 3) SPM and its variants: LLC-SPM27, SC-SPM35, and OB-SPM36. 4) SVC37 and SSC38, which provide advanced image representation. We standardized the configurations of each algorithm. Experimental setups of all the baseline methods are based on48.

As shown in Table 4 and Table 5, BING-guided patches consistently outperformed superpixels in terms of their descriptiveness and contribution to classification accuracy. The performance metrics derived from these experiments demonstrate that our method not only competes favorably with existing techniques but also surpasses many in terms of both robustness and accuracy in categorizing complex scenes. This highlights the effectiveness of integrating various feature representations and selecting the most informative regions for classification tasks.

Table 4 Comparative Classification Performance Across Standard Benchmarks (Mean Accuracy)
Table 5 Performance Variations Across Benchmark Datasets and Methods (Standard Deviation Metrics)

Upon reviewing the data presented in Table 4 and Table 5, we conducted a thorough statistical analysis to evaluate the performance of our method in comparison to both deep and traditional (flat) visual recognition models. To ensure the reliability and consistency of our results, each model was evaluated across 20 iterations, and we reported the standard deviations to provide insights into the variability of each approach’s performance. The results of our analysis clearly demonstrate that our method excels not only in terms of classification accuracy but also in stability. This is particularly evident when we examine the performance across various datasets, where our method consistently outperforms the others. In addition to its superior accuracy, our method shows a remarkable level of robustness in its results, with minimal fluctuations in performance across multiple runs. Notably, within our collected MSEI, our method achieved a categorization precision that exceeds its nearest competitor by more than 8%. This significant margin highlights the effectiveness of our method in handling complex scene classification tasks, where fine-grained accuracy is crucial. The ability of our method to maintain both high accuracy and stability underscores its potential as a powerful tool for large-scale visual recognition applications, particularly in areas requiring precise and consistent performance. These findings validate the superiority of our method over existing models, confirming its potential to address complex real-world classification challenges more effectively than traditional and state-of-the-art approaches.

Extension to ball sport classification

To further validate the generality of our weakly-supervised ranking framework, we conduct experiments on ball sport image classification, where identifying salient object patches (e.g., balls, players, nets) is critical for accurate recognition. As shown in Table 7, we adapt our method using sport-specific weak labels and feature configurations.

The results in Table 8 reveal three key findings:

  1. 1.

    GSS Superiority: Our method outperforms ResNet-50 by 7.6% accuracy, demonstrating that explicit patch ranking (via GSS) captures finer-grained discriminative features than end-to-end deep learning.

  2. 2.

    Dynamic Context Modeling: The 4.4% improvement over BING+kNN (Table 8, row 2) confirms that temporal gaze sequencing (ball\(\rightarrow\)player\(\rightarrow\)net) better represents sport dynamics than static patch similarity.

  3. 3.

    Patch Saliency Hierarchy: Table 9 quantitatively validates human intuition - balls consistently rank highest (0.92 saliency), followed by key players (0.85), establishing a measurable attention hierarchy.

The framework’s effectiveness stems from:

  • Cross-Domain Weak Labels: Player/ball detection (Table 7, row 2) provides free supervision, reducing reliance on manual annotations.

  • Manifold-Aware Ranking: The \(l_{21}\) norm in Eq. (3) eliminates noisy patches (e.g., audience), focusing computation on relevant regions.

  • SVM-Kernel Synergy: The RBF kernel (\(\sigma =0.1\)) in Table 7 optimally separates GSS features by exploiting non-linear patch relationships.

Table 6 Computational Efficiency Comparison (Seconds/Image).
Table 7 Dataset and Configuration for Ball Sport Classification.
Table 8 Classification Accuracy on Sports-10 Dataset.
Table 9 Patch Ranking Statistics in Sports Classification.

Table 6 reveals our method’s practical advantage: despite added functionality, it runs 13% faster than ResNet-50 due to:

  • Selective patch processing (only top 20% ranked patches undergo SVM classification)

  • Parallelizable GSS construction (Eq. (4) optimizations)

pTwo observed challenges motivate future research:

  • Small Ball Detection: Suboptimal ranking for balls <20 pixels (failed in 12% of cases)

  • Occlusion Handling: Player-ball overlaps reduce saliency scores by \(\sim\)15% (Table 9, rank 1)

Potential solutions include:

  • Multi-scale BING proposals

  • 3D gaze modeling for occluded objects

This experiment demonstrates our framework’s adaptability beyond medical imaging while revealing universal principles for object-aware attention modeling.

Conclusions

This research presents an innovative perceptual deep learning framework designed to derive complex visual representations from X-ray breast images. The framework integrates elements of human perception of different sceneries. We developed a distinctive framework for constructing Gaze Shifting Sequences (GSS) using a weakly-supervised ranking algorithm. This framework analyzes each GSS in depth to gather features. These features are then used to train a multi-category SVM for classifying cancer images into different stages. Extensive evaluations on the large breast image dataset we assembled validate the efficacy of our recognition system. The results demonstrate its robustness and accuracy.