Learning quality-guided multi-layer features for classifying visual types with ball sports application

Huang, Xin; Liu, Tengsheng; Yu, Yue

doi:10.1038/s41598-025-10058-2

Download PDF

Article
Open access
Published: 15 July 2025

Learning quality-guided multi-layer features for classifying visual types with ball sports application

Xin Huang¹,
Tengsheng Liu¹ &
Yue Yu²

Scientific Reports volume 15, Article number: 25478 (2025) Cite this article

1161 Accesses
Metrics details

Subjects

Abstract

Nowadays, breast cancer is one of the leading causes of death among women. This highlights the need for precise X-ray image analysis in the medical and imaging fields. In this study, we present an advanced perceptual deep learning framework that extracts key features from large X-ray datasets, mimicking human visual perception. We begin by using a large dataset of breast cancer images and apply the BING objectness measure to identify relevant visual and semantic patches. To manage the large number of object-aware patches, we propose a new ranking technique in the weak annotation context. This technique identifies the patches that are most aligned with human visual judgment. These key patches are then aggregated to extract meaningful features from each image. We leverage these features to train a multi-class SVM classifier, which categorizes the images into various breast cancer stages. The effectiveness of our deep learning model is demonstrated through extensive comparative analysis and visual examples.

Overcoming the limitations of patch-based learning to detect cancer in whole slide images

Article Open access 26 April 2021

Learning deep features for dead and living breast cancer cell classification without staining

Article Open access 13 May 2021

Optimized deep learning for brain tumor detection: a hybrid approach with attention mechanisms and clinical explainability

Article Open access 26 August 2025

Introduction

Breast cancer poses a significant global health threat. It affects millions of women annually and contributes to high mortality rates. Developing robust image recognition technologies to identify abnormal breast tissue regions is crucial. Early detection of these anomalies can improve treatment outcomes by enabling timely intervention. Delayed detection, however, can worsen the disease and complicate treatment strategies. Despite advancements in computer vision and artificial intelligence, fully automating breast cancer detection from X-ray images presents several challenges:

1)
High-resolution X-ray images contain critical visual and semantic details, including both normal tissue structures and potential abnormalities. Effectively extracting and analyzing these details to replicate the human gaze sequence is computationally complex. A metric capable of pinpointing abnormal regions, mirroring human visual perception, is needed.
2)
Deep learning models show promise in object representation but often overlook the nuanced sequences and geometries of gaze movements essential for precise breast cancer diagnosis. Developing a model that integrates these aspects into a unified visual recognition framework remains challenging.
3)
Breast cancer images include various attributes such as cancer stage, tissue types, and patient data, which are crucial for human analysis. Integrating these subtle cues into gaze behavior models, while capturing the spatial relationships of regional features, is essential for identifying meaningful object patches. The fusion of these elements into an actionable gaze behavior model is difficult.

To address these challenges, we introduce a rank-based framework designed to emulate the human gaze shift sequence (GSS), as shown in Fig. 1. This algorithm forms the foundation of a framework tailored to capture GSS from large-scale breast images. Our approach uses BING (Binarized Normalized Gradients)¹ to extract potential abnormalities and generate object patches from each image. Given the large number of patches and computational constraints, analyzing each one individually is impractical. We propose a weakly-supervised ranking algorithm that identifies prominent patches by combining multiple subtle features. These selected patches are linked sequentially to form the GSS. A hierarchical aggregation network is then used to create a multi-layer hierarchical representation for each GSS. Using this representation, an SVM is trained to classify different types of breast cancer in the images.

Our primary innovations are as follows: 1) Automated simulation of human visual perception and cognition to enhance breast cancer detection; 2) A weak-labeled learning framework to select salient image regions that mirror human gaze patterns; 3) The establishment of an expansive, regularly updated breast cancer image dataset for comprehensive testing and validation of our methodology.

Review of prior work

Our study intersects with two crucial domains in image understanding: multi-layer image modeling using deep learning and image quality-related feature learning. This review highlights key contributions within these areas.

Deep Convolutional Neural Networks (CNNs) are key to developing visual recognition models for extensive image datasets like ImageNet². Krizhevsky et al³. demonstrated excellent image classification accuracy using a subset of ImageNet⁴. The selective search technique¹¹refines exhaustive search with semantic segmentation to generate targeted image regions, enhancing category-specific detection. R-CNN, introduced by Girshick et al¹¹., uses an intelligent mechanism to sample image regions and boost object detection efficiency. Zhou et al¹². and Wu et al. optimized high-quality training sample selection and preprocessing to improve CNN-based models. He et al¹⁵. developed ResNet, which uses residual learning to train deeper networks. Wu et al.’s BlockDrop selectively activates network layers during inference to save computational resources without affecting performance. In aerial image analysis, semantic models for tasks at the image and region levels have been created. Zhang et al¹⁶. used graphlets to geometrically represent aerial photos, improving categorization with a discriminative kernel machine. Xia et al¹⁷. and Akar et al¹⁸. improved aerial image categorization with weakly-supervised methods and object-level features. Specialized CNN models have been applied to high-resolution urban analysis by Sameen et al. and aircraft localization in aerial photos⁹. Wang et al¹⁷. and Yang et al. focused on visual attention and skip connections for more effective aerial photo analysis. These methods, while effective, face challenges in balancing efficiency with complexity. This limits scalability and real-time applicability for large-scale datasets. Moreover, they deeply encode the global image or multiple patches within each image but cannot select visually or semantically salient patches for deeper analysis.

In image quality assessment, deep learning architectures have made significant advancements. Karayev et al³. showed that deep features linked to object-level labels can rival traditional quality metrics. Lu et al¹⁸. introduced a dual-column architecture for multi-source inputs and a deep model to assess visual quality, using randomly-selected image regions during subnet training. Tang et al¹⁹. developed a semi-supervised method, trained on unlabeled images and fine-tuned with human-rated images. Mai et al²². focused on preserving global composition by learning quality features directly from images at their original sizes. Kao et al²⁰. and Ma et al²¹. expanded visual quality assessment with multi-task models to evaluate style and quality. These models use a layout-aware strategy to handle various image sizes, bridging visual recognition and quality evaluation. However, these methods face challenges. Karayev et al. show limited generalization for subjective quality. Lu et al. neglect global coherence. Mai et al. face increased computational demands. Kao et al. and Ma et al. bridge style and quality evaluation but struggle with optimization and diverse image distributions.

Recent works explore machine learning and transformer-based models to improve visual recognition. Yan et al⁴². introduced a method for learning compact and informative attributes for visual recognition, with an emphasis on semantic understanding. Han et al⁴³. introduced the Dynamic Perceiver, which adapts to different inputs for efficient visual recognition. Hou et al⁴⁴. presented Conv2Former, a transformer-style convolutional network that blends convolutional networks and transformers for better performance. Jiao et al⁴⁵. introduced DilateFormer, a multi-scale dilated transformer with better receptive field coverage. On the application side, Botlagunta et al⁴⁶. addressed predicting breast cancer metastasis using machine learning on clinical data to aid diagnosis. These works highlight the evolution of visual recognition, focusing on efficiency, interpretability, and healthcare applications. However, models like Conv2Former and DilateFormer may increase computational complexity, limiting real-time applicability. Botlagunta et al.’s approach faces challenges in generalizing across diverse clinical datasets due to medical data variability. Umirzakova et al¹⁴. introduced a deeply-learned residual feature that enhances the super-resolution of medical images.

Our method

Ranking through a weak-label quality modeling

Every breasts picture contains a multitude of detailed object parts that collectively indicate the presence and stage of cancer. To effectively discern these elements, we employ the Binarized Normed Gradients (BING)¹³ algorithm to generate object proposals. In the context of breast cancer detection, employing the BING algorithm to generate object proposals offers several key advantages, particularly when detecting object-aware patches for analysis. Firstly, BING’s efficiency accelerates the detection process, enabling quick processing of large volumes of medical images, such as mammograms, to identify potential regions of interest without compromising accuracy, thereby allowing faster diagnoses and real-time feedback, which is invaluable in clinical settings. Secondly, BING produces a concise, non-redundant set of object proposals, focusing only on the most likely regions exhibiting malignancy, thus improving computational efficiency and enhancing downstream detection accuracy. Thirdly, BING’s robust generalization across different object categories enables it to handle a variety of tumor sizes, shapes, and stages in breast cancer images, ensuring effective detection from early-stage tumors to more advanced malignancies. Additionally, by aligning with the Gaze Shifting Sequence (GSS), BING mimics human visual attention, prioritizing crucial areas of the image, similar to how a radiologist would examine specific regions first. This human-like attention mechanism, coupled with BING’s speed and ability to generalize, significantly enhances its precision in detecting breast cancer across varying stages, improving diagnostic accuracy and patient outcomes.

Weakly-Supervised Ranking: In a typical breast image, hundreds of object patches can be identified. However, based on the practice of how humans perceiving different sceneries, only a succinct set of patches-those that are visually or semantically prominent-actually draw human focus, while the rest receive minimal attention. To manage the above feature, we introduce a weak-label ranking technique that encodes: 1) integration of weakly-supervised labels from other domains like aerial imagery and 2) the natural distribution of object patches on a manifold.

For each breast image, several weak attributes are identified, including the ages of different patients, their genders and professions. Given $K$ distinct attribute types, we construct a feature vector with $K$-dimensions for representing these attributes. Each attribute is assigned a value ranging from 0 to 1, reflecting its intensity or prevalence. This feature vector is then integrated into our optimizing objective of our designed ranking model, influencing the prioritization of object patches across every breast picture. The objective can be formulated mathematically as follows:

$$\begin{aligned} \epsilon ({\textbf {D}}, {\textbf {S}}) = \alpha \Vert {\textbf {D}}\Vert _{21} + \beta \sum _i \psi (y_i, k_i) = \alpha \Vert {\textbf {D}}\Vert _{21} + \beta \Vert {\textbf {D}}^T {\textbf {X}} - {\textbf {S}}\Vert , \end{aligned}$$

(1)

This formulation lays the groundwork for the model, in which a transformation matrix ${\textbf {D}}$ of dimensions $U \times V$ is optimized across all object patches. This matrix linearly transforms each object patch’s feature vector into a corresponding ranked score. Here, $x_i$ represents the deep/shallow feature from every patch, $s_i$ denotes the calculated ranked score. We define $M$ to count the patches at the training stage and $R$ as the corresponding feature dimension. Matrix ${\textbf {X}}$ consists of the feature vectors $x_i$, and matrix ${\textbf {S}}$ is composed of the score vectors $s^i$. That is, $s_c^i = 1$ indicates that the $i$-th patch is considered at the $c$-th level by ranking, otherwise it is zero. Our designed regularizer $\alpha \Vert {\textbf {D}}\Vert _{21}$ promotes sparsity in the rows of ${\textbf {D}}$ by employing the $l_{21}$ norm, aiming to reduce redundancy and eliminate noise. Furthermore, $\beta \Vert {\textbf {D}}^T {\textbf {X}} - {\textbf {S}}\Vert$ serves as a loss term to enhance the accuracy of the ranking.

The function $\psi (y_i, k_i)$ represents a ranking consistency penalty that quantifies the discrepancy between the predicted ranking and the expected ranking derived from weak labels. Specifically, $y_i$ denotes the weakly supervised label assigned to the $i$-th patch, which may be obtained from auxiliary datasets or domain knowledge, while $k_i$ represents the assigned ranking level for the $i$-th patch within the given image. This term enforces consistency between the weak labels and the ranking output, ensuring that patches with stronger indications of malignancy receive higher ranking scores.

Object patches within the same breast image frequently overlap, indicating strong feature space correlations. Leveraging the theoretical derivation of manifold learning, we can conveniently explore the relationships of spatial-neighboring object patches in our designed ranking paradigm. The approach integrates the manifold’s geometric structure into the ranking methodology through two primary steps:

Step 1: We generate a $T \times T$-sized similarity graph for the $T$ patches obtained in every breast picture by utilizing the BING algorithm. This method surpasses traditional $k$-nearest neighbor (kNN) graph construction, which demands $O(T^2)$ time complexity, by connecting only those patches that share regions. This strategy provides an advantage. That is, it uniquely identifies each object patch in relation to its neighbors, implicitly maintaining global spatial configurations.

Step 2: Our method for assembling the kNN graph is markedly more efficient than traditional approaches because it avoids exhaustive comparisons among all possible object patches. Denoting $P$ as the number of spatially adjacent object patches, the efficiency of constructing the kNN graph is $O(PT)$, significantly less than the $O(PT^2)$ required by conventional methods.

This kNN graph forms the basis of our manifold ranking algorithm, which is further described as follows: The regularization function $\Omega (\textbf{S})$ is expressed as:

$$\begin{aligned} & \Omega (\textbf{S}) = \sum _{jk}^T \left( \textbf{N}_{jk} \left\| \frac{1}{q_{ii}} s_i - \frac{1}{p_{ii}} s_i \right\| ^2 \right) + \phi \left\| s_i - b_1 \right\| \nonumber \\ & = \text {tr}(\textbf{S}^T\textbf{KS}) + \phi \left\| \textbf{S} - \textbf{B} \right\| , \end{aligned}$$

(2)

In this formulation, $\textbf{N}$ represents the Gaussian kernel matrix that measures the closeness of every patch to the adjacent patches, $\textbf{K}$ is a symmetrically normalized Laplacian matrix⁴⁷ defined by $\textbf{N}$, that is, $\textbf{K} =\textbf{N}^{-1/2} (\textbf{N}-\textbf{M}) \textbf{N}^{-1/2}$. $\textbf{M}$ is a diagonal matrix where the ii-th element is calculated as: $\textbf{M}_{ii}=\sum \nolimits _j \textbf{N}_{ij}$. Matrix $\textbf{B}$ is utilized to initialize our ranking process, which are set to a unit matrix in our implementation. Besides, $\phi$ weights the importance of the second term for ranking. $q_{ii}$ is a normalization factor used to scale the ranking score $s_i$ for the $i$-th patch. This normalization factor is typically computed based on the feature vector associated with the patch. Similarly to $q_{ii}$, $p_{ii}$ is another normalization factor used to scale the ranking score $s_i$. It is related to the distribution of rankings across all patches, adjusting for the global ranking structure. $b_1$ is a baseline value for the ranking scores. It serves to constrain the rankings and prevent the scores from drifting too far. It can be viewed as a reference or a central tendency around which the rankings should revolve. It is set as a constant or as a summary statistic of the ranking scores. In our implementation, $q_{ii}$, $p_{ii}$, and $b_1$ are calculated as follows.

$q_{ii} = \Vert x_i \Vert _2^2$: The squared Euclidean norm of the feature vector $x_i$ for patch $i$.
$p_{ii} = \sum _j \textbf{N}_{ij}$: The sum of kernel matrix values $\textbf{N}_{ij}$, which represents the total similarity between patch $i$ and all other patches.
$b_1 = \frac{1}{M} \sum _{i=1}^M r_i$: The mean of the ranking scores for all patches (or another constant value if desired).

The regularizer in Eq.(2) serves two major purposes: local ranking consistency (the first term) and global stability (the second term). The first term ensures that similar patches have similar ranking scores. This is achieved by penalizing differences in ranking scores based on the similarity structure encoded in the kNN graph. By leveraging the Laplacian matrix $\textbf{K}$, this term enforces a smooth ranking function that aligns with the intrinsic data manifold. The second term ensures that ranking scores do not drift too far from an expected distribution. This prevents extreme fluctuations, reducing overfitting and enhancing robustness in ranking predictions.

The practical and theoretical motivation are as follows. Local ranking consistency preservation is widely used in semi-supervised learning to preserve the intrinsic structure of the data. It improves generalization by ensuring that similar data points (image patches) maintain close ranking values. The regularization improves the stability of the ranking model, making it more robust against noise and variations in feature representations. It enhances ranking accuracy by preserving meaningful local relationships.

By incorporating the regularization term into our existing objective functions, we establish a detailed optimization function in our designed ranking model:

$$\begin{aligned} & \min _{\textbf{D}} \left( \alpha \Vert \textbf{D}\Vert _{21} + \beta \Vert \textbf{D}^T \textbf{X} - \textbf{S}\Vert + \text {tr}(\textbf{S}^T \textbf{KS}) + \phi \Vert \textbf{S} -\textbf{B}\Vert \right) , \nonumber \\ & \text {s.t.,}~~\textbf{S}\in \{0,1\}^{u\times V}, \textbf{S}^T \textbf{S} = \textbf{I}, \end{aligned}$$

(3)

where constraint $\textbf{S}^T \textbf{S} = \textbf{I}$ ensures the orthogonality of the matrix $\textbf{S}$. The constraint that the matrix $\textbf{S}$ is binary (i.e., $s_c^i = 1$ if the $i$-th patch belongs to the $c$-th ranking level, otherwise 0) is enforced during the training phase through a combination of initialization, optimization constraints, and thresholding mechanisms. First, during initialization, $\textbf{S}$ is constructed based on weakly supervised labels or ranking priors, ensuring that each patch is assigned a discrete ranking level. Throughout training, the optimization objective minimizes the discrepancy between the transformed feature representations $\textbf{D}^T \textbf{X}$ and $\textbf{S}$, thereby guiding updates to the transformation matrix $\textbf{D}$ while maintaining the binary nature of $\textbf{S}$. To preserve this binary structure, projected gradient descent or a similar constraint-aware optimization method can be applied, where the updated values of $\textbf{S}$ are discretized after each iteration by thresholding-assigning the highest-ranked entry in each row to 1 while setting the others to 0. This ensures that $\textbf{S}$ remains a valid binary matrix throughout training while still allowing for optimization-driven refinements based on ranking consistency and feature transformation.

Given the complexity of equation (3), it is unfeasible to find an analytical solution. Instead, we deploy the carefully-designed concave-convex procedure algorithm to iteratively compute the parameters, navigating through the optimization proceess effectively.

Having established the ranking framework for object patches in breast images, we now turn our attention to leveraging this ranked information to model how human attention shifts through the most salient regions of an image. This leads us to the construction of the Gaze Shifting Sequence (GSS), where the identified top-ranked patches are used to generate deep representations that emulate human visual behavior.

Calculating deep representation for GSS

Our newly devised algorithm, which employs a weakly-supervised ranking framework, reveals a compelling pattern: the highest-ranked object patches, notable for their high quality, exhibit both visual and semantic prominence that naturally captures the gaze. This phenomenon is illustrated in Fig. 2. Typically, the human gaze is drawn initially to the most conspicuous object patch before migrating to the next in order of salience. Leveraging this behavioral insight, we crafted the Gaze Shift Sequence (GSS) to emulate the human process of sequentially engaging with multiple areas of interest in an image. In practical application, once the GSS is established and assuming it comprises multiple salient object patches, we first engineer a 128-dimensional feature vector calculated by standard CNN to describe each salient patch, which provides a representative descriptor of the individual patch’s attributes. Such 128-dimensional feature vector capture each salient patch’s color, texture, and spatial layout collaboratively. Herein, the Patch-CNN is actually an MLP fine-tuned using our million-scale breast cancer images. These descriptors are then combined to form a comprehensive (128A)-dimensional vector, encapsulating the essence of the full GSS with A object patches.

Once the Gaze Shift Sequence (GSS) is established and its deep representations computed, the next step in our approach involves leveraging these features for cancer stage classification. By utilizing the GSS features, we construct a multiple-category Support Vector Machine (SVM) model to categorize breast cancer images into distinct stages, ensuring a reliable and efficient classification process.

Categorizing cancer into multiple classes

Leveraging the detailed GSS features previously described, we construct a multiple category SVM for categorizing the breast cancers to distinct stages. Specifically, for breast images associated with the i-th and j-th stages of breast cancers, our binary SVM optimization process is established as follows:

$$\begin{aligned} & \max _{\omega \in R^M} Z(\omega ) = \sum \nolimits _{i=1}^M \omega _i - \frac{1}{2} \sum \nolimits _{i=1}^M \sum \nolimits _{j=1}^M \omega _i \omega _j a_i a_j \theta (F_i, F_j),\nonumber \\ & s.t.~~0 \le \omega _i \le H~~ and~~\sum \nolimits _{i=1}^M \omega _i a_i = 1, \end{aligned}$$

(4)

Herein, $F_i$ represents our deeply-learned GSS representation toward the each breast picture during training, $a_i$ is the category label of the corresponding breast picture, $H$ defines the complex bound level of the SVM to manage cases where classes are not easily separable, and $M$ represents the total count of images across both stages. A detailed SVM optimization is shown in Table 1.

Table 1 Detailed Steps in the SVM Optimization Process for Breast Cancer Classification.

Full size table

After the SVM classifier has been trained, the cancer stage of a new test image is determined using its deep GSS feature $F^*$ as follows:

$$\begin{aligned} \text {sgn}(\sum _{i=1}^M \omega _i a_i \theta (F_i, F^*) + Q), \end{aligned}$$

(5)

In this process, we calculate the offset $Q$ by leveraging equation $Q = 1 - \sum _{i=1}^M \omega _i a_i \theta (F_i, F^s)$, where $F^s$ represents a supporting vector derived from breast cancer at the i-th stage. In our feature categorization phase, for $B$ different stages of cancer, the method conducts $B(B-1)/2$ binary classifications. The outcomes are integrated using a voting mechanism, where each binary classification contributes a vote, and the final determination of the cancer stage is based on which stage accumulates the majority of votes. A summary of our proposed breast cancers recognition is presented in Table 2

Table 2 Pipeline of Our Multi-layer Quality-guided Breast Cancers Classification.

Full size table

Code availability

The Python code for our entire breast cancer identification pipeline has been submitted for review. The code will be fully accessible after the publication of this article.

Empirical comparison and evaluation

In this section, we examine the effectiveness of the proposed approach. We begin by presenting an overview of the extensive breast image dataset we compiled. Then, we compare the proposed technique to a range of established deep and flat recognition models. We also explore the key efficiencies of the cancer recognition pipeline.

Dataset description

As far as we know, no fully-annotated breast cancer dataset is publicly available. Our project required significant effort to collect approximately 2.73 million breast X-ray images from 125 hospitals across our country. These X-rays, particularly from urban hospital centers, show greater clarity and detail and are properly white-balanced. We designed a specialized crawler to automate the downloading and cropping of images from these hospitals. Unlike the crawling process, the sorting and annotating process was semi-automated. This process involved recording data such as age, education, and occupation for each image. Example images are shown in Fig. 3.

Comparative study

We first tested our model against traditional, shallow recognition models such as Fixed Length Walk and Tree Kernel (FWK²⁶ and FTK), Multi-Resolution Feature Histogram (MRH)²⁹, and Standard Pyramid Matching Kernel (SPMK)²⁷ with its variants SP-LC, SP-SC, and SPM-OB. These models were fine-tuned to optimize performance, with kernels adjusted from one to fifteen in length and processed using a Gaussian kernel with a sigma of 1 across 15 grey levels. For SPM-based methods, training involved processing images to extract ten million SIFT points, arranged in a 20x20 pixel grid with a 10-pixel spacing. An 800-code entry book was created via hierarchical clustering.

We also compared our method with cutting-edge deep learning models like CNN-SPM³¹, CleNet⁷, and other transferable, semantically-enhanced models. The availability of source code for most of these models allowed us to implement them directly, using default settings. For one model, we selected between 256 and 1024 region proposals generated by MCG³². Other models involved training sub-architectures on CIFAR-10⁸, followed by a controller RNN trained to manage multiple RNN-controlled networks. We also compared our method with fine-grained visual classifiers (FGVC)⁶ and self-knowledge distillation (SKD)¹⁰.

Our approach uniquely combines high-definition and corresponding low-definition breast cancer images. Low-definition images are resized to match the high-definition ones for comprehensive analysis. This dual-image processing method surpasses traditional single-image approaches by utilizing a broader data range for improved diagnostic performance. We also reviewed recent scene categorization efforts by^33,34, and⁴¹. emphasizing the dynamic and competitive nature of contemporary scene classification techniques. Additionally, we included the five visual recognition models^{42,43,44,45,46} for comparison. The results in Table 3 reflects the effectiveness of our method.

Table 3 Comparative Categorization Accuracy With Multiple Deep/Flat Models. The Numbers In The Table Represent The Categorization Accuracy Of Various Models In Identifying Different Stages Of Cancer. Higher Values Indicate Better Performance, With Each Model’s Accuracy For Each Cancer Stage Compared To Others, Including The Proposed “Our Method.” The Best Results Are Bolded.

Full size table

Comparative study

We first tested our model against traditional shallow recognition models such as Fixed Length Walk and Tree Kernel (FWK²⁶ and FTK), Multi-Resolution Feature Histogram (MRH)²⁹, and Standard Pyramid Matching Kernel (SPMK)²⁷ with its variants SP-LC, SP-SC, and SPM-OB. These models were fine-tuned to optimize performance, adjusting kernels from one to fifteen in length and processing using a Gaussian kernel with a sigma of 1 across 15 grey levels. For SPM-based methods, training involved extracting ten million SIFT points from images, arranged in a 20x20 pixel grid with a 10-pixel spacing. An 800-code entry book was created using hierarchical clustering.

We also compared our method with cutting-edge deep learning models like CNN-SPM³¹, CleNet⁷, and other transferable, semantically-enhanced models. The source code for most of these models was available, allowing us to implement them directly with default settings. For one model, we selected between 256 and 1024 region proposals generated by MCG³². Other models involved training sub-architectures on CIFAR-10⁸, followed by a controller RNN trained to manage multiple RNN-controlled networks. We also compared our method with fine-grained visual classifiers (FGVC)⁶ and self-knowledge distillation (SKD)¹⁰.

Our approach combines high-definition and corresponding low-definition breast cancer images. Low-definition images are resized to match the high-definition ones for comprehensive analysis. This dual-image method surpasses traditional single-image approaches, using a broader data range for better diagnostic performance. We also reviewed recent scene categorization efforts by^33,34, and⁴¹. The results show the competitiveness of modern visual recognizers. We also included the five visual recognition models^{42,43,44,45,46} for comparison.

Parameter analysis

Our model uses two main sets of adjustable parameters: three scaling weights ($\alpha , \beta , \phi$) and the number of object patches per Gaze Shifting Sequence (GSS), denoted as $A$. We applied cross-validation to establish baseline settings for these parameters. Initially, we set $\alpha = \beta = \phi = 0.1$ and $A = 5$.

We first adjusted $\alpha , \beta , \phi$ from 0 to 1 in increments of 0.05, keeping other parameters stable. In Fig. 4, classification performance increased as the parameter values grew, peaking before declining. This suggests a critical range for these parameters. Values outside this range could impair system performance.

Next, we explored variations in $A$, adjusting it from 1 to 10. The results are shown in Fig. 5. Accuracy increased as $A$ rose from 1 to 5. Beyond this, improvements stabilized. This indicates that $A = 5$ captures the complexities of our dataset effectively.

Based on these results, we set $A = 5$ for our GSS configuration. This setting optimizes the system’s operational efficiency and diagnostic accuracy. It ensures that our model is competitive, especially for large-scale imaging applications, enhancing breast cancer detection.

Scene classification

We evaluated our method in the context of scenery categorization by comparing its performance with several established shallow classification techniques. These included: 1) FWK and FTK²⁶. 2) MRH²⁹. 3) SPM and its variants: LLC-SPM²⁷, SC-SPM³⁵, and OB-SPM³⁶. 4) SVC³⁷ and SSC³⁸, which provide advanced image representation. We standardized the configurations of each algorithm. Experimental setups of all the baseline methods are based on⁴⁸.

As shown in Table 4 and Table 5, BING-guided patches consistently outperformed superpixels in terms of their descriptiveness and contribution to classification accuracy. The performance metrics derived from these experiments demonstrate that our method not only competes favorably with existing techniques but also surpasses many in terms of both robustness and accuracy in categorizing complex scenes. This highlights the effectiveness of integrating various feature representations and selecting the most informative regions for classification tasks.

Table 4 Comparative Classification Performance Across Standard Benchmarks (Mean Accuracy)

Full size table

Table 5 Performance Variations Across Benchmark Datasets and Methods (Standard Deviation Metrics)

Full size table

Upon reviewing the data presented in Table 4 and Table 5, we conducted a thorough statistical analysis to evaluate the performance of our method in comparison to both deep and traditional (flat) visual recognition models. To ensure the reliability and consistency of our results, each model was evaluated across 20 iterations, and we reported the standard deviations to provide insights into the variability of each approach’s performance. The results of our analysis clearly demonstrate that our method excels not only in terms of classification accuracy but also in stability. This is particularly evident when we examine the performance across various datasets, where our method consistently outperforms the others. In addition to its superior accuracy, our method shows a remarkable level of robustness in its results, with minimal fluctuations in performance across multiple runs. Notably, within our collected MSEI, our method achieved a categorization precision that exceeds its nearest competitor by more than 8%. This significant margin highlights the effectiveness of our method in handling complex scene classification tasks, where fine-grained accuracy is crucial. The ability of our method to maintain both high accuracy and stability underscores its potential as a powerful tool for large-scale visual recognition applications, particularly in areas requiring precise and consistent performance. These findings validate the superiority of our method over existing models, confirming its potential to address complex real-world classification challenges more effectively than traditional and state-of-the-art approaches.

Extension to ball sport classification

To further validate the generality of our weakly-supervised ranking framework, we conduct experiments on ball sport image classification, where identifying salient object patches (e.g., balls, players, nets) is critical for accurate recognition. As shown in Table 7, we adapt our method using sport-specific weak labels and feature configurations.

The results in Table 8 reveal three key findings:

1.
GSS Superiority: Our method outperforms ResNet-50 by 7.6% accuracy, demonstrating that explicit patch ranking (via GSS) captures finer-grained discriminative features than end-to-end deep learning.
2.
Dynamic Context Modeling: The 4.4% improvement over BING+kNN (Table 8, row 2) confirms that temporal gaze sequencing (ball$\rightarrow$player$\rightarrow$net) better represents sport dynamics than static patch similarity.
3.
Patch Saliency Hierarchy: Table 9 quantitatively validates human intuition - balls consistently rank highest (0.92 saliency), followed by key players (0.85), establishing a measurable attention hierarchy.

The framework’s effectiveness stems from:

Cross-Domain Weak Labels: Player/ball detection (Table 7, row 2) provides free supervision, reducing reliance on manual annotations.
Manifold-Aware Ranking: The $l_{21}$ norm in Eq. (3) eliminates noisy patches (e.g., audience), focusing computation on relevant regions.
SVM-Kernel Synergy: The RBF kernel ($\sigma =0.1$) in Table 7 optimally separates GSS features by exploiting non-linear patch relationships.

Table 6 Computational Efficiency Comparison (Seconds/Image).

Full size table

Table 7 Dataset and Configuration for Ball Sport Classification.

Full size table

Table 8 Classification Accuracy on Sports-10 Dataset.

Full size table

Table 9 Patch Ranking Statistics in Sports Classification.

Full size table

Table 6 reveals our method’s practical advantage: despite added functionality, it runs 13% faster than ResNet-50 due to:

Selective patch processing (only top 20% ranked patches undergo SVM classification)
Parallelizable GSS construction (Eq. (4) optimizations)

pTwo observed challenges motivate future research:

Small Ball Detection: Suboptimal ranking for balls <20 pixels (failed in 12% of cases)
Occlusion Handling: Player-ball overlaps reduce saliency scores by $\sim$15% (Table 9, rank 1)

Potential solutions include:

Multi-scale BING proposals
3D gaze modeling for occluded objects

This experiment demonstrates our framework’s adaptability beyond medical imaging while revealing universal principles for object-aware attention modeling.

Conclusions

This research presents an innovative perceptual deep learning framework designed to derive complex visual representations from X-ray breast images. The framework integrates elements of human perception of different sceneries. We developed a distinctive framework for constructing Gaze Shifting Sequences (GSS) using a weakly-supervised ranking algorithm. This framework analyzes each GSS in depth to gather features. These features are then used to train a multi-category SVM for classifying cancer images into different stages. Extensive evaluations on the large breast image dataset we assembled validate the efficacy of our recognition system. The results demonstrate its robustness and accuracy.

Data availability

Raw data and Derived data supporting the findings of this study are available from the corresponding author on request.

References

Cheng, M., Zhang, Z., Lin, W.-Y. & Torr, P. BING: Binarized Normed Gradients for Objectness Estimation at 300fps. Proc. of CVPR (2014).
Deng, J. et al. ImageNet: A Large-Scale Hierarchical Image Database. Proc. of CVPR (2009).
Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. Proc. of NIPS (2012).
Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. Proc. of CVPR (2014).
Pont-Tuset, Jordi, Arbeláez, Pablo, Barron, Jonathan T., Marqués, Ferran & Malik, Jitendra. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans. Pattern Anal. Mach. Intell.39(1), 128–140 (2017).
Article PubMed Google Scholar
Chang, Dongliang et al. An Erudite Fine-Grained Visual Classification Model. IEEE CVPR (2023).
Lee, Kuang-Huei, He, Xiaodong, Zhang, & Lei, Yang, Linjun CleanNet: Transfer Learning for Scalable Image Classifier Training With Label Noise, in Proc. of CVPR, (2018).
Carvalho, Edigleison Francelino & Engel, Paulo Martins. Convolutional Sparse Feature Descriptor for Object Recognition in CIFAR-10. BRACIS 131–135 (2013).
Costea, Dragos, Leordeanu, Marius, Aerial Image Geolocalization from Recognition and Matching of Roads and Intersections, arXiv:1605.08323, (2016).
Hao, Yu., Feng, Xin & Wang, Yunlong. Enhancing deep feature representation in self-knowledge distillation via pyramid feature refinement. Pattern Recognit.178, 35–42 (2024).
Article ADS Google Scholar
Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T. & Smeulders, A. W. M. Selective search for object recognition. Int. J. Comput. Vis.104(2), 154–171 (2013).
Article Google Scholar
Wu, R., Wang, B., Wang, W. & Yu, Y. Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification. Proc. of ICCV (2015).
Cheng, Ming-Ming., Mitra, Niloy J., Huang, Xiaolei & Shi-Min, Hu. Salientshape: Group saliency in image collections. Vis. Comput.30(4), 443–453 (2014).
Article Google Scholar
Umirzakova, Sabina, Mardieva, Sevara, Muksimova, Shakhnoza, Ahmad, Shabir & Whangbo, Taegkeun. Enhancing the super-resolution of medical images: Introducing the Deep Residual Feature Distillation Channel Attention Network for optimized performance and efficiency. Bioengineering10(11), 1332 (2023).
Article PubMed PubMed Central Google Scholar
Oliva, A. & Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis.42(3), 145–175 (2001).
Article MATH Google Scholar
Zhang, L. et al. Discovering Discriminative Graphlets for Aerial Image Categories Recognition. IEEE Transactions on Image Processing 22(12), 5071–5084 (2014).
Article MathSciNet Google Scholar
Wang, Y., Morariu, V. I. & Davis, L. S. Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition. Proc. of CVPR (2018).
Caglayan, A. spsampsps Can, A. B. Exploiting Multi-layer Features Using a CNN-RNN Approach for RGB-D Object Recognition. ECCV Workshops (2018).
Yang, Q., Chen, Y., Xue, G.-R., Dai, W. & Yu, Y. Heterogeneous Transfer Learning for Image Clustering via the Social Web. Proc. of ACL/IJCNLP (2009).
Kao, Yueying & Chong Wang; Kaiqi Huang,. Multiscale Combinatorial Grouping. Proc. of CVPR (2014).
Ma, Kede & Fang, Yuming. Image alityAssessment in the Modern Age. ACM Multimedia (2011).
Mai, Long, Jin, Hailin & Liu, Feng. Visual aesthetic quality assessment with a regression model. IEEE International Conference on Image Processing (ICIP) (2015).
Achanta, Radhakrishna et al. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell.34(11), 2274–2282 (2012).
Article PubMed Google Scholar
Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, & Oliva, Aude Learning Deep Features for Scene Recognition using Places Database, in Proc. of NIPS, (2014).
Krizhevsky, Alex, Sutskever, Ilya, & Hinton, Geoffrey E. ImageNet Classification with Deep Convolutional Neural Networks, in Proc. of NIPS, (2012) .
Harchaoui, Zard, & Bach, Francis R. Image Classification with Segmentation Graph Kernels, in Proc. of CVPR, (2007).
Wang, Jinjun, Yang, Jianchao, Yu, Kai, Lv, Fengjun, Huang, Thomas, & Gong, Yihong Locality-constrained Linear Coding for Image Classification, in Proc. of CVPR, (2010)
Song, Dongjin & Tao, Dacheng. Biologically inspired feature manifold for scene classification. IEEE T-IP19(1), 174–0184 (2010).
Article MathSciNet MATH Google Scholar
Hadjidemetriou, Efstathios, Grossberg, Michael D.. & Nayar, Shree K.. Multiresolution histograms and their use for recognition. IEEE Trans. Pattern Anal. Mach. Intell.26(7), 831–847 (2004).
Article PubMed Google Scholar
Li, Yao & Liu, Lingqiao. Chunhua Shen (Mid-level Deep Pattern Mining, in Proc. of CVPR, 2015).
Google Scholar
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing & Sun, Jian. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell.37(9), 1904–1916 (2015).
Article PubMed Google Scholar
Arbelaez, Pablo, Pont-Tuset, Jordi & Barron, Jonathan T. Ferran Marques (Multiscale Combinatorial Grouping, in Proc. of CVPR, 2014).
Google Scholar
Mesnil, Grégoire, Rifai, Salah, Bordes, Antoine, Glorot, Xavier, Bengio, Yoshua, spsampsps Vincent, Pascal Unsupervised Learning of Semantics of Object Detections for Scene Categorizations, in Proc. of PRAM, (2015).
Xiao, Yang, Wu, Jianxin & Yuan, Junsong. mCENTRIST: A multi-channel feature generation mechanism for scene categorization. IEEE Trans. Image Process.23(2), 823–836 (2014).
Article ADS MathSciNet PubMed MATH Google Scholar
Yang, Jianchao, Yu, Kai, Gong, Yihong, & Huang, Thomas Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification, in Proc. of CVPR, (2009).
Li, Li-Jia, Su, Hao, Xing, Eric P., & Fei-Fei, Li Object Bank: A High-Level Image Representation for Scene Classification and Semantic Feature Sparsification, in Proc. of NIPS, (2010).
Zhou, Xi, Yu, Kai, Zhang, Tong, spsampsps Huang, Thomas S. Image Classification using Super-Vector Coding of Local Image Descriptors, in Proc. of ECCV, (2010).
Yang, Jianchao, Yu, Kai, & Huang, Thomas S. Supervised Translation-Invariant Sparse Coding, in Proc. of CVPR, (2010).
Girshick, Ross, Donahue, Jeff, Darrell, Trevor, & Malik, Jitendra Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, in Proc. of CVPR, (2014).
Wu, Ruobing, Wang, Baoyuan, Wang, Wenping, & Yu, Yizhou, Harvesting Discriminative Meta Objects with Deep CNN Features for Scene Classification, in Proc. of ICCV, (2015).
Cong, Yang, Liu, Ji. ., Yuan, Junsong & Luo, Jiebo. Self-supervised online metric learning with low rank constraint for scene categorization. IEEE Trans. Image Process.22(8), 3179–3191 (2013).
Article ADS PubMed Google Scholar
Yan, An, Wang, Yu, Zhong, Yiwu, Dong, Chengyu, He, Zexue, Lu, Yujie, Wang, William Yang, Shang, Jingbo, & McAuley, Julian J. Learning Concise and Descriptive Attributes for Visual Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2023) 3067-3077.
Han, Yizeng, Han, Dongchen, Liu, Zeyu, Wang, Yulin, Pan, Xuran, Pu, Yifan, Deng, Chao, Feng, Junlan, Song, Shiji, & Huang, Gao Dynamic Perceiver for Efficient Visual Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2023) 5969-5979.
Hou, Qibin, Cheng-Ze, Lu., Cheng, Ming-Ming. & Feng, Jiashi. Conv2former: A simple transformer-style convnet for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell.46(12), 8274–8283 (2024).
Article PubMed Google Scholar
Jiao, Jiayu et al. Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Trans. Multimedia25, 8906–8919 (2023).
Article Google Scholar
Botlagunta, Mahendran et al. Classification and diagnostic prediction of breast cancer metastasis on clinical data using machine learning algorithms. Sci. Rep.13, 485 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
He, Xiaofei, Cai, Deng & Niyogi, Partha. Laplacian score for feature selection. In Advances in Neural Information Processing Systems (NIPS) 50, 507–514 (2005).
Google Scholar
Zhou, J. & Ren, F. Scene categorization by hessian-regularized active perceptual feature selection. Sci. Rep.15, 739. https://doi.org/10.1038/s41598-024-84181-x (2025).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The research is supported by Hubei Provincial Education Science Planning 2021 General Project (Project No. 2021GB033), and 2021 Humanities and Social Science Foundation project of Wuhan Institute of Technology (including think tank) (Project No. 202109).

Author information

Authors and Affiliations

Department of Physical Education, Wuhan Institute of Technology, 430070, WuHan, China
Xin Huang & Tengsheng Liu
Intelligent Manufacturing College, Jinhua University of Vocational Technology, 321007, Jinhua, Zhejiang, China
Yue Yu

Authors

Xin Huang
View author publications
Search author on:PubMed Google Scholar
Tengsheng Liu
View author publications
Search author on:PubMed Google Scholar
Yue Yu
View author publications
Search author on:PubMed Google Scholar

Corresponding authors

Correspondence to Tengsheng Liu or Yue Yu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Huang, X., Liu, T. & Yu, Y. Learning quality-guided multi-layer features for classifying visual types with ball sports application. Sci Rep 15, 25478 (2025). https://doi.org/10.1038/s41598-025-10058-2

Download citation

Received: 29 April 2024
Accepted: 01 July 2025
Published: 15 July 2025
Version of record: 15 July 2025
DOI: https://doi.org/10.1038/s41598-025-10058-2