Introduction

Advances in computerized analysis of medical images include deriving computational features toward building machine learning models which can accurately predict or prognosticate disease outcomes1,2. However, variations in image acquisition parameters, scanner types, and institutional practices can significantly affect the appearance of medical images3,4,5, as a result of which the same tissue region may be represented differently in clinically acquired scans. As a result, even minor differences in scanner hardware, reconstruction algorithms, acquisition protocols, or patient positioning can substantially alter feature values across sites or sessions, independent of underlying biology.6,7 Similar instability can arise from differences in annotations or identification of regions of interest (ROI), additionally impacting reproducibility of extracted features.8,9 The resulting fluctuations in computationally extracted medical image features are largely unrelated to the underlying disease conditions, and result in classifier models that do not accurately generalize between medical institutions simply due to slight variations in scanner calibration and acquisition protocols. It has thus become increasingly crucial to determine the variability10 of computerized features from medical images, both within and between institutions, across different acquisition protocols11,12, as well as in test-retest settings7. The key challenge is thus to identify medical image features that are simultaneously reproducible13 to cross-domain variability as well as discriminable in order to ensure the generalizability14 and clinical utility of associated classifier models, especially when evaluated in the context of new, unseen medical imaging cohorts.

The most popularly used feature selection methods in medical imaging studies include minimum redundancy maximum relevance (mRMR)15, Wilcoxon rank-sum testing (WLCX)16, and least absolute shrinkage and selection operator (LASSO)17.By surveying over 50 recent studies (summarized in Supplementary Table 2), more than 80% were found to utilize one or more of these three specific techniques as illustrated in Fig. 1(a). Notably, these methods primarily focus on identifying a minimal set of discriminable features, potentially overlooking the reproducibility or variability of these features. Figure 1(b) highlights that only about 30% of medical imaging studies have explicitly incorporated variability-based screening (e.g., hard thresholds), while over 70% did not assess feature variability at all.

Fig. 1: Overview of popularly utilized feature selection practices in medical imaging studies.
figure 1

Distribution of (a) feature selection methods, (b) and feature screening strategies employed. c Lack of consensus across medical imaging studies for “screening” features based on variability (using IS, ICC, or CV measures) or discriminability (based on AUC), illustrated via distribution of widely varying thresholds used.

Multiple approaches have been proposed for quantifying feature variability in the context of medical imaging. Statistical measures including intra-class correlation coefficient (ICC)18, instability score (IS)19, or the coefficient of variation (CV)20 have been utilized as feature reproducibility measures. Correspondingly, measures such as classifier AUC have been used to quantify feature discriminability21,22. These measures are used to “screen” features by determining a cut-off threshold at which features are considered discriminable or reproducible23. However, threshold criteria for both reproducibility and discriminability measures are often empirically determined for a given study and indeed, can often vary significantly between studies. Figure 1(c) illustrates the distribution of threshold values reported across 40 recent studies which have utilized different variability measures (IS, ICC, and CV) and a discriminability measure (AUC). It can be observed that a wide spectrum of threshold values have been utilized, with no clear consensus on the optimal thresholds to define a discriminable and reproducible feature set.

Utilizing empirically determined thresholds to screen medical image features may be further complicated when attempting to optimize for multiple sources of variability. For instance, different variability measures are often used to quantify batch effects (such as IS19) vs quantifying differences due to annotation sources (such as ICC24,25). The interplay between multiple sources of variability is likely not appropriately accounted for if each measure is independently used to filter out medical image features. Similarly, if feature discriminability and reproducibility are evaluated independently of each other, features that have only marginal variability but may still be highly discriminable could be filtered out.

To illustrate this relationship between discriminability and variability, Fig. 2 presents 2D scatter plots of feature discriminability (Y-axis) vs two different measures of feature variability (X-axes), where each point corresponds to a computerized image feature. Identifying features that are both highly discriminatory and highly reproducible would involve determining which features meet pre-defined thresholds (TH1, …, TH8, horizontal and vertical dash lines) for each measure being considered. This would then yield the optimal feature sets highlighted via the blue boxes on each plot. Notably this set of optimal features can be seen to comprise different feature families in each plot (primarily F5 in Fig. 2(a) but primarily F4 in Fig. 2(b)), due to differences in feature trends between the two variability measures. There are also differences in the sub-optimal feature sets identified in each plot (low in discriminability and reproducibility), highlighted via red boxes. This suggests a significant challenge in optimally determining a trade-off between discriminability and reproducibility for computerized features, in terms of not only identifying the appropriate variability measure (which depends on how many imaging modalities, scanners, imaging protocols, or institutions are being considered) as well as determining the best threshold value toward identifying the most discriminatory and reproducible feature set for disease characterization.

Fig. 2: 2D scatterplots illustrating the relationship between variability (in terms of instability score (IS) and coefficient of variation (CV)) and discriminability (via Area Under the ROC Curve (AUC)).
figure 2

Each point represents a computerized image feature and colors represent different feature families. Blue boxes highlight desirable feature groups, while the red boxes show undesirable ones, based on different trade-offs between variability and discriminability.

In this work, we present a novel Variability Regularized Feature Selection (VaRFS) approach, which simultaneously attempts to ensure feature discriminability while also optimizing for feature variability across institutions, scanners, or acquisition settings; in the context of medical imaging data. An initial limited implementation of VaRFS was discussed in26, beyond which the current work incorporates multiple sources of variability, analytical evaluation of the convergence properties, as well a comprehensive comparison across larger multi-institutional data cohorts. Our novel optimization framework directly integrates feature reproducibility into the selection process through a variability-based soft penalty term. Unlike traditional methods that apply reproducibility screening as a hard pre-filter (e.g., removing features with variability “score” which does not meet a pre-specified threshold), our approach maintains a unified formulation that jointly accounts for variability and predictive power. This not only allows for finer control over feature variability but also minimizes the chance of prematurely excluding of marginally variable but highly informative features. The specific novel contributions of our current work are as follows:

  1. 1.

    VaRFS integrates feature variability screening and feature selection into a single optimization function; overcoming the need for empirical selection of an appropriate threshold value per variability measure or cohort. VaRFS is also designed to ensure a better tradeoff between the three essential properties of a computerized medical image feature set: discriminability, sparsity, and reproducibility. By comparison, separating feature variability screening and feature selection could result in sub-optimal features being identified due to the elimination of highly discriminatory features that do not meet empirically determined variability threshold criteria. As VaRFS integrates these two processes into a single optimization function, the feature selection process can be made more efficient, reliable, and flexible.

  2. 2.

    The objective function of VaRFS aims to maximize the discriminability of the selected features, with additional regularization terms to impose constraints on sparsity and variability to ensure that the selected features are also sparse and reproducible. Towards this, VaRFS leverages the least absolute shrinkage and selection operator (LASSO) framework17, as it can easily assimilate supplementary regularization conditions27,28. As this extension may result in slow convergence due to large-valued regularization parameters, as well as careful tuning of the step size29,30, popular approaches such as coordinate descent31 and iterative shrinkage-thresholding algorithm32 may be sub-optimal. To address these limitations, we present a comprehensive analytical framework that leverages a novel class of proximal algorithms33 which are computationally efficient, easy to implement, and can handle non-smooth objective functions34. We analytically demonstrate how the incorporation of proximal algorithms into our unique extension of the LASSO framework can further be accelerated35 for faster convergence, which represents a significant advancement in the field of optimization.

  3. 3.

    VaRFS provides a significant contribution to enabling clinical usage of machine learning models, specifically in addressing the challenging issue of optimally accounting for variability and reproducibility of computerized radiology image (or radiomic36) features. Toward this, VaRFS will be comprehensively compared against three routinely utilized feature selection approaches across five multi-institutional radiographic imaging cohorts involving challenging clinical problems including differentiating healthy and diseased samples, characterizing response to treatment, and as well as risk stratification; in both oncological (prostate, rectal cancer) and non-oncological (Crohn’s disease) settings.

Results

Experimental evaluation of VaRFS and alternative feature selection (FS) strategies were conducted using five different, multi-institutional, retrospectively accrued cohorts which were segregated into independent discovery and validation sets (see Table 1). The overall experimental workflow is illustrated in Fig. 3.

Table 1 Multi-institutional data cohorts, splits, and classification tasks considered in this study
Fig. 3: Overall experimental workflow evaluating VaRFS against alternative feature selection strategies.
figure 3

Panels depict the data curation, feature extraction, feature selection, and downstream analysis stages used in the study.

Experiment 1: comparing VaRFS against variability-screened feature selection methods

VaRFS was found to result in statistically significantly higher AUC values in all five cohorts for both discovery and hold-out validation, compared to any alternative FS approach. This suggests the integration of feature variability directly into the selection scheme can improve overall model performance, including multi-institutional validation. This is summarized via Table 2 in terms of the classifier performance for top-ranked radiomic features identified via each FS scheme (VaRFS, variability-screened mRMR, LASSO, and WLCX); for all five cohorts. Note these results are based on considering a single measure of variability at a time, e.g., results are presented for each of ICCdose, ISbatch, and ICCannot for C5. Supplementary Table 3 presents averaged performance across all cross-validation runs for each FS method, which further confirms the superior performance trends of VaRFS compared to other FS methods. This can also be noted when utilizing an LDA model to evaluate VaRFS against alternative FS methods, as summarized in Supplementary Table 4.

Table 2 Performance of VaRFS feature set vs variability-screened mRMR, LASSO, and WLCX-based feature sets in terms of AUC for distinguishing the two classes in each of C1-C5 using a single variability measure

The results presented in Table 3 similarly summarize classifier performance for different FS strategies for all five cohorts, but when considering multiple measures of variability simultaneously. Radiomic features identified via VaRFS yielded statistically significant improvements in AUC values in all five cohorts (both in the discovery and hold-out validation sets) compared to any alternative FS strategy. Notably, accounting for multiple sources of variability via VaRFS can be seen to yield a further improvement in classifier AUC beyond using individual variability measures (compare Table 2 vs Table 3); corresponding to an overall 8–10% improvement for VaRFS over variability-screened FS approaches.

Table 3 Classifier AUC of VaRFS feature set vs alternatives in distinguishing the 2 classes in each cohort when considering multiple variability measures

Figure 4 illustrates the chord diagram of the five top-ranked radiomic features selected via VaRFS as well as each of variability-screened mRMR, LASSO, and WLCX, together with their respective ranks, feature family (indicated via colors), and feature importance in terms of SHAP values (indicated via size). Chord connections highlight instances where a feature is common to two or more different methods, based on which the VaRFS feature set can be seen to include a majority of reproducible features (some of which had also been identified by other FS methods). Typically, VaRFS can be seen to have the most features in common with LASSO; which aligns with the commonality in their objective functions. The top-ranked VaRFS features identified here can be seen to correspond to radiomic descriptors from Laws, Gradient, and Haralick feature families. This resonates with previous findings from our group37,38,39,40 as well as others41,42 where these patterns have shown associations with specific disease biology or physiological characteristics (additional details in the Supplementary Materials).

Fig. 4: Chord diagram of the five top-ranked features identified by VaRFS and other alternative FS methods when considering multiple variability measures, for each of C1-C5.
figure 4

Numeric labels (1–5) indicate the rank of each feature as determined by the corresponding feature selection method, colors correspond to feature family, while the size of each section represents the feature importance based on its SHAP value. Chord connections highlight instances where a feature is common to two or more different methods, while distinct symbols inside the chord sections indicate whether a feature is variable based on defined thresholds. The indices of these selected features are summarized in the Supplementary Table 1, which can be cross-referenced against Supplementary Data1.

Figure 5 depicts PCPs of discriminability/variability trends in these five top-ranked radiomic features selected via VaRFS within each cohort, each of which corresponds to a polyline that connects vertices between different parallel axes (representing the specific discriminability or variability value of that feature). It can be observed that many of the marginally variable features selected via VaRFS are not only highly discriminatory but also located in close proximity to the threshold (indicated via horizontal dashed lines). This suggests that a slight adjustment to this cutoff value would include or exclude critically useful features from consideration by different FS strategies (due to not meeting ad hoc variability criteria). The overall improved classifier performance achieved by incorporating these marginally variable features suggests they significantly augment the overall discriminability of the VaRFS model while not compromising on its generalizability to unseen data (consistently improved performance across discovery and validation).

Fig. 5: Parallel Coordinate Plot (PCP) based on five top-ranked features selected via VaRFS.
figure 5

Polylines correspond to individual features f1, …, f5 selected per cohort (colors indicate feature family, identical to Fig. 4), which in turn are composed of unbroken line segments that connect vertices between different parallel axes (representing the specific discriminability or variability value of that feature). Horizontal dashed lines indicate the threshold cutoff value for specific variability measures.

Figure 6 depicts an UpSet-style error decomposition for each cohort (C1-C5). For each of the 5 FS strategies evaluated by partitioning false positives (FP) and false negatives (FN) into error sets, revealing both shared and method-specific failure modes. Both VaRFS variants can be seen to consistently produce fewer unique errors, with most misclassifications overlapping with those made by other methods. Across all cohorts, comparator FS methods (mRMR, WLCX) also demonstrate a markedly higher error rate compared to both VaRFS approaches, with the the smallest proportion of FPs/FNs associatd with VaRFS when considering multiple sources of variability. This pattern demonstrates that VaRFS does not appear to introduce new or unstable error modes compared other FS schemes; instead, it reduces method-specific errors while preserving discriminability.

Fig. 6: UpSet-style error decomposition for each cohort C1-C5, where stacked bars show the distribution of false positives (FP, brown) and false negatives (FN, red) associated with each FS approach (mRMR, WLCX, LASSO, VaRFS-Single, VaRFS-Multi).
figure 6

Below each bar, the filled circles indicate which FS approach was associated with a specific group of errors. Values in parentheses next to each method represent the total percentage of classification errors attributable to that method within the corresponding cohort.

Experiment 2: evaluating parameter sensitivity of VaRFS

Optimal classifier performance for VaRFS (highlighted in pink) in all five cohorts is observed when equally weighting β (variability) and λ (sparsity), though stable performance can be noted across a broad range of regularization parameters (see Supplementary Materials for a more detailed description). This can be seen in Fig. 7 via a 3D barplot of AUC values for the VaRFS feature set selected for each parameter combination of β and λ, when the corresponding RF model is evaluated in hold-out validation. The best overall AUC value in three cohorts corresponds to β = λ = 0.5 (and is very close to these values for the remaining two cohorts). Intuitively, classifier performance is seen to decline markedly for extreme parameter combinations (β < < λ or β > > λ) which indicates that both sparsity and variability terms are equally critical in the VaRFS cost function. This allows for more intelligent and reliable identification of feature sets which are simultaneously discriminable, sparse, and reproducible, thus reducing the chance of model overfitting while improving its generalizability.

Fig. 7: 3D bar plot of classifier AUC values (Z-axis) in hold-out validation for each cohort, when considering VaRFS feature sets selected for each combination of regularization parameters for variability (β, X-axis) and sparsity (λ, Y-axis)).
figure 7

Color shading of the bar plots is based on AUC values such that yellow indicates higher performance while blue corresponds to lower performance. Highest overall AUC performance in each cohort is highlighted in pink, with corresponding regularization parameters summarized.

Experiment 3: comparing regular vs accelerated versions of VaRFS

Accelerated VaRFS was found to converge at a faster rate as well as yield a lower minimization of the objective function compared to the regular implementation; in all five cohorts. Figure 8 presents optimization trends for the objective function J(θ) in (3), as computed by each of the regular (red lines) and accelerated versions (blue lines) of VaRFS over 100 iterations. When considering multiple measures of variability, the initialization of J(θ) is intuitively higher as compared to the single variability measure across all five cohorts. The visualization of C5 with three different sources of variability emphasizes this point, as the initialization of the J(θ) here has the highest value across all cohorts. VaRFS was also found to be more computationally efficient (average runtime of 124 seconds for regular, 79 seconds for accelerated) in comparison to both mRMR (614 secs runtime) and WLCX (412 secs runtime), while performing only marginally worse than LASSO (132 secs runtime) in all five cohorts. These results suggest that the use of proximal algorithms, rather than primal-dual methods43 or projection onto the convex sets44, are an appropriate choice for VaRFS as it includes alternating direction method of multipliers45 which has been shown an efficient and computationally cheaper approach.

Fig. 8: Trends in the objective function J(θ) in Equation (3) for regular (red) vs accelerated (blue) versions of VaRFS, computed via Algorithm 1 and Algorithm 2, respectively.
figure 8

In each plot, solid lines correspond to using a single variability measure while dashes lines indicate trends when considering multiple measures of variability simultaneously within VaRFS.

Building on these advantages, the accelerated proximal algorithm for optimizing the VaRFS objective function allows for more efficient solving of a convex but non-smooth optimization problem via the use of a momentum term that helps it converge faster46. These results are also inline with previous studies47,48,49; demonstrated here for the first time in the context of radiomics and medical image analysis.

Discussion

In this study, we presented a novel radiomic feature selection scheme, Variability Regularized Feature Selection (VaRFS) which represents a first effort at integrating feature variability as a generalizable regularization term directly into the optimization function used to select a sparse and discriminable set of features. Radiomic features selected via VaRFS achieved significantly higher classification performance compared to three routinely utilized feature selection approaches across five multi-institutional radiographic imaging cohorts involving challenging clinical problems including differentiating healthy and diseased samples, characterizing response to treatment, and as well as risk stratification. We were additionally able to demonstrate the computational efficiency of the VaRFS approach as well as examine how exploiting the trade-offs in feature discriminability and variability can ensure improved model performance.

To enhance the robustness and generalizability of machine learning models in medical image analysis, there has been increasing recognition of the need to consider the reproducibility of radiomic features given their sensitivity to acquisition parameters11 and batch effects9,50. Neglecting feature reproducibility can lead to an increased risk of false positive associations and type I errors7. Recent efforts in this regard have largely adopted an independent feature screening approach prior to feature selection21,23,51. These approaches typically utilize thresholding of variability measures to omit any features which do not meet prespecified criteria. This is because popular feature selection approaches (LASSO, mRMR, and WLCX) have not been explicitly designed to account for variability, but rather only for sparsity and discriminability. Feature screening can be also seen to suffer similar issues to dichotomizing continuous variables52, such as loss of information, reduced statistical power, and increased risk of false positives. It is also worth noting that blindly removing unstable features (without regard to their discriminability) or simply retaining all features (without regard to their reproducibility) based on pre-specified thresholds may not result in an optimal, generalizable feature set.

In order to account for these issues, VaRFS simultaneously optimizes for feature contributions in terms of discriminability, sparsity, and reproducibility. Unlike traditional methods which rely on threshold-based variability screening, VaRFS directly optimizes for feature variability together with sparsity and discriminability, offering a more principled alternative to exhaustive threshold parameter tuning. Furthermore, the features selected by VaRFS can be seen to represent an optimal trade-off between different variability measures, while not compromising on its ability to identify a complementary suite of features and patterns in the data. This is borne out in our experimental results, where radiomic features selected via VaRFS yielded significantly higher classification performance compared to features selected after variability screening; suggesting the significant advantages enabled by developing an approach which can simultaneously optimize for sparsity, discriminability, and reproducibility rather than considering each of these factors independently. VaRFS thus offers a more efficient and effective method for feature selection which could ultimately improve the clinical translation and practical utility of radiomics-based models.

We do acknowledge some limitations to our study. While considering five multi-institutional cohorts totaling over 700 patient datasets from 12 different institutions, we primarily considered binary classification problems in specific disease use-cases using MRI or CT scans. These results will require further confirmation in other diseases, when analyzing other radiomic feature families, as well as when considering other imaging modalities (e.g., PET, digital pathology). Our experiments did not incorporate any prior knowledge about feature variability, which was instead empirically determined on the fly within our specific cohorts. This was done primarily to ensure an even playing field when comparing VaRFS with alternative feature selection and screening approaches. Our selection of comparators in the current study was based on their wide usage in the medical imaging and radiomics literature, wherein WLCX, LASSO, and mRMR remain the most widely utilized for feature selection. Additional comparator FS techniques which could have been considered by us include tree-based importance measures53, ensemble-based strategies54, or ElasticNet55; which will be a subject for future work. Prior studies23,25 have linked robust radiomic features to clinical and biological endpoints. Understanding the relationship between biological interpretability and the variability characteristics of radiomic features will be a key direction for future research, building on the methodological development of VaRFS undertaken in the current study.

In the future, we plan to extend the VaRFS framework in order to incorporate the concept of reproducibility into deep learning approaches. We will also examine how to incorporate priors in terms of which feature families to utilize, as well as extend VaRFS for use in multi-class and continuous regression problems.

Methods

Overview of VaRFS

All the data are assumed to be real-valued. Vectors and matrices are marked by boldface lower-case letters and upper-case bold, respectively. Additional notation used in this work is summarized in Table 4.

Table 4 Common notation utilized in Section IV

Consider S data sources (e.g. institutions, batches, scanners), each of which are associated with the feature matrix \({{\bf{X}}}_{i}=[{{\bf{x}}}_{i}^{1}\ldots {{\bf{x}}}_{i}^{j}\ldots {{\bf{x}}}_{i}^{m}]\in {{\mathbb{R}}}^{{n}_{i}\times m}\), where \({{\bf{x}}}_{i}^{j}\) is the feature vector for the ith data source and the jth feature. m represents the total number of features (assumed to be the same for all S data sources) while ni corresponds to the number of samples from the ith source, respectively. Let \({{\bf{y}}}_{i}\in {{\mathbb{R}}}^{{n}_{i}\times 1}\) be the corresponding label vector for ni samples. Accumulated feature values over all samples and all cohorts can be denoted via \({\bf{X}}={[{{\bf{X}}}_{1}\ldots {{\bf{X}}}_{S}]}^{T}\in {{\mathbb{R}}}^{n\times m}\) where \(n=\mathop{\sum }\nolimits_{i = 1}^{S}{n}_{i}\).

Problem statement for feature selection via LASSO

Finding a user-specified number of discriminative features (denoted via the level of sparsity, c) can be cast as a constrained optimization problem as follows56:

$$\mathop{\min }\limits_{{\boldsymbol{\theta }}\in {{\mathbb{R}}}^{{\boldsymbol{m}}}}\frac{1}{2}\mathop{\sum }\limits_{i=1}^{s}{\left\Vert {{\bf{y}}}_{i}-{{\bf{X}}}_{i}{\boldsymbol{\theta }}\right\Vert }_{2},\quad \,\text{subject}\,\,{to}\,\quad {\left\Vert {\boldsymbol{\theta }}\right\Vert }_{0}\le c,$$
(1)

where \({\boldsymbol{\theta }}\in {{\mathbb{R}}}^{m}\) is the coefficient vector reflecting the contribution of each feature. This non-smooth combinatorial optimization problem is NP-hard57. A common alternative for (1) is to consider the convex relaxation based on the 1 norm, which corresponds to the LASSO equation17, written as:

$$\mathop{\min }\limits_{{\boldsymbol{\theta }}\in {{\mathbb{R}}}^{m}}\left\{\frac{1}{2}\mathop{\sum }\limits_{i=1}^{s}{\left\Vert {{\bf{y}}}_{i}-{{\bf{X}}}_{i}{\boldsymbol{\theta }}\right\Vert }_{2}^{2}+\lambda {\left\Vert {\boldsymbol{\theta }}\right\Vert }_{1}\right\},$$
(2)

where λ is the regularization parameter.

Development of VaRFS

Radiomic features are known to vary between data sources due to intra-site, inter-site, or test/retest differences including changes in the device, modality, sequence, compartment, patient, or laboratory settings36. This variability is typically quantified via different statistical measures (e.g. IS, ICC, CV).

Based on the types of variability being considered, denoted via v {1, …, V} (e.g., batch effects, annotation differences), we define the feature variability vector as \({{\bf{u}}}_{v}={\left[{u}_{v}^{1}\ldots {u}_{v}^{j}\ldots {u}_{v}^{m}\right]}^{T}\), based on computing a measure of variability on a per-feature basis via statistical comparisons of bootstrapped subsets generated from the original feature space. In matrix form, this is represented via the feature variability matrix, \({\bf{P}}=[{{\bf{u}}}_{1}\ldots {{\bf{u}}}_{V}]\in {{\mathbb{R}}}^{m\times V}\).

We incorporate feature variability into the LASSO formulation, by adding an additional penalty term to the objective function J(θ) in (2). Note that this penalty term is in the quadratic form to ensure that it is convex and a key element of the optimization approach described below.

This represents the objective function for VaRFS, written as:

$$\mathop{\text{argmin}}\limits_{{{\boldsymbol{\theta}}\in {\mathbb{R}}^m}} \left\{ J({{\boldsymbol{\theta}}}) = \mathop{\overbrace{\mathop{\underbrace{\frac{1}{2}\mathop{\sum}\limits_{i=1}^{S}{\|{{{\bf{y}}}_i - {{\bf{X}}}_i{{\boldsymbol{\theta}}}}\|}_2^2}}\limits_{{\rm{discriminability}}} + \mathop{\underbrace{\beta {{\boldsymbol{\theta}}}^T{{\bf{R}}}{{\boldsymbol{\theta}}}}}\limits_{{\rm{variability}}}}}\limits^{f({{\boldsymbol{\theta}}})} + \mathop{\overbrace{\mathop{\underbrace{\lambda{\|{{{\boldsymbol{\theta}}}}\|}_1}}\limits_{{\rm{sparsity}}}}}\limits^{g({{\boldsymbol{\theta}}})}\right\}$$
(3)

where β is the regularization parameter, used to differentially weight variability measures (and thus, different sources of variability). Note that R is the symmetric form of the feature variability matrix \({\bf{R}}\triangleq {\bf{P}}{{\bf{P}}}^{T}=\mathop{\sum }\nolimits_{v = 1}^{V}{{\bf{u}}}_{v}{{\bf{u}}}_{v}^{T}\).

Optimization of VaRFS

While f + g is a convex objective function in (3), it is still non-smooth (due to the sparsity term g(θ)) and thus cannot be solved by regular optimization methods such as gradient descent. Rather than computationally expensive and complex alternatives such as the alternating direction method of multipliers58, we utilize proximal algorithms59 as they work under extremely general conditions, are much faster for challenging optimization problems, as well as being scalable and amenable to distributed optimization46. Based on Lemma 1 (see Supplementary Information Section A), in order to minimize f + g in our convex optimization problem, we can replace the non-smooth function f with its upper-bound (denoted \(\bar{f}\)) which results in the following iterative solution algorithm for (3),

$${{\boldsymbol{\theta }}}_{k+1}=\mathop{\,\text{argmin}\,}\limits_{{\boldsymbol{\theta }}\in {{\mathbb{R}}}^{m}}\left\{\bar{f}({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}_{k})+g({\boldsymbol{\theta }})\right\},$$
(4)

where, for the kth iteration,

$$\bar{f}({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}_{k})=f({{\boldsymbol{\theta }}}_{k})+{\nabla }^{T}f({{\boldsymbol{\theta }}}_{k})({\boldsymbol{\theta }}-{{\boldsymbol{\theta }}}_{k})+\frac{1}{2\gamma }{\left\Vert {\boldsymbol{\theta }}-{{\boldsymbol{\theta }}}_{k}\right\Vert }_{2}^{2}.$$
(5)

This in turn is equivalent to

$${{\boldsymbol{\theta }}}_{k+1}=\mathop{\,\text{argmin}\,}\limits_{{\boldsymbol{\theta }}\in {{\mathbb{R}}}^{m}}\left\{\frac{1}{2}{\left\Vert {\boldsymbol{\theta }}-{\bar{{\boldsymbol{\theta }}}}_{k}\right\Vert }_{2}^{2}+\gamma g({\boldsymbol{\theta }})\right\}={{\rm{prox}}}_{\gamma g}({\bar{{\boldsymbol{\theta }}}}_{k}),$$
(6)

where, \({\bar{{\boldsymbol{\theta }}}}_{k}={{\boldsymbol{\theta }}}_{k}-\gamma \nabla f({{\boldsymbol{\theta }}}_{k})\) and proxγg is the proximal operator of the convex function γg (See Definition 3 in Section A of the Supplementary Information). This base mapping of the proximal algorithm is a standard tool for solving non-smooth optimization problems60. Proof that f in (3) is a Lipschitz continuous gradient function is presented as Lemma 2 in Section B of the Supplementary Information, based on which our problem can be seen to meet the requirements for using general proximal algorithms59. The final Algorithm 1 summarizes the overall approach to solve (6) within VaRFS.

Algorithm 1

Proximal Algorithm for VaRFS

Input: y, X, P, β, λ, K (number of inner-loop iterations), γ (step-size)

initialization : \({{\boldsymbol{\theta }}}_{0}\in {{\mathbb{R}}}^{m}\), R = PPT

1: for k = 1, 2, , K do

2: \(f({{\boldsymbol{\theta }}}_{k})=\frac{1}{2}{\left\Vert {\bf{y}}-{\bf{X}}{{\boldsymbol{\theta }}}_{k}\right\Vert }_{2}^{2}+\beta {{\boldsymbol{\theta }}}_{k}^{T}{\bf{R}}{{\boldsymbol{\theta }}}_{k}\)

3: \(g({{\boldsymbol{\theta }}}_{k})=\lambda {\left\Vert {{\boldsymbol{\theta }}}_{k}\right\Vert }_{1}\)

4: \({\bar{{\boldsymbol{\theta }}}}_{k}={{\boldsymbol{\theta }}}_{k}-\gamma \nabla f({{\boldsymbol{\theta }}}_{k})\)

5: \({{\boldsymbol{\theta }}}_{k+1}={{\rm{prox}}}_{\gamma g}({\bar{{\boldsymbol{\theta }}}}_{k})\)

6: end for

Output: θ = θk+1

Remark 1

Since in (5), \(\bar{f}({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}_{k})\ge f({\boldsymbol{\theta }})\), \(\bar{f}({{\boldsymbol{\theta }}}_{k},{{\boldsymbol{\theta }}}_{k})=f({{\boldsymbol{\theta }}}_{k})\), \(\bar{f}({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}_{k})\) is so-called majorization function of f(θ)61. Therefore, our algorithm is a type of majorization-minimization algorithm62.

Convergence analysis of VaRFS

We examine the restrictions on the learning rate parameter γ to assure convergence of the iterations as outlined in (6), in Theorem 1.

Theorem 1

The sequence {θk} in (6), converges to a stationary point of f + g. To guarantee convergence, parameter γ must adhere to

$$0 < \gamma \le \frac{1}{{\left\Vert {\bf{Q}}\right\Vert }_{2}},$$
(7)

in which,

$${\bf{Q}}\triangleq {{\bf{D}}}^{T}{\bf{D}},\,\,\,\,\,{\bf{D}}\triangleq {\left[{{\bf{X}}}^{T}\sqrt{2\beta }P\right]}^{T}.$$
(8)

The proof may be found in Supplementary Information Section B.

Theorem 2

Let Q in (8) is a positive-definite (PD) matrix with singular values sorted as \({\sigma }_{\min }\le \ldots \le {\sigma }_{\max }\). Given that the Algorithm 1 reaches the optimal solution θ* with a generic learning rate γ, the iterations of this algorithm demonstrate a linear convergence rate. Moreover, we have

$${\left\Vert {{\boldsymbol{\theta }}}_{k+1}-{{\boldsymbol{\theta }}}^{* }\right\Vert }_{2}\le z(\gamma ){\left\Vert {{\boldsymbol{\theta }}}_{k}-{{\boldsymbol{\theta }}}^{* }\right\Vert }_{2},$$
(9)

where \(z(\gamma )=\max \left\{\left\vert 1-\gamma {\sigma }_{\min }\right\vert ,\left\vert 1-\gamma {\sigma }_{\max }\right\vert \right\}\) is the convergence rate.

The proof is provided in Section B of the Supplementary Information.

Figure 9 depicts the convergence rate of the VaRFS regular proximal algorithm for some different learning rate γ. The convergence rate is illustrated based on the condition number (\(\kappa \triangleq \frac{{\sigma }_{\max }}{{\sigma }_{\min }}\)) of the matrix Q. According to (7), as we expected the well-conditioned matrices with κ(Q) > > 1 are faster to converge rather than the ill-conditioned ones with κ(Q) ≈ 1. As can be seen, the higher step size in the convergence interval (\(0 < \gamma \le \frac{1}{{\left\Vert {\bf{Q}}\right\Vert }_{2}}=\frac{1}{{\sigma }_{\max }}\)) correspond to the faster rate.

Fig. 9: Convergence ratio of the VaRFS proximal algorithm for different step sizes of γ.
figure 9

For better interpretation, the convergence rates are shown via the condition number κ.

Acceleration of VaRFS

Following30,63, the basic proximal gradient algorithm can be further accelerated through the use of weighted combinations of current and previous gradient directions via an extrapolation step; thus ensuring each iteration does not require more than one gradient evaluation. This is implemented by incorporating a new sequence, \({\{{{\boldsymbol{\eta }}}_{k}\}}_{k = 0}^{\infty }\), which is initialized as η0 = θ0. Recursively updating {ηk} and thus {θk} at each iteration k {0, 1, …, K} allows for a significantly faster convergence. Algorithm 2 summarizes this accelerated approach to solving (6) within VaRFS.

Algorithm 2

Accelerated Proximal Algorithm for VaRFS

Input: y, X, P, β, λ, K (number of inner-loop iterations), γ (step-size)

initialization : \({{\boldsymbol{\theta }}}_{0}\in {{\mathbb{R}}}^{m}\), η0 = θ0, R = PPT

1: for k = 1, 2, , K do

2: \(f({{\boldsymbol{\eta }}}_{k})=\frac{1}{2}{\left\Vert {\bf{y}}-{\bf{X}}{{\boldsymbol{\eta }}}_{k}\right\Vert }_{2}^{2}+\beta {{\boldsymbol{\eta }}}_{k}^{T}{\bf{R}}{{\boldsymbol{\eta }}}_{k}\)

3: \(g({{\boldsymbol{\eta }}}_{k})=\lambda {\left\Vert {{\boldsymbol{\eta }}}_{k}\right\Vert }_{1}\)

4: ηk = ηkγ f(ηk)

5: θk+1 = proxγg(ηk)

6: \({{\boldsymbol{\eta }}}_{k+1}={{\boldsymbol{\theta }}}_{k+1}+w\left({{\boldsymbol{\theta }}}_{k+1}-{{\boldsymbol{\theta }}}_{k}\right)\)

7: end for

Output: θ = θk+1

Remark 2

Parameter w must be chosen in specific ways to achieve convergence acceleration. One simple choice takes \(w=\frac{k}{k+3}\)46.

The Section C of the Supplementary Information provides a detailed computational complexity analysis of both Algorithm 1 and Algorithm 2, highlighting the convergence rates of \({\mathcal{O}}(1/k)\) and \({\mathcal{O}}(1/{k}^{2})\) for the regular and accelerated versions of VaRFS, respectively.

Data description

C1 (Prostate Cancer MRI) comprised 147 diagnostic T2-weighted (T2w) prostate MRIs from 4 institutions, with the goal of distinguishing benign from malignant lesions in the peripheral zone (discovery: 3 sites, validation: 1 site). More details of this dataset are available in64,65.

C2 (Rectal Cancer MRI, pre-CRT) comprised 197 pre-treatment T2w rectal MRIs from 3 institutions, from patients who later underwent standard-of-care chemoradiation (nCRT) and surgery. Histopathologic tumor regression grade (TRG) assessment of the excised surgical specimen was used to define pathologic complete response (pCR) to nCRT. The goal was to distinguish patients who will achieve pCR (i.e., ypTRG0 or 0% viable tumor cells remaining) from those who will not, based on annotated tumor regions on pre-nCRT MRI. For more dataset details refer to38.

C3 (Rectal Cancer MRI, post-CRT) comprised 119 T2w post-treatment rectal MRI scans from 3 institutions, from patients after they had undergone standard-of-care nCRT but prior to undergoing surgery. Histopathologic tumor stage (ypT) assessment of the excised surgical specimen was used to define the pathologic response to nCRT. The goal was to distinguish patients who achieved tumor regression (i.e., ypT0-2 or tumor that has regressed to within the rectal wall) from those who did not, based on annotated rectal wall regions on post-nCRT MRI. Additional details are in37.

C4 (Crohn’s Disease MRE) comprised 73 T2w bowel MR enterography (MRE) scans from patients who had been endoscopically confirmed with Crohn’s disease. The goal was to distinguish high-risk patients who needed surgery within one year of MRI and initiation of aggressive immunosuppressive therapy, from low-risk patients (stable for up to 5 years in follow-up); using annotated terminal ileum regions on baseline MRIs. This single institutional cohort was harboring large batch effects as a result of adjustments to acquisition parameters including scanner type and magnetic resonance strength. More details of this dataset are available in66.

C5 (Crohn’s Disease CTE) comprised 165 CT enterography (CTE) scans from patients being screened for Crohn’s disease with endoscopic confirmation of disease presence. The goal was to distinguish between healthy and diseased terminal ileum regions within this single institutional cohort harboring significant batch effects67, as well as dose/reconstruction changes.

Radiomic feature extraction

As summarized in Fig. 3, after data acquisition, pre-processing included linear resampling of all scans to an isotropic resolution of 1 × 1 × 1 mm to ensure consistent resolution within each cohort. Additionally, N4ITK bias field correction68 in 3D Slicer was used to correct inhomogeneity artifacts in MRI scans in C1-4. 405 3D radiomic features were then extracted on a voxel-wise basis from all scans. A complete list of all extracted features is provided in Supplementary Data1 in the Supplementary Materials. These features included 20 Histogram, 152 Laws69, 13 Gradient70, 160 Gabor71, and 60 Haralick72 responses. The mean value of each feature was then computed within specified regions-of-interest (ROIs), and feature normalization was applied on a cohort basis to ensure each feature had a mean of 0 and a standard deviation of 1. Based on the sources of variability present (summarized in Table 5), corresponding variability measures were computed on a per-feature basis for each of the five cohorts C1-5.

Table 5 Sources of variability and corresponding variability measures for each cohort considered in this work

VaRFS implementation and sensitivity analysis

VaRFS was implemented as Algorithm 1 (regular) and Algorithm 2 (accelerated), with K = 100 (number of iterations) and \(\gamma =\frac{1}{2{\sigma }_{\max }({\bf{Q}})}\) (mid-point of convergence interval, see Fig. 9). Analysis of convergence differences for J(θ) in (3) between both algorithms was conducted for all five cohorts. To evaluate the effect of the regularization parameters in VaRFS, these were varied as β, λ {0, 0.1, …, 1} corresponding to variability and sparsity, respectively. These 100 possible βλ parameter combinations were evaluated for each of C1-5, resulting in a total of 500 possible parameter variations of VaRFS being evaluated. Since each cohort had at least two sources of variability considered, VaRFS was evaluated for considering each individual measure as well as for the combination of multiple variability measures (e.g. P = [u1u2u3] for C5).

Comparative evaluation of common FS approaches

As an alternative strategy, conventional feature selection (FS) approaches including maximum relevance minimum redundancy (mRMR)15, Wilcoxon rank-sum testing (WLCX)16, and least absolute shrinkage and selection operator (LASSO)17 were implemented. All three FS methods were utilized in conjunction with feature variability screening, where radiomic features that did not meet a pre-defined threshold for their feature variability measure were not utilized in downstream analysis. Threshold values for different variability measures were selected based on the literature; specifically radiomic features with IS > 0.2525, CV > 0.565, or ICC < 0.8573 were excluded prior to FS. When considering multiple sources of variability, a sequential elimination process was employed where only those radiomic features were retained that met all relevant thresholds for corresponding variability measures.

Experimental evaluation

All five cohorts were partitioned into discovery and validation sets, as summarized in Table 1. The evaluation of feature sets, selected via each of VaRFS, mRMR, LASSO, and WLCX, was carried out by building a Random Forests classifier (RF) for the binary classification tasks in each cohort. The RF classifier was chosen due to its well-documented proficiency in handling high-dimensional, potentially correlated features, and its robustness against overfitting74. Moreover, RF can estimate the importance of features, which provides additional insight into the data75. In this study, the RF classifier was configured with 50 trees, a maximum depth of 50, and 100 leaf samples.

In all experiments, the RF classifier was first trained and optimized on the discovery cohort using 100 runs of nested 10-fold cross-validation. All model selection and thresholding steps were confined to the training set, within which the average classifier performance was estimated. Based on their formulation, distinct methodologies were employed to determine the top-ranked features and construct a final optimized RF model for hold-out validation when considering statistical FS (mRMR and WLCX) vs optimization-based FS (LASSO and VaRFS). For mRMR and WLCX, the most frequently selected features were identified based on their average rank value across all cross-validation runs. This top-ranked feature set was then utilized to construct a single RF classifier that was evaluated in a holdout fashion on the validation cohort. For LASSO and VaRFS, the best-performing RF model (and a corresponding set of selected features) was identified across all cross-validation runs. This model was then directly evaluated in a holdout fashion on the validation cohort. While these strategies aligned with the operational design of each FS method, an averaging-based approach across all cross-validation runs for each method was additionally implemented to confirm performance trends. Additionally, experimental evaluation was repeated using a Linear Discriminant Analysis (LDA) classifier76 for evaluating performance differences between VaRFS and comparator methods. All experiments were conducted in MATLAB 9.9 on a 64-bit Windows 10 PC with an Intel(R) Core(TM) i7 CPU 930 (3.60 GHz) and 32 GB RAM.

In all cases, the area under the receiver operator characteristic curve (AUC) was used as a measure of classifier performance. Statistical comparisons were conducted to assess differences in AUC values between VaRFS and baseline methods. For the training set, a two-tailed Wilcoxon signed-rank test (significance level p < 0.005) was employed using repeated cross-validation results, consistent with prior studies. For the validation set, the DeLong test77 was applied to evaluate statistical differences in ROC curves via VaRFS and baseline methods (since no cross-validation was involved).

A color-coded chord diagram was generated to visualize relationships and connections between top-selected features identified via different FS schemes. Additionally, feature importance was computed via their Shapley (SHAP) values78 rather than feature rank (in mRMR or WLCX) or feature coefficient (in LASSO or VaRFS). The Shapley value is the average marginal contribution of a feature over all possible coalitions79, providing a natural way to compute how much each feature contributes to predictive performance. A parallel coordinate plot (PCP) was constructed80 to analyze trends of the top-ranked VaRFS features in terms of multiple variability measures as well as their discriminability. Finally, a model-level error analysis was conducted to quantify trends in false-positive and false-negative instances across mRMR, WLCX, LASSO, and the two VaRFS variants (single and multiple variability measures), to identify which errors were unique and which were in common between different approaches. This analysis was used to generate an UpSet-style visualization81, enabling direct comparison of error rates as well as distinctiveness in erroneous samples between approaches.

Ethics approval and informed consent

All datasets used in this study comprised de-identified imaging data with appropriate institutional approvals. For the C1 cohort, data and expert annotations of tumor extent were provided under the Institutional Review Board (IRB) protocol #02-13-42C, approved by the University Hospitals of Cleveland IRB. For the C2 and C3 cohort, this HIPAA-compliant, retrospective study was approved by IRBs at three institutions: University Hospitals Cleveland Medical Center (UHCMC, #07-16-40), Cleveland Clinic Foundation (CCF, #18-427), and Case Western Reserve University (STUDY20240128). A waiver for the requirement of informed consent was granted, as only de-identified patient data were utilized. For the C4 and C5 cohort, approval was obtained from the University Hospitals Cleveland Medical Center IRB under protocol #11-15-24.