Introduction

In 2022, breast cancer was the most frequently diagnosed malignancy and the leading cause of cancer-related death among women worldwide, accounting for over 2.3 million new cases and approximately 670,000 deaths annually1,2. In Spain, it remains the most common cancer among women, with more than 37,682 new cases in 2024, and it is the third leading cause of cancer mortality among Spanish women3. Despite advances in early detection and treatment, prognosis varies considerably depending on tumor subtype, stage at diagnosis, and other clinical and pathological characteristics4.

Breast cancer is a biologically heterogeneous disease4. Current molecular classification systems categorize it into four major intrinsic subtypes—Luminal A, Luminal B, HER2-positive, and triple-negative breast cancer (Basal-like)—each with distinct clinical, prognostic, and therapeutic profiles4,5,6. Differences also exist in patterns of detection: luminal tumors are more often screen-detected, while HER2-positive and Basal-like tumors tend to present symptomatically, at younger ages and with more aggressive features4,7. This biological and clinical diversity underscores the need for complementary biomarkers that can improve risk stratification and support individualized clinical decision-making.

Early detection plays a crucial role in reducing breast cancer mortality and improving long-term outcomes. By identifying cancer at earlier, more treatable stages, the likelihood of successful treatment and survival increases significantly8,9.

Mammography remains the cornerstone of population-based screening programs. While it has contributed significantly to mortality reduction, its sensitivity is limited, especially in women with dense breast tissue9,10. Consequently, it is often complemented by ultrasound, while magnetic resonance imaging (MRI), with higher sensitivity, is reserved for high-risk patients or cases with inconclusive mammographic findings9.

Despite these diagnostic tools, contemporary challenges in breast cancer diagnosis and treatment persist. These include incomplete implementation of screening programs, limited methods for real-time monitoring of tumor progression, and the lack of robust, subtype-specific biomarkers to guide therapeutic decisions9.

MicroRNAs (miRNAs) are small (~ 22 nucleotides), non-coding RNAs that regulate gene expression post-transcriptionally and are implicated in several hallmarks of cancer, including proliferation, apoptosis, invasion, metastasis, and immune evasion5,11,12,13. It is estimated that they regulate up to 60% of protein-coding genes14,15, and depending on the cellular context, they can function as tumor suppressors or oncogenes11,13.

Importantly, miRNAs can be detected in extracellular fluids such as serum and plasma, where they are highly stable due to encapsulation in extracellular vesicles or binding to protein complexes11,16. These properties make circulating miRNAs attractive candidates for non-invasive cancer biomarkers, with potential applications in diagnosis, prognosis, and treatment monitoring13,16.

Several studies have proposed circulating miRNA panels associated with breast cancer risk or progression, but many have limitations such as small sample sizes, inconsistent methodologies, or lack of molecular subtype-specific analyses17,18. In our previous work using the MCC-Spain cohort, we identified serum-based miRNA signatures capable of distinguishing breast cancer cases from healthy controls, as well as across different clinical states19. However, the potential of circulating miRNAs to differentiate intrinsic molecular subtypes of breast cancer remains underexplored, particularly in multicenter case–control studies with harmonized clinical annotation and robust statistical modelling.

Recent studies have explored the diagnostic and prognostic value of circulating miRNAs in breast cancer, including early efforts to classify molecular subtypes based on serum or plasma profiles20,21. Some subtype-specific patterns have been suggested, such as downregulation of miR-101-3p in Basal-like22 or upregulation of miR-423-3p in hormone receptor–positive tumors23, but findings remain inconsistent due to differences in patient selection, sample processing, and statistical methodology19.

To date, few population-based studies have systematically evaluated circulating miRNAs across the major breast cancer subtypes using multivariable penalized regression models with internal validation. Evidence is particularly scarce from case–control designs with matched controls, harmonized pre-analytical procedures, and clinically confirmed molecular subtypes.

In this context, the primary objective of the present study was to evaluate the association between a panel of 46 circulating miRNAs and breast cancer molecular subtype (Luminal A, Luminal B, HER2-positive, and Basal-like), using serum samples from a multicenter case–control study. Supplementary Table 1 provides a detailed overview of the selected miRNAs, including their putative biological functions and previously reported associations with breast cancer biology.

Methods

Study population

The MCC-Spain study is a multicenter, population-based case–control study conducted across ten Spanish provinces (Asturias, Barcelona, Cantabria, Girona, Guipúzcoa, Huelva, León, Madrid, Navarra, and Valencia). For the present analysis, a total of 444 women were included: 317 incident breast cancer cases and 127 population-based controls. Participants from the pilot Screening Phase (n = 40) were excluded to ensure independence of the validation cohort.

Breast cancer cases included women aged 27–85 years with newly diagnosed, histologically confirmed invasive breast cancer between 2008 and 2013. Molecular subtypes, Luminal A-like, Luminal B-like, HER2-enriched-like, and Basal-like, were defined based on immunohistochemical (IHC) profiling, including estrogen receptor (ER), progesterone receptor (PR), HER2 status, and Ki-67 proliferation index, in accordance with the St. Gallen International Consensus Guidelines24. In this study, the Basal-like subtype corresponds to triple-negative breast cancer (ER-negative, PR-negative, HER2-negative) as defined by IHC.

Controls were women with no previous history of cancer, residing in the same catchment areas as cases. They were frequency-matched to cases by age (in 5-year intervals) and geographic region, and were randomly selected from general practitioner lists at participating primary healthcare centers.

Recruitment protocols and data collection procedures have been described in detail elsewhere25,26. All participants provided written informed consent. The study protocol was approved by the Ethics Committees of all participating institutions and was registered with the Spanish Data Protection Agency (Registration No. 2102672171). Further details are available at http://www.mccspain.org. All procedures were performed in accordance with relevant ethical guidelines and regulations.

Data collection

Face-to-face interviews conducted by trained personnel gathered information on demographic and epidemiological covariates, including age, body mass index (BMI), educational level, menopausal status, family socioeconomic status, smoking habits, alcohol consumption, use of hormonal contraceptives, history of hormone replacement therapy (HRT), and leisure-time physical activity. Dietary intake was assessed using a validated food frequency questionnaire and expressed as grams per day (g/day) and total energy intake (kcal/day).

Clinical-pathological information was obtained from medical records, including tumor receptor status (estrogen receptor [ER], progesterone receptor [PR], and HER2), Ki-67 proliferation index, and histological subtype. Breast cancer cases were stratified into molecular subtypes (Luminal A, Luminal B, HER2-positive, and Basal-like) according to the St. Gallen International Consensus Guidelines24.

Biological samples

Blood samples were collected from both cases and controls in the early morning after an overnight fast, following a standardized protocol. Samples were centrifuged at 3000 × g for 20 min at 10 °C, followed by a second centrifugation at 15,000 × g for 10 min at the same temperature to remove residual cellular debris. The resulting serum aliquots were stored at –80 °C until analysis.

Screening phase

The selection of candidate miRNAs followed a two-phase design (screening and validation) consistent with our previous work19, combining a structured literature review with an exploratory small RNA sequencing phase conducted in 40 serum samples. For the present study, we expanded this framework to specifically evaluate molecular subtypes of breast cancer (Luminal A, Luminal B, HER2-positive, and Basal-like), incorporating a larger and clinically diverse set of cases, including more advanced tumors. An updated review of the literature (2020–2024) focusing on subtype-specific circulating miRNAs was also performed to refine biomarker selection and develop subtype-specific predictive models.

Ten control women and 30 breast cancer cases were selected. As this pilot set was originally recruited for a previous analysis of detection modes, the distribution of molecular subtypes was unbalanced (13 Luminal A, 11 Luminal B, 2 Basal-like, 0 HER2-positive, and 4 unclassified/other; Supplementary Table 5). Consequently, this screening set was not powered to identify subtype-specific markers for HER2 or Basal-like tumors, necessitating the integration of literature-based candidates for these subgroups. These 40 participants were excluded from the validation phase to avoid overlap.

RNA isolation and library preparation

RNA was isolated from serum samples using the miRNeasy Serum/Plasma Kit (QIAGEN) at QIAGEN Genomic Services, following the manufacturer’s instructions. Libraries were prepared with the QIAseq miRNA Library Kit (QIAGEN), and quality control was performed with a Bioanalyzer 2100 (Agilent) or TapeStation 4200 (Agilent). A total of 200 µL of RNA were used to generate miRNA NGS libraries. Adapters containing Unique Molecular Indexes (UMIs) were ligated to eliminate amplification bias. The RNA was reverse-transcribed into cDNA, amplified by PCR, and purified. Library pools were quantified by qPCR and sequenced on a NextSeq 500 (Illumina) platform.

Bioinformatics processing

Raw FASTQ files were generated for each sample. Reads were processed using Cutadapt for UMI correction, Bowtie2 for alignment to Homo sapiens miRNA sequences from miRBase v22.1, and edgeR (Bioconductor) for differential expression analysis. Counts were normalized using the Trimmed Mean of M-values (TMM) method and log₂-transformed for downstream statistical analyses. Sequence annotation was performed against the GRCh37 human reference genome and miRBase v22.1.

Quality control

Rigorous quality control procedures were applied, including:

  1. (i)

    UMI collapsing (unique sequence/UMI combination required),

  2. (ii)

    Phred quality score thresholds,

  3. (iii)

    Read length > 15 nucleotides,

  4. (iv)

    Mapping to GRCh37 reference genome, and

  5. (v)

    Background filtering to remove low-abundance reads.

Samples not meeting these criteria were excluded from the dataset.

The most promising miRNAs identified in this exploratory screening were considered for inclusion in the validation panel, along with additional candidates selected from external sources.

Validation of miRNAs based on previous studies

This phase included 444 participants (127 cancer-free controls and 317 breast cancer cases) recruited from the MCC-Spain study across ten Spanish provinces. Blood samples were collected prior to any oncological treatment (surgery, chemotherapy, or radiotherapy). Cases were stratified into molecular subtypes (Luminal A, Luminal B, HER2-positive, and Basal-like) according to the St. Gallen International Consensus Guidelines24.

A total of 46 miRNAs were selected for validation based on two complementary sources: (1) updated literature review on subtype-specific circulating signatures (2020–2024)27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119, and (2) differentially expressed miRNAs identified in our screening phase.

Two endogenous control miRNAs (let-7d-5p and let-7i-5p) were monitored for technical stability and were not included in differential expression or predictive modeling. Expression values were normalized using TMM. Differential expression analysis was conducted for 44 miRNAs, and 12 were retained using LASSO regression models. Candidate miRNAs were chosen according to prior evidence and biological plausibility as circulating biomarkers of breast cancer (see Supplementary Table 1).

RNA extraction and RT-qPCR

The miRNA expression analysis was performed at QIAGEN Genomic Services using standardized protocols. Serum samples were thawed on ice and centrifuged at 3,000 × g for 5 min at 4 °C. From each sample, 200 µL of serum were transferred into FluidX tubes, and 60 µL of Buffer RPL (containing 1 µg of carrier RNA and an RNA spike-in template mix) were added. After vortexing (1 min) and incubation (7 min at room temperature), 20 µL of Buffer RPP was added. Total RNA was extracted with the miRNeasy Serum/Plasma Advanced Kit (QIAGEN) and eluted into 50 µL.

For reverse transcription, 2 µL of extracted RNA were used in 10 µL reactions with the miRCURY LNA RT Kit (QIAGEN). The cDNA was diluted 50 × and analyzed via quantitative real-time PCR (RT-qPCR) using a custom miRCURY LNA PCR panel and SYBR Green master mix (QIAGEN). Amplifications were performed on a LightCycler® 480 Real-Time PCR System (Roche) in 384-well plates. Negative controls without RNA templates were included in each batch.

Data normalization and quality control

Amplification curves were analyzed with Roche LC software using the second derivative method to determine Cq values and assess melting curve profiles. Endogenous miRNAs let-7d-5p and let-7i-5p, consistently detected across all samples, were used for normalization. Data were normalized using the TMM method120, and normalized values were subsequently log₂-transformed for statistical analyses.

Additional QC criteria included:

  1. (i)

    exclusion of reactions with > 1 melting temperature,

  2. (ii)

    melting temperatures deviating from reference values, and

  3. (iii)

    low amplification efficiency.

RT-qPCR assays were performed in duplicate (or triplicate when feasible) to ensure reliability and minimize measurement bias.

Statistical analysis

Analyses were performed for the overall model (all breast cancer cases vs. controls) and for models stratified by molecular subtype (Luminal A, Luminal B, HER2-positive, and Basal-like).

Normalized expression values (log₂-transformed) from controls were compared with each case group using Student’s t-test without further adjustment. Results are reported as log fold change (logFC), p values, and false discovery rate (FDR), calculated using the Benjamini–Hochberg method121. All reported p values are two-sided and were not corrected for multiple testing. Consequently, associations with a p value < 0.05 are considered nominal and should be interpreted as exploratory. A positive logFC indicates overexpression in cases relative to controls, whereas a negative logFC reflects underexpression. The greater the absolute logFC value, the larger the difference between groups.

To assess the predictive performance of circulating miRNAs, we applied LASSO (least absolute shrinkage and selection operator) penalized logistic regression122. This approach enables the inclusion of multiple predictors while shrinking coefficients of less informative variables toward zero, thus generating parsimonious models. The analysis began with an initial model that included all candidate miRNAs selected during the validation phase. The regularization parameter (λ) was chosen through tenfold cross-validation.

Odds ratios (ORs) with 95% confidence intervals (CIs) were obtained by refitting standard logistic regression models that included only the variables selected by LASSO. To ensure out-of-sample performance, AUC values were estimated using repeated stratified fivefold cross-validation. Model discrimination, reflecting the ability to distinguish cases from controls, was evaluated by calculating the area under the receiver operating characteristic curve (AUC) with bootstrap bias-corrected 95% CIs.

Each model was re-estimated after adjusting for demographic and epidemiological covariates (age, body mass index [BMI], menopausal status, family history of breast cancer, smoking status, alcohol intake, use of hormonal contraceptives, history of hormone replacement therapy, educational level, and recruitment area). Variables that showed notable differences between cases and controls (Supplementary Table 2) were considered potential confounders and included as covariates, together with additional covariates supported by prior epidemiological evidence. This strategy ensured appropriate control for confounding.

All statistical analyses were performed using Stata 18/SE (StataCorp, College Station, TX, USA).

Results

Characteristics of the study population

Supplementary Table 2 summarizes the baseline characteristics of the 444 women included in the analysis (317 breast cancer cases and 127 controls). Several variables showed different distributions between cases and controls, including menopausal status (premenopausal women: 38.2% vs. 26.8%; p = 0.023), family history of breast cancer (29.4% vs. 15.6%; p = 0.003), hormonal contraceptive use (50.3% vs. 37.0%; p = 0.009), and hormone replacement therapy (less frequent in cases; p = 0.056). The distribution of recruitment areas also showed differences between groups (p < 0.001).

In contrast, no notable differences were observed in age, body mass index (BMI), educational level, alcohol intake, smoking status, parity, age at first childbirth, total energy intake, or adherence scores to the World Cancer Research Fund (WCRF) lifestyle guidelines.

Table 1 further details the distribution of breast cancer molecular subtypes, showing that among the 317 cases, Luminal A was the most frequent (n = 142; 44.8%), followed by Luminal B (n = 42; 13.2%), HER2-positive (n = 47; 14.8%), and Basal-like (n = 86; 27.1%).

Table 1 Distribution of study participants by molecular subtype of breast cancer and control status.

Clinical and pathological characteristics of breast cancer cases

Supplementary Table 3 describes the clinical and pathological features of the 317 women diagnosed with breast cancer, including tumor histology, stage, receptor status, and grade. This description illustrates the clinical heterogeneity of the study population, which was considered in the subtype-stratified analyses and is essential to interpreting miRNA-based model performance.

The most common histological subtype was ductal carcinoma (83.9%), followed by lobular carcinoma (4.4%) and other subtypes (10.1%).

Tumor staging data revealed that nearly half of the tumors were classified as T1 (51.7%), with T2 tumors accounting for 29.0%. In situ tumors (Tis) and T0 lesions each represented 2.5% of cases. Regional lymph node involvement was present in 45.1% of patients (N1–N3), while 2.8% had distant metastases at diagnosis (M1). AJCC clinical staging indicated that most patients were diagnosed at early stages (Stage I–II: 70.3%), with 14.8% at stage III and only 2.8% at stage IV. Based on diagnosis timing, 79.3% were classified as early-stage cases.

Regarding receptor status, 57.7% of tumors were estrogen receptor (ER)–positive, 51.1% progesterone receptor (PR)–positive, and 24.0% HER2-positive. Based on surrogate definitions, molecular subtypes were distributed as follows: Luminal A (44.8%), Luminal B (13.2%), HER2-positive (14.8%), and Basal-like (27.1%).

Histological grade was available in 64% of cases. Among those with reported data, 27.4% of tumors were poorly differentiated (grade III), 22.7% moderately differentiated (grade II), and 14.2% well differentiated (grade I).

miRNA signatures in breast cancer

Table 2 presents the multivariate analysis comparing miRNA expression between breast cancer cases and controls. Among the miRNAs associated with an increased risk of breast cancer, miR-21-5p showed the highest odds ratio in the fully adjusted model (OR = 27.11, 95% CI 6.98–105.33), followed by miR-423-3p (OR = 4.65, 95% CI 1.99–10.82), miR-331-3p (OR = 2.19, 95% CI 1.14–4.24), miR-139-5p (OR = 1.81, 95% CI 1.16–2.83), and miR-141-3p (OR = 1.44, 95% CI 1.03–2.01). These miRNAs tended to be overexpressed in cases compared to controls.

Table 2 Differential expression of candidate circulating miRNAs in breast cancer cases versus controls.

Conversely, several miRNAs were associated with a decreased risk of breast cancer. Notably, miR-146a-5p demonstrated a strong inverse association (OR = 0.23, 95% CI 0.10–0.52), as did miR-29a-3p (OR = 0.31, 95% CI 0.17–0.57), miR-101-3p (OR = 0.34, 95% CI 0.16–0.75), miR-142-3p (OR = 0.37, 95% CI 0.17–0.81), and miR-186-5p (OR = 0.36, 95% CI 0.19–0.70). These results are consistent with lower expression levels in cases across different models.

In addition to the model adjusted for all miRNAs included in the final panel, a more comprehensive model was developed that further adjusted for demographic and epidemiological variables (age, menopausal status, BMI, family history of breast cancer, smoking status, alcohol intake, use of hormonal contraceptives, history of HRT, education level, and region of recruitment). These additional adjustments did not substantially alter the direction or magnitude of the associations, reinforcing the robustness of the identified profiles. For instance, miR-21-5p remained the most strongly associated with increased breast cancer risk, while miR-101-3p, miR-146a-5p, and miR-29a-3p consistently showed associations consistent with lower expression levels in cases, potentially reflecting inverse correlations with risk.

These consistent associations across models suggest that a subset of circulating miRNAs may serve as candidate biomarkers warranting further evaluation for breast cancer risk assessment.

Subtype-specific associations

In Luminal A tumors (Table 3), miR-423-3p and miR-21-5p showed marked associations (OR = 3.81 and OR = 14.16, respectively). In contrast, miR-186-5p (OR = 0.19), miR-29a-3p (OR = 0.30), and miR-146a-5p (OR = 0.32) showed consistent inverse associations. miR-141-3p was also retained in the adjusted model (OR = 1.61), suggesting that its association became more evident after accounting for demographic and epidemiological variables.

Table 3 Differential expression of candidate circulating miRNAs in luminal a breast cancer cases versus controls.

The Luminal B subgroup was characterized by an inverse association for miR-186-5p (OR = 0.27). miR-101-3p showed an OR of 0.39 in crude analysis and 0.45 after full adjustment, indicating a stable direction of association with modest attenuation (Table 4).

Table 4 Differential expression of candidate circulating miRNAs in Luminal B breast cancer cases versus controls.

Among HER2-positive tumors, miR-21-5p showed an extremely elevated OR (1,300,000), but with an extremely wide confidence interval, reflecting limited stability of the estimate—likely influenced by adjustment for recruitment region and the small sample size. miR-101-3p and miR-146a-5p showed strong inverse associations (both OR ≈ 0.01). Other miRNAs such as miR-29a-3p and miR-331-3p showed extreme ORs with wide intervals, again reflecting instability (Table 5).

Table 5 Circulating miRNA profiles in HER2-positive breast cancer versus controls.

In Basal-like tumors, the largest OR was again observed for miR-21-5p (OR = 4.6 × 10⁶), although the very wide confidence intervals suggest model instability, likely exacerbated by adjustment for recruitment region. miR-331-3p and miR-423-3p also showed strong positive associations. In contrast, miR-101-3p, miR-142-3p, miR-146a-5p, and miR-29a-3p consistently showed ORs close to zero, reflecting inverse associations that remained stable in direction after adjustment (Table 6).

Table 6 Circulating miRNA profiles in Basal-like breast cancer versus controls.

Subtype-specific circulating miRNA profiles

Table 7 synthesizes the patterns across subtypes. miR-21-5p and miR-423-3p tended to show higher odds across most subtypes. Conversely, miR-146a-5p, miR-29a-3p, and miR-101-3p were consistently retained with ORs below 1, suggesting potential inverse associations. miR-186-5p showed divergent behavior: inverse in Luminal A and B but positive in HER2-positive tumors.

Table 7 Subtype-specific adjusted odds ratios (OR) for selected circulating miRNAs in breast cancer.

As expected, models for HER2-positive and Basal-like tumors yielded extreme OR values with wide confidence intervals, likely reflecting smaller sample sizes and model instability, potentially exacerbated by adjustment for recruitment area.

Predictive performance of miRNA-based models

We first assessed the discriminatory performance of miRNA-based panels selected through LASSO regression for the detection of breast cancer and its molecular subtypes. Initial models included only circulating miRNA expression levels (unadjusted models), followed by models incorporating demographic and epidemiological covariates. Performance was evaluated using both LASSO-based AUCs and repeated cross-validated AUCs, with results summarized in Fig. 1 and Supplementary Table 4.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Receiver operating characteristic (ROC) curves for miRNA-based models with and without clinical adjustment across breast cancer subtypes.

Figure 1 displays ROC curves for models distinguishing (A) all breast cancer cases from controls and stratified by molecular subtypes: (B) Luminal A, (C) Luminal B, (D) HER2-positive, and (E) Basal-like. Dashed lines represent models based solely on circulating miRNAs, while solid lines correspond to models additionally adjusted for age, BMI, menopausal status, family history of breast cancer, smoking, alcohol use, hormonal therapy (contraceptives/HRT), education level, and recruitment region.

In the overall analysis (Fig. 1A), adjustment substantially improved discrimination, increasing the AUC from 0.802 to 0.908. The largest gain was observed in Luminal A (Fig. 1B), with the AUC rising from 0.751 to 0.896. A similar improvement was seen in Luminal B tumors (Fig. 1C; AUC 0.768 vs. 0.894). In HER2-positive tumors (Fig. 1D), adjustment led to a modest improvement in discrimination (AUC 0.914 vs. 0.965). In Basal-like tumors (Fig. 1E), where the unadjusted model already showed excellent performance, covariates still increased the AUC from 0.950 to 0.989.

Supplementary Table 4 further details these comparisons, presenting both LASSO-derived AUCs and mean AUCs from repeated fivefold cross-validation. In the general model, cross-validated AUC increased from 0.778 (95% CI 0.765–0.792) to 0.869 (95% CI 0.849–0.889) after adjustment, supporting the consistency of the observed improvements.

The effect was most pronounced in Luminal A and Luminal B. In Luminal A, the cross-validated AUC rose from 0.719 (95% CI 0.701–0.736) to 0.829 (95% CI 0.797–0.861). In Luminal B, the adjusted AUC reached 0.845 (95% CI 0.820–0.870) compared with 0.749 (95% CI 0.713–0.785) in the unadjusted model. These results reinforce the added predictive value of integrating demographic and lifestyle variables in hormone receptor–positive subtypes, which are typically more heterogeneous.

HER2-positive and Basal-like tumors also benefited from adjustment, although the unadjusted models already showed strong discrimination. The adjusted cross-validated AUC remained within a similar range, showing a slight decrease to 0.911 (95% CI 0.876–0.946) for HER2-positive tumors. Similarly, for the Basal-like tumors, the adjusted cross-validated AUC remained consistent at 0.928 (95% CI 0.911–0.946). Taken together, these findings suggest that the integration of molecular biomarkers with demographic and lifestyle information provides a robust and consistent framework for breast cancer subtype classification and supports high predictive accuracy across different model specifications.

Discussion

In our multicenter study, we found that circulating miRNA panels selected by LASSO achieved strong discriminatory performance for breast cancer detection, particularly in aggressive subtypes. The Basal-like and HER2-positive models reached AUCs of 0.989 and 0.965, respectively, while Luminal A and B subtypes showed more moderate classification performance (AUCs of 0.894–0.896) with improved model performance after adjustment for epidemiological covariates. This highlights the potential contribution of miRNAs as complementary predictive markers in aggressive tumors and as complementary components when combined with patient characteristics in more heterogeneous luminal subtypes.

Our findings are consistent with prior studies reporting comparable predictive performance using circulating miRNA signatures. For instance, Shimomura et al. identified a serum miRNA combination with AUC > 0.85 for early breast cancer detection, and Zhao et al. demonstrated the utility of plasma-based profiles in case–control models. More recently, Gómez-Acebo et al. confirmed in prostate cancer cohorts that LASSO-derived miRNA panels achieve high predictive accuracy (AUC = 0.824), supporting the consistency of this modelling approach across cancer types123,124,125. Biologically, the higher predictive accuracy observed in Basal-like and HER2-positive tumors in our data is consistent with the presence of highly active oncogenic programs that drive distinctive circulating miRNA signals; for example, oncomiRs such as miR-21-5p—implicated in epithelial–mesenchymal transition (EMT) and metastatic behavior—tend to be overexpressed in aggressive phenotypes, which may increase the detectability of tumor-derived signals in liquid biopsy108,126. Moreover, evidence that exosomal miR-21-5p attains high diagnostic accuracy127 supports the notion that vesicle-packaged miRNAs may concentrate tumor-derived molecules in circulation, contributing to the observed AUCs in aggressive subtypes, while pre-analytical and compartment differences (plasma vs. EVs) likely explain part of the variability reported across studies and subtypes37,67,70,128.

At the individual biomarker level, miR-21-5p was the most consistently upregulated in our data across subtypes, with higher expression levels in Basal-like and HER2-positive tumors, while miR-423-3p was also consistently upregulated across subtypes. In prior studies, miR-21-5p overexpression has been associated with advanced stage, nodal involvement, and poor outcomes, and with EMT/metastatic behavior in breast cancer stem cells108,126, whereas other works have reported associationswith early disease or improved survival37,67,128. Exosomal miR-21-5p has shown high diagnostic performance, but the optimal biological source and pre-analytical conditions remain debated70,127.

For miR-423-3p, the evidence is also mixed: silencing reduced proliferation and connected to PCNA/mutant p53 pathways in cell models129; overexpression was reported in ER + /PR + tumors87 and in serum from cases19, while underexpression was described in HR + /HER2– serum cohorts49. Taken together, our findings of overexpression associations for miR-21-5p and miR-423-3p support their involvement in proliferative and EMT-related pathways and with the possibility that extracellular-vesicle packaging enriches tumor-derived miRNAs in circulation, potentially explaining the diagnostic signals noted in aggressive subtypes70,108,126,127.

miR-141-3p in our analyses showed a positive association that was observed only after full adjustment (e.g., an OR of 1.61; 95% CI, 1.04–2.48 in Luminal A), suggesting a potential relationship that was not apparent in crude models. Evidence is mixed: reduced expression and a tumor-suppressive role via HMGB1/HIF-1α have been reported, with lower levels in aggressive phenotypes and higher levels in HR-positive tumors52,130, whereas other studies found overexpression in high-grade disease and poorer survival, including Basal-like and HER2-amplified contexts28,131. A systematic review emphasizes pleiotropy modulated by subtype and microenvironment80. Our adjustment-dependent observation is therefore compatible with a context-sensitive role, potentially hypoxia-responsive, and with the possibility that patient-level factors (age, menopausal status, BMI) confound or influence its circulating association, aligning with the heterogeneity highlighted in these reports.

miR-331-3p showed nominal risk-oriented associations within the HER2-positive and Basal-like subgroups, albeit with very wide confidence intervals that reflect limited precision. Most prior reports indicate downregulation in triple-negative tissues with adverse clinicopathologic features and direct suppression of NRP2, lower levels in serum cohorts, and functional inhibition of migration/invasion in MDA-MB-231 models49,91,132. A potential explanation for this observed discrepancy could involve the combination of compartment and subgroup effects, including tissue versus serum differences, the small size of HER2/Basal-like strata in our data (reflected by wide CIs), and platform variability. These factors can potentially invert the direction observed in liquid biopsy while the intratumoral profile remains suppressive in tissue-based assays, as described by these studies.

For miR-139-5p, our data showed a nominal risk association with moderate overexpression in circulation, while multiple tissue-based studies describe downregulation and tumor-suppressive behavior, including in HER2-enriched, luminal B, and Basal-like/triple-negative contexts23,53,133,134,135. Other works, however, report oncogenic roles depending on regulatory interactions with lncRNAs and signaling pathways19,64. Mechanistically, miR-139-5p interfaces with DNA repair and cell-cycle genes (MAT2A, POLQ, TOP1/2A, XRCC5, NOTCH1, RAD54L) and with lncRNA networks, such as TTN-AS1, and participates in TGFβ, Wnt, Rho, and MAPK/PI3K signaling30,107,129,136,137. These pathways may provide a rationale for potentially divergent roles, as depending on pathway activation, subtype composition, and the circulating compartment, miR-139-5p may appear increased in blood while being suppressed in tumor tissue. This interpretation is consistent with observed differences between circulating and tissue-based profiles.In contrast, miR-101-3p, miR-29a-3p, and miR-146a-5p were frequently underexpressed across multiple subtypes, including HER2-positive and Basal-like, yielding nominal inverse associations after adjustment, suggesting a potential role as circulating markers with inverse associations across subtypes. Prior work is largely aligned: miR-101-3p downregulation is linked to poor prognosis, with restoration after therapy and repression of oncogenic programs; mechanistically it targets JAK2, CXCR7, STMN1, AMPK, and EZH2, and can be sequestered by MALAT1 to activate mTOR/PKM219,22,42,55,138,139,140. miR-29a-3p is underexpressed in tissue/serum studies with regulation of COL5A1 and AGR2 and associations with better prognosis at higher levels, though EMT-linked invasiveness has been described under specific experimental conditions35,77,95,141,142. For miR-146a-5p, human cohort data are mixed: higher levels have been related to resistance (trastuzumab/NAC) and worse outcomes in HER2 + disease, whereas other studies include it in signatures of lower relapse risk or better survival36,46,75,143,144. Biologically, our observed inverse associations are compatible with the repression of key anti-oncogenic checkpoints: sustained loss of miR-101-3p permits EZH2/NT5E and JAK2/AMPK axis activation; reduced miR-29a-3p de-represses ECM-remodeling nodes (COL5A1/AGR2); and the variability in miR-146a-5p likely reflects differential engagement of treatment-response pathways in HER2-enriched tumors22,36,42,46,55,75,95,139,141,142.

We also observed nominal subtype-dependent patterns, most notably for miR-186-5p, which was suppressed in Luminal A/B but elevated in HER2-positive tumors, and for miR-142-3p, which showed inverse associations in Basal-like while appearing elevated in some luminal cohorts. The literature mirrors this duality: downregulation of miR-186-5p in luminal tumors and Basal-like tissues/lines is linked to poorer survival and invasion113,145,146, whereas overexpression and FOXM1-related pro-proliferative activity have been described in triple-negative models59. For miR-142-3p, underexpression and suppressor roles are described in tissues and cell lines, targeting the HMG family and EMT axes, though serum studies in luminal disease have shown overexpression versus controls or between luminal subtypes19,39,45,100,147,148. These divergent behaviors may reflect context-specific regulation shaped by subtype biology and compartment (tissue vs. serum vs. EVs), which our stratified framework aims to characterize.

Finally, miR-142-3p showed nominal inverse associations in our Basal-like analyses, whereas serum studies in luminal disease have noted higher circulating levels versus controls or between luminal subtypes19,148,149. In contrast, tissue and cell-line work frequently shows underexpression linked to worse prognosis and metastasis, targeting the HMG family members and EMT-related axes, with supporting observations in clinical samples and Luminal A lines39,45,100,147,150,151. Together with our stratified findings, this is compatible with a potential suppressive role for miR-142-3p in Basal-like biology, while emphasizing how subtype and sample type may shape the observed circulating profile.

We observed substantial concordance between nominal differential expression and LASSO-based feature selection: 11 of the 12 miRNAs retained by LASSO showed observable differences in expression between cases and controls. This overlap supports the internal consistency of the pipeline and is compatible with the suggestion that penalized selection may capture potentially relevant associations. While LASSO does not optimize p values by design, we deliberately emphasized changes in odds ratios and their stability after adjustment, which is aligned with the predictive aim of the study. The agreement between LASSO-based performance and repeated cross-validated AUCs further reinforces our results, and is consistent with our prior experience in prostate cancer cohorts showing high discrimination with LASSO-derived serum miRNA panels (AUC = 0.824)123. Together, these findings suggest that the identified panels could reflect the underlying tumor biology while retaining practical predictive utility in a liquid-biopsy setting.

From a methodological standpoint, the close correspondence between LASSO AUCs and the mean AUCs obtained via repeated stratified fivefold cross-validation with bootstrap correction points toward limited optimism in predictive performance estimates. The observation that adjusted models showed improved discrimination, especially in Luminal A and B, suggests that demographic and epidemiological covariates may capture residual confounding or clinically relevant heterogeneity that pure molecular profiles do not fully encode. This “data fusion” approach is aligned with prior serum/plasma studies that reported notable diagnostic performance124,125 and with our own experience using penalized models across cancer sites123, compatible with the translatability of the analytic framework.

This study builds upon our previous work on circulating miRNAs and breast cancer detection in the MCC-Spain cohort, where previous analyses focused on overall case–control differences, without stratification by subtype. In the current study, we expanded the analytical scope by incorporating a larger and more diverse set of breast cancer cases, including a higher number of clinically aggressive subtypes such as HER2-positive and Basal-like tumors. Additionally, we conducted an updated literature review specifically focused on miRNAs associated with breast cancer molecular subtypes, providing a biologically driven selection of candidate miRNAs. This approach allowed for the development of exploratory subtype-specific models and the investigation of potential heterogeneity in miRNA expression and diagnostic potential across distinct tumor profiles.

This work has several strengths that enhance the potential utility and clinical relevance of its findings. First, the multicenter recruitment across Spanish hospitals provides geographic and clinical diversity, enhancing generalizability and mitigating single-site biases. Second, the use of penalized regression (LASSO) enables parsimonious, data-driven feature selection in a high-dimensional setting, reducing overfitting relative to stepwise procedures and prioritizing predictive potential rather than nominal p value thresholds. Third, internal validation through repeated fivefold cross-validation with bootstrap resampling (10 repetitions) addresses sampling variability and selection uncertainty, yielding more conservative AUC estimates. Fourth, explicit stratification by molecular subtype accounts for breast cancer heterogeneity and uncovers subtype-specific miRNA patterns with differential predictive performance. Fifth, integrating demographic and epidemiological covariates with circulating miRNAs meaningfully improved discrimination, particularly in Luminal A/B, supporting multimodal risk stratification that reflects both biological and patient-level context. Finally, the convergence of our observed circulating profiles with mechanistic pathways reported in the literature, such as EMT, DNA repair/cell cycle, ECM remodeling, exosomal transport, provides biological plausibility for the predictive panels30,107,108,126,127.

Several limitations should be acknowledged. First, the sample sizes of certain subgroups, particularly Luminal B and HER2-positive tumors, were limited, which reduced the statistical power to detect modest associations and likely contributed to wide confidence intervals (CIs) and occasional extreme odds ratios (ORs) in the HER2/Basal-like analyses. Second, we did not correct for multiple comparisons across the large number of statistical tests performed. Therefore, all reported associations are nominal and exploratory; while this approach maintains sensitivity for discovery, it increases the risk of false positives (Type I errors). Third, some subtype-specific models (notably HER2 and Basal-like) relied on unadjusted or minimally adjusted ORs due to model instability, and adjustment for recruitment site may have further increased variance, as reflected in the wide CIs. Fourth, although internal validation such as cross-validation or bootstrapping was conducted, external validation in independent cohorts is required to assess the generalizability of the predictive panels. Fifth, pre-analytical variability and biological compartment differences, including plasma versus extracellular vesicles, may influence measured miRNA levels, as previously highlighted in translational studies of miR-21-5p70, and could explain some discrepancies across studies despite the promising diagnostic performance of exosomal assays127. Sixth, the single-country recruitment may limit generalizability to populations with different genetic backgrounds or environmental exposures. It is also worth noting that although penalized regression techniques such as LASSO can be affected by multicollinearity, particularly in datasets with highly correlated biomarkers like miRNAs, the overlap between selected features and those identified as differentially expressed suggests that the model captured biologically relevant signals. While this internal consistency does not replace external validation, it supports the robustness of the selected predictors. Finally, although NGS-based screening guided the initial selection of candidate miRNAs, the limited sample size and underrepresentation of key molecular subtypes, especially HER2-positive and Basal-like, precluded robust subtype-specific differential expression analysis. Future research should expand NGS profiling across all major breast cancer subtypes to enable subtype-specific biomarker discovery.

Clinically, our findings indicate a pragmatic path toward precision screening and subtype-informed triage using liquid biopsy. The strong discriminatory performance observed in Basal-like and HER2-positive tumors suggests that circulating miRNAs may facilitate early detection or expedite diagnostic workups in aggressive disease, while the improved performance observed in Luminal A/B after integrating demographic and epidemiological covariates supports multimodal models in more heterogeneous subtypes. In practice, standardized pre-analytical procedures, harmonized assay platforms, and direct comparisons between plasma- and EV-based detection methods (as discussed for miR-21-5p) will be essential to ensure reproducibility and scalability70,127. Prospective cohort studies with serial sampling should assess temporal stability and prognostic value (monitoring, minimal residual disease, relapse), and external validation efforts should evaluate generalizability across different healthcare settings.

In conclusion, this study suggests that several circulating miRNAs may be consistently dysregulated across breast cancer subtypes and may serve as non-invasive biomarkers for early detection and molecular classification. In line with our results, miR-21-5p and miR-423-3p emerged as reproducible oncomiRs in circulation, while miR-101-3p, miR-29a-3p, and miR-146a-5p exhibited consistent tumor-suppressive patterns across models. Predictive performance was highest in Basal-like and HER2-positive tumors even with miRNA-only models, and improved further in all settings when demographic and epidemiological covariates were incorporated, particularly in Luminal A and B tumors, underscoring the value of integrated clinical and molecular approaches. These findings support further research toward standardized and multimodal liquid-biopsy strategies that combine circulating miRNA panels with patient-level information to enhance early detection, molecular subtyping, and equitable access to precision diagnostics.