Introduction

Colorectal cancer (CRC) is a major lethal health problem being the 3rd most common cancer worldwide with a 19.5% prevalence1,2. According to GLOBOCAN 2022 number of new cases in 2024, both sexes and all ages was 1,931,590 (10% of all cancers), and the number of deaths in 2020, both sexes and all ages was 935,173 (9.4% of all cancers)2. Regarding the genetic basis of cancer malignancy, microarray technology has recently been one of the most widely used implements to evaluate the functions of genes in cancer cases3. MicroRNAs (miRNAs) are small single-stranded non-coding RNA molecules of an average of 22 nucleotides long. miRNAs appear to regulate more than 50% of human genes, and abnormal expression of miRNAs has been implicated in many human cancers4. miRNAs are also abundant as extracellular circulating molecules released into circulation by tumor cells either through cell death or by exosome-mediated signaling5,6. Combined with its remarkable stability in the blood and other body fluids, circulating cell-free miRNAs have the potential to serve as non_invasive biomarkers for cancer screening and diagnosis7. It is a challenge to introduce more accurate, fast, and specific diagnostic and prognostic biomarkers given that CRC is a complex disease. The role of biomarkers in CRC diagnosis is becoming more important to improve early diagnosis and better treatment8,9. In addition, transcription factors (TFs) and miRNAs are two of the most well-studied elements in coregulatory modules. Different modes of regulation add another layer of complexity to the post-transcriptional regulation. Hence, the study of complex diseases at this level will reshape our understanding of pathogenesis and treatment approaches in heterogenous disorders10.

The field of miRNA-disease association (MDAs) prediction has witnessed advancements in recent years owing to large accessable miRNA expression datasets11 and innovative methods such as similaty-based predictions12 and network-based inference13 yet there is no consensus on a globally accepted stratgey to achieve this14. Computational models have become a functional tool for predicting miRNA-disease pairs and substantially reduce the number of targets for experimental validation. This can inherently reduce laboratory costs and be time-efficient15. The efficacy of computational models in introducing prospective miRNAs has been significant as most of the suggested miRNAs were eventually validated by experiments, either in the short term or in the long run16.

As such, there have been numerous efforts to introduce new miRNAs in CRC as regulatory elements17,18,19, biomarkers17,18,20,21,22 and therapeutic targets17,23, all of which have contributed to refining our understanding of CRC development, underlying molecular mechanisms and enhancing diagnostic efficiency. However, majority of the studies have focused on solid tumors or blood as the source of miRNA expression data. Therefore, there is a lack of an in-depth analysis in the current literature with regard to investigating the potential of miRNA expression datasets derived from the serum of CRC patients.

The scale and complexity of microarray data sets are increasing exponentially and machine learning (ML) is one of the essential and effective tools in analyzing highly complex data. This study involved the application of ML techniques like a wrapper method in the feature selection step and robust supervised learning models on miRNA expression datasets, derived from serum. The aim is to identify promising miRNAs and introduce them as non-invasive biomarkers of CRC. Subsequently, a comprehensive functional annotation of the candidate miRNAs was carried out to provide additional insights into their underlying regulatory mechanisms. Overall, we performed an integrated analysis using miRNA expression datasets to discover a robust set of miRNAs and highlighting their significance to be further considered for experimental validation.

Materials and methods

Microarray data collection

Three publicly available microarray datasets on the gene expression omnibus database (GEO) GSE106817, GSE113486, and GSE113740 were for analysis which all are available at available at https://www.ncbi.nlm.nih.gov/geo/. The detailed information on the three datasets is shown in (Table 1). In this study, the serum samples of cancer cases and non-cancer controls have been analyzed by microarray to obtain miRNA expression profiles. For the training set, we used GSE106817 and for validation, we used the GSE113486 and GSE113740 datasets, including miRNA expression profile data from the serum samples (Table 1). This study was approved by the Ethics Committee of Tabriz University of Medical Sciences (No: IR.TBZMED.VCR.REC.1401.270).

Table 1 Information of datasets.

Differential expression analysis using GEO2R

To identify differentially expressed miRNAs, we also utilized GEO2R, an interactive web-based tool available through the Gene Expression Omnibus (GEO). The analysis was conducted on the (GSE113486 and GSE113740), comparing the cancerous group and control group. The limma package in R was employed to calculate fold changes and p-values, with a significance threshold of [p-value and adjusted p-value]. The results of significant miRNAs were listed for further analysis.

Feature selection techniques

In the analysis of microarray datasets, the number of miRNAs could be larger than the number of samples, thus leading to faulty classification and posing challenges to train the classifiers on such datasets of high dimensionality24,25. Preprocessing is an essential step to address this dimensionality problem, and then apply the classification algorithm for monitoring model complexity26. A critical sept in preprocessing is feature selection methods to overcome the curse of dimensionality27. There are three feature selection techniques in classification, i.e., filter, wrapper, and embedded methods. In the wrapper-based method, feature selection is carried out using the machine learning method and uses cross-validation to assess the feature subset score28.

Boruta

The Boruta algorithm, introduced by Miron B. Kursa and Witold R. Rudnicki in 201029,30, is a wrapper method built around the random forest classification algorithm. Its primary objective is to determine the significance of each feature in the context of the entire features, identifying which features are truly significant for predicting the target feature. Boruta Initiates the process by creating shadow features. These are copies of the original features with their values randomly shuffled, effectively acting as noise features. This step is crucial as it provides a baseline to compare the significance of original features against noise. The algorithm trains a random forest classifier on the extended dataset, which includes both the original and shadow features. The significance of each feature is measured using the mean decrease in the Gini index or any other suitable metric provided by the random forest. The significance scores of the original features are compared against the highest significance score of the shadow features. If an original feature has a significantly higher significance score than the best shadow feature, it is considered significant. Conversely, if an original feature has a lower significance score, it is deemed nonsignificant. Features that are identified as nonsignificantare subsequently excludedfrom the dataset. The process is repeated iteratively until a predefined stopping condition is met, such as a maximum number of iterations or stability in feature selection. Once iterations are finished, the features are categorized into three groups: confirmed significant, confirmed nonsignificant, and tentative. Tentative features require further analysis to determine their significance.

Boruta offers significant advantages in the domain of classification. It is highly robust to overfitting as it leverages the strength of the random forest algorithm and uses shadow features as a baseline for comparison. Unlike many feature selection methods that focus on finding a minimal optimal feature subset, Boruta aims to find all features that carry information about the target feature. By providing a clear distinction between significantand nonsignificantfeatures, Boruta enhances the interpretability of the model, Simplifying identification of features which meaningfully contribute to features contribute to predictions25,31.

Random forest

Breiman introduced random forest (RF)32. One of the significantaspects to note is that the RF algorithm can be applied to applications requiring classification and regression. The Random Forest algorithm relies on the principles of bagging (Bootstrap Aggregating) and random feature selection, which help in reducing the variance of the model and avoiding overfitting. During the construction of each tree, Random Forest selects a random subset of features at each split point. This process, known as feature bagging, ensures that the trees are decorrelated, further diminishing the risk of overfitting33,34. The significanthyperparameters should be set to implement the RF: The number of variables available for splitting at each tree node (mrty).

Extreme gradient boosting

Extreme Gradient Boosting (XGBoost) is a powerful and efficient implementation of the gradient boosting framework, designed to enhance the performance and speed of tree-based ensemble methods35. The objective function in XGBoost combines a loss function and a regularization term. The loss function measures the model’s prediction error, while the regularization term penalizes the complexity of the model, preventing overfitting. The objective function can be represented as:

$$\:\mathcal{L}\left(\theta\:\right)=\sum\:_{i=1}^{n}\:l\left({y}_{i},{\stackrel{\prime }{y}}_{i}\right)+\sum\:_{k=1}^{K}\:{\Omega\:}\left({f}_{k}\right)$$

where \(\:l\) denotes the loss function (e.g., mean squared error for regression), \(\:{\Omega\:}\) is the regularization term, \(\:{y}_{i}\) is the actual target, \(\:{\stackrel{\prime }{y}}_{i}\) is the predicted target, and \(\:{f}_{k}\) represents the \(\:k\)-th tree in the ensemble. XGBoost employs gradient descent to minimize the objective function. In each iteration, it fits a new tree to the negative gradient of the loss function for the current model’s predictions. This stepwise approach iteratively reduces the residual errors. XGBoost introduces both L1 (Lasso) and L 2 (Ridge) regularization to control the complexity of the model, as expressed in the regularization term Ω(\(\:{f}_{k}\)). This helps in preventing overfitting and ensures improved generalization. In the context of handling missing data, XGBoost incorporates an in-built mechanism to manage missing values by learning optimal default directions within its decision trees. This statistical method allows the algorithm to handle incomplete datasets effectively without the need for explicit imputation36. According to Chen et al.37, the XGBoost algorithm’s parameters can be separated into three groups: general parameters, booster parameters, and learning parameters. In this study, the XGBoost algorithm’s booster parameters were: 1- n rounds (max number of boosting iterations) 2- max-depth (used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample), 3- gamma (A node is split only when the resulting split gives a positive reduction in the loss function), it specifies the minimum loss reduction required to make a split and makes the algorithm conservative. The values can vary depending on the loss function and should be tuned. 4- colsample-bytree tree denotes the fraction of columns to be randomly sampled for each tree. 5- min-child-weight used to control over-fitting and 6- subsample (lower values make the algorithm more conservative and prevent overfitting but too small values might lead to under-fitting).

Machine learning model evaluation

The analysis was carried out using three different GEO datasets (GSE106187, GSE113486 and GSE113740) as training and validating data for performance comparison with two different machine learning models including RF and XGBoost. Each model was evaluated with different evaluation metrics such as accuracy, area under the ROC curve (AUC), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). To ensure robust performance evaluation of the models, we incorporated multiple validation strategies. The most commonly used k-fold cross-validation technique was applied in our experimental work. In the k-fold (here, k = 10) cross-validation technique, the dataset is randomly split into k subsets, whereby k-1 subsets are used for training, and the remaining subset serves as the test set. Ultimately, this process iterates k times.

Functional annotation of the identified MiRNAs

Based on the mean of scores (MS), the identified miRNAs with MS higher than 50% were subjected to ontology analysis. In order to identify miRNA-associated pathways, we used miRAnno (https://ophid.utoronto.ca/mirDIP/miRAnno.jsp#r), a network-based program for identifying miRNA-associated pathways38. Additionally, miRNet (https://www.mirnet.ca/miRNet/), which is a miRNA-centric network visual analytics platform was used to identify network between the identified miRNAs, target genes, and associated diseases39. Furthermore, miRNet found target genes were subjected to enrichment analysis using ToppFun (https://toppgene.cchmc.org/enrichment.jsp), which detect functional enrichment of a gene list based on transcriptome, proteome, regulome, ontologies and other features40. TransmiR41 database contains manually curated regulatory interactions between miRNAs and TFs. Our final set of miRNAs (MS > 50%) was added as an input to TransmiR database. On the other hand, the list of target genes by our top miRNAs was given to ChEA42 web server. This database allows us to retrieve overrepresented TFs based on the ChIP-X experiments. The top 5% of the most overrepresented TFs were considered for network assembly. miRNA-target genes network generated by miRNet was also merged. To this end, a TF-miRNA-Gene regulatory network around our candidate miRNAs was constructed using Cystoscope software (Version 3.10.2) from three different sources. As the final step, the top 5 TFs based on the degree of connectivity were isolated from the network, using the “Analyze Network” function within Cytoscape.

Result

Differential expressed MiRNAs in models

The workflow of our study is illustrated in Fig. 1 describing how various methods were merged into an integrative pipeline. Of 2,568 miRNAs in GSE106817, the Boruta algorithm initially selected 122 miRNA using Gini Index measurement. After fixing the tentative features, Boruta identified 146 miRNAs for the analysis. As shown in Table 2, Random Forest and XGBoosting classifiers identified 20 and 16 miRNAs respectively in the internal dataset from 146 miRNAs. In this list 10 of the identified miRNAs would be mutually common (total = 26 miRNAs). Based on the mean of scores, the set of hsa-miR-6787-5p, hsa-miR-1246, hsa-miR-8073, hsa-miR-5100, hsa-miR-6717-5p, hsa-miR-1228-5p, hsa-miR-4706, hsa-miR-3184-5p, and hsa-miR-1343-3p have significant differential expression over 50% in the random forest model all over three datasets. Also, in the XGBoost model hsa-miR-6787-5p and has-miR-1246 had over 40% significance. The results of GEO2R are presented in (Table 3). Note that the column of adjusted p-value is generally recommended as the primary statistic in the interpretation of results. The miRNAs with the smallest p-values will be the most reliable. As shown in Table 3, all of miRNAs have adjusted p-value < 0.0001. hsa-miR-1228-5p, hsa-miR-3184-5p, hsa-miR-6765-5p, and hsa-miR-1268a expressed in both external validation datasets as downregulated.

Fig. 1
figure 1

A workflow of steps performed in this study for the identification and functional annotation of miRNA biomarkers in CRC. CRC colorectal cancer, ROSE random over sampling, SOMTE synthetic minority oversampling technique, CV cross validation, TFs transcription factors.

Table 2 The significancevalue of MiRNAs in the RF and XGBoost models in the internal and external validation datasets.
Table 3 MiRNA statuses in external datasets.

Performance evaluation

Table 4 presents the optimal hyperparameters for the machine learning models, which were identified using the selected miRNAs. The results of the different performance metrics for each classifier are presented in (Table 4; Fig. 1). In Fig. 2a the heatmap showed differences between samples in each group. In GSE106817, the samples in the right side of the figure show a significantly low miRNA expression level (red color) on the heatmap compared with that in the non-cancerous group. However, miRNA expression levels of the samples on the right side of the figure that are red, are closer to that of the cancerous group. In addition, in Fig. 2b–d for GSE106817 as internal data and external datasets: GSE113486 and GSE113740 in silico validation showed the random forest model achieved better performance with an accuracy of 99.88% and 100% AUC. The XGBoost model achieved an accuracy of 99.71% and 99.9% AUC. In external validation datasets, the models using selected miRNAs achieved, the models achieved great performance as shown in the roc curve, AUC of random forest in GSE113486 and GSE113740 were 97.8% (CI: 92.8–100) and 96.7% (CI: 89.1–100) respectively. The AUC of XGBoost model in GSE113486 and GSE113740 were 98.9% (CI: 96.6–100) and 95.8% (CI: 87.5.1–100) respectively.

Table 4 Optimal hyperparameters and performance metrics for the final models.
Fig. 2
figure 2

Heatmap and Roc curves. Heatmap showing a promising result of the analysis using the 146 identified. miRNAs to distinguish different samples in GSE106817 between non-cancer and control patients (a). Roc curve RF and XGBoost models with selected miRNAs identified through Boruta feature selection algorithm in GSE106817 (b), and in external validation datasets (c,d).

Functional annotation of the identified MiRNAs

The random forest approach identified the following miRNAs with a mean score (MS) exceeding 50%: hsa-miR-1228-5p (MS: 83.46%), hsa-miR-6787-5p (MS: 82.31%), hsa-miR-1343-3p (MS: 81.59%), hsa-miR-6717-5p (MS: 77.57%), hsa-miR-3184-5p, (MS: 75.86%), hsa-miR-1246 (MS: 64.67%), hsa-miR-4706 (MS: 62.47%), hsa-miR-8073 (MS: 55.93%), and hsa-miR-5100 (MS: 53.45%). The XGBoost method also identified hsa-miR-6787-5p (MS: 50.57%) as the sole miRNA with a MS exceeding 50%, similarly identified by random forest as well.

Common miRNAs identified by both models which have MS of higher than 50% were subjected to pathway analysis, during which the miRAnno tool identified a list of pathways associated (P < 0.01) with these miRNAs (Supplementary file 1). Table 5 provides a summary of the cancer/tumor pathways and the top five associated molecular pathways with the selected miRNAs. The miRNet tool identified 106 diseases (Supplementary file 2) and 815 genes (Supplementary file 3) in association with the network of nine selected miRNAs. Among the identified diseases, 59 were cancer-related malignancies, including colorectal carcinoma, colonic neoplasms, and colorectal adenocarcinoma (see Supplementary file 2). Figure 3 was generated by miRNet and depicts the network between the selected miRNAs, their identified target genes, and the diseases with which they are associated. The ToppFun gene list enrichment analysis revealed that the target genes of the selected miRNAs play a significant role in 20 molecular functions and 42 biological processes (Table 6).

Table 5 miRNA-associated pathways listed by miranno.
Fig. 3
figure 3

miRNA-centric network generated by miRNet. The figure depicts how selected nine miRNAs and their target genes and associated diseases are connected. The blue squares represent miRNAs introduced by random forest and XGBoost approaches for having special relationship with colorectal cancer with a mean score higher than 50%. Every red circle represents a disease (Supplementary file 2) and every light green circle represent a target gene (Supplementary file 3). miRNet reported 106 diseases and 815 genes associated with the selected miRNAs.

Table 6 Gene ontology reported by ToppFun for target genes of the selected mirnas.

A total of 7 TFs (MECP2, BPTF, NFRKB, ZNF614, GMEB2, ZSCAN29, and HMBOX1) were found to interact with hsa-mir-3184-5p (Supplementary file 4). There was no known experimentally validated TF-miRNA interaction for other members of our selected list according to the TransmiR analysis result. TF-overrepresentation analysis introduced 1632 TFs, interacting with 815 target genes of the candidate miRNAs. We focused on the top 5% (81 TFs) having the most overlapping genes and included them to construct the final network (Supplementary file 4). The resulted network consists of 9 miRNA and their target genes, plus the reported TFs which interact with them, supported by experimental data. The final TFs-miRNAs-Genes regulatory network contains 902 nodes and 12,331 edges and is depicted in (Fig. 3a). We sought to identify the most connected TFs based on the degree of connectivity, as the network tends to be dense. The top 5 TFs in the TFs-miRNAs-Genes regulatory network are as follows: E2F1 (degree of 338), E2F4 (degree of 316), CREB1 (degree of 294), REST (degree of 287), and JUND (degree of 280). The full result of the network analysis is available in Supplementary file 4. Additionally, GMEB2 was the only TF that was common among direct regulators of miRNA and target genes of miRNA (Fig. 4a). The zoomed-in view of this particular axis is shown in (Fig. 4b).

Fig. 4
figure 4

(a) TFs-miRNAs-Genes regulatory network. Red triangles are TFs, green ellipses are genes and miRNAs are shown in blue rectangles. (b) The interactome around hsa-mir-3184-5p/GMEB2 axis. All their first neighbors were extracted from the main network.

Discussion

Three GEO datasets were employed in our study, using a robust and widely recognized feature selection method in machine learning and two distinct machine learning classifier models. We validated the models across datasets, and their performances were assessed using metrics such as accuracy, sensitivity, specificity, PPV, NPV, and AUC. The random forest method demonstrated superior performance with the GSE106187 and GSE113486 datasets, as shown in other studies43,44,45,46,47,48 revealing robustness against overfitting compared to other methods. These two models and analysis approaches have also been effective in other diseases like hepatocellular carcinoma6, gastric cancer49, and ovarian cancer45,46, Additionally, certain studies50,51,52,53,54 influenced our decision to use this method for selecting significantfeatures from the GSE113486 and GSE113740 datasets, respectively. The GSE113486 and GSE113740 datasets, which contain fewer samples of CRC, exhibited lower performance, potentially due to overfitting caused by the limited sample size relative to the model’s complexity. To address this issue and obtain a more reliable performance estimate, we employed 10-fold cross-validation. This technique evaluates the model’s generalization ability by averaging performance metrics across multiple folds, reducing the risk of overfitting. Additionally, to handle class imbalance, we applied SMOTE (Synthetic Minority Over-sampling Technique)55 within each fold of the cross-validation process. This ensures that synthetic samples are generated only from the training data, preventing data leakage and providing a more realistic estimate of model performance. To further validate the stability of our performance estimates, we conducted 100 iterations of the bootstrap method, averaging the evaluation metrics over these iterations. This approach provides a robust assessment of model performance and reduces the impact of variability in the dataset.

It is widely acknowledged that the utilization of in silico network analysis on the findings of experimental or theoretical molecular studies is beneficial in facilitating comprehension of the outcomes, particularly in the context of noncommunicable disease research, including cancer56,57. In this study, we employed the miRAnno, miRNet, and ToppFun tools to gain insight into the function and potential molecular contributions of the selected top-ranked miRNAs. In its analysis, miRAnno links several cancer pathways and molecular pathways to the identified miRNAs. (Table 5). In this section, we examine the involvement of the first-ranked pathway by miRAnno for each of the selected miRNAs in CRC.

The hsa-miR-1228-5p-associated pathway, “Metabolism of ingested SeMet, Sec, MeSec into H2Se,” is implicated in the transformation of inorganic and organic forms of selenium into the intermediate selenide through the trans-selection pathway, selenocysteine lyase, and cystathionine gamma-lyase58,59. It is well established that selenocysteine pathways are involved in some molecular phenomena associated with colon cancer, including autophagy60, sporadic colorectal carcinogenesis, and WNT signaling activity61. Moreover, they are involved in maintaining the integrity of the intestinal barrier62 and influencing a range of other molecular processes related to CRC63.

The miRAnno analysis indicates that the primary pathway associated with hsa-miR-6787-5p is FGFR1 signaling. FGFR1 amplification has been proposed as a prognostic factor in CRC64, and its inhibition has been suggested as a means of suppressing the proliferation of CRC65. The PTK6/STAT3 pathway, which is the most highly ranked in relation to hsa-miR-1343-3p, plays a role in the proliferation, migration, and impaired apoptosis of colon cancer cells66 and in the chemoresistance of CRC67. Benzo(a)pyrene (associated with hsa-miR-6717-5p), which has its origins in dietary habits, has been demonstrated to accelerate colon carcinogenesis68. The primary pathway associated with hsa-miR-3184-5p is NFG/proNGF/p75NTR, and there is a cross talk between androgens and NGF in regulating apoptosis of colon cancer cells69.

Moreover, the overexpression of p75NTR in colon cancer cells resulted in a G1 phase arrest, attenuation of invasion and colony formation, and induced apoptosis70. The saccharide sequence of dermatan sulfate (associated with hsa-miR-1246) chains from human colon cancer is altered from that in normal colon tissue71. hsa-miR-4706 is associated with the WNT ligand secretion/PORCN inhibitor LGK974 pathway. It has been demonstrated that LGK974 is an effective inhibitor of the WNT and MAPK signaling pathways, capable of arresting the cell cycle and inducing apoptosis in CRC cell lines72. Quinol/quinone metabolism (associated with hsa-miR-8073) has been demonstrated to play a role in colon tumor growth73. (S)-3-hydroxy-3-methylglutaryl-CoA degradation, which is linked to hsa-miR-5100, has been shown to be important in favorable clinicopathological characteristics74 and the outcome of statin use in colon cancer cases75,76.

The Gene Ontology (GO) list of biological processes and molecular functions (Table 6) reported by ToppFun for miRNet identified target genes (Supplementary file 3, Fig. 2) align with the findings of miRAnno (Supplementary file 1, Table 5).

TFs-miRNAs-Genes regulatory network GMEB2 is identified as the only common TF between TF regulators of miRNA and their target genes. Figure 3b shows the regulatory axis between GMEB2 and hsa-miR-3184-5p. The elevated expression of GMEB2 and its contribution to CRC progression is explained by stimulating NF-κB signaling pathway77. As shown in Table 3, the expression of hsa-miR-3184-5p is down-regulated in CRC samples. This reciprocal expression of GMEB2 and hsa-miR-3184-5p further highlights the importance of studying this axis in CRC. E2F family of TFs belong to one of the most studied class of genes in CRC, contributing to different aspects of CRC pathogenesis, from tumorigenesis and progression to drug resistance and apoptosis. E2F1 which has been implicated in CRC development and progression via different axes, was found to be the most connected TF in our TFs-miRNAs-Genes regulatory network. To the best of our knowledge, our set of 9 candidate miRNAs have no reports to be directly involved in E2F-madiated signaling. Given the fact that a unique miRNA expression pattern has been observed during CRC progression, an experimental approach to investigate the possible interactions of E2F-mediated signaling and proposed miRNAs is of great importance78. Our results emphasize the need to study the possible interaction of E2F family of TFs, especially E2F1 and E2F4, and their corresponding miRNAs in the context of CRC. Unlike E2F1, E2F4 is a canonical repressor TF in which its interaction with miRNA in CRC is not yet determined. E2F4 was found to be one of the overrepresented TF for the target genes of our 9-candidate miRNA, having 316 overlapped genes according the ChEA libraries. CREB1 is reported to be involved in CRC cell plasticity by modulating NF-κB signaling pathway via CCAT1/MYC regulatory axis79. Recently, Inhibition of JUND was proposed as a therapeutic option as it is involved in stemness of cancer stem cell in CRC80. Interestingly, hsa-miR-1343-3p which was found to have the best classification performance by the XGBoost model and among the top 3 classifiers in Random Forest model, has the highest degree (223) among our 9 candidate miRNAs. Most of these biological processes and molecular functions are fundamental phenomena of cellular life which involved in normal and malignant growth and its regulation and control. It is clear that any sustained disruption to the regulation of these fundamental processes and functions, which is the downstream action of the identified miRNAs, may result in the malignant transformation of colon cells. The provided justifications demonstrate that the identified miRNAs present a valuable opportunity to expand the study to introduce reliable biomarkers for colon cancer diagnosis and/or prognosis.

The potential contributions mentioned above are also supported by other studies, some of which we review here. Yaghoubi et al. identified hsa-miR-1228-5p as a potential biomarker in ovarian cancer81. Additionally, previous studies have suggested that hsa-miR-1228-5p exhibits high diagnostic accuracy for hepatocellular carcinoma82. Furthermore, interactions between hsa-miR-1228-5p and TRIM26 (Tripartite Motif 26) as well as SNRPB (Small Nuclear Ribonucleoprotein Polypeptides B and B1) have been proposed as a potential mechanistic axis influencing the progression of kidney clear cell carcinoma83. It is already established that TRIM26 promotes colorectal cancer growth by inactivating p5384. Moreover, in a pan-cancer analysis that included colon adenocarcinoma, Wu et al. found that SNRPB expression was significantly elevated across nearly all tumor types. They further reported that its upregulation may facilitate tumor progression, impact Tumor-Node-Metastasis (TNM) staging, and serve as a risk factor for poor prognosis across various cancers85. In a separate study, Zhong et al. identified hsa-miR-1228-5p as one of the differentially expressed miRNAs in dermatomyositis-associated interstitial lung disease, specifically in patients with anti-melanoma differentiation-associated protein 5 (MDA5) antibody-positive subsets86. Their target analysis further revealed that ZBTB22 (Zinc Finger and BTB Domain Containing 22) and MDM2 had the strongest evidence for interaction with hsa-miR-1228-5p in a miRNA-mRNA regulatory circuit86. Interestingly, Douglas et al. discovered ZBTB22 mutations in rectal cancer patients with poor response to chemoradiation and proctectomy, whereas these mutations were absent in complete responders87. The role of MDM2 in colon cancer is well established. Notably, its oncogene overexpression in colon adenocarcinoma has been shown to directly influence p53 oncoprotein levels88.

Expanding on miRNA-based biomarkers, Kamkar et al. employed weighted miRNA co-expression network analysis on 972 serum miRNA profiles across thirteen cancer types and healthy individuals. They identified hsa-miR-1228-5p, hsa-miR-1343-3p, hsa-miR-6765-5p, and hsa-miR-6787-5p as promising biomarkers for gastric cancer detection, achieving an accuracy of 87%, specificity of 90%, and sensitivity of 89%89. Similarly, Mitsunaga et al. demonstrated that hsa-miR-1343-5p, in combination with four other miRNAs, serves as a valuable biomarker for the early diagnosis of pancreatobiliary cancer90. Further supporting its role in gastrointestinal cancers, Cao et al. reported a possible interaction between hsa-miR-1343-3p and the DUOX2 (Dual Oxidase 2) gene in pancreatic cancer91. Additionally, it has been shown that DUOX2 promotes colorectal cancer progression by regulating the AKT pathway and interacting with RPL392.

Beyond cancer, Cho et al. identified hsa-miR-6717-5p as one of ten significantly downregulated miRNAs in pseudoexfoliation glaucoma patients compared to controls in a Korean population93. In a large-scale microarray analysis, Chen and Dhahbi examined datasets from 13 cancer types, including colorectal and gastric cancers, using 100 random forest models. Their analysis highlighted hsa-miR-3184-5p as a key diagnostic marker, and a combined model incorporating hsa-miR-3184-5p alongside three other miRNAs achieved an exceptional AUC of 0.9815, underscoring its potential for cancer screening94. Moreover, hsa-miR-3184-5p has been validated as a reliable biomarker for early bladder cancer detection and has a regulatory role in breast cancer95,96. In functional studies, Rajarajan et al. discovered through in vitro assays that miR-3184‐5p was the most upregulated miRNA in adipocyte-induced breast cancer cells. They further identified FOXP4 as a direct target of miR‐3184‐5p, linking it to increased cell proliferation and invasive capacity in breast cancer96. Additionally, elevated FOXP4 expression levels have been associated with advanced pathological stages in colorectal cancer patients97. Meanwhile, the contribution of hsa-miR-1246 to laryngeal squamous cell carcinoma has also been previously established (X19). Finally, hsa-miR-8073 has been introduced by Yaghoubi et al. as a potential biomarker in ovarian cancer98.

Integrating Boruta’s robust feature selection with tree-based classifiers (Random Forest and XGBoost) offers significant potential for advancing colorectal cancer research and clinical practice. The interpretability of tree-based models enables clinicians and biologists to prioritize miRNAs for mechanistic studies, such as exploring their roles in regulating oncogenic pathways and epigenetic modifications. the framework’s adaptability to multi-omics data (e.g., integrating miRNA expression with mRNA or methylation profiles) could refine CRC subtyping or predict therapeutic responses, supporting personalized treatment strategies. Finally, the method’s generalizability makes it applicable to biomarker discovery in other cancers or complex diseases where high-dimensional data and small sample sizes remain a challenge.

It is important to note that this study was subject to certain limitations. The C group sample size was relatively limited. Further limitations included the lack of pathological information, such as tumor stage, age, or other factors, which were not available in our datasets. Furthermore, there is currently no experimental data available to substantiate our theoretical findings. Future work could validate these miRNAs in prospective cohorts or integrate them with clinical variables to build risk-stratification tools for clinical deployment.

Conclusion

In this paper, we applied decision tree-based machine learning algorithms along with wrapper methods from feature selection approaches to model colorectal cancer using miRNAs expression data from serum. Our Integrated bioinformatics analysis selected 20 significantmiRNAs that could be potential biomarkers for diagnosis of CRC, achieving an AUC of over 90%. Based on our model’s results and additional filtering using MS > 50%, we further narrowed the final candidate miRNAs down to 9 miRNAs (hsa-miR-1228-5p, hsa-miR-6787-5p, hsa-miR-1343-3p, hsa-miR-6717-5p, hsa-miR-3184-5p, has-miR-1246, has-miR-4706, has-miR-8073, and has-miR-5100) which are proposed to be promising diagnostic biomarkers. As the experimental validation to further corroborate our results is missing, the accuracy of our model is sufficiently high (100% AUC) to justify further examination of potential clinical applications.