Abstract
MicroRNAs are found to regulate various biological processes which are produced from precursor microRNA. As the length of such microRNA are small, homology-based searching is not very useful. Hence, various machine learning based tools have been designed for prediction of such hairpin loops using various thermodynamic and sequential features. In this research, we discuss about the comparative statistical analysis of various features used the in development of machine learning based predictive tools. The sequence features of insect precursor microRNA were compared with precursor microRNA of other available organisms. We initially established that features such as Length, GC content, Minimum Free Energy (MFE) of folding, etc., differs in insects as compared to other organisms using Kolmogorov-Smirnov (KS) test. We further trained a predictive model for one-vs-rest binary classification using XGBoost between insects, human, monocots, aves, ruminants, sauria, dogs and rodents. We performed PCA and retained 14 principal components for classification using cumulative explained variance. Various parameters of XGBoost was tuned with 5-fold CV and the parameter values with highest CV score were considered. We used independent held-out data test the models. The accuracy of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.8549, 0.8626, 0.6835, 0.7005, 0.8875, 0.6972, 0.7591 and 0.6588 respectively. This shows that ancestral lineage specific ML models can be developed for detection of precursor microRNA for different classes of organism.
Similar content being viewed by others
Background
Precursor microRNA (pre-miRNA) are the non-coding RNA hairpin loops which is cleaved by Drosha to produce microRNA (miRNA)1,2. Multiple miRNAs can be produced from a single pre-miRNA for which characterization and identification of pre-miRNA has been of great importance. miRNA has been found to regulate gene expression of various biological processes such as development, cell proliferation, cell differentiation, apoptosis, transposon silencing, and antiviral defense3,4,5,6. In insects, changes in miRNA expression profile have been observed in various biological processes such as metamorphosis, reproduction, immune response, etc7,8,9,10,11,12,13. miRNAs are believed to be conserved although they target diverse genes. They are believed to be similar across all the species14,15,16.
Various tools are designed to predict pre-miRNAs as they give rise to mature miRNA. These data are downloaded from miRBase which contains collection of pre-miRNAs and their corresponding miRNAs of various organisms17. It currently holds miRNAs from 271 organisms. Features such as nucleotide bases, length of the sequence, GC content of pre-miRNAs are used to train machine learning classifiers to predict a true pre-miRNA18,19,20,21,22,23,24,25,26,27,28,29,30. Deep learning methods were also carried out for detecting pre-miRNA hairpin loop in COVID31.
However, most existing tools are either general-purpose or tailored to a single organism or taxonomic group, and they often assume that pre-miRNA features are conserved across species. This assumption may not hold for phylogenetically distant groups such as insects, which are known to have unique regulatory networks and ecological specializations. There is a growing need to develop organism-aware or lineage-specific models to improve the accuracy and biological relevance of miRNA prediction32,33.
In this work, we analyzed the pre-miRNA sequences of insect pre-miRNA from miRbase and performed comparative statistical analysis with other available organisms. We initially established that features such as Length, GC content, MFE, etc., differs in insects as compared to other organisms. We further trained a predictive model for classification using XGBoost between insects, human, monocots, aves, ruminants, sauria, dogs and rodents.
Methods
Data collection and pre-processing
We collected pre-miRNA sequences of insects, human, monocots, aves, ruminants, sauria, dogs and rodents from miRBase17 and labelled them for comparison. The secondary structure was calculated using RNAfold software from ViennaRNA package. The fasta header, nucleotide sequence, MFE score and secondary structure for each pre-miRNA sequence was converted into tabular format using in-house python script.
Hypothesis testing
The null hypothesis was:
Our alternate hypothesis states that insect pre-miRNAs are different in many aspects which are routinely used in ML (machine learning) tools, i.e.
To determine if the normality of distribution Shapiro-Wilk test32 was performed given in Eq. 1.
where: xi: are the ordered sample values.
x̄
is the sample mean.
ai
are coefficients that depend on the sample size n.
The results suggested that the data was not normally distributed and hence, we performed two-sample Kolmogorov-Smirnov (KS) tests given in Eq. 2, to compare the distributions of 57 pre-miRNA features between insects and each of seven other organisms (aves, human, mammalia, monocots, rodent, rumin, sauria), resulting in 399 comparisons.
where: Fn(x) is the theoretical CDF and Gm(x) is the empirical CDF.
The significance level for all statistical tests was set at α = 0.05. Additionally, to account for multiple comparisons, a Bonferroni correction was applied to adjust the p-values and minimize the risk of type I errors.
In order to apply ml algorithms, we also performed Mann-Whitney U tests to assess median differences, and Levene’s tests to evaluate variance equality.
Feature engineering
We used dimensionality reduction techniques to identify the most informative features for training. Initially, we calculated Pearson’s correlation coefficients between all feature pairs to detect multicollinearity. For each pair of highly correlated features, one representative feature was retained while the other was removed to eliminate redundancy and reduce overfitting risk. Following this filtering step, we applied Principal Component Analysis (PCA) and computed the sum of squared loadings for each feature to assess its contribution to the principal components. We then selected the minimal set of features that collectively accounted for 95% of the total variance in the dataset al.so known as Cumulative Explained Variance33,34, ensuring both compactness and informativeness of the feature set.
Training with XGBoost
A one-vs-rest binary classification approach was adopted to distinguish each organism from the rest. For each target organism, all available sequences were treated as positive samples. An equal number of negative samples were randomly drawn without replacement from the pool of remaining organisms to maintain class balance. This procedure ensured that each binary classifier was trained on a balanced dataset of equal positive and negative examples.
We kept 80% of the data for training (X_train) and used 20% of the data for testing (X_test). We used XGBoost classifier to train the data for classification. The algorithm of XGBoost constructs multiple CART models in parallel which effectively improves the computation speed. Second-order Taylor formula is used to optimize by model by calculating the error value between the predicted and true value35. It can further handle missing feature values by internally imputing it and hence does not require feature standardization36. It has hence been used in the estimation and classification biological data37,38. It is based on minimising the loss function and regularization, \(\:{L}^{\left(t\right)}\) which mathematically it can be written as:
where \(\:l\) measures the difference between prediction \(\:\widehat{{y}_{i}}\) and target \(\:{y}_{i}\) in the \(\:i\)th instance at iteration \(\:t\).\(\:\:{f}_{t}\) is an independent tree for given input \(\:{x}_{i}\). \(\:{\Omega\:}\left({f}_{t}\right)\) works has a penalty function. We used 5 fold CV while optimising the following parameters of the XGBoost package in Scikit-learn:
-
lambda: L2 regularization range from 1e-3 to 10.
-
alpha: L1 regularization range from 1e-3 to 10.
-
colsample_bytree: Subsample ratio of columns during construction of each tree, ranges from 0.3 to 1.0.
-
subsample: Ratio of training instances, ranges from 0.4 to 1.
-
learning_rate: Step size at each iteration while moving towards minimum of loss function, ranges from 0.001 to 0.2.
-
n_estimators: Number of trees, ranges from 50 to 400.
-
max_depth: Max depth of a tree, ranges from 5 to 17.
-
min_child_weight: Minimum instances needed to be in each node, ranges from 1 to 300.
Performance analysis
To assess the generalizability of each classifier, the hyperparameter tuning and cross-validation were carried out exclusively on the training data and the final model was evaluated on the held-out test set.
The best parameters were chosen and the models were evaluated on X_test dataset to check their efficiency. During performance evaluation, we considered True dataset to be the group that is being evaluated and False dataset to be the collection of all other groups in each case. The performance was calculated based on the following classical classification measures: sensitivity (SN): \(\:SN\:=\:\frac{TP}{TP+FN}\), specificity (SP): \(\:SP\:=\:\frac{TN}{TN+FP}\), Accuracy (Acc): \(\:Acc=\:\frac{TN+TP}{TN+FP+TP+FN}\), precision (p): \(\:p=\frac{TP}{TP+FP}\), harmonic mean of sensitivity and precision (F1): \(\:{F}_{1}=2\frac{SN\:.\:\:p}{SN\:+\:p}\) and Matthew’s correlation coefficient (MCC): \(\:MCC\:=\:\frac{(TP\:.\:\:TN)\:+\:(FP\:.\:\:FN)}{\sqrt{(TP\:+\:FP)(TP\:+\:FN)(TN\:+\:FP)(TN\:+\:FN)}\:}\), where TP, TN, FP and FN are the number of true-positive, false-positive and false-negative classifications, respectively. For given false positive rate (α) and true positive rate (1 − β) at different threshold values, the AUC-ROC was computed as: \(\:AUC={\sum\:}_{n=1}^{i}\left\{\left(1-{\beta\:}_{i}\varDelta\:\alpha\:\right)+\frac{1}{2}[\varDelta\:(1-\beta\:)\varDelta\:\alpha\:]\right\}\), where Δ(1 − β)=(1 − βi)−(1 − βi−1) and Δα = αi − αi−1 and i = 1, 2, …, m (number of test data points)31. The workflow of the methods is given in Fig. 1.
Results
Data pre-processing
A total of 5541 sequences were collected for the analysis as given in Table 1.
Parameter calculation
The parameters used for the classification is given in Table 2. A total of 57 parameters were calculated out of which 16 were dinucleotide counts and dinucleotide percentage counts, 4 nucleotide counts and their percentage counts, 2 bp counts, base propensity, Shannon entropy, etc.
Hypothesis testing
The Shapiro-Wilk test results indicate that the parameters are not normally distributed, suggesting the use of the Kolmogorov–Smirnov (KS) test. A Bonferroni-corrected alpha threshold of 0.000125 was applied to the KS test, which identified 335 combinations as significantly different across various organism classes. In the remaining 64 combinations, no single parameter was common to all organisms, suggesting that the null hypothesis does not hold. Furthermore, the Mann–Whitney U test and Levene’s test showed that all 57 parameters exhibited significant differences in median and variance between at least one pair of organisms. The test results are provided in Supplementary Material 1.
Figure 2 illustrates the comparison of selected parameters—including length, %G + C content, and minimum free energy (dG)—across 500 randomly sampled pre-miRNA sequences from each organism for visual clarity.
Comparison of insect pre-miRNA with other class of organisms. The comparation of various features of 500 randomly sampled pre-miRNA from insects, human, monocots, aves, ruminants, sauria, dogs and rodents are is shown in the pair-plot scatter diagrams. Features such as GC percentage (%G + C), Length (Len) and dG (MFE/Length) were considered out of 57 features mentioned below. Multivariate gaussian distribution plot is given in the diagonals.
Feature engineering
To reduce redundancy, only one feature from each group of highly correlated variables was retained. As a result, the following features were removed due to high correlation: ‘AU’, ‘ND’, ‘A + U’, ‘G’, ‘G + C’, ‘A’, ‘%GC’, ‘%GG’, ‘%CC’, ‘%UU’, ‘mfe’, ‘D’, ‘AA’, ‘CC’, ‘U’, ‘%UA’, ‘NQ’, ‘pb’, ‘MFE3’, ‘%CG’, ‘%A + U’, ‘UU’ and ‘%G + C’. The results of the PCA are presented in Table 3. Based on the cumulative explained variance, 14 principal components were selected, capturing 95% of the total variance.
The cumulative explained variance by each principal component is also illustrated in Fig. 3. Based on this plot, the first 14 components were selected, as they collectively account for approximately 95% of the total variance.
Training with XGBoost
The classification model for each organism was trained and evaluated using a random search cross-validation strategy on 80% of the training dataset followed by testing on 20% of the independent hold-out dataset. Table 4 summarizes the best cross-validation (CV) accuracy along with test accuracy, precision, recall, and F1-score. Detailed results of the hyperparameter tuning process—including fit times, fold-wise test scores, selected hyperparameter values, and model ranking—for each organism are provided in Supplementary Material 2.
The best hyperparameter values identified during model tuning for each organism are summarized in Table 5. These include values for tree-specific parameters (e.g., max_depth, min_child_weight), learning rate, regularization parameters (reg_alpha, reg_lambda), and sampling parameters (colsample_bytree, subsample).
Performance evaluation
Various performance measures for each group of organisms is given in Table 6. The accuracy of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.8549, 0.8626, 0.6835, 0.7005, 0.8875, 0.6972, 0.7591 and 0.6588 respectively. Specificity was found to be 0.8704 for insects, 0.8956 for monocots, 0.6424 for rodents, 0.7057 for human, 0.7042 for ruminants, 0.7705 for sauria, 0.6318 for aves and 0.7426 for dogs. The F1 score of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.8572, 0.858, 0.696, 0.699, 0.695, 0.8, 0.6678 and 0.6415 respectively. Sensitivity was found to be 0.8395 for insects, 0.8297 for monocots, 0.7247 for rodents, 0.6953 for human, 0.6901 for ruminants, 0.8197 for sauria, 0.6859 for aves and 0.5941 for dogs. The MCC score of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.7102, 0.7269, 0.3683, 0.4011, 0.3944, 0.5909, 0.3182 and 0.3404 respectively. This has also been graphically shown in Fig. 4. The radar plot provides a comparative overview of model performance across organisms, highlighting strengths and weaknesses in different metrics. Notably, organisms such as Monocots and Insects exhibit consistently higher scores, while Aves and Rodents show comparatively lower performance across several metrics.
Radar plot illustrating the performance of classification models across different organisms using six evaluation metrics: Accuracy, Precision, Sensitivity, F1 Score, Matthews Correlation Coefficient (MCC), and Specificity. Each line represents a different organism, allowing for comparative visualization of model strengths and weaknesses across taxa.
The ROC-AUC is given in Fig. 5. The AUC was found to be 0.92 for Insects, 0.93 for Monocots, 0.87 for Sauria, 0.74 for Dog, 0.77 for Ruminant, 0.76 for Human, 0.76 for Rodent, and 0.72 for Aves. These values indicate strong classifier performance for Insects, Monocots, and Sauria, suggesting the model can distinguish positive and negative cases effectively in these taxa.
Discussion
Data collection and pre-processing
We collected all available pre-miRNA sequences of insects as our initial focus was to distinguish them from other organisms. We also collected data from rodents, monocots, aves, dogs and sauria which are reptiles. Highest number of pre-miRNA sequence from a single species was collected from humans which was followed by mouse and cattle. All the sequences formed characteristic hairpin loops which was inferred from the secondary structure calculated by RNAfold.
Hypothesis testing
Our null hypothesis was that all the pre-miRNAs are physically and compositionally similar and hence performed the KS test with Bonferroni adjustment of various parameters with insects. These parameters are used in various machine learning based pre-miRNA prediction tools19,21,24,39. As the p-value of parameters such as Length, GC content, MFE1, etc. were < 0.05, hence we rejected the null hypothesis and accepted the alternate hypothesis that the pre-miRNA sequences from these groups vary from insects. This indicated a possibility of supervised training for classification of pre-miRNA based on their ancestral origin. Therefore, we also performed Mann Whitney and Levene’s test to check for median and variance of features among the groups. We found that all the 57 features are significant in at least on pair of organisms. Hence, we moved forward with state-of-the-art XGBoost algorithm which can efficiently learn to classify different group data based on given labels.
Feature engineering and model training
The estimation of features which has the maximum contribution in building the model is essential40,41. Our approach initially gave us 34 features after removing highly correlated features. PCA is another widely used dimension reduction technique42,43. Using PCA we selected 14 principal components as shown in the scree plot for the classification model capturing 95% of cumulative explained variance.
Performance evaluation
The performance of XGBoost varied among the groups. Each group had fairly good accuracy however accuracy is often misleading and cannot be considered as best indicator of performance44,45. The predictive model had good specificity for each group. However, only insects, plants, and saurias had sensitivity more than 60%, with insects having the highest at 83.95%. MCC score which estimates all the parameter is a crucial indicator45. The F1 score for all the organism was above 60%.
But MCC of only Monocots and Insects was above 60%.
The AUC values of 0.92, 0.93 and 0.87 of insects, monocots, and reptiles (sauria) respectively, suggest that the parameters we used in XGBoost can classify these organisms.
Conclusion
In this work, we demonstrated the distinct nature of insect pre-miRNA by comparing it against that of other organisms and established that insect pre-miRNAs are significantly different from those of monocots, humans, rodents, ruminants, sauria, dogs, and aves. We further developed a predictive model using the XGBoost classifier, which effectively learned to differentiate pre-miRNA sequences across these organism classes based on a range of sequence and structural features.
In the future, this model can be implemented as a web server or standalone software tool, enabling researchers to rapidly classify unknown pre-miRNA sequences based on their likely taxonomic origin. Such a tool would be particularly valuable for annotating novel or poorly characterized genomes, assisting in evolutionary studies, and guiding experimental validation in non-model organisms. Additionally, expanding the dataset to include more taxa and incorporating deep learning-based feature extraction could further improve prediction accuracy and broaden the model’s applicability.
Data availability
All the data used in the analysis can be found in: https://github.com/adhiraj141092/mir_comp/tree/main/raw_data. The files created during analysis is present in: https://github.com/adhiraj141092/mir_comp/tree/main/processes.
References
O’Brien, J., Hayder, H., Zayed, Y. & Peng, C. Overview of MicroRNA Biogenesis, mechanisms of Actions, and circulation. Front. Endocrinol. (Lausanne). 9, 402 (2018).
Han, J. et al. The Drosha-DGCR8 complex in primary MicroRNA processing. Genes Dev. 18, 3016–3027 (2004).
Ambros, V. The functions of animal MicroRNAs. Nature 431, 350–355 (2004).
Ruvkun, G. B. The tiny RNA world. Harvey Lect. 99, 1–21 (2003). (PMID: 15984549).
Cullen, B. R. Viral and cellular messenger RNA targets of viral MicroRNAs. Nature 457, 421–425 (2009).
Ventura, A. & Jacks, T. MicroRNAs and cancer: short RNAs go a long way. Cell 136, 586–591 (2009).
Zhang, Y. et al. microRNA-309 targets the homeobox gene SIX4 and controls ovarian development in the mosquito Aedes aegypti. Proc. Natl. Acad. Sci. U S A. 113, E4828–E4836 (2016).
Etebari, K. & Asgari, S. Conserved microRNA miR-8 blocks activation of the Toll pathway by upregulating Serpin 27 transcripts. RNA Biol. 10, 1356–1364 (2013).
Zhang, Q. et al. Genome-Wide analysis of MicroRNAs in relation to pupariation in Oriental fruit fly. Front. Physiol. 10, 301 (2019).
Sun, X. H. et al. A novel miRNA, miR-13664, targets CpCYP314A1 to regulate deltamethrin resistance in culex pipiens pallens. Parasitology 146, 197–205 (2019).
Tariq, K., Metzendorf, C., Peng, W., Sohail, S. & Zhang, H. miR-8-3p regulates mitoferrin in the testes of bactrocera dorsalis to ensure normal spermatogenesis. Sci. Rep. 6, 22565 (2016).
Gulhane, P., Nimsarkar, P., Kharat, K. & Singh, S. Deciphering miR-520c-3p as a probable target for immunometabolism in non-small cell lung cancer using systems biology approach. Oncotarget 13, 725 (2022).
Song, J. et al. The microRNAs let 7 and mir 278 regulate insect metamorphosis and oogenesis by targeting the juvenile hormone early response gene Krüppel homolog. Development 145 (24), dev170670 (2018).
Lee, C. T., Risom, T. & Strauss, W. M. Evolutionary Conservation of MicroRNA Regulatory Circuits: An Examination of MicroRNA Gene Complexity and Conserved MicroRNA-Target Interactions through Metazoan Phylogeny. DNA Cell Biol 26, 209–218. https://home.liebertpub.com/dna (2007)
Friedman, R. C., Farh, K. K. H., Burge, C. B. & Bartel, D. P. Most mammalian mRNAs are conserved targets of MicroRNAs. Genome Res. 19, 92–105 (2009).
Willmann, M. R. & Poethig, R. S. Conservation and evolution of MiRNA regulatory programs in plant development. Curr. Opin. Plant. Biol. 10, 503–511 (2007).
Kozomara, A., Birgaoanu, M. & Griffiths-Jones, S. MiRBase: from MicroRNA sequences to function. Nucleic Acids Res. 47, D155–D162 (2019).
Ng, K. L. S. & Mishra, S. K. De Novo SVM classification of precursor MicroRNAs from genomic Pseudo hairpins using global and intrinsic folding measures. Bioinformatics 23, 1321–1330 (2007).
Batuwita, R. & Palade, V. MicroPred: effective classification of pre-miRNAs for human MiRNA gene prediction. Bioinformatics 25, 989–995 (2009).
Tran, V. D. T., Tempel, S., Zerath, B., Zehraoui, F. & Tahi, F. miRBoost: boosting support vector machines for microRNA precursor classification. RNA 21, 775–785 (2015).
Gudyś, A., Szcześniak, M. W., Sikora, M. & Makałowska, I. HuntMi: an efficient and taxon-specific approach in pre-miRNA identification. BMC Bioinform. 14, 83 (2013).
Stegmayer, G., Yones, C., Kamenetzky, L. & Milone, D. H. High class-imbalance in pre-miRNA prediction: A novel approach based on deepSOM. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 1316–1326 (2017).
Xue, C. et al. Classification of real and Pseudo MicroRNA precursors using local structure-sequence features and support vector machine. BMC Bioinform. 6, 310 (2005).
Jiang, P. et al. MiPred: classification of real and Pseudo MicroRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35, W339–W344 (2007).
Ye, J., Xu, M., Tian, X., Cai, S. & Zeng, S. Research advances in the detection of MiRNA. J. Pharm. Anal. 9, 217–226 (2019).
Condrat, C. E. et al. MiRNAs as biomarkers in disease: latest findings regarding their role in diagnosis and prognosis. Cells 2020. 9, 276 (2020).
Huang, K. Y., Lee, T. Y., Teng, Y. C. & Chang, T. H. ViralmiR: a support-vector-machine-based method for predicting viral MicroRNA precursors. BMC Bioinform. 16, S9 (2015).
Stegmayer, G. et al. Predicting novel microrna: a comprehensive comparison of machine learning approaches. Brief. Bioinform. 20, 1607–1620 (2019).
Nath, A., Bora, U. & RNAinsecta A tool for prediction of precursor MicroRNA in insects and search for their target in the model organism drosophila melanogaster. PLoS One. 18, e0287323 (2023).
Bugnon, L. A. et al. Deep learning for the discovery of new pre-miRNAs: helping the fight against COVID-19. Mach. Learn. Appl. 6, 100150 (2021).
Jaiswal, S. et al. Development of species specific putative MiRNA and its target prediction tool in wheat (Triticum aestivum L). Sci. Rep. 9, 3790 (2019).
Fávero, L. P. & Belfiore, P. Hypotheses Tests. Data Sci. Bus. Decis. Mak. 199–248. https://doi.org/10.1016/B978-0-12-811216-8.00009-4 (2019).
Vanhatalo, E., Kulahci, M. & Bergquist, B. On the structure of dynamic principal component analysis used in statistical process monitoring. Chemometr. Intell. Lab. Syst. 167, 1–11 (2017).
Shaharudin, S. M. & Ahmad, N. Choice of cumulative percentage in principal component analysis for regionalization of Peninsular Malaysia based on the rainfall amount. in 216–224 (2017). https://doi.org/10.1007/978-981-10-6502-6_19
Ahmad, F., Farooq, A. & Khan, M. U. G. Deep learning model for pathogen classification using feature fusion and data augmentation. Curr. Bioinform. 16, 466–483 (2020).
Yang, H. et al. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief. Bioinform. 21, 1568–1580 (2020).
Li, H. et al. dPromoter-XGBoost: detecting promoters and strength by combining multiple descriptors and feature selection using XGBoost. Methods 204, 215–222 (2022).
Bi, Y. et al. An interpretable prediction model for identifying N7-Methylguanosine sites based on XGBoost and SHAP. Mol. Ther. Nucleic Acids. 22, 362–372 (2020).
Kadri, S., Hinman, V. & Benos, P. V. HHMMiR: efficient de Novo prediction of MicroRNAs using hierarchical hidden Markov models. BMC Bioinform. 10, S35 (2009).
Biesiada, J. & Duch, W. Feature selection for high-dimensional data - A pearson redundancy based filter. Adv. Soft Comput. 45, 242–249 (2007).
Saidi, R., Bouaguel, W. & Essoussi, N. Hybrid feature selection method based on the genetic algorithm and pearson correlation coefficient. Stud. Comput. Intell. 801, 3–24 (2019).
Kambhatla, N. & Leen, T. K. Dimension reduction by local principal component analysis. Neural Comput. 9, 1493–1516 (1997).
Zhang, T. & Yang, B. Big Data Dimension Reduction Using PCA. Proc.–2016 IEEE Int. Conf. Smart Cloud SmartCloud 152–157. https://doi.org/10.1109/SMARTCLOUD.2016.33 (2016).
Sokolova, M., Japkowicz, N. & Szpakowicz, S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. AAAI Workshop - Technical Report WS-06-06, 24–29 (2006).
Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).
Acknowledgements
The authors thank the Param-Ishan HPC facility of IIT Guwahati for providing the computational resources needed to carry out the experiments. The authors thank the confidential reviewers for their insightful remarks that helped to enhance the quality of this manuscript.
Funding
There was no funding for this work.
Author information
Authors and Affiliations
Contributions
AN: Methodology, Project administration, Resources, Validation, Writing – original draft. UB: Conceptualization, Supervision, Writing – review & editing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Nath, A., Bora, U. Comparative analysis of sequential and thermodynamic features of pre-miRNA in insects with various organisms and applying XGBoost for one-vs-rest binary classification. Sci Rep 15, 39407 (2025). https://doi.org/10.1038/s41598-025-22138-4
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-22138-4







