Comparative analysis of sequential and thermodynamic features of pre-miRNA in insects with various organisms and applying XGBoost for one-vs-rest binary classification

Nath, Adhiraj; Bora, Utpal

doi:10.1038/s41598-025-22138-4

Download PDF

Article
Open access
Published: 11 November 2025

Comparative analysis of sequential and thermodynamic features of pre-miRNA in insects with various organisms and applying XGBoost for one-vs-rest binary classification

Adhiraj Nath¹ &
Utpal Bora¹

Scientific Reports volume 15, Article number: 39407 (2025) Cite this article

852 Accesses
Metrics details

Subjects

Abstract

MicroRNAs are found to regulate various biological processes which are produced from precursor microRNA. As the length of such microRNA are small, homology-based searching is not very useful. Hence, various machine learning based tools have been designed for prediction of such hairpin loops using various thermodynamic and sequential features. In this research, we discuss about the comparative statistical analysis of various features used the in development of machine learning based predictive tools. The sequence features of insect precursor microRNA were compared with precursor microRNA of other available organisms. We initially established that features such as Length, GC content, Minimum Free Energy (MFE) of folding, etc., differs in insects as compared to other organisms using Kolmogorov-Smirnov (KS) test. We further trained a predictive model for one-vs-rest binary classification using XGBoost between insects, human, monocots, aves, ruminants, sauria, dogs and rodents. We performed PCA and retained 14 principal components for classification using cumulative explained variance. Various parameters of XGBoost was tuned with 5-fold CV and the parameter values with highest CV score were considered. We used independent held-out data test the models. The accuracy of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.8549, 0.8626, 0.6835, 0.7005, 0.8875, 0.6972, 0.7591 and 0.6588 respectively. This shows that ancestral lineage specific ML models can be developed for detection of precursor microRNA for different classes of organism.

Comprehensive analysis of miRNA profiling in Schistosoma mekongi across life cycle stages

Article Open access 29 January 2024

The biogenesis and regulation of animal microRNAs

Article 19 December 2024

Characterization of presence and activity of microRNAs in the rumen of cattle hints at possible host-microbiota cross-talk mechanism

Article Open access 15 August 2022

Background

Precursor microRNA (pre-miRNA) are the non-coding RNA hairpin loops which is cleaved by Drosha to produce microRNA (miRNA)^1,2. Multiple miRNAs can be produced from a single pre-miRNA for which characterization and identification of pre-miRNA has been of great importance. miRNA has been found to regulate gene expression of various biological processes such as development, cell proliferation, cell differentiation, apoptosis, transposon silencing, and antiviral defense^3,4,5,6. In insects, changes in miRNA expression profile have been observed in various biological processes such as metamorphosis, reproduction, immune response, etc^{7,8,9,10,11,12,13}. miRNAs are believed to be conserved although they target diverse genes. They are believed to be similar across all the species^14,15,16.

Various tools are designed to predict pre-miRNAs as they give rise to mature miRNA. These data are downloaded from miRBase which contains collection of pre-miRNAs and their corresponding miRNAs of various organisms¹⁷. It currently holds miRNAs from 271 organisms. Features such as nucleotide bases, length of the sequence, GC content of pre-miRNAs are used to train machine learning classifiers to predict a true pre-miRNA^{18,19,20,21,22,23,24,25,26,27,28,29,30}. Deep learning methods were also carried out for detecting pre-miRNA hairpin loop in COVID³¹.

However, most existing tools are either general-purpose or tailored to a single organism or taxonomic group, and they often assume that pre-miRNA features are conserved across species. This assumption may not hold for phylogenetically distant groups such as insects, which are known to have unique regulatory networks and ecological specializations. There is a growing need to develop organism-aware or lineage-specific models to improve the accuracy and biological relevance of miRNA prediction^32,33.

In this work, we analyzed the pre-miRNA sequences of insect pre-miRNA from miRbase and performed comparative statistical analysis with other available organisms. We initially established that features such as Length, GC content, MFE, etc., differs in insects as compared to other organisms. We further trained a predictive model for classification using XGBoost between insects, human, monocots, aves, ruminants, sauria, dogs and rodents.

Methods

Data collection and pre-processing

We collected pre-miRNA sequences of insects, human, monocots, aves, ruminants, sauria, dogs and rodents from miRBase¹⁷ and labelled them for comparison. The secondary structure was calculated using RNAfold software from ViennaRNA package. The fasta header, nucleotide sequence, MFE score and secondary structure for each pre-miRNA sequence was converted into tabular format using in-house python script.

Hypothesis testing

The null hypothesis was:

$$\:{H}_{0}\:=\:All\:the\:precursor\:miRNA\:features\:are\:similar\:among\:all\:organisms\:$$

Our alternate hypothesis states that insect pre-miRNAs are different in many aspects which are routinely used in ML (machine learning) tools, i.e.

$$\:{H}_{A}:\:\:Features\:of\:precursor\:miRNA\:differ\:significantly\:between\:insects\:and\:other\:organisms$$

To determine if the normality of distribution Shapiro-Wilk test³² was performed given in Eq. 1.

$$\:W=\frac{{(\sum\:{a}_{i}{x}_{i})}^{2}}{{(\sum\:{x}_{i}-\:\stackrel{-}{x}\:)}^{2}}$$

(1)

where: xi: are the ordered sample values.

x̄

is the sample mean.

a_i

are coefficients that depend on the sample size n.

The results suggested that the data was not normally distributed and hence, we performed two-sample Kolmogorov-Smirnov (KS) tests given in Eq. 2, to compare the distributions of 57 pre-miRNA features between insects and each of seven other organisms (aves, human, mammalia, monocots, rodent, rumin, sauria), resulting in 399 comparisons.

$$\:{D}_{n,m}=\text{max}|{F}_{n}\left(x\right)-{G}_{m}(x\left)\right|$$

(2)

where: F_n(x) is the theoretical CDF and G_m(x) is the empirical CDF.

The significance level for all statistical tests was set at α = 0.05. Additionally, to account for multiple comparisons, a Bonferroni correction was applied to adjust the p-values and minimize the risk of type I errors.

In order to apply ml algorithms, we also performed Mann-Whitney U tests to assess median differences, and Levene’s tests to evaluate variance equality.

Feature engineering

We used dimensionality reduction techniques to identify the most informative features for training. Initially, we calculated Pearson’s correlation coefficients between all feature pairs to detect multicollinearity. For each pair of highly correlated features, one representative feature was retained while the other was removed to eliminate redundancy and reduce overfitting risk. Following this filtering step, we applied Principal Component Analysis (PCA) and computed the sum of squared loadings for each feature to assess its contribution to the principal components. We then selected the minimal set of features that collectively accounted for 95% of the total variance in the dataset al.so known as Cumulative Explained Variance^33,34, ensuring both compactness and informativeness of the feature set.

Training with XGBoost

A one-vs-rest binary classification approach was adopted to distinguish each organism from the rest. For each target organism, all available sequences were treated as positive samples. An equal number of negative samples were randomly drawn without replacement from the pool of remaining organisms to maintain class balance. This procedure ensured that each binary classifier was trained on a balanced dataset of equal positive and negative examples.

We kept 80% of the data for training (X_train) and used 20% of the data for testing (X_test). We used XGBoost classifier to train the data for classification. The algorithm of XGBoost constructs multiple CART models in parallel which effectively improves the computation speed. Second-order Taylor formula is used to optimize by model by calculating the error value between the predicted and true value³⁵. It can further handle missing feature values by internally imputing it and hence does not require feature standardization³⁶. It has hence been used in the estimation and classification biological data^37,38. It is based on minimising the loss function and regularization, $\:{L}^{\left(t\right)}$ which mathematically it can be written as:

$$\:{L}^{\left(t\right)}={\sum\:}_{i=1}^{n}l({y}_{i},{\widehat{{y}_{i}}}^{(t-1)}+\:{f}_{t}\left({x}_{i}\right))+\:{\Omega\:}({f}_{t})$$

(3)

where $\:l$ measures the difference between prediction $\:\widehat{{y}_{i}}$ and target $\:{y}_{i}$ in the $\:i$th instance at iteration $\:t$.$\:\:{f}_{t}$ is an independent tree for given input $\:{x}_{i}$. $\:{\Omega\:}\left({f}_{t}\right)$ works has a penalty function. We used 5 fold CV while optimising the following parameters of the XGBoost package in Scikit-learn:

lambda: L2 regularization range from 1e-3 to 10.
alpha: L1 regularization range from 1e-3 to 10.
colsample_bytree: Subsample ratio of columns during construction of each tree, ranges from 0.3 to 1.0.
subsample: Ratio of training instances, ranges from 0.4 to 1.
learning_rate: Step size at each iteration while moving towards minimum of loss function, ranges from 0.001 to 0.2.
n_estimators: Number of trees, ranges from 50 to 400.
max_depth: Max depth of a tree, ranges from 5 to 17.
min_child_weight: Minimum instances needed to be in each node, ranges from 1 to 300.

Performance analysis

To assess the generalizability of each classifier, the hyperparameter tuning and cross-validation were carried out exclusively on the training data and the final model was evaluated on the held-out test set.

The best parameters were chosen and the models were evaluated on X_test dataset to check their efficiency. During performance evaluation, we considered True dataset to be the group that is being evaluated and False dataset to be the collection of all other groups in each case. The performance was calculated based on the following classical classification measures: sensitivity (SN): $\:SN\:=\:\frac{TP}{TP+FN}$, specificity (SP): $\:SP\:=\:\frac{TN}{TN+FP}$, Accuracy (Acc): $\:Acc=\:\frac{TN+TP}{TN+FP+TP+FN}$, precision (p): $\:p=\frac{TP}{TP+FP}$, harmonic mean of sensitivity and precision (F₁): $\:{F}_{1}=2\frac{SN\:.\:\:p}{SN\:+\:p}$ and Matthew’s correlation coefficient (MCC): $\:MCC\:=\:\frac{(TP\:.\:\:TN)\:+\:(FP\:.\:\:FN)}{\sqrt{(TP\:+\:FP)(TP\:+\:FN)(TN\:+\:FP)(TN\:+\:FN)}\:}$, where TP, TN, FP and FN are the number of true-positive, false-positive and false-negative classifications, respectively. For given false positive rate (α) and true positive rate (1 − β) at different threshold values, the AUC-ROC was computed as: $\:AUC={\sum\:}_{n=1}^{i}\left\{\left(1-{\beta\:}_{i}\varDelta\:\alpha\:\right)+\frac{1}{2}[\varDelta\:(1-\beta\:)\varDelta\:\alpha\:]\right\}$, where Δ(1 − β)=(1 − β_i)−(1 − β_i−1) and Δα = α_i − α_i−1 and i = 1, 2, …, m (number of test data points)³¹. The workflow of the methods is given in Fig. 1.

Results

Data pre-processing

A total of 5541 sequences were collected for the analysis as given in Table 1.

Table 1 Total sequences collected for the analysis.

Full size table

Parameter calculation

The parameters used for the classification is given in Table 2. A total of 57 parameters were calculated out of which 16 were dinucleotide counts and dinucleotide percentage counts, 4 nucleotide counts and their percentage counts, 2 bp counts, base propensity, Shannon entropy, etc.

Table 2 Parameters calculated for feature extraction.

Full size table

Hypothesis testing

The Shapiro-Wilk test results indicate that the parameters are not normally distributed, suggesting the use of the Kolmogorov–Smirnov (KS) test. A Bonferroni-corrected alpha threshold of 0.000125 was applied to the KS test, which identified 335 combinations as significantly different across various organism classes. In the remaining 64 combinations, no single parameter was common to all organisms, suggesting that the null hypothesis does not hold. Furthermore, the Mann–Whitney U test and Levene’s test showed that all 57 parameters exhibited significant differences in median and variance between at least one pair of organisms. The test results are provided in Supplementary Material 1.

Figure 2 illustrates the comparison of selected parameters—including length, %G + C content, and minimum free energy (dG)—across 500 randomly sampled pre-miRNA sequences from each organism for visual clarity.

Feature engineering

To reduce redundancy, only one feature from each group of highly correlated variables was retained. As a result, the following features were removed due to high correlation: ‘AU’, ‘ND’, ‘A + U’, ‘G’, ‘G + C’, ‘A’, ‘%GC’, ‘%GG’, ‘%CC’, ‘%UU’, ‘mfe’, ‘D’, ‘AA’, ‘CC’, ‘U’, ‘%UA’, ‘NQ’, ‘pb’, ‘MFE3’, ‘%CG’, ‘%A + U’, ‘UU’ and ‘%G + C’. The results of the PCA are presented in Table 3. Based on the cumulative explained variance, 14 principal components were selected, capturing 95% of the total variance.

Table 3 Principal component analysis (PCA) results showing the explained variance and cumulative variance for each component. Based on the criterion of capturing 95% of the cumulative variance, the first 14 principal components were retained for further analysis, while the remaining components were discarded.

Full size table

The cumulative explained variance by each principal component is also illustrated in Fig. 3. Based on this plot, the first 14 components were selected, as they collectively account for approximately 95% of the total variance.

Training with XGBoost

The classification model for each organism was trained and evaluated using a random search cross-validation strategy on 80% of the training dataset followed by testing on 20% of the independent hold-out dataset. Table 4 summarizes the best cross-validation (CV) accuracy along with test accuracy, precision, recall, and F1-score. Detailed results of the hyperparameter tuning process—including fit times, fold-wise test scores, selected hyperparameter values, and model ranking—for each organism are provided in Supplementary Material 2.

Table 4 Performance metrics of the classification model for each organism. The table includes the best cross-validation (CV) accuracy and corresponding test set metrics: accuracy, precision, recall, and F1-score. Higher scores for monocots, insects, and Sauria suggest more distinct or learnable features, while lower scores in other groups indicate potential classification challenges.

Full size table

The best hyperparameter values identified during model tuning for each organism are summarized in Table 5. These include values for tree-specific parameters (e.g., max_depth, min_child_weight), learning rate, regularization parameters (reg_alpha, reg_lambda), and sampling parameters (colsample_bytree, subsample).

Table 5 Best-performing hyperparameter values for each organism as identified through cross-validation. Parameters include colsample_bytree, learning_rate, max_depth, min_child_weight, n_estimators, reg_alpha, reg_lambda, and subsample.

Full size table

Performance evaluation

Various performance measures for each group of organisms is given in Table 6. The accuracy of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.8549, 0.8626, 0.6835, 0.7005, 0.8875, 0.6972, 0.7591 and 0.6588 respectively. Specificity was found to be 0.8704 for insects, 0.8956 for monocots, 0.6424 for rodents, 0.7057 for human, 0.7042 for ruminants, 0.7705 for sauria, 0.6318 for aves and 0.7426 for dogs. The F1 score of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.8572, 0.858, 0.696, 0.699, 0.695, 0.8, 0.6678 and 0.6415 respectively. Sensitivity was found to be 0.8395 for insects, 0.8297 for monocots, 0.7247 for rodents, 0.6953 for human, 0.6901 for ruminants, 0.8197 for sauria, 0.6859 for aves and 0.5941 for dogs. The MCC score of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.7102, 0.7269, 0.3683, 0.4011, 0.3944, 0.5909, 0.3182 and 0.3404 respectively. This has also been graphically shown in Fig. 4. The radar plot provides a comparative overview of model performance across organisms, highlighting strengths and weaknesses in different metrics. Notably, organisms such as Monocots and Insects exhibit consistently higher scores, while Aves and Rodents show comparatively lower performance across several metrics.

Table 6 Performance metrics of classification models evaluated across eight different organism groups. Metrics include Accuracy, Precision, Sensitivity, F1 Score, Matthews correlation coefficient (MCC), and Specificity. These values reflect the models’ ability to generalize and perform consistently across diverse biological taxa.

Full size table

The ROC-AUC is given in Fig. 5. The AUC was found to be 0.92 for Insects, 0.93 for Monocots, 0.87 for Sauria, 0.74 for Dog, 0.77 for Ruminant, 0.76 for Human, 0.76 for Rodent, and 0.72 for Aves. These values indicate strong classifier performance for Insects, Monocots, and Sauria, suggesting the model can distinguish positive and negative cases effectively in these taxa.

Discussion

Data collection and pre-processing

We collected all available pre-miRNA sequences of insects as our initial focus was to distinguish them from other organisms. We also collected data from rodents, monocots, aves, dogs and sauria which are reptiles. Highest number of pre-miRNA sequence from a single species was collected from humans which was followed by mouse and cattle. All the sequences formed characteristic hairpin loops which was inferred from the secondary structure calculated by RNAfold.

Hypothesis testing

Our null hypothesis was that all the pre-miRNAs are physically and compositionally similar and hence performed the KS test with Bonferroni adjustment of various parameters with insects. These parameters are used in various machine learning based pre-miRNA prediction tools^19,21,24,39. As the p-value of parameters such as Length, GC content, MFE1, etc. were < 0.05, hence we rejected the null hypothesis and accepted the alternate hypothesis that the pre-miRNA sequences from these groups vary from insects. This indicated a possibility of supervised training for classification of pre-miRNA based on their ancestral origin. Therefore, we also performed Mann Whitney and Levene’s test to check for median and variance of features among the groups. We found that all the 57 features are significant in at least on pair of organisms. Hence, we moved forward with state-of-the-art XGBoost algorithm which can efficiently learn to classify different group data based on given labels.

Feature engineering and model training

The estimation of features which has the maximum contribution in building the model is essential^40,41. Our approach initially gave us 34 features after removing highly correlated features. PCA is another widely used dimension reduction technique^42,43. Using PCA we selected 14 principal components as shown in the scree plot for the classification model capturing 95% of cumulative explained variance.

Performance evaluation

The performance of XGBoost varied among the groups. Each group had fairly good accuracy however accuracy is often misleading and cannot be considered as best indicator of performance^44,45. The predictive model had good specificity for each group. However, only insects, plants, and saurias had sensitivity more than 60%, with insects having the highest at 83.95%. MCC score which estimates all the parameter is a crucial indicator⁴⁵. The F1 score for all the organism was above 60%.

But MCC of only Monocots and Insects was above 60%.

The AUC values of 0.92, 0.93 and 0.87 of insects, monocots, and reptiles (sauria) respectively, suggest that the parameters we used in XGBoost can classify these organisms.

Conclusion

In this work, we demonstrated the distinct nature of insect pre-miRNA by comparing it against that of other organisms and established that insect pre-miRNAs are significantly different from those of monocots, humans, rodents, ruminants, sauria, dogs, and aves. We further developed a predictive model using the XGBoost classifier, which effectively learned to differentiate pre-miRNA sequences across these organism classes based on a range of sequence and structural features.

In the future, this model can be implemented as a web server or standalone software tool, enabling researchers to rapidly classify unknown pre-miRNA sequences based on their likely taxonomic origin. Such a tool would be particularly valuable for annotating novel or poorly characterized genomes, assisting in evolutionary studies, and guiding experimental validation in non-model organisms. Additionally, expanding the dataset to include more taxa and incorporating deep learning-based feature extraction could further improve prediction accuracy and broaden the model’s applicability.

Data availability

All the data used in the analysis can be found in: https://github.com/adhiraj141092/mir_comp/tree/main/raw_data. The files created during analysis is present in: https://github.com/adhiraj141092/mir_comp/tree/main/processes.

References

O’Brien, J., Hayder, H., Zayed, Y. & Peng, C. Overview of MicroRNA Biogenesis, mechanisms of Actions, and circulation. Front. Endocrinol. (Lausanne). 9, 402 (2018).
Article PubMed Google Scholar
Han, J. et al. The Drosha-DGCR8 complex in primary MicroRNA processing. Genes Dev. 18, 3016–3027 (2004).
Article CAS PubMed PubMed Central Google Scholar
Ambros, V. The functions of animal MicroRNAs. Nature 431, 350–355 (2004).
Article CAS PubMed Google Scholar
Ruvkun, G. B. The tiny RNA world. Harvey Lect. 99, 1–21 (2003). (PMID: 15984549).
PubMed Google Scholar
Cullen, B. R. Viral and cellular messenger RNA targets of viral MicroRNAs. Nature 457, 421–425 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ventura, A. & Jacks, T. MicroRNAs and cancer: short RNAs go a long way. Cell 136, 586–591 (2009).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. et al. microRNA-309 targets the homeobox gene SIX4 and controls ovarian development in the mosquito Aedes aegypti. Proc. Natl. Acad. Sci. U S A. 113, E4828–E4836 (2016).
CAS PubMed PubMed Central Google Scholar
Etebari, K. & Asgari, S. Conserved microRNA miR-8 blocks activation of the Toll pathway by upregulating Serpin 27 transcripts. RNA Biol. 10, 1356–1364 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Q. et al. Genome-Wide analysis of MicroRNAs in relation to pupariation in Oriental fruit fly. Front. Physiol. 10, 301 (2019).
Article PubMed PubMed Central Google Scholar
Sun, X. H. et al. A novel miRNA, miR-13664, targets CpCYP314A1 to regulate deltamethrin resistance in culex pipiens pallens. Parasitology 146, 197–205 (2019).
Article CAS PubMed Google Scholar
Tariq, K., Metzendorf, C., Peng, W., Sohail, S. & Zhang, H. miR-8-3p regulates mitoferrin in the testes of bactrocera dorsalis to ensure normal spermatogenesis. Sci. Rep. 6, 22565 (2016).
Article CAS PubMed PubMed Central Google Scholar
Gulhane, P., Nimsarkar, P., Kharat, K. & Singh, S. Deciphering miR-520c-3p as a probable target for immunometabolism in non-small cell lung cancer using systems biology approach. Oncotarget 13, 725 (2022).
Article PubMed PubMed Central Google Scholar
Song, J. et al. The microRNAs let 7 and mir 278 regulate insect metamorphosis and oogenesis by targeting the juvenile hormone early response gene Krüppel homolog. Development 145 (24), dev170670 (2018).
Article PubMed Google Scholar
Lee, C. T., Risom, T. & Strauss, W. M. Evolutionary Conservation of MicroRNA Regulatory Circuits: An Examination of MicroRNA Gene Complexity and Conserved MicroRNA-Target Interactions through Metazoan Phylogeny. DNA Cell Biol 26, 209–218. https://home.liebertpub.com/dna (2007)
Friedman, R. C., Farh, K. K. H., Burge, C. B. & Bartel, D. P. Most mammalian mRNAs are conserved targets of MicroRNAs. Genome Res. 19, 92–105 (2009).
Article CAS PubMed PubMed Central Google Scholar
Willmann, M. R. & Poethig, R. S. Conservation and evolution of MiRNA regulatory programs in plant development. Curr. Opin. Plant. Biol. 10, 503–511 (2007).
Article CAS PubMed PubMed Central Google Scholar
Kozomara, A., Birgaoanu, M. & Griffiths-Jones, S. MiRBase: from MicroRNA sequences to function. Nucleic Acids Res. 47, D155–D162 (2019).
Article CAS PubMed Google Scholar
Ng, K. L. S. & Mishra, S. K. De Novo SVM classification of precursor MicroRNAs from genomic Pseudo hairpins using global and intrinsic folding measures. Bioinformatics 23, 1321–1330 (2007).
Article CAS PubMed Google Scholar
Batuwita, R. & Palade, V. MicroPred: effective classification of pre-miRNAs for human MiRNA gene prediction. Bioinformatics 25, 989–995 (2009).
Article CAS PubMed Google Scholar
Tran, V. D. T., Tempel, S., Zerath, B., Zehraoui, F. & Tahi, F. miRBoost: boosting support vector machines for microRNA precursor classification. RNA 21, 775–785 (2015).
Article CAS PubMed PubMed Central Google Scholar
Gudyś, A., Szcześniak, M. W., Sikora, M. & Makałowska, I. HuntMi: an efficient and taxon-specific approach in pre-miRNA identification. BMC Bioinform. 14, 83 (2013).
Article Google Scholar
Stegmayer, G., Yones, C., Kamenetzky, L. & Milone, D. H. High class-imbalance in pre-miRNA prediction: A novel approach based on deepSOM. IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 1316–1326 (2017).
Article CAS PubMed Google Scholar
Xue, C. et al. Classification of real and Pseudo MicroRNA precursors using local structure-sequence features and support vector machine. BMC Bioinform. 6, 310 (2005).
Article Google Scholar
Jiang, P. et al. MiPred: classification of real and Pseudo MicroRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 35, W339–W344 (2007).
Article PubMed PubMed Central Google Scholar
Ye, J., Xu, M., Tian, X., Cai, S. & Zeng, S. Research advances in the detection of MiRNA. J. Pharm. Anal. 9, 217–226 (2019).
Article PubMed PubMed Central Google Scholar
Condrat, C. E. et al. MiRNAs as biomarkers in disease: latest findings regarding their role in diagnosis and prognosis. Cells 2020. 9, 276 (2020).
CAS Google Scholar
Huang, K. Y., Lee, T. Y., Teng, Y. C. & Chang, T. H. ViralmiR: a support-vector-machine-based method for predicting viral MicroRNA precursors. BMC Bioinform. 16, S9 (2015).
Article Google Scholar
Stegmayer, G. et al. Predicting novel microrna: a comprehensive comparison of machine learning approaches. Brief. Bioinform. 20, 1607–1620 (2019).
Article CAS PubMed Google Scholar
Nath, A., Bora, U. & RNAinsecta A tool for prediction of precursor MicroRNA in insects and search for their target in the model organism drosophila melanogaster. PLoS One. 18, e0287323 (2023).
Article CAS PubMed PubMed Central Google Scholar
Bugnon, L. A. et al. Deep learning for the discovery of new pre-miRNAs: helping the fight against COVID-19. Mach. Learn. Appl. 6, 100150 (2021).
CAS PubMed PubMed Central Google Scholar
Jaiswal, S. et al. Development of species specific putative MiRNA and its target prediction tool in wheat (Triticum aestivum L). Sci. Rep. 9, 3790 (2019).
Article PubMed PubMed Central Google Scholar
Fávero, L. P. & Belfiore, P. Hypotheses Tests. Data Sci. Bus. Decis. Mak. 199–248. https://doi.org/10.1016/B978-0-12-811216-8.00009-4 (2019).
Vanhatalo, E., Kulahci, M. & Bergquist, B. On the structure of dynamic principal component analysis used in statistical process monitoring. Chemometr. Intell. Lab. Syst. 167, 1–11 (2017).
Article CAS Google Scholar
Shaharudin, S. M. & Ahmad, N. Choice of cumulative percentage in principal component analysis for regionalization of Peninsular Malaysia based on the rainfall amount. in 216–224 (2017). https://doi.org/10.1007/978-981-10-6502-6_19
Ahmad, F., Farooq, A. & Khan, M. U. G. Deep learning model for pathogen classification using feature fusion and data augmentation. Curr. Bioinform. 16, 466–483 (2020).
Article Google Scholar
Yang, H. et al. A comparison and assessment of computational method for identifying recombination hotspots in Saccharomyces cerevisiae. Brief. Bioinform. 21, 1568–1580 (2020).
Article PubMed Google Scholar
Li, H. et al. dPromoter-XGBoost: detecting promoters and strength by combining multiple descriptors and feature selection using XGBoost. Methods 204, 215–222 (2022).
Article CAS PubMed Google Scholar
Bi, Y. et al. An interpretable prediction model for identifying N7-Methylguanosine sites based on XGBoost and SHAP. Mol. Ther. Nucleic Acids. 22, 362–372 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kadri, S., Hinman, V. & Benos, P. V. HHMMiR: efficient de Novo prediction of MicroRNAs using hierarchical hidden Markov models. BMC Bioinform. 10, S35 (2009).
Article Google Scholar
Biesiada, J. & Duch, W. Feature selection for high-dimensional data - A pearson redundancy based filter. Adv. Soft Comput. 45, 242–249 (2007).
Article Google Scholar
Saidi, R., Bouaguel, W. & Essoussi, N. Hybrid feature selection method based on the genetic algorithm and pearson correlation coefficient. Stud. Comput. Intell. 801, 3–24 (2019).
Article Google Scholar
Kambhatla, N. & Leen, T. K. Dimension reduction by local principal component analysis. Neural Comput. 9, 1493–1516 (1997).
Article Google Scholar
Zhang, T. & Yang, B. Big Data Dimension Reduction Using PCA. Proc.–2016 IEEE Int. Conf. Smart Cloud SmartCloud 152–157. https://doi.org/10.1109/SMARTCLOUD.2016.33 (2016).
Sokolova, M., Japkowicz, N. & Szpakowicz, S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. AAAI Workshop - Technical Report WS-06-06, 24–29 (2006).
Google Scholar
Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank the Param-Ishan HPC facility of IIT Guwahati for providing the computational resources needed to carry out the experiments. The authors thank the confidential reviewers for their insightful remarks that helped to enhance the quality of this manuscript.

Funding

There was no funding for this work.

Author information

Authors and Affiliations

Department of BSBE, IIT Guwahati, North Guwahati, Assam, 784039, India
Adhiraj Nath & Utpal Bora

Authors

Adhiraj Nath
View author publications
Search author on:PubMed Google Scholar
Utpal Bora
View author publications
Search author on:PubMed Google Scholar

Contributions

AN: Methodology, Project administration, Resources, Validation, Writing – original draft. UB: Conceptualization, Supervision, Writing – review & editing.

Corresponding author

Correspondence to Adhiraj Nath.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Nath, A., Bora, U. Comparative analysis of sequential and thermodynamic features of pre-miRNA in insects with various organisms and applying XGBoost for one-vs-rest binary classification. Sci Rep 15, 39407 (2025). https://doi.org/10.1038/s41598-025-22138-4

Download citation

Received: 07 November 2024
Accepted: 25 September 2025
Published: 11 November 2025
Version of record: 11 November 2025
DOI: https://doi.org/10.1038/s41598-025-22138-4

Subjects

Abstract

Similar content being viewed by others

Comprehensive analysis of miRNA profiling in Schistosoma mekongi across life cycle stages

The biogenesis and regulation of animal microRNAs

Characterization of presence and activity of microRNAs in the rumen of cattle hints at possible host-microbiota cross-talk mechanism

Background

Methods

Data collection and pre-processing

Hypothesis testing

x̄

ai

Feature engineering

Training with XGBoost

Performance analysis

Results

Data pre-processing

Parameter calculation

Hypothesis testing

Feature engineering

Training with XGBoost

Performance evaluation

Discussion

Data collection and pre-processing

Hypothesis testing

Feature engineering and model training

Performance evaluation

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Supplementary Material 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links

a_i