Background

Precursor microRNA (pre-miRNA) are the non-coding RNA hairpin loops which is cleaved by Drosha to produce microRNA (miRNA)1,2. Multiple miRNAs can be produced from a single pre-miRNA for which characterization and identification of pre-miRNA has been of great importance. miRNA has been found to regulate gene expression of various biological processes such as development, cell proliferation, cell differentiation, apoptosis, transposon silencing, and antiviral defense3,4,5,6. In insects, changes in miRNA expression profile have been observed in various biological processes such as metamorphosis, reproduction, immune response, etc7,8,9,10,11,12,13. miRNAs are believed to be conserved although they target diverse genes. They are believed to be similar across all the species14,15,16.

Various tools are designed to predict pre-miRNAs as they give rise to mature miRNA. These data are downloaded from miRBase which contains collection of pre-miRNAs and their corresponding miRNAs of various organisms17. It currently holds miRNAs from 271 organisms. Features such as nucleotide bases, length of the sequence, GC content of pre-miRNAs are used to train machine learning classifiers to predict a true pre-miRNA18,19,20,21,22,23,24,25,26,27,28,29,30. Deep learning methods were also carried out for detecting pre-miRNA hairpin loop in COVID31.

However, most existing tools are either general-purpose or tailored to a single organism or taxonomic group, and they often assume that pre-miRNA features are conserved across species. This assumption may not hold for phylogenetically distant groups such as insects, which are known to have unique regulatory networks and ecological specializations. There is a growing need to develop organism-aware or lineage-specific models to improve the accuracy and biological relevance of miRNA prediction32,33.

In this work, we analyzed the pre-miRNA sequences of insect pre-miRNA from miRbase and performed comparative statistical analysis with other available organisms. We initially established that features such as Length, GC content, MFE, etc., differs in insects as compared to other organisms. We further trained a predictive model for classification using XGBoost between insects, human, monocots, aves, ruminants, sauria, dogs and rodents.

Methods

Data collection and pre-processing

We collected pre-miRNA sequences of insects, human, monocots, aves, ruminants, sauria, dogs and rodents from miRBase17 and labelled them for comparison. The secondary structure was calculated using RNAfold software from ViennaRNA package. The fasta header, nucleotide sequence, MFE score and secondary structure for each pre-miRNA sequence was converted into tabular format using in-house python script.

Hypothesis testing

The null hypothesis was:

$$\:{H}_{0}\:=\:All\:the\:precursor\:miRNA\:features\:are\:similar\:among\:all\:organisms\:$$

Our alternate hypothesis states that insect pre-miRNAs are different in many aspects which are routinely used in ML (machine learning) tools, i.e.

$$\:{H}_{A}:\:\:Features\:of\:precursor\:miRNA\:differ\:significantly\:between\:insects\:and\:other\:organisms$$

To determine if the normality of distribution Shapiro-Wilk test32 was performed given in Eq. 1.

$$\:W=\frac{{(\sum\:{a}_{i}{x}_{i})}^{2}}{{(\sum\:{x}_{i}-\:\stackrel{-}{x}\:)}^{2}}$$
(1)

where: xi: are the ordered sample values.

is the sample mean.

ai

are coefficients that depend on the sample size n.

The results suggested that the data was not normally distributed and hence, we performed two-sample Kolmogorov-Smirnov (KS) tests given in Eq. 2, to compare the distributions of 57 pre-miRNA features between insects and each of seven other organisms (aves, human, mammalia, monocots, rodent, rumin, sauria), resulting in 399 comparisons.

$$\:{D}_{n,m}=\text{max}|{F}_{n}\left(x\right)-{G}_{m}(x\left)\right|$$
(2)

where: Fn(x) is the theoretical CDF and Gm(x) is the empirical CDF.

The significance level for all statistical tests was set at α = 0.05. Additionally, to account for multiple comparisons, a Bonferroni correction was applied to adjust the p-values and minimize the risk of type I errors.

In order to apply ml algorithms, we also performed Mann-Whitney U tests to assess median differences, and Levene’s tests to evaluate variance equality.

Feature engineering

We used dimensionality reduction techniques to identify the most informative features for training. Initially, we calculated Pearson’s correlation coefficients between all feature pairs to detect multicollinearity. For each pair of highly correlated features, one representative feature was retained while the other was removed to eliminate redundancy and reduce overfitting risk. Following this filtering step, we applied Principal Component Analysis (PCA) and computed the sum of squared loadings for each feature to assess its contribution to the principal components. We then selected the minimal set of features that collectively accounted for 95% of the total variance in the dataset al.so known as Cumulative Explained Variance33,34, ensuring both compactness and informativeness of the feature set.

Training with XGBoost

A one-vs-rest binary classification approach was adopted to distinguish each organism from the rest. For each target organism, all available sequences were treated as positive samples. An equal number of negative samples were randomly drawn without replacement from the pool of remaining organisms to maintain class balance. This procedure ensured that each binary classifier was trained on a balanced dataset of equal positive and negative examples.

We kept 80% of the data for training (X_train) and used 20% of the data for testing (X_test). We used XGBoost classifier to train the data for classification. The algorithm of XGBoost constructs multiple CART models in parallel which effectively improves the computation speed. Second-order Taylor formula is used to optimize by model by calculating the error value between the predicted and true value35. It can further handle missing feature values by internally imputing it and hence does not require feature standardization36. It has hence been used in the estimation and classification biological data37,38. It is based on minimising the loss function and regularization, \(\:{L}^{\left(t\right)}\) which mathematically it can be written as:

$$\:{L}^{\left(t\right)}={\sum\:}_{i=1}^{n}l({y}_{i},{\widehat{{y}_{i}}}^{(t-1)}+\:{f}_{t}\left({x}_{i}\right))+\:{\Omega\:}({f}_{t})$$
(3)

where \(\:l\) measures the difference between prediction \(\:\widehat{{y}_{i}}\) and target \(\:{y}_{i}\) in the \(\:i\)th instance at iteration \(\:t\).\(\:\:{f}_{t}\) is an independent tree for given input \(\:{x}_{i}\). \(\:{\Omega\:}\left({f}_{t}\right)\) works has a penalty function. We used 5 fold CV while optimising the following parameters of the XGBoost package in Scikit-learn:

  • lambda: L2 regularization range from 1e-3 to 10.

  • alpha: L1 regularization range from 1e-3 to 10.

  • colsample_bytree: Subsample ratio of columns during construction of each tree, ranges from 0.3 to 1.0.

  • subsample: Ratio of training instances, ranges from 0.4 to 1.

  • learning_rate: Step size at each iteration while moving towards minimum of loss function, ranges from 0.001 to 0.2.

  • n_estimators: Number of trees, ranges from 50 to 400.

  • max_depth: Max depth of a tree, ranges from 5 to 17.

  • min_child_weight: Minimum instances needed to be in each node, ranges from 1 to 300.

Performance analysis

To assess the generalizability of each classifier, the hyperparameter tuning and cross-validation were carried out exclusively on the training data and the final model was evaluated on the held-out test set.

The best parameters were chosen and the models were evaluated on X_test dataset to check their efficiency. During performance evaluation, we considered True dataset to be the group that is being evaluated and False dataset to be the collection of all other groups in each case. The performance was calculated based on the following classical classification measures: sensitivity (SN): \(\:SN\:=\:\frac{TP}{TP+FN}\), specificity (SP): \(\:SP\:=\:\frac{TN}{TN+FP}\), Accuracy (Acc): \(\:Acc=\:\frac{TN+TP}{TN+FP+TP+FN}\), precision (p): \(\:p=\frac{TP}{TP+FP}\), harmonic mean of sensitivity and precision (F1): \(\:{F}_{1}=2\frac{SN\:.\:\:p}{SN\:+\:p}\) and Matthew’s correlation coefficient (MCC): \(\:MCC\:=\:\frac{(TP\:.\:\:TN)\:+\:(FP\:.\:\:FN)}{\sqrt{(TP\:+\:FP)(TP\:+\:FN)(TN\:+\:FP)(TN\:+\:FN)}\:}\), where TP, TN, FP and FN are the number of true-positive, false-positive and false-negative classifications, respectively. For given false positive rate (α) and true positive rate (1 − β) at different threshold values, the AUC-ROC was computed as: \(\:AUC={\sum\:}_{n=1}^{i}\left\{\left(1-{\beta\:}_{i}\varDelta\:\alpha\:\right)+\frac{1}{2}[\varDelta\:(1-\beta\:)\varDelta\:\alpha\:]\right\}\), where Δ(1 − β)=(1 − βi)−(1 − βi−1) and Δα = αi − αi−1 and i = 1, 2, …, m (number of test data points)31. The workflow of the methods is given in Fig. 1.

Fig. 1
figure 1

Workflow for XGBoost training.

Results

Data pre-processing

A total of 5541 sequences were collected for the analysis as given in Table 1.

Table 1 Total sequences collected for the analysis.

Parameter calculation

The parameters used for the classification is given in Table 2. A total of 57 parameters were calculated out of which 16 were dinucleotide counts and dinucleotide percentage counts, 4 nucleotide counts and their percentage counts, 2 bp counts, base propensity, Shannon entropy, etc.

Table 2 Parameters calculated for feature extraction.

Hypothesis testing

The Shapiro-Wilk test results indicate that the parameters are not normally distributed, suggesting the use of the Kolmogorov–Smirnov (KS) test. A Bonferroni-corrected alpha threshold of 0.000125 was applied to the KS test, which identified 335 combinations as significantly different across various organism classes. In the remaining 64 combinations, no single parameter was common to all organisms, suggesting that the null hypothesis does not hold. Furthermore, the Mann–Whitney U test and Levene’s test showed that all 57 parameters exhibited significant differences in median and variance between at least one pair of organisms. The test results are provided in Supplementary Material 1.

Figure 2 illustrates the comparison of selected parameters—including length, %G + C content, and minimum free energy (dG)—across 500 randomly sampled pre-miRNA sequences from each organism for visual clarity.

Fig. 2
figure 2

Comparison of insect pre-miRNA with other class of organisms. The comparation of various features of 500 randomly sampled pre-miRNA from insects, human, monocots, aves, ruminants, sauria, dogs and rodents are is shown in the pair-plot scatter diagrams. Features such as GC percentage (%G + C), Length (Len) and dG (MFE/Length) were considered out of 57 features mentioned below. Multivariate gaussian distribution plot is given in the diagonals.

Feature engineering

To reduce redundancy, only one feature from each group of highly correlated variables was retained. As a result, the following features were removed due to high correlation: ‘AU’, ‘ND’, ‘A + U’, ‘G’, ‘G + C’, ‘A’, ‘%GC’, ‘%GG’, ‘%CC’, ‘%UU’, ‘mfe’, ‘D’, ‘AA’, ‘CC’, ‘U’, ‘%UA’, ‘NQ’, ‘pb’, ‘MFE3’, ‘%CG’, ‘%A + U’, ‘UU’ and ‘%G + C’. The results of the PCA are presented in Table 3. Based on the cumulative explained variance, 14 principal components were selected, capturing 95% of the total variance.

Table 3 Principal component analysis (PCA) results showing the explained variance and cumulative variance for each component. Based on the criterion of capturing 95% of the cumulative variance, the first 14 principal components were retained for further analysis, while the remaining components were discarded.

The cumulative explained variance by each principal component is also illustrated in Fig. 3. Based on this plot, the first 14 components were selected, as they collectively account for approximately 95% of the total variance.

Fig. 3
figure 3

Cumulative explained variance of principal components derived from PCA. The first 14 components, accounting for 95% of the total variance, were retained for downstream analysis. The elbow in the curve indicates the point of diminishing returns for additional components.

Training with XGBoost

The classification model for each organism was trained and evaluated using a random search cross-validation strategy on 80% of the training dataset followed by testing on 20% of the independent hold-out dataset. Table 4 summarizes the best cross-validation (CV) accuracy along with test accuracy, precision, recall, and F1-score. Detailed results of the hyperparameter tuning process—including fit times, fold-wise test scores, selected hyperparameter values, and model ranking—for each organism are provided in Supplementary Material 2.

Table 4 Performance metrics of the classification model for each organism. The table includes the best cross-validation (CV) accuracy and corresponding test set metrics: accuracy, precision, recall, and F1-score. Higher scores for monocots, insects, and Sauria suggest more distinct or learnable features, while lower scores in other groups indicate potential classification challenges.

The best hyperparameter values identified during model tuning for each organism are summarized in Table 5. These include values for tree-specific parameters (e.g., max_depth, min_child_weight), learning rate, regularization parameters (reg_alpha, reg_lambda), and sampling parameters (colsample_bytree, subsample).

Table 5 Best-performing hyperparameter values for each organism as identified through cross-validation. Parameters include colsample_bytree, learning_rate, max_depth, min_child_weight, n_estimators, reg_alpha, reg_lambda, and subsample.

Performance evaluation

Various performance measures for each group of organisms is given in Table 6. The accuracy of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.8549, 0.8626, 0.6835, 0.7005, 0.8875, 0.6972, 0.7591 and 0.6588 respectively. Specificity was found to be 0.8704 for insects, 0.8956 for monocots, 0.6424 for rodents, 0.7057 for human, 0.7042 for ruminants, 0.7705 for sauria, 0.6318 for aves and 0.7426 for dogs. The F1 score of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.8572, 0.858, 0.696, 0.699, 0.695, 0.8, 0.6678 and 0.6415 respectively. Sensitivity was found to be 0.8395 for insects, 0.8297 for monocots, 0.7247 for rodents, 0.6953 for human, 0.6901 for ruminants, 0.8197 for sauria, 0.6859 for aves and 0.5941 for dogs. The MCC score of insect, monocots, rodents, human, ruminants, sauria, aves and dogs was found to be 0.7102, 0.7269, 0.3683, 0.4011, 0.3944, 0.5909, 0.3182 and 0.3404 respectively. This has also been graphically shown in Fig. 4. The radar plot provides a comparative overview of model performance across organisms, highlighting strengths and weaknesses in different metrics. Notably, organisms such as Monocots and Insects exhibit consistently higher scores, while Aves and Rodents show comparatively lower performance across several metrics.

Table 6 Performance metrics of classification models evaluated across eight different organism groups. Metrics include Accuracy, Precision, Sensitivity, F1 Score, Matthews correlation coefficient (MCC), and Specificity. These values reflect the models’ ability to generalize and perform consistently across diverse biological taxa.
Fig. 4
figure 4

Radar plot illustrating the performance of classification models across different organisms using six evaluation metrics: Accuracy, Precision, Sensitivity, F1 Score, Matthews Correlation Coefficient (MCC), and Specificity. Each line represents a different organism, allowing for comparative visualization of model strengths and weaknesses across taxa.

The ROC-AUC is given in Fig. 5. The AUC was found to be 0.92 for Insects, 0.93 for Monocots, 0.87 for Sauria, 0.74 for Dog, 0.77 for Ruminant, 0.76 for Human, 0.76 for Rodent, and 0.72 for Aves. These values indicate strong classifier performance for Insects, Monocots, and Sauria, suggesting the model can distinguish positive and negative cases effectively in these taxa.

Fig. 5
figure 5

ROC-AUC of the XGBoost classifier across different organism groups. It shows how the traditional features contribute to the overall learning in different classes of organisms. Higher AUC values for Insects, Monocots, and Sauria indicate strong discriminatory performance.

Discussion

Data collection and pre-processing

We collected all available pre-miRNA sequences of insects as our initial focus was to distinguish them from other organisms. We also collected data from rodents, monocots, aves, dogs and sauria which are reptiles. Highest number of pre-miRNA sequence from a single species was collected from humans which was followed by mouse and cattle. All the sequences formed characteristic hairpin loops which was inferred from the secondary structure calculated by RNAfold.

Hypothesis testing

Our null hypothesis was that all the pre-miRNAs are physically and compositionally similar and hence performed the KS test with Bonferroni adjustment of various parameters with insects. These parameters are used in various machine learning based pre-miRNA prediction tools19,21,24,39. As the p-value of parameters such as Length, GC content, MFE1, etc. were < 0.05, hence we rejected the null hypothesis and accepted the alternate hypothesis that the pre-miRNA sequences from these groups vary from insects. This indicated a possibility of supervised training for classification of pre-miRNA based on their ancestral origin. Therefore, we also performed Mann Whitney and Levene’s test to check for median and variance of features among the groups. We found that all the 57 features are significant in at least on pair of organisms. Hence, we moved forward with state-of-the-art XGBoost algorithm which can efficiently learn to classify different group data based on given labels.

Feature engineering and model training

The estimation of features which has the maximum contribution in building the model is essential40,41. Our approach initially gave us 34 features after removing highly correlated features. PCA is another widely used dimension reduction technique42,43. Using PCA we selected 14 principal components as shown in the scree plot for the classification model capturing 95% of cumulative explained variance.

Performance evaluation

The performance of XGBoost varied among the groups. Each group had fairly good accuracy however accuracy is often misleading and cannot be considered as best indicator of performance44,45. The predictive model had good specificity for each group. However, only insects, plants, and saurias had sensitivity more than 60%, with insects having the highest at 83.95%. MCC score which estimates all the parameter is a crucial indicator45. The F1 score for all the organism was above 60%.

But MCC of only Monocots and Insects was above 60%.

The AUC values of 0.92, 0.93 and 0.87 of insects, monocots, and reptiles (sauria) respectively, suggest that the parameters we used in XGBoost can classify these organisms.

Conclusion

In this work, we demonstrated the distinct nature of insect pre-miRNA by comparing it against that of other organisms and established that insect pre-miRNAs are significantly different from those of monocots, humans, rodents, ruminants, sauria, dogs, and aves. We further developed a predictive model using the XGBoost classifier, which effectively learned to differentiate pre-miRNA sequences across these organism classes based on a range of sequence and structural features.

In the future, this model can be implemented as a web server or standalone software tool, enabling researchers to rapidly classify unknown pre-miRNA sequences based on their likely taxonomic origin. Such a tool would be particularly valuable for annotating novel or poorly characterized genomes, assisting in evolutionary studies, and guiding experimental validation in non-model organisms. Additionally, expanding the dataset to include more taxa and incorporating deep learning-based feature extraction could further improve prediction accuracy and broaden the model’s applicability.