Introduction

MicroRNAs (miRNAs) are a class of small noncoding RNA that serve as expression regulators to direct the cleavage, degradation, and/or translational inhibition of their target transcripts1. They play vital roles in broad spectrums of physiological processes including cell cycle, proliferation, differentiation, development, and apoptosis in multicellular organisms2. A substantial proportion of miRNAs present highly tissue/cell-type-specific expression pattern that are predictive of cell identity3,4,5,6. Altered expression levels of miRNAs are widely observed in tissues or body fluids between healthy and disease conditions, or across different disease stages, and often correlate with clinical phenotypes and outcomes, indicating their capability to serve as predictive, diagnostic or prognostic biomarkers7,8. Therefore, the comprehensive and accurate cataloguing of miRNA-based biomarkers across tissues and cell types is indispensable for identifying the functional status of cells, as well as for understanding the underlying gene regulation mechanism in physiological and pathological conditions in complex tissues.

With the advent of high-throughput techniques, large amounts of miRNA profiling data derived from different tissues in normal and diseased states has accumulated in public repositories, such as The Cancer Genome Atlas (TCGA)9 and the Encyclopedia of DNA Elements (ENCODE) project10. The number of miRNA biomarkers has been increasing at an unprecedented rate with intensive exploration of these troves of publicly available data. However, the vast majority of these molecular profiles are derived from bulk tissue sequencing of cell mixtures where cellular heterogeneity usually acts as a confounding factor. This intra-sample heterogeneity is altered by physiological or functional activity changes of the samples, which impose great challenges to the omics study11. For example, in the setting of case-control or longitudinal studies, observed changes for particular miRNA could result from either the altered expression in any particular cell type, or because of alterations of cell-type composition12. This uncertainty in the source of data variation could lead to decreased statistical power and an increased false positive rate. On the other hand, shifts in the relative cell composition play an important role in various physiological processes, such as embryogenesis, morphogenesis and immune activities13,14,15. Cell type quantification explains etiology of many diseases and could be their pathological indications16,17,18. Notably in cancer research, the tumor microenvironment involving density of infiltrating lymphocytes is critical for cancer progression, patient prognosis and therapy response19,20,21. Therefore, resolving cell types and compositions from these miRNA expression profiling data is extremely valuable for identifying miRNAs as disease markers and enhancing our understanding of their underlying biological mechanisms.

Over the past decade, the burgeoning single-cell RNA sequencing (scRNA-seq) technologies have been developed that allows for characterization and quantification of cellular heterogeneity by providing an overview of global gene expression within a population of cells22. Unfortunately, current single-cell RNA-Seq protocols still lag for quantification of miRNA expression, only very limited number of studies have been published towards this area23,24,25. As an alternative, in-silico cell type deconvolution is attractive for identifying cell-type compositions. Computational deconvolution has the advantage that it is cheap and fast, whist also offering a degree of robustness to noise and confounding variables26. Currently, gene expression and DNA methylation profiles are widely used for reference-based cell type deconvolution to resolve cell mixtures of complex tissues27. These approaches rely on priori reference profiling information about specific markers from constitutive cell types, which usually generated from sorted cell types or more elaborate single-cell profiling data28. The well-known deconvolution method CIBERSORT and its successor CIBERSORTx for instance, utilize support vector regression approach and references generated from purified cell populations or single cell data for cell composition estimation29,30. Although cell type deconvolution can be achieved for samples with matched RNA-seq or DNA methylation data, this is the rare case and most miRNA data in the public domain does not have such matched multi-omics profiling available. The Gene Expression Omnibus (GEO) database for instance, there are already 116,400 miRNA expression profiling samples available, whereas only about 5% have corresponding mRNA profiling data. This situation seriously hinders the in-depth analysis of miRNA data. Therefore, it is urgent need to develop deconvolution tools based on miRNA expression profiles for the analysis of cell heterogeneity in complex tissues and to identify miRNA biomarkers.

In this study, we constructed miRNA reference matrices from sorted cells and developed DeconmiR, a deconvolution method that resolves relative proportions of different cell types from miRNA profiles of mixture samples. We evaluated our method from various experiments and clinical samples, it presents high prediction accuracy for many evaluation metrics. We further applied DeconmiR to different scenarios, such as deconvolution of the data from samples of acute myeloid leukemia and solid tumors, and found that cell proportions estimated by DeconmiR is applicable for patient prognosis. In addition, it also shows great application value for screening of age-associated miRNAs, as well as for monitoring the immune landscape in infectious diseases like COVID-19. By adjusting the estimated cell type composition, we identified smoking-associated differentially expressed miRNAs occurring in the epithelial compartment, which were validated in smoking-related lung cancer. These results indicated that our designed method is an effective tool to infer cell type fractions by using miRNA profiling data, and also demonstrate that miRNAome data is of great significant for accurate cellular heterogeneity analysis of complex tissues.

Results

Building the miRNA expression reference matrix for blood sample

Reference-dependent deconvolution methods rely entirely on a library for cell-type-specific markers. Ideally, a reference matrix should consist of miRNAs whose expression variation be maximized across distinct cell types but minimized within each cell type31,32,33. We first considered the case of peripheral blood, for which the main constituent cell types are well known. We collected miRNA-seq data for a total of 257 samples encompassing six purified leukocyte subtypes (CD4 + T cell, CD8 + T cell, B cell, NK cell, Monocyte and Neutrophil) obtained from 47 individuals from Juzenas et al. (Supplementary Data 1-2)34. The average purity reported by FACS sorting was 90.2% across all samples (Fig. S1). We selected a total of 197 samples with purity higher than 85% for reference construction and in-silico mixture sample generation by a leave-one-out strategy. We used limma approach to identify differentially expressed miRNAs (DEMs) by comparing each cell type against all others (see Methods)35, which leads to 145 non-redundant miRNAs selected. For sanity check, we used the t-Stochastic Neighbor Embedding (t-SNE) for clustering of the samples based on the signature miRNAs to reveal the structure of the data. We verify that samples are segregated distinctly according to cell types (Fig. 1A). Then the reference matrix was constructed by calculating the average expression for each miRNA from every individual cell type (Supplementary Data 3). Fig. 1B depicts the clustering heatmap generated from the reference libraries, which adequately discriminate the lymphoid-derived cells and myeloid-derived cells, and are also validated by cell-type similarity heatmap by computing pair-wise Kendall rank correlation coefficients (Fig. S2A, B). Many of the signature miRNAs have been identified that are specifically expressed in immune cells, reflecting the function of growth and differentiation of the cells. For instance, miR-155 was found highly expressed and mediates important functions within lymphocytes (both B and T cells)36,37. miR-181 family has profound impact on the development of human NK cells from hematopoietic progenitor cells38. The Gene Ontology (GO) enrichment analysis of the miRNA targets revealed a strong and specific enrichment of the terms representing the functional phenotype of the corresponding cell types (Fig. 1C). For example, “Regulation of T cell differentiation” and “T cell receptor signaling” are enriched in T-lymphocytes, “Regulation of myeloid cell differentiation” and “Myeloid cell development” are mainly enriched in monocyte and neutrophil. This suggests that our method could accurately and reasonably identify miRNA markers that discriminate different cell types, and reflecting the function of growth and differentiation for the cells.

Fig. 1: Reference library construction for accurate cell type composition estimation of real and simulated datasets.
figure 1

A t-SNE scatterplot based on cell type-specific miRNAs identified. The plot indicates distinct clustering of six cell types and most of data variance can be attributed to cell type difference. B Hierarchical clustering heatmap for isolated leukocyte subtype based the mean expression of miRNA signature. C Gene Ontology (GO) analysis of target genes associated with cell-specific miRNAs. miRNA target genes were screened and used to identify enriched biological process terms at the FDR-adjusted P-value of 0.05, Enrichment FDR values are size encoded as indicated. D Stacked bar charts showing estimated cell fractions indicated that CD4+ T cell constitute the major cell type for purified samples. E Boxplots of the estimated cell fractions indicated the very low component for other cell types in the pure cell samples CD4 + T cell (n = 39 biological replicates), The boxes are bounded by the 25 and 75 percentiles and the center represents the median. The whiskers extend from each edge of the box to indicate the 1.5× interquartile range (IQR). F Barplots of the average cell proportions of each major cell type according to FACS and DeconmiR in 257 samples from Juzenas et al. Data are presented as mean values ± SE. G Scatterplots comparing the estimated cell fractions to the known fractions of the in-silico reconstructed mixture sample, where the mixing proportions are known. One scatterplot is shown for each blood cell subtypes used in reconstructions. In each case, the R2 and RMSE values been rounded to two significant digits are indicated. P-values are derived from the Pearson correlation test. Source data are provided as a Source Data file.

Concordance between true and estimated cell proportions

The application scenario of the identified miRNA-based cell markers is for cell type deconvolution. We designed a method of DeconmiR for miRNA profiling based cell type deconvolution by using Robust Partial Correlation (RPC), whose validity has been demonstrated by applying to DNA methylation-based deconvolution from our previous work39,40. We first assessed our method by using 257 miRNA profiling data from immune cells with known proportions, which is indicated by the purity of the corresponding cell types. In this benchmark, the predicted cell type proportion by our method could indicate the major cell composition from the pure-cell samples. For example, for the pure-cell samples consisting of only CD4+ T cell, our method consistently predicted that CD4+ T cell is the dominant cell type of all immune cells, with the average predicted cell proportion of 0.91 (Fig. 1D), whereas the predicted proportions for other cell types are generally very low (Fig. 1E). In addition, the predictions also agree well with the ground truth for other cell types, as illustrated in Fig. S3. In general, we observed high agreement between the predicted and FACS-estimated cell proportions in this benchmark, with an average Root-mean-square error (RMSE) of 0.11 for the six cell types (Fig. 1F).

As flow-cytometric estimates of the cell type proportions are also subject to error, to further evaluate whether our method could capture the cell proportion by taking the miRNA profiling data of cell mixtures, we prepared simulated bulk tissues by generating in-silico mixtures of immune cells where the exact mixing proportions are known. One hundred mixtures were generated with a randomly assigned cell weight for each mixture and the proportions exhibited a reasonable dynamic range. For all six cell types, estimated cell fractions displayed high agreement with the true proportions. DeconmiR obtained near-perfect R2 values across all cell types, with RMSEs ranging from 1% to 3.2% (Fig. 1G), confirming our method can accurately quantify blood cell type proportions by using the reference library.

DeconmiR could reasonably capture rare cell components

To further evaluate the capability of DeconmiR to deal with rare cell components, we also performed the deconvolution based on mixture data of individual cell types at varying proportions. We generated 100 in silico mixture samples by setting the proportion of one cell type to 1%, 3%, 5%, 7% and 9%, respectively, and the proportions of other components were randomly assigned. The results indicated that the estimation of DeconmiR is close to the ground truth with the smallest variance across different proportions for the six cell types (Fig. 2A). It presents average small RMSE of 0.05 and high R2 values above 0.5 for the six cell types (Fig. 2B), which illustrates the robustness of DeconmiR on inferring the fraction of rare components.

Fig. 2: Accuracy evaluation of DeconmiR and comparison with other methods.
figure 2

A Deconvolution results of simulated rare components with varying degree, the changes of RMSE and R2 values agree with different rare component proportions (n = 100 replicates at each proportion level). The boxes are bounded by the 25 and 75 percentiles and the center represents the median. The whiskers extend from each edge of the box to indicate the 1.5× interquartile range (IQR). B Averaged RMSE and R2 value for deconvolution results of simulated rare components with varying degree. C DeconmiR predicts relative cell fractions of 14 samples from adult human whole blood with flow cytometry fractions, the corresponding RMSE and R2 values for the comparison are presented. P-value is derived from the Pearson correlation test. D Comparing DeconmiR with other deconvolution methods, barplots of RMSE and R2 values for the estimated cellular proportions for the four different algorithms applied to the reconstructed mixture samples. E Average RMSE and R2 values for different algorithms across different cell types. Source data are provided as a Source Data file.

Independent validation of the miRNA profiling based cell type deconvolution

To further validate our estimation using independent actual samples, we next applied our method to another in-house miRNA profiling dataset generated from whole blood of 14 adults with known proportions of the six immune cell types. Again, our method achieved highly accurate cell-type fraction estimates when compared to the ground truth cell-type composition (Fig. 2C). The validation dataset showed low RMSE and high R2 values and across the different cell types, particularly for CD4 + T cell, B Cell, NK cell and monocyte (Fig. 2C). Therefore, our method retained the ability to accurately deconvolute cell fractions for samples from independent datasets.

DeconmiR compares favorably to other deconvolution methods

Having validated our method, we next performed a detailed comparison to other reference-based deconvolution methods that have been applied to gene expression data. Specifically, we compared our method to an approach based on solving linear equations (SLE), to NNLS (used by methods such as GEDIT41), to state-of-the-art deconvolution algorithms like CIBERSORT29, and to the quadratic programming (QP) used by DeconRNAseq42. For each of the algorithms, we used the reference library to compare the prediction accuracy by estimating the in silico mixture sample proportions, and comparing the algorithms in terms of RMSE and R2 values. For RMSE, DeconmiR outperformed all other methods across five cell types except monocyte, where the QP performed mildly better than DeconmiR (Fig. 2D). DeconmiR outperformed all other methods in terms of the average RMSE across all cell types (Fig. 2E). For algorithms performance as assessed by R2 value, all methods attained relatively high R2 values, with DeconmiR achieving marginally higher ones (Fig. 2D, E). These results sufficiently indicate the high prediction accuracy of DeconmiR for estimation, and they consistently outperform other methods.

DeconmiR reveals robustness to external noise in different cell type context

In order to survey the miRNA based deconvolution in the context of other cell types, we constructed references by using miRNA profiles from epithelial and non-epithelial cell types obtained from FANTOM543, and applied our method to the mixture scenarios of three epithelial cell types consists of human mammary epithelial cell (HMEC), renal proximal tubular epithelial cell (RPTEC) and bronchial epithelial cell (BEC) from RNA_atlas44, as well as three non-epithelial cell types consists of aortic endothelial cell (AEC), cardiac fibroblast (CF) and prostate smooth muscle cell (PSMC) from microRNAome45. Once again, DeconmiR achieves an accurate prediction both in epithelial and non-epithelial cell contexts (Fig. S4A and B), with an averaged R2 of 0.96 and RMSE of 0.11 for epithelial cell types, and an averaged R2 of 0.93 and RMSE of 0.10 for non-epithelial cell types, respectively. As this analysis was performed on the reconstructed in-silico mixtures generated from miRNA profiles that are independent from those used to build the reference signature matrices, which already implicitly assess the robustness of the algorithm to natural levels of noise, as encountered for instance between different sequencing platforms or experimental protocols. However, in order to further improve our understanding of the robustness of the method, we next investigated their relative performance under increasingly higher levels of noise (see Methods). By adding increasing levels of noise to the reconstructed mixtures for epithelial, we observed that our method performs relatively stable in terms of RMSE, whereas in terms of R2, the performance decreases as the noise level increases, which indicated that it is optimal at low level noise but sensitive to high level noise (Fig. S4A). Similar observations were also presented for the non-epithelial cell types (Fig. S4B). These data indicate that the robustness of the miRNA based approaches for estimating cell proportions and confirms its value in cell heterogeneity analysis for complex solid tissues.

Besides the deep sequencing methods for miRNAs quantitative study, microarray is also widely used for parallel expression analysis of defined miRNAs. The accuracy of the deconvolution methods is therefore probably subject to the technical variation when applied to different platforms. Thus, we implemented our method to the application of miRNA profiling data derived from microarrays for cross platform effectiveness assessment. We collected the miRNA profiling data from the Affymetrix Multispecies miRNA-1 Array platform for six purified leukocyte subtypes and prepared the in-silico mixture samples. Deconvolution of the simulated datasets comprise six cell types indicates that our method could predict cell proportions for samples based on microarray profiling with a fair degree of accuracy, with averaged R2 value of 0.74 and RMSE of 0.09 for six cell types (Fig. S4C). A relative higher level of prediction accuracy was observed for four cell types of B cell, NK cell, CD4+ and CD8 + T cell, whereas results for Monocyte and Neutrophil were somewhat deviated from the real data. Here we also access the robustness of our methods under increasingly higher levels of noise. This time we found rapid decreasing of R2 and RMSE values with increasing levels of noise added, which indicated the high-level noise may affect the robustness across platforms (Fig. S4C). In summary, our analysis indicated that relative performance for deconvolution in heterogeneous mixtures is dependent on cell types and noise level within the data.

Potential diagnostic value of DeconmiR for acute myeloid leukemia

Having demonstrated the accuracy and robustness of our method in cellular heterogeneity deconvolution, we investigated the changes of cell fractions in human malignancies of acute myeloid leukemia (AML), which were obtained from TCGA (Supplementary Data 4). We first estimated the cell type composition for these samples by applying our method to the miRNA profiling data. As depicted in Fig. 3A, AML samples present a heterogeneous composition for the six immune cell types. We then applied CIBERSORT to transcriptome data, and the EpiDISH to DNA methylome data from the matched samples for comparison. The CIBERSORT method based on LM22 library distinguishes 22 hematopoietic cell types including seven T cell subtypes, two B cell subtypes, plasma cells, NK cell, and also the myeloid subsets (Fig. S5A). We aggregated the subdivided cell types into the 6 major immune cell types to calculate their proportions. For instance, the cell proportions of CD4 naïve, CD4 memory resting, CD4 memory activated, follicular helper and Tregs were summed to obtain the total proportion of CD4 + T cell. For EpiDISH-based results, proportions for seven different cell types were given (Fig. S5B). The result showed a moderate but significant concordance between deconvolution results obtained from different deconvolution methods. An estimated Pearson Correlation Coefficient (PCC) of 0.24 and a p-value of 1.4e-10 was observed for the estimated cell compositions obtained from DeconmiR and CIBERSORT (Fig. 3B). Whereas for results based on DeconmiR and EpiDISH, an estimated PCC of 0.26 and p-value of 4.3e-12 were achieved (Fig. 3B). Relative better concordance was observed for B cell and NK cell according to the estimated RMSE values as depicted (Fig. 3B). One possible reason for the relatively low consistency between results obtained from different platforms can be the less flexibility across multiomics data (given that miRNA levels and mRNA levels are not highly correlated). The second explanation is that CIBERSORT yields absolute cell fractions for 22 cell types in a given sample, and there is lack of an exact correspondence to the cell types provided by DeconmiR. Nevertheless, these results still indicate the cell type proportions estimated are consistent across different types of omics data.

Fig. 3: Deconvolution of AML and healthy samples using DeconmiR.
figure 3

A Stacked bar charts of the cellular fractions for the 112 AML samples from TCGA. B Scatter plots for the cellular fractions predicted by DeconmiR and CIBERSORT (left panel), as well as by DeconmiR and EpiDISH (middle panel) for AML samples. Pearson correlation coefficients and p-values were calculated between the cellular fractions estimated by different methods. The corresponding RMSE values for the comparison are also presented (right panel). C Boxplots for cell fraction distribution in AML (n = 112) and healthy (n = 77) samples across the six blood cell types. ****p < 0.0001 for two-sided Wilcox rank sum test (p = 7.61e−16, 1.89e−17, and 5.19e−4 for CD8+ T cell, NK cell and Neutrophil, respectively). The boxes are bounded by the 25 and 75 percentiles and the center represents the median. The whiskers extend from each edge of the box to indicate the 1.5× interquartile range (IQR). D Clustering heatmap of the cell type fractions in AML samples. Two sample groups can be observed with different cell type fraction distribution. E Kaplan–Meier survival plot indicates different prognosis for these two groups of patients. The survival difference between different groups is calculated by log-rank test. Source data are provided as a Source Data file.

We further compared the cellular fraction from AML samples with that from healthy control. To do this, we estimated cell proportions of 77 additional whole blood samples from Juzenas et al. (Fig. S5C). We found a significant difference for cell proportions of CD8+T cell, NK cell and neutrophil between AML and healthy samples. Specifically, the CD8+ T cell presented significantly higher fractions in AML samples, while the fractions of NK cell and neutrophil were significantly lower in AML. Notably, we can observe a higher heterogeneity level in AML samples, which shows a much wider range of cellular composition distribution for all six cell types (Fig. 3C). Considering the important role of cell composition in cancer prognosis, we made a hierarchical clustering for AML samples based on the estimated cellular fractions to define subgroups. Two different groups (Group A vs. Group B) can be observed for apparent difference in the fractions of four cell types including CD4+ T Cell, CD8+ T Cell, NK Cell and neutrophil. Higher fractions of CD8+ T Cell and NK Cell are more enriched in Group A, whereas Group B presents higher fractions of CD4+ T Cell and neutrophil (Fig. 3D). Kaplan–Meier analysis indicated that the cellular fractions well characterize the differences in overall survival between the two groups (Fig. 3E). These results indicated the critical roles for miRNA based cell type deconvolution in cancer prognosis and potential clinical usage of our method for patients’ stratification.

DeconmiR improves sensitivity for aging related miRNA selection

Age-related changes in miRNA expression have been previously reported and functionally described46. In order to illustrate our method for application in aging-associated miRNA identification, we apply DeconmiR to a miRNA profiling data in a cohort of 38 samples, with age ranging from 24 to 79, for which whole blood samples were taken and miRNA profiles were generated using Affymetrix Multispecies miRNA-1 Array (Supplementary Data 5)47. We obtained the cell proportions for these samples by using our method. The results indicated that lymphocytes including CD4+ T cell and B cell constitutes the majority of the samples and the neutrophil makes up the rest (Fig. 4A). Then we obtained the top components of data variation by using singular value decomposition (SVD). By using a principal component regression analysis, the top components were correlated against sample ages and other potential confounders of the inferred cell weights, with the aim of assessing these factors that contribute to the inter-sample variability. We found that the top components of data variance were explained more by cell type heterogenity than that by age (Fig. 4B). The first principal component is significantly associated with cell composition of B cell, NK cell and monocyte, and the third principal component is significantly associated with cell composition of neutrophil. While confounders of age and gender are only weakly associated with top principal components.

Fig. 4: DeconmiR improves detection sensitivity of aging related miRNAs in blood and cell type composition aberrance with SARS-Cov-2 infection.
figure 4

A Boxplot of cellular proportions of 38 age-related blood samples as inferred from DeconmiR. The boxes are bounded by the 25 and 75 percentiles and the center represents the median. The whiskers extend from each edge of the box to indicate the 1.5× interquartile range (IQR). B Heatmap for associations between the top principal components and potential confounders of miRNA profiling data. P-values are derived from Pearson correlation test for numerical variables, and from Kruskal-Wallis test for categorical variables. C QQ-plot for all miRNAs passing quality control from supervised analysis against age only adjusted for gender (“No adjustment”). P-value is derived from simple regression and was q-value adjusted. D QQ-plot for all miRNAs passing quality control from supervised analysis against age and adjusted for gender and cell proportions as estimated by DeconmiR (“Cell type adjusted”). P-value is derived from multiple regression and was q-value adjusted. E Venn diagram indicated the number of specific and shared age-related miRNAs with and without cell type proportion adjusted. F Boxplot for cell type composition across six cell types in COVID-19 patients (n = 31) and healthy control (n = 16) samples. See (A) for boxplot definition. P-values are derived from two-sided Wilcoxon rank sum test. G Boxplots for cell type composition for CD4+ T cell and neutrophil in COVID-19 patients with different clinical features (n = 15, 19 and 13 for mild, moderate and serious, respectively). See (A) for boxplot definition. P-value are derived from two-sided Wilcoxon rank sum test. Source data are provided as a Source Data file.

To determine effects of the observed confounding between age and cell composition at individual miRNA level, we examined age associated miRNA with and without cell type composition adjusted. By using a linear regression model, a total of 14 miRNAs were identified that are associated with age, with 8 positive correlated and 6 negative correlated (q-value < 0.05, Fig. 4C and Supplementary Data 6). By adjusting cell type composition estimated by DeconmiR, we observed 27 age associated miRNAs, doubling the number without cell type composition adjusted. For these miRNAs, a total of 17 of them are positively age-correlated and 10 are negatively correlated (q-value < 0.05, Fig. 4D and Supplementary Data 6). Importantly, among the 14 age-miRNAs identified without cell type adjustment, 13 are also included in those identified with adjustment (Fig. 4D and Fig. S6). This analysis confirms that DeconmiR greatly improves the sensitivity for age related miRNA screen over the unadjusted analysis.

Cell type composition change in the blood samples of COVID-19 patients

The pandemic of COVID-19 caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has led to enormous infections and loss of life since its origin in 2019. The susceptibility and clinical manifestations to the coronavirus vary greatly among individuals. Many studies have indicated that the immune landscape aberration could be the key mechanisms causing the clinical manifestation differences of COVID-19 patients48,49. To test if our method could capture the immune properties of COVID-19 patients, we obtained miRNA-seq data for whole blood from a cohort of 47 individuals (Supplementary Data 7)50. By applying DeconmiR to miRNA-seq data and CIBERSORT to RNA-seq data from matched samples, we observed concordant results for results from DeconmiR and CIBERSORT as before (Fig. S7A). Then we checked the composition difference between COVID-19 patients and control samples for immune cell types and found a significant difference between the two groups. The proportions of the lymphoid lineage including CD4+ T cell, CD8+ T cell and NK cell decreased significantly in COVID-19 patients, whereas the myeloid lineage increased significantly. Notable differences can be observed for the CD4+ T cell and neutrophil (Fig. 4F). To further explore whether these cellular compositions could characterize clinical features of COVID-19 patients, we also collected another cohort data for 47 PBMC samples from 18 patients with different clinical features (Supplementary Data 8)51. Here again, concordant estimations are obtained by using DeconmiR and CIBERSORT, respectively (Fig. S7B). When checking the cellular compositions among groups of patients with different clinical features, we found the proportion of monocytes were gradually elevated, notably in serious COVID-19 patients, while the CD4+ T cell were decreased significantly (Fig. 4G). This observation was verified by some other studies using the elaborate single cell methods52. Collectively, this analysis further confirmed the accuracy of our deconvolution analysis in dissecting the pathogenesis and revealed the dynamic changes for intrinsic immune landscape in the inflamed state of COVID-19 patients.

DeconmiR reveals immune cell infiltration in breast cancer

Having demonstrated our methods in blood tissues, we aim to assess the effectiveness of miRNA based deconvolution on real mixtures of epithelial and immune subtypes. To do this, we apply our method to the breast tissue and constructed a reference matrix containing 10 cell types from four categories of epithelial, fibroblast, adipocyte and immune cells. We used miRNA profiling data for the 7 immune cell subtypes from FANTOM543, and further collected human breast epithelial cell (HMEC) fibroblast, adipocyte. Then we repeated the same procedure as before to obtain 271 cell-type-specific miRNAs, which were demonstrated to discriminate samples from different cell types accurately (Fig. S8A, Supplementary Data 9). The reference matrix constructed based on these miRNAs could also discriminate immune and other cell types (Fig. 5A and Fig. S8B). Comparison of these miRNAs to the blood references indicated a significant part of them are breast sample specific (Fig. S8C). We applied this reference to 1000 in silico mixture samples and observed high consistency between predicted and true cell-type fractions with R2 values above 0.95 and RMSE less than 0.06 (Fig. S8D). Then we apply our method to an independent dataset of four samples of purified breast epithelium cell line and 36 whole blood samples from the microRNAome project45. When estimating the fractions for the breast epithelium samples, we are able to confirm that epithelial cell is estimated as dominant for the four samples, in line with the high epithelial purity expected (Fig. 5B). Whereas for the 36 whole blood samples, immune cells constitute the major categories (Fig. 5C). All these results demonstrate the reliability of our reference matrix for cell proportion estimation for solid tissues like breast.

Fig. 5: Reference construction and cell type deconvolution of human breast tissues.
figure 5

A Hierarchical clustering heatmap for the 10 cell typereference matrix constructed for breast tissue. B Stacked bar charts of the cell fractions inferred by DeconmiR for 4 samples of breast epithelium cell line from microRNAome project. C Boxplot of cellular proportions inferred by DeconmiR for 36 whole blood samples from microRNAome project. The boxes are bounded by the 25 and 75 percentiles and the center represents the median. The whiskers extend from each edge of the box to indicate the 1.5× interquartile range (IQR). D Boxplot for cell type composition across six cell major categories for breast cancer (n = 1085) and healthy control (n = 104) samples from TCGA. See (C) for boxplot definition. P-value are derived from two-sided Wilcoxon rank sum test. E Kaplan–Meier survival plot for two groups of patients classified by the macrophage level. The survival difference between groups is calculated by log-rank test. Source data are provided as a Source Data file.

Next, we aim to assess the immune cell infiltration in breast cancer samples. To do this, we estimated the epithelial and immune cell contents for 1182 normal and breast cancer samples from TCGA (Supplementary Data 10). We can observe that both normal and tumor samples exhibit substantial immune cell infiltration, whereas all major cell categories present significant changes between normal and tumor samples. Generally, an increased fraction of epithelial is observed in tumor samples, but with the fibroblast and adipocyte fractions decreased. For the immune content, both the lymphoid and myeloid lineages present decreased fraction, whereas the macrophage increased dramatically (Fig. 5D). An in-depth analysis indicated a more heterogeneous variance of different immune cell types. The CD8+ T cell, B cell and NK cell present an increased fraction, CD4+ T cell and neutrophil decreased significantly. Notably, the monocytes were absent from both normal and tumor samples as expected (Fig. S9A). As the tumor-associated macrophages (TAMs) are indicated that associated with a poor outcome of cancer patients, we examined whether the infiltration level of macrophage estimated by our methods can be used for prognosis of the patients53,54. By classifying all tumor samples into two groups of high and low based on macrophage fraction, we validate that a higher proportion of macrophages was associated with worse prognosis (Fig. 5E). Thus, the immune cell fractions obtained using DeconmiR can serve as an effective prognostic indicator for breast cancer.

Finally, we further explored the characteristics of the immune cell landscape across different breast cancer subtypes defined by immunohistochemistry (IHC) or PAM50. Fig. S9B shows that ER+ tumors present a greater proportion of total immune cells, but with different trends in lymphoid and myeloid lineages, where the lymphoid fraction is higher and the myeloid fraction is lower in ER+ tumors. Whereas for intrinsic subtypes defined by PAM50 clustering, both total immune cell and lymphoid lineage were highest in the Luminal A subtype (Fig. S9C). In contrast, the myeloid was highest in Luminal B tumors, and also in the Her2+ and Basal subtypes, which indicated the significant heterogeneity of immune infiltrate among different subgroups. We further investigated immune infiltration patterns among different tumor samples. As depicted in Fig. S9D, ER+ or PR+ subtypes, and the Luminal A subtype are found have higher fractions of CD4+ T cell and NK cell, but lower fraction of neutrophil. These results suggested that our method could provide important insights on the dynamic immune cell infiltration for solid tumors and offer valuable indicators for potential molecular classification and patient prognosis.

Cell type deconvolution across tumor types to identify cancer biomarkers

In order to further demonstrate the suitability of our method in the context of other solid epithelial tissues (e.g. lung, liver, kidney, etc.), we then constructed tissue-type-specific references matrices by using the corresponding epithelial and fibroblast cell lines and the purified immune cell subtypes used earlier. In addition, we also constructed a generic matrix for a hierarchical two-step procedure deconvolution mode, which was adopted by the hEpiDISH method for its effectiveness has been validated in complex epithelial tissues40. In such case, we used different epithelial and fibroblast cell lines with reasonable sample numbers (n = 10) as well as immune cell types to construct the miRNA reference database for the three main cell types (Fig. S10 and Supplementary Data 11). We performed deconvolution on 16 epithelial cancer types for which miRNA profiling data are available from TCGA. We would expect consistency with independent cell-type fraction estimates obtained using a different molecular assay, such as RNA-seq. To do this, we applied the ESTIMATE algorithms on the matched RNA-seq profiles of normal-adjacent and cancer samples profiled as part of TCGA55. The comparison of the total immune fraction estimated using DeconmiR on the miRNA profile, with the corresponding immune score obtained using ESTIMATE on the matched RNA-seq profile reveals a remarkable good agreement for both normal and cancer tissue and across all cancer types (Fig. S11). These findings attest to the general pan-cancer applicability of DeconmiR.

We next demonstrate the effectiveness of our method to infer cell-type-specific DEMs in the context of solid epithelial tissue as cancer biomarkers. We first considered the case of DEMs between normal and cancer, using breast cancer as an example for the availability of miRNA profiles from breast cancer cell lines. We constructed an approximate gold standard set of breast cancer DEMs by intersecting the lists of DEMs derived from a comparison of normal to breast cancer cell lines from both the FANTOM5 and microRNAome project, with a total of 372 and 408 DEMs were identified, respectively (Supplementary Data 12). The overlap of DEMs from the two comparisons defines a high-confidence set of 160 true positives occurring within the epithelial compartment of breast cancers, as well as a list of 1504 true negatives, which are not differentially expressed in neither of them (Fig. 6A and Fig. S12). Treating the proportion of three cell type types (epithelial, fibroblast and immune) as covariate in the differential expression analysis of TCGA breast cancer cohort data, we obtained a sensitivity of 30% and specificity of 74% (Fig. 6B). An unadjusted analysis achieved a higher sensitivity (41%), but at a lower specificity (62%), whereas the state-of-the-art reference-free approach surrogate variable analysis (SVA)56 achieved a similar performance for sensitivity (42%) and specificity (61%). Thus, the increased specificity supports the view that DeconmiR is able to remove false positives attributable to changes in cell-type composition between normal and cancer. This approach provides reasonably accurate estimates of cell-type fractions, while also controlling for the false-positive rate for cancer biomarker identification in complex epithelial tissues.

Fig. 6: Applications of DeconmiR in solid tumor to identify cancer biomarkers.
figure 6

A Strategy used to construct a gold standard list of breast cancer differentially expressed miRNAs. Briefly, two separate lists of DEMs are derived by comparing breast cancer cell lines to normal breast epithelial cell from FANTOM5 and microRNAome projects, respectively. By taking the overlap of these two, a high-confidence set of breast cancer DEMs occurring in the epithelial compartment of the breast was obtained. B The sensitivity and specificity of three different DEM identification methods: unadjusted analysis, adjusted using DeconmiR-estimated cell-type fractions as independent covariates, and using SVA. C Identification of epithelial-specific smoking differentially expressed miRNAs and their relevance in lung cancer. Smoothed scatterplot of CellDMC t-statistics in epithelial cells (x-axis) versus those in immune cells (y-axis) derived from a cohort of 48 pairs of lung adenocarcinoma and non-malignant lung tissue. Pearson correlation coefficients and p-values were calculated. Green dashed lines indicate level of statistical significance (FDR < 0.05). Orange and green points mark the up- and downregulated smoking miRNAs derived from bronchial airway with smoking as the phenotype. Those passed statistical significance threshold were labeled. D Boxplots indicates the top-rank smoking related miR-183-3p and miR-139-5p in bronchial airway are also differentially expressed in normal and cancer samples (n = 48) The boxes are bounded by the 25 and 75 percentiles and the center represents the median. The whiskers extend from each edge of the box to indicate the 1.5× interquartile range (IQR). P-values are derived from two-sided Wilcoxon rank sum test. E Scatterplots of the average expression of smoking up- and downregulated miRNAs against cell fractions of epithelial and immune cell fractions. Pearson correlation coefficients and p-values were calculated. F Scatterplots of expression levels of the top-rank smoking-associated epithelial-specific miRNA miR-183-3p against the epithelial cell fraction (left panel) and total immune cell fraction (right panel) across the 559 LUAD samples from the TCGA. Pearson correlation coefficients and p-values were calculated. G Scatterplot of Spearman correlation coefficients (y-axis) between expression and epithelial fraction, as computed over the 559 TCGA LUAD samples, against the CellDMC t-statistics for predicted epithelial-specific miRNAs derived from GSE110907. Dashed lines indicate level of statistical significance (FDR < 0.05). P-value is from Fisher’s exact test on the miRNAs passing significance in each quadrant. H Average expression levels of cell-specific miRNAs derived with CellDMC from the GSE110907, in the normal-adjacent (n = 46) and LUAD (n = 513) samples from the TCGA. See (D) for boxplot definition. P-values are derived from two-sided Wilcoxon rank sum test. Source data are provided as a Source Data file.

Finally, to demonstrate how miRNA-based cell type deconvolution may help to provide new biological insights, for instance, to reveal smoking-associated miRNA changes in epithelial cells, we collected miRNA profiles from a cohort of 347 bronchial airway samples with different statuses of smoking exposure (GSE93284)57. A set of 175 significantly differentially expressed miRNAs between current and former smokers was obtained as those with smoking-related expression changes in bronchial epithelium, with 89 of them being upregulated and 86 downregulated (Supplementary Data 13). We inspected whether these smoking-associated miRNAs can be observed in epithelial compartment of lung cancer, so that can be used as early diagnostic markers by using bronchial epithelium as a surrogate tissue. For this purpose, we further collected miRNA profile from another cohort of 48 pairs of lung adenocarcinoma and non-malignant lung tissue (GSE110907)58 and applied our method to infer cell type fractions. As expected, lung cancer samples mainly contain epithelial and immune cells and the elevated epithelial in tumor samples was observed (Fig. S13A). Then we try to identify cell-type-specific DEMs in these two cell types by applying the CellDMC method59, which was originally designed for identifying cell-type-specific differentially methylated positions from DNA methylome data, whereas its effectiveness for transcriptome data has also been validated60. We verify that many top-ranked smoking-associated miRNAs in bronchial epithelium were predicted to be expressed specifically in the epithelial or immune compartments of the lung tissue (Fig. 6C, D). The epithelial-specific miR-183-3p for instance, has been found significantly up-regulated in lung cancer tissues, which is associated with lymph node metastasis and poor prognosis of lung cancer patients61,62. The immune-specific miR-139-5p has been identified as the tumor suppressor downregulated in lung adenocarcinoma63. The miR-139 family members were found as the regulators of the PD-1 and other Immunocytokines, therefore to play an important role in immune evasion of cancer64,65. The up- and downregulated smoking miRNAs exhibited a highly significant trend toward positive correlations between their expression levels and the estimated epithelial as well as immune cell-type fractions, respectively (Fig. 6E). To further validate these epithelial-specific smoking miRNAs, we also estimated fractions of different cell types for 559 lung adenocarcinoma samples from TCGA, with similar patterns was observed (Fig. S13B). We confirmed this correlation pattern for the top-ranked upregulated epithelial smoking miRNAs between their expression and epithelial cell fractions (Fig. 6F), and also for immune miRNAs (Fig. S13C, D). Moreover, their variance in expression levels increases with the fraction of epithelial cells in cancer, while also exhibiting a concomitant decrease in variance in samples with a higher immune cell content. Besides, the top-ranked epithelial-specific miRNA with respect to lung cancer exhibited a highly significant trend toward positive and negative correlations from independent validation (Fig. 6G). Up- and downregulated epithelial-specific miRNAs exhibited increased and decreased levels of expression in lung cancer, respectively (Fig. 6H). Thus, the bronchial smoking dysregulated miRNAs identified here represent alterations occurring in the epithelial cell of origin of smoking-related lung cancer and therefore may mark cells that are selected during lung carcinogenesis.

Discussion

Computational cell type deconvolution has been demonstrated as a useful strategy for dissecting heterogeneity of complex tissues66,67,68. However, in-silico deconvolution methods that apply to miRNA profiling data remain unavailable. Although cell type fraction information can be obtained for those samples with multi-omics data available, such as mRNA profiling, whereas only about 5% of the samples have such data. In such case, we generated DeconmiR, a deconvolution predictor for depicting cell type composition by employing miRNA expression profiling data. Using both simulated and experimental data, we demonstrate that our method produces high-accuracy predictions for a diverse range of cell types and is robust to data noise. It provides approximate precision compared with other gene expression or DNA methylation-based deconvolution tools. Its success indicates that noncoding RNA has great potential for exploring the cellular composition of complex tissues.

Genome-wide miRNA profiling data have been accumulating rapidly in the past few years69. However, a key challenge for the interpretation of the biological meaning of these data remains in that expression measurements are generally performed on heterogeneous sample specimen. The ensemble measurement on bulk samples would be severely affected by cell-type composition, which could vary substantially between individuals70. Consequently and surprisingly, studies still continue to ignore this important confounding factor when investigating miRNA profiling data71,72,73. Although recently emerged single-cell small RNA sequencing is promising to facilitate understanding the sample heterogeneity, the application of such methods to various tissues is still limited. Many of these single cell miRNA profiles were performed on purified cell lines but without real tissue sample available, in addition, most of these data have very limited miRNAs or cells profiled23,24,25. Hence, we still lack a systematic survey of the cellular heterogeneity and their influence on RNA omics study from the prospective of miRNA profiling. Our designed tool is therefore necessary for capturing the spatiotemporal complexity behind the diversity of RNA biology.

To guarantee its performance of cell type quantification for complex tissues, DeconmiR utilized several approaches to obtain the highly cell-specific expressed miRNAs and build the reference signature matrix that is most informative for deconvolution. We first collected a large cohort dataset which profiled sufficient number of samples for different immune cell types in our study. More replicates included in each cell type give rise to a robust signature matrix for deconvolution. Second, we filter the most informative miRNAs that discriminate different cell types by applying the pair-wise comparison with t-test analogous approaches. Our results indicated that a good performance for discriminating cell types can be achieved for the library size with miRNAs less than 200, much less than that of the genes recruited by transcriptome-based deconvolution methods, such as the LM22 library, which was created and utilized by CIBERSORT29, or the number of CpGs for DNA methylome-based methods, which generally has hundreds or even thousands of sites included31,39,74,75. Finally, we verify that the performance can be improved by logarithmic transformation of the data for miRNA profiling-based deconvolution. Data scaling has been indicated as important to reduce the omics data noise. There are, however, still debates on whether the accuracy can be improved for transcriptome-based deconvolution methods. Several tools require that a log-transformation is needed for performing RNA-seq data-based computational deconvolution76,77, whereas others consider that log-transformed data violated the linearity assumption for regression and leading to a worse estimation66,78. They show the necessity of data scaling prior to the deconvolution but conclude that non-logarithmic transformed TPM, RSEM or FPKM values are more appropriate for accurate estimation of cell type proportions. From our experience, a more accurate deconvolution can be achieved in logarithmic scaling when deconvolving miRNA expression profiles. We speculate it is probably because the inflated nature of RPM value leads to the extremely high noise level, outliers and abnormal distribution of the profiling data greatly affect the model performance, sometimes even make the correct estimation impossible. In addition, we notice that many newly developed methods utilize the weighted least squares approaches and combine with scRNA-seq data derived references when analyzing the RNA-seq sample with a highly skewed distribution79,80,81. These methods present high accuracy for the estimation of rare cell types, However, they are easily subject to errors when applying to data with noise, which is exact the case of miRNA profiling data. In such a case, the PRC methods were selected as it is more robust for miRNA data with high-level noise.

We assessed the performance of DeconmiR on a set of benchmark datasets, it produces accurate results and robustness against noise when tested on mixtures of human immune cells. We also showcased the effectiveness of DeconmiR in different application contexts. By applying our method to human malignancy of AML, DeconmiR indicates the differences in the compositions of different cell types between disease and control samples. Cellular fractions predicted by our method are able to discriminate samples into subgroups, which suggests the potential value for clinical diagnosis and prognosis of AML. When applied to an age-related miRNA profiling study, it could identify double more age-related miRNAs compares to the cell type unadjusted analysis, confirming DeconmiR could dramatically increase the screen sensitivity. In addition, it successfully captures the dynamic changes of major cell types in the blood of COVID-19 patients, which shows its important application value in exploring immune response for infectious diseases.

We also extended the success of the reference-based framework to complex tissues other than blood. In doing this, we not only provide detailed miRNA reference databases for individual tissue types using several independent datasets, but also a generic reference which is able to reliably estimate fractions of epithelial, immune and stromal (mainly fibroblasts) cells in complex tissues by using a hierarchical deconvolution strategy40. Using our novel reference matrix, we demonstrate that complex tissues exhibit widespread contamination by immune cells. By using the breast tissue as an example, our analysis shows the reference matrix can be used not only to estimate the cell content of samples, but also to infer cancer-specific miRNA biomarkers occurring in the epithelial compartment by controlling the sensitivity and specificity. In the case of lung cancer, we show that many of top-ranked smoking-associated miRNAs detected from a profiling performed in bronchial airway can be detected specifically in the epithelial compartment of lung cancer samples, while many others also occur in the immune compartment. These cell-type-specific miRNA changes in the bronchial and lung epithelium may affect or reflect deregulation of regulatory networks that contribute causally to lung cancer development. In addition, these results have profound implications for future miRNA profiling study that want to use easily accessible surrogate tissues to identify miRNA alterations that are representative of that originate in the epithelial or immune compartments of solid tissues. We believe that with more work demonstrating the value of miRNA profiling data in this area, it could help improve biological interpretation, prioritize candidates that require functional validation, and elucidate causal pathways to different diseases.

We also indicate there are limitations for border applications of this method. One such is the collinearity features observed for some close related cell types in the miRNA-seq data, which may lead to confounded deconvolution82. In addition, we observe that some miRNAs in the reference library belong to the same family that share common seed regions or clusters. For example, the miR-181 family members miR-181a-5p, miR-181b-5p, miR-181c-5p, and miR-181d-5p in the blood reference. In total, ten miRNAs were identified as originating from the same clusters or sharing identical seed sequences, representing 7% of the miRNAs in the library. These highly correlated miRNAs may introduce redundant information into the reference library, potentially leading to higher weights for specific cell types. Thus, a careful interpretation for the deconvolution results is needed to best eliminate potential bias for particular cell types. Nevertheless, we foresee that the applications of our tool will be a wide benefit for scientific and clinical fields.

In summary, we present a novel reference-based deconvolution algorithm, DeconmiR, for in-silico dissecting cell type composition of complex tissues on miRNA profiling data. It provides advantages to researchers in terms of cell type and platform flexibility, and constitutes a useful addition to the existing tools for tissue decomposition that utilize transcriptome or DNA methylome data. We expect the broad applications of the proposed method to different scenarios so as to increase our understanding of the roles of cellular heterogeneity in different biological processes, and facilitate the miRNA biomarker identification in physiological and disease samples.

Methods

Data collection and reference profile generation from circulating cells

The first critical step involves the selection and assembly of a library that incorporates miRNA signatures characteristic to the component cell types using pure cell-type profiling data. Therefore, we collected miRNA-seq data from FANTOM5, TCGA, as well as other miRNA profiling studies for individual cell types and tissues. We first try our method for peripheral blood as it is the most commonly used biospecimen for biological analyses and is easily accessible. We used a publicly available miRNA-seq profiling data from Juzenas et al., which generated from FACS sorted cells of peripheral blood from 43 individuals (GEO Accession: GSE100467)34. Specifically, we used the purified leukocyte subtypes of lymphoid lineage including CD4+ T cell (T helper cell), CD8+ T cell (cytotoxic T cell), B cell (CD19+), Natural Killer cell (NK cell, CD56+), and myeloid lineage including Monocyte (CD14+) and Neutrophil (CD15+). There are 257 samples collected for these six cell types in total. The genome-wide miRNA profiling data quantified as RPM was log-transformed with pseudo count 1 added in order to regularize the data for downstream analysis. Here we used the leave-one-out strategy to construct reference profile for immune cell types. we first picked those samples with purity higher than 85% for reference construction and in silico mixture sample generation. Given n samples for a particular cell type within the dataset, n-1 samples were used to derive the signature matrix, and the remaining sample was used to synthesis the in silico mixture samples. We used a well-known empirical Bayes framework of moderated t-statistics (limma) to derive DEMs between one of the six cell types and the remaining. This tool has been widely used to identify differential features between sample groups, and providing reliable inference even with small sample sizes35. Those miRNAs were selected as differentially expressed at the false discovery rate (FDR) threshold of 0.05, whilst also requiring that the expression fold change be larger than 2 between the two groups. For each cell type, we picked the top 50 DEMs across all cell types and the non-redundant set of miRNAs were kept. Then the expression centroids were calculated for each miRNA to generate the final signature matrix.

To generate reference matrices for solid tissues, we gathered miRNA profiling data for 28 purified cell types or cell lines from a separate publicly available study of FANTOM543, which mainly covers epithelial, endothelial and fibroblast cells from ten different tissues, including the breast, lung, liver, esophagus, kidney, prostate, respiratory tract, skin and vessel, and purified samples representing seven major immune cell types (CD4+ and CD8+ T cells, B cell, NK cell, neutrophil, monocyte and macrophage), and also adipocyte from breast tissue. We require at least three samples for each cell type. To derive signature matrices, the custom limma-based wrapper function was used for all pairwise comparisons between candidate cell types within each tissue. In addition, we have also constructed a generic matrix that is used for a hierarchical deconvolution strategy and for cell-type deconvolution of general epithelial tissues. In doing this, these miRNA profiles were categorized into major cell groups of epithelial, fibroblast and immune, then differentially expressed miRNAs were selected by comparisons among the three groups, each comparison comparing one cell type to the other two. Then, the reference was obtained by averaging expression values for the miRNAs and for each of the three main cell groups. Finally, a second reference matrix consists of only seven immune types was constructed to estimate their relative proportions. We require the second matrix shares no overlap with the first one making up three cell type groups. This strategy ensures that the inference of immune subtype fractions is not significantly confounded by variation in the epithelial and fibroblast proportions, and has been validated by a previous study for DNA methylation-based deconvolution.

Deconvolution method development for bulk tissue data

As mentioned above, an application scenario of the identified marker miRNAs and the associated reference profiles is for cellular deconvolution, in which we used the reference profiles generated in the first stage to estimate the cellular fractions for bulk samples. We leveraged the statistical framework of Robust Partial Correlation (RPC) which has been used for DNA methylation or gene expression data as benchmark method. This method has presented high accuracy and robustness to noise in many scenarios. In addition, four other widely used statistical methods, including solve linear equations (SLE), non-negative least squares (NNLS), support vector regression (SVR) and quadratic programming (QP) are also included for the purpose of comparison. In addition, we also use an additional tree-based strategy to estimate cell-type fractions for all nine cell types (epithelial, fibroblast and the seven immune subtypes listed above) in any given tissue. Briefly, we apply DeconmiR with RPC and the generic reference to obtain cell-type fractions for the epithelial, fibroblast and total immune components. Then, we reapply DeconmiR with RPC to the immune reference database to estimate relative proportions for all seven immune cell subtypes. Finally, these relative proportions for the seven immune cell subtypes are rescaled by multiplying them with the total immune fraction estimated in the first step to obtain absolute proportions of all seven immune subtypes in the sample of interest. This hierarchical approach has shown its advantages over trying to construct one-single reference that encompasses all cell types40. All these functions and references were wrapped into R package DeconmiR (https://github.com/zyangx/DeconmiR). We have also developed a comprehensive and interactive web resource (https://www.compgenelab.info/DeconmiR/). With this platform, users can easily obtain the cell fraction estimations by using miRNA profiling in a specific tissue of interest. In order to evaluate the concordance between deconvolution results across multi-omics data for matched AML samples from TCGA, we obtained the RNA-seq data to calculate cell fractions for the six immune cell types by applying the well-known deconvolution method of CIBERSORT and the associated LM22 reference library29, as well as the DNA methylome data by applying EpiDISH and the reference library, which is based on DNase Hypersensitive Site (DHS) screened cell-type-specific CpGs39. Furthermore, the total immune cell fractions obtained by DeconmiR for solid tumors from TCGA were compared to RNA-seq data-based estimates from ESTIMATE algorithm55 in the absence of flow-cytometry-based estimates of the different cell types. To evaluate the deconvolution performance, the Root-mean-square error (RMSE) and Pearson correlation coefficient (PCC) were calculated for the simulated or real-world datasets.

Generation of pseudo bulk mixtures for model evaluation

To evaluate if our method could capture the cell composition by taking the miRNA expression profiles of cell mixtures, we generated in-silico mixtures of immune and non-immune cells for simulated bulk tissues. First, we prepared 100 different in-silico mixtures of the purified immune cell miRNA profiles, with their weight randomly chosen from a uniform (0,1) distribution, and subject to the constraints of sum weight of 1. RMSE between the estimated and the known weight of corresponding cell types were calculated to assess the performance of each algorithm. R2 values between estimated and true weights are also calculated.

Flow cytometry and miRNA expression analysis of adult whole blood samples

We further evaluated our deconvolution method from an independent cohort of in-house collected samples for validation. To do this, we collected whole blood samples of 14 adult donors of 12 males and 2 females, with an average of 51.9 years old from the First Affiliated Hospital, Wannan Medical College (Wuhu, Anhui, P. R. China) with their informed consent obtained. Each of the 14 whole blood samples was divided into two equal aliquots, each containing a different anticoagulant: heparin, citrate or EDTA. For each sample, portions of the aliquot in heparin were used to perform conventional immune profiling methods, including flow cytometry, manual 5-part white blood cell differential and CBC with automated 5-part white blood cell differential, so that cell proportions for eight different cell types (proportions of CD4+ T, CD8+ T, B cell, NK cell, Monocyte, Neutrophil, Eosinophil and Basophil) were acquired. In parallel, another aliquot for each sample was subjected to high-throughput miRNA sequencing. Details for the RNA extraction and post processing of the miRNA profiling data can be seen in Supplementary Methods. All surveys and experiments were approved by the Ethics Committee for Clinical Research of the First Affiliated Hospital of Wannan Medical College. miRNA expression profiling quantified as RPM was obtained and log2 transformed. miRNA sequencing data from this experiment have been deposited in GEO at NCBI and are accessible through accession number GSE268516.

Public external datasets for in-silico mixture construction and validation

  • Dataset 1 was obtained from RNAatlas project44. This corresponds to miRNA-seq profile of HMEC, RPTEC and BEC, which was used for in-silico mixture sample generation.

  • Dataset 2 was obtained from microRNAome project45. This corresponds to miRNA-seq profile of AEC, CF and PSMC, which was also used for in-silico mixture sample generation, and 4 HMEC samples and whole blood samples from 36 individuals for purity estimation.

  • Dataset 3 was obtained from Florence et al. (GEO Accession: GSE28489)83. This dataset includes miRNA expression profiling for six sorted immune cells (CD4+ T cell, CD8+ T cell, B cell, NK cell, monocyte, neutrophil) based on microarray.

  • Dataset 4 was obtained from TCGA acute myeloid leukemia (AML) and other 16 solid cancer cohort samples including BLCA, BRCA, CESC, COAD, ESCA, HNSC, KIRC, KIRP, LIHC, LUAD, LUSC, PAAD, PRAD, READ, THCA and UCEC9. For the AML cohort, miRNA expression profiling data from whole blood samples was obtained. For the solid cancer cohort, miRNA expression profiling data from both normal and tumor samples are used.

  • Dataset 5 was obtained from Maider et al. (GEO Accession: GSE89042)47. This dataset includes miRNA microarray data profiled on whole blood from 38 healthy individuals with age from 24 to 79.

  • Dataset 6 was obtained from Togami et al. (GEO Accession: GSE182152)50. This corresponds to miRNA profiling data from whole blood of 62 coronavirus disease 2019 (COVID-19) patients and health controls, which profiled based on two different sequencing platforms.

  • Dataset 7 was obtained from Zheng et al. (GEO Accession: GSE157859)51. This corresponds to miRNA profiling data from peripheral blood mononuclear cell (PBMC) of 18 patients of COVID-19 with different clinical manifestations at different clinical stages.

  • Dataset 8 was obtained from Pavel et al. (GEO Accession: GSE93284)57. This corresponds to miRNA profiling data from a cohort of 347 bronchial airway samples with different status of smoking exposure.

  • Dataset 9 was obtained from Yu et al. (GEO Accession: GSE110907)58. This corresponds to miRNA profile from a cohort of 48 pairs of lung adenocarcinoma and non-malignant lung tissue.

For the TCGA AML cohort of Dataset 4, we also obtained the genome-wide RNA-seq and DNA methylome 450 K data for the matched samples. For other cancer types from TCGA as well as dataset 6 and 7, the RNA-seq data for matched samples were also obtained. The detailed description, data processing and reference construction procedures for all above datasets can be found in Supplementary Methods.

Data quality control and cell type adjustment for biomarker identification

To determine the associations between inter-sample variability and cell type heterogeneity of the dataset, the top components of data variation were obtained by using the singular value decomposition (SVD) method. The number of significantly variable singular vectors was determined by Random Matrix Theory (RMT). Then the significant components of variation were correlated to phenotypic and technical factors, as well as inferred cell type fractions, to assess the relative contributions of the biological variables to data variability and represented in a P-value heatmap between components and different factors. To identify the miRNAs that truly associated with biological factors, such as aging or smoking, P-values of association were determined by multiple linear regressions in each dataset, with cell-type fractions estimated as covariates for adjustment.

miRNA target prediction and functional enrichment analysis

For each immune cell-specific miRNA set obtained, we collected the experimentally validated targets from miRTarBase84. Then we performed the enrichment analysis of the miRNA targets on Gene Ontology terms by using clusterProfiler package to facilitate the functional annotation of cell type-dependent miRNA85. Enriched GO terms for biological processes were selected where FDR-adjusted P-value < 0.05.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.