Main

Over the past two decades, the use of functional genomics in large-scale, many-sample studies has been instrumental in advancing our understanding of how clinical, genetic and environmental properties manifest at the cellular and molecular levels1,2. These studies now benefit from a potentially transformative increase in quality and resolution, thanks to the maturation of large-scale single-cell genomics, which provides access to detailed information about the cellular and molecular composition of hundreds of samples3,4,5,6,7,8,9. Realizing the potential of large-scale single-cell genomics, however, requires rethinking the analysis strategy. Early on, most studies relied on small numbers of samples and focused on variation between cells; however, the emergence of large-scale single-cell genomics now opens the way for a more in-depth understanding of variation between samples.

There are at least two fundamental tasks in sample-level analysis. The first, which we refer to as exploratory analysis, is to divide the samples into groups on the basis of their cellular and molecular properties. The idea of de novo grouping has seen powerful applications in clinical studies that use functional genomics to enable more precise prognoses and treatment planning10,11. As a prominent example, pan-cancer analysis with functional genomics has revealed that, in many cases, people with cancer are more effectively classified using their molecular data, rather than histopathology12. The second task is to conduct comparative analysis, that is to identify cellular and molecular features that differ between predefined groups of samples (for example, cases versus controls). In bulk-level studies, differential expression (DE) has typically been used to detect gene expression programs that are associated with conditions of interest13. The advent of single-cell genomics also popularized differential abundance (DA) as another form of comparison to discover cell states that are disproportionately abundant in a given group of samples14.

Current approaches for these two closely related problems suffer from limitations that preclude them from taking full advantage of the resolution afforded by single-cell genomics. Starting with exploratory analysis, a common approach for quantifying the distances between samples is to first organize the cells into groups (representing types or states) and then evaluate the differences in the frequency of each group5,7,15,16,17. This approach, however, might oversimplify the task by reducing the amount of available rich information about each sample. Furthermore, it hinges on the effective clustering of the cells (to represent distinct cell states), which is often complicated by the need for high resolution, to distinguish between closely related states and harmonize samples or datasets. Finally, this approach can miss critical effects that might manifest in only particular subsets of cells (which we later demonstrate, using cohorts of people with irritable bowel disease (IBD) or COVID-19). Similar issues also emerge in comparative analyses. Most current applications of DE and DA rely on a priori clustering of cells. It is possible, however, that DE programs span few or parts of the a-priori-defined cell subsets and thus are less likely to be detected. Similarly, differentially abundant subpopulations might not clearly correspond to any annotated subset, again limiting the ability to detect them. Finally, even with access to high-quality cell annotation, comparative analyses of different partitions (for example, comparisons between sexes or age groups) might be best reflected by different cell-clustering schemes18.

To mitigate these issues, a recent line of work focused on quantifying DE or DA without relying on predefined cell clusters19,20,21. These methods typically embed cells into a low-dimensional space and then consider small neighborhoods in that space to identify ‘local’ DE or DA effects. A caveat of this approach, however, is that it does not account for the uncertainty that embeddings might have (for example, as inferred with variational autoencoders (VAEs)22), which can be substantial23. Another line of work uses VAEs to learn the effect of sample covariates on the latent embedding of cells24,25,26. The primary limitations of this approach are that it assumes the evaluated effects are constant, meaning they are identical for all cells irrespective of their state, and that they do not account for the uncertainty in estimating these effects.

To address these challenges, we introduce MrVI, a probabilistic framework for large-scale (multi-sample) single-cell genomics. For exploratory analysis, MrVI identifies sample groups without requiring a priori cell clustering. Instead, it allows for different sample groupings to be conferred by different cell subsets that are detected automatically. For comparative analysis, MrVI enables both DE and DA in an annotation-free manner and at high resolution, while accounting for uncertainty and controlling for undesired covariates, such as experimental batch. The notion at the basis of MrVI is that of counterfactual analysis, which aims to infer what would the gene expression profile of a cell be had it come from a certain sample. This approach provides a principled methodology for estimating the effects of sample-level covariates on gene expression at the level of an individual cell. It relies on a hierarchical deep generative model architecture, powered by modern techniques in deep learning, such as cross-attention, to model the effects of sample covariates while providing state-of-the-art performance in sample-integration quality. On the software side, MrVI leverages optimization procedures included in scvi-tools27, allowing it to scale to multi-sample studies with millions of cells.

In this work, we demonstrate that MrVI compares favorably to common approaches for integration, exploratory and comparative analyses and then showcase its utility in several multi-sample studies. In a PBMC dataset from a COVID-19 study, MrVI identifies a monocyte-specific response to the disease that more naive approaches cannot directly identify. In a dataset of drug-perturbation screens, MrVI reveals both expected and non-trivial relationships between compounds. Finally, using MrVI to study a cohort of people with IBD, we find a previously unappreciated subset of pericytes with strong transcriptional changes in people with stenosis.

Results

Multi-resolution variational inference

MrVI is a hierarchical Bayesian model for integrative, exploratory and comparative analysis of single-cell RNA-sequencing data from multiple samples (for example, corresponding to human participants) or experimental conditions (for example, perturbations in a screen; Fig. 1a). The model utilizes two levels of hierarchy to distinguish between two types of sample-level covariates. The first covariate type captures properties in either exploratory or comparative settings—we refer to these as target covariates.

Fig. 1: Overview of MrVI.
Fig. 1: Overview of MrVI.
Full size image

a, We consider multi-batch, multi-sample experimental designs. In the canonical case, we gather single-cell measurements from several samples, which are collected across several batches. In this case, the relevant nuisance covariate is the batch. b, Left, MrVI model illustration. Right, graphical model plate diagram. MrVI relies on two cell representations, u and z. A sample-unaware cell representation (u) captures shared type information (colored by cell type in the diagram). From this quantity and the sample-of-origin of the cell, we construct a sample-aware representation (z) of the cell. Last, we model gene expression as a function of this latent variable and of observed nuisance factors. Each point in the diagram corresponds to an individual cell. c,d, Use cases of MrVI for exploratory and comparative analyses. c, For exploratory analysis, MrVI computes local sample stratifications. MrVI can compute counterfactual representations, characterizing what would have been the representation of a cell had it originated from a different sample. By computing the distances between counterfactual representations of all samples, MrVI can identify sample-level effects on cell states. d, For comparative analysis, MrVI quantifies differences in abundance across cell states (top right), and identifies sample metadata effects on gene expressions (bottom). Both the sample stratification and differential expression procedures use counterfactual z representations to compare local sample effects. The differential abundance procedure involves an approximation of the posterior density for each sample in the u latent space.

Typically, an identifier for each sample (for example, human donor ID or experimental perturbation) is a natural choice for the target covariate to be provided as input to MrVI, because it is entirely nested in other sample-level target attributes (for example, treatment type), thus enabling their analysis. The second covariate type is considered ‘nuisance,’ and typically corresponds to technical factors (sample processing site, library-preparation technology or the study of origin in cross-studies).

In MrVI, each cell (n) is associated with two low-dimensional latent variables, un and zn (Fig. 1b): un is designed to capture the variation between cell states while being disentangled from sample covariates; zn, reflects the variation between cell states, in addition to the variation induced by target covariates, while remaining unaffected by the nuisance covariates. Finally, we model the observed gene expression (xn) as samples from negative binomial distributions whose parameters are predicted by decoding zn conditioned on nuisance covariates.

MrVI employs a mixture of Gaussians as a prior for un instead of a uni-modal Gaussian. We demonstrate that this more versatile prior provides state-of-the-art performance in the integration of large datasets and in facilitating annotations of cell types and states. zn is learned as a function of the respective un and the sample ID, sn (Methods). We used neural networks for all mapping functions in the model. The parameters characterizing these functions are learned through maximization of the evidence lower bound (Methods)28.

The trained model performs two types of analyses at single-cell resolution—exploratory (de novo grouping of samples) and comparative (evaluating the effects of target covariates). For exploratory analysis, MrVI computes a sample-by-sample distance matrix, or sample distance matrix in short, for each cell n by evaluating how the sample of origin (sn) affects the representation of this cell in z space (Fig. 1c). To this end, for each cell n, we compute \(p({z}_{n}| {u}_{n},{s}^{{\prime} })\), its hypothetical state had it originated from sample \({s}^{{\prime} }\ne {s}_{n}\). We then define the distance between each pair of samples on cell n as the Euclidean distance between their respective hypothetical states. Then, hierarchical clustering can be used over the sample distance matrices for each cell to highlight the target covariates most likely to explain the major axes of sample-level variation. This analysis helps capture, in an annotation-free manner, cellular populations that are influenced distinctly by target covariates (for example, disease or tissue of origin).

In comparative analysis, MrVI identifies both DE and DA at single-cell resolution (Fig. 1d) using counterfactuals. Consider the case of differential expression between two sets of samples (S1, S2). To evaluate the group-level effects in cell n, we evaluate the extent to which the expectation of \(p({z}_{n}| {u}_{n},{s}^{{\prime} })\) depends on whether \({s}^{{\prime} }\) is in S1 or S2 using a linear model. We then use the decoder network (that is, mapping from z to x) to detect which genes are affected and evaluate their effect size (fold change). Contrary to traditional DE methods, MrVI does not require pre-existing grouping of the data upstream of model fitting. Meanwhile, for local differential abundance, we estimate the posteriors (\(p({u}_{n}| {s}^{{\prime} })\)) and compare the aggregate values of samples \({s}^{{\prime} }\) in S1 versus S2. An in-depth description of MrVI and its post-training analysis procedures is provided in the Methods.

Retrieving known sample effects on a semi-synthetic dataset

We used a semi-synthetic dataset to evaluate how accurately MrVI captures differences between samples (through exploratory and comparative analysis) when different cell subsets are influenced by different sample-level effects. Taking a published dataset of 68,000 peripheral blood mononuclear cells (PBMCs)29 profiled with 10x, consisting of 3,000 highly variable genes and five main cell clusters, which we refer to as subsets A–E. We assigned each cell in this dataset to 1 of 32 synthetic study subjects. These study subjects are characterized by two distinct sample-level covariates. Our strategy for assigning cells to the simulated subjects varied between the cell subsets to simulate different covariate effects. For subset A, the assignment of cells resulted in DE across categories of covariate 1, reflecting a hierarchical grouping of the samples. For subsets B and C, our cell assignment reflected DA across categories of covariate 2 (Fig. 2a). Cells in the remaining subsets were randomly assigned to samples and hence did not contain any DE or DA effects (Methods).

Fig. 2: Semi-synthetic experiment.
Fig. 2: Semi-synthetic experiment.
Full size image

a, Experimental design. We created a semi-synthetic dataset with 5 subsets of cells and 32 study subjects (subj.), containing sample-specific differential-expression (exp.) and differential-abundance effects. In cell subset A, cells have differences in gene expression, on the basis of the value of a study subject covariate, covariate 1. These DE effects stratify synthetic samples according to a known hierarchy. In subsets B and C, cells have differences in abundance between study subjects, on the basis of the value of a second study subject covariate, covariate 2 (four categories (cat.) in total). Cells in subset B or C were over- or under-sampled, respectively, according to fixed rates in each category, such that the sum of cells from B and C remained constant. Stars indicate categories with strong resampling. There are no DE or DA effects across study subjects in other cell subsets. b, Minimum distortion embeddings (MDEs) of MrVI u and z latent spaces, colored by cell subset assignments and covariate 1 categories. c, MrVI’s distance matrices aggregated over cell subsets A and B. d,e, DA analysis using MrVI and Milo for the comparison of synthetic samples based on covariate 2 categories (categories with strong DA in cell subsets B and C versus rest (see a)). d, A u latent space MDE, colored by log density ratios comparing the subset population (star) with the remaining (rest) population (Equation (4)). Enr., enriched. e, Precision–recall curves with areas under the curve (AUCs, higher is better) for identifying DA cells. Cells in subset B or C are true positives; other cells are true negatives. We used the absolute value of the log density ratio for MrVI and the absolute value of the LFC produced by Milo as scores to estimate precision–recall. f,g, DE analysis using MrVI and miloDE comparing group 1 of the simulation (4 blue figures) against all other samples. We inferred which cells showed DE effects for the comparison of synthetic samples based their covariate 1 assignment (blue versus rest). f, u latent space MDE colored by the squared norm of βn, appearing in Equation (3), which quantifies the overall sample covariate effect on gene expression. g, Comparison of miloDE and MrVI LFCs versus DESeq2 reference, reporting Pearson’s r for each method.

We applied MrVI using the simulated subject identifiers as the modeled target covariate (sn) and leaving the nuisance covariate (bn) empty. The resulting u space clearly reflected the differences between the cell subsets (Fig. 2b). In the z space, we observed distinct subject-specific effects in cells of subset A, whereas cells in the remaining clusters were mixed, aligning with the expectations that subset A alone contained DE effects. For exploratory analysis, we used the mapping from u to z to estimate sample distances for each cell (Fig. 2c). In cell subset A, the sample distance matrix (averaged over cells) produced a hierarchical structure similar to the simulated (ground truth) dendrogram. As expected, MrVI estimated much smaller distances between samples when considering the other cell subsets, with no discernible structure. We compared this result with the standard approach for stratifying subjects using clustering obtained either from PCA or scVI (Methods). The resulting compositional analyses were less effective in capturing sample stratification in subset A (Supplementary Fig. 1a,c) and introduced non-negligible distances in subsets in which no differences were expected (Supplementary Fig. 1b).

For DA analysis, we partitioned the subjects into two groups, according to covariate 2 (presence or absence of a star in Figure 2a). We used the estimated posteriors (p(us)) around each cell to evaluate the extent to which its state was over-represented in one group of study subjects versus another. The resulting log ratios accurately reflected the DA effects that were simulated in cell subsets B and C (Fig. 2d). Furthermore, the inferred ratios significantly diverged from zero in only subsets B and C (Supplementary Fig. 1d). We compared MrVI with Milo19, a popular framework for DA, and found that MrVI more accurately identified DA effects and associated them with the correct cell subsets (Fig. 2e and Methods).

For DE analysis, we compared the subjects in one category of covariate 1 (blue in Fig. 2a) with all other subjects. In this comparison, only cell subset A was expected to contain DE effects. We used the estimated posteriors (p(zu, s)) around each cell to evaluate the extent to which its gene expression profile depends on its sample-of-origin category, using the linear model (in latent space) to obtain effect sizes (Methods).

These quantities reached much higher values in subset A cells than in cells in other subsets (Fig. 2f), indicating that MrVI captured the particular groups of cells exhibiting DE effects. Next, we evaluated each gene’s effect size (log fold change; LFC) in each cell belonging to subset A using the MrVI model. We compared these values with those obtained through pseudo-bulk DE analysis of subset A (representing an annotation-dependent analysis in the ‘perfect’ scenario, in which the annotations completely align with the DE signal). The results from these strategies were highly correlated, with a substantial improvement over miloDE—a recent cluster-free method for DE analysis (Fig. 2g).

These results demonstrate that MrVI can identify different sample groupings for different cell subsets without requiring an a priori annotation of cell states. Similarly, it can accurately retrieve shifts in cell-state composition (DA) and gene expression (DE) and identify the respective cellular populations.

Highlighting variation in myeloid responses of people with COVID-19

We next used MrVI to analyze 419,000 PBMCs obtained from a cohort of people with COVID-19 and healthy controls7. We used the sample identifier, corresponding to unique study participants, as our modeled target covariate (s). As anticipated, the resulting u space is not affected by the sample of origin, instead showing marked mixing between study participants (Fig. 3a). At the same time, the u space clearly stratified the cells into immune subsets in a manner consistent with their annotation in the original study. Considering a standard evaluation of integration performance30, we found that the MrVI u space embedding outperformed PCA and scVI in terms of mixing the samples while retaining their biological signal30 (Supplementary Fig. 2). However, a model should capture the effects of viral infection. The two-level structure of MrVI allowed us to derive both a representation that is cell-type centric (u) and one that is affected by the respective sample (z). Indeed, the z space showed clear sample-specific variation, separating COVID-19-positive individuals from the control population inside each cell type (Fig. 3b).

Fig. 3: Analysis of a COVID-19 cohort with MrVI.
Fig. 3: Analysis of a COVID-19 cohort with MrVI.
Full size image

a,b, MDEs of u and z latent spaces in MrVI, computed on the full dataset and colored by the original cell-type annotations and COVID-19 status. pDC, plasmacytoid DC. c, Sankey plot mapping cell-type annotations to clusters obtained by clustering cell-specific distance matrices using the Leiden algorithm. This clustering identified three cell subpopulations, A, B and C. Cluster A contained monocytes and DCs, cluster B contained T cells and NK cells and cluster C contained B cells. Cell-type or cluster pairs with less than 1% of the total cells are not shown. d, Sample distance matrices averaged over cells from two of the three subpopulations in c. For each matrix, we computed the associated affinity dendrogram between samples obtained through hierarchical clustering and colored each row (sample) according to participant age, DSS, infection status and the most severe stage of disease that a participant has experienced. e, Differential abundance analysis using MrVI log density ratios for the myeloid cells identified as cluster A in c. Left, comparison of COVID-19-positive individuals with healthy controls. Right, comparison between COVID-19-positive individuals with high or low DSS. f, Differential expression analysis using MrVI between COVID-19-positive individuals with high or low DSS. MrVI identified three DE modules of genes. Each plot shows the activity of the module in the u latent space. Displayed are the LFCs averaged over all genes in the module. In these figures, the individuals with low DSS and those with high DSS, respectively, correspond to donor clusters 1 and 3 in d. Upreg., upregulated; expr., expression.

We used MrVI to answer two fundamental questions: how do samples in this cohort stratify into groups? Do they stratify differently when considering different immune populations? To address these topics, we used counterfactual embeddings to estimate a sample distance matrix for each cell (n). We then clustered the cells according to these values, thus detecting groups of cells inducing similar sample stratifications (Fig. 3c). This analysis produced three groups of cells, one containing T cells and natural killer (NK) cells, another consisting primarily of monocytes along with a smaller population of dendritic cells (DCs), and a third containing B cells.

The resulting distance matrices (averaged across all cells in each respective group) separated groups of people with disease and controls, indicating that MrVI can identify clinically relevant groups (Fig. 3d and Supplementary Fig. 3). However, the distance matrix conferred by monocytes and DCs highlighted an additional stratification of the study participants. In this cluster, the COVID-19 group was further stratified into two groups, corresponding to groups 1 and 3 in Figure 3d. Group 1 was enriched in individuals for whom the number of days since first symptoms (DSS) was low, whereas individuals in group 3 showed longer duration of symptoms (Extended Data Figure 1; Mann–Whitney U test, P < 0.05). The association with monocyte activity and the time elapsed since infection has been established31. MrVI identified this association without prior knowledge of the DSS or any other information about the participants.

To interpret this data-driven stratification of people with COVID-19 and its association with monocytes, we performed two analyses. We first performed a DA analysis of the myeloid population, comparing those in the COVID-19 and control groups. We found a marked decrease in non-classical CD16+ monocytes and DCs (Fig. 3e and Supplementary Fig. 4a) in people with the disease. The comparison of the two COVID-19 groups similarly showed a shift toward non-classical monocytes in the group with higher DSS (Fig. 3e and Supplementary Fig. 4b). These results are consistent with independent studies31, which reported that CD14+ monocytes are highly pro-inflammatory and contribute to the cytokine release in early COVID-19, thereby contributing to symptoms.

Next, we applied our DE analysis to compare the two COVID-19 groups. Using MrVI counterfactuals, we estimated the respective LFC for each gene in every myeloid cell. We then clustered the genes on the basis of their estimated LFC profiles (Methods). This analysis uncovered three modules, each containing genes with a similar DE pattern, implicating different subsets of myeloid cells (Fig. 3f and Supplementary Fig. 5). The first module, upregulated in the group with higher DSS, was enriched in genes identified in myeloid cells of healthy individuals (compared with those with the disease), again supporting the notion of a return to baseline with long-standing infection. Specifically, we see a lower CSF3R expression in the recently infected individuals, aligning with less-mature monocytes that are released earlier from bone marrow during infection32. Similarly, we found that, early in the infection, the number of MHC-II-expressing monocytes declines but later returns to normal levels31. This accounts for the observed elevation in LGALS2 and HLA-DR2, both linked to MHC-II. The second module, over-expressed in individuals with lower DSS, is enriched in interferon-related genes. This module includes GBP1 and IFITM3, interferon-response genes, and IFI27, reported as an early predictor of COVID-19 severity33. These results agree with strong interferon signaling during early infection, especially in myeloid cells31. The third module, over-expressed by the higher DSS group, contained TNF and NFKBIZ. It has been demonstrated that tumor necrosis factor (TNF) release is reduced during acute COVID-19, whereas NFKBIZ expression is drastically reduced during acute infection34. Our analysis suggests that both molecules are markers of acute infection more so than mortality.

Grouping molecular effects on expression from perturbations

To demonstrate the flexibility of the target covariate used by MrVI, we analyzed a chemical perturbation screen with a single-cell RNA-seq readout generated with the sci-Plex assay6. The sci-RNA-seq3 dataset includes three cell lines, 188 small-molecule drugs and vehicle controls. Each small molecule was delivered at four doses, and the entire study was conducted with two biological replicates. In this assay, each cell receives a single perturbation (or negative control vehicle) that can be identified in addition to its transcriptome. MrVI can serve several fundamental analyses for this type of study, namely integrating all replicates into a shared embedding, stratifying the screened compounds into groups with similar effects, mapping these effects at the gene-expression level and mapping cell-state composition.

To achieve this, we used the concatenation of the drug name and the dose level as the target covariate modeled by MrVI, resulting in 752 ‘samples’ per cell line. Because the study was conducted with 96-well plates, with each plate containing one of two biological replicates, we chose the plate identifier as our nuisance covariate. As found in the original study, many drug–dose combinations had minimal effect on transcription across the assayed cell lines6. Consequently, we applied a simple filter, retaining only drugs that had a minimal number of DE genes with at least one concentration and in at least one cell line (Methods). This resulted in 368 perturbation samples (92 drugs with four concentrations each) that we used to train MrVI (Methods). Here, we focus on the epithelial A549 lung adenocarcinoma line and provide the results from the other two cell lines in the Supplementary Information.

The resulting u space (Fig. 4a) appears as a single cluster with no apparent sub-clusters specific to a given class of drugs, indicating a successful integration that reflects the drug-independent states of the cells. We rather observed that positioning in the u space carries information about the cell cycle. With respect to the perturbation-affected (z) space (Fig. 4b), we observed several sub-clusters of cells originating from distinct classes of drugs. In particular, populations of cells treated with HDAC inhibitors (expected to target epigenetic regulation) and trametinib (to block MEK-mediated tyrosine kinase signaling) formed clear clusters, highlighting their distinct drug-induced shifts in gene expression. The distinction of HDAC inhibitors is in agreement with the original study, in which the authors additionally identified acetyl-CoA deprivation as a common mechanism for this drug class, captured by drug-induced shifts in gene expression. The response to trametinib is also expected to greatly impact the Ras-driven A549 cells because the drug inhibits the downstream MEK pathway6.

Fig. 4: sci-Plex experiment for the A549 cell line.
Fig. 4: sci-Plex experiment for the A549 cell line.
Full size image

MrVI was fit over 92 drugs each at four doses that passed our simple DE-gene filter. a,b, MDEs of the u and z latent spaces, colored by the pathway of the drug used to treat each cell (left) and the cell-cycle stage of each cell (right). For the MDEs colored by pathway, only the top 20% of samples, on the basis of distance from the vehicle, are shown in full opacity. c, PCA of sample distance matrices. Left, scatterplot of all local sample distance matrices projected onto the top two principal components, colored by cell-cycle stage (displays no visual subclusters). Right, barplot of the proportion of variance explained against the number of principal components. d, Comparison of MrVI against the benchmark methods, on the basis of performance metrics assessing alignment with prior knowledge. Each bar represents the metric for one model fit, except for ‘Random,’ which reports the 95% confidence interval over 100 permutations of the inferred distance matrix from MrVI. Left, average percentile of intra-drug distances, measuring how much closer samples with the same drug and different doses are to each other relative to other samples (lower is better). Right, silhouette score of sample clusters with similarities inferred from DEG sets in the connectivity map dataset, assessing cluster consistency (higher is better). e, Hierarchically clustered sample distance matrix. Rows are annotated by the pathway, dose and cluster of each sample (clusters inferred from the distance matrix). f, Heatmap of gene set enrichment analysis (GSEA) scores for the Human MSigDB Hallmark gene set collection for DE genes identified for each cluster in e. Each tile’s upper-right and bottom-left triangles, respectively, represent scores for the set of upregulated (Upreg.) and downregulated (Downreg.) DE genes. For e and f, the analysis is performed over the top 20% of drug–dose combinations (74/368), on the basis of their distance from the vehicle (see Extended Data Fig. 2 for the full matrix and Supplementary Fig. 7 for the matrix here with drug–dose labels).

For a more quantitative comparison, we computed MrVI sample distance matrices. We found that the cells in this cell-line-based assay are homogeneous in terms of the distance matrices that they induce (Fig. 4c), with no evident subclusters in PC space. By contrast, other datasets analyzed here explored primary cells with diverse cell types, featuring distinct distance matrices (Figs. 3 and 5). Therefore, we operated on one sample distance matrix, averaging over all cells.

Fig. 5: Characterization of stenosis in Crohn’s disease.
Fig. 5: Characterization of stenosis in Crohn’s disease.
Full size image

a, Uniform manifold approximation and projection (UMAP) embedding of u latent space, colored by different sample-level covariates. From left to right, coarse labels, identified by us to stratify cell types for unguided analysis (circled is the subset analyzed in be); the tissue-collection method highlights a bias for stromal cells in surgical specimens; the inferred effect size of inflamed versus non-inflamed and the inferred effect size of B2 disease behavior (highlighted are the small subpopulations of pericytes with the strongest effect (Supplementary Fig. 20). For these last two plots, this effect size corresponds to the squared norm of β from Equation (3); it characterizes the overall effect of the covariate on gene expression. be, Analysis of the cell population circled in a, using the same UMAP embeddings as in a. b, Cell-type labels provided in the original study. Displayed are fine annotations for stromal cells and coarse labels for the other cells. c, DA analysis of stromal cells. MrVI was used to compare the B2 disease phenotype and the B1 phenotype (red denotes higher in B2 disease behavior). Displayed is the log (density ratio) between B2 and B1 disease behavior. The number of cells of each type is reported in Supplementary Table 4. d, Raw expression of CDH11 inside all stromal cells. Values are library-size-normalized and log1p-transformed. Cells are sorted for display on the basis of higher expression. e, Inferred LFCs from MrVI for the comparison of B2 and B1 disease behavior (beh.) on the basis of multivariate analysis in MrVI, correcting for inflammation status, sex, tissue location and chemistry in the different cell types. The violin plots display the distribution of estimated LFCs per cell type. Hierarchical clustering was used to determine the order of cell types in the violin plots. f, UMAP of two genes, highlighting intra-cell-type DE variation. Cells are colored on the basis of the genes’ LFC for the comparison of B1 and B2 classifications.

To test whether the resulting distance matrix captures a priori known relationships between the samples, we formulated two performance metrics and compared MrVI with the two standard (composition-based) methods (Fig. 4d and Supplementary Information B.1). First, we used the transcriptomic-based Connectivity Map resource35, providing a measure of similarity between drugs, and compared these similarities with the distances we estimated using MrVI for the maximum tested dose (10,000 nM). Second, we evaluated the extent to which treatments with the same drug but at different concentrations tend to be more similar to each other than expected by chance. MrVI achieved better performance on both metrics—showing higher concordance with the Connectivity Map stratification, and lower distances between treatments with the same compound. These metrics were also used for fine-tuning the hyperparameters of the MrVI model (u and z dimensions). This strategy reflects real-world cases in which prior knowledge of the similarity between samples can be utilized for more effective modeling and downstream analysis (Supplementary Fig. 6).

We then analyzed a hierarchical clustering of the sample distance matrix (Fig. 4e and Supplementary Fig. 7). Drug-dose combinations that had little to no effect in the A549 context mostly clustered with the vehicle treatment (Extended Data Figure 2). In particular, a disproportionately high number of samples with low effect sizes also had low dosages, reflecting expected dose–response relationships. Of the remaining samples, those that were most distinct from the vehicle sample were organized into several clusters, each with a different effect on gene expression (Supplementary Figs. 8 and 9). Clusters 1, 3 and 4 consist mostly of HDAC inhibitors. These three clusters span a wide range of effect sizes that are correlated with the dosages. Cluster 1 contains the samples with the highest dosage levels and the largest effects, and cluster 3 consists of samples with lower dosages and weaker effects (Supplementary Fig. 8). These groupings highlight the ability of MrVI to uncover dose-dependent effects on gene expression that are apparent across multiple drugs in the HDAC inhibitor class. Clusters 2 and 8 corresponded to all doses of trametinib and YM155, respectively. In these cases, MrVI therefore suggests that the effects of these drugs were less dependent on the dose, at least when considering the range of concentrations tested.

MrVI also uncovered relationships between drugs, on the basis of their effects on transcription, that are supported by recent literature and the original sci-Plex study. For instance, cluster 8 includes rigosertib, which is labeled as a tyrosine kinase inhibitor but has been found to directly affect microtubule function36, as well as epothilone A and patupilone, two drugs that interfere with microtubule function. Moreover, MrVI revealed non-trivial similarities that were not captured in the original study. In cluster 5, two JAK2 inhibitors, fedratinib and TG101209, were grouped with JQ1, a drug labeled as a BRD inhibitor. Notably, recent work has shown that JQ1 inhibits the JAK–STAT signaling pathway in addition to being a BRD inhibitor, which supports the plausibility of this grouping37.

Finally, we investigated the clusters by performing Gene Set Enrichment Analysis (GSEA38) on the DE gene sets identified by MrVI (comparing each cluster of samples to the vehicle controls; Fig. 4f and Methods). As a reference, we used the hallmark collection of MSigDB39 that records sets of genes that contribute to major cellular processes. This analysis shed additional light on the effects of each cluster of drugs. Specifically, we found that clusters 1, 3 and 4 associate most strongly with downregulation in metabolic pathways, agreeing with the effect of HDAC inhibitors on carbon metabolism6. Furthermore, cluster 6 was enriched in the p53 pathway, consistent with the categorization of its respective drugs as cell-cycle regulators. Similarly, the effects of cluster 2 were enriched in genes downstream of KRAS signaling, in agreement with its categorization as targeting tyrosine kinases. We provide a heatmap of the LFCs for the top DE genes across all clusters in Supplementary Figure 10. On the basis of this analysis, we highlight that MrVI not only provides an interpretable grouping of clusters, but additionally helps highlight the genes underlying this grouping.

For the other two cell lines used in the sci-Plex experiment, the results of MrVI were consistent with known biology (Extended Data Figs. 3 and 4 and Supplementary Figs. 11–18)40,41.

Profiling stromal cell dynamics in Crohn’s disease stenosis

To provide another example of the applicability of MrVI to human cohorts, we utilized a recent study conducted in 46 people with Chron’s disease and 25 controls across 463,000 cells using single-cell RNA-sequencing42. The dataset includes metadata describing the anatomical location of sampling (in terms of tissue, colon or ileum; in terms of the tissue layer, lamina propria or epithelial), the method of extraction (surgical or biopsy) and sample-preparation detail (10X chemistry). It also includes information on individual study participants, such as disease state and the presence of stenosis in their history. We used these metadata to evaluate the ability of MrVI to recognize meaningful subgroups of people with Chron’s and to highlight cell populations that are affected by stenosis.

We trained MrVI to integrate all the samples in this dataset, using the sample identifier as the modeled target covariate and the combination of library preparation protocol and tissue layers (lamina propria and mucosa) as the nuisance covariate. Comparing the resulting u space with the embedding obtained with scVI, we found that the default settings of MrVI yielded better mixing between the study participants but had slightly lower performance in terms of distinguishing between cell states (using cell annotations assigned in the original study; Extended Data Fig. 5). Indeed, integration is challenging in this dataset owing to significant differences in cell-type composition in the colon and in the ileum, as well as between the mucosal and lamina propria layers. To enhance the alignment of u with known cell states, we developed a variant of MrVI that incorporates cell-type labels (Methods). This variant exhibited overall improved performance compared with scVI, managing sample mixing while preserving cell-type information (Extended Data Figure 5).

We next used MrVI to explore how the different samples are stratified, and how these strata change between cell types. Using a coarse definition of cell types (Fig. 5a and Methods), we again find that different types are associated with different groupings of people with the disease. For instance, considering a subset of immature enterocytes (referred to herein as enterocytes-stem), the sample distance matrix clustered solely by their tissue-of-origin (colon or ileum) (Supplementary Fig. 19). We applied MrVI’s DE function on the enterocytes-stem subset to investigate the differences between these clusters. In line with their biological functions, we observed higher expression of AQP8, CA2 and CA8 in the colon, all of which encode proteins that absorb water, and higher FABP6 and FABP2 expression in the ileum, which encode proteins that absorb fatty acids. Considering that a population of mature enterocytes provides a slightly different view, highlighting a specific cluster of eight people with Chron’s that are distinguished from all others with the disease and controls. Using MrVI DE to compare mature enterocytes in this cluster, which mainly contained colon samples, with other colon samples, we detected an upregulation of genes encoding mucins (MUC1, MUC2, MUC12), which is a well-described pattern in Crohn’s disease43, and upregulation of CXCL1 and CXCL3, which encode chemokines that attract neutrophils44. The expression of these chemokines was associated with stimulation of epithelial cells with IL22 and IL17A, key cytokines in Crohn’s disease. Furthermore, neutrophil infiltration is a key feature of inflamed gut regions45. Depending on the cell subset under consideration, the exploratory analysis of MrVI reflected known differences between the tissues sampled (here, ileum versus colon) and revealed differences in a subpopulation of people with the disease (non-inflamed versus inflamed).

Next, we used MrVI for comparative analysis with respect to a known covariate. We considered the distinction between the two most common complications of Crohn’s disease: stenosis (Vienna classification B2, 11 participants) and fistula or abscesses (penetrating; Vienna classification B3, 7 participants). The remaining are healthy controls or individuals without either complication (Vienna classification B1). Finding reliable biomarkers to distinguish between individuals experiencing the two types of complications is critical, as they might require different treatment strategies. Using our multivariate DE procedure, we studied the presence of stenosis and the inflammation status (inflamed versus non-inflamed) of individual cells, while accounting for the effects of nuisance covariates such as biological sex and tissue location (Methods). We excluded surgical samples from this analysis owing to the marked differences in cell-type composition compared with biopsies (which are the source of most cells in the dataset; Fig. 5a and Methods). Inflammation status had a marked effect on several cell lineages and a strong effect on the stromal compartment. As expected, the presence of B2 disease had a more mild effect, mostly restricted to a few stromal subsets, with its highest impact in a small subset of pericytes (Fig. 5a, Supplementary Fig. 20 and Extended Data Fig. 6b). Therefore, the remainder of the analysis focused on stromal populations consisting of fibroblasts, pericytes, glial cells and endothelial cells (Fig. 5b and Supplementary Fig. 21).

We first compared B2 with B1 samples using DA analysis, controlling for inflammation status and other covariates (Methods). We find a decrease in the abundance of several endothelial populations in B2 samples (for example, lymphatic and LTC4S+ endothelial cells) and an increase in fibroblast populations (for example, ADAMDEC1+; Fig. 5b,c and Extended Data Fig. 7b). This result is in accordance with the prevalence of microvascular rarefaction (that is, loss of endothelial cells) in tissue fibrosis46. Using the same settings for DE, we find that in individuals classified as B2, CDH11, a biomarker of stenosis47, as well as classical markers of tissue fibrosis (ADAMDEC1 and COL1A1) and activation (S100A6, MT2A, and JUN), were upregulated in a subset of HIGD1B+STEAP4+ pericytes (Fig. 5d–f and Extended Data Fig. 6d). Although these genes are upregulated in B2 disease by other stromal subsets, we find several genes that are affected uniquely in cells of the HIGD1B+STEAP4+ pericyte subset. This includes LUM, which is upregulated after fibroblast stimulation in lung fibrosis and promotes fibrocyte differentiation48, and PDGFRB and TGFBI, which are strongly upregulated in lung fibrosis and whose pharmacological inhibition reduces fibrosis20,49. Furthermore, LUM and TGFBI have been reported to be upregulated in the chronic phase of a mouse model of colitis50, which is associated with marked intestinal fibrosis. Notably, although MrVI bases its DE estimations on its generative model, targeted analysis of the same molecules using the raw data shows consistent results (Extended Data Fig. 7a). Together, this analysis demonstrates the potential of MrVI for delineating a population of cells that is associated with a disease phenotype and could facilitate a more nuanced discovery of markers for diagnosis and treatment.

MrVI additionally predicts marked upregulation of PDGFRB by CD36+ endothelial cells in B2 samples (Fig. 5e,f). This is unexpected because PDGFRB is a common marker of pericytes and is not normally expressed by endothelial cells. We further characterized gene expression in the CD36+ subset and found coexpression of endothelial markers (like PLVAP, VAMP5, VWA1) and pericyte markers (like NOTCH3, RGS5 and MYL9) (Extended Data Figs. 6d and 7b). We also found upregulation of markers of tissue fibrosis, such as COL1A1 and TGFBI, in this subset (Extended Data Fig. 6e). Therefore, these results highlight a cell population with a mixed phenotype between the endothelial and pericyte lineages, which upregulates markers of tissue fibrosis in B2 samples. This hints at the endothelial-to-mesenchymal transition in IBD and the presence of a pericyte-like state in the gut endothelium of B2 disease. This transition has been described as occurring in human IBD51. However, the phenomenon was not explored in the original study of this cohort, and has not to date been studied with single-cell genomics51.

Discussion

In this paper, we introduce MrVI, a comprehensive solution for large-scale (multi-sample) single-cell RNA-seq studies. MrVI provides a unified probabilistic framework for integration of samples, sample stratification and analysis of the effects of sample covariates at both the cell-subset and gene levels. Based on a hierarchical latent variable model and counterfactual predictions, MrVI addresses these tasks while accounting for nuisance sources of variation, without requiring cell-type annotations.

The latter point is of particular importance owing to the difficulty of defining cluster boundaries and their resolution. For instance, ref. 10 categorized the human brain into 17 cell types for studying autism, whereas ref. 52 identified 3,313 clusters to characterize cellular heterogeneity in the same tissue. Both strategies proved useful, and it is therefore not generally clear which resolution is appropriate for a given analysis. Agnostic to such clustering strategies, MrVI facilitates a ‘bottom up’ approach that divides the cells into groups in a manner that reflects the task at hand. Specifically, by estimating sample distance matrices around each cell, MrVI aggregates cells into subsets that confer similar groupings of samples. Similarly, estimating DE or DA effects in every cell allows for the aggregation of genes or cells in a way that reflects a coherent response to the covariate of interest. For ease of interpretation, these aggregations can also make use of cell annotations into subsets (by averaging the cell-wise DE or DA effects), as long as the cells in a subset are consistently affected (which was the case in many of our analyses).

MrVI’s architecture uses multi-head attention to model heterogeneous covariate effects, a distinction from traditional models that typically rely on multi-layered perceptrons. We validated this design using ablation studies motivating these choices (Supplementary Fig. 22). We also guided the selection of default parameters that are relevant for most use cases through hyperparameter sensitivity studies (Supplementary Fig. 22, Supplementary Tables 2 and 3 and Supplementary Note B).

We demonstrated MrVI performance in a few case studies. Considering a COVID-19 cohort, MrVI identified clinically relevant groupings of people with the disease and highlighted subsets of myeloid cells in which these groupings manifested. Post hoc analysis of the resulting strata revealed a marked agreement with the elapsed time since infection—information unavailable to the algorithm. Notably, MrVI’s grouping did not perfectly mirror the infection timelines. This observation underscores MrVI’s potential to produce data-driven sample strata that might not be trivially obtained from the recorded metadata alone and instead could lead to different diagnoses or identification of new disease subtypes12. MrVI is particularly relevant for studies in which samples are collected from numerous individuals and across different anatomical locations or experimental protocols. We demonstrated this using an IBD study, in which MrVI effectively integrated samples from diverse tissue locations and highlighted changes associated with stenosis. Beyond clinical or cross-studies, MrVI applies to any discrete cell-level meta-data by designating it as the target covariate. We demonstrated this using a perturbation screen with the sci-Plex assay, in which each cell is associated with a particular perturbagen, facilitating de novo identification of compound groups and characterization of their effects.

When considering patient cohorts, we applied MrVI using the sample identifier as our modeled target covariate (s). MrVI’s ability to stratify samples without relying on explicit covariate information makes it robust to scenarios in which metadata might be incomplete or inaccurate. There are, however, scenarios in which it is useful to explicitly model other target or nuisance covariates, which could constitute natural extensions of MrVI.

Another natural extension of MrVI is to handle information from other measurement modalities, both separately and in parallel to RNA expression. This extension could pinpoint, for instance, different strata (and their inducing cell subsets) when considering chromatin properties versus RNA53.

MrVI is implemented using state-of-the-art software tools for deep probabilistic modeling and can thus scale to multi-sample studies with millions of cells. Beyond that, the expected increase in scale and complexity of single-cell omics raise new challenges and opportunities for which MrVI can provide a powerful framework for analysis and a solid foundation for further developments.

Methods

The MrVI model

Generative model overview

We consider two-stage scRNA-seq experimental designs in which cells are collected from multiple samples (Fig. 1a). Each sample is associated with target covariates (for example, treated versus untreated, donor or specimen age and sex) or nuisance covariates (for example, the sample collection site or the study ID in cross-study analyses).

Typically, multiple target covariates can induce variation in expression across samples, but it is unknown which of these can affect cells and by what mechanism. For instance, in drug-response studies, both the type of administered drug and its dosage are crucial to assessing drug impact on cell states, but the nature of the interaction between these two factors might not be known. In disease studies, cases can induce specific shifts in gene expression in specific donor subpopulations that might not be fully encoded in the available metadata.

Instead of attempting to model the effects of these covariates directly, we adopt an approach that initially requires only knowledge of sample IDs s {1, …, S} and the nuisance covariates as b {1, …, B}. This strategy allows us, at a later stage, to highlight which target covariates drive sample variations of interest. The resulting gene expression profiles are denoted as {x1, …, xN}, in which \({x}_{n}\in {{\mathbb{N}}}^{G}\) is the vector of RNA transcript counts for cell n over the G observed genes. We denote the count for cell n and gene g as xng. For any cell (n), sn is the sample ID (for example, the donor from which cell n originates) and bn is the nuisance covariate.

In cases with multiple nuisance covariates, we recommend using the covariate with the coarsest resolution that is still nested in any covariates expected to confound the analysis. This might require concatenating multiple nuisance covariates (that is, the concatenation of the study ID with the batch ID used in each study as bn).

Isolating sample-specific effects on cell states with MrVI

The generative model of MrVI writes as:

$$\begin{array}{rcl}{u}_{n}& \sim &{\rm{MixtureOfGaussians}}({\mu }_{1},\ldots ,{\mu }_{K},{{\rm{\Sigma }}}_{1},\ldots ,{{\rm{\Sigma }}}_{K},{\pi }_{1},\ldots {\pi }_{K})\\ {z}_{n}| {u}_{n},{s}_{n}&=&{g}_{\theta }^{u\to z}({u}_{n},{s}_{n})\\ {h}_{n}&=&{\rm{softmax}}\left({A}_{zh}\times \left[{z}_{n}+{g}_{\theta }^{z\to h}({z}_{n},{b}_{n})\right]+{\gamma }_{zh}\right)\\ {x}_{ng}| {h}_{ng}& \sim &{\rm{NegativeBinomial}}\left({l}_{n}{h}_{ng},{r}_{ng}\right).\end{array}$$
(1)

Here, un and zn are the latent (unobserved) representations of cell n, both of dimension L. Azh is a matrix of dimension G × L, and γzh is a bias vector of dimension G. \({g}_{\theta }^{u\to z},{g}_{\theta }^{z\to h}\) are multi-head attention layers, with disjoint sets of parameters both denoted as θ for brevity. The size factor ln is fixed as the total sum of counts of cell n, hng denotes the normalized gene expression levels and rng ≥ 0 denotes the inverse dispersion of the distribution for cell n and gene g. μi, Σi and πi, respectively, denote the mean, covariance matrix, and weight of component iK, where K represents the total number of mixture components. More details about this prior are provided in ‘Additional model details’. All these parameters, other than ln, are learned during training.

A priori, un captures broad variations assumed to characterize cell types and more granular cell states, but is disentangled from both target and nuisance covariates. As such, un harmonizes cells from all samples into a shared latent space. We assume a mixture of Gaussians (MoG) prior or an unimodal Gaussian prior (K = 1) on un, depending on the application and available prior knowledge about cell-state variation. When we expect cells to belong to one of several groups, a MoG prior could be more appropriate than a unimodal Gaussian prior, to avoid posterior collapse and prior overregularization, two issues with variational inference that have been reported in the field54,55.

When reliable cell-type annotations are available, MrVI can also rely on a prior that is weakly informed about cell-type annotations. We found the cell-type-supervised model separated existing cell-type annotations much better than did its unsupervised counterpart (Supplementary Fig. 21). More details about the prior of un are given in ‘Additional model details’.

zn is an augmented representation of the cell, that is aware of sample effects but is disentangled from other nuisance covariates. This latent variable is constrained to be close to un by a term in the objective function that penalizes the L2 distance between zn and un.

These assumptions on un and zn hold a priori, but not necessarily a posteriori. For instance, the posterior distribution of un could exhibit conditional dependence on sample-specific factors, allowing MrVI to capture compositional differences across samples while harmonizing cells into a shared latent space.

Given that zn is expected to capture more variability in cells than is un, allowing it to lie in a higher-dimensional space is natural. Additionally, a low-dimensional bottleneck on un could improve sample harmonization. In such a case, we allow zn to take a higher dimension than un by modeling \({z}_{n}| {u}_{n},{s}_{n}={g}_{\theta }^{u\to z}({A}_{uz}{u}_{n}+{\gamma }_{uz},{s}_{n})\), in which Auz is a learned matrix of dimension Lz × L, and γuz is a bias vector of dimension Lz, where Lz is the dimension of zn. Without loss of generality, the remainder of the manuscript focuses on the case in which zn and un have the same dimension (L).

Modeling gene expression under technical effects

MrVI models the normalized expression of gene g, denoted hng, as a function of both zn and the nuisance covariate. This relationship is parameterized with multi-head attention (gθ above) to capture non-linear, nuisance-covariate-specific effects on gene expression. More information regarding this parameterization is available in ‘Additional model details’. Finally, we model the observed transcript counts with negative binomial distributions and account for the technical effects of the sequencing depth using the same approach as scVI22.

Variational approximation and training procedure

The generative model described by Equation (1) can be used to generate synthetic data; it does not directly inform on the posterior distribution of the latent variables un and zn, given observed gene expressions and sample ID of a given cell (n) required for analysis. Because we model zn as deterministic given un and sn, we must approximate the posterior pθ(unxn). Because this posterior term is intractable, we rely on variational inference to learn an approximation qϕ(unxn) to the posterior, in which ϕ denotes all parameters used to construct the variational approximation.

Modeling q ϕ(u nx n)

We model qϕ(unxn) as a Gaussian distribution whose mean and covariance (assumed diagonal) are outputs of multi-layer perceptrons (MLPs), fϕ, taking xn as inputs.

With this variational approximation, a straightforward approach to get posterior latent un and zn for a given cell consists of sampling un from qϕ(unxn), then computing zn as described in Equation (1).

Training procedure

We optimize the evidence lower bound (ELBO), which we maximize over the generative model parameters θ and variational parameters ϕ using mini-batch stochastic gradient descent methods56,57. In this problem, the ELBO writes as:

$${\rm{ELBO}}(\theta ,\phi ):= {{\mathbb{E}}}_{x,s,b}{{\mathbb{E}}}_{{q}_{\phi }(u| x)}\left[\log \frac{{p}_{\theta }(x| z,b){p}_{\theta }(z| u,s){p}_{\theta }(u)}{{q}_{\phi }(u| x)}\right]$$

As mentioned previously, we also add a scaled L2 penalty to this objective on the distance between z and u. Thus, the final objective writes as

$${\mathcal{L}}(\theta ,\phi ):= {\rm{ELBO}}(\theta ,\phi )-c{{\mathbb{E}}}_{{q}_{\phi }(u| x){p}_{\theta }\left.(z| u,s)\right]}[\parallel z-u{\parallel }_{2}^{2}]$$

where c is a scalar coefficient set such that the penalty term matches the log probability of z for an isotropic Gaussian distribution centered around u.

Model architecture

Overall, MrVI uses a variety of MLPs, multi-head attention layers and learnable embeddings to estimate the parameters of the different variational and likelihood distributions described above (Extended Data Figure 8).

Exploratory analysis of sample effects on cell states

A common scenario in large-scale studies is that sample-level covariates are incomplete, noisy, or inconsistent across datasets. It may also be the case that the most relevant sample characteristics affecting gene expression are unobserved. Here, MrVI can identify the most relevant sources of heterogeneity between samples without assuming access to relevant target covariates (Fig. 1c); the most salient axes of variation across samples can then be related back to observed target covariates. This type of analysis relies on cell-state counterfactuals, which are used to quantify sample distances at the cellular level.

Predicting counterfactual cell states

After model fitting, MrVI can be used to predict the effect of a given sample on any cell. We aim to predict the counterfactual state of the cell, that is, its state had it been collected from another sample. We achieve this by substituting the sample-of-interest \({s}^{{\prime} }\ne {s}_{n}\) for the true sample-of-origin sn in Equation (1) to obtain

$${z}_{n}^{{s}^{{\prime} }}:= {u}_{n}+{g}_{\theta }^{u\to z}({u}_{n},{s}^{{\prime} }),$$
(2)

where un is the inferred cell state for cell n obtained through variational inference. \({z}_{n}^{{s}^{{\prime} }}\) captures the counterfactual state of cell n had it been collected from sample \({s}^{{\prime} }\).

Estimating sample distances at the cellular level

MrVI allows for unsupervised sample stratification by comparing distances between counterfactuals from Equation 2. In fact, we can assess the differences between samples sa and sb on a cell n by computing the distance between their respective counterfactual cell states \({z}_{n}^{{s}_{a}}\) and \({z}_{n}^{{s}_{b}}\). In particular, low distances between counterfactuals indicate that the two samples have similar effects on the cell according to the model.

On the basis of this observation, we summarize the sample stratification for a given cell as a sample distance matrix by computing the distance between counterfactuals for all pairs of samples. More precisely, for any cell n, we let D(n) denote its sample distance matrix between counterfactuals. In this matrix, the element at the position indexed by (sa, sb), where sa and sb are indices representing different samples, corresponds to the Euclidean distance between the counterfactual cell states \({z}_{n}^{{s}_{a}}\) and \({z}_{n}^{{s}_{b}}\).

These matrices inform sample stratification at single-cell resolution. They can first be used to identify cell populations with homogeneous sample stratifications. Clustering cells using their distance matrices as feature vectors can identify populations of cells with homogeneous sample stratifications. To do so, we embed each flattened distance matrix using PCA before clustering cells using the Leiden algorithm58. In any resulting cluster, we then assess sample stratification in aggregate. We first compute the average sample distance matrix of the cluster, which we then use to cluster samples using hierarchical clustering.

Owing to the uncertainty in un, even two samples with identical underlying distributions will have non-zero distances between their estimated counterfactual cell states. To account for this uncertainty, MrVI optionally computes Monte Carlo estimates of the distribution of distances between two counterfactual cell states derived from the same sample. These distribution estimates can then be used to determine the z score for the original distance matrix values. In detail, for a cell (xn), \({u}_{n}^{1},{u}_{n}^{2} \sim \hat{p}(u| {x}_{n})\) can be sampled, and the L2 distance between them can be computed, \(\parallel {u}_{n}^{1}-{u}_{n}^{2}{\parallel }_{2}\). The mean (\(\hat{\mu }\)) and the s.d. (\(\hat{\sigma }\)) of this term with more Monte Carlo samples can be estimated, and these can be used to compute normalized distances for D(n) as \(\frac{D(n)-\hat{\mu }}{\hat{\sigma }}\).

Assessing compositional and expressional sample differences

With observed sample covariates at hand, MrVI can also highlight which cells have different abundances or expression levels across groups (Fig. 1d). Such characteristics can, for instance, correspond to age, sex or disease status when samples correspond to different donors. The target covariate can also be derived from a stratification of samples based on the procedure described in the previous section. This section outlines how MrVI can be used for both DE and DA analyses assessing sample differences in gene expression and cell composition.

Cluster-free assessment of differences in expression

First, MrVI can characterize differential expression patterns across samples. Suppose we observe C target covariates in the form of a vector \({c}^{s}\in {{\mathbb{R}}}^{C}\) for each sample (s). To identify affected cells and genes by each target covariate, we fit the following linear model for each cell (n):

$${z}_{n}^{{s}^{{\prime} }}={{c}^{{s}^{{\prime} }}}^{T}{\beta }_{n}+{u}_{n},\quad \forall {s}^{{\prime} },$$
(3)

where \({z}_{n}^{1},\ldots {z}_{n}^{S}\) are S counterfactuals for cell n obtained from Equation (2). Here, \({\beta }_{n}\in {{\mathbb{R}}}^{C\times L}\) is the vector of regression coefficients obtained via least-squares regression.

Identifying the effect of covariates on cells

This linear model can first quantify the overall effect of an observed covariate on any cell. We compute, for any cell n and covariate index jC, the Chi-squared statistic of \({\beta }_{n}^{j}\). This statistic quantifies the extent to which the observed covariate j explains the variation in the counterfactual cell states.

Detecting cells strongly affected by covariate

The results of the linear regression can help identify cells strongly affected by a covariate. We compute the L2 norm of the vector βi for a covariate i. This yields effect strengths in the z representation for the specific covariates and can be used to compare effect strength across multiple cell types.

Detecting differentially expressed genes

The results of the linear regression can also identify DE genes associated with a given covariate in any cell. For simplicity, assume that the covariate of interest is binary. Let \({\beta }_{n}^{1}\in {{\mathbb{R}}}^{L}\) denote the regression coefficients of the covariate of interest for cell n. To identify the associated DE genes, we decode the counterfactual cell state \({z}_{n}^{1}={\beta }_{n}^{1}+{u}_{n}\) and the reference cell state \({z}_{n}^{0}={u}_{n}\). This computation yields two vectors of decoded gene expressions, denoted as \({h}_{n}^{1}\) and \({h}_{n}^{0}\). We then compute the log (fold change) between these two vectors, measuring the effect of covariate j on each of the observed genes g in the cell.

Accounting for out-of-distribution samples

Before conducting the described procedure, we first identify and discard samples that are out of distribution for any given cell. Samples will be out of distribution for a cell if no cell from that sample was collected in a similar cell state in the un space. For these samples, the model has insufficient information to accurately infer realistic counterfactual cell states in Equation (2). Thus, we conservatively discard the sample s for cell state u if the maximum density reached at u with respect to the approximate variational posterior distributions falls below a given threshold τ. More details on how the densities are computed and how τ is chosen are given in ‘Additional model details’.

Assessing cluster-free differences in composition

Last, our approach can identify differentially abundant cell populations over groups of samples using log ratios of aggregated posterior densities. For this purpose, we introduce qs, the aggregated posterior distribution for a given sample s, which corresponds to \({q}_{s}(u):= \frac{1}{{n}_{s}}{\sum }_{n:{s}_{n} = s}q_\phi(u| {x}_{n})\), where ns denotes the number of cells in the considered sample and q(uxn) is the variational approximation to the posterior distribution over the u space for cell n. We can then quantify the density of any set of samples A {1, …S} in the u space as \({q}_{A}(u):= \frac{1}{| A| }{\sum }_{s\in A}{q}_{s}(u).\)

For two disjoint sets of samples A and B, we quantify the relative overabundance of cells from A compared with B at any cell state u by computing the log (density ratio) of the aggregated posterior densities of the two groups, that is:

$${r}_{AB}(u):= \log \frac{{q}_{A}(u)}{{q}_{B}(u)}.$$
(4)

We can then identify enriched or depleted regions of u in A compared with B by inspecting the log ratio rAB(u). This approach has several benefits. Given that MrVI assesses differential abundance in the u latent space, the captured differential-abundance effects are orthogonal to the differential-expression effects quantified in the previous section. Furthermore, this approach allows us to identify enriched cell states without cell-type or neighborhood assignments.

At the cluster level, we also devise a strategy to identify cell-enriched or depleted subpopulations with statistical confidence. For this purpose, letting A denote a cluster of interest, we collect log-ratios for (1) all cells in cluster A, and (2) those not in A. We then test for difference in the mean of these two sets of log ratios using a two-sided, two-sample t-test. To avoid detecting differences as significant owing to large sample sizes, the t-test rejects for the composite null that the mean difference is, in absolute value, less than a given threshold δ. Throughout the experiments, we set δ = 0.1.

Additional model details

Mixture of Gaussians prior for u n

MrVI posits a mixture of Gaussians (MoG) prior on un, that writes as:

$$\begin{array}{ll}&{c}_{n} \sim {\rm{Categorical}}({\pi }_{1},{\pi }_{2},\ldots ,{\pi }_{K}),\\ &{u}_{n}| {c}_{n}=c \sim {\mathcal{N}}({\mu }_{c},{{\rm{\Sigma }}}_{c}).\end{array}$$
(5)

Here, cn denotes the mixture component assigned to cell n. In practice, we assume the covariance matrices to be diagonal and learn μ1, …, μK, Σ1, …, ΣK, and π1, …, πK during training using maximum likelihood estimation.

When cell-type annotations are available, MrVI can weakly encourage the mixture of Gaussians to align with these annotations. In this case, we set K to be the number of unique cell-type annotations and reparameterize the mode of the Gaussian distribution as \({c}_{n} \sim {\rm{Categorical}}({\pi }_{1,n}^{{\prime} },{\pi }_{2,n}^{{\prime} },\ldots ,{\pi }_{K,n}^{{\prime} })\), in which \(\log {\pi }_{k,n}^{{\prime} }=\log {\pi }_{k}+\epsilon {\mathbb{I}}\left({y}_{n}=k\right)\). Here, ϵ is a positive constant (with a default of 10), yn is the cell-type annotation of cell n, and \({\mathbb{I}}(.)\) is the indicator function.

Parameterization of multi-head attention layers

Two components of MrVI rely on multi-head attention layers, corresponding to the mappings \({g}_{\theta }^{u\to z}\) and \({g}_{\theta }^{z\to h}\) from Equation (1). We now provide details on how these layers are parameterized, illustrated in Extended Data Figure 9.

Parameterization of \({g}_{\theta }^{u\to z}\)

\({g}_{\theta }^{u\to z}\) takes as inputs the cell state (un) and the sample from which the cell was collected (sn). We associate each sample ID s {1, …S} with an embedding \({e}_{s}\in {{\mathbb{R}}}^{L}\), learned during training. We then rely on a multi-head attention mechanism to capture the effect of sn on un in a non-linear fashion, considering un as queries and \({e}_{{s}_{n}}\) as keys or values. This output is then passed through a series of two fully connected layers, with ReLU activations, to obtain the actual output of \({g}_{\theta }^{u\to z}\).

Parameterization of \({g}_{\theta }^{z\to h}\)

\({g}_{\theta }^{z\to h}\) relies on the exact same parameterization as \({g}_{\theta }^{u\to z}\), but takes zn and bn as inputs instead of un and sn.

Out-of-distribution checks

The MrVI DE module actively filters for out-of-distribution cell–sample pairs. It might, for instance, be the case that a given sample contains no cells of a given type; in this case, this sample should be discarded for the DE analysis in Equation (3). We identify these out-of-distribution samples by defining the following admissibility score:

$${A}_{s}(u):= \mathop{\max }\limits_{n:{s}_{n}=s}q_\phi(u| {x}_{n}).$$
(6)

For a given cell n, we filter out sample s whenever As(u)≤τs, where τs is set to the 5% quantile of reference admissibility scores \({R}_{s}^{{\prime} }(n)\) over all of the cells collected in s, in which:

$${R}_{s}^{{\prime} }(n):= \mathop{\max }\limits_{\{i:{s}_{i}=s\}\setminus n}q_\phi(u| {x}_{i}).$$

Notably, when computing the baseline admissibility scores over in-sample observations, we exclude the cell itself. After computing the set of admissible samples for every cell, additional filtering can be performed on the level of cells to eschew those with very few admissible samples (for example, a rare cell type observed in one sample), for which counterfactual estimates could be generally unreliable (see Supplementary Fig. 23).

Benchmark

Baselines

Exploratory analyses

We considered two approaches that stratify samples on the basis of differences in cell cluster abundance. Both approaches compare subcluster proportions between samples to yield distance matrices. More particularly, they subcluster each predefined cell group with the Leiden algorithm58 using low-dimensional cell representations, that is PCA for Composition (PCA) or scVI for Composition (SCVI). The scVI model was given the batch ID numbers to correct for batch effects in the latent representation. The PCA algorithm, however, does not have an explicit way to handle batch IDs. The distance between two arbitrary samples is then defined as the Euclidean distance between their subcluster proportions.

Guided analyses

We also considered Milo19 and miloDE61, which leverage estimates for DA and DE, respectively, in guided analyses. Milo is a statistical framework that aims to detect cell neighborhoods enriched in certain sample groups based on a nearest-neighbor graph of cells. Built on top of Milo, miloDE59 performs differential expression tests for each neighborhood identified by Milo by comparing each neighborhood against adjacent ones. These approaches, however, do not provide effect sizes for DA and DE at the cell level and instead group cells into neighborhoods that could obscure effect sizes at a single-cell resolution18. To compare these approaches to MrVI, we computed cell-level effect sizes for Milo and miloDE by defining cell-level effect sizes as the average effect size of the neighborhoods to which each cell belonged.

Metrics

Cell-type silhouette scores

We consider averaged silhouette width scores, computed as in ref. 30, to assess the relevance and the proper mixing of the latent representation u under the assumption that the same cell types appear across the considered samples. To do so, we first compute the silhouette score with respect to author-provided cell-type annotations. For any cell n with cell representation r(n), belonging to annotation Co, let d(n, C) denote the mean distance of r(n) to representations of annotation C, excluding n if C = Co. a(n) denotes the average distance of r(n) to cells of the same annotation, and b(n) the smallest mean distance of r(n). The silhouette score for cell n is computed as:

$$s(n)=\frac{{\min }_{C,C\ne {C}_{o}}d(n,C)-d(n,{C}_{o})}{\max \left\{{\min }_{C,C\ne {C}_{o}}d(n,C),d(n,{C}_{o})\right\}},$$
(7)

and the overall dataset silhouette score is the average of rescaled silhouette scores across all cells in the data. The rescaling, \(\tilde{s}(n)=\frac{1}{2}(s(n)+1)\), puts the dataset score in the range (0, 1). This score assesses to what extent the data representations cluster according to the annotations. When the dataset score is equal to 1, representations with the same annotation perfectly cluster together.

Batch silhouette scores

We also used the silhouette to measure the extent to which batch IDs mix together in the latent space. To do so, we follow the procedure described in ref. 30, which consists of, for each previously annotated cell type: (1) computing cell silhouette scores with respect to the batch assignments, (2) rescaling these scores, such that \(\hat{s}(n)=1-| s(n)|\), and (3) computing an overall silhouette score computed as a weighted average of \(\hat{s}(n)\), to ensure that each cell type gets the same contribution.

Data and preprocessing

We here provide a brief description of the datasets and of the preprocessing steps applied to them. More details can be found in Supplementary Note A.

Semi-synthetic experiment

We constructed a semi-synthetic dataset containing controlled DE and DA effects, on the basis of an original PBMC dataset of 68,000 cells29 (Extended Data Fig. 10). The synthetic dataset contains 32 synthetic study subjects, as well as five cell subsets, each of which can be seen as a cell type. In the first subset, denoted as A, there are differences in expression across subject groups characterized by covariate 1. Although we have no exact ground truth for the subject–subject distance in subset A, our ground truth consisted of a dendrogram, or tree, over study subjects, characterizing the similarities between subjects in terms of gene expression in the cell subset A. Specifically, all subjects sharing the same covariate 1 value shared similar gene expression values for cells in subset A. In two other cell subsets, denoted as B and C, cells had no differences in expression over subjects but exhibited differences in abundance, either corresponding to enrichment or depletion of these cell subsets in a specific group of samples. In particular, all subjects sharing the same covariate 2 value shared similar proportions of cell subsets B and C, but different proportions with the other subjects.

COVID-19 experiment

The original dataset7 contained a total of 650,000 PBMC cells sequenced across three sites: Cambridge, Sanger and Newcastle. We discarded cells coming from Cambridge and Sanger, and focused on data points sequenced in Newcastle. We retained the 10,000 most variable genes using Seurat v3. The resulting dataset contained 418,768 cells, originating from 55 individuals. No additional cell or gene filtering was performed.

sci-Plex experiment

The original sci-Plex dataset6 contains gene expression over three cell lines, exposed to 188 small-molecule drugs at four different doses. We filtered for drugs with no effects, resulting in a final dataset containing 251,088 cells with 92 drugs at all four doses and the vehicle cells. We applied highly variable gene selection using Seurat v3, retaining the top 5,000 genes.

Inflammatory bowel disease experiment

We downloaded the dataset42 from the Broad Institute Single Cell Portal, conducted in 46 individuals with the disease and 25 controls across tissue regions and sample preparation protocols, for a total of 463,000 cells. We retained the 10,000 most variable genes using Seurat v3.

Code Dependencies

MrVI was implemented using the scivi-tools framework with jax as the machine learning backend. The single-cell data was primarily stored as anndata structures, and xarray and pandas were used to store downstream model outputs. For preprocessing and visualization, we used scanpy, leidenalg, pymde, matplotlib, plotnine and seaborn. We used the scib-metrics package for computing various integration metrics.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.