Introduction

Developments in Single-cell RNA sequencing (scRNA-seq) and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) screening accelerate the discovery of association between genes and various biological processes such as immune responses, cell proliferation and drug resistance1,2,3,4,5,6. In particular, technologies such as CROP sequencing (CROP-seq)7 and Perturb sequencing1 (Perturb-seq) have made high-throughput, large scale cellular perturbation screens possible. Such cellular perturbation screens allow practitioners to investigate complex biological mechanisms such as regulatory dependencies and drug responses on a single-cell level using the comprehensive, fine-grained readouts of the target perturbations within single cells, and have found applications in studies such as combinatorial therapy8,9, drug discovery10 and regulatory elements11,12.

The growing granularity of measurements provided by single-cell CRISPR screening technologies motivates the need for novel computational methods to help extracting interpretable biological insights from generated data particularly in relation to the perturbation effects. However, it is a challenging task due to the high dimensionality, complex structure and sparse nature of the single-cell screening measurements. Analytically, the problem is to produce a prediction model that can be used to provide an estimate of the effect of a perturbation on expression for any particular cell type or cell. The model can be developed using a training dataset that consists of a series of expression measurements on single cells, in which each cell belongs to one of a finite number of cell types and has been subject to one of a finite number of perturbations (including unperturbed controls).

A common approach has been to apply deep learning-based techniques to learn the relationships between cell type, perturbation and expression output flexibly from sufficiently large datasets. To do this, expression data and cell type information are typically transformed into embeddings via deep neural networks (DNNs), which are learnt low-dimensional projections of the original measurements, and then the effects of the perturbations are described in this embedding space. Perturbed embeddings can then be remapped back to expression measurements. If there are a large number of perturbations, embeddings may also be formed for the perturbations themselves. Models are trained against an objective which seeks to minimise the discrepancy between the observed perturbation effect and that predicted via the model. A bottlenecking effect in the design of the model architectures, which perform the various embedding and output transformations, leads to the creation of compressed representations that are optimised towards the maximal retention of information.

The Compositional Perturbation Autoencoder13 (CPA) is an example of such an approach. Given the measured unperturbed and perturbed expression of a cell, CPA predicts the counterfactual distribution of the expression of that cell had it been subjected to a different perturbation. CPA adopts an autoencoder learning framework and uses additive latent embedding of the cell and perturbation states. SAMS-VAE14 using a sparse additive mechanism shift variational autoencoder to characterise perturbation effects as sparse latent representations. In SAMS-VAE, the latent representation of a perturbed expression vector is obtained by adding a sparse representation of the perturbation to a dense perturbation-independent basal state, and the decoder is trained to reconstruct the perturbed expression vectors from latent representations. Other approaches have sought to embed additional external information about the expression features to improve predictions. GEARS15 uses a knowledge graph of gene-gene relationships to inform the prediction, allowing it to simulate the outcomes of perturbing unseen single genes or combinations of genes. Although not deep learning-based, CellOT16 leverages DNNs for function estimation in a neural optimal transport framework17 to map between unperturbed and perturbed single-cell responses.

More recently, single cell foundation models18,19,20,21 have emerged, which promise to provide a multi-functional basis for many analytical applications22. However, the benefits of current foundation model approaches are not yet clear23,24,25,26 and fair evaluation is complicated by emerging applications that integrate direct empirical data with knowledge extracted from scientific literature or pre-trained foundation models25,27.

Non-deep learning approaches have also been developed. Guided Sparse Factor Analysis28 (GSFA) models continuous observations and adopts a linearity assumption in its multivariate latent factor regression approach.12 uses a variant of the “factorize-recover" algorithm to infer perturbation effects from composite sample phenotypes from compressed Perturb-seq experimental data using combination of sparse principal components analysis and LASSO regression. The attractiveness of such approaches is their comparative simplicity due to the use of linearity assumptions.

The plethora of computational perturbation modelling methods29,30 disguises many practical issues that are only apparent at usage time. For instance, assumptions in experimental setup and data preprocessing can be implicitly built into models. CPA assumes categorical cell-level information and continuous gene expression inputs but SAMS-VAE is not able to incorporate additional cell-level information, such as batch information or cell type, and can only handle binary perturbation and count-based expression inputs. GSFA utilises its own particular approach to input data transformation and normalisation. While CPA is able to process continuous perturbation levels (e.g., dosage), GEARS applies only to discrete perturbations as it uses perturbation embeddings and relational graphs between perturbations. Foundation models often require their own specific approaches for tokenisation and input data embedding. These model design differences, on a practical level, mean direct and intuitive comparisons between methods may not be possible both in terms of their predictions but also in terms of the explanations underlying those.

In this work, we propose a more conceptually classical approach for perturbation modelling, called GPerturb, which utilises hierarchical Bayesian modelling31 and Gaussian Process regression32. We demonstrate that GPerturb can achieve high levels of predictive performance that is comparable to current state of the art perturbation models even using a sparse, additive modelling structure and without the use of latent embeddings or external information. Further, the modularity of the hierarchical construction allows us to examine the effect of swapping an observational data model based on count-based expression data for one which uses continuous transformed values instead. Despite the abundance of perturbation modelling methods available, GPerturb offers a novel and scalable generative modelling approach with classical features which make prediction output and their interpretation more readily understandable compared to methods based on black box learning.

Results

Overview

We first provide an overview of GPerturb (a more detailed mathematical description is provided in “Methods”), which is a generative model that aims to directly identify and estimate sparse, interpretable gene-level perturbation effects, for analysing single-cell CRISPR screening data. In GPerturb, we assume that each expression feature measured for each cell can be explained as a sample from a distribution. In the case where expression data is continuous, a normal distribution is used (zero-inflated Poisson for count-based data), where the mean expression level is given by the combination of two components. The first is a feature-specific basal expression level which is determined by the cell-specific parameters (e.g., cell type or cell-specific sequencing information). The second component is a feature-specific perturbation effect which depends on the type of perturbation applied to the cell (which can be null). To make it explicit that some perturbations will only affect certain features, the perturbation component for each feature is controlled by a binary on/off switch. The relationships that map cell-specific parameters and perturbation type to the observed expression levels are governed by nonlinear Gaussian processes.

GPerturb adopts a supervised learning approach to learn and disentangle the basal (unperturbed) expression distribution associated with a given cell type and the additive effect of perturbations given observed gene expression measurements (Fig. 1A). Gaussian processes32 are used to model expression functions, and sparsity constraints aim to regularise the model and improve generalisation and robustness. The generative properties of GPerturb allow perturbed expression levels to be predicted (Fig. 1B), and the sparsity in perturbation effects facilitate users in identifying details about complex perturbation-gene dependencies.

Fig. 1: Overview of GPerturb.
figure 1

A For each cell-feature, GPerturb models the distribution over observed feature expression as the sum of a basal (unperturbed) expression and perturbation effect using feature-specific Gaussian Processes to transform of cell-specific information and perturbation applied. B Schematic of the mathematical construction of GPerturb showing the incorporation sparsity models and the ability to provide observation models for continuous and count-based expression data. The red blood cells and gene therapy icons: Flaticon.com. This cover has been designed using resources from Flaticon.com.

Compared with existing methods, GPerturb does not require a latent variable construction and incorporates uncertainty propagation in an intuitive way due to the Bayesian framework. It can be applied to either raw count (GPerturb-ZIP, for zero-inflated Poisson) or continuous transformed expression measurements (GPerturb-Gaussian). Further and detailed information about the model development and relationships to existing methods can be found in “Methods”.

Single-gene perturbation analysis

We first compared the predictive performance of GPerturb, CPA, GEARS and SAMS-VAE on a subset of the genome-wide CRISPR interference Perturb-seq dataset.33 For all methods, the recommended settings are used. Since SAMS-VAE takes count-based data as inputs while CPA and GEARS require continuous expression inputs, we compare SAMS-VAE against GPerturb-ZIP and CPA and GEARS against GPerturb-Gaussian, respectively. Similar to previous studies, we randomly select 20% of the dataset as the test set, and use the rest to train GPerturb.

We compared the averaged predictions and averaged observations for each of the unique perturbations. For CPA, SAMS-VAE and GEARS, we computed and store the average of 1000 samples of reconstructed/predicted expressions drawn from the fitted model for each of the unique perturbations. Similarly, for GPerturb, we compute and store the averaged predicted mean expressions for each of the unique perturbations (i.e., averaged over all samples associated with a common perturbation). Table 1 shows the Pearson correlations between the predicted and observed expression levels for the perturbations which are illustrated in Fig. 2A, C. We see GPerturb-ZIP attains better correlation than SAMS-VAE (rGPerturb = 0.972, rSAMS-VAE = 0.944) for count-based inputs, while CPA-mlp achieved the best performance ahead of GPerturb-Gaussian and GEARS on continuous inputs (rCPA-mlp = 0.984, rGPerturb = 0.981, rGEARS = 0.977).

Table 1 Comparison of predictive performance
Fig. 2: Single-gene perturbation analysis.
figure 2

A Comparison of predicted versus observed expression from GPerturb and GEARS using continuous expression inputs and B comparison of predicted perturbation effects. C Comparison of predictions from GPerturb and SAMS-VAE on count-based expression inputs and D comparison of predicted perturbation effects. Source data of the figure are provided as a Source Data file.

While the overall correlation between predicted and observed expression was high, Fig. 2B, D shows that the directionality of the perturbation effects given by different models did not always agree, with instances where one method might report that a perturbation gives increased gene expression while another method indicates that the perturbation leads to decreased expression. We quantified this observation in Table 2 by examining the directionality agreement over all gene-perturbation pairs. Figure 3A shows the discrepancies in the exosome-related perturbation effects between GPerturb-Gaussian, CPA and GEARS for continuous expression input. In contrast, using count-based expression input, GPerturb-ZIP and SAMS-VAE showed greater consistency suggesting that the choice of pre-processing could have a considerable impact on perturbation modelling (Fig. 3B). In order to further examine this, we were able to compare the outputs of GPerturb-Gaussian and GPerturb-ZIP on 345 perturbations grouped by pathways (Fig. 4). This showed that given the same data set, the conversion from count-based to continuous-based expression (and the necessary changes in likelihood model in GPerturb) considerably changes the predicted perturbation effects.

Table 2 Proportion of gene-perturbation pairs with agreement on directionality between methods
Fig. 3: Differences in exosome-related perturbation effects associated with model and pre-processing selection.
figure 3

Top 25 most differentially expressed genes identified by (A) Gaussian and (B) Zero-Inflated Poisson versions of GPerturb and comparisons to perturbation effects inferred by GEARS, CPA and SAMS-VAE. Note that for the continuous case, we only include perturbations that are present in the GO graph of the current implementation of GEARS for sake of comparison. The difference in the scales of perturbation effects are due to the different internal data-preprocessing and normalising steps. Source data of the figure are provided as a Source Data file.

Fig. 4: Perturbations by pathway annotations.
figure 4

Visualisation of 345 perturbations with pathway annotations and grouped by pathway highlights differences between (A) Gaussian- and (B) ZIP-based GPerturb. Each row corresponds to the perturbation effects of a unique perturbation. The perturbation effects for a gene is included only if the associated posterior inclusion probability greater than 0.95. Source data of the figure are provided as a Source Data file.

Multi-gene perturbation analysis

We next considered a Perturb-seq dataset34 consisting of 89,357 cells and 5045 genes and containing 131 two-gene perturbations. We compute the averaged predicted responses each method for each of the two-gene perturbations and compared to the corresponding averaged observations which are shown in Table 1 and Fig. 5A, C. Note that unlike the other methods, GEARS is able to predict perturbation outcomes of previously unseen multi-gene perturbations by using biological knowledge encoded in its knowledge graph. Although GPerturb does not use additional prior information as in GEARS, it attains comparable correlation on predictions for two-gene perturbations and outperforms CPA and SAMS-VAE (Table 1). Interestingly, as in the previous experiments, the directionality of the perturbation effects between methods was not always consistent as illustrated in Fig. 5B, D and quantified in Table 2.

Fig. 5: Multi-gene perturbation analysis.
figure 5

A Comparison of predicted and observed expression from GPerturb and GEARS using continuous expression inputs and B corresponding comparison of perturbation effects. C Comparison of predicted and observed expression from GPerturb-ZIP and SAMS-VAE using count-based expression inputs and D corresponding comparison of perturbation effects. E Comparison of predicted and observed expression from GPerturb and GEARS using continuous expression input and F corresponding comparison of perturbation effects. Source data of the figure are provided as a Source Data file.

We further compare GEARS and GPerturb using a highly multiplexed Perturb-seq dataset12 under the same setup. CPA and SAMS-VAE could not be applied to this data set due to the large number perturbations. We report the averaged predictions against the corresponding averaged observations for both methods in Fig. 5E. We see GPerturb and GEARS attain comparable predictions (rGPerturb = 0.798, rGEARS = 0.802), but predictions had high variance. Consequently, estimated perturbation effects for each perturbation-gene pair given by GPerturb and GEARS in Fig. 5F showed weak correlation only. Unlike the previous examples, we see that even though the two methods have similar prediction accuracy, the scale of estimated perturbation effects given by GEARS is much smaller than GPerturb. The much more conservative perturbation estimations given by GEARS are likely due to the fact that less than 30% of the genetic perturbations is present in the Gene-Ontology (GO) knowledge graph in the current implementation of GEARS.12

Dosage-based perturbations

We next considered the SciPlex2 dataset2, where we examined a subset of A549 cells treated with one of the four compounds: dexamethasone (Dex), Nutlin-3a (Nutlin), BMS-345541 (BMS), or vorinostat (SAHA) across seven different doses. As a benchmark we conducted an analysis using CPA13, which requires four inputs for each prediction: the cell property, a perturbation type, the expression profile of the cell corresponding to that perturbation and the perturbation type for which we want to predict the expression profile. We recorded the averaged counterfactual predictions of the negative control samples (no perturbation) under each of the 28 unique perturbations (4 compounds × 7 dosages) as counterfactual treatments. For GPerturb we recorded the averaged predictions (i.e., prediction values averaged over all cells associated with a common perturbation) for each of the 28 unique perturbations. We then compare the averaged predictions associated with all unique perturbations to the averaged observations in Table 1 and found GPerturb outperformed two variants of CPA: CPA-mlp and CPA-logsig. The latter enforces monotonicity of the dose-response relationship in its latent space. In the case of GPerturb, the superior performance was achieved in the absence of requiring a basal expression profile as input as needed in CPA. Note that comparison with GEARS and SAMS-VAE was not possible since neither account for non-binary perturbations. We then further investigated the ability of GPerturb to model the dosage relationships. Figure 6 illustrates that the predicted dosage-dependent expression levels given by GPerturb are more aligned to the measured expression values than both CPA variants particularly for non-monotonic dependencies between drug doses and expression levels. In particular, for PDE4D, CDKN1A and MDM2, expression varies non-monotonically for BMS, which is not captured by the monotonicity constrained CPA-logsig.

Fig. 6: Analysis of continuous dosage-based perturbations.
figure 6

Predicted dosage-linked expression levels given by (A) GPerturb, (B) monotonically-constrained CPA and (C) unrestricted CPA on selected genes from the Sciplex2 dataset2. Different colours are assigned to the four drugs (Dex, Nutlin, BMS, SAHA). Dotted lines correspond to the estimated means given by different methods. Shaded regions are the corresponding 95% credible band given by GPerturb. Source data of the figure are provided as a Source Data file.

A feature of the semiparametric model specification of GPerturb is that it can be used to identify distinct dosage response patterns by examining the gradient information of the estimated perturbation effects. We computed the integrated squared derivatives of the perturbation effects with respect to the dosage level exactly and efficiently using automatic differentiation within GPerturb. Large values of this metric allows us to identify genes that are the most sensitive to the dosage of perturbation while low values show no response at all. Examples are illustrated in Fig. 7A. Note that this derivative-based metric captures both monotonic and non-monotonic dosage dependent behaviours. Figure 7B shows the distribution of the derivative metric for each drug on the log scale. In Fig. 7B, only a fraction of genes show high sensitivity to each drug, making it a useful metric for discovery. Figure 7C illustrates example genes which exhibited sensitivity to multiple drugs.

Fig. 7: Sensitivity to perturbations.
figure 7

A Dosage vs Estimated perturbed expression. Each row corresponds to one of the four drugs (Vorinostat, Nutlin-3a, dexamethasone, BMS-345541). Left two columns consist of most sensitive genes measured in integral of squared derivative, mid two columns correspond to genes showing medium sensitivity (derivative metric around 1), and the right two columns show genes with low sensitivity (derivative metric around 0). Dots corresponds to the observed expressions under different gene-perturbation pairs averaged over the test set. Dotted lines correspond to the estimated means. Shaded regions are the corresponding 95% credible band given by GPerturb. B Violin plot of the log of integrals of squared derivatives for all 5000 genes in the Sciplex2 dataset. Different colours correspond to different drugs. C Bar plots of the sum of derivative metrics of the top 5 genes most sensitive to each of the four drugs. Note that AKR1B10 and ALDH3A1 are sensitive to more than one drugs. Source data of the figure are provided as a Source Data file.

Comparisons to linear models

In this section, we apply GPerturb to a LUHMES neural progenitor cell CROP-seq dataset28, and compare its performance to GSFA. This study targets 14 neurodevelopmental genes, including 13 autism risk genes, in LUHMES human neural progenitor cells. The resulting dataset consisting of N = 8708 samples and P = 6000 selected genes. The perturbations were encoded as one-hot vectors of length 14, each element corresponding to one of the 14 targeted neurodevelopmental genes (i.e., 14 distinct perturbations). The cell information is a real vector of length 4 (lib_size: number of total UMI counts, n_features: number of genes with non-zero UMI readings, mt_percent: percentage of mitochondrial gene expression and batch: batch ID). In addition to the one-hot perturbations, the dataset also consists of negative control gRNAs whose perturbations are encoded as zeros. For our proposed method, we randomly select 20% of the dataset as the test set, and use the rest to train GPerturb. For GSFA, the results are obtained based on the recommended settings28.

We further note that in the original study28, the authors first removed cell level information from the continuous expression inputs by regressing the expression data on the cell information and then apply GSFA to the corresponding standardized residual matrix. In contrast, GPerturb disentangles and estimates cell-level and perturbation-induced variations simultaneously, and does not require any additional standardisation. We provided input into GPerturb with and without this standardisation but note that the analysis is more interpretable on the original form as additional transformations can affect prediction power and quality.

Figure 8A illustrates the predictions with GPerturb on the original data scale and after standardisation. While GPerturb shows good predictive performance without the GSFA standardisation applied to the data, it achieves similar correlative performance to GSFA when the data standardisation is used (Pearson correlation rGPerturb = 0.248, rGSFA = 0.182).

Fig. 8: Comparison to GSFA.
figure 8

A Comparison of predicted and observed expression for LUHMES neural progenitor cell CROP-seq using GPerturb with and without GSFA-based standardisation and B GSFA reported predicted and observed standardised expression. C Comparison of predicted and observed expression for primary human CD8+ T cells datasets28 given by GPerturb with and without GSFA-based standardisation and D GSFA reported predicted and observed standardised expression. Source data of the figure are provided as a Source Data file.

We also applied GPerturb-Gaussian to a primary human CD8+ T cells dataset28 in a similar fashion. This study targets 20 genes associated with the T cell response, in both stimulated and unstimulated T cells. The processed dataset consists of N = 24,955 samples and P = 6000 genes. The perturbations were encoded as one-hot vectors of length 20, which correspond to the 20 targeted genes in the study, and cell information was provided as a real vector of length 5 (lib_size: number of total UMI counts, n_features: number of genes with non-zero UMI readings, mt_percent: percentage of mitochondrial gene expression, donor: T Cell donor ID and stimulated: whether or not the T Cell is stimulated).

In GSFA a modification is used to capture differences in perturbation effects between stimulated and unstimulated cells. We replicated the modification in GPerturb model to accommodate potentially different perturbation effects for stimulated and unsimulated T Cells. Similar to the previous example, we randomly select 20% of the dataset as the test set, and use the rest to train GPerturb. We report the fitted results in Fig. 8D. When comparing the predictive performance, GSFA showed greater correlation than GPerturb on the standardised data (rGPerturb = 0.271, rGSFA = 0.335). However, Fig. 8C shows that the standardisation applied by GSFA was likely to be disadvantageous and unnecessary with GPerturb since it can be applied without the initial cell information regression step.

Discussion

There are a number of state-of-the-art single-cell perturbation modelling methods currently available (including many not directly considered here), but a detailed analysis of the pre-processing, training and inference requirements of each method highlights significant differences in the approach and requirements associated with each method. While there has been considerable interest in deep learning based approaches, GPerturb adopts a more classical non-linear regression based modelling strategy, which provides a non-deep learning approach to support model training and prediction by focusing on directly modelling individual genes rather than via the use of latent representations in many other recent methods. Our analysis shows that GPerturb is capable of attaining state of the art performance despite these significant design differences and is highly versatile and computationally efficient (see Supplementary Table 1). A feature showed in our experimental results (Table 1) is that GPerturb in both forms could be applied to all four examples, while other methods could only be used for a subset of these. This highlights the versatility of GPerturb as it is able to handle single, multi-gene, categorical and continuous perturbation inputs. While predictive performance derived from retrospective analysis of existing data sets is an extremely important metric, it is important to note that validation on independent experiments is vital.

Our experiments show that direct performance comparisons between methods must be interpreted carefully and may not always be applicable. For example, comparisons between GSFA and other methods were not shown since GSFA operates and returns results in terms of standardised input data residuals. In addition, we also tried to compare the estimated sparse perturbation effects given by GPerturb with SAMS-VAE and CPA, but found no straightforward way to do so due to the fact that GPerturb directly estimates sparse perturbation effects on the gene level, while CPA and SAMS-VAE focus on finding sparse low-dimensional embeddings of them. Furthermore, since our proposed GPerturb framework allows handling of both continuous normalised and count-based data using Gaussian and zero-inflated Poisson based likelihoods, we have observed that while the alternate versions of GPerturb attain comparable prediction accuracy with methods using comparable input data, the perturbation effects captured by the Gaussian and ZIP versions of GPerturb could be different. This highlights that variations in data processing and modelling could affect the conclusions drawn from the same raw data and adds a further source of uncertainty on the true validity of any biological insights drawn from perturbation prediction methods.

Our experiments are consistent with other recent more extensive evaluation studies25,26,35, which have also found that prediction performance is highly context-dependent and that no single method excels across all scenarios. These evaluations include recently developed single-cell foundation models, which can also be applied for perturbation effect prediction. In some cases, performance of these foundation models may be no better than simple linear models26. The scalable Gaussian Process regression models we have introduced in GPerturb provide a highly-effective and complementary approach for single-cell perturbation modelling. These models can be of utility for direct prediction tasks or as a methodologically distinct benchmark for the development of new methods. Future work could examine extensions of this Gaussian Process framework as a credible non-deep learning based approach for handling multi-omics or spatially resolved molecular data.

Methods

GPerturb-Gaussian

We first discuss the model with X being a matrix of pre-processed continuous responses. We will give a counting data version of the model alongside with a schematic illustration of the additive modelling structure later.

Let \({k}_{{\nu }_{\mu }},{k}_{{\nu }_{\gamma }},{k}_{{\nu }_{\eta }}:{{\mathbb{R}}}^{L}\times {{\mathbb{R}}}^{L}\to {\mathbb{R}}\) be Gaussian process kernels governed by kernel parameters νμνγνη, respectively. Let \({g}_{\mu },{g}_{\gamma },{g}_{\eta }:{{\mathbb{R}}}^{L}\to {\mathbb{R}}\) be the mean functions of the corresponding Gaussian processes.

We first define the gene-level additive perturbation model as follows:

$${m}_{p}:{{\mathbb{R}}}^{D}\to {\mathbb{R}};\quad {\lambda }_{p}\in {\mathbb{R}};$$
(1)
$${\mu }_{p} \sim {{\mathcal{GP}}}({g}_{\mu },{k}_{{\nu }_{\mu }});\quad {\gamma }_{p} \sim {{\mathcal{GP}}}({g}_{\gamma },{k}_{{\nu }_{\gamma }});$$
(2)
$${\eta }_{p} \sim {{\mathcal{GP}}}({g}_{\eta },{k}_{{\nu }_{\eta }});\quad {z}_{ip} \sim {{\rm{Bernoulli}}}\left(\sigma ({\eta }_{p}({{{\bf{C}}}}_{i}))\right);$$
(3)
$${X}_{ip} \sim {{\mathcal{N}}}\left({m}_{p}({{{\bf{K}}}}_{i})+{z}_{ip}{\mu }_{p}({{{\bf{C}}}}_{i}),\log (\exp ({\lambda }_{p}+{z}_{ip}{\gamma }_{p}({{{\bf{C}}}}_{i}))+1)\right),$$
(4)

where \(\sigma (x)=\frac{1}{1+\exp (-x)}\). In this model setup, mp is a fixed but unknown function that takes the cell-level information vector Ki associated with the ith sample as input, and returns mp(Ki) as the expected basal expression level of the pth gene in the ith sample. λp is the basal variability parameter of the expression level of the pth gene shared across all samples i = 1, …, N. zip is a binary toggle controlling whether or not the expression level of the pth gene in the ith sample is perturbed by the perturbation vector Ci. The success probability of zip depends on Ci through ηp(Ci), a random function ηp evaluated at Ci (Note that under our additive modelling setup, the binary toggles zip are the same for all cells receiving the same perturbation). μpγp are also random functions that take Ci associated with the ith sample as input, and return μp(Ci), γp(Ci) as the potential mean- and variability-level perturbation effects on the expression level of the pth gene in the ith sample. Schematic illustration and graphical representation of the Gaussian model is given in Supplementary Fig. 1.

We then assume the observed perturbed expression level Xip associated with perturbation Ci and cell-level information Ki follows a Gaussian distribution with mean being the sum of basal mean mp(Ki) and mean-level perturbation effect zipμp(Ci), and variance being a positive function of the sum between the common basal variability λp and a variability-level perturbation zipγp(Ci). We choose to use the function \(\log (\exp (\cdot )+1)\) to map the unconstrained variability parameters to the positive real variance parameters since it is approximately linear when the magnitude of input is large, and therefore would partially retain the additive structure between the basal states and perturbation effects in a similar fashion to the mean parameters in comparison with e.g., \(\exp (\cdot )\).

Under our modelling setup, the parameters naturally partition into two groups: the random perturbation-specific parameters \({\{{\mu }_{p},{\gamma }_{p},{\eta }_{p},{\{{z}_{ip}\}}_{i=1}^{N}\}}_{p=1}^{P}\), and the unknown but fixed basal level parameters \({\{{m}_{p},{\lambda }_{p}\}}_{p=1}^{P}\). We do not treat the basal level parameters as random variables since the primary objective of the model is to learn how the gene expression levels Xip respond to different perturbations. In the proposed model, the basal states only play the role of “intercept" or “offset", and is not of primary interest to us. In addition, treating the basal level parameters as unknown but fixed model parameters also simplifies the inference procedure and reduces the computational cost of the proposed model.

GPerturb-ZIP

We now discuss the model for expression count data. Similar to the continuous model, we define the count data based gene-level additive perturbation model using a zero-inflated Poisson likelihood as follows

$${m}_{p}:{{\mathbb{R}}}^{D}\to {\mathbb{R}};\quad {\mu }_{p} \sim {{\mathcal{GP}}}({g}_{\mu },{k}_{{\nu }_{\mu }});\quad {\eta }_{p} \sim {{\mathcal{GP}}}({g}_{\gamma },{k}_{{\nu }_{\gamma }});$$
(5)
$${\pi }_{p}\in (0,1);\quad {z}_{ip} \sim {{\rm{Bernoulli}}}\left(\sigma ({\eta }_{p}({{{\bf{C}}}}_{i}))\right);$$
(6)
$${X}_{ip} \sim {{\rm{ZIP}}}\left(\log (\exp ({m}_{p}({{{\bf{K}}}}_{i})+{z}_{ip}{\mu }_{p}({{{\bf{C}}}}_{i}))+1),{\pi }_{p}\right),$$
(7)

where μpmpηpzip have the same interpretation as in the deviance-based model, πp is the proportion of excessive zeros on the pth gene shared across all samples i = 1, …, N, and ZIP(μπ) denotes a zero-inflated Poisson distribution with expected Poisson rate μ and probability of excessive zeros π. Note that our ZIP model does not aim to estimate the pattern of excessive zeros of the dataset. Hence the quantity zipμp(Ci)) should be interpreted as “the conditional perturbation effect given that the corresponding observation Xip is not an excessive zero". Schematic illustration and graphical representation of the Zero-inflated Poisson model is given in Supplementary Fig. 2.

We also considered handling potential over-dispersion by modelling X using a zero-inflated Gamma-Poisson likelihood (a different parameterisation of Negative Binomial). However, we find the Gamma-Poisson model and Poisson model achieved similar level of prediction performance on real datasets, and the majority of estimated dispersion parameters are far less than 1, showing no strong sign of over-dispersion. Hence we focus on the Poisson model in this section for sake of simplicity. The details of the zero-inflated Gamma-Poisson model can be found in Supplementary Information.

For both the deviance-based Gaussian and the Zero-inflated Poisson model, we recommend setting kμkγkη to be RBF kernels \(k({x}_{1},{x}_{2})={\nu }^{(1)}\exp (-{\nu }^{(2)}| | {x}_{1}-{x}_{2}| {| }_{2}^{2})\) governed by kernel parameters \({\nu }_{\mu }^{(1)},{\nu }_{\gamma }^{(1)},{\nu }_{\eta }^{(1)}=1\), \({\nu }_{\mu }^{(2)},{\nu }_{\gamma }^{(2)},{\nu }_{\eta }^{(2)}=0.1\), gμ = gγ = 0 and gη = −3 as these prior specifications give satisfactory results in all of our numerical experiments. These choices of priors reflect our belief that all μp(Ci) and γp(Ci) have the same marginal prior \({{\mathcal{N}}}(0,1)\), and the prior on σ(ηp(Ci))), the inclusion probability of perturbation effect of Ci on the pth gene, is concentrated at around 0.05. Alternative choices are also discussed in the following sections.

Posterior inference

In this section, we discuss the posterior inference strategy of the proposed models. We first give the posterior inference procedure of the deviance-based model. Let \({{\boldsymbol{\lambda }}}={\{{\lambda }_{p}\}}_{p=1}^{P}\). Let \(p\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}\right)\) be the prior of the associated perturbation-specific parameters. Let

$$ p\left({{\bf{X}}}| {{\bf{C}}},{{\bf{K}}},{\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip},{m}_{p}({{{\bf{K}}}}_{i})\}}_{i,p=1}^{N,P},{{\boldsymbol{\lambda }}}\right)\\ ={\prod }_{i,p=1}^{N,P}{{\mathcal{N}}}\left({X}_{ip};{m}_{p}({{{\bf{K}}}}_{i})+{z}_{ip}{\mu }_{p}({{{\bf{C}}}}_{i}),\log (\exp ({\lambda }_{p}+{z}_{ip}{\gamma }_{p}({{{\bf{C}}}}_{i}))+1)\right)$$
(8)

be the likelihood of the observed gene-expression level matrix X given the perturbation matrix C, the cell-level information matrix K, and all model parameters. Since the number of samples N and the number of genes P are usually large, jointly estimating the posterior \(p\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}| {{\bf{X}}},{{\bf{C}}},{{\bf{K}}},{\{{m}_{p}({K}_{i})\}}_{i,p=1}^{N,P},{{\boldsymbol{\lambda }}}\right)\) is computationally infeasible. We therefore use amortised variational inference36 to address this issue: Let \({f}_{{{\boldsymbol{\xi }}}}:{{\mathbb{R}}}^{L}\to {{\mathbb{R}}}^{6P}\) be a neural network parameterized by a real vector ξ. We approximate the full posterior \(p\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}| {{\bf{X}}},{{\bf{C}}},{{\bf{K}}},{\{{m}_{p}({K}_{i})\}}_{i,p=1}^{N,P},{{\boldsymbol{\lambda }}}\right)\) using the following variational posterior:

$${q}_{{{\boldsymbol{\xi }}}}\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}\right)= {\prod }_{i,p=1}^{N,P}\left({{\mathcal{N}}}\left({\mu }_{p}({{{\bf{C}}}}_{i});{f}_{{{\boldsymbol{\xi }}}}^{(p)}({{{\bf{C}}}}_{i}),\exp ({f}_{{{\boldsymbol{\xi }}}}^{(p+P)}({{{\bf{C}}}}_{i}))\right)\right.\\ \times {{\mathcal{N}}}\left({\gamma }_{p}({{{\bf{C}}}}_{i});{f}_{{{\boldsymbol{\xi }}}}^{(p+2P)}({{{\bf{C}}}}_{i}),\exp ({f}_{{{\boldsymbol{\xi }}}}^{(p+3P)}({{{\bf{C}}}}_{i}))\right) \\ \times {{\mathcal{N}}}\left({\eta }_{p}({{{\bf{C}}}}_{i});{f}_{{{\boldsymbol{\xi }}}}^{(p+4P)}({{{\bf{C}}}}_{i}),\exp ({f}_{{{\boldsymbol{\xi }}}}^{(p+5P)}({{{\bf{C}}}}_{i}))\right)\\ \left.\times {{\rm{Bernoulli}}}\left({z}_{ip};\sigma ({\eta }_{p}({{{\bf{C}}}}_{i}))\right)\right),$$
(9)

where \({f}_{{{\boldsymbol{\xi }}}}^{(p)}({{{\bf{C}}}}_{i})\) denotes the pth entry of fξ(Ci), \({{\mathcal{N}}}(\cdot ;\mu,{s}^{2})\) denotes a Gaussian p.d.f. with mean μ and variance s2, and Bernoulli( ; π) denotes a Bernoulli p.m.f. with success probability π. Similarly, for the fixed but unknown basal level functions \({\{{m}_{p}\}}_{p=1}^{P}\), we let \({f}_{{{\boldsymbol{\phi }}}}:{{\mathbb{R}}}^{D}\to {{\mathbb{R}}}^{P}\) be a neural network parameterized by a real vector ϕ, and use \({f}_{{{\boldsymbol{\phi }}}}^{(p)}({{{\bf{K}}}}_{i})\) to parameterize mp(Ki) for all i = 1, …, N and p = 1, …, P. The evidence lower bound (ELBO) of the deviance-based model then takes the form

$${{{\rm{ELBO}}}}_{G}({{\boldsymbol{\xi }}},{{\boldsymbol{\phi }}},{{\boldsymbol{\lambda }}};{{\bf{X}}},{{\bf{C}}},{{\bf{K}}})= {E}_{{q}_{{{\boldsymbol{\xi }}}}}\left(\log p({{\bf{X}}}| {{\bf{C}}},{{\bf{K}}},{\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip},{f}_{{{\boldsymbol{\phi }}}}^{(p)}({K}_{i})\}}_{i,p=1}^{N,P},{{\boldsymbol{\lambda }}})\right)\\ -KL\left({q}_{{{\boldsymbol{\xi }}}}\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}\right), \right.\\ \left. p\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}\right)\right),$$
(10)

where \({E}_{{q}_{{{\boldsymbol{\xi }}}}}\) denotes expectation with respect to the variational posterior qξ. We estimate the variational posterior qξ and all other model parameters by maximizing (an empirical version of) ELBOG(ξ, ϕ, λ; X, C, K). The Bernoulli random variables in Eq. (9) is approximated using Gumbel softmax37.

Let \(\{{{{\boldsymbol{\xi }}}}^{*},{{{\boldsymbol{\phi }}}}^{*},{{{\boldsymbol{\lambda }}}}^{*}\}=\arg {\max }_{{{\boldsymbol{\xi }}},{{\boldsymbol{\phi }}},{{\boldsymbol{\lambda }}}}{{{\rm{ELBO}}}}_{G}({{\boldsymbol{\xi }}},{{\boldsymbol{\phi }}},{{\boldsymbol{\lambda }}};{{\bf{X}}},{{\bf{C}}},{{\bf{K}}})\). Once the model has been fitted, we can then construct both approximate point and interval estimates of parameters of our interest. For example, let \({{{\bf{C}}}}^{{\prime} }\) be a generic perturbation vector. One can form approximate point or interval estimates of the posterior inclusion probability \(\sigma ({\eta }_{p}({{{\bf{C}}}}^{{\prime} }))\), which controls if the expression level of the pth gene is perturbed by \({{{\bf{C}}}}^{{\prime} }\), using the variational posterior \({q}_{{{{\boldsymbol{\xi }}}}^{*}}({\eta }_{p}({{{\bf{C}}}}^{{\prime} }))={{\mathcal{N}}}\left({f}_{{{{\boldsymbol{\xi }}}}^{*}}^{(p+4P)}({{{\bf{C}}}}^{{\prime} }),\exp ({f}_{{{{\boldsymbol{\xi }}}}^{*}}^{(p+5P)}({{{\bf{C}}}}^{{\prime} }))\right)\). Compared with LFSR38 used in GSFR28, identifying perturbation effects using posterior inclusion probability is more intuitive and interpretable thanks to the full Bayesian framework of the proposed model.

We now discuss the posterior inference of the Zero-inflated Poisson model. It can be parameterised and estimated in a similar fashion to the deviance-based model: Let \({{\boldsymbol{\pi }}}={\{{\pi }_{p}\}}_{p=1}^{P}\). Let \(p\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}\right)\) be the prior of the associated perturbation-specific parameters in the Zero-inflated Poisson model. Let

$$\begin{array}{rcl}&&p\left({{\bf{X}}}| {{\bf{C}}},{{\bf{K}}},{\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip},{m}_{p}({{{\bf{K}}}}_{i})\}}_{i,p=1}^{N,P},{{\boldsymbol{\pi }}}\right)\\ &&=\mathop{\prod }_{i,p=1}^{N,P}{{\rm{ZIP}}}\left({X}_{ip};\log (\exp ({m}_{p}({{{\bf{K}}}}_{i})+{z}_{ip}{\mu }_{p}({{{\bf{C}}}}_{i}))+1),{\pi }_{p}\right)\end{array}$$
(11)

be the likelihood of the raw counting data X, where ZIP( ; μπ) is the p.m.f. of a Zero-Inflated Poisson distribution with Poisson rate μ and probability of excessive zeros π. Similar to the deviance-based model, let \({f}_{{{\boldsymbol{\theta }}}}:{{\mathbb{R}}}^{L}\to {{\mathbb{R}}}^{4P}\) be a neural network parameterised by a real vector θ. Let

$${q}_{{{\boldsymbol{\theta }}}}\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}\right)= {\prod }_{i,p=1}^{N,P}\left({{\mathcal{N}}}\left({\mu }_{p}({{{\bf{C}}}}_{i});{f}_{{{\boldsymbol{\theta }}}}^{(p)}({{{\bf{C}}}}_{i}),\exp ({f}_{{{\boldsymbol{\theta }}}}^{(p+P)}({{{\bf{C}}}}_{i}))\right)\right.\\ \left.\times {{\mathcal{N}}}\left({\eta }_{p}({{{\bf{C}}}}_{i});{f}_{{{\boldsymbol{\theta }}}}^{(p+2P)}({{{\bf{C}}}}_{i}),\exp ({f}_{{{\boldsymbol{\theta }}}}^{(p+3P)}({{{\bf{C}}}}_{i}))\right) \right. \\ \left. \times {{\rm{Bernoulli}}}\left({z}_{ip};\sigma ({\eta }_{p}({{{\bf{C}}}}_{i}))\right)\right),$$
(12)

be the variational posterior of the perturbation-specific parameters \({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}\). As in the deviance-based model, we use fϕ(Ki) to parameterize the basal level parameter mp(Ki) for all i = 1, …, N and p = 1, …, P. Then the ELBO of the Zero-inflated Poisson model ELBOP(θϕπXCK) is defined in a similar fashion to Eq. (10), and the model parameters are estimated by maximizing (an empirical version of) ELBOP(θϕπXCK) with respect to {θϕπ}.

Magnitudes of perturbation vectors

In our proposed methods, the perturbation vector Ci can either be binary (indicating the presence of a perturbation) or continuous (representing e.g., dosage). When Ci represents the continuous dosage of a perturbation, we expect that the potential perturbation effects is positively correlated to the dosage (at least in a sensible range before some ceiling effects). Similarly, when Ci = 0 (i.e., no perturbation at all), we expect there is no potential perturbation effects. To impose these physical constraints, we recommend modifying the model and inference procedure as follows (Here we focus on the deviance based model. The zero-inflated Poisson model can be modified in a similar fashion): We first replace the standard RBF kernels on νμνγ by a modified “zero-passing" RBF kernel39. This modification ensures that samples \({\{{\mu }_{p},{\gamma }_{p}\}}_{p=1}^{P}\) drawn from the modified Gaussian process prior would satisfy \({\{{\mu }_{p}({{\bf{0}}})={\gamma }_{p}({{\bf{0}}})=0\}}_{p=1}^{P}\). We also replace the generative process of zip with \({z}_{ip} \sim {{\rm{Bernoulli}}}\left({\mathbb{1}}(| | {{{\bf{C}}}}_{i}| {| }_{2}^{2}\ne 0)\sigma ({\eta }_{p}({{{\bf{C}}}}_{i}))\right)\), which ensures that no zip would be triggered when the input Ci = 0. This choice also reflects our prior belief that even though the potential effects of a perturbation Ci depends on its magnitude, whether or not a gene is perturbed only depends on the presence of the perturbation (i.e., \(| | {{{\bf{C}}}}_{i}| {| }_{2}^{2}\ne 0\)), but not the scale of it. Similar constraints have also been used previously13,14.

In addition to the generative process, we also adjust the inference procedure accordingly. We modify the variational posterior in Eqn. (9) as follows:

$$ {q}_{{{\boldsymbol{\xi }}}}^{{\prime} }\left({\{{\mu }_{p}({{{\bf{C}}}}_{i}),{\gamma }_{p}({{{\bf{C}}}}_{i}),{\eta }_{p}({{{\bf{C}}}}_{i}),{z}_{ip}\}}_{i,p=1}^{N,P}\right)={\prod }_{i,p=1}^{N,P}\left({{\mathcal{N}}}\left({\mu }_{p}({{{\bf{C}}}}_{i});| | {{{\bf{C}}}}_{i}| {| }_{2}{f}_{{{\boldsymbol{\xi }}}}^{(p)}({{{\bf{C}}}}_{i}),| | {{{\bf{C}}}}_{i}| {| }_{2}^{2}\exp ({f}_{{{\boldsymbol{\xi }}}}^{(p+P)}({{{\bf{C}}}}_{i}))\right)\right.\\ \times {{\mathcal{N}}}\left({\gamma }_{p}({{{\bf{C}}}}_{i});| | {{{\bf{C}}}}_{i}| {| }_{2}{f}_{{{\boldsymbol{\xi }}}}^{(p+2P)}({{{\bf{C}}}}_{i}),| | {{{\bf{C}}}}_{i}| {| }_{2}^{2}\exp ({f}_{{{\boldsymbol{\xi }}}}^{(p+3P)}({{{\bf{C}}}}_{i}))\right)\times {{\mathcal{N}}}\left({\eta }_{p}({{{\bf{C}}}}_{i});{f}_{{{\boldsymbol{\xi }}}}^{(p+4P)}({{{\bf{C}}}}_{i}),\exp ({f}_{{{\boldsymbol{\xi }}}}^{(p+5P)}({{{\bf{C}}}}_{i}))\right)\\ \left.\times {{\rm{Bernoulli}}}\left({z}_{ip};{\mathbb{1}}(| | {{{\bf{C}}}}_{i}| {| }_{2}^{2}\ne 0)\sigma ({\eta }_{p}({{{\bf{C}}}}_{i}))\right)\right).$$
(13)

Here we rescale the variational posterior of the mean- and viability-level perturbation by a factor Ci2. This ensures that both terms would be zero when there is no perturbation, and would explicitly depend on the size of Ci otherwise. We also modify the variational distribution of zip in the same way as in the generative process. These modification ensures that both generative process and posterior inference are inline with the physical constraints discussed above. We use this modified generative process and inference procedure as our default model setup for the rest of the paper.

Single gene perturbation

The Perturb-Seq dataset33 was pre-processed and filtered using the previously described pre-processing steps14,40. The resulting dataset \({{\bf{X}}}\in {{\mathbb{N}}}^{N\times P}\) consists of counting data of N = 118, 461 cells and P = 1187 genes. For i = 1, …, N, the perturbation \({{{\bf{C}}}}_{i}\in {{\mathbb{R}}}^{L}\) is either a length L = 722 one-hot vector, representing one of the 722 unique CRISPR guides (perturbations), or a zero vector, representing the perturbation associated with negative controls (non-targeting CRISPR guides). The non-targeting negative controls are treated as the baseline level. The cell information \({{{\bf{K}}}}_{i}\in {{\mathbb{R}}}^{D}\) is a length D = 4 real vector (lib_size: total number of UMI counts, n_features: number of genes with non-zero UMI readings, mt_percent: percentage of mitochondrial gene expression, scale_factor: core scale factor).

Multigene perturbation prediction

We compared GPerturb’s performance on predicting multi-gene perturbation outcomes with the knowledge-graph informed GEARS using the Perturb-seq dataset15,34. We followed a previously described data-preprocessing process.15 The resulting dataset \({{\bf{X}}}\in {{\mathbb{R}}}^{N\times P}\) consists of N = 89, 357 cells and P = 5045 genes. For i = 1, …, N, the perturbation \({{{\bf{C}}}}_{i}\in {{\mathbb{R}}}^{L}\) is a length L = 103 binary vector where the positions of ones encode the perturbed genes. (The dataset consists of 131 two-gene perturbations.) The cell information \({{{\bf{K}}}}_{i}\in {{\mathbb{R}}}^{D}\) is a length D = 2 real vector (lib_size: total number of UMI counts, n_features: number of genes with non-zero UMI readings). We randomly select 20% of the dataset as the test set, and use the rest as training set. For both our GPerturb and GEARS, the recommended settings are used to fit the models.

We further compared GPerturb with GEARS using the multiplexed Perturb-seq dataset12 using the same procedure described above. We follow the data-preprocessing process given by the authors. The resulting dataset \({{\bf{X}}}\in {{\mathbb{R}}}^{N\times P}\) consists of N = 24,192 cells and P = 15,668 genes. For i = 1, …, N, the perturbation \({{{\bf{C}}}}_{i}\in {{\mathbb{R}}}^{L}\) is a length L = 600 binary vector where the positions of ones encode the perturbed genes. The cell information \({{{\bf{K}}}}_{i}\in {{\mathbb{R}}}^{D}\) is a length D = 3 real vector (lib_size: total number of UMI counts, n_features: number of genes with non-zero UMI readings and mt_percent: percentage of mitochondrial gene expression). We compare the performance of GEARS and GPerturb under the same setting described above.

SciPlex2

The SciPlex2 dataset2 (GSM4150377) consists of A549 cells treated with one of the four compounds: dexamethasone (Dex), Nutlin-3a (Nutlin), BMS-345541 (BMS), or vorinostat (SAHA) across seven different doses. We follow previously described data pre-processing steps.13 The resulting dataset \({{\bf{X}}}\in {{\mathbb{R}}}^{N\times P}\) consists of N = 20,643 cells and P = 5000 genes. For i = 1, …, N, the perturbation \({{{\bf{C}}}}_{i}\in {{\mathbb{R}}}^{L}\) is a length L = 4 vector with only one non-zero entry whose position and value encode the compound type and dosage, respectively. Similar to the previous sections, the perturbation associated with negative controls are encoded as Ci = 0 and treated as the baseline level. The cell information \({{{\bf{K}}}}_{i}\in {{\mathbb{R}}}^{D}\) is a length D = 2 real vector (lib_size: total number of UMI counts, n_features: number of genes with non-zero UMI readings). We randomly select 20% of the dataset as the test set, and use the rest to train GPerturb. For both Gaussian GPerturb and CPA, the recommended settings are used to fit the models.

Derivative metrics for identification of dosage patterns

We denote \({\hat{D}}_{i,j}(x)\) as the estimated perturbation effect of perturbation j on gene i at dosage level x. Due to the semiparametric specification of GPerturb, we can compute \(\frac{d}{dx}{\hat{D}}_{i,j}(x)\), the derivative of the perturbation effects with respect to the dosage level x, exactly and efficiently (thanks to automatic differentiation) and use the derivative information to capture interesting perturbation patterns.

We identify genes that are the most sensitive to the dosage of perturbation j by investigating the integral of the squared derivative \({\hat{D}}_{i}^{j}=\int_{{A}_{\min }^{j}}^{{A}_{\max }^{j}}{\left(\frac{d}{dx}{\hat{D}}_{i,j}(x)\right)}^{2}dx\) for each gene i in the data set, where \({A}_{\min }^{j},{A}_{\max }^{j}\) are the minimum and maximum dosage of perturbation j respectively.

We choose the integral of the squared derivative as a measure of sensitivity since this quantity equals zero if and only if \({\hat{D}}_{i,j}(x)\) equals some constant, indicating that the perturbation effect does not depend on dosage at all, and is large only if the magnitude of rate of change in the perturbation effect is large over the interval \([{A}_{\min }^{j},{A}_{\max }^{j}]\).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.