Introduction

Single-cell ATAC-seq (scATAC-seq) is a powerful assay for profiling the epigenetic status of open chromatin in individual cells1,2. It has been applied to study gene regulation across tissues and under various conditions, including homeostasis3,4,5, development6,7 or disease8,9. The cis-regulatory elements (CREs), modulated by nucleosome turnover and occupancy10, display variable accessibility across cells. In a cell, the dynamic of CRE activities is dependent on various physiological factors such as cell type1,3, developmental state6,7, spatial location of the tissue11,12, as well as the interaction of these factors with genetic variation13,14. Identifying the sets of elements whose accessibility is governed by genetic, developmental, and physiological factors is essential in understanding the cis-regulatory codes of biological processes15,16. Moreover, prioritizing regulatory elements is instrumental in identifying relevant cell types associated with complex traits5,17,18,19,20. Dissecting genetic, epigenetic, and environmental factors requires analysis methods that allow compound hypothesis testing. However, the discrete and sparse nature of ATAC-seq data presents technical challenges for existing approaches. Here, we establish a model framework and a new statistical method for complex hypothesis testing for single-cell ATAC-seq data.

Among all the factors that drive the accessibility of CREs, only some factors are experimentally controlled. In a typical single-cell experiment, the collection of cells is a random sample of a cell’s variable states over the unknown factors (e.g., cell cycle stage, metabolic stage) while controlling for the known factors (e.g., tissue, location, batch). We note that sometimes, the values of these factors are estimated from the data, such as unsupervised inference of cell type labels or time sequences. Nevertheless, as the data are sampled over unknown microstates and stochastic molecular processes, the latent accessibility of a CRE should be considered as a random variable, even without experimental variability.

A central theme in scATAC-seq data analyses is understanding the dynamic of CRE across cell types and biological conditions. This is typically achieved by conducting a statistical test for Differential Accessible Regions (DARs) between different conditions of hypothetical causal factors that might govern the chromatin structure (e.g., treatment vs control). However, most existing approaches for identifying DARs rely on pairwise testing of one putative causal factor at a time (e.g., ArchR21, snapATAC22, or contingency table-based approaches) that does not allow for simultaneously testing multiple factors or testing continuous variables (such as temporal effects). A common practice in literature is to ignore other covariates while testing the factor of interest. However, as schematically diagramed in Fig. 1, testing one factor at a time can create false negatives and false positives in DAR of primary factors and cannot test for interactions. Although Seurat/Signac23 uses the logistic regression framework, which can conduct multi-factor testing theoretically, its current implementation creates spurious dependency problems that result in mis-calibrated tests (Supplemental Note 1).

Fig. 1: PACS (Probability model of Accessible Chromatin in Single Cells) framework.
figure 1

a Upper panel: Illustration of the latent accessibility of cells. Multiple factors such as cell types, developmental stages, and spatial locations determine the chromatin structure and configurations within respective cell groups. These different chromatin structures result in the variable Tn5 insertion rates in the ATAC-seq experiments. The readout of ATAC assays is paired insertion counts (PIC), which are crude measures of latent insertion rates. Middle panel: Schematic of the read-capturing process in scATAC-seq. During PCR and sequencing, fragments in each single cell are partially captured, and after data processing, variable capturing probability should be accounted for in data modeling. Lower panel: Example downstream application of PACS, including compound hypothesis testing, cell type label transfer, and feature-level batch effect correction. b Example chromatin configurations where ignoring factors in Differential Accessible Regions (DARs) test may result in false interpretation.

Besides the key problem of the inability to conduct hypothesis testing for multiple causal factors together, there are two additional unresolved technical problems: (1) scATAC-seq data displays heterogeneous sequencing coverage in each cell, so the cells are not directly comparable; (2) scATAC-seq data is sparse but also contains quantitative information at each peak24. To address these challenges, we present a new statistical framework, missing-corrected cumulative logistic regression25 (mcCLR, or missing corrected ordinal logistic regression, mcOLS), for the analysis of scATAC-seq data with modified Firth regularization26,27 to account for data sparsity.

With this statistical framework, we present our Probability model of Accessible Chromatin of Single cells (PACS), a toolkit for scATAC-seq analysis. PACS allows complex compound analysis tasks in scATAC-seq data, including cell type classification, feature-level batch effect correction, and spatiotemporal data analysis. With simulated and real data, we show that PACS effectively controls false positives while maintaining sensitivity for model testing. We apply PACS to a mouse kidney dataset, a developing human brain dataset, and a time-series PBMC treatment dataset, all of which have complex study designs, to demonstrate its capability to model multiple sources of variations for hypothesis-driven biological inference.

Results

A probabilistic model of accessible peaks and statistical test framework

In the PACS framework, we model the accessibility state of CREs in a single cell as a function of predictive factors such as cell type, physiological/developmental time, spatial region, etc. We use a matrix \({F}_{C\times J}\) to represent these variables in each cell, where C is the number of cells and J is the number of predictive variables (including dummy variables). Let \({Y}_{C\times M}\) represent an integer-valued scATAC-seq count matrix across C cells and M genomic regions. For empirical ATAC-seq data, these regions M are determined by data-dependent peak calling, where peaks are regarded as the set of candidate CREs28,29. As scATAC-seq can recover quantitative information on the density and distribution of nucleosomes24,30, we use integer values \({Y}_{{cm}}\in \left\{{\mathrm{0,1,2}},\ldots \right\}\) to represent the level of accessibility. Existing pipelines diverge in quantifying scATAC-seq counts, and we propose using the paired insertion count (PIC) matrix as a uniform input for downstream analyses24. The integer-valued PIC counts are observed measurements of the latent chromatin accessibility (Fig. 1, upper panel). This latent variable perspective motivated us to use the cumulative logit model to decompose the cumulative distribution of \({Y}_{cm}\) by its predictive variables \({F}_{c*}\).

Due to cell-specific nucleosome preparation and sequencing depth, the scATAC-seq output may miss sequence information from certain accessible chromatin regions (Fig. 1, middle panel). As with other genomics data, these missing data points result in zero counts rather than NA counts. Therefore, it is crucial to distinguish between true zeros (closed chromatin) and technical zeros (missing data). Here, we use \({R}_{C\times M}\), with binary values, to represent the read recovery/capturing status for each cell and region. This matrix encapsulates all the experimental factors (Tn5 activities, sequencing depth, etc.) that result in a disparity of recovered reads among cells. The observed chromatin states, denoted by \({Z}_{{CM}}\), are specified by the element-wise product between the latent accessibility \({Y}_{{CM}}\) and the capturing status \({R}_{{CM}}\). Since various experimental factors are cell-specific, we further assume the capturing probability \({{\rm{P}}}({R}_{{cm}}=1)\) to be unique to each cell but common to all peaks in that cell, and thus we use \({q}_{c}\) to denote this conditional read capturing probability in cell c.

Motivated by the latent variable model and to account for cell-specific missing data, we extended the cumulative logit model to decompose accessibility as:

$${{\rm{logit}}}\left({{\rm{P}}}({Y}_{{cm}}\ge t)\right) ={\alpha }^{(t)}+{\sum }_{j=1}^{J}{\beta }_{j}{F}_{{cj}},\\ {{\rm{where}}\; {\rm{P}}}\left({Z}_{{cm}}\ge t\right) ={{\rm{P}}}\left({Y}_{{cm}}\ge t\right){q}_{c}{;t}\in \{1,2,\ldots,T\}$$
(1)

where \({q}_{c}\) is the capturing probability for a cell c, \({{\rm{P}}}({Y}_{{cm}}\ge t)\) is the sampling probability of cells with accessibility level greater than or equal to t. The maximum value of accessibility we account for, \(T\), is feature-specific. For a feature \(m\), \(T\) is the largest integer such that \({\sum }_{c}1({Z}_{{cm}}\ge t)/{\sum }_{c}1({Z}_{{cm}}\ge 1)\ge h\) where \(h\) is a hyperparameter with a default value of 0.25. \({\alpha }^{(t)}\) is the intercept term in the \({t}^{{{\rm{th}}}}\) cumulative logit, and \({\beta }_{j}\) is the coefficient for the \({j}^{{th}}\) column of the design matrix. Eq. 1 assumes a proportional odds model, where we have a common set of coefficients \({\beta }_{j}\) for all levels of the cumulative distribution while allowing for a unique constant term \({\alpha }^{(t)}\) for each level. Hereafter, we refer to our method as the mcCLR model, which stands for the missing-corrected Cumulative Logit Regression model. We note that under the assumption of complete data capturing (i.e., \({q}_{c}=1,\,\forall c\)), \({{Z}_{{cm}}=Y}_{{cm}}\) and our model reduces to a generalized linear model (GLM) with sparsity regularization. However, the divergent cell-specific read capturing in scATAC-seq data cannot be accounted for by GLM, necessitating the development of the mcCLR model.

With the formulation above, the effect of a complex set of predictive variables (and their interactions) on accessibility can be tested by the null hypothesis of \({\beta }_{i}\) = 0 with a likelihood ratio test (Fig. 1, lower panel). One statistical challenge is to estimate \({q}_{c}\)’s for each cell. We assumed the same capturing probability within a cell, so the problem is tractable and can be computed efficiently. Operationally, we first group the cells by their combination of the treatments and then utilize a coordinate descent algorithm to obtain estimates of \({{\rm{P}}}({Y}_{{cm}}\ge 1|{f}_{c})\) and \({q}_{c}\) (Methods).

Another statistical challenge of scATAC-seq is that the data is very sparse, creating a so-called “perfect separation” problem (see in ref. 25). Here, we developed a regularized model to resolve this issue by generalizing the Firth logistic regression model26,31 (Methods). Essentially, a Firth penalty is introduced in the regression model:

$$\log {L}^{ \!*}\left({{\boldsymbol{\beta }}}|{{\boldsymbol{Z}}}\right)=\log L\left({{\boldsymbol{\beta }}}|{{\boldsymbol{Z}}}\right)+\frac{1}{2}\log {|I}({{\boldsymbol{\beta }}})|$$
(2)

Where \({L}^{ \!*}\) represents the penalized likelihood, \(L\) is the likelihood of the regression model, and \(I({{\boldsymbol{\beta }}})\) the information matrix. Derivations of the parameter estimation framework are described in the Methods section. With the proposed methods, we aim to more accurately control type I errors and account for technical zeros (due to uneven data capturing) and sparse data. This regression-based model enables testing multiple covariates that jointly determine accessibility while controlling for other covariates or confounders.

Application of PACS to cell type identification

To demonstrate the effectiveness of our model for separating the latent chromatin accessibility from the capturing probability, we evaluated three model assumptions using the task of (supervised) cell type prediction, where the goal is to predict cell types in a new scATAC-seq dataset given an annotated (labeled) dataset.

We first evaluated the accuracy of the estimation procedure of PACS. We simulated groups of cells with a spectrum of both the underlying probability of accessibility (\({{\rm{P}}}\left({Y}_{{cm}}\ge 1\right)\), or \(p\) in short) across peaks and the capturing probabilities (\(q\)) across cells (Methods). We then utilized PACS to jointly estimate \(p\) and \(q\), with n = 1000, 500, or 250 cells. The simulation results show that our estimator can determine both the capturing probabilities and open-chromatin probabilities accurately, with root mean squared errors (RMSE) for the underlying probability of accessibility from 0.028 (n = 1000) to 0.027 (n = 250) and RMSE for capturing probability from 0.0067 (n = 1000) to 0.012 (n = 250, Supplementary Fig. 1a–f, and Supplementary Data 1).

We next tested PACS by applying it to a cell-type label transfer task, comparing it with the Naïve Bayes model. For both models, we started with an estimated \({{{\boldsymbol{p}}}}_{g}\) for each known cell type group label \(g\), and then applied the Bayes discriminative model to infer the most probable cell type labels for novel unidentified cells. Naïve Bayes does not assume missing data; thus, it ignores the cell-specific capturing probability. The prediction performances were evaluated with ten-fold cross-validation and holdout methods, where the original cell type labels are regarded as ground truth (Methods). We tested the methods on five datasets, including two human cell line datasets21, two mouse kidney datasets6, and one marmoset brain dataset32. In the two human cell line datasets, the labels are annotated by their SNP information21, so they are regarded as gold standards. For the remaining datasets, the original cell-type labels are generated by clustering and marker-based annotation, so the labels may have errors.

PACS consistently outperforms the Naïve Bayes model with an average 0.31 increase in Adjusted Rand Index (ARI, Supplementary Fig. 2a), suggesting the importance of considering the cell-to-cell variability in capturing rate. For the gold-standard cell line mixture data, we achieved almost perfect label prediction (ARI > 0.99), while Naïve Bayes had much lower accuracy with an average ARI = 0.54 (Supplementary Fig. 2b, c). For the kidney data6 and the marmoset brain data32, PACS still achieved high performance, with average ARI equal to 0.92, 0.90, and 0.88 for the adult kidney, P0 kidney, and marmoset brain data, respectively. The Naïve Bayes model, on the other hand, again produced lower ARI scores, equal to 0.59, 0.65, and 0.69 for the three datasets, respectively (Supplementary Fig. 1d–k).

For the holdout experiment, where training and testing are done on different datasets, consistent with the above results, our method shows more accurate cell label prediction than Naïve Bayes. Our cell type label prediction approach is very efficient, and the total time for training and prediction takes <5 min for large datasets (>70,000 cells).

PACS enables parametric multi-factor model testing for accessibility

Identifying the CREs regulated by certain physiological cues is essential in understanding functional regulation. For example, differentially accessible region (DAR) analysis tries to determine cell-type-specific chromosomal accessibility differences. Most scATAC-seq pipelines adopt RNA-seq differential expression methods to ask whether a peak belongs to a DAR. These approaches generally lack calibration for sparse ATAC data, and pairwise DAR tests do not allow testing more complex models that might determine peak accessibility (e.g., a combination of spatial location and batch effects). With existing methods for DAR detection, commonly adopted approaches are to ignore other factors or stratify by other factors to test the factor of interest if the predictive variables are nominal (e.g., cell types). However, such tests involve ad hoc partition into levels of the nominal factor and cannot test more complex models, including possible metric variables (e.g., developmental time).

Here, we first assess the model design and capabilities of PACS and six established tools/methods: ArchR21, Seurat/Signac23, snapATAC22, edgeR33, snapATAC2 (in ref. 34), and Fisher’s exact test. ArchR conducts a Wilcoxon rank-sum test on the subsampled cells from the initial groups, while ensuring parity in the number of sequencing reads across any two samples being tested. Seurat employs the standard logistic regression model35 by setting the cell type as the dependent variable and ATAC counts and total reads in the cells as predictive variables. SnapATAC conducts a test on the pseudo-bulk data of two groups and utilizes the edgeR33 negative binomial test on the pseudo-bulk data with a pre-defined ad hoc variance measure (biological coefficient of variation, bvc = 0.4 for human and 0.1 for mouse data). Since snapATAC operates on the pseudo-bulk level, we also included the edgeR33 method applied to single-cell data in our comparison. snapATAC2 employs a logistic regression model with ATAC counts as the dependent variable and cell type and total reads in cells as predictive variables5,34. A comprehensive comparison is summarized in Table 1.

Table 1 Comparison of model design and capability among existing scATAC-seq testing tools

To quantitatively compare the performance of the parametric test framework, we first used simulated data to test a single-factor model (cell types as a factor). To resemble real data, simulated samples were generated by parameterizing the model with the accessibility and capturing probability estimated directly from a human cell line dataset21 (Methods). We randomly sampled varying numbers of cells in each group, ranging from 250 to 1000. Supplementary Fig. 3 shows that Seurat failed to control the type I error rate at the specified significance level. Among the methods demonstrating type I error control, PACS has, on average, 17%, 19%, and 122% greater power than Fisher’s exact test, ArchR and snapATAC, respectively (Supplementary Data 2). The reduced power of ArchR is likely due to the subsampling process, and the ad hoc “bvc” choice in snapATAC may result in a miscalibrated test with a low type I error and power. The q-q plots of the five methods are shown in Supplementary Fig. 4a–e.

To evaluate the performance under a multi-factor model, we conducted another set of simulations with a second factor, two spatial locations (S1 and S2), and the first factor of two cell types (T1 and T2). We evaluated the performance in testing for the main effects or their interactions (Methods). Two strategies were used for the methods that cannot directly test effects for multiple factors. The first is called the “naïve test”, where the other factor is ignored in testing the factor of interest. The second strategy is called the “stratified test”, where we stratified the dataset by the second factor and conducted a pairwise test between the factor of interest on each stratum, followed by a p-value combination test (Methods). Across all methods and test strategies, only PACS, snapATAC (naïve and stratified), Fisher-stratified, and ArchR-stratified controlled type I error at the specified level (Fig. 2a–i); PACS remained the most powerful test and detected up to 66% more true differential peaks compared with the second most powerful methods (Supplementary Data 3). Notably, Seurat could not determine peaks with spatial or interaction effects because the cell type factor is treated as the dependent variable in the regression model. A more detailed explanation can be found in Supplementary Note 1. For testing cell type effects, snapATAC2 with ATAC count as a dependent variable performs similarly to Seurat.

Fig. 2: Compound hypothesis testing with PACS is sensitive and specific.
figure 2

ai Comparison of Type I error rates and statistical power across analytical methods using two-factor simulation data. Error bars represent the standard deviation from five independent simulations. All tests conducted are two-sided. The best-performing method in each scenario is highlighted with a color-matched circle. It is important to note that Seurat is limited to testing only cell-type effects. LR denotes logistic regression. j Illustration of linear and quadratic effects of treatment on accessibility across five-time points. The effect sizes are defined as the log fold change (logFC) between the highest and the lowest accessibility values. k Assessment of statistical power for identifying linear and quadratic temporal dynamics across varying effect sizes in simulated datasets. Error bars denote the standard deviation from five independent simulation runs. Superimposed points represent individual simulation power estimates. A sample of N = 1000 cells was analyzed for each time point. l The false positive and false negative rates from ignoring batch covariates in testing cell type effect with ArchR for the adult mouse kidney data.

We then simulated a time-series dataset with five-time points to evaluate our model performance for ordinal covariates. We assumed two temporal trends of accessibility: linear and quadratic trends. To put this in a biological setting, the quadratic trend may represent the presence of an acute spike response, and the linear trend may represent temporally accumulating chronic responses. The PACS framework could detect both linear and quadratic signals, and its power is dependent on the “effect sizes” defined as the log fold change of accessibility between the highest and lowest accessibility (Fig. 2j, k).

We also evaluated the PACS model in real datasets. As the ground truth is unknown, we utilized a sampling-based approach. We used randomly permuted cell type labels to estimate the type I error. To evaluate power, we treated the consensus DAR set from all methods as “true DARs” (after type I error calibration, see Methods). For the standard two-group DAR test, our method consistently controlled type I error and achieved high power across different datasets (Supplementary Fig. 5a–f). We further showed that ignoring confounders such as batch effect could result in substantial false discoveries, up to 47% in the adult mouse kidney data (Fig. 2l). Recent advancements in sequencing technologies have significantly enhanced the ability to quantitatively measure chromatin states, as evidenced by a substantial proportion of counts equal to or exceeding two in a compilation of datasets (see Supplementary Data 4 from ref. 24). Our assessment indicates that the full PACS model, which leverages a cumulative logistic regression approach, demonstrates greater power in datasets with high read counts than the binary model. For example, for the 10X PBMC multi-home dataset, the power increases from 0.86 with the binary model to 0.92 with the full PACS model (Supplementary Fig. 6).

PACS identifies kidney cell type-specific regulatory motifs and allows direct batch correction

One important feature of PACS is its ability to handle complex datasets with multiple confounding factors. To test the performance of PACS, we analyzed an adult kidney dataset with strong batch effects6. This dataset contains three samples generated independently (in three batches), and the authors identified a strong batch effect. Existing methods for batch correction map the ATAC-seq features to a latent vector space to subtract the batch effects. For example, the original study6 relies on Harmony36 to remove the batch effect in latent space for visualization and clustering, but the batch effect is still present in the peak feature sets because the correction is carried out in latent space without mapping to original features. Batch effect correction in latent space will be useful for secondary analyses that operate from latent space inputs, e.g., clustering, but it will not help with secondary analyses that directly operate on features such as peak-associated motif analysis.

To remove the batch effect at the feature level, we assume that the batch effect will affect (increase or decrease) the accessibility of certain peaks, and these effects are orthogonal to the biological effects. This assumption is necessary for most of the existing batch-effect correction methods (e.g., MNN37, Seurat38, and Harmony36) as a matter of experimental design. With this assumption, we applied PACS on the adult kidney data, detected significant DAR peaks among batches (p value < 0.05 with or without FDR correction), and removed batch-effect peaks from the feature set. We next implemented Signac to process the original data as well as the batch effect-corrected data without any other batch correction steps. Dimension reductions with UMAP suggested that the original data contained a strong batch effect, where almost all cell types are separated by batch (Fig. 3a, b). After removing the peaks with strong batch effects, the cells are better mixed among batches (Fig. 4c, d and Supplementary Fig. 7a, b). Note that different cell types are still separated, suggesting the biological differences are (at least partially) maintained. Since UMAP visualization may not fully preserve the actual batch mixing structure, we adopted a batch mixing score from ref. 39 to quantify the batch effect in the PCA space. The batch mixing score is defined as the average proportion of nearest neighbor cells with different batch identities, where a higher score indicates better mixing between batches and, thus, a smaller batch effect (Methods). We normalized the mean batch mixing score by dividing it by the expected score under the random mixing scenario. After batch effect correction with PACS, the normalized mean batch mixing score is 0.358 (with FDR correction) or 0.417 (without FDR correction) compared with 0.122 before batch correction. Example peaks with strong batch effects detected by PACS are shown in Supplementary Fig. 8a–l.

Fig. 3: Application of PACS to the mouse kidney dataset.
figure 3

ad UMAP dimension reduction plots of the kidney adult dataset. Panels (a, c) are constructed using all features, whereas panels (b) and (d) are constructed after excluding features significantly affected by batch effects. The (a, b) are colored by batch labels, while (c, d) are colored by cell types. Features impacted by batch effects are identified using PACS (two-sided test) with FDR correction for multiple testing. Normalized PCA mixing represents the normalized mixing score calculated in the PCA space, with 1 indicating no batch effect and 0 indicating the strongest batch effect. e IGV plots of peak summits near genes specific to PCT (proximal convoluted tubule) and PST (proximal straight tubule) cell types identified using PACS. Gene specificity is determined through GREAT enrichment analysis of differentially accessible peaks. f Heatmap of normalized gene expression z-scores for the scRNA-seq data from the male (-m) and female (-f) mouse kidneys. The gene list corresponds to those identified in (e), emphasizing the consistency across datasets in identifying sex-specific expression patterns.

Fig. 4: Application of PACS to the developing human brain data.
figure 4

a Overview of the developing human brain dataset. The subset of data we analyzed comprises samples from three donors across six brain anatomical regions. The analysis specifically targets the excitatory neuron lineage. UMAP visualizations of the data’s complexity, with points colored by cell type (b) or anatomical regions (c). Key to abbreviations: RG radial glia, IPC intermediate (neuro-) progenitor cells, earlyEN early excitatory neurons, dlEN deep layer excitatory neurons, ulEN upper layer excitatory neurons, M1 primary motor cortex, Parietal dorsolateral parietal cortex, PFC dorsolateral prefrontal cortex, Somato primary somatosensory cortex, Temporal temporal cortex, V1 primary visual cortex. d Motif enrichment results for PFC- and V1-specific peaks identified using PACS. The p values and adjusted p values are calculated by the Homer enrichment test (one-sided). PWM, position weight matrix. The full list of motif enrichment results can be found in Supplementary Data 12, 13. e Plot of accessibility z-scores for peaks specific to PFC and V1 across five identified cell types.

As noted by other studies40,41,42, there is always a tradeoff between batch correction and conservation of biological effects. The decision must be made with careful consideration. In our study, we aimed to correct the batch effect only for clustering major cell types and visualization. For these tasks, information loss is inevitable and can be accepted, and thus, the approach of excluding peaks with significant batch effect can be used, provided that the tradeoffs discussed above are understood. However, we emphasize that for tasks like detecting DAR, a better approach is to keep all peaks and regress out the batch effect to obtain an unbiased estimation of the cell type effect. To demonstrate this approach, we applied our method to identify cell type-specific features while adjusting for batch effect. We focused on the two proximal tubule subtypes, proximal convoluted tubules (PCT) and proximal straight tubules (PST). By fitting our mcCLR model with cell type and batch effect, we identified 19,888 and 62,368 significant peaks for PCT and PST, respectively (FDR-corrected p value < 0.05, Supplementary Data 4, 5). The original study utilized snapATAC, which reported 23,712 and 36,078 significant peaks for PCT and PST, respectively. With the batch-corrected differential peaks, we conducted GREAT enrichment analysis43,44 to identify candidate PCT- and PST-specific genes (Supplementary Data 6, 7). We identified Gc, Nox4, Slc4a4, Bnc2, Slc5a12, and Ndrg1 genes as top PCT-enriched genes, and Ghr, Gramd1b, Etv6, Atp11a, Gse1, and Sik1 as top PST-enriched genes. The associated genomic pile-up figures for the CREs of these genes are shown in Fig. 3e, and these findings were supported by a public scRNA-seq dataset45 (Fig. 3f).

PACS dissects complex accessibility-regulating factors in the developing human brain

We applied our method to the human brain dataset11, which is more challenging due to the complex study design with cells collected from six donors across eight spatial locations. Substantial sequencing depth variations among samples have also been noticed, further complicating the analysis (Supplementary Fig. 9a–c). To study how spatial locations affect chromatin structure, the original reference focused on the prefrontal cortex (PFC) and primary visual cortex (V1) regions, as they were the extremes of the rostral-caudal axis11. With the multi-factor analysis capacity of PACS, we conducted analyses to (1) identify the region effect while adjusting for the donor effect and (2) identify the cell type-specific region effect.

We first examined how different brain regions affect chromatin accessibility across all cell types. To accurately capture this relationship, we incorporated other influential factors, such as donor variability, as covariates in our regression model. This methodological approach allows us to assess the specific effects of brain regions while accounting for potential confounding factors (Methods). For this, we focused on a subset of three donors where spatial information is retained during data collection (Fig. 4a–c and Supplementary Data 8). In total, we identified 146,676 brain region-specific peaks (FDR corrected p value < 0.05). Between PFC and V1 regions, we identified 30,455 DAR peaks, ~20% more compared with the original study (Supplementary Data 9, 10). With the region-specific DARs, we conducted motif enrichment analysis to identify region-specific TFs. For the PFC and V1 regions, we found several signals that were consistent with the original article11, including PFC-specific motifs MEIS1, TBX21, and TBR1, and V1-specific motifs MEF2B, MEF2C, MEF2A, and MEF2D. Moreover, we identified additional V1-specific motifs ETS and ZIC2 (Fig. 4d), supported by the scRNA-seq data collected from the same regions46. We also noticed that some neuron development-associated TFs, including OLIG2 and NEUROG2, are enriched in both brain regions but with different binding sites, likely due to different co-factors that open different DNA regions. We next conducted TF enrichment analysis for the DAR set identified by PACS but not by snapATAC. The top enriched motifs are highly concordant with the output using the full set of peaks, suggesting that the additional peaks identified by PACS are consistent with but increase the information set found in the PACS-snapATAC common set and support the reliability of the PACS method. Motif enrichment results for the brain data are reported in Supplementary Data 1114.

Next, we used PACS to examine the location effect across different cell types along excitatory neurogenesis. This corresponds to testing the interaction terms between spatial location and cell types while adjusting for donor effect (Fig. 4e). The previous study reported that the chromatin status of the intermediate progenitor cells (IPC) population started to diverge between the PFC and V1 regions. Consistent with the article, we identified 2773 significant differential peaks between PFC and V1 at the IPC stage, 52% more than snapATAC (Supplementary Data 15). Example peaks with cell type-specific region effect are shown in Supplementary Fig. 10.

In sum, we show the implementation of PACS for data with three levels of factors: donor, spatial region, and cell type. PACS can be applied to study one factor or the interaction between factors while adjusting for other confounding factors, and test results have a higher power.

PACS identifies time-dependent immune responses after stimulation

The existing methods for DAR detection rely on pairwise comparisons and, thus, are not applicable to ordinal or continuous factors. One such example is the scATAC-seq data collected at multiple time points. Here, we apply PACS to a peripheral blood mononuclear cell (PBMC) dataset collected at three-time points (0 h control, 1 h, and 6 h) after drug treatment47. Multiple treatments have been applied separately to cells collected from four human donors. While PACS can simultaneously model all drugs and conditions, we focus on the ionomycin plus phorbol myristate acetate (PMA) treatment to demonstrate the PACS workflow. The factors included in the PACS model are shown in Fig. 5a, where cell type and donor effects are categorical, and the time effect is coded as an ordinal variable. Note that time can be alternatively coded as a continuous variable.

Fig. 5: Application of PACS to time-series human PBMC treatment data.
figure 5

a Factor landscape of the PBMC treatment dataset. Here, another layer of factor is the four different treatments, which can also be jointly considered in the model. For demonstration purposes, we only focus on the effect of PMA treatment. The baseline (control) is designated as time 0, with subsequent time points at one hour (time 1) and six hours (time 2) post-treatment. b, c Summaries of chromatin accessibility peaks significantly upregulated or downregulated in response to PMA treatment across different cell types. The full list of differentially accessible peaks can be found in Supplementary Data 16, 17. d, e Heatmaps display significant chromatin accessibility peaks, categorized by upregulation or downregulation, organized by cell type and time point. The color scale (scaled_acc) quantifies the z-score of accessibility.

We tested the treatment effect by identifying open chromatin regions that gradually increase or decrease in accessibility after treatment. In total, we detected 35,356 peaks with a strong treatment effect across five broad cell types (B cell, CD4 T cell, CD8 T cell, Monocyte, and NK cell, Supplementary Data 1618). Across the cell types, CD4 and CD8 T cells show the most significant changes in chromatin landscape after treatment (Fig. 5b, c). This is expected, as PMA can induce T cell activation and proliferation48. Among the peaks with significant PMA treatment effect, most become more accessible after treatment, consistent with the activation function of the treatment. We then conducted gene enrichment analysis with GREAT44, where we identified several GO pathways associated with T cell activation, such as “regulation of T cell differentiation” and “regulation of interleukin-2 production” (Supplementary Data 19). We also identified enriched genes, including DUSP5, IL1RL1, TBX21, and CXCR3 (Supplementary Data 20), the expression of which has been previously reported to be up-regulated in PMA treatment49,50,51,52. Notably, DUSP5 is known to play an essential role in the immune response through regulation of NF-κB as well as ERK1/2 signal transduction53, and TBX21 is an immune cell TF that also directs T-cell homing to pro-inflammatory sites via regulation of CXCR3 expression54. Figure 5d, e shows the cell type-specific open chromatin landscape dynamic after the PMA treatment. We noticed that some CREs respond to the treatment effect across all cell types, and some CREs become activated only in certain cell types.

Discussion

Single-cell sequencing data, characterized by uneven data capturing and sparsity, present significant challenges in data analysis. For scRNA-seq data, data normalization has been an essential step for adjusting for uneven data capturing; however, this approach is not directly applicable to scATAC-seq data, creating a unique challenge. Our method, PACS, addresses this by jointly modeling the group-level underlying accessibility and cell-level sequencing reads capturing. In addition, PACS controls for artifacts of sparse single-cell data that tend to have many missing values by a regularization approach. When applied to cell-type annotation tasks, PACS showed improved performance compared with the Naïve Bayes model, which does not consider cell-specific capturing probabilities.

The increasing volume of data across various tissue conditions necessitates atlas-level data integration to comprehend tissue dynamics. Our cell type annotation framework facilitates the transfer of cell type annotation from a reference dataset to another dataset, thereby overcoming a major hurdle in integrative data analysis. A further challenge of data integration is jointly modeling various factors (e.g., genotype, cell type, spatial locations) that govern cellular CRE activities. The standard GLM framework could not address the uneven data capturing in scATAC-seq data, so we developed a statistical model that extends the GLM framework to account for cell-specific missing data. Our model, the missing-corrected cumulative logistic regression (mcCLR) with regularization, enables PACS to perform multi-covariate hypothesis tests, including spatial and temporal data analysis. Here, we analyzed three empirical datasets from brain, kidney, and blood samples to show the utility and flexibility of our framework in large, complex datasets. Additionally, PACS shows promise in genotype-based chromatin accessibility studies, such as allele imbalance analysis or chromatin accessibility quantitative trait locus (caQTL) studies. Compared with existing models, PACS offers a more controlled approach for handling multiple covariates in the caQTL analysis.

The main characteristics of the PACS model are (1) a probability model of read depth, (2) sparsity control, and (3) an explicit treatment of ordinal distribution in a GLM framework. Given this model, there are some limitations and considerations. First, treating scATAC-seq data as ordinal data requires a certain amount of read depth to reveal quantitative data from each peak. In ref. 24 we show that typical scATAC-seq data have such quantitative information, but if the read depth is shallow, the effectiveness of PACS will be reduced with respect to the application of the cumulative model versus a binary logistic model. Second, treating sparsity with modified Firth regularization can help with artifacts caused by missing data and imbalanced sample sizes, but like any regularization procedure, it does not completely cure arbitrary sparse data and read depth. Other experimental protocols should be optimized to reduce missing data. Third, we have previously derived a parametric model of the scATAC-seq read count, known as the size-filtered signed Poisson distribution (ssPoisson)24. In our current work, we treat the insertion rate as a latent variable and directly model the paired insertion counts (PIC) of the data with a regression-based model without directly trying to estimate the parameters of the ssPoisson model. This approach greatly enhanced computational efficiency, but future works might consider a full parametric model. Lastly, we treat all the putative factors that affect chromatin accessibility in a linear model. The linear model allows for interaction terms, but if the effects are non-linear, our approach will yield incorrect tests. Nevertheless, given the generally small values distribution of scATAC datasets, we believe a linear model will be a good approximation to more complex relationships.

In summary, PACS allows versatile hypothesis testing to analyze scATAC-seq data. Its capability of jointly accounting for multiple factors that govern the chromosomal landscape will help investigators dissect multi-factorial chromatin regulation.

Methods

A probabilistic model of underlying open chromatin status

Here we model the activity of regulatory elements in each cell type group by the cumulative distribution of the accessibility. The underlying accessibility for a CRE is a function of nucleosome density and turnover rate. As we discuss in the main text, the chromatin state should be regarded as a random variable for a particular cell group as they are sampled from mixtures of hidden microstates. Here, we expanded the model of accessible chromatin from ref. 24 Briefly, let \({F}_{C\times J}\) be a design matrix that summarizes known independent variables (e.g., cell type, developmental time, sample locations, etc.) across \(C\) cells, \({Y}_{C\times M}\) be the underlying (latent) chromatin status across \(C\) cells and \(M\) regions, where each element represents the accessibility of a genomic region. The goal of PACS is to decompose the (complementary) cumulative distribution of \({Y}_{{cm}}\), i.e., the series of distributions:

$${{\rm{P}}}\left\{{Y}_{{cm}}\ge t\right\}={\sum}_{i=t}^{T}{\pi }_{i} \, {{\rm{for}}} \, t=1,2,\ldots,T$$
(3)

by predictive independent variables in \({F}_{c*}\). Here the maximum value of accessibility we account for, \(T\), is feature-specific. To be precise, for a feature \(m\), \(T\) is the largest integer such that \({\sum }_{c}1({Z}_{{cm}}\ge t)/{\sum }_{c}1({Z}_{{cm}}\ge 1)\ge h\) where \(h\) is a hyperparameter. In our study, \(h\) is set to be \(0.25\) but based on our evaluation, our model is not sensitive to the choice of \(h\).

Model for capturing probability of cell

Due to various experimental factors like enzyme activity and sequencing depth disparities across cells, we introduce \({R}_{C\times M}\) as a matrix representing the capturing status of each cell and region. Let \({Z}_{C\times M}\) be the (observed) scATAC dataset, we have \(Z=Y\bigotimes R\), where \(\otimes\) denote element-wise Product. We consider \({R}_{{cm}}\) to be sampled from a Bernoulli distribution parameterized by \({q}_{c}\), cell-specific capturing probability:

$${R}_{{cm}}\sim {{\rm{Bernoulli}}}({q}_{c})$$
(4)

We note that under the assumption of complete data capturing (i.e., \({R}_{{cm}}=1\) for all \(c\)), \({{Z}_{{cm}}=Y}_{{cm}}\) and our model reduces to a standard cumulative logistic regression model (also known as ordinal logistic regression). The cell-specific capturing probability in scATAC-seq data complicates the model and necessitates the development of an extended model, missing-corrected cumulative logistic regression (mcCLR). Example applications of ordinal logistic regression in biomedical research are reviewed in ref. 55,56.

Joint parameter estimation for single-factor scenario

Given a class of data that corresponds to a combination of levels of independent variables, we follow the same parameter estimation framework as described in ref. 24 Briefly, assume we have a genomic region-by-cell (i.e., peak-by-cell) matrix \({Z}_{{C}_{f}\times M}\) with \({C}_{f}\) denoting the subset of cells corresponding to some combination of the independent prediction factors. The observed values in \({Z}_{{C}_{f}\times M}\) are ordinal values, but as most of the non-zero scATAC-seq counts are one (typically >70%), we focus on \({{\rm{P}}}({Y}_{{C}_{f}m}\ge 1)\) for purposes of \({q}_{c}\) estimation. Hereafter, we use the notation \({p}_{{C}_{f}m}^{(1)}\) to represent the (non-zero) open probability of group \({C}_{f}\) and feature \(m\). We have further assumed \({q}_{c}\) to be identical across different levels of accessibility for a given cell. Due to the data sparsity and the predominant counts of one, this assumption is moderate, and the estimation process will be greatly accelerated with this assumption. We use a moment estimator with a coordinate descent algorithm to iteratively update \({p}_{{C}_{f}m}^{(1)}\) given \({q}_{c}\), and update \({q}_{c}\) given \({p}_{{C}_{f}m}^{(1)}\). Briefly, we execute the following iteration until convergence:

  1. 1.

    Start with an initial estimate of \({p}_{m}^{[0]}\)

  2. 2.

    For t = 1, 2, …

    1. a.

      Compute \({q}_{c}^{[t]}\) by:

      $${q}_{c}^{[t]}=\frac{{\sum }_{m=1}^{M}I({z}_{{cm}}\ge 1)}{{\sum }_{m=1}^{M}{p}_{m}^{[t-1]}}{{\rm{for}}} \, {c}\in {C}_{f}$$
    2. b.

      Update \({p}_{m}^{[t]}\) by moment estimator:

$${p}_{m}^{[t]}=\frac{{\sum }_{c\in {C}_{f}}I({z}_{{cm}}\ge 1)}{{\sum }_{c\in {C}_{f}}{q}_{c}^{[t]}}{{\rm{for}}} \, {m}\in \{1,2,\,\ldots,M\}$$

where we use superscript \([t]\) to represent the \({t}^{{{\rm{th}}}}\) iteration, and we omit the subscript \({C}_{f}\) and superscript (1) for \({p}_{{C}_{f}m}^{(1)}\).

Uniqueness of parameter estimation

In order for the above joint parameter estimation framework to converge and for the estimated parameters to be uniquely defined, there should be \({q}_{c}=1\) for some cells and \({p}_{{C}_{f}m}^{(1)}=1\) for some features. In PACS, we conduct a convergence check by requiring a certain proportion of cells (default 10%) to have an estimated capturing probability greater than 0.9. In the case of a cluster of cells being rare or not sufficiently deeply sequenced, the estimates may be unstable, and we recalibrate the estimates for this rare cluster to its most similar cluster to prevent potential false positives. Specifically, let \({C}_{f1}\) index the rare group of cells; then, to identify the cell groups with the most similar open chromatin profile, we compute the correlation between \({p}_{{C}_{f1}*}^{(1)}\) and \({p}_{{C}_{{fj}}*}^{(1)}\) for all other clusters \(j=1,\ldots,J\), across all regions. Assuming \({C}_{{fn}}\) has the most similar chromatin profile, we rescale the current estimation of \({p}_{{C}_{f1}m}^{(1)}\) by the following formula:

$$S={\sum}_{m}{p}_{{C}_{{fn}}m}^{\left(1\right)}\,\big/\,{\sum}_{m}{p}_{{C}_{f1}m}^{\left(1\right)},\;\;\;{p}_{{C}_{f1}*}^{(1){\prime} }={p}_{{C}_{f1}*}^{(1)}\times S$$
(5)

where S is the scale factor, \({p}_{{C}_{f1}m}^{(1){\prime} }\) is the rescaled open probability estimate for the cluster \({C}_{f1}\) and feature \(m\). Through rescaling, we assume that most peaks are not differentially accessible between these two cell types.

DAR framework in Seurat and snapATAC2

Both Seurat/Signac and snapATAC2 employ the standard logistic regression framework for DAR detection but with different model specifications. To be precise, Seurat used the following model:

$${{\rm{logit}}}\left({{\rm{P}}}\left(C{T}_{c}=1\right)\right)=\alpha+{\beta }_{0}{D}_{c}+{\beta }_{1}{Z}_{{cm}}+{\sum }_{j=2}^{J}{\beta }_{j}{F}_{{cj}}$$
(6)

where \(CT\) represents the binary indicator of the cell type label, \({D}_{c}\) represents the total number of reads of a cell, \({Z}_{{cm}}\) represents the observed read count in peak m of cell \(c\), and \({F}_{{cj}}\) represents the other biological factors to be controlled for. We note that this model should only be used to test the relationship between cell type and read count rather than other factors, such as spatial effects.

In snapATAC2, the following model is used:

$${{\rm{logit}}}\left({{\rm{P}}}\left({Z}_{{cm}}=1\right)\right)=\alpha+{\beta }_{0}{D}_{c}+{\beta }_{1}C{T}_{c}+{\sum }_{j=2}^{J}{\beta }_{j}{F}_{{cj}}$$
(7)

The major difference is that in snapATAC2, the observed read count becomes the random component (dependent variable). In contrast, the cell type label becomes part of the systematic component (a covariate) in the regression. The model proposed in snapATAC2 is more logically consistent with the biological inference problem, i.e., whether the cell type status governs peak-reads due to differential chromatin state. Despite the difference, both models correct for sequencing depth by incorporating total read count (\({D}_{c}\)) as an explanatory variable in regression. Such treatment will result in potential collider bias, as discussed in Supplementary Note 1.

Cell type label prediction framework

Given a reference dataset, we estimate the probability of open chromatin \({p}_{{C}_{g}m}^{(1)}\) for each cell type \(g\in \{1,\ldots,G\}\), using the formula above. With a new set of binarized observations \({Z}_{{C}^{{\prime} }\times M}^{{\prime} }\), we apply the Bayes discriminative model to predict the corresponding cell type labels, \(h\left({Z}_{c*}^{{\prime} }\right).\)

$${{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)={g|}{Z}_{c*}^{{\prime} }\right) \propto {{\rm{P}}}\left({Z}_{c*}^{{\prime} },|,h\left({Z}_{c*}^{{\prime} }\right)=g\right){{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right) \\ \propto {{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right){\prod }_{m=1}^{M}{\left({p}_{{C}_{g}m}^{(1)}{q}_{c}\right)}^{{Z}_{{cm}}^{{\prime} }}{\left(1-{p}_{{C}_{g}m}^{(1)}{q}_{c}\right)}^{1-{Z}_{{cm}}^{{\prime} }}$$
(8)

where \({{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)={g|}{Z}_{c*}^{{\prime} }\right)\) represents the posterior probability of cell \(c\) being sampled from cell group \(g\), \({{\rm{P}}}\left({Z}_{c*}^{{\prime} },|,h\left({Z}_{c*}^{{\prime} }\right)=g\right)\) represents the conditional probability of observing \({Z}_{c*}^{{\prime} }\) given that the cell \(c\) is sampled from cell type \(g\), \({{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right)\) is the prior probability of a new observation belonging to cell group g, which can either be assumed to be a non-informative Dirichlet prior \({{\rm{Dirich}}}(\delta )\) or estimated based on the cell type composition in reference data. Note that we have a large feature space, so this choice will not make a big difference.

The Naive Bayes model is a probabilistic classifier based on Bayes’ theorem, assuming independence between features given the class label. Specifically, in our context, it assumes that the accessibility of each peak is independent of the accessibility of other peaks, given the cell type. Using the same notation as above, the parameter estimation procedure is

$${p}_{{C}_{g}m}^{(1)}=\frac{{\sum }_{c\in {C}_{g}}I({z}_{{cm}}\ge 1)}{|{C}_{g}|}$$
(9)

Where \(|{C}_{g}|\) represents the group size of (i.e., number of cells in) cell type \(g\). With a new set of observations \({Z}_{{C}^{{\prime} }\times M}^{{\prime} }\), the probability of their corresponding cell type labels \(h\left({Z}_{c*}^{{\prime} }\right)\) can be estimated using the following equation

$${{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)={g|}{Z}_{c*}^{{\prime} }\right) \propto {{\rm{P}}}\left({Z}_{c*}^{{\prime} } | h\left({Z}_{c*}^{{\prime} }\right)=g\right){{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right) \\ \propto {{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right){\prod }_{m=1}^{M}{\left({p}_{{C}_{g}m}^{(1)}\right)}^{{Z}_{{cm}}^{{\prime} }}{\left(1-{p}_{{C}_{g}m}^{(1)}\right)}^{1-{Z}_{{cm}}^{{\prime} }}$$
(10)

Missing-corrected cumulative logistic regression (mcCLR)

Due to the high sparsity of scATAC-seq data, perfect separability is common, hindering the parameter estimation in (Eq. 1). To address this issue, we incorporated Firth regularization (Eq. 2). Here we summarize the (unregularized) log-likelihood function and information matrix for the cumulative response model and derive the analytical expression for the binary model. The loss function, when considering the cumulative response, is

$$\log L({{\mathbf{\Pi }}},{{\boldsymbol{z}}}|{{\boldsymbol{q}}})={\sum }_{c=1}^{C}{\sum }_{t=0}^{T}\log ({\widetilde{\pi }}_{{ct}})1({z}_{c}=t)$$
(11)

where C represents the total number of cells, \({\pi }_{{ct}}\) and \({\widetilde{\pi }}_{{ct}}\) represent the probability of \(t\) PIC counts in cell \(c\) before and after accounting for cell-specific capturing probability, respectively. Specifically, \({\pi }_{{ct}}=P\left({y}_{c}\ge t\right)-P({y}_{c}\ge t+1)\), \({\Pi }_{c}={\left({\pi }_{c0},{\pi }_{c1},{\pi }_{c2},\ldots,{\pi }_{{cT}}\right)}^{{{\rm{Trans}}}}\) and \({\widetilde{\Pi }}_{c}={Q}_{c}{\Pi }_{c}\), where \({Q}_{c}\) is the capturing probability matrix of dimension\(\,(T+1)\times (T+1)\) specified as

$${Q}_{c}=\left[\begin{array}{ccccc}1 & 1-{q}_{c} & 1-{q}_{c} & \ldots & 1-{q}_{c}\\ 0 & {q}_{c} & 0 & & \\ 0 & 0 & {q}_{c} & & \vdots \\ \vdots & & & \ddots & \\ 0 & & \ldots & & {q}_{c}\end{array}\right]$$
(12)

In our PACS model, an approximated estimation of parameters in the cumulative logit model was obtained using a method described in a previous set of studies57,58 that was based on stacking the data and optimizing with binary logistic regression specified by

$$\log L({{\boldsymbol{p}}},{{\boldsymbol{z}}}|{{\boldsymbol{q}}})={\sum }_{c=1}^{C}\left[{z}_{c}\log ({p}_{c}{q}_{c})+(1-{z}_{c})\log (1-{p}_{c}{q}_{c})\right]$$
(13)
$$I\left({{\boldsymbol{\beta }}}\right)={F}^{T}{WF},\;\,{{\mathrm{where}}}\, W={{\rm{diag}}}\left\{\frac{{p}_{c}{q}_{c}(1-{p}_{c})^2}{1-{p}_{c}{q}_{c}}\right\}$$
(14)

where \({p}_{c}=P\left({z}_{c}=1\right)\).

Parameter estimation for mcCLR

We implemented Newton’s and Iterative Reweighted Least Squares (IRLS) methods for parameter estimation. Briefly, for Newton’s method, β is estimated through the following iteration

$${{{\boldsymbol{\beta }}}}^{(s+1)}={{{\boldsymbol{\beta }}}}^{(s)}+{{I^{\prime}} }^{-1}\left({{{\boldsymbol{\beta }}}}^{\left(s\right)}\right){U}^{*}\left({{{\boldsymbol{\beta }}}}^{\left(s\right)}\right)$$
(15)

where the superscript \(s\) represents the iteration, \({I}^{{\prime} }=I\) for the full model, and \({I}^{{\prime} }={I}_{-\{d\}}\) for the null model of \({\beta }_{\left\{d\right\}}=0\). The score function \({U} {*}\left(\beta \right)\) is given by:

$${U}^{*}\left({\beta }_{r}\right) =U\left({\beta }_{r}\right)+\frac{1}{2}{{\rm{trace}}}\left[I{\left({{\boldsymbol{\beta }}}\right)}^{-1}\frac{\partial I\left({{\boldsymbol{\beta }}}\right)}{\partial {\beta }_{r}}\right] \\ ={\sum }_{c=1}^{C}\frac{{f}_{{cr}}({z}_{c}-{p}_{c}{q}_{c})(1-{p}_{c})}{1-{p}_{c}{q}_{c}}+\frac{1}{2}{\sum }_{c=1}^{C}{f}_{{cr}}{k}_{r}{h}_{r},\;(r=1,\ldots,p)$$
(16)

where the \({h}_{c}\)‘s are the \({c}^{{{\rm{th}}}}\) diagonal elements of the “hat” matrix, \(H={W}^{1/2}F{\left({F}^{T}{WF}\right)}^{-1}{F}^{T}{W}^{1/2}\), and \({k}_{c}=(2{p}_{c}^{2}{q}_{c}-3{p}_{c}+1)/(1-{p}_{c}{q}_{c})\).

For the IRLS method, the information matrix \(I\) is replaced with an estimate of the information matrix, \(\widetilde{I}\),

$$\widetilde{I}\left({{\boldsymbol{\beta }}}\right)={F}^{T}\widetilde{W}F,\;{{\rm{where}}}\; \widetilde{W}={{\rm{diag}}}\left\{-\frac{\left[-{p}_{c}^{2}{q}_{c}^{2}+{q}_{c}\left(2{p}_{c}+{z}_{i}-1\right)-{z}_{i}\right]{p}_{c}(1-{p}_{c})}{{\left(1-{p}_{c}{q}_{c}\right)}^{2}}\right\}$$
(17)

Hypothesis testing framework of mcCLR

We utilized a generalized likelihood ratio test framework for hypothesis testing with the mcCLR model, although a Wald-type test can also be derived. As the model contains Firth regularization, we used the profile penalized likelihood approach to obtain p values31,59. Specifically, in the null model, the coefficients of interest are set to zero but still left in the model, so that the regularization accounts for the presence of these parameters during optimization.

Data simulation for single factor differential test

To mimic real data, we estimated insertion rates \({\lambda }_{{C}_{f}m}\)) and \({q}_{c}\) from the human cell line data and used these values to construct simulated data. Because viable scATAC-seq reads come from two adjacent Tn5 insertion events with the right primer configuration (reviewed in ref. 60), we derived the size-filtered signed Poisson (ssPoisson) distribution from modeling this data generation process24. With the observed counts, we estimated the insertion rate parameters for two cell types, and regions with true insertion rate difference greater than 0.1 were set to be as true differential (Ha), and the remaining region’s open probabilities were set equal (by taking the mean) and therefore non-differential (H0). Based on parametric model of latent and observed accessibility, we first sampled the latent ATAC reads by \({{\rm{ssPoission}}}({\hat{\lambda }}_{{C}_{f}m})\) for \(f={\mathrm{1,2}}\), and then sampled the observing status by Bernoulli distribution parameterized by \({q}_{c}\). The observed data were generated by the element-wise product of these two matrices. We randomly sampled 10,000 non-differential features to assess the type I error and 10,000 differential features to evaluate power. This simulation was conducted under varying numbers of cells in each group (from 250 to 1000), and each scenario was repeated 5 times.

Data simulation for multi-factor differential test

Building upon the single-factor setting, we assumed the data contained two cell types (T1 and T2) sampled from two spatial locations (S1 and S2). We evaluated marginal effects and their interactions through separate simulations. To simulate data with marginal effect, the cell type effect is first introduced using the same method as in single-factor simulation, so the effect sizes vary. The spatial effect was then considered to affect features with and without a cell-type effect. Specifically, a third of the features with (and without) a cell type effect showed an accessibility difference across batches, with a log fold change of \(\pm 0.5\). We introduced sample imbalance as frequently seen in real datasets. Specifically, we considered that S1 contained 1600 T1 and 800 T2 cells, while S2 contained 400 T1 and 1200 T2 cells. The peak by cell count data generation procedure is the same as for the single factor setting. Two strategies were used to evaluate the performance of methods that do not support multi-factor testing: the naïve test and the stratified test, as reported in the main text. Following the stratified test, we use Edington’s p value combination approach as it assumes consistent effect size across strata.

To evaluate the interaction effect, we considered two configurations of interaction. In the first configuration, cells of T1 in S1 are highly accessible while other groups are lowly accessible, with effect sizes also estimated from the cell line data. In the second configuration, cells of T1 in S1 and cells of T2 in S2 are highly accessible while other groups are lowly accessible. The second configuration may not be common in real biological data, but a method capable of testing the interaction effect should be able to identify it.

Data simulation for time-series differential test

To evaluate model performance in situations where the design matrix contains ordinal covariates, we simulated time-series scATAC-seq data across five-time points. We assumed linear and quadratic temporal effects on accessibility and set the effect size (log fold change) to be 0.3 or 0.5 between the two groups. The baseline accessibility was generated from the cell line data, and the peak-by-cell-count data generation procedure is the same as for the single-factor setting. N = 1000 cells were sampled for each time point.

Evaluating type I error and power in real datasets

To estimate type I error in real data where the ground truth is unknown, we used a label permutation approach, where the data in one cell type were divided randomly into two groups and a differential test was conducted between these groups. As this is randomly assigned, all features were considered non-DAR, so the proportion of P values smaller than 0.05 is the empirical type I error using real data. Then, we set the fifth rank percentile as the correct critical value for those methods with type I errors greater than 0.05. We next conducted a test with two different cell types using the calibrated critical values for each method. Since we do not know the true DAR set, we defined the pseudo-true DAR peaks as the union DAR set of all tested methods, using their corresponding new critical values. Power for each method was then calculated by the number of DARs detected divided by the number of pseudo-true DARs. This approach is adopted from ref. 24.

Estimating effect size (fold change and accessibility change)

A common practice to determine differential features in single-cell data is setting a cutoff for p-value and fold change. In scRNA-seq data analysis, one way to estimate the effect size of a particular variable (predictor) is by calculating the fold change (FC) for the normalized data obtained by dividing the normalized mean expression of one group by the other group. However, with scATAC-seq data, no direct normalization method is available, and computing the fold change on raw read counts may lead to inaccuracies due to disparities in data capture. Here, we propose to use the capturing probability-adjusted count to compute fold change (FC) or the arithmetic difference between accessibility (accessibility change, AC) of two cell types. To be precise:

$${{\rm{FC}}}=\frac{{\sum }_{c\in C1}{Z}_{{cm}}/{q}_{c}}{{\sum }_{c\in {C}_{2}}{Z}_{{cm}}/{q}_{c}},{{\rm{AC}}}={\sum }_{c\in {C}_{1}}{Z}_{{cm}}/{q}_{c}-{\sum }_{c\in {C}_{2}}{Z}_{{cm}}/{q}_{c}$$
(18)

where \(m\) is the feature of interest and \({C}_{1}\) and \({C}_{2}\) are the lists of cells that contain foreground and background cell types.

Processing kidney adult data with Signac

We used Signac23 to evaluate the effectiveness of our method in correcting for batch effect at the feature level. We follow the standard workflow as recommended in the Signac vignette (https://stuartlab.org/signac/articles/pbmc_vignette.html). Briefly, we used the TF-IDF approach without feature selection (min.cutoff = ‘q0’), followed by SVD to reduce dimensionality. We then conduct clustering and UMAP visualization using the dimensions 2–30 (as the first LSI dimension usually reflects sequencing depth, per the Seurat tutorial). The sample and cell type labels are retrieved from the annotations in the initial publication.

Batch mixing score calculation

We calculated the batch mixing scores in the PCA space to measure the batch effect. At the cell level, the batch mixing score is adapted from ref. 39 and is defined as the proportion of nearest neighbor cells with different batch identities, where a higher score indicates better mixing between batches and, thus, a smaller batch effect. At the whole data level, the batch mixing score is defined as the mean batch mixing score across all cells. To calculate the expected batch mixing score for a given dataset when no batch effect is present, let \(M\) denote a cell type-by-batch matrix, with each element \({m}_{{ij}}\) representing the number of cells in the cell type \(i\) and batch \(j\). Then, the expected data-level batch mixing score in the setting of no batch effect is given by

$${{\rm{E}}}\left[{{\rm{batch}}} \, {{\rm{mixing}}} \, {{\rm{score}}}\right]=\frac{1}{{\sum }_{i,j}{m}_{{ij}}}\left\{{\sum }_{i}\left[{\sum }_{j}{m}_{{ij}}\left(\frac{{\sum }_{k\ne j}{m}_{{ik}}}{{\sum }_{k}{m}_{{ik}}}\right)\right]\right\}$$
(19)

The normalized batch mixing score is the batch mixing score divided by the expected score under random mixing, and thus a higher normalized batch mixing score indicates better mixing across samples.

Processing developing human brain data

This dataset contains 18 specimens collected from human donors. Our study excluded samples with unknown spatial locations (GW17, GW18, GW21) or samples not from the cortex (MGE_GW20 and MGE_twin34). Here, we focused on the excitatory neuron lineage, including radial glia (RG), intermediate progenitor cells (IPCs), early excitatory neurons (earlyEN), deep layer excitatory neurons (dlENs), and upper layer excitatory neurons (ulENs). We further excluded the insular region for having too few cell counts (645 cells across five cell types). Since the data matrix was saved as a binary matrix, we implemented the missing-corrected logistic regression model to analyze this data.

DAR identification in the developing human brain data

We constructed two models to identify the significant region effect of the excitatory neuron lineage. Specifically, to identify the region effect, the systematic component of the PACS model is specified as:

$$\alpha+{\sum }_{k=2}^{K}{\gamma }_{k}1({G}_{c}={g}_{k})+{\sum}_{l=2}^{L} \, \zeta \, \, 1({S}_{c}={S}_{l})+{\sum }_{t=2}^{M}{\tau }_{m}1({D}_{c}={D}_{m})$$
(20)

where \(G\) is the index of cell type, \(S\) is the index of spatial location, and \(D\) is the index of the donor. The null hypothesis for the test is \({H}_{0}:\zeta=0\). To identify the cell type-specific region effect, we included the interaction terms between each cell type and spatial location, and the test was conducted for each interaction term.

Motif enrichment analysis

The motif enrichment analysis was conducted with Homer61. The list of significant DAR peaks is used as input for the analysis, with the size of the search region specified as 300 bp around the peak center. The reported motif enrichment scores are FDR-corrected P values from the known motif results.

DAR identification in the human PBMC treatment data

To identify the cell type-specific temporal effect in the PBMC treatment data, the systematic component of the PCAS model is specified as:

$$\alpha+{\sum }_{k=2}^{K}{\gamma }_{k}1({G}_{c}={g}_{k})+\kappa E+{\sum }_{t=2}^{M}{\omega }_{t}1({D}_{c}={D}_{t})$$
(21)

where \(G\) is the index of cell type, \(E\) is the experimental time index (0, 1, 2 corresponds to control, 1 h, and 6 h after treatment, respectively), and \(D\) is the donor index. The null hypothesis for the test is \({H}_{0}:\kappa=0\).

Gene and pathway enrichment with GREAT

We used the GREAT method (v. 4.0.4) for gene and enrichment analysis43, with DARs as input and default parameter settings. The output from GREAT for the human PBMC data can be found in Supplementary Data 17, 18.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.