Depth-corrected multi-factor dissection of chromatin accessibility for scATAC-seq data with PACS

Miao, Zhen; Wang, Jianqiao; Park, Kernyu; Kuang, Da; Kim, Junhyong

doi:10.1038/s41467-024-55580-5

Download PDF

Article
Open access
Published: 05 January 2025

Depth-corrected multi-factor dissection of chromatin accessibility for scATAC-seq data with PACS

Nature Communications volume 16, Article number: 401 (2025) Cite this article

7030 Accesses
3 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Single cell ATAC-seq (scATAC-seq) experimental designs have become increasingly complex, with multiple factors that might affect chromatin accessibility, including genotype, cell type, tissue of origin, sample location, batch, etc., whose compound effects are difficult to test by existing methods. In addition, current scATAC-seq data present statistical difficulties due to their sparsity and variations in individual sequence capture. To address these problems, we present a zero-adjusted statistical model, Probability model of Accessible Chromatin of Single cells (PACS), that allows complex hypothesis testing of accessibility-modulating factors while accounting for sparse and incomplete data. For differential accessibility analysis, PACS controls the false positive rate and achieves a 17% to 122% higher power on average than existing tools. We demonstrate the effectiveness of PACS through several analysis tasks, including supervised cell type annotation, compound hypothesis testing, batch effect correction, and spatiotemporal modeling. We apply PACS to datasets from various tissues and show its ability to reveal previously undiscovered insights in scATAC-seq data.

scATAC-seq generates more accurate and complete regulatory maps than bulk ATAC-seq

Article Open access 29 January 2025

RA3 is a reference-guided approach for epigenetic characterization of single cells

Article Open access 12 April 2021

A generic reference defined by consensus peaks for single-cell ATAC-seq data analysis

Article Open access 09 February 2026

Introduction

Single-cell ATAC-seq (scATAC-seq) is a powerful assay for profiling the epigenetic status of open chromatin in individual cells^1,2. It has been applied to study gene regulation across tissues and under various conditions, including homeostasis^3,4,5, development^6,7 or disease^8,9. The cis-regulatory elements (CREs), modulated by nucleosome turnover and occupancy¹⁰, display variable accessibility across cells. In a cell, the dynamic of CRE activities is dependent on various physiological factors such as cell type^1,3, developmental state^6,7, spatial location of the tissue^11,12, as well as the interaction of these factors with genetic variation^13,14. Identifying the sets of elements whose accessibility is governed by genetic, developmental, and physiological factors is essential in understanding the cis-regulatory codes of biological processes^15,16. Moreover, prioritizing regulatory elements is instrumental in identifying relevant cell types associated with complex traits^{5,17,18,19,20}. Dissecting genetic, epigenetic, and environmental factors requires analysis methods that allow compound hypothesis testing. However, the discrete and sparse nature of ATAC-seq data presents technical challenges for existing approaches. Here, we establish a model framework and a new statistical method for complex hypothesis testing for single-cell ATAC-seq data.

Among all the factors that drive the accessibility of CREs, only some factors are experimentally controlled. In a typical single-cell experiment, the collection of cells is a random sample of a cell’s variable states over the unknown factors (e.g., cell cycle stage, metabolic stage) while controlling for the known factors (e.g., tissue, location, batch). We note that sometimes, the values of these factors are estimated from the data, such as unsupervised inference of cell type labels or time sequences. Nevertheless, as the data are sampled over unknown microstates and stochastic molecular processes, the latent accessibility of a CRE should be considered as a random variable, even without experimental variability.

A central theme in scATAC-seq data analyses is understanding the dynamic of CRE across cell types and biological conditions. This is typically achieved by conducting a statistical test for Differential Accessible Regions (DARs) between different conditions of hypothetical causal factors that might govern the chromatin structure (e.g., treatment vs control). However, most existing approaches for identifying DARs rely on pairwise testing of one putative causal factor at a time (e.g., ArchR²¹, snapATAC²², or contingency table-based approaches) that does not allow for simultaneously testing multiple factors or testing continuous variables (such as temporal effects). A common practice in literature is to ignore other covariates while testing the factor of interest. However, as schematically diagramed in Fig. 1, testing one factor at a time can create false negatives and false positives in DAR of primary factors and cannot test for interactions. Although Seurat/Signac²³ uses the logistic regression framework, which can conduct multi-factor testing theoretically, its current implementation creates spurious dependency problems that result in mis-calibrated tests (Supplemental Note 1).

**Fig. 1: PACS (Probability model of Accessible Chromatin in Single Cells) framework.**

Besides the key problem of the inability to conduct hypothesis testing for multiple causal factors together, there are two additional unresolved technical problems: (1) scATAC-seq data displays heterogeneous sequencing coverage in each cell, so the cells are not directly comparable; (2) scATAC-seq data is sparse but also contains quantitative information at each peak²⁴. To address these challenges, we present a new statistical framework, missing-corrected cumulative logistic regression²⁵ (mcCLR, or missing corrected ordinal logistic regression, mcOLS), for the analysis of scATAC-seq data with modified Firth regularization^26,27 to account for data sparsity.

With this statistical framework, we present our Probability model of Accessible Chromatin of Single cells (PACS), a toolkit for scATAC-seq analysis. PACS allows complex compound analysis tasks in scATAC-seq data, including cell type classification, feature-level batch effect correction, and spatiotemporal data analysis. With simulated and real data, we show that PACS effectively controls false positives while maintaining sensitivity for model testing. We apply PACS to a mouse kidney dataset, a developing human brain dataset, and a time-series PBMC treatment dataset, all of which have complex study designs, to demonstrate its capability to model multiple sources of variations for hypothesis-driven biological inference.

Results

A probabilistic model of accessible peaks and statistical test framework

In the PACS framework, we model the accessibility state of CREs in a single cell as a function of predictive factors such as cell type, physiological/developmental time, spatial region, etc. We use a matrix ${F}_{C\times J}$ to represent these variables in each cell, where C is the number of cells and J is the number of predictive variables (including dummy variables). Let ${Y}_{C\times M}$ represent an integer-valued scATAC-seq count matrix across C cells and M genomic regions. For empirical ATAC-seq data, these regions M are determined by data-dependent peak calling, where peaks are regarded as the set of candidate CREs^28,29. As scATAC-seq can recover quantitative information on the density and distribution of nucleosomes^24,30, we use integer values ${Y}_{{cm}}\in \left\{{\mathrm{0,1,2}},\ldots \right\}$ to represent the level of accessibility. Existing pipelines diverge in quantifying scATAC-seq counts, and we propose using the paired insertion count (PIC) matrix as a uniform input for downstream analyses²⁴. The integer-valued PIC counts are observed measurements of the latent chromatin accessibility (Fig. 1, upper panel). This latent variable perspective motivated us to use the cumulative logit model to decompose the cumulative distribution of ${Y}_{cm}$ by its predictive variables ${F}_{c*}$.

Due to cell-specific nucleosome preparation and sequencing depth, the scATAC-seq output may miss sequence information from certain accessible chromatin regions (Fig. 1, middle panel). As with other genomics data, these missing data points result in zero counts rather than NA counts. Therefore, it is crucial to distinguish between true zeros (closed chromatin) and technical zeros (missing data). Here, we use ${R}_{C\times M}$, with binary values, to represent the read recovery/capturing status for each cell and region. This matrix encapsulates all the experimental factors (Tn5 activities, sequencing depth, etc.) that result in a disparity of recovered reads among cells. The observed chromatin states, denoted by ${Z}_{{CM}}$, are specified by the element-wise product between the latent accessibility ${Y}_{{CM}}$ and the capturing status ${R}_{{CM}}$. Since various experimental factors are cell-specific, we further assume the capturing probability ${{\rm{P}}}({R}_{{cm}}=1)$ to be unique to each cell but common to all peaks in that cell, and thus we use ${q}_{c}$ to denote this conditional read capturing probability in cell c.

Motivated by the latent variable model and to account for cell-specific missing data, we extended the cumulative logit model to decompose accessibility as:

$${{\rm{logit}}}\left({{\rm{P}}}({Y}_{{cm}}\ge t)\right) ={\alpha }^{(t)}+{\sum }_{j=1}^{J}{\beta }_{j}{F}_{{cj}},\\ {{\rm{where}}\; {\rm{P}}}\left({Z}_{{cm}}\ge t\right) ={{\rm{P}}}\left({Y}_{{cm}}\ge t\right){q}_{c}{;t}\in \{1,2,\ldots,T\}$$

(1)

where ${q}_{c}$ is the capturing probability for a cell c, ${{\rm{P}}}({Y}_{{cm}}\ge t)$ is the sampling probability of cells with accessibility level greater than or equal to t. The maximum value of accessibility we account for, $T$, is feature-specific. For a feature $m$, $T$ is the largest integer such that ${\sum }_{c}1({Z}_{{cm}}\ge t)/{\sum }_{c}1({Z}_{{cm}}\ge 1)\ge h$ where $h$ is a hyperparameter with a default value of 0.25. ${\alpha }^{(t)}$ is the intercept term in the ${t}^{{{\rm{th}}}}$ cumulative logit, and ${\beta }_{j}$ is the coefficient for the ${j}^{{th}}$ column of the design matrix. Eq. 1 assumes a proportional odds model, where we have a common set of coefficients ${\beta }_{j}$ for all levels of the cumulative distribution while allowing for a unique constant term ${\alpha }^{(t)}$ for each level. Hereafter, we refer to our method as the mcCLR model, which stands for the missing-corrected Cumulative Logit Regression model. We note that under the assumption of complete data capturing (i.e., ${q}_{c}=1,\,\forall c$), ${{Z}_{{cm}}=Y}_{{cm}}$ and our model reduces to a generalized linear model (GLM) with sparsity regularization. However, the divergent cell-specific read capturing in scATAC-seq data cannot be accounted for by GLM, necessitating the development of the mcCLR model.

With the formulation above, the effect of a complex set of predictive variables (and their interactions) on accessibility can be tested by the null hypothesis of ${\beta }_{i}$ = 0 with a likelihood ratio test (Fig. 1, lower panel). One statistical challenge is to estimate ${q}_{c}$’s for each cell. We assumed the same capturing probability within a cell, so the problem is tractable and can be computed efficiently. Operationally, we first group the cells by their combination of the treatments and then utilize a coordinate descent algorithm to obtain estimates of ${{\rm{P}}}({Y}_{{cm}}\ge 1|{f}_{c})$ and ${q}_{c}$ (Methods).

Another statistical challenge of scATAC-seq is that the data is very sparse, creating a so-called “perfect separation” problem (see in ref. ²⁵). Here, we developed a regularized model to resolve this issue by generalizing the Firth logistic regression model^26,31 (Methods). Essentially, a Firth penalty is introduced in the regression model:

$$\log {L}^{ \!*}\left({{\boldsymbol{\beta }}}|{{\boldsymbol{Z}}}\right)=\log L\left({{\boldsymbol{\beta }}}|{{\boldsymbol{Z}}}\right)+\frac{1}{2}\log {|I}({{\boldsymbol{\beta }}})|$$

(2)

Where ${L}^{ \!*}$ represents the penalized likelihood, $L$ is the likelihood of the regression model, and $I({{\boldsymbol{\beta }}})$ the information matrix. Derivations of the parameter estimation framework are described in the Methods section. With the proposed methods, we aim to more accurately control type I errors and account for technical zeros (due to uneven data capturing) and sparse data. This regression-based model enables testing multiple covariates that jointly determine accessibility while controlling for other covariates or confounders.

Application of PACS to cell type identification

To demonstrate the effectiveness of our model for separating the latent chromatin accessibility from the capturing probability, we evaluated three model assumptions using the task of (supervised) cell type prediction, where the goal is to predict cell types in a new scATAC-seq dataset given an annotated (labeled) dataset.

We first evaluated the accuracy of the estimation procedure of PACS. We simulated groups of cells with a spectrum of both the underlying probability of accessibility (${{\rm{P}}}\left({Y}_{{cm}}\ge 1\right)$, or $p$ in short) across peaks and the capturing probabilities ($q$) across cells (Methods). We then utilized PACS to jointly estimate $p$ and $q$, with n = 1000, 500, or 250 cells. The simulation results show that our estimator can determine both the capturing probabilities and open-chromatin probabilities accurately, with root mean squared errors (RMSE) for the underlying probability of accessibility from 0.028 (n = 1000) to 0.027 (n = 250) and RMSE for capturing probability from 0.0067 (n = 1000) to 0.012 (n = 250, Supplementary Fig. 1a–f, and Supplementary Data 1).

We next tested PACS by applying it to a cell-type label transfer task, comparing it with the Naïve Bayes model. For both models, we started with an estimated ${{{\boldsymbol{p}}}}_{g}$ for each known cell type group label $g$, and then applied the Bayes discriminative model to infer the most probable cell type labels for novel unidentified cells. Naïve Bayes does not assume missing data; thus, it ignores the cell-specific capturing probability. The prediction performances were evaluated with ten-fold cross-validation and holdout methods, where the original cell type labels are regarded as ground truth (Methods). We tested the methods on five datasets, including two human cell line datasets²¹, two mouse kidney datasets⁶, and one marmoset brain dataset³². In the two human cell line datasets, the labels are annotated by their SNP information²¹, so they are regarded as gold standards. For the remaining datasets, the original cell-type labels are generated by clustering and marker-based annotation, so the labels may have errors.

PACS consistently outperforms the Naïve Bayes model with an average 0.31 increase in Adjusted Rand Index (ARI, Supplementary Fig. 2a), suggesting the importance of considering the cell-to-cell variability in capturing rate. For the gold-standard cell line mixture data, we achieved almost perfect label prediction (ARI > 0.99), while Naïve Bayes had much lower accuracy with an average ARI = 0.54 (Supplementary Fig. 2b, c). For the kidney data⁶ and the marmoset brain data³², PACS still achieved high performance, with average ARI equal to 0.92, 0.90, and 0.88 for the adult kidney, P0 kidney, and marmoset brain data, respectively. The Naïve Bayes model, on the other hand, again produced lower ARI scores, equal to 0.59, 0.65, and 0.69 for the three datasets, respectively (Supplementary Fig. 1d–k).

For the holdout experiment, where training and testing are done on different datasets, consistent with the above results, our method shows more accurate cell label prediction than Naïve Bayes. Our cell type label prediction approach is very efficient, and the total time for training and prediction takes <5 min for large datasets (>70,000 cells).

PACS enables parametric multi-factor model testing for accessibility

Identifying the CREs regulated by certain physiological cues is essential in understanding functional regulation. For example, differentially accessible region (DAR) analysis tries to determine cell-type-specific chromosomal accessibility differences. Most scATAC-seq pipelines adopt RNA-seq differential expression methods to ask whether a peak belongs to a DAR. These approaches generally lack calibration for sparse ATAC data, and pairwise DAR tests do not allow testing more complex models that might determine peak accessibility (e.g., a combination of spatial location and batch effects). With existing methods for DAR detection, commonly adopted approaches are to ignore other factors or stratify by other factors to test the factor of interest if the predictive variables are nominal (e.g., cell types). However, such tests involve ad hoc partition into levels of the nominal factor and cannot test more complex models, including possible metric variables (e.g., developmental time).

Here, we first assess the model design and capabilities of PACS and six established tools/methods: ArchR²¹, Seurat/Signac²³, snapATAC²², edgeR³³, snapATAC2 (in ref. ³⁴), and Fisher’s exact test. ArchR conducts a Wilcoxon rank-sum test on the subsampled cells from the initial groups, while ensuring parity in the number of sequencing reads across any two samples being tested. Seurat employs the standard logistic regression model³⁵ by setting the cell type as the dependent variable and ATAC counts and total reads in the cells as predictive variables. SnapATAC conducts a test on the pseudo-bulk data of two groups and utilizes the edgeR³³ negative binomial test on the pseudo-bulk data with a pre-defined ad hoc variance measure (biological coefficient of variation, bvc = 0.4 for human and 0.1 for mouse data). Since snapATAC operates on the pseudo-bulk level, we also included the edgeR³³ method applied to single-cell data in our comparison. snapATAC2 employs a logistic regression model with ATAC counts as the dependent variable and cell type and total reads in cells as predictive variables^5,34. A comprehensive comparison is summarized in Table 1.

Table 1 Comparison of model design and capability among existing scATAC-seq testing tools

Full size table

To quantitatively compare the performance of the parametric test framework, we first used simulated data to test a single-factor model (cell types as a factor). To resemble real data, simulated samples were generated by parameterizing the model with the accessibility and capturing probability estimated directly from a human cell line dataset²¹ (Methods). We randomly sampled varying numbers of cells in each group, ranging from 250 to 1000. Supplementary Fig. 3 shows that Seurat failed to control the type I error rate at the specified significance level. Among the methods demonstrating type I error control, PACS has, on average, 17%, 19%, and 122% greater power than Fisher’s exact test, ArchR and snapATAC, respectively (Supplementary Data 2). The reduced power of ArchR is likely due to the subsampling process, and the ad hoc “bvc” choice in snapATAC may result in a miscalibrated test with a low type I error and power. The q-q plots of the five methods are shown in Supplementary Fig. 4a–e.

To evaluate the performance under a multi-factor model, we conducted another set of simulations with a second factor, two spatial locations (S1 and S2), and the first factor of two cell types (T1 and T2). We evaluated the performance in testing for the main effects or their interactions (Methods). Two strategies were used for the methods that cannot directly test effects for multiple factors. The first is called the “naïve test”, where the other factor is ignored in testing the factor of interest. The second strategy is called the “stratified test”, where we stratified the dataset by the second factor and conducted a pairwise test between the factor of interest on each stratum, followed by a p-value combination test (Methods). Across all methods and test strategies, only PACS, snapATAC (naïve and stratified), Fisher-stratified, and ArchR-stratified controlled type I error at the specified level (Fig. 2a–i); PACS remained the most powerful test and detected up to 66% more true differential peaks compared with the second most powerful methods (Supplementary Data 3). Notably, Seurat could not determine peaks with spatial or interaction effects because the cell type factor is treated as the dependent variable in the regression model. A more detailed explanation can be found in Supplementary Note 1. For testing cell type effects, snapATAC2 with ATAC count as a dependent variable performs similarly to Seurat.

**Fig. 2: Compound hypothesis testing with PACS is sensitive and specific.**

We then simulated a time-series dataset with five-time points to evaluate our model performance for ordinal covariates. We assumed two temporal trends of accessibility: linear and quadratic trends. To put this in a biological setting, the quadratic trend may represent the presence of an acute spike response, and the linear trend may represent temporally accumulating chronic responses. The PACS framework could detect both linear and quadratic signals, and its power is dependent on the “effect sizes” defined as the log fold change of accessibility between the highest and lowest accessibility (Fig. 2j, k).

We also evaluated the PACS model in real datasets. As the ground truth is unknown, we utilized a sampling-based approach. We used randomly permuted cell type labels to estimate the type I error. To evaluate power, we treated the consensus DAR set from all methods as “true DARs” (after type I error calibration, see Methods). For the standard two-group DAR test, our method consistently controlled type I error and achieved high power across different datasets (Supplementary Fig. 5a–f). We further showed that ignoring confounders such as batch effect could result in substantial false discoveries, up to 47% in the adult mouse kidney data (Fig. 2l). Recent advancements in sequencing technologies have significantly enhanced the ability to quantitatively measure chromatin states, as evidenced by a substantial proportion of counts equal to or exceeding two in a compilation of datasets (see Supplementary Data 4 from ref. ²⁴). Our assessment indicates that the full PACS model, which leverages a cumulative logistic regression approach, demonstrates greater power in datasets with high read counts than the binary model. For example, for the 10X PBMC multi-home dataset, the power increases from 0.86 with the binary model to 0.92 with the full PACS model (Supplementary Fig. 6).

PACS identifies kidney cell type-specific regulatory motifs and allows direct batch correction

One important feature of PACS is its ability to handle complex datasets with multiple confounding factors. To test the performance of PACS, we analyzed an adult kidney dataset with strong batch effects⁶. This dataset contains three samples generated independently (in three batches), and the authors identified a strong batch effect. Existing methods for batch correction map the ATAC-seq features to a latent vector space to subtract the batch effects. For example, the original study⁶ relies on Harmony³⁶ to remove the batch effect in latent space for visualization and clustering, but the batch effect is still present in the peak feature sets because the correction is carried out in latent space without mapping to original features. Batch effect correction in latent space will be useful for secondary analyses that operate from latent space inputs, e.g., clustering, but it will not help with secondary analyses that directly operate on features such as peak-associated motif analysis.

To remove the batch effect at the feature level, we assume that the batch effect will affect (increase or decrease) the accessibility of certain peaks, and these effects are orthogonal to the biological effects. This assumption is necessary for most of the existing batch-effect correction methods (e.g., MNN³⁷, Seurat³⁸, and Harmony³⁶) as a matter of experimental design. With this assumption, we applied PACS on the adult kidney data, detected significant DAR peaks among batches (p value < 0.05 with or without FDR correction), and removed batch-effect peaks from the feature set. We next implemented Signac to process the original data as well as the batch effect-corrected data without any other batch correction steps. Dimension reductions with UMAP suggested that the original data contained a strong batch effect, where almost all cell types are separated by batch (Fig. 3a, b). After removing the peaks with strong batch effects, the cells are better mixed among batches (Fig. 4c, d and Supplementary Fig. 7a, b). Note that different cell types are still separated, suggesting the biological differences are (at least partially) maintained. Since UMAP visualization may not fully preserve the actual batch mixing structure, we adopted a batch mixing score from ref. ³⁹ to quantify the batch effect in the PCA space. The batch mixing score is defined as the average proportion of nearest neighbor cells with different batch identities, where a higher score indicates better mixing between batches and, thus, a smaller batch effect (Methods). We normalized the mean batch mixing score by dividing it by the expected score under the random mixing scenario. After batch effect correction with PACS, the normalized mean batch mixing score is 0.358 (with FDR correction) or 0.417 (without FDR correction) compared with 0.122 before batch correction. Example peaks with strong batch effects detected by PACS are shown in Supplementary Fig. 8a–l.

**Fig. 3: Application of PACS to the mouse kidney dataset.**

**Fig. 4: Application of PACS to the developing human brain data.**

As noted by other studies^40,41,42, there is always a tradeoff between batch correction and conservation of biological effects. The decision must be made with careful consideration. In our study, we aimed to correct the batch effect only for clustering major cell types and visualization. For these tasks, information loss is inevitable and can be accepted, and thus, the approach of excluding peaks with significant batch effect can be used, provided that the tradeoffs discussed above are understood. However, we emphasize that for tasks like detecting DAR, a better approach is to keep all peaks and regress out the batch effect to obtain an unbiased estimation of the cell type effect. To demonstrate this approach, we applied our method to identify cell type-specific features while adjusting for batch effect. We focused on the two proximal tubule subtypes, proximal convoluted tubules (PCT) and proximal straight tubules (PST). By fitting our mcCLR model with cell type and batch effect, we identified 19,888 and 62,368 significant peaks for PCT and PST, respectively (FDR-corrected p value < 0.05, Supplementary Data 4, 5). The original study utilized snapATAC, which reported 23,712 and 36,078 significant peaks for PCT and PST, respectively. With the batch-corrected differential peaks, we conducted GREAT enrichment analysis^43,44 to identify candidate PCT- and PST-specific genes (Supplementary Data 6, 7). We identified Gc, Nox4, Slc4a4, Bnc2, Slc5a12, and Ndrg1 genes as top PCT-enriched genes, and Ghr, Gramd1b, Etv6, Atp11a, Gse1, and Sik1 as top PST-enriched genes. The associated genomic pile-up figures for the CREs of these genes are shown in Fig. 3e, and these findings were supported by a public scRNA-seq dataset⁴⁵ (Fig. 3f).

PACS dissects complex accessibility-regulating factors in the developing human brain

We applied our method to the human brain dataset¹¹, which is more challenging due to the complex study design with cells collected from six donors across eight spatial locations. Substantial sequencing depth variations among samples have also been noticed, further complicating the analysis (Supplementary Fig. 9a–c). To study how spatial locations affect chromatin structure, the original reference focused on the prefrontal cortex (PFC) and primary visual cortex (V1) regions, as they were the extremes of the rostral-caudal axis¹¹. With the multi-factor analysis capacity of PACS, we conducted analyses to (1) identify the region effect while adjusting for the donor effect and (2) identify the cell type-specific region effect.

We first examined how different brain regions affect chromatin accessibility across all cell types. To accurately capture this relationship, we incorporated other influential factors, such as donor variability, as covariates in our regression model. This methodological approach allows us to assess the specific effects of brain regions while accounting for potential confounding factors (Methods). For this, we focused on a subset of three donors where spatial information is retained during data collection (Fig. 4a–c and Supplementary Data 8). In total, we identified 146,676 brain region-specific peaks (FDR corrected p value < 0.05). Between PFC and V1 regions, we identified 30,455 DAR peaks, ~20% more compared with the original study (Supplementary Data 9, 10). With the region-specific DARs, we conducted motif enrichment analysis to identify region-specific TFs. For the PFC and V1 regions, we found several signals that were consistent with the original article¹¹, including PFC-specific motifs MEIS1, TBX21, and TBR1, and V1-specific motifs MEF2B, MEF2C, MEF2A, and MEF2D. Moreover, we identified additional V1-specific motifs ETS and ZIC2 (Fig. 4d), supported by the scRNA-seq data collected from the same regions⁴⁶. We also noticed that some neuron development-associated TFs, including OLIG2 and NEUROG2, are enriched in both brain regions but with different binding sites, likely due to different co-factors that open different DNA regions. We next conducted TF enrichment analysis for the DAR set identified by PACS but not by snapATAC. The top enriched motifs are highly concordant with the output using the full set of peaks, suggesting that the additional peaks identified by PACS are consistent with but increase the information set found in the PACS-snapATAC common set and support the reliability of the PACS method. Motif enrichment results for the brain data are reported in Supplementary Data 11–14.

Next, we used PACS to examine the location effect across different cell types along excitatory neurogenesis. This corresponds to testing the interaction terms between spatial location and cell types while adjusting for donor effect (Fig. 4e). The previous study reported that the chromatin status of the intermediate progenitor cells (IPC) population started to diverge between the PFC and V1 regions. Consistent with the article, we identified 2773 significant differential peaks between PFC and V1 at the IPC stage, 52% more than snapATAC (Supplementary Data 15). Example peaks with cell type-specific region effect are shown in Supplementary Fig. 10.

In sum, we show the implementation of PACS for data with three levels of factors: donor, spatial region, and cell type. PACS can be applied to study one factor or the interaction between factors while adjusting for other confounding factors, and test results have a higher power.

PACS identifies time-dependent immune responses after stimulation

The existing methods for DAR detection rely on pairwise comparisons and, thus, are not applicable to ordinal or continuous factors. One such example is the scATAC-seq data collected at multiple time points. Here, we apply PACS to a peripheral blood mononuclear cell (PBMC) dataset collected at three-time points (0 h control, 1 h, and 6 h) after drug treatment⁴⁷. Multiple treatments have been applied separately to cells collected from four human donors. While PACS can simultaneously model all drugs and conditions, we focus on the ionomycin plus phorbol myristate acetate (PMA) treatment to demonstrate the PACS workflow. The factors included in the PACS model are shown in Fig. 5a, where cell type and donor effects are categorical, and the time effect is coded as an ordinal variable. Note that time can be alternatively coded as a continuous variable.

**Fig. 5: Application of PACS to time-series human PBMC treatment data.**

We tested the treatment effect by identifying open chromatin regions that gradually increase or decrease in accessibility after treatment. In total, we detected 35,356 peaks with a strong treatment effect across five broad cell types (B cell, CD4 T cell, CD8 T cell, Monocyte, and NK cell, Supplementary Data 16–18). Across the cell types, CD4 and CD8 T cells show the most significant changes in chromatin landscape after treatment (Fig. 5b, c). This is expected, as PMA can induce T cell activation and proliferation⁴⁸. Among the peaks with significant PMA treatment effect, most become more accessible after treatment, consistent with the activation function of the treatment. We then conducted gene enrichment analysis with GREAT⁴⁴, where we identified several GO pathways associated with T cell activation, such as “regulation of T cell differentiation” and “regulation of interleukin-2 production” (Supplementary Data 19). We also identified enriched genes, including DUSP5, IL1RL1, TBX21, and CXCR3 (Supplementary Data 20), the expression of which has been previously reported to be up-regulated in PMA treatment^49,50,51,52. Notably, DUSP5 is known to play an essential role in the immune response through regulation of NF-κB as well as ERK1/2 signal transduction⁵³, and TBX21 is an immune cell TF that also directs T-cell homing to pro-inflammatory sites via regulation of CXCR3 expression⁵⁴. Figure 5d, e shows the cell type-specific open chromatin landscape dynamic after the PMA treatment. We noticed that some CREs respond to the treatment effect across all cell types, and some CREs become activated only in certain cell types.

Discussion

Single-cell sequencing data, characterized by uneven data capturing and sparsity, present significant challenges in data analysis. For scRNA-seq data, data normalization has been an essential step for adjusting for uneven data capturing; however, this approach is not directly applicable to scATAC-seq data, creating a unique challenge. Our method, PACS, addresses this by jointly modeling the group-level underlying accessibility and cell-level sequencing reads capturing. In addition, PACS controls for artifacts of sparse single-cell data that tend to have many missing values by a regularization approach. When applied to cell-type annotation tasks, PACS showed improved performance compared with the Naïve Bayes model, which does not consider cell-specific capturing probabilities.

The increasing volume of data across various tissue conditions necessitates atlas-level data integration to comprehend tissue dynamics. Our cell type annotation framework facilitates the transfer of cell type annotation from a reference dataset to another dataset, thereby overcoming a major hurdle in integrative data analysis. A further challenge of data integration is jointly modeling various factors (e.g., genotype, cell type, spatial locations) that govern cellular CRE activities. The standard GLM framework could not address the uneven data capturing in scATAC-seq data, so we developed a statistical model that extends the GLM framework to account for cell-specific missing data. Our model, the missing-corrected cumulative logistic regression (mcCLR) with regularization, enables PACS to perform multi-covariate hypothesis tests, including spatial and temporal data analysis. Here, we analyzed three empirical datasets from brain, kidney, and blood samples to show the utility and flexibility of our framework in large, complex datasets. Additionally, PACS shows promise in genotype-based chromatin accessibility studies, such as allele imbalance analysis or chromatin accessibility quantitative trait locus (caQTL) studies. Compared with existing models, PACS offers a more controlled approach for handling multiple covariates in the caQTL analysis.

The main characteristics of the PACS model are (1) a probability model of read depth, (2) sparsity control, and (3) an explicit treatment of ordinal distribution in a GLM framework. Given this model, there are some limitations and considerations. First, treating scATAC-seq data as ordinal data requires a certain amount of read depth to reveal quantitative data from each peak. In ref. ²⁴ we show that typical scATAC-seq data have such quantitative information, but if the read depth is shallow, the effectiveness of PACS will be reduced with respect to the application of the cumulative model versus a binary logistic model. Second, treating sparsity with modified Firth regularization can help with artifacts caused by missing data and imbalanced sample sizes, but like any regularization procedure, it does not completely cure arbitrary sparse data and read depth. Other experimental protocols should be optimized to reduce missing data. Third, we have previously derived a parametric model of the scATAC-seq read count, known as the size-filtered signed Poisson distribution (ssPoisson)²⁴. In our current work, we treat the insertion rate as a latent variable and directly model the paired insertion counts (PIC) of the data with a regression-based model without directly trying to estimate the parameters of the ssPoisson model. This approach greatly enhanced computational efficiency, but future works might consider a full parametric model. Lastly, we treat all the putative factors that affect chromatin accessibility in a linear model. The linear model allows for interaction terms, but if the effects are non-linear, our approach will yield incorrect tests. Nevertheless, given the generally small values distribution of scATAC datasets, we believe a linear model will be a good approximation to more complex relationships.

In summary, PACS allows versatile hypothesis testing to analyze scATAC-seq data. Its capability of jointly accounting for multiple factors that govern the chromosomal landscape will help investigators dissect multi-factorial chromatin regulation.

Methods

A probabilistic model of underlying open chromatin status

Here we model the activity of regulatory elements in each cell type group by the cumulative distribution of the accessibility. The underlying accessibility for a CRE is a function of nucleosome density and turnover rate. As we discuss in the main text, the chromatin state should be regarded as a random variable for a particular cell group as they are sampled from mixtures of hidden microstates. Here, we expanded the model of accessible chromatin from ref. ²⁴ Briefly, let ${F}_{C\times J}$ be a design matrix that summarizes known independent variables (e.g., cell type, developmental time, sample locations, etc.) across $C$ cells, ${Y}_{C\times M}$ be the underlying (latent) chromatin status across $C$ cells and $M$ regions, where each element represents the accessibility of a genomic region. The goal of PACS is to decompose the (complementary) cumulative distribution of ${Y}_{{cm}}$, i.e., the series of distributions:

$${{\rm{P}}}\left\{{Y}_{{cm}}\ge t\right\}={\sum}_{i=t}^{T}{\pi }_{i} \, {{\rm{for}}} \, t=1,2,\ldots,T$$

(3)

by predictive independent variables in ${F}_{c*}$. Here the maximum value of accessibility we account for, $T$, is feature-specific. To be precise, for a feature $m$, $T$ is the largest integer such that ${\sum }_{c}1({Z}_{{cm}}\ge t)/{\sum }_{c}1({Z}_{{cm}}\ge 1)\ge h$ where $h$ is a hyperparameter. In our study, $h$ is set to be $0.25$ but based on our evaluation, our model is not sensitive to the choice of $h$.

Model for capturing probability of cell

Due to various experimental factors like enzyme activity and sequencing depth disparities across cells, we introduce ${R}_{C\times M}$ as a matrix representing the capturing status of each cell and region. Let ${Z}_{C\times M}$ be the (observed) scATAC dataset, we have $Z=Y\bigotimes R$, where $\otimes$ denote element-wise Product. We consider ${R}_{{cm}}$ to be sampled from a Bernoulli distribution parameterized by ${q}_{c}$, cell-specific capturing probability:

$${R}_{{cm}}\sim {{\rm{Bernoulli}}}({q}_{c})$$

(4)

We note that under the assumption of complete data capturing (i.e., ${R}_{{cm}}=1$ for all $c$), ${{Z}_{{cm}}=Y}_{{cm}}$ and our model reduces to a standard cumulative logistic regression model (also known as ordinal logistic regression). The cell-specific capturing probability in scATAC-seq data complicates the model and necessitates the development of an extended model, missing-corrected cumulative logistic regression (mcCLR). Example applications of ordinal logistic regression in biomedical research are reviewed in ref. ^55,56.

Joint parameter estimation for single-factor scenario

Given a class of data that corresponds to a combination of levels of independent variables, we follow the same parameter estimation framework as described in ref. ²⁴ Briefly, assume we have a genomic region-by-cell (i.e., peak-by-cell) matrix ${Z}_{{C}_{f}\times M}$ with ${C}_{f}$ denoting the subset of cells corresponding to some combination of the independent prediction factors. The observed values in ${Z}_{{C}_{f}\times M}$ are ordinal values, but as most of the non-zero scATAC-seq counts are one (typically >70%), we focus on ${{\rm{P}}}({Y}_{{C}_{f}m}\ge 1)$ for purposes of ${q}_{c}$ estimation. Hereafter, we use the notation ${p}_{{C}_{f}m}^{(1)}$ to represent the (non-zero) open probability of group ${C}_{f}$ and feature $m$. We have further assumed ${q}_{c}$ to be identical across different levels of accessibility for a given cell. Due to the data sparsity and the predominant counts of one, this assumption is moderate, and the estimation process will be greatly accelerated with this assumption. We use a moment estimator with a coordinate descent algorithm to iteratively update ${p}_{{C}_{f}m}^{(1)}$ given ${q}_{c}$, and update ${q}_{c}$ given ${p}_{{C}_{f}m}^{(1)}$. Briefly, we execute the following iteration until convergence:

1.
Start with an initial estimate of ${p}_{m}^{[0]}$
2.
For t = 1, 2, …
1. a.
  Compute ${q}_{c}^{[t]}$ by:
  $${q}_{c}^{[t]}=\frac{{\sum }_{m=1}^{M}I({z}_{{cm}}\ge 1)}{{\sum }_{m=1}^{M}{p}_{m}^{[t-1]}}{{\rm{for}}} \, {c}\in {C}_{f}$$
2. b.
  Update ${p}_{m}^{[t]}$ by moment estimator:

$${p}_{m}^{[t]}=\frac{{\sum }_{c\in {C}_{f}}I({z}_{{cm}}\ge 1)}{{\sum }_{c\in {C}_{f}}{q}_{c}^{[t]}}{{\rm{for}}} \, {m}\in \{1,2,\,\ldots,M\}$$

where we use superscript $[t]$ to represent the ${t}^{{{\rm{th}}}}$ iteration, and we omit the subscript ${C}_{f}$ and superscript (1) for ${p}_{{C}_{f}m}^{(1)}$.

Uniqueness of parameter estimation

In order for the above joint parameter estimation framework to converge and for the estimated parameters to be uniquely defined, there should be ${q}_{c}=1$ for some cells and ${p}_{{C}_{f}m}^{(1)}=1$ for some features. In PACS, we conduct a convergence check by requiring a certain proportion of cells (default 10%) to have an estimated capturing probability greater than 0.9. In the case of a cluster of cells being rare or not sufficiently deeply sequenced, the estimates may be unstable, and we recalibrate the estimates for this rare cluster to its most similar cluster to prevent potential false positives. Specifically, let ${C}_{f1}$ index the rare group of cells; then, to identify the cell groups with the most similar open chromatin profile, we compute the correlation between ${p}_{{C}_{f1}*}^{(1)}$ and ${p}_{{C}_{{fj}}*}^{(1)}$ for all other clusters $j=1,\ldots,J$, across all regions. Assuming ${C}_{{fn}}$ has the most similar chromatin profile, we rescale the current estimation of ${p}_{{C}_{f1}m}^{(1)}$ by the following formula:

$$S={\sum}_{m}{p}_{{C}_{{fn}}m}^{\left(1\right)}\,\big/\,{\sum}_{m}{p}_{{C}_{f1}m}^{\left(1\right)},\;\;\;{p}_{{C}_{f1}*}^{(1){\prime} }={p}_{{C}_{f1}*}^{(1)}\times S$$

(5)

where S is the scale factor, ${p}_{{C}_{f1}m}^{(1){\prime} }$ is the rescaled open probability estimate for the cluster ${C}_{f1}$ and feature $m$. Through rescaling, we assume that most peaks are not differentially accessible between these two cell types.

DAR framework in Seurat and snapATAC2

Both Seurat/Signac and snapATAC2 employ the standard logistic regression framework for DAR detection but with different model specifications. To be precise, Seurat used the following model:

$${{\rm{logit}}}\left({{\rm{P}}}\left(C{T}_{c}=1\right)\right)=\alpha+{\beta }_{0}{D}_{c}+{\beta }_{1}{Z}_{{cm}}+{\sum }_{j=2}^{J}{\beta }_{j}{F}_{{cj}}$$

(6)

where $CT$ represents the binary indicator of the cell type label, ${D}_{c}$ represents the total number of reads of a cell, ${Z}_{{cm}}$ represents the observed read count in peak m of cell $c$, and ${F}_{{cj}}$ represents the other biological factors to be controlled for. We note that this model should only be used to test the relationship between cell type and read count rather than other factors, such as spatial effects.

In snapATAC2, the following model is used:

$${{\rm{logit}}}\left({{\rm{P}}}\left({Z}_{{cm}}=1\right)\right)=\alpha+{\beta }_{0}{D}_{c}+{\beta }_{1}C{T}_{c}+{\sum }_{j=2}^{J}{\beta }_{j}{F}_{{cj}}$$

(7)

The major difference is that in snapATAC2, the observed read count becomes the random component (dependent variable). In contrast, the cell type label becomes part of the systematic component (a covariate) in the regression. The model proposed in snapATAC2 is more logically consistent with the biological inference problem, i.e., whether the cell type status governs peak-reads due to differential chromatin state. Despite the difference, both models correct for sequencing depth by incorporating total read count (${D}_{c}$) as an explanatory variable in regression. Such treatment will result in potential collider bias, as discussed in Supplementary Note 1.

Cell type label prediction framework

Given a reference dataset, we estimate the probability of open chromatin ${p}_{{C}_{g}m}^{(1)}$ for each cell type $g\in \{1,\ldots,G\}$, using the formula above. With a new set of binarized observations ${Z}_{{C}^{{\prime} }\times M}^{{\prime} }$, we apply the Bayes discriminative model to predict the corresponding cell type labels, $h\left({Z}_{c*}^{{\prime} }\right).$

$${{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)={g|}{Z}_{c*}^{{\prime} }\right) \propto {{\rm{P}}}\left({Z}_{c*}^{{\prime} },|,h\left({Z}_{c*}^{{\prime} }\right)=g\right){{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right) \\ \propto {{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right){\prod }_{m=1}^{M}{\left({p}_{{C}_{g}m}^{(1)}{q}_{c}\right)}^{{Z}_{{cm}}^{{\prime} }}{\left(1-{p}_{{C}_{g}m}^{(1)}{q}_{c}\right)}^{1-{Z}_{{cm}}^{{\prime} }}$$

(8)

where ${{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)={g|}{Z}_{c*}^{{\prime} }\right)$ represents the posterior probability of cell $c$ being sampled from cell group $g$, ${{\rm{P}}}\left({Z}_{c*}^{{\prime} },|,h\left({Z}_{c*}^{{\prime} }\right)=g\right)$ represents the conditional probability of observing ${Z}_{c*}^{{\prime} }$ given that the cell $c$ is sampled from cell type $g$, ${{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right)$ is the prior probability of a new observation belonging to cell group g, which can either be assumed to be a non-informative Dirichlet prior ${{\rm{Dirich}}}(\delta )$ or estimated based on the cell type composition in reference data. Note that we have a large feature space, so this choice will not make a big difference.

The Naive Bayes model is a probabilistic classifier based on Bayes’ theorem, assuming independence between features given the class label. Specifically, in our context, it assumes that the accessibility of each peak is independent of the accessibility of other peaks, given the cell type. Using the same notation as above, the parameter estimation procedure is

$${p}_{{C}_{g}m}^{(1)}=\frac{{\sum }_{c\in {C}_{g}}I({z}_{{cm}}\ge 1)}{|{C}_{g}|}$$

(9)

Where $|{C}_{g}|$ represents the group size of (i.e., number of cells in) cell type $g$. With a new set of observations ${Z}_{{C}^{{\prime} }\times M}^{{\prime} }$, the probability of their corresponding cell type labels $h\left({Z}_{c*}^{{\prime} }\right)$ can be estimated using the following equation

$${{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)={g|}{Z}_{c*}^{{\prime} }\right) \propto {{\rm{P}}}\left({Z}_{c*}^{{\prime} } | h\left({Z}_{c*}^{{\prime} }\right)=g\right){{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right) \\ \propto {{\rm{P}}}\left(h\left({Z}_{c*}^{{\prime} }\right)=g\right){\prod }_{m=1}^{M}{\left({p}_{{C}_{g}m}^{(1)}\right)}^{{Z}_{{cm}}^{{\prime} }}{\left(1-{p}_{{C}_{g}m}^{(1)}\right)}^{1-{Z}_{{cm}}^{{\prime} }}$$

(10)

Missing-corrected cumulative logistic regression (mcCLR)

Due to the high sparsity of scATAC-seq data, perfect separability is common, hindering the parameter estimation in (Eq. 1). To address this issue, we incorporated Firth regularization (Eq. 2). Here we summarize the (unregularized) log-likelihood function and information matrix for the cumulative response model and derive the analytical expression for the binary model. The loss function, when considering the cumulative response, is

$$\log L({{\mathbf{\Pi }}},{{\boldsymbol{z}}}|{{\boldsymbol{q}}})={\sum }_{c=1}^{C}{\sum }_{t=0}^{T}\log ({\widetilde{\pi }}_{{ct}})1({z}_{c}=t)$$

(11)

where C represents the total number of cells, ${\pi }_{{ct}}$ and ${\widetilde{\pi }}_{{ct}}$ represent the probability of $t$ PIC counts in cell $c$ before and after accounting for cell-specific capturing probability, respectively. Specifically, ${\pi }_{{ct}}=P\left({y}_{c}\ge t\right)-P({y}_{c}\ge t+1)$, ${\Pi }_{c}={\left({\pi }_{c0},{\pi }_{c1},{\pi }_{c2},\ldots,{\pi }_{{cT}}\right)}^{{{\rm{Trans}}}}$ and ${\widetilde{\Pi }}_{c}={Q}_{c}{\Pi }_{c}$, where ${Q}_{c}$ is the capturing probability matrix of dimension$\,(T+1)\times (T+1)$ specified as

$${Q}_{c}=\left[\begin{array}{ccccc}1 & 1-{q}_{c} & 1-{q}_{c} & \ldots & 1-{q}_{c}\\ 0 & {q}_{c} & 0 & & \\ 0 & 0 & {q}_{c} & & \vdots \\ \vdots & & & \ddots & \\ 0 & & \ldots & & {q}_{c}\end{array}\right]$$

(12)

In our PACS model, an approximated estimation of parameters in the cumulative logit model was obtained using a method described in a previous set of studies^57,58 that was based on stacking the data and optimizing with binary logistic regression specified by

$$\log L({{\boldsymbol{p}}},{{\boldsymbol{z}}}|{{\boldsymbol{q}}})={\sum }_{c=1}^{C}\left[{z}_{c}\log ({p}_{c}{q}_{c})+(1-{z}_{c})\log (1-{p}_{c}{q}_{c})\right]$$

(13)

$$I\left({{\boldsymbol{\beta }}}\right)={F}^{T}{WF},\;\,{{\mathrm{where}}}\, W={{\rm{diag}}}\left\{\frac{{p}_{c}{q}_{c}(1-{p}_{c})^2}{1-{p}_{c}{q}_{c}}\right\}$$

(14)

where ${p}_{c}=P\left({z}_{c}=1\right)$.

Parameter estimation for mcCLR

We implemented Newton’s and Iterative Reweighted Least Squares (IRLS) methods for parameter estimation. Briefly, for Newton’s method, β is estimated through the following iteration

$${{{\boldsymbol{\beta }}}}^{(s+1)}={{{\boldsymbol{\beta }}}}^{(s)}+{{I^{\prime}} }^{-1}\left({{{\boldsymbol{\beta }}}}^{\left(s\right)}\right){U}^{*}\left({{{\boldsymbol{\beta }}}}^{\left(s\right)}\right)$$

(15)

where the superscript $s$ represents the iteration, ${I}^{{\prime} }=I$ for the full model, and ${I}^{{\prime} }={I}_{-\{d\}}$ for the null model of ${\beta }_{\left\{d\right\}}=0$. The score function ${U} {*}\left(\beta \right)$ is given by:

$${U}^{*}\left({\beta }_{r}\right) =U\left({\beta }_{r}\right)+\frac{1}{2}{{\rm{trace}}}\left[I{\left({{\boldsymbol{\beta }}}\right)}^{-1}\frac{\partial I\left({{\boldsymbol{\beta }}}\right)}{\partial {\beta }_{r}}\right] \\ ={\sum }_{c=1}^{C}\frac{{f}_{{cr}}({z}_{c}-{p}_{c}{q}_{c})(1-{p}_{c})}{1-{p}_{c}{q}_{c}}+\frac{1}{2}{\sum }_{c=1}^{C}{f}_{{cr}}{k}_{r}{h}_{r},\;(r=1,\ldots,p)$$

(16)

where the ${h}_{c}$‘s are the ${c}^{{{\rm{th}}}}$ diagonal elements of the “hat” matrix, $H={W}^{1/2}F{\left({F}^{T}{WF}\right)}^{-1}{F}^{T}{W}^{1/2}$, and ${k}_{c}=(2{p}_{c}^{2}{q}_{c}-3{p}_{c}+1)/(1-{p}_{c}{q}_{c})$.

For the IRLS method, the information matrix $I$ is replaced with an estimate of the information matrix, $\widetilde{I}$,

$$\widetilde{I}\left({{\boldsymbol{\beta }}}\right)={F}^{T}\widetilde{W}F,\;{{\rm{where}}}\; \widetilde{W}={{\rm{diag}}}\left\{-\frac{\left[-{p}_{c}^{2}{q}_{c}^{2}+{q}_{c}\left(2{p}_{c}+{z}_{i}-1\right)-{z}_{i}\right]{p}_{c}(1-{p}_{c})}{{\left(1-{p}_{c}{q}_{c}\right)}^{2}}\right\}$$

(17)

Hypothesis testing framework of mcCLR

We utilized a generalized likelihood ratio test framework for hypothesis testing with the mcCLR model, although a Wald-type test can also be derived. As the model contains Firth regularization, we used the profile penalized likelihood approach to obtain p values^31,59. Specifically, in the null model, the coefficients of interest are set to zero but still left in the model, so that the regularization accounts for the presence of these parameters during optimization.

Data simulation for single factor differential test

To mimic real data, we estimated insertion rates ${\lambda }_{{C}_{f}m}$) and ${q}_{c}$ from the human cell line data and used these values to construct simulated data. Because viable scATAC-seq reads come from two adjacent Tn5 insertion events with the right primer configuration (reviewed in ref. ⁶⁰), we derived the size-filtered signed Poisson (ssPoisson) distribution from modeling this data generation process²⁴. With the observed counts, we estimated the insertion rate parameters for two cell types, and regions with true insertion rate difference greater than 0.1 were set to be as true differential (H_a), and the remaining region’s open probabilities were set equal (by taking the mean) and therefore non-differential (H₀). Based on parametric model of latent and observed accessibility, we first sampled the latent ATAC reads by ${{\rm{ssPoission}}}({\hat{\lambda }}_{{C}_{f}m})$ for $f={\mathrm{1,2}}$, and then sampled the observing status by Bernoulli distribution parameterized by ${q}_{c}$. The observed data were generated by the element-wise product of these two matrices. We randomly sampled 10,000 non-differential features to assess the type I error and 10,000 differential features to evaluate power. This simulation was conducted under varying numbers of cells in each group (from 250 to 1000), and each scenario was repeated 5 times.

Data simulation for multi-factor differential test

Building upon the single-factor setting, we assumed the data contained two cell types (T1 and T2) sampled from two spatial locations (S1 and S2). We evaluated marginal effects and their interactions through separate simulations. To simulate data with marginal effect, the cell type effect is first introduced using the same method as in single-factor simulation, so the effect sizes vary. The spatial effect was then considered to affect features with and without a cell-type effect. Specifically, a third of the features with (and without) a cell type effect showed an accessibility difference across batches, with a log fold change of $\pm 0.5$. We introduced sample imbalance as frequently seen in real datasets. Specifically, we considered that S1 contained 1600 T1 and 800 T2 cells, while S2 contained 400 T1 and 1200 T2 cells. The peak by cell count data generation procedure is the same as for the single factor setting. Two strategies were used to evaluate the performance of methods that do not support multi-factor testing: the naïve test and the stratified test, as reported in the main text. Following the stratified test, we use Edington’s p value combination approach as it assumes consistent effect size across strata.

To evaluate the interaction effect, we considered two configurations of interaction. In the first configuration, cells of T1 in S1 are highly accessible while other groups are lowly accessible, with effect sizes also estimated from the cell line data. In the second configuration, cells of T1 in S1 and cells of T2 in S2 are highly accessible while other groups are lowly accessible. The second configuration may not be common in real biological data, but a method capable of testing the interaction effect should be able to identify it.

Data simulation for time-series differential test

To evaluate model performance in situations where the design matrix contains ordinal covariates, we simulated time-series scATAC-seq data across five-time points. We assumed linear and quadratic temporal effects on accessibility and set the effect size (log fold change) to be 0.3 or 0.5 between the two groups. The baseline accessibility was generated from the cell line data, and the peak-by-cell-count data generation procedure is the same as for the single-factor setting. N = 1000 cells were sampled for each time point.

Evaluating type I error and power in real datasets

To estimate type I error in real data where the ground truth is unknown, we used a label permutation approach, where the data in one cell type were divided randomly into two groups and a differential test was conducted between these groups. As this is randomly assigned, all features were considered non-DAR, so the proportion of P values smaller than 0.05 is the empirical type I error using real data. Then, we set the fifth rank percentile as the correct critical value for those methods with type I errors greater than 0.05. We next conducted a test with two different cell types using the calibrated critical values for each method. Since we do not know the true DAR set, we defined the pseudo-true DAR peaks as the union DAR set of all tested methods, using their corresponding new critical values. Power for each method was then calculated by the number of DARs detected divided by the number of pseudo-true DARs. This approach is adopted from ref. ²⁴.

Estimating effect size (fold change and accessibility change)

A common practice to determine differential features in single-cell data is setting a cutoff for p-value and fold change. In scRNA-seq data analysis, one way to estimate the effect size of a particular variable (predictor) is by calculating the fold change (FC) for the normalized data obtained by dividing the normalized mean expression of one group by the other group. However, with scATAC-seq data, no direct normalization method is available, and computing the fold change on raw read counts may lead to inaccuracies due to disparities in data capture. Here, we propose to use the capturing probability-adjusted count to compute fold change (FC) or the arithmetic difference between accessibility (accessibility change, AC) of two cell types. To be precise:

$${{\rm{FC}}}=\frac{{\sum }_{c\in C1}{Z}_{{cm}}/{q}_{c}}{{\sum }_{c\in {C}_{2}}{Z}_{{cm}}/{q}_{c}},{{\rm{AC}}}={\sum }_{c\in {C}_{1}}{Z}_{{cm}}/{q}_{c}-{\sum }_{c\in {C}_{2}}{Z}_{{cm}}/{q}_{c}$$

(18)

where $m$ is the feature of interest and ${C}_{1}$ and ${C}_{2}$ are the lists of cells that contain foreground and background cell types.

Processing kidney adult data with Signac

We used Signac²³ to evaluate the effectiveness of our method in correcting for batch effect at the feature level. We follow the standard workflow as recommended in the Signac vignette (https://stuartlab.org/signac/articles/pbmc_vignette.html). Briefly, we used the TF-IDF approach without feature selection (min.cutoff = ‘q0’), followed by SVD to reduce dimensionality. We then conduct clustering and UMAP visualization using the dimensions 2–30 (as the first LSI dimension usually reflects sequencing depth, per the Seurat tutorial). The sample and cell type labels are retrieved from the annotations in the initial publication.

Batch mixing score calculation

We calculated the batch mixing scores in the PCA space to measure the batch effect. At the cell level, the batch mixing score is adapted from ref. ³⁹ and is defined as the proportion of nearest neighbor cells with different batch identities, where a higher score indicates better mixing between batches and, thus, a smaller batch effect. At the whole data level, the batch mixing score is defined as the mean batch mixing score across all cells. To calculate the expected batch mixing score for a given dataset when no batch effect is present, let $M$ denote a cell type-by-batch matrix, with each element ${m}_{{ij}}$ representing the number of cells in the cell type $i$ and batch $j$. Then, the expected data-level batch mixing score in the setting of no batch effect is given by

$${{\rm{E}}}\left[{{\rm{batch}}} \, {{\rm{mixing}}} \, {{\rm{score}}}\right]=\frac{1}{{\sum }_{i,j}{m}_{{ij}}}\left\{{\sum }_{i}\left[{\sum }_{j}{m}_{{ij}}\left(\frac{{\sum }_{k\ne j}{m}_{{ik}}}{{\sum }_{k}{m}_{{ik}}}\right)\right]\right\}$$

(19)

The normalized batch mixing score is the batch mixing score divided by the expected score under random mixing, and thus a higher normalized batch mixing score indicates better mixing across samples.

Processing developing human brain data

This dataset contains 18 specimens collected from human donors. Our study excluded samples with unknown spatial locations (GW17, GW18, GW21) or samples not from the cortex (MGE_GW20 and MGE_twin34). Here, we focused on the excitatory neuron lineage, including radial glia (RG), intermediate progenitor cells (IPCs), early excitatory neurons (earlyEN), deep layer excitatory neurons (dlENs), and upper layer excitatory neurons (ulENs). We further excluded the insular region for having too few cell counts (645 cells across five cell types). Since the data matrix was saved as a binary matrix, we implemented the missing-corrected logistic regression model to analyze this data.

DAR identification in the developing human brain data

We constructed two models to identify the significant region effect of the excitatory neuron lineage. Specifically, to identify the region effect, the systematic component of the PACS model is specified as:

$$\alpha+{\sum }_{k=2}^{K}{\gamma }_{k}1({G}_{c}={g}_{k})+{\sum}_{l=2}^{L} \, \zeta \, \, 1({S}_{c}={S}_{l})+{\sum }_{t=2}^{M}{\tau }_{m}1({D}_{c}={D}_{m})$$

(20)

where $G$ is the index of cell type, $S$ is the index of spatial location, and $D$ is the index of the donor. The null hypothesis for the test is ${H}_{0}:\zeta=0$. To identify the cell type-specific region effect, we included the interaction terms between each cell type and spatial location, and the test was conducted for each interaction term.

Motif enrichment analysis

The motif enrichment analysis was conducted with Homer⁶¹. The list of significant DAR peaks is used as input for the analysis, with the size of the search region specified as 300 bp around the peak center. The reported motif enrichment scores are FDR-corrected P values from the known motif results.

DAR identification in the human PBMC treatment data

To identify the cell type-specific temporal effect in the PBMC treatment data, the systematic component of the PCAS model is specified as:

$$\alpha+{\sum }_{k=2}^{K}{\gamma }_{k}1({G}_{c}={g}_{k})+\kappa E+{\sum }_{t=2}^{M}{\omega }_{t}1({D}_{c}={D}_{t})$$

(21)

where $G$ is the index of cell type, $E$ is the experimental time index (0, 1, 2 corresponds to control, 1 h, and 6 h after treatment, respectively), and $D$ is the donor index. The null hypothesis for the test is ${H}_{0}:\kappa=0$.

Gene and pathway enrichment with GREAT

We used the GREAT method (v. 4.0.4) for gene and enrichment analysis⁴³, with DARs as input and default parameter settings. The output from GREAT for the human PBMC data can be found in Supplementary Data 17, 18.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

We downloaded the following scATAC-seq datasets from public repositories: mouse kidney data⁶ (GEO GSE157079, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE157079), human cell line data²¹ (GEO GSE162690, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE162690), developing human brain data¹¹ (GEO GSE163018, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE163018), marmoset brain data³² (the Brain Cell Data Center RRID SCR_017266; https://biccn.org/data), human PBMC time-series stimulation data⁴⁷ (GEO GSE178431, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE178431).

Code availability

PACS is an open-access software available at the GitHub repository https://github.com/Zhen-Miao/PACS (ref. ⁶²). Codes for reproducing the analyses are also available at the GitHub page.

References

Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Mezger, A. et al. High-throughput chromatin accessibility profiling at single-cell resolution. Nat. Commun. 9, 3647 (2018).
Article ADS PubMed PubMed Central MATH Google Scholar
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).
Article PubMed PubMed Central MATH CAS Google Scholar
Arda, H. E. et al. A chromatin basis for cell lineage and disease risk in the human pancreas. Cell Syst. 7, 310–322.e4 (2018).
Article PubMed MATH CAS Google Scholar
Zhang, K. et al. A single-cell atlas of chromatin accessibility in the human genome. Cell 184, 5985–6001.e19 (2021).
Article PubMed PubMed Central MATH CAS Google Scholar
Miao, Z. et al. Single cell regulatory landscape of the mouse kidney highlights cellular differentiation programs and disease targets. Nat. Commun. 12, 2277 (2021).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Cusanovich, D. A. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat. Biotechnol. 37, 925–936 (2019).
Article PubMed PubMed Central MATH CAS Google Scholar
Corces, M. R. et al. Lineage-specific and single-cell chromatin accessibility charts human hematopoiesis and leukemia evolution. Nat. Genet. 48, 1193–1203 (2016).
Article PubMed PubMed Central MATH CAS Google Scholar
Klemm, S. L., Shipony, Z. & Greenleaf, W. J. Chromatin accessibility and the regulatory epigenome. Nat. Rev. Genet. 20, 207–220 (2019).
Article PubMed MATH CAS Google Scholar
Ziffra, R. S. et al. Single-cell epigenomics reveals mechanisms of human cortical development. Nature 598, 205–213 (2021).
Article ADS PubMed PubMed Central CAS Google Scholar
Deng, Y. et al. Spatial profiling of chromatin accessibility in mouse and human tissues. Nature 609, 375–383 (2022).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Turner, A. W. et al. Single-nucleus chromatin accessibility profiling highlights regulatory mechanisms of coronary artery disease risk. Nat. Genet. 54, 804–816 (2022).
Article PubMed PubMed Central MATH CAS Google Scholar
Benaglio, P. et al. Mapping genetic effects on cell type-specific chromatin accessibility and annotating complex immune trait variants using single nucleus ATAC-seq in peripheral blood. PLoS Genet. 19, e1010759 (2023).
Article PubMed PubMed Central CAS Google Scholar
Kim, S. & Wysocka, J. Deciphering the multi-scale, quantitative cis-regulatory code. Mol. Cell 83, 373–392 (2023).
Article PubMed MATH CAS Google Scholar
Miao, Z., Humphreys, B. D., McMahon, A. P. & Kim, J. Multi-omics integration in the age of million single-cell data. Nat. Rev. Nephrol. 17, 710–724 (2021).
Article PubMed PubMed Central MATH Google Scholar
Ulirsch, J. C. et al. Interrogation of human hematopoiesis at single-cell and single-variant resolution. Nat. Genet. 51, 683–693 (2019).
Article PubMed PubMed Central MATH CAS Google Scholar
Sullivan, K. M. & Susztak, K. Unravelling the complex genetics of common kidney diseases: from variants to mechanisms. Nat. Rev. Nephrol. 16, 628–640 (2020).
Article PubMed PubMed Central MATH Google Scholar
Sheng, X. et al. Mapping the genetic architecture of human traits to cell types in the kidney identifies mechanisms of disease and potential treatments. Nat. Genet. 53, 1322–1333 (2021).
Article PubMed PubMed Central MATH CAS Google Scholar
Yu, F. et al. Variant to function mapping at single-cell resolution through network propagation. Nat. Biotechnol. 40, 1644–1653 (2022).
Article PubMed PubMed Central MATH CAS Google Scholar
Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet. 53, 403–411 (2021).
Article PubMed PubMed Central MATH CAS Google Scholar
Fang, R. et al. Comprehensive analysis of single-cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods 18, 1333–1341 (2021).
Article PubMed PubMed Central CAS Google Scholar
Miao, Z. & Kim, J. Uniform quantification of single-nucleus ATAC-seq data with Paired-Insertion Counting (PIC) and a model-based insertion rate estimator. Nat. Methods 21, 32–36 (2024).
Article PubMed CAS Google Scholar
Agresti, A. Categorical Data Analysis. vol. 792 (John Wiley & Sons, 2012).
FIRTH, D. Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38 (1993).
Article MathSciNet MATH Google Scholar
Heinze, G. A comparative investigation of methods for logistic regression with separated or nearly separated data. Stat. Med. 25, 4216–4226 (2006).
Article MathSciNet PubMed MATH Google Scholar
Moore, J. E. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Article ADS PubMed PubMed Central MATH Google Scholar
Li, Y. E. et al. An atlas of gene regulatory elements in adult mouse cerebrum. Nature 598, 129–136 (2021).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Martens, L. D., Fischer, D. S., Theis, F. J. & Gagneur, J. Modeling fragment counts improves single-cell ATAC-seq analysis. Nat Methods 21, 28–31 (2024).
Heinze, G. & Schemper, M. A solution to the problem of separation in logistic regression. Stat. Med. 21, 2409–2419 (2002).
Article PubMed MATH Google Scholar
Bakken, T. E. et al. Comparative cellular analysis of motor cortex in human, marmoset and mouse. Nature 598, 111–119 (2021).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Chen, Y., Lun, A. T. L. & Smyth, G. K. From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research 5, 1438 (2016)
Zhang, K., Zemke, N. R., Armand, E. J. & Ren, B. A fast, scalable, and versatile tool for analysis of single-cell omics data. Nat. Methods 21, 217–227 (2024).
Article PubMed PubMed Central MATH CAS Google Scholar
Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 16, 163–166 (2019).
Article PubMed MATH CAS Google Scholar
Korsunsky, I. et al. Fast, sensitive, and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).
Article PubMed PubMed Central MATH CAS Google Scholar
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
Article PubMed PubMed Central CAS Google Scholar
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Article PubMed PubMed Central MATH CAS Google Scholar
Chari, T. & Pachter, L. The Specious Art of Single-Cell Genomics. PLoS Comput. Biol. 19, 8 (2023).
Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome. Biol. 21, 12 (2020).
Article PubMed PubMed Central MATH CAS Google Scholar
Li, H., McCarthy, D. J., Shim, H. & Wei, S. Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics. BMC Bioinform. 23, 460 (2022).
Article CAS Google Scholar
Luecken, M. et al. Benchmarking Atlas-Level Data Integration in Single-Cell Genomics. Nat Methods 19, 41–50 (2022).
McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010).
Article PubMed PubMed Central MATH CAS Google Scholar
Tanigawa, Y., Dyer, E. S. & Bejerano, G. WhichTF is functionally important in your open chromatin data? PLoS Comput. Biol. 18, e1010378 (2022).
Article ADS PubMed PubMed Central CAS Google Scholar
Ransick, A. et al. Single-cell profiling reveals sex, lineage, and regional diversity in the mouse kidney. Dev. Cell 51, 399–413.e7 (2019).
Article PubMed PubMed Central CAS Google Scholar
Nowakowski, T. J. et al. Spatiotemporal gene expression trajectories reveal developmental hierarchies of the human cortex. Science 358, 1318–1323 (2017).
Article ADS PubMed PubMed Central MATH CAS Google Scholar
Kartha, V. K. et al. Functional inference of gene regulation using single-cell multi-omics. Cell Genom. 2, 100166 (2022).
Article PubMed PubMed Central MATH CAS Google Scholar
Ai, W., Li, H., Song, N., Li, L. & Chen, H. Optimal method to stimulate cytokine production and its use in immunotoxicity assessment. Int. J. Environ. Res. Public. Health 10, 3834–3842 (2013).
Article PubMed PubMed Central MATH Google Scholar
Brignall, R. et al. Integration of kinase and calcium signaling at the level of chromatin underlies inducible gene activation in T cells. J. Immunol. 199, 2652–2667 (2017).
Article PubMed PubMed Central MATH CAS Google Scholar
Shin, H.-J., Lee, J.-B., Park, S.-H., Chang, J. & Lee, C.-W. T-bet expression is regulated by EGR1-mediated signaling in activated T cells. Clin. Immunol. 131, 385–394 (2009).
Article PubMed MATH CAS Google Scholar
Dagan-Berger, M. et al. Role of CXCR3 carboxyl terminus and third intracellular loop in receptor-mediated migration, adhesion and internalization in response to CXCL11. Blood 107, 3821–3831 (2006).
Article PubMed MATH CAS Google Scholar
Cooke, M. et al. Differential regulation of gene expression in lung cancer cells by diacyglycerol-lactones and a phorbol ester via selective activation of protein kinase C isozymes. Sci. Rep. 9, 6041 (2019).
Article ADS PubMed PubMed Central MATH Google Scholar
Seo, H. et al. Dual-specificity phosphatase 5 acts as an anti-inflammatory regulator by inhibiting the ERK and NF-κB signaling pathways. Sci. Rep. 7, 17348 (2017).
Article ADS PubMed PubMed Central Google Scholar
Stolarczyk, E., Lord, G. M. & Howard, J. K. The immune cell transcription factor T-bet. Adipocyte 3, 58–62 (2014).
Article PubMed MATH CAS Google Scholar
Lee, J. Cumulative logit modelling for ordinal response variables: applications to biomedical research. Comput. Appl. Biosci. 8, 555–562 (1992). CABIOS.
ADS PubMed MATH CAS Google Scholar
Bender, R. & Grouven, U. Ordinal logistic regression in medical research. J. R. Coll. Physicians Lond. 31, 546–551 (1997).
PubMed PubMed Central MATH CAS Google Scholar
Winship, C. & Mare, R. D. Regression models with ordinal variables. Am. Social. Rev. 49, 512 (1984).
Article MATH Google Scholar
Christensen, R. H. B. Sensometrics: Thurstonian and Statistical Models. (Technical University of Denmark, Kgs. Lyngby, 2012).
Venzon, D. J. & Moolgavkar, S. H. A method for computing profile-likelihood-based confidence intervals. Appl. Stat. 37, 87 (1988).
Article MATH Google Scholar
Adey, A. C. Tagmentation-based single-cell genomics. Genome. Res. 31, 1693–1705 (2021).
Article PubMed PubMed Central MATH CAS Google Scholar
Duttke, S. H., Chang, M. W., Heinz, S. & Benner, C. Identification and dynamic quantification of regulatory elements using total RNA. Genome. Res. 29, (2019).
Miao, Zhen. Depth-corrected multi-factor dissection of chromatin accessibility for scATAC-seq data with PACS, Zhen-Miao/PACS. Zenodo https://doi.org/10.5281/ZENODO.14004648 (2024).

Download references

Acknowledgements

This work has been supported in part by the UC2DK126024 and R01 DK126925-05 grant to JK and also by the Health Research Formula Fund of the Commonwealth of Pennsylvania, WHO did not play a direct role in the work. We thank the Blavatnik Family Fellowship for supporting the work of ZM. We thank Dr. Pablo Camara, Dr. Nancy Zhang, Dr. Kui Wang, Dr. Xiangjie Li, Dr. Yinan Lin, Dr. Mengying You, Dr. Yaqi Cao, and members of Junhyong Kim’s lab, especially Dr. Erik Nordgren, for their constructive suggestions that improved this work. We thank Dr. Kun Zhang and Dr. Jason Buenrostro for sharing the metadata.

Author information

Jianqiao Wang
Present address: Department of Statistics and Data Science, Tsinghua University, Beijing, China

Authors and Affiliations

Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
Zhen Miao & Junhyong Kim
Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
Zhen Miao, Kernyu Park & Junhyong Kim
Department of Biostatistics, Harvard T.H. Chan School of Health, Boston, MA, USA
Jianqiao Wang
Deptartment Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
Da Kuang & Junhyong Kim

Authors

Zhen Miao
View author publications
Search author on:PubMed Google Scholar
Jianqiao Wang
View author publications
Search author on:PubMed Google Scholar
Kernyu Park
View author publications
Search author on:PubMed Google Scholar
Da Kuang
View author publications
Search author on:PubMed Google Scholar
Junhyong Kim
View author publications
Search author on:PubMed Google Scholar

Contributions

J.K. and Z.M. conceived the study. Z.M., J.W., and J.K. designed the statistical model. J.W. formulated the missing data model for sequencing depth and derived the analytical expression for the missing-corrected logistic regression estimation procedure. Z.M. implemented the model and constructed the software package with feedback from J.W., D.K., and J.K., Z.M. conducted the simulation and real data analysis with help from K.P. and D.K. J.K. supervised the work. J.K. and Z.M. wrote the manuscript with feedback from J.W.

Corresponding author

Correspondence to Junhyong Kim.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Kai Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Description of Additional Supplementary Files

Supplementary Datasets 1-20

Transparent Peer Review file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Miao, Z., Wang, J., Park, K. et al. Depth-corrected multi-factor dissection of chromatin accessibility for scATAC-seq data with PACS. Nat Commun 16, 401 (2025). https://doi.org/10.1038/s41467-024-55580-5

Download citation

Received: 28 February 2024
Accepted: 10 December 2024
Published: 05 January 2025
Version of record: 05 January 2025
DOI: https://doi.org/10.1038/s41467-024-55580-5

This article is cited by

A hierarchical, count-based model highlights challenges in scATAC-seq data analysis and points to opportunities to extract finer-resolution information
- Aaron Wing Cheung Kwok
- Heejung Shim
- Davis J. McCarthy
Genome Biology (2025)