Introduction

Oligodendrocytes play key functional roles in the central nervous system (CNS) function, including that they are responsible for myelination1,2. Myelination is a complex neurodevelopmental process that begins during brain development in the third trimester of pregnancy and increases steadily during childhood, but it can also be dynamically regulated in the context of learning and diseases affecting the mature CNS3,4. Also, Oligodendrocyte dysfunction and myelin abnormalities have been reported in CNS disorders2,5,6. Multidirectional interactions between neuronal and glial cells are required for CNS function7, including interactions between oligodendrocytes and neurons through myelination8. Therefore, it is critical to better understand the functions and roles of oligodendrocytes and myelin.

Gene expression of oligodendrocyte development from oligodendrocyte progenitor cells (OPC) is governed by complex gene regulatory mechanisms involving transcription factors (TFs)3,4. TFs often work in a combinatorial fashion to regulate gene expression from regulatory elements9,10. For example, some TFs such as SOX10 and OLIG2 cooperate during the induction of genes for differentiation and myelin formation11,12,13,14. Enhancers can increase transcription levels from promoters and transcription start sites (TSS), and much of the regulatory code that drives cell type-specific gene expression resides in these distal regulatory elements. Especially, some active enhancers are associated with the gene expression that characterizes cell identity and functions15. Thus, it is important to identify active oligodendrocyte-specific enhancers as well as promoters and the co-binding TFs that are responsible for their activity.

Next-generation sequencing technologies, including single-cell RNA sequencing (scRNA-seq) and the assay for transposase-accessible chromatin sequencing (scATAC-seq), have provided important insights into cell-type-specific gene regulation. Recent functional genomic resources such as PsychENCODE216 and GTEx17, and emerging tools for integrating multi-omics data enable creating cell-type-level gene regulatory networks (GRNs) linking TFs and their binding sites (TFBS), regulatory elements to target genes (TGs). Those networks can reveal the cell-type-specific regulatory roles of TFs via regulatory elements. Moreover, additional bioinformatic tools such as SCENIC+18, Signac19, and scGRNom20 predict cell-type-specific gene regulatory networks to explain potential TF-TG relationships. However, most of these studies and tools focus on relationships between individual TFs and TGs instead of TF-TF interactions and their effects on TG expression. Consequently, due to the lack of tools, the mechanistic roles of cooperative TFs in establishing cell type-specific gene regulation remain uncharacterized.

To tackle these challenges, we introduce coTF-reg, an analytical framework that integrates scRNA-seq and scATAC-seq data to identify cooperative TFs co-regulating the TG. coTF-reg identifies cooperative co-binding TFs along with active regulatory elements for gene regulation as hallmarks of active oligodendrocyte-specific regulatory elements. First, it identifies co-binding TF pairs in these regulatory regions. Second, a deep learning model is trained to predict TG expression based on the expression profiles of co-binding TFs. Third, Shapley interaction scores are computed to evaluate the interactions between TF pairs. Our findings reveal high interactions between co-binding TF pairs, such as SOX10-TCF12. Validation using oligodendrocyte eQTLs and their eGenes that are regulated by these cooperative TFs showed potential regulatory roles for genetic variants. Experimental validation using ChIP-seq data confirmed some cooperative TF pairs, such as SOX10-OLIG2 and SOX10-NKX2.2. Prediction performance of our models was evaluated through holdout data and additional datasets, and an ablation study was also conducted. The results demonstrated stable and consistent performance. Overall, our results create an analytic framework in which co-binding TF pairs cooperatively activate the TG expression through oligodendrocyte-specific regulatory elements.

Results

Deep learning and single-cell multi-omics for identifying cooperative transcription factors in oligodendrocytes

In order to predict cooperative TFs involved in oligodendrocyte gene regulation, we designed coTF-reg, which integrates scRNA-seq and scATAC-seq data to identify the cooperative TFs that co-regulate the target gene (TG) expression in oligodendrocytes (Fig.1, Methods and Materials). Briefly, we first used scATAC-seq data with peak-to-gene links21. Second, among the regulatory regions for various cell types, we focused on those specific to oligodendrocytes. We then identified transcription factor binding sites (TFBSs) and co-binding TF pairs through motif co-occurrence and co-enrichment analyses. Third, we trained deep neural networks (DNNs) to predict the expression levels of the TGs and measure interaction effects between co-binding TFs on the expression levels of TGs using gene expression from scRNA-seq data22 and computed Shapley interaction (SI)23,24 scores for co-binding TF pairs and found cooperative TF pairs. Fourth, we built a gene regulatory network based on SI scores for co-binding TF pairs. Lastly, as an independent validation, to validate the cooperative TF pairs we found, we mapped oligodendrocyte eQTLs onto the regulatory regions where cooperative TF pairs exist, performed Liftover analysis and co-enrichment analysis using ChIP-seq data, and applied Boolean rules to characterize the cooperativity of regulatory factors. To evaluate the prediction performance of our models, we used other publicly available datasets and conducted an ablation study by generating random TF sets to predict TG expressions.

Fig. 1: Deep learning and single-cell multi-omics for identifying cooperative transcription factors in oligodendrocytes.
figure 1

Inputs for the coTF-reg pipeline are scATAC-seq peak-gene links and scRNA-seq. It infers transcription factor binding sites (TFBSs) in regulatory regions and identifies co-binding TF pairs. Then, it measures cooperativity of co-binding TFs by predicting TF-TG relationships for the levels of expression using deep learning models and Shapley interaction scores. It outputs a gene regulatory network linking co-binding TF pairs with their TGs and regulatory variants on the regulatory regions where co-TFs have their binding sites.

Identification of the co-binding transcription factors in oligodendrocyte-specific regulatory regions

First, we identified a set of 787 oligodendrocyte-differentially accessible and oligodendrocyte-specific regulatory regions by comparison of oligodendrocyte scATAC-seq data to other brain cell types. In this set, we identified 958 motifs for inferred TFBSs using the JASPAR database. Second, we used co-occurrence analysis and co-enrichment analysis to identify 8101 co-binding TF pairs out of 458,403 possible TF pairs (‘Methods and Materials: Co-enrichment analysis’ for more details). We removed TF pairs from the same families and applied a cutoff (<0.1) for false discovery rate (FDR) yielding 8101 co-binding TF pairs. There were 206 TFs that have co-binding TFs linked to 445 TGs (Supplementary Data) that are oligodendrocyte specific in 643 regulatory regions (Fig. 2a). We annotated the regulatory regions to categorize them into promoters (32.5%) and enhancers (67.5%) (Fig. 2b).

Fig. 2: Distribution and correlation of the numbers of co-binding transcription factors, target genes, and peaks for individual transcription factors, peak annotation, and summary statistics for transcription factor-target gene links.
figure 2

a Summary statistics of transcription factor (TF)-target gene (TG) links. b Peak annotation. c Distributions of the numbers of co-binding TFs, TGs, and peaks for individual transcription factors. d Correlations between the numbers of co-binding transcription factors and target genes, as well as the numbers of co-binding TFs and peaks. e Distribution of the numbers of co-binding TF pairs, TGs, and peaks for transcription factors by families.

The density plots show the distributions of the number of co-binding TF pairs, the number of TGs, and the number of peaks for individual TFs that are co-bound to other TFs. Most of the TFs have 50 to 103 co-binding TFs (median = 78). The distribution of the number of TGs for TFs is right-skewed, and many TFs have 76 to 172 TGs linked. The distribution of the number of peaks for TFs is also right skewed, and the most frequent intervals were between 75 and 180 peaks (Fig. 2c). The distributions of the numbers of TGs and peaks for co-binding TF pairs are approximately normal. On average, co-binding TF pairs have 60 TGs linked and 59 peaks (Supplementary Fig. 1). Additionally, other density plots show the distributions of the number of TGs and the number of peaks for co-binding TF pairs and bar plots display the numbers of co-binding TFs, TGs, and peaks for individual TFs by their family categories (Fig. 2d). Co-binding TFs have 4 to 115 TGs (median = 59) and 4 to 123 peaks (median = 56) and the most frequent motifs are associated with TF families with C2H2 zinc finger (ZF), bZip, and bHLH DNA-binding domains (Fig. 2e). C2H2 ZF proteins are a large family and C2H2 ZF TFs (e.g., ZNF2425 and KLF9/1326) are known to play significant roles in the development and function of oligodendrocytes, which are the myelinating cells of the CNS. These TFs can regulate the expression of genes essential for oligodendrocyte differentiation, survival, and myelination processes25,27,28.

We computed Pearson correlation coefficient (r) to measure correlations between the number of co-binding TFs and the number of TGs and the number of co-binding TFs and the number of peaks for individual TFs (Fig. 2d). The number of co-binding TFs and the number of TGs for individual TFs are strongly positively correlated (r = 0.70). It suggests that TFs that are co-bound to other TFs tend to have more TGs linked to them. The number of co-binding TFs and the number of peaks for individual TFs are also strongly positively correlated (r = 0.67). It shows that co-binding TFs may exist in many different peaks.

In the following sections, we incorporate RNA-seq data to explore gene expression relationships between TFs and TGs, train deep learning models to predict TG expression using co-binding TFs, and compute TF interaction scores using the trained models.

Oligodendrocyte gene expression relationships between transcription factors and target genes

A single cell study identified the unique gene expression profile of oligodendrocytes compared to other brain cell types22, as shown by the two dimensional Uniform Manifold Approximation and Projection (UMAP) space after computing latent representations of the neighborhood graph (Fig. 3a). The UMAP embeddings reveal that oligodendrocytes exhibit a distinct expression profile compared to other cell types. This separation suggests that oligodendrocytes have unique transcriptional programs that differentiate them from neighboring cell types. The distinct clustering of oligodendrocytes in the UMAP space indicates specialized functional roles and may reflect their involvement in myelination and maintenance of neural integrity. In order to focus on oligodendrocyte-specific mechanisms of gene regulation, we conducted differential expression testing using 17,946 genes and 20,191 metacells and identified 4387 differentially expressed genes (DEGs) for oligodendrocytes. We found 445 TGs out of 507 TGs of oligodendrocyte-specific regulatory elements (88%) were DEGs for oligodendrocytes. Subsequently, we conducted enrichment analysis for these 445 TGs revealing their involvement in crucial biological processes for oligodendrocytes, such as oligodendrocyte development, oligodendrocyte differentiation, and myelination (Fig. 3b).

Fig. 3: Oligodendrocyte gene expression relationships between transcription factors and target genes.
figure 3

a UMAP for eighteen cell type in middle temporal gyrus region, b Enrichments analysis for target genes that are oligodendrocyte-specific, c Pairwise two-sided t-tests for the correlation (between the expression of the TFs and their TGs) comparison for three TF groups: 10 oligodendrocyte key transcription factors, 83 oligodendrocyte-specific non-key transcription factors, and 103 non-oligodendrocyte-specific transcription factors, and d Boxplots for the expression levels of the three categories in (c) (each column is an example for each category).

We categorized TFs into oligodendrocyte key TFs, oligodendrocyte-specific non-key TFs, and non-oligodendrocyte-specific TFs using oligodendrocyte expression level and the list of key TFs (see ‘Methods and Materials: Key TFs’ for more details). ‘Oligodendrocyte-specific key TFs’ are oligodendrocyte differentially expressed TFs and key TFs, ‘oligodendrocyte-specific non-key TFs’ are oligodendrocyte differentially expressed TFs but not key TFs, and ‘non-oligodendrocyte-specific TFs’ are neither oligodendrocyte differentially expressed TFs nor key TFs. The key oligodendrocyte TFs were defined based on mouse loss-of-function studies that have shown that specific TF’s are critical for oligodendrocyte differentiation. The key TF’s include SOX1029, SOX230,31, SOX832, MYRF33, OLIG134, OLIG235, TCF7L236,37, ZNF2425, NKX2.238, and NKX6.239.

Each of 206 TF, who have co-binding TFs, regulates a different set of TGs, and we computed correlations between the expression of the TFs and their TGs in the three categories. Pairwise two-sided t-tests show that correlations between TFs and TGs in oligodendrocyte key TF pairs and those in non-oligodendrocyte-specific TF pairs are significantly different (p < 0.001). It also indicates that correlations between TFs and TGs in oligodendrocyte-specific non-key TF pairs and those in non-oligodendrocyte-specific TF pairs are significantly different (p < 0.001) (Fig. 3c). The results for differential expression testing show that the six TFs in the two categories, oligodendrocyte key TF pairs and oligodendrocyte-specific non-key TF pairs are all significant and up-regulated (Fig. S8).

We color-coded the UMAP embeddings based on the expression level of the TFs (Fig. 3a) and selected three TFs as examples for each category. Oligodendrocyte-specific key TFs such as SOX10, MYRF, and OLIG2 are specifically highly expressed in oligodendrocytes. Oligodendrocyte-enriched non-key TFs, including RBPJ, JUND, and KLF7, are expressed in multiple cell types but are more highly expressed in oligodendrocytes. Non-oligodendrocyte-specific TFs, such as RUNX1, HLF, and CREB1, are not specifically expressed in oligodendrocytes (Fig. 3d).

Deep learning and Shapley interaction scores to measure cooperativity of co-binding transcription factors

To understand the complex relationships between TFs for predicting TGs, we built deep learning models. We trained a deep learning model for each of the 445 TG. Each model used the expression levels for the 206 TFs that have co-binding TFs to predict a TG expression level. We used seven hidden layers in each DNN (Fig. 4a). We excluded co-binding TF-TG pairs that exhibited high variability in their SI scores (coefficient of variance > 0.5). Using a trained model and a hold-out test dataset, we computed SI scores for TFs in each DNN. Additionally, we determined the percentile SI score for all co-binding TF pairs. Then, a two-sided t-test to compare the mean values for the percentile SI scores of key co-binding TF pairs and non-key co-binding TF pairs revealed a significant difference between the two groups (p < 0.0001) (Fig. 4b).

Fig. 4: Cooperative transcription factor pairs by Shapley interaction scores.
figure 4

a Deep learning architecture, b Percentile distribution of interaction scores for 577 key-transcription factor pairs and 7029 non-key transcription factor pairs, c Top forty-eight SI interaction scores for key transcription factor pairs, and d Top forty-eight SI interaction scores for non-key transcription factor pairs.

To emphasize the several important key co-binding TF pairs, we selected the top forty-eight interacting pairs for each key co-binding TF pair, such as SOX10, MYRF, OLIG1, OLIG2, NKX6.2, and TCF7L2, and generated a heatmap for their SI scores scaled from 0 to 1 (Fig. 4c). Similarly, we chose the top forty-eight interacting co-binding TF pairs for non-key TFs and created another heatmap for their SI scores scaled from 0 to 1 (Fig. 4d). We noticed that the SI scores for key-TF co-binding pairs have higher values than those for non-key co-binding TF pairs.

We also validated our model prediction performance for one TG, myelin basic protein (MBP), using additional data (Supplementary Fig. 2)40. We regressed the scaled actual values on the scaled predicted values. For our primary dataset, we obtained an R-squared of 0.81 and a r of 0.90 (Supplementary Fig. 2a). Furthermore, when analyzing another dataset, we observed an R-squared of 0.69 and a r of 0.83, affirming the predictive capability of our model architecture (Supplementary Fig. 2b).

Oligodendrocyte gene regulatory network analysis for cooperative TF pairs and transcription factor hierarchy

We chose one pair of co-binding TF with the highest interaction scores from six key co-binding TF pairs, including SOX10, MYRF, OLIG1, OLIG2, NKX6.2, and TCF7L2. We built a gene regulatory network (GRN) for these cooperative TF pairs and their TGs that are co-regulated by them (Fig. 5a). We found that a TG, CALD1, is co-regulated by three key cooperative TF pairs, SOX10-TCF12, RORA-OLIG2, and FOXP1-NKX6.2 and another TG, PPP1R16B, is co-regulated by three key cooperative TF pairs, RORA-OLIG2, FOXP1-NKX6.2, and FOXP1-OLIG1. There are other TGs, such as AMOTL2, BOK, CALD1, FA2H, and CPM, in the GRN that are co-regulated by two pairs of cooperative TFs.

Fig. 5: Gene regulatory network and transcription factor hierarchy.
figure 5

a Gene regulatory network for six key cooperative transcription factor pairs with the highest interaction scores, b A plot of in-degree (I) vs out-degree (O) for the transcription factors that have I and O in the gene regulatory network, and b Transcription factor hierarchy. Each node depicts a transcription factor. In (b, c), the edges colored in orange are the top-level (master) regulators, the edges colored in yellow are the middle-level regulators, and the edges colored in white are the bottom-level regulators.

We computed in-degree and out-degree for eighteen TFs that can also be TGs at the same time since TF feed forward and feedback loops are common (Fig. 5b). Then, we conducted a TF hierarchy analysis and found eight top-level regulators (Fig. 5c), called ‘Master regulators’, including SOX10, SOX2, and SOX8, which are key TFs that are known to play critical roles in oligodendrocyte differentiation30,31,32. The other five master regulators, MEIS1, MEIS2, RBPJ, JUND, and ZNF281, are categorized as oligodendrocyte-specific non-key TFs in Fig. 3c. All TFs that are middle-level regulators and bottom-level regulators, except for MYRF, are categorized as oligodendrocyte-specific non-key TFs. MYRF is one of the key TFs which is specifically activated in myelinating oligodendrocytes. PROX1 has been identified as being important for oligodendrocyte differentiation41,42. Most of these eighteen TFs are expressed in both oligodendrocytes and OPCs (Supplementary Fig. 4). It provides evidence that oligodendrocyte differentiation is pre-set in OPCs43.

Independent validation for cooperative TFs

eQTL mapping

As an independent assessment of the regulatory regions we mapped oligodendrocyte eQTLs44 onto oligodendrocyte-specific regulatory regions to explain the causal relationships between the expression levels of the co-binding TF pairs we identified and their target genes (TGs). Using chromosome and position of eQTL SNPs (eSNPs) from oligodendrocyte eQTLs, eSNPs integrated with a total of 643 oligodendrocyte-specific regulatory regions (Fig. 2a). This integration facilitates the identification of potential regulatory connections between the eSNPs and the co-binding TFs in these regions, enhancing our understanding of how genetic variations influence the expression levels of the identified co-binding TF pairs and their corresponding TGs. Notably, it provides evidence of causation if the eQTL genes and TGs are identical where co-binding TF pairs occur, indicating that these co-binding TF pairs are co-regulating TG expressions.

First, among 4.8 million oligodendrocytes eQTLs, we filtered 2 million significant (FDR < 0.05) eQTLs. Second, we mapped these significant eSNPs onto oligodendrocyte-specific regulatory regions (Fig. 6a). In total, 383 eSNPs and 159 eGenes were mapped onto 188 regulatory regions. Among these, 373 eSNPs and 153 eGenes (and TGs) were found in 179 regulatory regions associated with key TF pairs. Enrichment analysis for TGs indicates their strong involvement in biological processes such as oligodendrocyte development, myelination, and oligodendrocyte differentiation. (Fig. 6b)

Fig. 6: Independent validation using eQTLs and ChiP-seq data.
figure 6

a eQTL mapping onto oligodendrocytes regulatory regions, b ChIP-seq peak for cooperative TF pairs that co-regulate MBP. The ChIPseq profile is for SOX10, and solid blocks indicate called peaks for the specified transcription factors. c Gene ontology enrichment analysis of target genes associated with oligodendrocyte eQTL’s, and d,e Heatmaps show distribution of SOX10 ChIP-seq reads centered on the previously defined OLIG2 and NKX2.2 sites.

Validation of cooperative TF pairs

The model generated from human epigenome and expression data predicted a number of enriched TF pairs within oligodendrocyte-specific TF regulatory elements. In order to test if the coordination occurs as predicted, we utilized rat oligodendrocyte ChIP-seq data that were available for selected transcription factors. One predicted pair was OLIG2/SOX10, which had previously been shown to be extensively colocalized in analyses of rat oligodendrocytes45. To visualize the preferential binding of SOX10 on a global scale, a read density plot for SOX10 ChIP-seq reads11 was generated centered on the previously defined OLIG2 peaks45 in oligodendrocytes (Fig. 6c). In line with previous analysis, the average read density of SOX10 is highly enriched over OLIG2 bound sites. A newly found pair predicted by the model was that of NKX2.2 and SOX10, and we generated a similar plot of SOX10 ChIP-seq reads over a defined set of NKX2.2 ChIP-seq peaks in oligodendrocytes46, and we found a similarly high enrichment of SOX10 binding on ~40% of NKX2.2 binding sites (Fig. 6d). An example of the colocalization is shown for the MBP gene, which MBP is a crucial TG in oligodendrocytes as a key component of the myelin sheath47,48. Expression of MBP is essential for the differentiation and maturation of oligodendrocytes49,50, and MBP maintains the structure and integrity of the myelin sheath51. As shown in Fig. 6e, there are at least 2 sites upstream of MBP where there is colocalization of SOX10 with NKX2.2 and OLIG2.

Boolean cooperativity of TF pairs

We applied a logic circuit to characterize Boolean cooperativity of TFs using Loregic52. A total of 206 TFs that form 8101 co-binding TF pairs were input. 6660 (82.2%) out of the 8101 co-binding TF pairs have consistent triplets—matching the same logic gate across all targets, demonstrating strong cooperation between the activities of the two TFs on the TGs. More than half of the TF1-TF2-TG pairs are categorized as “AND” indicating a positive correlation between TG expression and the expression of both TF1 and TF2 (Fig. S6a). We also achieved permutation scores to remove logic gates chosen by random. Still, 6092 TF pairs have consistent triplets and 64% of triplets are categorized as “AND” (Fig. S6b).

Independent validation for the prediction performance of the models

Model prediction validation and ablation study

Using Multi-omics scRNA-seq data21 from the same cells as the scATAC-seq data in the main analysis, we trained deep learning models and computed SI scores. Forty-eight SI scores for key-TF pairs in Fig. 4c were selected. The correlation between the SI scores computed from the main data and the Multi-omics data are shown in Fig. S5a. We also ran a two-sided t-test to compare the mean values for the percentile SI scores of key co-binding TF pairs and non-key co-binding TF pairs as we did for the main data (Fig. 4b). There was a significant difference between the two groups (p < 0.0001) (Fig. S5b).

Model performance was evaluated using the holdout data. Additionally, we included three more publicly available scRNA-seq datasets: Multi-omics, ROSMAP40, and Cross-disorder53, and validated the prediction performance of our model for each TG. Here, TGs were predicted using the trained models and the entire datasets. The holdout data, Multi-omics, and ROSMAP show consistently low normalized root mean squared error (NRMSE), while more than 75% of predictions in Cross-disorder also have low NRMSE. NRMSE can be compared across genes (Fig. S7a).

We also conducted an ablation study to compare the prediction performance of our models. Another dataset for 206 random TFs that are neither co-binding nor cooperative was created and their prediction performance was compared to that of 206 co-binding TFs (Fig. S7b). The model prediction performance is much better overall when 206 TFs used for predicting TGs are either co-binding, cooperative, or both.

Discussion

With resources provided by advances in single-cell sequencing, some studies54,55,56,57 have elucidated the roles of several TFs, enabling the construction of cell type-specific gene regulatory networks to explain potential TF-TG relationships using bioinformatic tools. However, most of these studies and tools primarily focus on relationships between independent TFs and TGs.

This study introduces an analytical framework, coTF-reg, which identifies co-binding TFs and their TGs in oligodendrocyte-specific regulatory regions. Deep learning models predicted TG expression levels using the expression levels of co-binding TF pairs, and we computed TF SI scores to define highly interacting co-binding TF pairs as ‘cooperative’ TFs that co-regulate TG expression levels. We found that the key co-binding TF pairs tend to highly interact with each other compared to non-key co-binding TF pairs for predicting TG expression levels. Independent validation, such as mapping eQTLs onto the regulatory regions, provides evidence for causal relationships between co-binding TF pairs and TGs. Additionally, converting these regions to the rat genome assembly coordinates and measuring the density of ChIP-seq signals for key cooperative TFs show that many of these TF pairs are enriched in the regulatory regions, indicating their collaborative role in co-regulating TG expression levels. We defined specific key TFs and examined co-binding TF pairs containing them, along with their interactions in predicting TG expression levels. We then compared these results with those of non-key TF pairs. Overall, co-binding TF pairs with known regulators of oligodendrocyte development exhibit higher SI scores, suggesting that they not only regulate TG expression individually but also cooperatively. We identified several highly cooperative TF pairs, such as SOX10 and OLIG212,58, which are already known. Additionally, we discovered previously unreported cooperative pairs, such as SOX10 and NKX2.2.

Our study demonstrates several strengths. First, we concentrate on interactions between co-binding TF pairs and their impact on TG expression using deep learning approaches. Deep learning can elucidate complex TF relationships and their effects on TG expression levels. Second, the coTF-reg pipeline can be used by general users with any scATAC-seq and scRNA-seq data. The code for coTF-reg is openly available on GitHub, allowing users to input their scATAC-seq and scRNA-seq data for specific purposes. Third, we provide a comprehensive analytical framework that incorporates analyses utilizing co-bindings by motif and expression levels. We define ‘cooperative’ TF pairs as TF pairs significantly co-enriched across regulatory regions, exhibiting high SI scores in terms of expression when predicting TG expression. The term cooperativity has often been applied to co-bindings of TFs to nearby sites that facilitates stabilized binding due to protein-protein interactions, but in our model, we use TF pairs that can bind to sites in the same regulatory regions, since TF’s can coordinately activate enhancers without direct interactions.

Nevertheless, there are some limitations to our study. To begin with, it’s important to note that more than two TFs can co-regulate TG expression59,60. However, our current tool is limited to analyzing interactions between two co-binding TFs. In future research, developing or applying more sophisticated methods capable of handling clusters of TFs that co-regulate the same TG expression will be informative. Moreover, our method for identifying binding sites relies on the position frequency matrices in the motif database. While both SOX10 and MYRF are key TFs for oligodendrocytes, we encountered difficulty in obtaining sufficient binding sites for MYRF. Consequently, we had to supplement with a different motif for MYRF based on our prior knowledge. More generally, the definition of TF motifs relies on disparate methods, and limitations of motif generation and analysis have been noted previously. Nonetheless, our analysis provided TF-TF coordination that we could validate using data from previous studies. We predict that future analysis can be used to determine if the predicted TF pairing plays a role in oligodendrocyte differentiation, since reliance on single factor studies is not able to recapitulate the important combinatorial functions of TF’s in generating cell type-specific gene expression patterns. Lastly, there can be alternative methods for establishing cooperative relationships between TFs, such as Boolean rules61,62,63,64,65. Logic-based models are also powerful tools for understanding the complex interactions among regulatory TFs in gene regulation. Developing new tools that incorporate Boolean rules and machine learning approaches will help us effectively infer more intricate TF relationships, paving the way for future research aimed at unraveling the complexities of gene regulation.

Methods

coTF-reg pipeline workflow

First, published scATAC-seq data with peak-to-gene links21 is inputted into the coTF-reg pipeline. Second, transcription factor binding sites (TFBSs) and co-binding TF pairs in the oligodendrocyte regulatory regions are identified through motif co-occurrence and co-enrichment analyses. Third, deep neural networks (DNNs) to predict the expression levels of the TGs are trained and the interaction effects between co-binding TFs on the expression levels of TGs using gene expression from scRNA-seq data22 are measured by computing Shapley interaction (SI)23,24 scores. Fourth, a gene regulatory network is built based on SI scores for co-binding TF pairs. Fifth, a TF hierarchy analysis is used to define TFs as regulators in three categories. Lastly, as an independent validation, to validate the cooperative TF pairs: 1. The oligodendrocyte eQTLs are mapped onto the regulatory regions where cooperative TF pairs exist, 2. Liftover analysis and co-enrichment analysis using ChIP-seq data are conducted, 3. Boolean rules are applied to characterize the cooperativity of regulatory factors. To evaluate the prediction performance of our models: 1. Other publicly available datasets are used as validation data to predict TG expressions, 2. Ablation study is implemented by generating random TF sets to predict TG expressions.

Step 1: Infer transcription factor binding sites

We inferred transcription factor binding sites (TFBSs) in 787 scATAC-seq peak regions that have linkages with TGs.

  1. a)

    The R package GenomicRanges was used to format the ATAC-seq peaks into genomic ranges.

  2. b)

    Position frequency matrices (PFMs) for the 949 motifs in JASPAR2022 database66 were set in R, along with nine additional PFMs for the important modified motifs based on our prior knowledge.

  3. c)

    TFBSs in the scATAC-seq peak regions were inferred using a R package, motifmatchr67.

Step 2: Identify co-binding transcription factor pairs

We identified co-binding TF pairs using the inferred TFBSs in Step 1.

  1. a)

    All possible TF-TF pairs with binding sites in the scATAC-seq peak regions were considered.

  2. b)

    TF pairs from the same families were excluded.

  3. c)

    Co-enrichment analysis: Co-occurrence analysis was conducted to find TF pairs that have overlapping regions. We then conducted hypergeometric tests to find significantly enriched TF pairs in the same regions. We used multiple testing corrections via FDR and applied FDR < 0.1 cutoff. We define the TF pairs that are co-enriched (FDR < 0.1) as ‘co-binding’ TF pairs.

  4. d)

    Gene regulatory networks (GRNs) were constructed for TG-co-TF pair-peak links and matched TGs and co-TF pairs to the scRNA-seq data.

  5. e)

    Lowly expressed TGs and TFs were removed from the GRNs by applying a cutoff, median expression level > 1; more than half of the cells are expressed, from the GRNs.

  6. f)

    Differential expression testing was implemented using Seurat68 and selected TGs that are oligodendrocyte specific in the GRNs.

  7. g)

    Peaks were annotated as promoters or enhancers using annotatr69.

Step 3: Measure cooperativity of co-binding transcription factors

Gene expression levels of the co-binding TF pairs from scRNA-seq data were incorporated into deep learning models to predict the expression levels of the TGs and measure interaction effects between co-binding TFs on the expression levels of TGs using Shapley interaction (SI) scores.

  1. a)

    Metacells for the cells in scRNA-seq data were projected using a Python package, metacells70.

  2. b)

    Expression levels of TFs that have co-binding TFs and TGs were used to construct deep learning models for each TG using PyTorch71 in Python.

  3. c)

    SI scores for TF pairs were computed in each deep learning model.

  4. d)

    Interaction matrices for the SI scores were generated in deep learning models and the mean interaction scores for co-binding TF pairs were calculated.

  5. e)

    Coefficients of variation (CV)72 of the interaction scores for each co-binding TF pair were computed and the pairs with CV values higher than 0.5 were removed.

Step 4: Gene regulatory network and TF hierarchy analysis

A gene regulatory network was built for six key cooperative TFs.

  1. a)

    One cooperative TF pair for each of the six key TFs was selected based on the top interaction scores.

  2. b)

    A gene regulatory network was bulit linking cooperative TF pairs to TGs.

  3. c)

    TGs co-regulated by cooperative TF pairs were selected.

  4. d)

    A network plot was generated using Cytoscape73.

Step 5: TF hierarchy analysis

TFs that can be TGs were chosen, and we implemented hierarchy analysis74 for those TFs.

  1. a)

    In-degree (I) and out-degree (O) for the TFs were calculated.

  2. b)

    Hierarchy height metrics for the TFs were computed.

  3. c)

    TFs were classified as top-regulator, middle-regulator, or bottom-regulator.

Step 6: Independent validation

We implemented eQTL mapping, ChIP-seq enrichment analysis, and Boolean cooperativity analysis for validating cooperative TF pairs and model prediction validation and an ablation study for validating the prediction performance of the models.

Validation of cooperative TF pairs

eQTL mapping

We mapped the significant (FDR < 0.05) oligodendrocyte eQTLs onto the scATAC-seq peak regions.

  1. a)

    Publicly available oligodendrocyte eQTL data44 were downloaded and the significant (FDR < 0.05) eQTLs were extracted.

  2. b)

    The significant eQTLs were mapped to the scATAC-seq peak regions in the GRNs.

  3. c)

    The results were verified by comparing the number of eQTLs mapped onto the peak regions for key-TF pairs and non-key-TF pairs.

ChIP-seq enrichment analysis

We performed the LiftOver analysis to convert genome coordinates for rat to human hg38 assembly using UCSC Genome Browser75.

  1. a)

    Genome coordinates for human hg38 assembly were converted to the rn5 rat genome coordinates for human (hg19) assembly.

  2. b)

    Overlapping genome coordinates between conserved (from hg38 to rn5) assembly and the regulatory regions in the GRN were identified.

  3. c)

    Cooperative TF pairs in the overlapping regions identified, along with the TGs they co-regulate.

Using the results from the LiftOver analysis, we tried to find signals in co-enriched binding sites for cooperative key TF pairs in rat oligodendrocyte ChIP-seq data. Heatmaps were created via EAseq76. ChIP-seq tracks were visualized using UCSC genome browser. Previous ChIP-seq datasets for SOX10, OLIG2, and NKX2.2 are available at GEO accession numbers: GSE64703, GSE42447 and GSM1906296.

Boolean cooperativity of TF pairs

We applied a logic circuit to characterize Boolean cooperativity of TFs using Loregic52. Loregic is a computational tool, integrating gene expression and regulatory network, to characterize the cooperativity of regulatory factors. It uses 16 possible two-input-one-output logic gates (e.g. AND) to describe triplets of two factors regulating a common target. The GRN was inputted including co-binding TFs-TG links. Then, we binarized the gene expression levels to Boolean values 1 and 0 to represent high and low gene expression, respectively, using BoolNet77. BoolNet assigned Boolean values to expression data on the basis of modular co-expression patterns by K-means clustering across inputted samples and therefore accounts for differences in the dynamic ranges of expression among genes in the input data. The triplet gene expression data was extracted and matched to all possible logic gates. We selected consistent logic gates. We also ran 100 permutation tests to find significant logic gates.

Validation of the prediction performance

Model prediction validation

To verify the performance of deep learning model architectures, we trained a deep learning model for predicting a TG, MBP using another data40. The trained model was used to predict the expression level of MBP and compared the results with the model for MBP using the main data.

Using Multi-omics scRNA-seq data21 from the same cells as the scATAC-seq data in the main analysis, we trained deep learning models and computed SI scores, following the same processes we did in coTF-reg pipeline for identifying cooperative TFs in oligodendrocyte gene regulation (‘Step 2 Measure cooperativity of co-binding TFs’) for the main scRNA-seq data.

Model performance was evaluated using the SEA-AD22 holdout data. We also include three more publicly available scRNA-seq datasets: Multi-omics21, ROSMAP40, and Cross-disorder53, and validate the prediction performance of our model for each TG. Here, TGs were predicted using the trained models and the entire datasets. Normalized root mean squared error (NRMSE) is used to compare the performance across different datasets.

Ablation study

It is important to assess whether the 206 co-binding TFs effectively predict their TGs. Another dataset with 206 random TFs that are neither co-binding nor cooperative was generated to evaluate the prediction performance of our models. We used our trained models to predict holdout data for random TFs and compared their prediction performance to that of 206 co-binding TFs.

Single-cell ATAC-seq data

Chromatin accessibility data21 was used for the main analyses. Brain samples were selected and eight thousand nuclei from each sample were subjected to the Chromium Next GEM Single-Cell Multiome ATAC-seq. We filtered oligodendrocyte-specific peak-gene links for our analyses. 930 peaks and 606 genes were initially chosen.

Single-cell RNA-seq data

SEA-AD (Main analysis)

The data for the whole taxonomy collected from dorsolateral prefrontal cortex (1,395,601 cells) were downloaded through the Open Data Registry on AWS as AnnData objects (h5ad format)22. The cells for disease were excluded and only the controls were retained. Then, we projected metacells for the whole taxonomy and found 2004 metacells and 17,946 genes for oligodendrocytes.

Multi-omics

The normalized and quality controlled data was gained from the CELLxGENE (RRID:SCR_021059) portal. Brain samples were selected and eight thousand nuclei from each sample were subjected to the Gene Expression protocol (10x Genomics). We filtered 5459 cells for oligodendrocyte.

ROSMAP

The processed count matrix for oligodendrocyte was downloaded from a supplementary website for ‘Single-cell atlas reveals correlates of high cognitive function, dementia, and resilience to Alzheimer’s disease pathology’40. We projected metacells for the controls only and found 7072 metacells and 16,707 genes.

Cross-disorder

Post quality control filtered data was obtained from the CELLxGENE portal. We projected metacells for oligodendrocyte controls and found 1004 metacells and 21,248 genes for oligodendrocytes.

Uniform manifold approximation and projection for dimension reduction

We gained scRNA-seq data for the whole taxonomy collected from dorsolateral prefrontal cortex through the Open Data Registry on AWS as AnnData objects (h5ad format)22. There were 1,395,601 cells across 18 sub-cell types. A total of 18,431 hg38 protein-coding genes, obtained via BioMart78, were selected from 36,517 genes. We normalized the data to a depth of 10,000 and log1 transformed it using Scanpy79 in Python. Then, the highly variable genes (HVGs) were identified using dispersion-based methods80 to normalize dispersion, obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. The cutoffs for the mean dispersions for genes were a minimum of 0.0125 and a maximum of 3, and for the minimum dispersion was 0.5. We identified 3032 HVGs and scaled each gene to unit variance to clip values exceeding standard deviation of 10. To reduce the dimensionality of the data, we ran principal component analysi and used top 30 PCs to compute the neighborhood graph of the cells. Finally, we embedded the neighborhood graph with 20 neighbors in two dimensions using Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)81.

Differential expression testing

We inputted metacells for all cell types to identify oligodendrocyte-specific genes using Seurat v4 in R. We used the Poisson likelihood ratio test in FindMarkers function assuming that gene expression follows the negative binomial distribution. Oligodendrocytes and oligodendrocyte precursor cells (OPCs), and astrocytes that are known as major cell types among glia in the CNS were grouped and the other fifteen cell types were compared. We used a cutoff, FDR (<0.05) to select differentially expressed genes in the oligodendrocyte group.

Position Frequency Matrices

Position frequency matrices (PFMs) for the 949 motifs in JASPAR2022 were used to infer TF binding sites. We added PFMs for MYRF, SP7, and OLIG2 that are one of the key TFs from another study82, Mus musculus in JASPAR202266, and HOCOMOCO v1283, respectively. We also included shorter motifs for other key TFs, such as SOX10, MYRF, ZNF24, NKX2.2, and SP7, considering their importance in oligodendrocytes (Supplementary Fig. 3).

Co-enrichment analysis

We used a hypergeometric test to assess whether a number of overlaps in the binding sites for two TFs follows a hypergeometric distribution. Specifically, given that a random variable \(X\) represents the possible outcomes of a hypergeometric process, the probability of getting k or more overlapping binding sites between two TFs inside a particular chosen set, as a hypergeometric random process, is

$$\Pr \left(X\ge {k|n;N;m}\right)={\sum }_{x=k}^{\min (n.m)}\frac{\left(\begin{array}{c}m\\ x\end{array}\right)\left(\begin{array}{c}N-m\\ n-x\end{array}\right)}{\left(\begin{array}{c}N\\ n\end{array}\right)}$$
(1)

where \({N}\) is the total number of transcription binding sites for all TFs, \(m\) is the number of binding sites for TF1, n is the number of binding sites for TF2, and \(x\) is the number of overlapping binding (co-occurrence) sites between TF1 and TF2. We applied an FDR adjusted p-value as a cutoff (<0.1) for all possible TF pairs and chose co-binding TF pairs.

Key transcription factors

We defined ten key TFs that are oligodendrocyte marker genes based on mouse loss-of-function studies that have shown that specific TF’s are critical for oligodendrocyte differentiation. This includes SOX1029, SOX230,31, SOX832, MYRF33, OLIG134, OLIG235, TCF7L236,37, ZNF2425, NKX2.238, and NKX6.239 were chosen as key TFs. Ten ‘Oligodendrocyte-specific key TFs’ are oligodendrocyte differentially expressed TFs and key TFs, eighty-three ‘oligodendrocyte-specific non-key TFs’ are oligodendrocyte differentially expressed TFs but not key TFs, and a hundred-thirteen ‘non-oligodendrocyte-specific TFs’ are neither oligodendrocyte differentially expressed TFs nor key TFs. Especially, ten ‘Oligodendrocyte-specific key TFs’ play crucial roles in the development and differentiation of oligodendrocytes. They regulate various stages of oligodendrocyte maturation and promote the expression of myelin genes; essentially, they are key players in the process of myelination within the CNS.

Deep learning models

We inputted expression levels of TFs that have co-binding pairs into the deep neural network (DNN) models to predict TG expression levels. 2004 metacells (samples), 206 TFs (features), and a TG expression level (label) were used in the DNN models. A DNN for each TG was built to predict a TG expression level. The mean squared error (MSE) between predicted TG expression and actual TG expression was used as the loss function in DNN models. We cross-validated the training dataset (80% of the input samples) with 5-fold cross-validation and validated the best trained model on the 20% of hold-out validation dataset for the best use of data and to achieve reliable model performance. We used an early stopping function with patience 10 and determined the number of epochs and we set the batch size to 32. Adam with a learning rate 0.001 was used for training the models. The structure of our neural network model can be written as

$${Z}_{i}=f\left({W}_{i}\cdot X+{b}_{i}\right)$$
(2)

where \(X\) denotes the input data and \(f\) represents the activation function, specifically the LeakyReLU function. TF expression levels serve as the input data, while Zi represents the output of the ith hidden layer. The final output of the model is the predicted TG expression level, and \({W}_{i}\) and \({b}_{i}\) are the weight matrix and bias vector for the \(i\)th layer, respectively.

To evaluate the performance of our neural network model, we utilize the Mean Squared Error (MSE) loss function. The MSE quantifies the average squared difference between the predicted outputs of the model, Z and the true labels in our regression task. Mathematically, we can express the MSE as follows:

$${MSE}=\quad \frac{1}{N}{\sum }_{i=1}^{N}{({Z}_{i}-{Y}_{{{true}}_{i}})}^{2},$$
(3)

where \({Z}_{i}\) represents the predicted output for the ith sample, and \({Y}_{{{true}}_{i}}\) denotes the true label corresponding to the ith sample.

Shapley interaction scores

We denote the set of all TFs by F, a feature iF, and a feature set SF. We define the interaction effect between TF i and j, with feature set S, of a neural network f at a data point \({X}_{k}\) to be

$${\delta } \, _{ij}^{f}=f({X}_{k};S\cup \{i,j\})\left)\right.-f({X}_{k};S\cup \{i\})\left)\right.-f({X}_{k};S\cup \{ \, j\})\left)\right.+f({X}_{k};S),$$
(4)

where \(f\)\(({X}_{k}{;S})\) is the prediction at \({X}_{k}\) when only TFs in S are used, which often requires retraining the NN multiple times. A common approximation is to replace the absent features (i.e., F\S) by the corresponding values in a baseline CF\S, such that

$$f\left({X}_{k}{;S}\right) \approx f\left({X}_{K,S};{C}_{F{{\backslash }}S}\right)$$
(5)

The baseline is set as the empirical mean of each feature. The Shapley interaction score \({{SI}}_{{ij}}^{f}({X}_{k})\) is the expectation of \({\delta }_{{ij}}^{f}({X}_{k}{;S})\),

$${{SI}}_{{ij}}^{f}({X}_{k})={E}_{p\left(S\right)}\left[{\delta } \, _{{ij}}^{f}\left({X}_{k}{;S}\right)\right],$$
(6)

over a uniformly random chosen feature set \(S\) from \(F\). We use Monte-Carlo procedure84 to approximate \({{SI}}_{{ij}}^{f}({X}_{k})\) by a small number of samples of \(S\). To aggregate the local interaction effect at different data points into a global interaction effect, we take the expectation \(\left|{{SI}}_{{ij}}^{f}({X}_{k})\,\right|\) of w.r.t. the empirical data distribution \(p(X)\), such that

$${{SI}}_{{ij}}^{f}={E}_{p\left(X\right)}\left[\left|{{SI}}_{{ij}}^{f}\left(X\right)\right|\right]$$
(7)

For our deep ensemble of deep learning models, we utilize a posterior distribution of functions \(q( \, f)\) induced by the ensemble distribution of the weights \(q(w)\), as outlined in Eq. (2). This ensemble approach involves training multiple instances of the model, each initialized with different random weights to promote diverse learning paths.

The weights \(w\) are drawn from a Gaussian prior, reflecting our initial uncertainty about their values. After training, we apply Bayesian inference techniques to update our beliefs about these weights and compute the posterior distribution \(q(w)\). This posterior captures the uncertainty in the model parameters, providing a more comprehensive understanding of the model’s behavior.

The function \(q( \, f)\) represents the expected output of the model across this ensemble of weights. To compute the interaction score, we take the expectation of the interaction score \({{SI}}_{{ij}}\) with respect to \(q(f)\). This is estimated by averaging \({N}_{f}\) samples drawn from the ensemble:

$${{SI}}_{{ij}}={E}_{q\left(f\right)}\left[{{SI}}_{{ij}}^{f}\right]\approx \frac{1}{{N}_{f}}{\sum }_{k=1}^{{N}_{f}}{{SI}}_{{ij}}^{{f}_{k}}.$$
(8)

We compute Shapley interaction scores23,24 for the co-binding TF pairs, TF \(i\) and TF \(j\) using the trained DNN models and validation datasets. We calculate mean values for co-binding TF pairs using interaction matrices. We rank them by percentile and scaled them to 0 and 1 for easier interpretation.

Coefficient of variance

The coefficient of variation (CV) is a statistical measure of the dispersion of data points in a data series around the mean. The CV represents the ratio of the standard deviation to the mean, and it is a useful statistic for comparing the degree of variation from one data series to another, even if the means are drastically different from one another. The CV is defined as the ratio of standard deviation to the mean as follows:

$${CV}=\quad \frac{\sigma }{\mu }$$
(9)

Hierarchy analysis

We computed connectivity statistics, out-degree (O) and in-degree (I), for individual TFs to get a ‘hierarchy height’ metric (h), a normalized value of the difference between O and I for each TF. The \(h\) is calculated as

$$h=\quad \frac{O-I}{O+I}$$
(10)

We defined TFs as top-regulator (h > 0.33), middle-regulator (−0.33 < h < 0.33), and bottom-regulator (h < -0.33) by their h values.

Statistics and reproducibility

Data manipulation and analyses were performed using Python 3.10.14 and R 4.3.1. All relevant information including the sample sizes in the groups for statistical tests are included in the figure legends. The plots in this study are generated by Scanpy79 (v1.10.3), and seaborn (v0.13.2) in Python and ggplot2 (v3.5.1) in R.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.