Abstract
The CCCTC-binding factor (CTCF) is pivotal in orchestrating diverse biological functions across the human genome, yet the mechanisms driving its cell type-active DNA binding affinity remain underexplored. Here, we collected ChIP-seq data from 67 cell lines in ENCODE, constructed a unique dataset of cell type-active CTCF binding sites (CBS), and trained convolutional neural networks (CNN) to dissect the patterns of CTCF binding activity. Our analysis reveals that transcription factors RAD21/SMC3 and chromatin accessibility are more predictive compared to sequence motifs and histone modifications. Integrating them together achieved AUPRC values consistently above 0.868, highlighting their utility in deciphering CTCF transcription factor binding dynamics. This study provides a deeper understanding of the regulatory functions of CTCF via machine learning framework.
Similar content being viewed by others
Introduction
CCCTC-binding factor (CTCF) is a versatile zinc finger protein that plays crucial roles in various genomic contexts and biological processes, including transcriptional regulation, genomic imprinting, and chromatin interactions. CTCF functions both as a transcriptional activator and repressor, depending on the specific genomic context. For instance, Vostrov et al. demonstrated CTCF’s role as a transcriptional activator in the APP gene promoter1, while Filippova et al. showed its repressive function in the human C-MYC gene2. Additionally, CTCF is integral to genome organization, particularly in the formation of topologically associated domains (TADs). According to the loop extrusion model, cohesion facilitates the folding of the genome into loops anchored by CTCF for maintaining TAD boundaries3. Barrington et al. and Guo et al. highlighted that CTCF binding sites are often located at these TAD boundaries, emphasizing their role in higher-order chromatin structure4,5. Furthermore, CTCF influences alternative splicing through mechanisms involving transcriptional elongation, DNA methylation, and chromatin architecture, underscoring its multifaceted regulatory functions6. These diverse roles of CTCF in gene regulation and chromatin organization underscore the importance of understanding its binding mechanisms.
The identification of CTCF binding sites (CBSs) is primarily archived through Chromatin immunoprecipitation followed by sequencing (ChIP-seq), a technique that provides high-resolution, genome-wide localization of transcription factors (TFs)7. ChIP-seq analyses have uncovered numerous CBSs across the human genome, approximately 15,000 potential CBSs within human conserved noncoding elements has been identified8. Using ChIP-chip, Kim et al. identified 13,804 CBSs in IMR90 human fibroblasts, further research in IMR90 and U937 revealed consistent CBS localization across cell type9. Although CTCF binding patterns are largely invariant across cell types, variations exist in regulatory mechanisms at specific loci. Therefore, a comprehensive and systematic understanding of CTCF binding mechanisms remains elusive, specifically to elucidate how CTCF selectively binds to specific genomic sites in diverse cellular contexts.
Several factors influence CTCF binding, including DNA sequence, epigenetic modifications, and co-factors. Mutations within CBSs can alter CTCF binding affinity, potentially leading to aberrant gene expression and disease outcomes. For instance, Katainen et al. found that mutations in CBSs are prevalent in various cancer types, suggesting that such alterations can drive tumorigenesis by disrupting normal gene regulation10. DNA methylation is another critical factor affecting CTCF binding. Fang et al. identified cancer-specific CTCF binding patterns across six cancer types, revealing that changes in DNA methylation can hinder CTCF binding at certain loci11. Moreover, the functional diversity of CTCF binding motifs is influenced by epigenetic features particularly at the 12th position12. Cohesin, a key mediator of chromosomal contacts during mitosis, co-localizes with CTCF, underscoring its role in chromosome pairing and stability. The cohesin complex, composed of SMC1A (structural maintenance of chromosome protein 1A), SMC3 (structural maintenance of chromosome protein 3), RAD21/SCC1 (sister-chromatid cohesion protein 1), and SCC3 (sister-chromatid cohesion protein 3), collaborates with CTCF to facilitate chromosomal interactions. These findings suggest that a combination of sequence and epigenetic information is necessary to accurately identify cell type-active CTCF binding sites.
Numerous computational approaches have been developed to predict TF binding sites (TFBSs), employing methods that integrate chromatin accessibility, motif scores, and other genomic features13,14,15. For instance, Chen et al. developed Mocap, which combines chromatin accessibility, motif scores, TF footprints, and other factors for the classification of TF cell type-active binding16. Li et al. introduced Anchor, an algorithm that enhances in vivo TFBSs by considering both nucleotide position interactions with neighboring nucleotides and nonlinear interactions between TF motifs17. Similarly, Deng et al. presented DeepD2V, a hybrid deep learning framework integrating convolutional and recurrent neural networks to predict TFBSs18, while Qin et al. developed Lisa to predict transcriptional regulators (TR) using public chromatin accessibility and ChIP-seq data19. Despite significant advancements in the development of computational methods for predicting TFBSs, most existing methods are designed to accommodate a wide range of TFs. These generalized models often fail in accurately deciphering the precise binding modes of individual TFs like CTCF. The expansive data generated from the ENCODE project provides a robust foundation for developing a specialized prediction model. Such a model would leverage the extensive CTCF-active data to enhance the accuracy of binding site predictions and provide deeper insights into the diverse regulatory functions of CTCF across different cell types.
In this paper, we tackle this challenge by focusing on the identification of cell type-active CBSs in the human genome. Based on ChIP-seq data across 67 different cell lines provided by the ENCODE project, we constructed datasets that represent both cell type-active CBSs and non-binding sites for each cell line. By integrating the signal of DNase-seq, 12 histone modifications (HMs), and the binding profiles of RAD21 and SMC3, we employ convolutional neural networks (CNN) to identify cell type-active CBSs in different cell lines. The results showed that DNase-seq, the binding of RAD21 and SMC3 contributes more in predicting cell type-active CBSs, while sequence motifs and histone modification signals appear to contribute less to the recognition of these sites. Those evidence will enhance our understanding of diverse regulatory functions of CTCF.
Results
Overview of the cell type-active CBSs prediction model
We propose a prediction model to integrate epigenetic features in determining cell type-active CTCF binding in three steps: (i) For each cell line, we constructed positive and negative sets of cell type-active CBSs. The positive set included CBSs supported by ChIP-seq data, while the negative set comprised the remainder CBSs from the Complete_peak dataset, excluding those overlapping with the positive set (Fig. 1A). (ii) We collected feature for predicting CTCF binding sites, categorized into single and combined features. Single features included chromatin accessibility (DNase I hypersensitivity), histone modifications, RAD21 binding, and SMC3 binding data. Combined features integrated chromatin accessibility with histone modifications, RAD21, and SMC3 data (Fig. 1B). (iii) We derived motif and epigenetic information across 15 windows to create signal matrices for classification. Utilizing Convolutional Neural Network (CNN), we trained the models to identify cell type-active CBSs across serval cell lines. The models were trained using the constructed signal matrices and evaluated based on their performance, measured by the Area Under the Receiver Operating Characteristic (AUPRC) curve (Fig. 1C).
The overview of cell type-active CBSs prediction model. (A) The construction of the positive and negative sets of CTCF cell type-active binding site. (B) The feature extraction process, detailing both single and combined features used for model training. (C) The classifier is constructed based on motif and epigenetic signals applying the CNN methodology.
Chromatin accessibility is predictive for CBS
Techniques such as DNase-seq and ATAC-seq provide insight into open chromatin regions. These regions are accessible to TFs and other regulatory proteins. For 59 cell lines with the DNase-seq data, we analyzed the distribution of chromatin accessibility signals using bwtool20. Figure 2A depicts the distribution patterns of 21 representative cell lines, while the distribution of the remaining 38 is provided in the Supplementary file (Additional file2, Fig. S1). Our findings from 59 cell lines reveal that DNase I is significantly enriched in active CBSs of the positive set compared to inactive CBSs of the negative set. Notably, within the core binding zone, a signal peak is observed at the center, flanked by a gradual decline in read depth towards the edges. Contrastingly, the distribution in HA-sp follows a unique pattern, presenting diminished signals at the midpoint and elevated signals at the two neighborhoods adjacent to the center. Specifically, the DNase I signal peaks within a 100 bp range, suggesting the principal loci of signal enrichment is narrowed down to 100 bp within the CBSs.
Chromatin accessibility signals is predictive for CBS. (A) Chromatin accessibility profiles (DNase-seq) in 21 cell lines, showing the distribution of DNase I hypersensitivity signals around CTCF binding sites (CBSs). (B) AUPRC values for predicting CBSs using DNase I signal features across 59 cell lines. The bar graph displays the performance of the predictive model, with the black line indicating the proportion of CTCF binding sites located within open chromatin regions.
We then used DNase-seq data to compute the chromatin accessibility signal coverages across fifteen windows within each potential CBS region, employing the Multicov utility of Bedtools21. This approach generated a feature matrix of dimension 15 × 20. We subsequently employed these matrices as input for CNN to predict cell type-active CBSs across different cell lines.
The predictive capability of our models, accessed over 59 cell lines, is depicted in Fig. 2B. Notably, the models that solely incorporated DNase-seq data as input demonstrated remarkable predictive efficacy. Specifically, in 58 out of 59 cell lines, the CNN models achieved AUPRC values greater than 0.8, indicating strong predictive performance. Among these, 43 cell lines exhibited AUPRC values exceeding 0.9, with the HL-60 achieving the highest AUPRC value of 0.991. These results underscore the substantial predictive capability of Dnase-seq signals, affirming the crucial role of chromatin accessibility in deciphering cell type-active activities of CTCF binding. Indeed, for a significant majority (98.3%) of the evaluated cell lines, Dnase I proved to be an efficient signal source for identifying cell type-active CBSs (refer to Table S4 in Additional file 1 for detailed results). Further analysis of the proportion of CTCF binding sites in open chromatin regions across these cell lines (Fig. 2B) revealed a correlation: As the proportion of CTCF binding sites in open chromatin regions decreases, there is a corresponding decline in prediction accuracy. This trend highlights the potential impact of chromatin openness on the efficacy of CBS prediction.
Key HMs are predictive for CBSs
At cellular level, there is a noted co-localization of TFs with HMs at regulatory elements (promoter, enhancer, or insulator) within the human genome. HMs participate in the process of transcriptional initiation and elongation via interactions with TFs22,23. For three cell lines (HUVEC, K562 and NHEK) with 12 HM data and 10 cell lines (GM12878, H2-hesc, HeLa-S3, HepG2, HMEC, HSMM, HSMMtube, NH-A, NHDF-AD and NHLF) with 11 HM data, we analyzed the enrichment difference in distribution of HM signals between the positive and negative sets. For illustration, the distribution of K562’s 12 HMs in two CBS sets is depicted in Fig. 3A and the H2AFZ's distribution across 12 cell lines in Fig. 3B. While there is variation in HM signal distributions between the positive and negative sets, no universal trend emerges. All signal distribution results for the 12 cell lines can be found in Fig. S2 of Additional file2. However, of all the HMs analyzed, differences in H2AFZ, H3k4me1 and H3k4me2 were prominent, in some cell lines, these three signals exhibit minor signal difference at the center of the CTCF binding site, but the differences are more pronounced in regions further from the center of the CTCF binding site. It seems that they may plays a crucial role in regulating the CTCF binding.
Analysis of histone modification signals in CBS prediction. (A) Distribution of 12 histone modification signals in the K562 cell line. The profiles show the enrichment of histone modifications around CBSs. (B) Distribution of H2AFZ signals in 12 different cell lines. (C) AUPRC values for predicting CBSs using 12 histone modifications across 13 cell lines. The bar graph depicts the performance of the predictive model for each histone modification, demonstrating the effectiveness of different histone marks in identifying CBSs.
Same as the DNase I signal we produced a matrix with a dimension of 15 × 20 for each CBS. Utilizing the HM matrices, we trained our CNN model to predict cell type-active CBSs in 13 cell lines. As shown in the Fig. 3C, the results manifest varying prediction accuracy based on the HMs considered. When employing the signal of H2AFZ, the AUPRC in 5 cell lines achieved 0.8. However, when the signal matrices derived from other HMs were used as input, the AUPRC hovered around 0.75 for the aforementioned 13 cell lines. In general, the top three HMs in terms of predictive power are H2AFZ, H3K4me1, and H3K4me2. Our findings represent H2AFZ, H3K4me1, and H3K4me2 have moderate predictive capability, while other HMs appear to be weakly discriminative between cell type-active and unbound CBSs in these cell lines. This highlights the more significant role of the three HMs in determining active CBSs.
RAD21 and SMC3’s cofactor roles in CBSs prediction
We have found that CTCF consistently interacts with SMC3 and RAD21 to regulate gene expression in cell lines such as GM12878 and K56224,25. This interaction underscores the critical role of CTCF in shaping 3D chromatin structure, thereby highlighting its cellular adaptability and dynamics8,26,27. Building upon prior knowledge that CTCF collaboratively with RAD21 and SMC3 to bind DNA24, we further explored this relationship by analyzing ChIP-seq data for RAD21 and SMC3 across four cell lines: GM12878, HeLa-S3, HepG2 and K562. We discerned their binding variations in positive versus negative sets as showed in Fig. 4A,B. Remarkably, like the DNase I signals, the binding signals of the two TFs predominantly enrich in the CBS center. The divergence in the distribution of RAD21 and SMC3 between the two sets is striking. There is a considerable enhancement of read depth in the positive set compared to the negative set. Apparently, RAD21 and SMC3 play a pivotal role in influencing CTCF’s cell type-active binding.
Subsequently, our CNN models anchored on RAD21 and SMC3 binding signal matrices were developed. For 9 cell lines, the AUPRC values predicted by RAD21 ranged from 0.923 to 0.965, and the values predicted by SMC3 for 4 cell lines ranged from 0.905 to 0.964. This evidences their considerable predictive power, further hinting RAD21’s discriminative ability for identifying cell type-active CBSs. Additionally, we inferred that the model constructed based on the binding signals of RAD21 and SMC3 can enhance predictive power. As a result, when the signal matrices derived from both TFs were used as inputs, our model’s AUPRC varies between 0.949 and 0.971, and a pinnacle of predictive performance was obtained in K562 with an AUPRC value of 0.971, as depicted in Fig. 4C. This not only confirms the leading role of RAD21 and SMC3 in predicting CTCF’s active binding sites, but also underscores the complementarity between them in the action of CTCF.
Predictive performance of motif-based model
We sourced CTCF motifs from two reputable databases, JASPAR and HOCOMOCO, as shown in Fig. 5A. Through the overlapping analysis of CTCF binding peaks, we identified 2,241 peaks shared across 67 cell lines. Applying the MEME tool28, we searched in sequences extracted from these shared peaks. This yielded a DNA motif, designated as Com_motif, whose position weight matrix (PWM) is consistent with the motifs’ PWM in the aforementioned databases, as illustrated in Fig. 5B.
Analysis of CTCF motif in CBS prediction. (A) The motifs sourced from JASPAR and HOCOMOCO. (B) The common motif derived from peaks shared across 67 cell lines (abbreviated as Com_motif). (C) The motifs derived from peaks unique to one cell line (abbreviated as Uni_motif). (D) The AUPRC values of prediction with the motif scores in 33 cell lines.
Given the CTCF’s cell type-active binding nature, we assumed that there are CTCF binding motifs exclusive to particular cell lines. To test this, we searched in sequences sourced from peaks that were unique to one cell line. Because MEME-ChIP requires a substantial sequence pool for motif searching, we narrowed our analysis to 38 cell lines containing at least 200 unique peaks. We constrained the motif length range from 3 to 15bp. In instances where multiple motifs were discovered by MEME-ChIP29, the most recurrent motif was retained as one cell line’s unique motif, termed Uni_motif. Figure 5C illustrates a fascinating consistency in the nucleotide composition of these unique motifs across 38 cell lines. The 5’-terminal sequence of the motifs predominantly consists of cytosine (C), while the 3’ end prefers guanine (G). However, cell lines including Progfib, Ag04449, GM19239, Fibrobl, GM12892, GM12891, GM19240, and GM19238 diverged from this pattern, presenting motifs with a high level of cytosine (C) or thymine (T).
After determination of motif position using FIMO, the enrichment matrices (of dimension 15 × 20) for com_motif and uni_motif across the 15 windows were generated by counting the number of motifs in each bin. For 33 cell lines owning uni_motif, a CNN model anchored on motif scores was deployed. When we input the com_motif score matrix, the AUPRC values largely settled around 0.8, while the utilization of the uni_motif score matrix led to AUPRC values circling 0.7, as elucidated in Fig. 5D. Interestingly, we obtained better predictions in HMF with AUPRC values of 0.844 (Additional file1, Table S4). This confirmed that in most cell lines, motif information alone is insufficient to accurately identify cell type-active CBSs; however, it exhibits higher predictive power in certain individual cell lines. These findings suggest that motifs, while informative for sequence conservation, are not robust features for distinguishing binding from non-binding sites.
Enhancing active CBSs predictions through combined signals
Our initial analysis indicated its insufficient capability to discern cell type-active CBSs. Hence, our focus shifted to the integration between chromatin accessibility, HMs and the binding of RAD21 and SMC3 in the following model construction.
We utilized CNN and SVM methods based on the combination of DNase-seq and HM signals in 13 cell lines (Table 1). Meanwhile, considering the redundancy between HMs, one HM at a time was used to combine with the DNase-seq profile to perform cell type-active CBSs prediction. When deploying the CNN model, the observed AUPRC values span a range of 0.885 and 0.968. In contrast, the SVM model has slightly lower predictive performance with AUPRC between 0.671 and 0.840. Generally, the CNN model consistently outperforms the SVM model with AUPRC increases ranging from 0.134 to 0.408. The CNN model’s prediction suggests that the inclusion of HM signals led to an increase in AUPRC values by approximately 0.003 in 13 cell lines. Despite the limited prediction ability of HM demonstrated in the above predictions with HM signals as single input, integrating them into models based on DNase-seq signals has not consistently enhanced the prediction accuracy.
According to our previous studies, we ensure that the specific binding of RAD21 and SMC3 is associated with the dynamic binding of CTCF in different conditions. Therefore, considering their interaction with CTCF, we amalgamated the DNase-seq signals with the binding signals from the two TFs to delineate active CBSs across 9 distinct cell types as shown in Table 2. Among them, only nine cell lines have the data of RAD21. The prediction confirmed that the addition of TF binding information significantly improves the prediction accuracy. The CNN model achieved AUPRC values ranging from 0.949 to 0.970, while the SVM model’s AUPRC values fell between 0.773 and 0.898. The difference in AUPRC values ranged from 0.058 to 0.116, indicating that the CNN model consistently outperformed the SVM model. Furthermore, compared with adding one TF, adding two TF signals can improve the prediction accuracy by about 2 percentage points. This result verified that the DNase I signal combined with the binding of RAD21 or SMC3 can more accurately capture specific behaviors of CTCF in different conditions. This confirms the cooperativity between the three TFs. In terms of mechanism, we inferred that the specificity of CTCF binding is affected by RAD21 or SMC3 because they regulate the change of genome 3D structure. Moreover, changes in binding affinity of CTCF further influenced gene expression and regulation.
Finally, we tested whether the prediction performance could be further improved through adding HM signal to the model based on the combination of DNase I and the two TFs as input. For the CNN and SVM models, the AUPRC values have slight fluctuations in 4 cell lines, and the range is limited to 0.001–0.011 as shown in Table 2. Moreover, the predicted AUPRC values of the CNN model decreased 0.002 because of the addition of H2AFZ signal in GM12878. In comparison, the fluctuation of the SVM model is relatively smaller. These results indicated that the addition of HM signal does not improve the prediction performance of cell type-active CBSs. Therefore, for subsequent analysis and comparison, we focus on the models based on the signal matrices derived from DNase-seq and the two TFs.
Performance of CNN and XGBoost in predicting active CBSs
In the realm of predicting cell type-active TF binding profiles, significant advancements have been demonstrated by models like the Anchor, which shared the top spot with J-Team13 in the ENCODE-DREAM challenge17. Utilizing the XGBoost method and a total of 556 features, the Anchor pipeline set a new benchmark in predictive performance at the time. To validate the predictive ability, we compared our CNN model with the XGBoost method which was downloaded from the Anchor’s webpage (https://github.com/GuanLab/Anchor). We carried out comparisons in 9 cell line-signal pairs (Table 3). The results indicated that the XGBoost model outperforms the CNN model with an AUPRC increase ranging from 0.001 to 0.056. Among them, the greatest AUPRC increase of 0.056 occurs in HAc, and the model takes the binding signals of SMC3 as inputs. This is evident that the performance gap is minimal although the XGBoost models demonstrates advantages in predictions, which confirmed the competitive predictive ability of our CNN model. This underscores the potential for our CNN model as a viable alternative for predicting active CBSs.
To further evaluate the robustness of our model across different cell lines, we conducted cross-cell type predictions. Specifically, we trained the model on one cell line exhibiting the best prediction and performed test on the other cell type with the matched data. Due to the availability of the data, we performed cross predictions for 15 pairs of cell lines derived from 5 groups as shown in Table 4. The results indicated that the AUPRC values range from 0.770 to 0.973 by our CNN models, and the variation depends on cell line or signal type. The AUPRC values of the XGBoost models range from 0.729 to 0.973. The performance differences between the two methods further manifested that our CNN method has competitive predictive ability and robust across cell lines. More significantly, among 15 cross cell type predictions, the accuracies of 5 predictions even outperform the XGBoost model with up to a 0.041 increase in AUPRC values and one prediction have the same AUPRC values as the XGBoost model. It manifests that when compared to Anchor, our CNN model showed quite comparable performance in the prediction across cell type. This provides compelling evidence of the potential robustness and versatility of our model in diverse cellular environments.
This comprehensive analysis highlights the competitive performance and robust versatility of our CNN model across diverse cellular environments. The results provide compelling evidence of our model’s potential in reliably predicting TF binding profiles, demonstrating its capacity to serve as a formidable alternative to established methods like XGBoost.
Broad applicability and generalizability of the CNN model in CBSs prediction
To rigorously assess the broad applicability and generalizability of our CNN model in predicting CBSs, we executed the prediction in 8 cell lines outside the cell line set forming our Complete_peak set, which are D721Med, GM10248, GM10266, GM23338, H54, LNCaP clone FGC, IMR90 and SK-N-SH. The data are downloaded from the ENCODE project. This allows us to examine the potential in more independent cell lines for predicting cell line-active CBSs. For each new cell line, we screened specific cell lines which had matched data and optimal predictive performance to train the CNN model. Then, from the peaks in the Complete_peak set, we determined the cell type-active active CBSs for the above 8 cell lines. Based on the overlap between the identified active CBSs and the peaks annotated by the ChIP-seq data, we found that around 67–79% peaks in the 6 cell lines can be identified (Fig. 6). This highlights that DNase I signals provide interpretability for the vast majority of cell type-active CBSs. Then, integrating the binding information of RAD21 and SMC3 into the CNN models, a significant improvement was achieved in the other two cell lines. The results indicated that about 70–84% peaks can be identified (Fig. 6). The analysis pointed out that 0.16–11.21% peaks are not included in the Complete_peak set. Therefore, the recognition rate is actually underestimated (Table 5). This expanded application further emphasized that chromatin accessibility and the interaction between CTCF, RAD21 and SMC3 play a decisive role in the cell type-active binding of CTCF. The changes in prediction accuracy also provide quantitative evidence of the contribution of different signals.
Discussion
In this study, we developed a whole genome-wide set of CBSs (the Complete_peak set) using ChIP-seq data from 67 cell lines in the ENCODE database, identifying 137,471 non-overlapping CTCF binding peaks. Only 11.98% to 44.27% of peaks were bound in a single cell line, highlighting the cell type-active nature of CTCF binding. Analyzing overlaps between cell type-active peaks and the Complete_peak set, we constructed positive and negative sets of CBSs for each cell lines. DNase-seq, RAD21, and SMC3 emerged as key factors in regulating CTCF binding specificity, with sequence conservation variations observed in unique CBSs across cell lines.
A central goal of TFBS prediction is to achieve robust authentication for regulatory element activity under various conditions. We focused on predicting CBSs to better understand CTCF binding specificity across cell types. The DNase-seq profiles in different physiological conditions could provide cell type-active information for active binding events of CTCF. It supports that chromosome accessibility plays an important role in regulating TF binding. Researchers noted that RAD21 as a subunit of cohesin, could regulate gene expression and activate targets when co-occurring with CTCF30. On the other hand, Gu et al. found a positive correlation between SMC3 and CTCF at the single-cell level31. This explains why the signals of RAD21 and SMC3 could effectively identify cell type-active binding sites of CTCF. However, the mechanism by which histone modifications mediate CTCF binding is still unclear32. Our study predicted cell type-active CBSs based on 12 histone modifications in 13 cell lines, with H2AFZ, H3K4me1, and H3K4me2 showing better predictive performance, suggesting their regulatory role in CTCF binding. Additionally, using MEME-ChIP and database information, we identified CBS motifs, confirming that motif scores alone are insufficient for determining regulatory element activity in different cell lines.
In previous prediction works33,34,35,36, deep learning has been widely used for its good prediction accuracy. Other prediction algorithms like random forest (RF), XGBoost, and support vector machines (SVM)37,38,39,40 are also common. Our comparison of CNN and SVM methods revealed that CNN performed better due to its ability to extract local features from various signals and regions. The performance gap between CNN and XGBoost was minimal, highlighting CNN's competitive predictive capability for active TFBSs. Cross cell type predictions were generally less accurate than within-cell type predictions41. However, by using fewer signals (DNase-seq, RAD21, SMC3), our model outperformed XGBoost in several cases, demonstrating competitive predictive ability and robustness across diverse cellular environments. AUPRC values ranged from 0.62 to 0.991 depending on the cell line or signal type. In 15 cross cell type predictions, our model outperformed XGBoost in five instances, with up to a 0.045 increase in AUPRC values, highlighting the robustness and versatility of our CNN model.
Materials and methods
Data
The experimental data regarding of chromatin accessibility, histone modification, and the binding sites of transcription factors were available in the ENCODE project. And the reference human genomes (Human Build GRCh37) were used to extract the sequence segment of CTCF binding sites. Details of data downloading are as follows:
-
The ChIP-seq files of CTCF binding sites in 67 cell lines were downloaded from the ENCODE project.
-
For 59 of 67 cell lines, the DNase-seq data in bam format were downloaded from the ENCODE project.
-
For HUVEC, K562 and NHEK, there are the data of 12 histone modifications. For other 10 cell lines, GM12878, H1-hESC, HeLa-S3, HepG2, HMEC, HSMM, HSMMtube, NH-A, NHDF-Ad and NHLF, the data of 11 histone modifications in bam format are available. We also downloaded these data from the ENCODE project.
-
The ChIP-seq data of RAD21 in bam format for 12 cell lines, and the ChIP-seq data of RAD21 and SMC3 in bam format for 5 cell lines also were downloaded from the ENCODE project. The details about the above epigenetic data can be found in Additional file 1, Table S1.
-
Additionally, we obtained the motif information of CTCF binding sites in two ways. We downloaded the PWMs from the databases JASPAR42 and HOCOMOCO43. On the other hand, using MEME-ChIP tool, the position-frequency matrix (PFM) was obtained by motif searching according to the sequences of CTCF binding sites.
Constructing dataset of CTCF cell line-active binding sites
To comprehensively understand the binding profile of CTCF across various cell lines, we began by analyzing CTCF binding peaks provided by ChIP-seq data from 67 cell lines. Considering the internal overlaps in peak data, in instances where a peak was entirely contained within another, only the longer peak was retained to avoid redundancy. If the distance between the midpoints of two adjacent peaks was less than half the length of either peak, they were considered as indistinguishable. In such cases, only one of the peaks was retained. By applying these filtering criteria, we ensured that only distinct and representative peaks were retained for further analysis.
After merging and eliminating overlapping peaks across the 67 cell lines (excluding peaks of the sex chromosomes), we obtained a genome-wide collection of CBSs, a total of 137,471 distinct binding peaks. On average, each cell line displays about 44,331 peaks, though the numbers of peaks were varied, ranging from 2878 in GM12801 to 60,862 in Gliobla (Additional file 1, Table S2). Such variability underscores the diverse occupancy across cell types. Interestingly, 2,251 peaks were consistent across all 67 cell lines. Excluding GM12801, every cell line exhibited at least one unique binding peak that constitutes the Uni_peak dataset, as shown in (Additional file2, Fig. S3)44.
Our comprehensive collection of peaks termed the Complete_peak dataset, represents an exhaustive inventory of CBSs in the human genome. Using the Complete_peak dataset as a reference, we established positive and negative sets of cell type-active CBSs for each cell line (Additional file 1, Table S3).
-
For each cell line, we defined the peaks of the ChIP-seq data as the positive set of cell type-active CBSs.
-
By removing the peaks overlapping with the positive set from the Complete_peak dataset for each cell line, the remaining peaks constituted the negative set. These peaks represent CBSs unbound in the specific cell line although they are authenticated CBSs.
Extraction of features
For the 137,471 distinct, non-overlapping CTCF binding peaks, we analyzed their widths. We found that the widths ranged between 56 bp and 1,132 bp, with a median of 330 bp. Notably, 80.6% of these peaks have widths within 400 bp as shown in Fig. S4A of Additional file2. Given this distribution, we designated a 400 bp region, centered at the midpoint of each peak, as the CTCF core binding zone. To ensure comprehensive data capture, we extended each core zone to include seven 400 bp windows, stepped by 100 bp, upstream and downstream. This resulted in a total 15 windows per CBS for subsequent analysis (Additional file2, Fig. S4B).
To enhance the accuracy of predicting CBSs and account for variations across cell lines, we integrated both DNA sequence motif and epigenetic markers into our analysis framework. The process began by scanning a 1800 bp-wide candidate region using the FIMO tool45. Considering the motif length as Xmotif, we broadened the scanning window to 1800 + (Xmotif − 1) bp. Based on FIMO’s default criteria, positions of CTCF binding motifs were determined. To further refine the motif density, this 1800 bp region was divided into 90 bins, each spanning 20 bp, wherein we enumerated the motifs in each bin. By assigning the density data to 15 windows, we were able to derive a motif enrichment matrix of 15 × 20 dimensions for every potential CBS. Meanwhile, we also evaluated the distributions of epigenetic signals, including DNase-seq, HMs, and the binding of RAD21 and SMC3. We extracted 15 400 bp windows from the target region, further divided each window into 20 bins. Applying the Multicov utility of Bedtools, the signal scores in each bin were counted21. This generates a 15 × 20 signal matrix, providing a representation of one epigenetic marker around a potential CTCF binding site.
CNN and SVM models
We employed two distinct machine learning frameworks, Convolutional Neural Networks (CNN)46 and Support Vector Machines (SVM)47, for predicting CTCF binding sites. CNN can extract the information of adjacent features more accurately than other deep learning structures. To construct the CNN model, we leveraged the Keras library in Python48, which consisted of two convolutional layers with activation functions set to hyperbolic tangent (act_type = “tanh”). The kernel size for both layers was set to 1 × 1 (kernel = [1, 1]). The number of convolutional filters for the first and second layers were 20 and 50, respectively (num_filter = [20, 50]). This was followed by two fully connected layers with 500 and 40 hidden nodes (num_hidden = [500, 40]). Meanwhile, SVM is a well-established machine learning algorithm Support vector machines (SVM). To build our SVM-based prediction model, we used the SVM_light software package, maintaining its default settings49. The prediction model developed in this study is available for review at our GitHub repository: https://github.com/22246013-LuChai/Predicting-CTCF-cell-type-active-binding-sites-in-human-genome.git.
Partition of training and validation sets
To robustly assess our models, we adopted a threefold cross-validation scheme. Both the positive and negative datasets were split into three subsets: one subset for testing, and the combined other two for training. The prediction performance was evaluated based on the average of the three predictions and the area under the area under the precision-recall curve (AUPRC). To ensure our model’s generalization capability and independence from specific chromosomal biases, we excluded the data from chromosomes 1, 8 and 21. The remaining 19 chromosomes (not including Chromosome X and Y) were retained for building the prediction model according to threefold cross-validation. In the end, the model’s performance was assessed using the previously held-out data from chromosomes 1, 8 and 21 (Additional file 1, Table S3).
Data availability
The ChIP-seq data of CTCF binding of 67 cell lines and the epigenetic date DNase-seq, histone modification and two TFs RAD21 and SMC3 are available in ENCODE (http://genome.ucsc.edu/ENCODE/index.html), and the CTCF motif data is available in JASPAR (https://jaspar.genereg.net/) and HOCOMOCO (https://hocomoco11.autosome.org/). Specific data download link is provided within the manuscript or supplementary information files.
References
Vostrov, A. A. & Quitschke, W. W. The zinc finger protein CTCF binds to the APBbeta domain of the amyloid beta-protein precursor promoter Evidence for a role in transcriptional activation. J. Biol. Chem. 272, 33353–33359. https://doi.org/10.1074/jbc.272.52.33353 (1997).
Filippova, G. N. et al. An exceptionally conserved transcriptional repressor, CTCF, employs different combinations of zinc fingers to bind diverged promoter sequences of avian and mammalian c-myc oncogenes. Mol. Cell. Biol. 16, 2802–2813. https://doi.org/10.1128/MCB.16.6.2802 (1996).
Dekker, J. & Mirny, L. The 3D genome as moderator of chromosomal communication. Cell 164, 1110–1121. https://doi.org/10.1016/j.cell.2016.02.007 (2016).
Barrington, C. et al. Enhancer accessibility and CTCF occupancy underlie asymmetric TAD architecture and cell type specific genome topology. Nat. Commun. 10, 2908. https://doi.org/10.1038/s41467-019-10725-9 (2019).
Guo, Y. et al. CRISPR inversion of CTCF sites alters genome topology and enhancer/promoter function. Cell 162, 900–910. https://doi.org/10.1016/j.cell.2015.07.038 (2015).
Alharbi, A. B., Schmitz, U., Bailey, C. G. & Rasko, J. E. CTCF as a regulator of alternative splicing: New tricks for an old player. Nucleic Acids Res 49, 7825–7838. https://doi.org/10.1093/nar/gkab520 (2021).
Xu, C. & Corces, V. G. Genome-wide mapping of protein–DNA interactions on nascent chromatin. Methods Mol. Biol. https://doi.org/10.1007/978-1-4939-7768-0_13 (2018).
Xie, X. et al. Systematic discovery of regulatory motifs in conserved regions of the human genome, including thousands of CTCF insulator sites. Proc. Natl. Acad. Sci. U. S. A. 104, 7145–7150. https://doi.org/10.1073/pnas.0701811104 (2007).
Kim, T. H. et al. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 128, 1231–1245. https://doi.org/10.1016/j.cell.2006.12.048 (2007).
Katainen, R. et al. CTCF/cohesin-binding sites are frequently mutated in cancer. Nat. Genet. 47, 818–821. https://doi.org/10.1038/ng.3335 (2015).
Fang, C. et al. Cancer-specific CTCF binding facilitates oncogenic transcriptional dysregulation. Genome Biol. 21, 1–30. https://doi.org/10.1186/s13059-020-02152-7 (2020).
Fang, R., Wang, C., Skogerbo, G. & Zhang, Z. Functional diversity of CTCFs is encoded in their binding motifs. BMC Genomics 16, 649. https://doi.org/10.1186/s12864-015-1824-6 (2015).
Keilwagen, J., Posch, S. & Grau, J. Accurate prediction of cell type-specific transcription factor binding. Genome Biol. 20, 1–17. https://doi.org/10.1186/s13059-018-1614-y (2019).
Tompa, M. et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144. https://doi.org/10.1038/nbt1053 (2005).
Weirauch, M. T. et al. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126–134. https://doi.org/10.1038/nbt.2486 (2013).
Xi, C., Bowen, Y., Nicholas, C., Claudio, S. & Richard, B. Mocap: Large-scale inference of transcription factor binding sites from chromatin accessibility. Nucleic Acids Res. https://doi.org/10.1093/nar/gkx174 (2017).
Li, H. Y., Quang, D. & Guan, Y. F. Anchor: Trans-cell type prediction of transcription factor binding sites. Genome Res. 29, 281–292. https://doi.org/10.1101/gr.237156.118 (2019).
Deng, L., Wu, H., Liu, X. & Liu, H. DeepD2V: A novel deep learning-based framework for predicting transcription factor binding sites from combined DNA sequence. Int. J. Mol. Sci. 22, 5521. https://doi.org/10.3390/ijms22115521 (2021).
Qin, Q., Fan, J., Zheng, R., Wan, C. & Liu, X. S. Lisa: Inferring transcriptional regulators through integrative modeling of public chromatin accessibility and ChIP-seq data. Genome Biol. https://doi.org/10.1186/s13059-020-1934-6 (2020).
Pohl, A. & Beato, M. bwtool: A tool for bigWig files. Bioinformatics (Oxford, England) 30, 1618–1619. https://doi.org/10.1093/bioinformatics/btu056 (2014).
Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics (Oxford, England) 26, 841. https://doi.org/10.1093/bioinformatics/btq033 (2010).
Berger, S. L. The complex language of chromatin regulation during transcription. Nature 447, 407–412. https://doi.org/10.1038/nature05915 (2007).
Florian, S. et al. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res. https://doi.org/10.1093/nar/gkw1061 (2017).
Zhang, L. et al. A deep learning model to identify gene expression level using cobinding transcription factor signals. Brief. Bioinform. https://doi.org/10.1093/bib/bbab501 (2021).
Zhang, L., Xue, G., Liu, J., Li, Q. & Wang, Y. Revealing transcription factor and histone modification co-localization and dynamics across cell lines by integrating ChIP-seq and RNA-seq data. BMC Genomics https://doi.org/10.1186/s12864-018-5278-5 (2018).
Cuddapah, S. et al. Global analysis of the insulator binding protein CTCF in chromatin barrier regions reveals demarcation of active and repressive domains. Genome Res. 19, 24–32. https://doi.org/10.1101/gr.082800.108 (2009).
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376. https://doi.org/10.1038/nature11082 (2012).
Bailey, T. L. et al. MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208. https://doi.org/10.1093/nar/gkp335 (2009).
Machanick, P. & Bailey, T. L. MEME-ChIP: Motif analysis of large DNA datasets. Bioinformatics (Oxford, England) 27, 1696–1697. https://doi.org/10.1093/bioinformatics/btr189 (2011).
Ong, C. T. & Corces, V. G. CTCF: An architectural protein bridging genome topology and function. Nat. Rev. Genet. 15, 234–246. https://doi.org/10.1038/nrg3663 (2014).
Gu, B. et al. Opposing effects of cohesin and transcription on CTCF organization revealed by super-resolution imaging. Mol. Cell 80, 699-+. https://doi.org/10.1016/j.molcel.2020.10.001 (2020).
Wu, J. et al. Systematical identification of cell-specificity of CTCF-gene binding based on epigenetic modifications. Brief. Bioinform. 22, 589–600. https://doi.org/10.1093/bib/bbaa004 (2021).
Chen, C. et al. DeepGRN: Prediction of transcription factor binding site across cell-types using attention-based deep neural networks. Bmc Bioinformat. https://doi.org/10.1186/s12859-020-03952-1 (2021).
Koo, P. K. & Ploenzke, M. Deep learning for inferring transcription factor binding sites. Curr. Opin. Syst. Biol. 19, 16–23. https://doi.org/10.1016/j.coisb.2020.04.001 (2020).
Zeng, H. Y., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics (Oxford, England) 32, 121–127. https://doi.org/10.1093/bioinformatics/btw255 (2016).
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods 12, 931–934. https://doi.org/10.1038/nmeth.3547 (2015).
Ardakani, F. B., Schmidt, F. & Schulz, M. H. Predicting transcription factor binding using ensemble random forest models. FResearch https://doi.org/10.12688/f1000research.16200.2 (2018).
Srivastava, A. & Kumar, M. Prediction of zinc binding sites in proteins using sequence derived information. J. Biomol. Struct. Dyn. 36, 4413–4423. https://doi.org/10.1080/07391102.2017.1417910 (2018).
Wang, W. et al. SmoPSI: Analysis and prediction of small molecule binding sites based on protein sequence information. Comput. Math. Method M. https://doi.org/10.1155/2019/1926156 (2019).
Zhu, Y. H., Hu, J., Qi, Y., Song, X. N. & Yu, D. J. Boosting granular support vector machines for the accurate prediction of protein-nucleotide binding sites. Comb. Chem. High T. Scr. 22, 455–469. https://doi.org/10.2174/1386207322666190925125524 (2019).
Bryan, Q. & Furey, T. S. DeFCoM: Analysis and modeling of transcription factor binding sites using a motif-centric genomic footprinter. Bioinformatics (Oxford, England) https://doi.org/10.1093/bioinformatics/btw740 (2016).
Anthony, M. et al. JASPAR 2016: A major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 44, D110–D115. https://doi.org/10.1093/nar/gkv1176 (2016).
Kulakovskiy, I. V. et al. HOCOMOCO: Expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. https://doi.org/10.1093/nar/gkv1249 (2016).
Wang, H. et al. Widespread plasticity in CTCF occupancy linked to DNA methylation. Genome Res. 22, 1680–1688. https://doi.org/10.1101/gr.136101.111 (2012).
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: Scanning for occurrences of a given motif. Bioinformatics (Oxford, England) 27, 1017–1018. https://doi.org/10.1093/bioinformatics/btr064 (2011).
Cheng, J., Wang, P. S., Gang, L. I., Qing-Hao, H. U. & Han-Qing, L. U. Recent advances in efficient computation of deep convolutional neural networks. Front. Inf. Technol. Electron. Eng. 19, 64–77. https://doi.org/10.1631/fitee.1700789 (2018).
Platt, J. C. Sequential minimal optimization: A fast algorithm for training support vector machines. Microsoft Res. (1998).
Ketkar, N. Introduction to Keras. Apress https://doi.org/10.1007/978-1-4842-2766-4_7 (2017).
Joachims, T. Making large-scale SVM learning practical. RePEc: Res. Papers. Econ. 28 (1998).
Acknowledgements
We would like to express our sincere gratitude to Dr. Zhang for her invaluable guidance and support throughout this research. We also thank our colleagues in the Zhang Lab for their helpful discussions and assistance with data analysis. This work was supported by funding from the National Natural Science Foundation of China (Grant Numbers: 61962041, 62062053 and 12025107) and Inner Mongolia Autonomous Region Natural Science Foundation (Grant Number: 2024MS03023), which facilitated our research efforts.
Author information
Authors and Affiliations
Contributions
L.C. led the data collection, data analysis, and study implementation, organized the data, and wrote the manuscript. J.G. contributed to the creation of figures for the manuscript, facilitating a clear presentation of the research data and findings. Z.L. assisted in experimental data collection and participated in discussions about methodological details. H.S. conducted the literature review and identified relevant studies. J.L. provided financial and project support. Y.W. participated in work discussions, organized the project workflow, and contributed financial and project support. L.Z. conceived the study, designed the experimental plan, proposed the research topic, writing—review and editing, supervision, and provided financial and project support. All authors reviewed and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Chai, L., Gao, J., Li, Z. et al. Predicting CTCF cell type active binding sites in human genome. Sci Rep 14, 31744 (2024). https://doi.org/10.1038/s41598-024-82238-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-82238-5








