EmbedTAD Using Graph Embedding and Unsupervised Learning to Identify TADs from High-Resolution Hi-C Data

Chowdhury, H. M. A. Mohit; Oluwadare, Oluwatosin

doi:10.1038/s42003-025-09224-z

Download PDF

Article
Open access
Published: 09 December 2025

EmbedTAD Using Graph Embedding and Unsupervised Learning to Identify TADs from High-Resolution Hi-C Data

Communications Biology volume 9, Article number: 7 (2026) Cite this article

1560 Accesses
Metrics details

Subjects

Abstract

Topologically Associating Domains (TADs) serve a functional purpose as self-interacting regions whose boundaries are enriched with various proteins. Identifying these TAD regions is essential for examining several biological characteristics, including immune system function and chromosome organization. In this study, we propose EmbedTAD for identifying TAD regions from high-resolution Hi-C data. To achieve this, we utilize NetMF, a graph embedding technique that employs low computational resources, and cluster the embeddings into TAD regions using the HDBSCAN algorithm. We demonstrate that, during T-cell differentiation, EmbedTAD detects TAD rearrangements and can differentiate between active and inactive cells. Furthermore, we show that EmbedTAD recovers a significant number of TADs also present in PLAC-seq data, demonstrating its reproducibility. We confirm that EmbedTAD detects TADs with distinct ChIP-seq signals surrounding their boundaries, including CTCF, RAD21, and SMC3. Overall, EmbedTAD reliably and efficiently identifies TADs with minimal computational resources, outperforming many state-of-the-art methods.

A comprehensive benchmark of single-cell Hi-C embedding tools

Article Open access 14 October 2025

Enhanced sensitivity and scalability with a Chip-Tip workflow enables deep single-cell proteomics

Article Open access 16 January 2025

A comprehensive benchmarking with interpretation and operational guidance for the hierarchy of topologically associating domains

Article Open access 23 May 2024

Introduction

One of the key parts of living beings are chromosomes, which contain genetic and epigenetic information to control functional traits¹. The genome’s three-dimensional (3D) structure is essential for gene regulation, including chromatin interactions that influence phenotypes and contribute to understanding gene expression, as well as the organization of active and inactive territories^2,3,4,5. 3D chromosome conformation capture methods, such as 5C, Hi-C, and Micro-C, enable the mapping of spatial interactions between genomic regions that are distant in linear sequence, providing insights into chromatin structure, gene regulation, and cellular function⁶. These techniques have enabled the study of the 3D spatial structure of chromosomes and genomes³. Specifically, Hi-C has revealed that chromosomes are segregated into kilobase to megabase-sized regions, creating physical domains known as Topologically Associating Domains (TADs)^7,8. According to Dixon et al.⁸, there are self-interacting regions with a bin size of less than 100 Kb that form a triangle-like shape and are bounded by segments before the interaction abruptly terminates. These areas are known as TADs, and the sudden shift denotes the boundary areas that divide TADs⁸. TADs are enriched in genes that interact with regulatory elements, and their boundaries are enriched with various epigenetic proteins, including insulator proteins and others^2,8,9. TADs play a crucial role in defining interacting domains and genomic functions, such as the formation of chromatin loops¹⁰. They are also important for higher-order chromatin folding and proper long-range transcriptional control⁶. Within a TAD, loci share common cis-regulatory elements and form interaction networks. During cell development, gene expression patterns are influenced by the physical clustering of TADs, which form a modular framework in chromatin structure and nuclear positioning⁷. In mammalian genomes, TAD regions align with histone protein markers such as H3K27me3, H3K9me2, as well as CTCF and cohesin-binding sites^7,8.

TAD detection has been a prominent field of study, and several computational methods have been developed to identify TADs from Hi-C data, including^{11,12,13,14,15,16}. Researchers have conducted comparative studies and classified these tools into several categories, including linear score based, statistical model based, network feature based, graph partitioning, and clustering based approaches^17,18,19,20, in order to benchmark these methods and highlight the importance of TAD detection research. Within these categories, linear score based methods assign a score to each genomic bin’s contact frequency and apply statistical testing to identify TADs. This category includes most TAD detection tools, such as Armatus²¹ and TopDom¹⁵. For instance, TopDom computes bin signals using a sliding window and identifies statistically significant domains. The statistical modeling category divides data and variables based on their interaction distributions or relationships, applying statistical tests to filter false-positive TADs. HiCSeg¹⁴, for example, uses dynamic programming and log-likelihood computations to detect TADs. This approach can also be described as feature based modeling, since it leverages interaction features to infer domain boundaries. Tools in the graph modeling category transform the Hi-C contact matrix into a graph data structure, where edges represent interaction frequencies between loci. Spectral¹⁶ identifies TADs using spectral graph theory and the Fiedler vector, while scKTLD²² detects TADs from single-cell Hi-C data through graph embedding and dynamic programming. We refer to graph modeling and network modeling interchangeably, given their conceptual similarity. Lastly, there are clustering based approaches^17,18,19,20, which group genomic regions based on interaction similarity to infer hierarchical or non-overlapping domain structures. A number of clustering algorithms have been proposed for TAD detection tools, such as IC-Finder¹³, which uses hierarchical clustering to identify TAD domains, ClusterTAD¹², which uses K-means clustering²³, and CASPIAN¹¹, which uses HDBSCAN²⁴.

Despite advancements in TAD detection, the use of spatial feature representation of bulk Hi-C data through graph embedding remains limited. HiC-GNN²⁵ introduced the use of a graph embedding-based approach for 3D genome structure reconstruction from bulk Hi-C data, and HiCEGNN²⁶ recently proposed a similar approach for single-cell Hi-C data 3D chromosome structure reconstruction. However, the application of graph embedding techniques to TAD identification using bulk Hi-C data has not yet been explored, representing a significant opportunity for methodological innovation. Graph-based representation of Hi-C data offers a powerful means of understanding the complex interactions within TADs. By structuring the data as a graph, we can better capture the relationships and connectivity among genomic features, providing a more nuanced view of spatial organization. Graph embedding techniques retain critical attributes of the data while simplifying computation, making it easier to analyze large datasets.

In this study, we present EmbedTAD, an algorithm that integrates graph embedding and clustering to identify TADs from high-resolution Hi-C contact matrices. EmbedTAD employs an efficient embedding strategy that maximizes feature representation in a low-dimensional space while minimizing computational cost. The accuracy and biological relevance of the TADs detected by EmbedTAD were validated using multiple evaluation metrics and biological datasets. We first compared EmbedTAD’s performance with existing TAD callers using both in-silico and in-situ Hi-C datasets. We then conducted a systematic evaluation of the TAD regions identified by EmbedTAD and demonstrated its practical applications in understanding cellular functional organization. The identified TADs facilitate the exploration of key biological features, including CTCF binding sites, T-cell activation during development, and histone modification markers such as H3K27ac and H3K27me3. Overall, EmbedTAD represents Hi-C interaction data through network embedding in a low-dimensional space, achieving low memory usage, reduced runtime, and superior performance compared to other state-of-the-art TAD callers. These characteristics make EmbedTAD an efficient and powerful tool for TAD detection and genomic structure analysis.

Results

EmbedTAD overview

To predict TADs, EmbedTAD starts with an n × n Hi-C contact matrix as input (Fig. 1A). We observed that the size of a single chromosome of a high-resolution Hi-C contact matrix is often too large, particularly for human and mouse data, and most computing systems are unable to process the entire matrix at once. To address this issue, we divided the n × n contact matrix into p × p equal-sized sub-matrices following Equation (2). EmbedTAD determines ns (number of sub-matrices) dynamically at runtime according to the size of n × n and a threshold t = 5000, resulting in sub-matrices of size ≤5000 × 5000. While dividing the n × n Hi-C contact matrix into equal-sized p × p sub-matrices, we observed missed boundary cases, degrading performance by 2–3%. To account for this loss, we extended every p_i+1 × p_i+1 sub-matrix by a q = 3Mb region from the previous p_i × p_i sub-matrix (Fig. 1B). In general, we represent each sub-matrix as p × p and apply a Gaussian filter to remove noise and strengthen the diagonal interaction frequencies.

Next, we convert each p × p sub-matrix into graph data to feed into the NetMF embedding algorithm²⁷. We used an embedding size of e = 455, as it achieved optimal results during our hyperparameter search (see Hyperparameter Search in supplementary information). The embedding reduces the feature size to e while preserving the information present in p features. The embedded data are then fed into the HDBSCAN²⁴ clustering algorithm to produce clusters, which represent TADs. Because we extended p_i+1 × p_i+1 by q, some overlapping or redundant TADs appear in the extended (Q) region, which originates from the previous p_i × p_i sub-matrix (Fig. 1B). To remove these overlapping or redundant TADs, we measure the TAD Quality (TQ_i) for p_i × p_i and TQ_i+1 for p_i+1 × p_i+1 in the Q region. We retain the set with the highest TAD quality.

Iteratively, EmbedTAD applies a Gaussian filter to each sub-matrix (p × p), creates a graph, feeds it into the NetMF and HDBSCAN algorithms, and removes overlapping or redundant TADs to produce TADs for the entire matrix.

EmbedTAD shows consistency across different noise level using In-silico Hi-C data

We observed the consistency of EmbedTAD across different noise levels (4, 8, 12, 16, and 20) using the In-silico Hi-C dataset²⁸. We computed the Silhouette Index (SI), Davies-Bouldin Index (DBI), and Calinski-Harabasz Index (CHI) to evaluate cluster quality and consistency across different noise levels (Fig. 2A, B, and C). EmbedTAD produced a consistent SI (≈ 0.35) and DBI (≈ 1.50) across different noise levels. The CHI score fluctuated between 55 and 70 across these noise levels. In addition, we computed the TAD Quality (TQ)¹² for each dataset and found that the scores were consistent among the five datasets at each noise level. All noise levels maintained TQ scores between 8 and 12 (Fig. 2D). We also observed the number of detected TADs at each noise level and compared the results with the ground-truth. EmbedTAD’s detected TAD count was slightly lower than the true TADs, with the median value closely matching the true count of 171 TADs across all noise levels (Fig. 2E). To further validate EmbedTAD’s consistency across different noise levels using the In-silico Hi-C dataset, we measured the TAD size distribution. We observed that EmbedTAD’s kernel density estimation (KDE) curve was almost consistent with the ground-truth curve, with most TAD sizes falling between 250 Kb and 1.75 Mb (Fig. 2F and Supplementary Fig. 1). Overall, EmbedTAD showed consistent results across all metrics using the In-silico Hi-C dataset.

**Fig. 2: EmbedTAD’s consistency analysis using In-silico Hi-C data.**

EmbedTAD outperforms across all noise level using In-silico Hi-C data with other TAD callers

We performed a thorough comparison of EmbedTAD with seven state-of-the-art methods to assess how well our pipeline performs relative to other state-of-the-art algorithms. Comprehensive reviews by^17,18,19,20 categorized TAD detection algorithms into different groups. To ensure a fair comparison, we randomly selected at least one tool from each category. Since our algorithm falls under the clustering-based category, we included three clustering-based algorithms in our comparison. Specifically, we selected IC-Finder¹³, ClusterTAD¹², TopDom¹⁵, Armatus²¹, HiCseg¹⁴, CASPIAN¹¹, and Spectral¹⁶, categorizing them into Clustering, Linear, Statistical, and Network Feature methods based on their underlying approaches (Supplementary Table 1). We extensively compared EmbedTAD with these seven state-of-the-art TAD callers using in-silico Hi-C data across five noise levels (4, 8, 12, 16, and 20) and recorded the Measure of Concordance (MoC)²⁹ to evaluate performance (Fig. 2G and Supplementary Fig. 2). EmbedTAD consistently achieved the highest MoC scores across all noise levels compared to other TAD callers. We further evaluated its performance using traditional clustering metrics, including Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) (see Performance Comparison using AMI and ARI in Supplementary Information), following prior studies in TAD detection^22,30. EmbedTAD achieved competitive AMI and ARI scores, showing stable and reproducible performance with nearly identical mean and median values and no outliers (Supplementary Fig. 3, Supplementary Table 2). The results indicate that the network embedding and graph-based data representation reduce noise sensitivity and enhance overall stability. Collectively, these findings demonstrate that EmbedTAD reliably and accurately identifies TADs across varying noise conditions, establishing it as a robust and effective TAD caller.

EmbedTAD demonstrates robust ChIP-seq signal enrichment at TAD boundaries using In-situ Hi-C data

We evaluated TAD boundaries on GM12878 chromosome 19 and CH12LX chromosome 18 at 5 Kb and 10 Kb resolutions and compared EmbedTAD with other TAD callers. CCCTC-binding protein (CTCF), RAD21, and SMC3 are known to be abundant at TAD boundaries^8,31. Cohesin proteins, such as RAD21 and SMC3, aid in DNA loop formation and chromosomal structure maintenance, while CTCF functions as an insulator protein. These proteins are often used as markers to evaluate a TAD caller’s accuracy and to validate the detected TAD boundaries. We computed the average ChIP-seq signal for all TAD callers listed in Supplementary Table 1 and compared the results with those of EmbedTAD. Specifically, we measured the average ChIP-seq signal at the boundary positions and their ± 250 Kb neighboring regions. In Fig. 3A, EmbedTAD showed CTCF enrichment around boundaries on GM12878 chromosome 19 at 10 Kb resolution. Similarly, when evaluating RAD21 (Fig. 3B) and SMC3 (Fig. 3C), EmbedTAD showed enrichment around TAD boundaries similar to other TAD callers.

**Fig. 3: Comparison of TAD callers using GM12878 chromosome 19 at 10Kb resolution.**

When comparing CH12LX chromosome 18 at both 5 Kb and 10 Kb resolutions, as well as GM12878 chromosome 19 at 5 Kb resolution, we observed a consistent pattern where EmbedTAD showed enrichment signals at boundary positions (Supplementary Figs. 4, 5, and 6). EmbedTAD consistently showed enrichment around boundaries, similar to other TAD callers across analyses. We also note that methods detecting a larger number of TADs (Supplementary Fig. 7) naturally have a higher likelihood of capturing additional boundary variations, which partly explains why the other methods showed slightly higher enrichment of CTCF, RAD21, and SMC3. Taken together, these results underscore the robustness of our algorithm and provide a clear explanation of ChIP-seq signal enrichment around TAD boundaries, which is essential for accurately identifying TADs and maintaining biological information for future analysis.

EmbedTAD demonstrates impressive TAD detection accuracy implied by T A D a d j R ²

TADs are characterized by higher interaction frequencies within the region compared to regions outside, making them significant in cell biology⁸. In contrast, regions with low interaction frequencies are typically identified as non-TAD regions. Interaction frequency generally decays in proportion to distance; within a TAD, local interaction frequencies should remain high and gradually decay as the distance increases. TADadjR² score (Equation (1)) is an R²-based statistical metric, adjusted for TAD analysis, that explains the variance of Hi-C interaction frequencies over a given genomic distance.³². An et al.³² described TADadjR² as

$${R}_{adj}^{2}=1-\frac{\frac{1}{N-{N}_{t}-1}{\sum }_{i = 1}^{N}{\left({X}_{i}-{\hat{X}}_{i}\right)}^{2}}{\frac{1}{N-1}\mathop{\sum }_{i = 1}^{N}{\left({X}_{i}-\overline{X}\right)}^{2}}$$

(1)

where, X_i = i^th bin pair contact frequency, N = number of bins pairs, N_t = number of TADs whose size is greater than or equal to this genomic distance, ${\hat{X}}_{i}=$ average contact frequency within this TAD or gap region, and $\overline{X}=$ mean contact frequency. This metric explains the interaction frequency variance at a given genomic distance. The numerator calculates the variance of interaction frequency of TADs, and the denominator calculates the overall variance, and this metric ensures that adding more TADs that are not substantially significant will not affect the score by penalizing the TADs’ variance with overall variance. As TADs have a high local interaction frequency, and this frequency decays over the distance at the boundary region, this metric will quantify the detected TADs with interaction frequency explanation. It generates scores from 0 to 1; close to 1 means the classified TADs perfectly explains the interaction frequency variance and decays proportionally with increasing distance, and 0 means no explanation of frequency variance.

To assess TAD detection accuracy, we measured the TADadjR² score from 0 to 1.5 Mb region for each TAD caller and compared it with EmbedTAD. We plotted the TADadjR² scores in Fig. 3D to K and Supplementary Figs. 8, 9, and 10 to visualize the decay pattern over the region. EmbedTAD showed high interaction frequency, and this interaction frequency depleted over increasing distance, like other state-of-the-art TAD callers. Along with other state-of-the-art TAD callers, EmbedTAD obtained a score that is close to 1 and decreased over increasing genomic distance, as indicated in Supplementary Table 3. This high score and deplation over distance are essential for differentiating between TAD and non-TAD regions and provides a perfect explanation for interaction frequency variation. EmbedTAD’s maintains a small difference in numerical score with other state-of-the-art TAD callers which indicates a strong agreement with others. This analysis of Hi-C interaction frequency variance demonstrates EmbedTAD’s capability to distinguish between TAD and non-TAD regions.

EmbedTAD accurately detects TADs with strong interaction signals, consistent sizes, and insulation scores

We statistically analyzed the TADs detected by EmbedTAD on GM12878 and CH12LX at 5 Kb and 10 Kb resolutions. In Fig. 4A, we plotted the total number of TADs detected in each dataset and observed that EmbedTAD detected more TADs at 10 Kb resolution compared to 5 Kb. We also examined the TAD size distribution per bin and found that the average size ranged between 200 Kb and 280 Kb, indicating that the typical TAD size detected by EmbedTAD lies within this range (Fig. 4B, C, D, E, and F). To further evaluate the detected TADs, we visualized^33,34 GM12878 chromosome 19 (Fig. 4G) and CH12LX chromosome 18 (Fig. 4H) at 10 Kb resolution, focusing on the 40 Mb to 44 Mb region. The TADs are shown with blue lines at the top, and the insulation score below these TAD regions indicates the Hi-C interaction frequency strength. The TAD cutoff (red line) signifies regions with strong Hi-C interaction frequency, a fundamental feature of TAD boundaries³⁴. We further examined the insulation scores on GM12878 chromosome 3 (5 Kb and 10 Kb), chromosome 19 (5 Kb), and CH12LX chromosome 2 (5 Kb and 10 Kb) and chromosome 18 (5 Kb), and observed that the detected TAD regions had insulation scores above the TAD cutoff in most cases (Supplementary Fig. 11 and 12). Overall, EmbedTAD detected the majority of TADs with strong Hi-C interaction frequencies and insulation scores exceeding the TAD cutoff.

**Fig. 4: Statistical analysis of EmbedTAD using in-situ Hi-C data.**

EmbedTAD achieves efficient TAD detection with low memory and fast execution

Running time and memory consumption are crucial factors for any algorithm, especially when processing high-volume genomic data. We performed a detailed performance comparison of CPU and GPU implementations of EmbedTAD (see Performance Comparison of CPU and GPU Implementations of EmbedTAD in supplementary information). While both versions are available we adopted the GPU implementation of EmbedTAD as the default. We recorded the running time (in seconds) and memory consumption (in MB) for each TAD caller on GM12878 chromosome 19 and CH12LX chromosome 18 at 5 Kb and 10 Kb resolutions to evaluate the performance of EmbedTAD compared to other TAD callers. As shown in Supplementary Table 4, EmbedTAD ranked as the third fastest algorithm in terms of execution time. In terms of memory consumption (Supplementary Table 5), EmbedTAD used the least memory among all the evaluated methods. While running time and memory usage are important for usability, EmbedTAD also maintains high accuracy in detecting TAD regions, offering an efficient and accurate solution.

CTCF binding and histone modifications validate TAD detection in EmbedTAD

CTCF binding protein is known to be enriched at TAD boundaries. Dixon et al.⁸ investigated factors contributing to TAD formation and discovered that 15% of TAD boundaries contain CTCF binding sites. TAD and its’ boundary regions should also be assessed for other elements, such as histone modifications and transcription factors⁸. To confirm the biological relevance of the TADs detected by EmbedTAD, we analyzed TAD regions using markers such as H3K27ac, H3K27me3, H3K4me1, H3K4me3, and H3K9me3. Visualization³⁵ of Hi-C interactions alongside ChIP-seq signals validated that the detected TADs are enriched with these proteins. We observed that active enhancers, such as H3K27ac, and gene body markers, like H3K4me3, are enriched within TAD regions in GM12878 chromosome 19 (Fig. 5A) and CH12LX chromosome 18 (Fig. 5B) at 10 Kb resolution, while repressive marks such as H3K9me3 are not enriched around TADs.

**Fig. 5: Biological validation of EmbedTAD using different ChIP-seq signal data.**

To further validate EmbedTAD’s results, we performed this analysis on GM12878 chromosomes 3 and 19, and CH12LX chromosomes 2 and 18 at both 5 Kb and 10 Kb resolutions (Supplementary Figs. 13-18). We consistently observed that TAD regions detected by EmbedTAD are enriched with active enhancers and depleted of repressive markers. Overall, this analysis demonstrates EmbedTAD’s ability to detect TADs accurately while preserving key biological features.

EmbedTAD recovers TAD regions from mESC H3K4me3 PLAC-seq data

A major challenge for TAD callers is the lack of accessible ground truth, leading to variations in TAD identification across different methods. While the majority of TAD regions should overlap, even when window size is taken into account, these regions often differ between TAD callers. This overlap can be described as the reproducibility among TAD callers. Another significant challenge is that most TAD callers are based on bulk Hi-C data, where genes are enriched with proteins, including the active enhancer mark H3K4me3.

In this study, we identified TADs using PLAC-seq data, as researchers have demonstrated its application for 3D genome analysis^36,37. PLAC-seq generates a higher proportion of long-range intra-chromosomal pairs (67%) and fewer inter-chromosomal pairs (11%)³⁸. In their study, Fang et al. (2016)³⁸ demonstrated that PLAC-seq improves both efficiency and accuracy over ChIA-PET in detecting long-range chromatin interactions in mammalian cells, generating reproducible contact maps across biological replicates. In mouse embryonic stem (ES) cells, PLAC-seq successfully captured promoter-centered interactions, and H3K4me3 PLAC-seq proved useful for identifying chromatin interactions at active or poised promoters³⁸. Together, these findings established PLAC-seq as a reliable method for mapping long-range chromatin interactions. Subsequent studies have further validated the use of PLAC-seq for 3D genome analysis. For example, Lee et al.³⁷ used H3K27ac PLAC-seq data to demonstrate that TADs and sub-TADs constrain enhancer-promoter interactions, highlighting the utility of PLAC-seq for structural domain analysis. Similarly, Rosen et al.³⁶ developed the HPTAD method, specialized for TAD detection using PLAC-seq data. They benchmarked TADs detected from PLAC-seq against Hi-C-derived TADs and showed that PLAC-seq can support domain-level analysis.

Since Rosen et al.³⁶ developed cutting-edge techniques for identifying TADs using mESC H3K4me3 PLAC-seq data, we utilized their TADs as the ground truth to evaluate our method. We calculated EmbedTAD’s recovery rate using its detected TADs from mESC bulk Hi-C data. As shown in Fig. 6A, EmbedTAD recovered approximately 67% of the TAD regions identified in the PLAC-seq data, with the exception of chromosomes 15 and 18 at 40 Kb resolution. While our method successfully recovered TADs on most chromosomes, it struggled on chromosomes 15 and 18 where PLAC-seq coverage was particularly sparse, leading to weaker domain signals. Because PLAC-seq targets specific proteins such as H3K4me3, the resulting data are inherently sparser than Hi-C and better suited for detecting chromatin loops (e.g., promoter-enhancer and promoter-promoter interactions) rather than large-scale 3D domains³⁸. This reflects a limitation of the input PLAC-seq data rather than the algorithm itself, as bulk Hi-C provides the broader coverage needed for robust TAD detection and chromosome structure analysis. We further compared detected TAD regions from chromosome 1 (Fig. 6B), chromosome 3 (Fig. 6C), chromosome 17 (Fig. 6D), and chromosome 19 (Fig. 6E) between 20 Mb and 26 Mb, along with the H3K4me3 ChIP-seq signals. We found that EmbedTAD’s detected TADs generally agreed with the mESC H3K4me3-detected TADs, often showing either multiple smaller TADs or similar TAD regions.

**Fig. 6: TADs recovery rate using mESC H3K4me3 PLAC-seq data and TAD rearrangement during T cell differentiation.**

This analysis confirms the agreement between PLAC-seq and bulk Hi-C TADs, demonstrating the reproducibility of EmbedTAD and its ability to retain biologically relevant information consistent with PLAC-seq data.

EmbedTAD identifies TAD rearrangement during T-cell differentiation in mouse cell

T-cell activation significantly impacts both the immune system response and the dynamics of chromatin structure³⁹. These cells play key roles in energy production, cell cycle progression, biosynthesis, and other biochemical processes. Naïve CD4+ T-cells undergo differentiation into various helper T-cell subsets, including Th17, Th1, and others, which are involved in autoimmunity, tissue inflammation, and other cellular processes. Zhang et al.⁴⁰ demonstrated that TAD rearrangement is one of the significant organizational changes occurring during T-cell differentiation.

In this study, we examined mus musculus Naïve CD4+ (non-active) and Th17 and Th1 (active) T-cells on chromosome 2 at 10 Kb resolution, focusing on properties such as TAD size distribution and structural changes to validate the differences between active and non-active T-cells. We analyzed overlapping TADs among the three cell types and observed that most TADs overlapped between the active cells (Fig. 6F). We also examined TAD size distribution across the three cell types using KDE curves, revealing size changes from inactive to active T-cells (Fig. 6G). Additionally, we identified and visualized Merge and Split events in TAD domains, finding 21 Split events, and 31 and 28 Merge events in Th17 and Th1 cells, respectively. We highlighted a Merge event in the Th17 cell from the 28 Mb to 32 Mb region (Fig. 6H) and a Split event in the Th1 cell from the 36 Mb to 40 Mb region (Fig. 6I).

Overall, EmbedTAD demonstrates the capability to detect TADs in both active and non-active T-cells, providing insights into their functions and behavior during differentiation.

Discussion

TADs are important biological features involved in cell development and gene regulation. It is well established that more interactions are observed within TAD regions, and these interactions decay proportionally with increasing distance between loci. TAD boundaries are enriched with different proteins, such as CTCF, which contributes to DNA looping, active enhancer-promoter interactions, and cohesin proteins that help maintain chromosome structure during cell development.

In this study, we developed a pipeline, EmbedTAD, which employs graph representation, embedding, and clustering techniques while using minimal memory and providing fast, robust performance across different organisms and resolutions. EmbedTAD demonstrates competitive performance in various analyses, including MoC scores, ChIP-seq signal enrichment, and reduced memory usage and running time compared to other TAD callers. In addition to achieving strong performance, TADs detected by EmbedTAD were validated through multiple statistical analyses, including agreement in average TAD size distributions and insulation scores. Although EmbedTAD was primarily developed for bulk Hi-C data, it also shows good agreement with TADs detected from PLAC-seq data, further establishing its reproducibility. Moreover, EmbedTAD effectively detects TAD rearrangements during T-cell differentiation, providing valuable insights into the distinction between active and non-active T-cells. By leveraging graph embedding techniques, EmbedTAD enables the preservation of important genomic features in lower-dimensional space, highlighting its potential for future research in graph-based genomic data analysis. Overall, EmbedTAD utilizes network embedding to represent bulk Hi-C data in a low-dimensional space. To the best of our knowledge, it is the primary method of its kind, achieving competitive running time while consuming the least memory, and demonstrates significant performance compared to other state-of-the-art TAD callers using Hi-C data across different organisms and resolutions, while preserving key biological features for downstream analyses.

Methods

In this work, we formulate the TAD detection problem as a graph-based problem. A graph data structure preserves relational information between neighboring nodes, where edges define the spatial proximity between two nodes. The distance between nodes indicates their degree of closeness. Additionally, a node, denoted as the central node (n_c), is defined with respect to other nodes by establishing a threshold distance value (magnitude), d_c, for a specific set of nodes relative to the central node^11,41. All nodes that satisfy the condition d < d_c (with n_c nodes in total) form a neighborhood, where each node is connected based on spatial proximity. Using this concept, we can derive multiple neighborhoods, with each neighborhood sharing certain properties and collectively forming a cluster. We applied this neighborhood-based graph clustering approach by treating each bin as a node, with the interaction frequency between two bins representing the weight of the edge between them. The Hi-C interaction frequency defines the proximity between bins, where a higher interaction frequency indicates a greater likelihood of interaction. This interaction frequency reveals hidden clusters, with closely related bins forming neighborhoods, which we define as TAD regions. To discover these domains, we developed EmbedTAD (Fig. 1A), which consists of three modules: i) Data Preprocessing, ii) Clustering, and iii) Output. The following subsections provide a detailed description of EmbedTAD.

Data Preprocessing

As the initial step of EmbedTAD’s pipeline, we represented the Hi-C interaction matrix as a graph data structure to feed into the Clustering module. We used in-silico (simulated) Hi-C contact matrices²⁸ at different noise levels (4, 8, 12, 16, and 20) to determine the hyperparameters (see Hyperparameter Search in supplementary information) for our proposed pipeline and compared EmbedTAD with seven state-of-the-art TAD callers. To evaluate robustness, we also used in-situ (real) Hi-C data^3,31.

Initially, we converted the Hi-C data into an n × n contact matrix. This contact matrix contains the interaction frequencies between bins (i, j). To achieve memory efficiency, we divided the n × n matrix into smaller sub-matrices of size p × p. EmbedTAD compares the total number of bins with a threshold of 5000 bins. If the total bin count exceeds this threshold, the algorithm determines the optimal number of bins, which is used to divide the entire matrix into smaller sub-matrices. These sub-matrices make the data easier to process without overwhelming memory.

To determine the optimal number of sub-matrices, we used the following equation:

$$ns=\left\lceil \frac{{t}_{bins}}{t}\right\rceil$$

(2)

where ns = number of sub-matrices, t_bins = total number of bins, and t = threshold. We then determined the sub-matrix shape as $(p\times p)=\left(\lceil \frac{{t}_{bins}}{ns}\rceil \times \lceil \frac{{t}_{bins}}{ns}\rceil \right)$.

Since dividing the entire matrix into smaller sub-matrices may disregard boundary regions, we expanded each p_i+1 × p_i+1 sub-matrix on the top and left by q = 3 Mb to accommodate border cases. This Q region was added from the previous p_i × p_i sub-matrix (Fig. 1B). This resolves the issue of potentially losing TADs at the boundaries. In general, we denote each sub-matrix as p_i × p_i.

After this step, we normalized the contact matrix using a Gaussian filter to remove potential noise. Applying the Gaussian filter improved the performance of our pipeline by 15%–30%. The Gaussian filter normalizes the interaction frequencies, with stronger interactions concentrated along the central diagonal.

Finally, we created the graph structure to feed into the Clustering module to determine the clusters.

Clustering

The Clustering module is a key component of EmbedTAD’s pipeline, where we apply graph embedding and clustering to identify TADs (Fig. 1A). Each cluster is defined as one TAD. In the data preprocessing step, we represent the Hi-C contact matrix as a graph. Proper representation is crucial, as the graph structure contains all the necessary information; without it, processing the data becomes computationally expensive. Embedding addresses this by representing the interaction frequencies of each node as a vector. For instance, a node i may have varying contact frequencies with neighboring nodes x (left) and y (right), and these nodes may have different connections with other nodes, each with distinct contact frequencies. To address this complexity, we applied the NetMF algorithm²⁷ with varying embedding size and an embedding size of e = 455 produced optimal result (see Determining Optimal Embedding Size and Validation of Optimal Embedding Size Using TAD Quality Metric in supplementary information). NetMF uses network matrix factorization to embed the graph into a lower-dimensional space while capturing the most important features.

Once the data embedding is completed, we utilize HDBSCAN²⁴ to perform clustering. HDBSCAN is an advanced clustering algorithm capable of handling variable densities. Since TAD regions do not have a fixed cluster size and exhibit varying densities with hierarchical properties, HDBSCAN is well-suited for this task compared to other clustering algorithms. We input the embedded data from the previous step into the HDBSCAN algorithm to identify potential TAD regions. The output from HDBSCAN consists of clusters of varying sizes, and not all clusters are considered TADs. To determine TAD size, we focus on regions ranging from 100 Kb to 5 Mb, which corresponds to the typical TAD size in mammalian organisms⁸. Clusters within this size range are labeled as TADs, while clusters outside this range are discarded.

NetMF

We utilized NetMF²⁷ in our pipeline; a matrix factorization embedding framework in closed form and demonstrated improvement compared to other embedding framework such as DeepWalk⁴², LINE⁴³, node2vec⁴⁴, etc. We directly incorporated the NetMF implementation from Rozemberczki et al.⁴⁵ into EmbedTAD and we implemented it’s GPU compatible version with pytorch and cugraph. Qiu et al.²⁷ explained LINE, PTE⁴⁶, DeepWalk and node2vec’s objective function as factorized matrix in closed form, and proposed a network embedding, NetMF, based on the DeepWalk matrix, as it is both computationally efficient and a more generalized formulation. Consider a graph, G = (V, E, A) and itś properties, $A\in {{\mathbb{R}}}_{+}^{| V| \times | V| }$ = adjacency matrix, diagonal matrices, Δ_row = diag(Ae), ${\Delta }_{col}=diag({A}^{{W}_{s}}e)$ and Δ = diag(a₁, …, a_∣V∣), undirected graph where Δ_row = Δ_col, vol(G) = ∑_i∑_jA_ij = ∑_ia_i, a_i = generalized degree of vertex i, W_s = window size, and n_s = # of negative sampling. Based on this properties, LINE expressed their objective function as maximization problem,

$${\mathbb{L}}=\mathop{\sum }_{i}^{| V| }{\sum }_{j}^{| V| }{A}_{ij}\left(\log \sigma \left({x}_{i}^{T}{y}_{j}\right)+{n}_{s}{E}_{{j}^{{\prime} } \sim {N}_{d}}\left[\log \sigma \left(-{x}_{i}^{T}{y}_{{j}^{{\prime} }}\right)\right]\right)$$

(3)

where $X,Y\in {{\mathbb{R}}}^{| V| \times a}$ that is x_i, y_i = rows and i = 1, …, ∣V∣, σ = sigmoid function, noise distribution, ${N}_{d}(j)\propto {a}_{j}^{\frac{3}{4}}$ and Qiu et al.²⁷ expressed this objective function, ${\mathbb{L}}$ (Equation (3)) as matrix factorization form,

$$\log \left(vol(G){\Delta }^{-1}A{\Delta }^{-1}\right)-\log {n}_{s}=X{Y}^{T}$$

(4)

PTE is another form of LINE algorithm assuming graph as multiple network Lapacian and factorized it consider graph, G = (V₁ ∪ V₂, E, A) as heterogeneous network, specifically bipartite network. They described objective function as maximization problem similer to LINE where E ⊆ V₁ × V₂, $A\in {{\mathbb{R}}}_{+}^{| {V}_{1}| \times | {V}_{2}| }$, and $vol(G)={\sum }_{i}^{| {V}_{1}| }\mathop{\sum }_{j}^{| {V}_{2}| }{A}_{ij}$. Similar to LINE (Equation (3)), PTE’s objective function is also expressed as matrix factorization in closed form,

$$\log \left(vol(G){\Delta }_{row}^{-1}A{\Delta }_{col}^{-1}\right)-\log {n}_{s}=X{Y}^{T}$$

(5)

DeepWalk is an implicit matrix factorization and primarily based on skip-gram negative sampling (SGNS). Following Levy et al.⁴⁷, SGNS implicitly expressed as

$$\log \left(\frac{\#({w}_{c},{c}_{w})| {{\mathcal{C}}}| }{\#({w}_{c}).\#({c}_{w})}\right)-\log {n}_{s}$$

(6)

where C = corpus, w_c = corpus of words and c_w = context. Qiu et al.²⁷ showed that DeepWalk produced low-rank transformation of normalized Lapacian matrix in closed form and LINE (second order proximity) is a special case of DeepWalk implementation where window size, W_s = 1. For infinite random walk, R_w → ∞, Equation (6) expresses DeepWalk objective function in closed form matrix factorization

$$\log \left(\frac{vol(G)}{{W}_{s}}\left(\mathop{\sum }_{r}^{{W}_{s}}{\rho }^{r}\right){\Delta }^{-1}\right)-\log {n}_{s}$$

(7)

where ρ = Δ⁻¹A. Considering transition probability tensor, ${\underline{\tau }}^{r}$, $\mu ={w}_{{c}_{j-1}}^{n}$ and stationary distribution, π, Qiu et al.²⁷ expressed node2vec in closed form

$$\frac{\#({w}_{c},{c}_{w})| {{\mathcal{C}}}| }{\#({w}_{c}).\#({c}_{w})}{\to }^{param}\frac{\frac{1}{2{W}_{s}}{\sum }_{r}^{{W}_{s}}\left({\sum }_{\mu }{\pi }_{{w}_{c}\mu }{\underline{\tau }}_{{c}_{w}{w}_{c}\mu }^{r}+{\sum }_{\mu }{\pi }_{{c}_{w}\mu }{\underline{\tau }}_{{w}_{c}{c}_{w}\mu }^{r}\right)}{\left({\sum }_{\mu }{\pi }_{{w}_{c}\mu }\right)\left({\sum }_{\mu }{\pi }_{{c}_{w}\mu }\right)}$$

(8)

Based on the closed-form matrix factorization analyses of other embedding algorithms (Equations (4), (5), (7), (8)), Qiu et al.²⁷ developed a network embedding algorithm, NetMF, which unified these embedding approaches under a single baseline. Among the matrices considered, the DeepWalk matrix is more general and computationally efficient, and NetMF was designed with careful consideration of the skip-gram theoretical foundation, negative sampling, and the DeepWalk matrix. Qiu et al.²⁷ proposed NetMF for both small and large context window sizes, where the key difference lies in matrix definition. For small window sizes, they directly used the DeepWalk matrix,

$$\chi =\frac{vol(G)}{{n}_{s}{W}_{s}}\left(\mathop{\sum }_{r}^{{W}_{s}}{\rho }^{r}\right){\Delta }^{-1}$$

(9)

while for large window sizes, they approximated a matrix for calculation efficiency. For small window size, W_s, NetMF computes DeepWalk matrix, χ and then defines,

$${\chi }^{{\prime} }=max(\chi ,1)$$

(10)

inspired by Shifted PPMI approach⁴⁷. Since the direct computation of $\log \chi$ is difficult and expensive, they proposed ${\chi }^{{\prime} }$, calculated $\log {\chi }^{{\prime} }$ and applied Singular Value Decomposition (SVD) to factorize :

$$\log {\chi }^{{\prime} }={U}_{a}{\Sigma }_{a}{V}_{a}^{T}$$

(11)

Finally, the Rank-a network embedding is generated as: ${U}_{a}\sqrt{{\sum }_{a}}$. Qiu et al.²⁷ also showed that matrix, χ has a close relationship with normalized graph Laplacian which simplifies the computation for large window, W_s. Specifically, they approximated the top-h eigenpairs as

$${\Delta }^{-\frac{1}{2}}A{\Delta }^{-\frac{1}{2}}\approx {U}_{h}{\Lambda }_{h}{U}_{h}^{T}$$

(12)

Then, they computed the approximate matrix as:

$$\hat{\chi }=\frac{vol(G)}{{n}_{s}}{\Delta }^{-\frac{1}{2}}{U}_{h}\left(\frac{1}{{W}_{s}}{\sum }_{r}^{{W}_{s}}{\Lambda }_{h}^{r}\right){U}_{h}^{T}{\Delta }^{-\frac{1}{2}}$$

(13)

The remaining steps follow the same procedure as in the small window case to generate embeddings of the input graph using Equation (10) for large window size. Detailed proofs of each equation can be found in²⁷.

HDBSCAN

HDBSCAN²⁴ is a hierarchical density-based clustering method that addresses several issues with well-known density and hierarchical-based clustering algorithms such as DBSCAN⁴¹, DENCLUE⁴⁸, and OPTICS⁴⁹. Based on the global density threshold, the majority of density-based algorithms, including DBSCAN, cluster data non-hierarchically. For their globally defined density threshold, they are unable to accurately characterize nested densities or variable densities²⁴. Additionally, other density or hierarchically based clustering algorithm needs sensitive parameters such as DBSCAN requires two inputs: Eps = maximum distance between two points, and minPts = minimal number of points in a neighborhood. It can be challenging to determine the ideal value for these parameters, particularly when dealing with big datasets like genomic data. These limitations are addressed by HDBSCAN, which also offers greater flexibility in identifying clusters across a range of datasets²⁴. Moreover, HDBSCAN yields more stable clusters from a tree of discovered clusters which is another advantage over other hierarchical based clustering algorithms. They approached this challenge as a maximization problem and optimized it by taking into account the optimal smallest number of cluster points that might be found. HDBSCAN²⁴ is an extended version of DBSCAN⁴¹ and OPTICS⁴⁹ algorithm. HDBSCAN identifies clusters from different densities adjusting the epsilon value. It generates a tree data structure to find the significant hierarchical cluster and introduces long-term stability mechanism of clusters. Initially, HDBSCAN computes a set of core distances considering all points expressed as μ, Δ_c(x_p)∣Δ_c(x_p) ≤ ε where x_p ∈ X, μ = minimum number of points, ε = maximum distance between two points, and calculates mutual reachability distance from the derived set of core distances:

$$\nabla ({x}_{p},{x}_{q})=\max \left\{{\Delta }_{c}({x}_{p}),\ {\Delta }_{c}({x}_{q}),\ {\Delta }_{c}({x}_{p},{x}_{q})\right\}$$

(14)

Next, it builds a Minimum Spanning Tree (MST) using the mutual reachability distance where it makes a disjoint small tree and iteratively adds lower weighted edges. It takes the weight of the corresponding object and adds self-loops in each node. After creating MST, it creates a dendrogram to create the clusters. It starts from the root as a single cluster and iteratively assigns other clusters to its descendant nodes which are sorted by weight in descending order.

Output

EmbedTAD’s Output module integrates all TAD regions generated by the Clustering module. During the preprocessing step, we divided the Hi-C contact matrix into sub-matrices to accelerate graph generation and processing. For each sub-matrix, EmbedTAD’s Clustering module identifies TAD regions corresponding to a specific portion of the Hi-C contact matrix. However, to obtain continuous and distinct TADs across the entire matrix, we remove overlapping or redundant TADs in the Q regions (Fig. 1B).

As described earlier, in the preprocessing step, we extended each sub-matrix p_i+1 × p_i+1 by q = 3 Mb from the previous sub-matrix p_i × p_i. This extension can result in overlapping or duplicate TADs in the Q region shared between the two sub-matrices. To resolve this, we calculate the TAD Quality (TQ) score¹² for each overlapping Q region in both p_i × p_i and p_i+1 × p_i+1. The TQ score works by maximizing intra-TAD interactions while minimizing inter-cluster interactions, ensuring that TADs are well-defined. We then retain the TADs from the sub-matrix with the higher TQ score. This selection process is defined by the following equations:

$${Q}^{TAD}={({p}_{i}\times {p}_{i})}^{Q}\cup {({p}_{i+1}\times {p}_{i+1})}^{Q}$$

(15)

$${Q}^{TAD}\in {\left\{{({p}_{i}\times {p}_{i})}^{Q}\right\}}^{TAD}={p}_{i}^{TAD}$$

(16)

$${Q}^{TAD}\in {\left\{{({p}_{i+1}\times {p}_{i+1})}^{Q}\right\}}^{TAD}={p}_{i+1}^{TAD}$$

(17)

$$f\left({p}_{i}^{TAD}\right)=T{Q}_{i}$$

(18)

$$f\left({p}_{i+1}^{TAD}\right)=T{Q}_{i+1}$$

(19)

$${Q}_{f}^{TAD}=\left\{\begin{array}{ll}{p}_{i}^{TAD},\quad &\,{\mbox{if}}\,\,T{Q}_{i} > T{Q}_{i+1}\\ {p}_{i+1}^{TAD},\quad &\!\!\!\!\!\!\!\!\!\!{\mbox{otherwise}}\,\end{array}\right.$$

(20)

Equation (15) defines the union of TADs from both overlapping sub-matrices. Equations (16) and (17) assign TADs from each respective sub-matrix. Equations (18) and (19) calculate the TAD Quality scores for each sub-matrix, and finally, Equation (20) selects the TADs from the sub-matrix with the higher TQ score to ensure the best representation in the overlapping Q region.

After resolving overlapping or redundant TADs, we merge the retained TADs from both the previous and current sub-matrices to produce continuous, distinct TAD regions. This ensures comprehensive coverage of the entire contact matrix, including boundary regions, while maintaining high TAD Quality scores.

The final output of EmbedTAD is provided in a BED-like format, representing the start and end positions of each detected TAD region.

Statistics and Reproducibility

Result and method section includes our experiment’s statistics and reproducibility. EmbedTAD was assessed using in-silico Hi-C dataset, and our pipeline was once more verified using real Hi-C dataset. We used various noise levels of in-silico Hi-C data to assess EmbedTAD with SI, DBI, and CHI. We evaluated EmbedTAD’s performance using MoC and TAD Quality score. From the in-silico Hi-C dataset, we contrasted True and EmbedTAD-detected TAD. Using the real Hi-C dataset, we examined the TADadjR² score and average ChIP-seq signal across several organisms and resolutions. From the real Hi-C dataset, we assessed the quantity of TADs, the distribution of TAD sizes, and the IS of EmbedTAD-detected TADs at various resolutions and organisms. We used a variety of resolutions and organisms with distinct ChIP-seq signals including CTCF, RAD21, H3K27ac to biologically validate EmbedTAD’s detected TADs from genuine Hi-C datasets. The data, tables, and figures in this manuscript and supplemental information file contain all of our analyses.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

In-silico Hi-C was downloaded from HiCToolsCompare. Hi-C contact maps of human lymphoblastoid cell (GM12878) and mouse lymphoma cell (CH12LX) were downloaded from NCBI GEO GSE63525³. ChIP-seq signal data were downloaded from https://www.encodeproject.org/, https://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/and https://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/. mESC data downloaded from GSE35156⁸. Mus musculus data were downloaded from GSE210418⁴⁰. We used https://hicexplorer.readthedocs.io/en/latest/index.html, https://github.com/kmiles18/TAD-callers-comparison, https://github.com/vaquerizaslab/tadtool, and https://github.com/XiaoTaoWang/TADLib?tab=readme-ov-fileto plot some of our analysis results. The source data for experimental results and analysis data is found at https://github.com/OluwadareLab/EmbedTAD/tree/main/ra_data.

Code availability

The EmbedTAD source code is freely available at https://github.com/OluwadareLab/EmbedTAD. The EmbedTAD documentation is available at: https://github.com/OluwadareLab/EmbedTAD/wiki.

References

Kilpinen, H. & Dermitzakis, E. T. Genetic and epigenetic contribution to complex traits. Hum. Mol. Genet. 21, R24–R28 (2012).
Article Google Scholar
Sexton, T. et al. Three-dimensional folding and functional organization principles of the drosophila genome. Cell 148, 458–472 (2012).
Article Google Scholar
Rao, S. S. et al. A 3d map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Article Google Scholar
Pombo, A. & Dillon, N. Three-dimensional genome architecture: players and mechanisms. Nat. Rev. Mol. Cell Biol. 16, 245–257 (2015).
Article Google Scholar
Oluwadare, O., Highsmith, M. & Cheng, J. An overview of methods for reconstructing 3-d chromosome and genome structures from hi-c data. Biol. Proced. Online 21, 1–20 (2019).
Article Google Scholar
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009).
Article Google Scholar
Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the x-inactivation centre. Nature 485, 381–385 (2012).
Article Google Scholar
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012).
Article Google Scholar
Hong, S. & Kim, D. Computational characterization of chromatin domain boundary-associated genomic elements. Nucleic acids Res. 45, 10403–10414 (2017).
Article Google Scholar
Dekker, J. & Mirny, L. The 3d genome as moderator of chromosomal communication. Cell 164, 1110–1121 (2016).
Article Google Scholar
Gong, H. et al. Caspian: A method to identify chromatin topological associated domains based on spatial density cluster. Computational Struct. Biotechnol. J. 20, 4816–4824 (2022).
Article Google Scholar
Oluwadare, O. & Cheng, J. Clustertad: an unsupervised machine learning approach to detecting topologically associated domains of chromosomes from hi-c data. BMC Bioinforma. 18, 1–14 (2017).
Article Google Scholar
Haddad, N., Vaillant, C. & Jost, D. Ic-finder: inferring robustly the hierarchical organization of chromatin folding. Nucleic acids Res. 45, e81–e81 (2017).
Google Scholar
Lévy-Leduc, C., Delattre, M., Mary-Huard, T. & Robin, S. Two-dimensional segmentation for analyzing hi-c data. Bioinformatics 30, i386–i392 (2014).
Article Google Scholar
Shin, H. et al. Topdom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic acids Res. 44, e70–e70 (2016).
Article Google Scholar
Chen, J., Hero III, A. O. & Rajapakse, I. Spectral identification of topological domains. Bioinformatics 32, 2151–2158 (2016).
Article Google Scholar
Zufferey, M., Tavernari, D., Oricchio, E. & Ciriello, G. Comparison of computational methods for the identification of topologically associating domains. Genome Biol. 19, 217 (2018).
Article Google Scholar
Sefer, E. A comparison of topologically associating domain callers over mammals at high resolution. BMC Bioinforma. 23, 127 (2022).
Article Google Scholar
Liu, K., Li, H.-D., Li, Y., Wang, J. & Wang, J. A comparison of topologically associating domain callers based on hi-c data. IEEE/ACM Trans. Computational Biol. Bioinforma. 20, 15–29 (2022).
Article Google Scholar
Xu, J. et al. A comprehensive benchmarking with interpretation and operational guidance for the hierarchy of topologically associating domains. Nat. Commun. 15, 4376 (2024).
Article Google Scholar
Filippova, D., Patro, R., Duggal, G. & Kingsford, C. Identification of alternative topological domains in chromatin. Algorithms Mol. Biol. 9, 1–11 (2014).
Article Google Scholar
Liu, E. et al. Identifying tad-like domains on single-cell hi-c data by graph embedding and changepoint detection. Bioinformatics 40, btae138 (2024).
Article Google Scholar
Lloyd, S. Least squares quantization in pcm. IEEE Trans. Inf. theory 28, 129–137 (1982).
Article Google Scholar
Campello, R. J., Moulavi, D. & Sander, J. Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining, 160–172 (Springer, 2013).
Hovenga, V., Kalita, J. & Oluwadare, O. Hic-gnn: A generalizable model for 3d chromosome reconstruction using graph convolutional neural networks. Computational Struct. Biotechnol. J. 21, 812–836 (2023).
Article Google Scholar
Wang, Y. & Cheng, J. Reconstructing 3d chromosome structures from single-cell hi-c data with so (3)-equivariant graph neural networks. NAR Genomics Bioinforma. 7, lqaf027 (2025).
Article Google Scholar
Qiu, J. et al. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the eleventh ACM international conference on web search and data mining, 459–467 (2018).
Forcato, M. et al. Comparison of computational methods for hi-c data analysis. Nat. methods 14, 679–685 (2017).
Article Google Scholar
Pfitzner, D., Leibbrandt, R. & Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl. Inf. Syst. 19, 361–394 (2009).
Article Google Scholar
Li, X., Zeng, G., Li, A. & Zhang, Z. Detoki identifies and characterizes the dynamics of chromatin tad-like domains in a single cell. Genome Biol. 22, 217 (2021).
Article Google Scholar
Sanborn, A. L. et al. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc. Natl Acad. Sci. USA 112, E6456–E6465 (2015).
Article Google Scholar
An, L. et al. Ontad: hierarchical domain structure reveals the divergence of activity among tads and boundaries. Genome Biol. 20, 1–16 (2019).
Article Google Scholar
Wang, X.-T., Cui, W. & Peng, C. Hitad: detecting the structural and functional hierarchies of topologically associating domains from chromatin interactions. Nucleic Acids Res. 45, e163–e163 (2017).
Article Google Scholar
Kruse, K., Hug, C. B., Hernández-Rodríguez, B. & Vaquerizas, J. M. Tadtool: visual parameter identification for tad-calling algorithms. Bioinformatics 32, 3190–3192 (2016).
Article Google Scholar
Wolff, J. et al. Galaxy hicexplorer 3: a web server for reproducible hi-c, capture hi-c and single-cell hi-c data analysis, quality control and visualization. Nucleic Acids Res. 48, W177–W184 (2020).
Article Google Scholar
Rosen, J. et al. Hptad: A computational method to identify topologically associating domains from hichip and plac-seq datasets. Comput. Struct. Biotechnol. J. 21, 931–939 (2023).
Article Google Scholar
Lee, D., Kang, J. & Kim, A. Tad-dependent sub-tad is required for enhancer–promoter interaction enabling the β-globin transcription. FASEB J. 38, e70181 (2024).
Article Google Scholar
Fang, R. et al. Mapping of long-range chromatin interactions by proximity ligation-assisted chip-seq. Cell Res. 26, 1345–1348 (2016).
Article Google Scholar
Zhu, J. & Paul, W. E. Cd4 t cells: fates, functions, and faults. Blood J. Am. Soc. Hematol. 112, 1557–1569 (2008).
Google Scholar
Zhang, G., Li, Y. & Wei, G. Multi-omic analysis reveals dynamic changes of three-dimensional chromatin architecture during t cell differentiation. Commun. Biol. 6, 773 (2023).
Article Google Scholar
Ester, M. et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, vol. 96, 226–231 (1996).
Perozzi, B., Al-Rfou, R. & Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 701–710 (2014).
Tang, J. et al. Line: Large-scale information network embedding. In Proceedings of the 24th international conference on world wide web, 1067–1077 (2015).
Grover, A. & Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 855–864 (2016).
Rozemberczki, B., Kiss, O. & Sarkar, R. Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), 3125–3132 (ACM, 2020).
Tang, J., Qu, M. & Mei, Q. Pte: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 1165–1174 (2015).
Levy, O. & Goldberg, Y. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems 27 (2014).
Hinneburg, A. & Keim, D. A. A general approach to clustering in large databases with noise. Knowl. Inf. Syst. 5, 387–415 (2003).
Article Google Scholar
Ankerst, M., Breunig, M. M., Kriegel, H.-P. & Sander, J. Optics: Ordering points to identify the clustering structure. ACM Sigmod Rec. 28, 49–60 (1999).
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Institutes of General Medical Sciences of the National Institutes of Health under award number R35GM150402 to O.O.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of North Texas, Denton, TX, USA
H. M. A. Mohit Chowdhury & Oluwatosin Oluwadare
Center for Computational Life Sciences, University of North Texas, Denton, TX, USA
H. M. A. Mohit Chowdhury & Oluwatosin Oluwadare

Authors

H. M. A. Mohit Chowdhury
View author publications
Search author on:PubMed Google Scholar
Oluwatosin Oluwadare
View author publications
Search author on:PubMed Google Scholar

Contributions

H.M.A.M.C. conducted the analysis, wrote, and revised the manuscript and O.O. conceived, wrote, revised the manuscript, and supervised this project. All authors reviewed the manuscript.

Corresponding author

Correspondence to Oluwatosin Oluwadare.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Emre Sefer and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Leelavati Narlikar and Kaliya Georgieva. [A peer review file is available].

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file

Supplementary Material

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chowdhury, H.M.A.M., Oluwadare, O. EmbedTAD Using Graph Embedding and Unsupervised Learning to Identify TADs from High-Resolution Hi-C Data. Commun Biol 9, 7 (2026). https://doi.org/10.1038/s42003-025-09224-z

Download citation

Received: 11 April 2025
Accepted: 11 November 2025
Published: 09 December 2025
Version of record: 03 January 2026
DOI: https://doi.org/10.1038/s42003-025-09224-z