SnapFISH-IMPUTE: an imputation method for multiplexed DNA FISH data

Yu, Hongyu; Wu, Daiqing; Mishra, Shreya; Shen, Guning; Sun, Huaigu; Hu, Ming; Li, Yun

doi:10.1038/s42003-024-06428-7

Download PDF

Article
Open access
Published: 09 July 2024

SnapFISH-IMPUTE: an imputation method for multiplexed DNA FISH data

Hongyu Yu^1,2,
Daiqing Wu³,
Shreya Mishra⁴,
Guning Shen^5,6,
Huaigu Sun⁷,
Ming Hu ORCID: orcid.org/0000-0003-0987-2916⁴ &
…
Yun Li ORCID: orcid.org/0000-0002-9275-4189^5,7,8

Communications Biology volume 7, Article number: 834 (2024) Cite this article

1958 Accesses
1 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Chromatin spatial organization plays a crucial role in gene regulation. Recently developed and prospering multiplexed DNA FISH technologies enable direct visualization of chromatin conformation in the nucleus. However, incomplete data caused by limited detection efficiency can substantially complicate and impair downstream analysis. Here, we present SnapFISH-IMPUTE that imputes missing values in multiplexed DNA FISH data. Analysis on multiple published datasets shows that the proposed method preserves the distribution of pairwise distances between imaging loci, and the imputed chromatin conformations are indistinguishable from the observed conformations. Additionally, imputation greatly improves downstream analyses such as identifying enhancer-promoter loops and clustering cells into distinct cell types. SnapFISH-IMPUTE is freely available at https://github.com/hyuyu104/SnapFISH-IMPUTE.

SnapFISH: a computational pipeline to identify chromatin loops from multiplexed DNA FISH data

Article Open access 12 August 2023

Simultaneous visualization of DNA loci in single cells by combinatorial multi-color iFISH

Article Open access 10 February 2022

Preparation of long single-strand DNA concatemers for high-level fluorescence in situ hybridization

Article Open access 25 October 2021

Introduction

Chromatin conformation capture (3C) methods and their derivatives have been widely used for studying three-dimensional (3D) chromatin conformations¹. Sub-nuclear structures such as chromosome territories, topologically associating domains (TADs), and enhancer-promoter loops can all be identified from data generated by 3C-based methods^2,3. However, since 3C-based methods depend on crosslinking DNA fragments, downstream analyses often start from the pairwise contact matrices, and reconstructing 3D models from these matrices is challenging^4,5,6. Recent developments in fluorescence in situ hybridization (FISH)-based technologies, such as DNA multiplexed error-robust FISH (MERFISH)⁷ and DNA seqFISH+⁸, in contrast, do not rely on proximity ligation. Instead, a set of DNA probes is designed to cross-hybridize the genomic region of interest. By sequentially binding DNA probes with different fluorescence signals, the identity of each probe can be resolved, thus enabling direct visualization of DNA fragments in each single cell⁹. However, the quality of FISH-based data decreases as the number of hybridization rounds increases. As a result, the average detection efficiency is below 80% for most chromatin conformation datasets with a resolution higher than 30 Kb^{7,8,10,11,12,13}.

Compared to 3C-based data, which only have dropouts represented by zeros in their contact matrices, FISH-based data have unobserved 3D coordinates that cannot be simply filled by a single value such as zero. This missing data problem makes many existing algorithms inapplicable. For example, none of the classical dimension-reduction algorithms for visualizations, such as principal component analysis (PCA) or uniform manifold approximation and projection (UMAP) can handle missing values. To account for such missingness, two types of imputation strategies are adopted in published literature^8,14. The first approach imputes the pairwise distances between unobserved imaging loci. A missing entry in the distance matrix is imputed either by the average of neighboring entries⁸ or the average of that entry across all cells (mean imputation)¹⁵. This approach is appropriate if downstream analyses only depend on pairwise distances; otherwise, the problem of missing data pertains since the 3D coordinates of the unobserved loci are still missing. The second approach imputes 3D coordinates directly. The most common strategy is to perform linear interpolation by calculating a weighted average of two neighboring loci¹¹. Other methods, such as nearest neighbors¹⁶ are also reported, but all existing strategies that impute 3D coordinates of missing loci rely on neighboring loci only and lack the ability to infer global structures. In practice, multiplexed DNA FISH data are typically filtered before analysis, leaving only the cells with detection efficiency higher than some predetermined threshold^8,16, which further reduces the number of usable cells and complicates downstream analyses.

In contrast to the lack of imputation methods developed specifically for multiplexed DNA FISH data, computational methods tailored for imputing single-cell Hi-C (scHi-C) data, such as scHiCluster¹⁷ and Higashi¹⁸, have been proposed recently. However, these methods are not directly applicable to multiplexed DNA FISH data because they all build on assumptions made on 3C data. For instance, the data are sparse at the single-cell level, measurements are discrete, and only dropouts are recorded instead of unavailable observations. To fill in such a methodological gap, we present an imputation method tailored for multiplexed DNA FISH data in this paper. The imputation method relies on a dissimilarity measure to find chromosomes with similar conformations. Note by default, we perform analyses separately of the two chromosomes in each cell. Then a target pairwise distance matrix is constructed for each chromosome based on the selected structures. To recover missing 3D coordinates, we minimize the difference between the target matrix and the spatial distances calculated from 3D coordinates. Compared to existing methods, our method is able to integrate information from both the neighboring loci of the missing locus and other cells. We have shown that the imputed data resemble real imaging data and can improve the results of various downstream analyses, including loop-calling and clustering.

Results

SnapFISH-IMPUTE

Multiplexed DNA FISH data measure the three-dimensional Euclidean coordinates of consecutive loci on chromosomes. Depending on the ploidy and the cell cycle, it is possible to observe multiple signals of the same locus within each cell. Here we assume that the identity of each locus, that is, to which allele it belongs, is already resolved by clustering or more advanced methods tailored for spatial genome alignment^8,19. Then the goal of imputation is to fill in the 3D coordinates of the missing loci on each chromosome.

Although chromosome conformations often demonstrate large cell-to-cell variability, common structures such as TADs and enhancer-promoter loops are conserved in subpopulations of cells from the same cell type^8,10,11. We, therefore, reason that for each missing locus, its position is not only related to loci that immediately upstream or downstream loci on the same chromosome but also can be determined by the relative positioning of the same locus on other cells. To find cells with similar structures, we first convert 3D coordinates to pairwise Euclidean distances between each locus pair. The pairwise distances are then normalized by their one-dimensional (1D) genomic distances, so the difference between the two cells is not dominated by pairwise distance entries with large variations. Specifically, for locus pairs separated by the same 1D genomic distance, we model the difference in each coordinate as a centered normal distribution with variance proportional to the 1D distance. The squared pairwise distance, which is the sum of squares of the difference in each axis, thus follows a chi-squared distribution with three degrees of freedom multiplied by the variance. In imaging experiments, however, factors such as the resolution of the microscope and the imaging procedure used might lead to violations of the assumptions. To account for potential deviations, an additional Box-Cox transformation is performed, ensuring that the underlying distributions are approximately normal (Fig. S1). Last but not least, we applied the z score normalization to correct for the difference in 1D genomic distances (Fig. 1).

**Fig. 1: Overview of the imputation algorithm.**

To find cells with similar conformations, we define a dissimilarity measure by computing the root mean square deviation between the normalized distances. Because of the presence of missing loci, when calculating the dissimilarity score between two cells, normalized distances that are unavailable in at least one of the two cells are skipped, and the final score is rescaled by the number of shared available entries, thus ensuring that all scores are within the same range and comparable with each other. We further filter the scores by removing the ones with shared entries <80% of the available entries in both cells. However, since the Euclidean distances are highly dynamic at the locus pair level, the dissimilarity defined in this way cannot capture larger structural features, such as TADs, a phenomenon also observed in Hi-C data²⁰. A widely adopted solution for Hi-C data is to first smooth the contact matrix. Here, we resize the normalized distance matrices to 20 by 20 to achieve the same purpose. An additional advantage is that resizing will simultaneously reduce the missing ratio in the processed pairwise distances (Fig. S2), hence allowing more scores to pass the 80% filter.

The next step of the imputation workflow involves constructing target pairwise distances. Specifically, for each cell, all the other cells are ranked by their dissimilarity, and each missing pairwise distance of the cell is replaced by the first non-missing entry in the sorted cell list. This replacing process is repeated for all cells until no missing pairwise distances exist. To recover the underlying 3D coordinates from pairwise distances, we minimize the difference between the target pairwise distances and the pairwise distances calculated from 3D coordinates. Similarly, the difference is calculated after pairwise distances are normalized by 1D genomic distances, which ensures that each entry contributes equally. The difference and its derivatives are then passed to a minimizer and optimized with the limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS)²¹, a method designed for solving large-scale optimization problems.

Imputation preserves the distribution of pairwise spatial distances

We applied our imputation algorithm to a published multiplexed DNA FISH dataset from mouse embryonic stem cells (mESCs)¹⁰. The authors performed a whole-genome DNA seqFISH+ on 446 cells from two biological replicates. Two groups of probes were designed, where the first group consisted of 1200 probes separated by 25 kb, and the second group consisted of 2460 probes separated by about 1 Mb. We started with the 25 kb resolution subset, where one imaging region is selected from each chromosome, and 60 consecutive loci separated by 25 kb are imaged for each imaging region. Keeping only haploid chromatins with at least one locus observed, the average detection efficiency of the 25 kb subset is 67.9% (range: 58.7% [chr2] 73.0% [chr3]). To benchmark our method SnapFISH-IMPUTE, we included three alternative imputation strategies in the following analyses: linear imputation, cubic spline imputation, and mean imputation (see Methods). The first two methods can recover the 3D coordinates, while the mean imputation method can only fill in missing pairwise distances.

As expected, the average pairwise distances from the raw data and the mean imputed data are identical since in mean imputation, missing values are replaced by the average (Fig. 2a). Linear imputation also gives an average distance matrix similar to the original one, though the distances between locus pairs with larger 1D genomic distance are more likely to be overestimated (Fig. 2a, S3a). Interestingly, cubic spline, a generalization of linear imputation that uses a cubic polynomial instead of a linear function, completely removes the original patterns in the data (Fig. 2a, S3b). This suggests that filling in missing 3D coordinates while preserving population-level features is challenging. The main reason that linear imputation works reasonably well is likely that, arithmetically, it is equivalent to averaging neighboring two loci, so it implicitly captures the stochasticity in chromatin conformation and replaces those missing coordinates by their expectations. If we try to capture the variations explicitly by increasing the order of the fitted function, the predicted value is no longer computed from the neighboring two loci only but will also be influenced by other loci further apart. In contrast, by adopting a two-stage imputation workflow, SnapFISH-IMPUTE preserves the original population-level patterns faithfully and yields almost identical distance matrices (Fig. 2a, S3c). The difference between the imputed matrix and the raw matrix is also substantially smaller than the difference computed from both the linear imputation result and the cubic spline imputation result (Fig. 2b).

**Fig. 2: Imputation results of the 25 kb subset of the DNA seqFISH+ mESCs dataset.**

The Pearson correlations between the mean distance matrices from the imputed data and the raw data show that SnapFISH-IMPUTE consistently outperforms linear imputation and cubic spline across all imaging regions. Indeed, the correlations are around 0.99, almost the same as the ones from mean imputation (Fig. 2c). Such high correlations are achieved without gathering population-level information beforehand, which is completely different from mean imputation, where the population average is pre-computed and directly used to fill in missing pairwise distances. The results indicate that the information from other cells is referenced sufficiently in the first part of our method, and the 3D coordinates recovered resemble true distributions. In addition to the mean distance matrix, we calculated the standard deviation of the distance between each locus pair across the dataset. Both linear imputation and spline imputation lead to unwanted variations, which can be more than twice as high as the observed variations (Fig. 2d). The mean imputation, on the other hand, reduces the original standard deviations because all missing distances from the same locus pair are replaced by the same value. Among all methods tested, although our proposed imputation method also slightly decreases the variation, it keeps the overall distribution at a similar level (Figs. 2d, S3d). Taken together, our method preserves population-level features and surpasses other benchmark methods in multiple aspects.

Imputed conformations resemble observed conformations

We next evaluated the effect of imputation on single-cell level characteristics. A few cells without missing loci are selected, and we found that chromatin conformations demonstrate large cell-to-cell variability, coherent with previous studies (Fig. S4a)¹⁰. Nevertheless, there are some common patterns shared by most imaged regions. For example, the loci are often arranged sinuously instead of following a smooth curve. As a result, neighboring entries in the pairwise distance matrix can have distinctive values, despite their being close in 1D genomic distance (Fig. S4a). Such features are not always preserved by other imputation methods considered. When ~50% of loci are missing, both linear imputation and cubic spline imputation fail to recover the sharper changes observed in real imaging data (Fig. 2e). This pattern is more obvious as the detection efficiency drops below 50%, where loci start to pack together, and finer structural variations are entirely lost, as reflected by the distance matrices (Fig. S4e, f). SnapFISH-IMPUTE, on the opposite, is robust under various detection efficiencies, and the imputed loci distribute randomly as in real imaging data even when two-thirds of the loci are missing (Fig. S4b–f). Remarkably, although SnapFISH-IMPTUE does not model the conformation parametrically, it captures the overall trend of non-missing loci, and the imputed pairwise distances are indistinguishable from the real pairwise distances, in contrast to mean imputation (Fig. 2e).

If the imputed loci emulate real imaging data at the single-cell level, it would be difficult to discriminate imputed cells with low detection efficiency from those with high detection efficiency. For each imaging region in the 25 kb subset, we binned the cells by their detection efficiencies into three groups, and we trained a soft-margin support vector machine on the two groups with the highest and the lowest detection efficiencies (see Methods). If the imputed result closely resembles true data patterns, the classifier would not be able to distinguish cells from these two groups easily. Indeed, the accuracy is ~50% for SnapFISH-IMPTUE across different imaging regions, which is about the same as a random classifier. In comparison, the classification accuracies are considerably higher for all three competing methods, with mean imputation having the highest score of around 90% (Fig. 2f). We performed a PCA using the mean imputed data and the data generated by SnapFISH-IMPUTE. The result shows that cells with low and high detection efficiencies from the mean imputed data occupy different sub-spaces in the low-dimensional space, consistent with the high classification accuracies. In contrast, cells with different detection efficiencies are mixed together in the PCA plot of SnapFISH-IMPUTE, confirming that the predicted conformations are similar to the observed ones (Fig. 2g).

SnapFISH-IMPUTE is robust under different resolutions and imaging protocols

Next, we applied SnapFISH-IMPUTE to the 1 Mb subset of the mESCs data. We imputed missing data in output from Jie¹⁹, a spatial genome alignment method. The overall detection efficiency is only 36.3%, with a range of 30.2% to 46.9%. Notably, although nearly two-thirds of the 3D coordinates are missing, SnapFISH-IMPUTE yields average distance matrices almost identical to the original ones, and the Pearson correlations are close to one as before (Fig. S5a, b). Additionally, we have analyzed a DNA FISH dataset of mESCs generated by a different imaging protocol¹¹. A total of 41 probes are designed to image two alleles at 5 kb resolution, and the average detection efficiency is ~70% for both alleles. Similar to previous results, SnapFISH-IMPUTE recovers missing 3D coordinates effectively, as reflected by both the mean distance matrices and the distance deviations (Fig. S5c–f). The correlations are 0.98 and 0.98 for the two alleles, in contrast to the 0.88 and 0.86 achieved by linear imputation and the 0.58 and 0.57 by cubic spline imputation. In summary, the performance of the SnapFISH-IMPUTE is robust under different imaging resolutions and protocols.

Imputation improves downstream analysis

Enhancer-promoter loop is a critical structural feature in chromatin conformation data and plays a key role in transcription regulation²². In our recent work, we have developed a loop caller, SnapFISH²³, for multiplexed DNA FISH data. We applied SnapFISH to the 25 kb imputed DNA seqFISH+ data and benchmarked the loop-calling performance with both the loop set from the raw data and the HiCCUPS²⁴ output. With the default threshold, SnapFISH identified 30 loops from the imputed data, of which 14 loops overlapped with the HiCCUPS output (Fig. 3a). A large number of false positives is not surprising since imputation increases the effective sample size and thus allows more loops to pass the default threshold. The t test results from SnapFISH show that false positive loops have smaller t-statistics than true positive loops, suggesting that the relative loop strength is not affected by imputation (Fig. 3c). This motivates us to optimize the default FDR cutoff in the SnapFISH algorithm to achieve a similar precision to the raw data. Indeed, we found that by setting the cutoff to 0.001, SnapFISH reports 4 false loops while all the true loops are not affected (Fig. 3b). It is worth noting that not all imputation strategies will enhance loop-calling. Both linear imputation and cubic spline imputation drastically lower the sensitivity, with only two loops called at the optimal threshold. We reason that enhancer-promoter loop is a more intricate feature in 3D chromatin conformation, thus requiring careful processing of the raw data. To assess whether and to what extent the performance depends on the loop caller used, we performed the same analysis using loops called FitHiC2²⁵ as the ground truth. We observed similar patterns (Fig. S6) when using HiCCUPS loops as the ground truth.

**Fig. 3: Imputation improves loop-calling and cell-type clustering.**

Next, we applied SnapFISH to the 5 kb chromatin tracing data and tested whether it could identify the Sox2 enhancer-promoter loop with different numbers of cells. Specifically, we generated 100 random samples for each number of cells and calculated the F1 score on these 100 samples. The result shows that imputation boosts loop-calling efficiencies and leads to a higher F1 score across both alleles (Fig. 3d, e).

In addition to loop-calling, we test whether the imputed data can be used for cell-type clustering. We re-analyzed a previously published DNA seqFISH+ dataset of mouse brain cells⁸. The authors also conducted mRNA seqFISH+ on each cell in the dataset and identified nine major cell types. Here we asked whether cells have distance 3D conformations in different cell types. We applied SnapFISH-IMPUTE and linear imputation to the 1 Mb resolution subset of the data. After obtaining the complete 3D coordinates, we computed the pairwise distances of each cell and concatenated all pairwise distances from the 19 autosomes in each cell. Since not all cells have both alleles observed and all 19 autosomes recorded, we kept only cells with at least one haploid observable and randomly selected one allele for cells with more than one allele to perform clustering. The concatenated distances are normalized and then embedded with PCA followed by UMAP. Although not all cell types are distinguishable from each other in the embedding space, some of them, such as microglia (Micro) and neurons expressing vasoactive intestinal polypeptide (Vip) (Fig. 3f) or neurons expressing parvalbumin (Pvalb) and endothelial cells (Fig. 3g), occupy distinct regions. We also noticed that SnapFISH-IMPUTE often separates these classes more clearly than linear imputation and, thus, is more appropriate for downstream analyses.

To quantitatively evaluate the clustering efficiency, we calculated the adjusted mutual information scores (AMI) between the imputed data and the ground truth. Specifically, we first embedded the distances with PCA and then applied hierarchical clustering to obtain the predicted cluster assignments. We then computed the AMI score using the predicted and the true cluster assignments. The AMI scores of SnapFISH-IMPUTE are almost always higher than the ones from linear imputation (Fig. S7). For example, for pairs Pvalb versus Endo, Micro versus Vip, and Sst versus Micro, linear imputation yields an AMI close to zero, indicating that the clusters found are almost random. In contrast, SnapFISH-IMPUTE gives an AMI score close to 0.5 in these cases, which is considerably higher and much closer to 1, showing that SnapFISH-IMPUTE preserves original chromatin conformation characteristics.

Discussion

In this paper, we presented SnapFISH-IMPUTE, a unified nonparametric method to impute missing values in multiplexed DNA FISH data. In contrast to previous attempts, which focused on either imputing 3D coordinates with simple parametric models or developing ad-hoc methods for specific purposes, our method takes into account that cells imaged together have common structures. By allowing similar cells to borrow information from each other, we improved the quality of the imputed data substantially. We systematically evaluated the performance of SnapFISH-IMPUTE on three published multiplexed DNA FISH data of mice under different resolutions and experiment protocols. The results showed that the spatial relationships between DNA loci are preserved in the imputed data, and a similar level of cell-to-cell variation is observed after imputation. By dividing the cells by their missing ratios and training a classifier on the imputed data, we have also demonstrated that SnapFISH-IMPUTE performs consistently regardless of missing ratios.

In addition, we have benchmarked our method with three imputers used in previous studies. For the mean imputer¹⁵, though the average pairwise distance remains the same after imputation, it also eliminates the cell-to-cell variability typically observed in real imaging data. The reason is that for each cell, the same values are used to fill in the missing entries in the pairwise distance matrix. Moreover, only the distance matrices (not the 3D conformations) will be available after imputation, which further limits the application of the method. On the other hand, the linear and cubic spline imputers can retain the single-cell level variations but often introduce outliers at the two ends of the imaging region, affecting the average pairwise distances. In contrast, SnapFISH-IMPUTE overcomes this challenge and produces reliable conformation. Specifically, at the single-cell level, imputed chromatin conformations do not have outliers as in linear and cubic spline imputed data. At the population level, imputed average pairwise distances are nearly identical to the original ones.

More importantly, the imputed data outputted by SnapFISH-IMPUTE can be used for downstream analyses with minimal extra processing. For example, we showed that the imputed mouse brain cell data can be directly used for cell type clustering, with no additional adjustments needed. Furthermore, biological features became more detectable in the imputed data. As demonstrated with the 5 kb chromatin tracing data, by increasing the effective sample size, imputation improves the F1 score of loop calling and thus makes it possible to identify enhancer-promoter interactions with fewer cells. We also found that the normalization procedure proposed in this work effectively removes biases and transforms the pairwise distances between DNA loci with a fixed 1D genomic distance to approximately normal (Fig. S1f). In comparison, normalization by z score only or by taking the Pearson residuals from generalized linear models cannot standardize the distributions effectively (Fig. S1d, e).

We acknowledge that the performance of SnapFISH-IMPUTE depends on the number of cells imaged and the number of loci on each chromosome. When the sample size is large, and each chromosome is imaged at <1000 loci, SnapFISH-IMPUTE is able to find cells with similar conformations and effectively impute the missing portions. However, if each chromosome is imaged at more than thousands of loci and only a limited number of cells is available, it might be difficult to find cells with similar chromatin conformations, impairing the imputation performance. In addition, when the number of loci is large, it will be difficult to recover the 3D structure given the pairwise distance matrices, as the complexity of the optimization problem will grow quadratically. Therefore, as imaging technologies continue to evolve and image resolution continues to increase, more efficient imputation methods may be needed in the future.

In summary, SnapFISH-IMPUTE is an effective algorithm for imputing missing 3D coordinates in multiplexed DNA FISH data. The improvement in data quality through imputation will enable more computational and statistical tools to be developed, bringing new insights to the understanding of gene regulation in cells.

Methods

Pairwise distance normalization

Let (x_in, y_in, z_in) denote the x, y, and z-coordinates of the ith locus in the nth cell. The Euclidean distance between the ith and the jth loci is defined by

$${D}_{ij}^{(n)}=\sqrt{{({x}_{in}-{x}_{jn})}^{2}+{({y}_{in}-{y}_{jn})}^{2}+{({z}_{in}-{z}_{n})}^{2}}$$

(1)

If we assume that the measurement error in each axis is the same, then their differences should be similar. Additionally, the distribution of x_i−x_j, y_i−y_j, and z_i−z_j should have a mean of 0; otherwise, it means that cells from different cells are all aligned in the same direction, which is unlikely. For locus pairs with the same 1D genomic distances, x_i−x_j, y_i−y_j, and z_i−z_j can be viewed as three independent centered normal distributions with the same variance. Therefore, after grouped by 1D distances, D² divided by the common variance is the sum of squares of three standard normal distributions, thus following a ${\chi }_{3}^{2}$ distribution. However, depending on the experiment design, the imaging data might be acquired in a slice-by-slice manner, leading to an unequal variance in each axis¹⁰. Additionally, the resolution limit of microscopes might lead to different measurement errors. As a result, although the distances are distributed closely as chi-squared distributions, they may possess distinctive skewness and tailedness.

Since D is non-negative and roughly follows a chi-squared distribution, we can apply Box-Cox transformation to standardize it²⁶. We determine the optimal parameter λ by maximizing the likelihood of the transformed distance:

$${D}_{\lambda }=\left\{\begin{array}{ll}\frac{{D}^{\lambda }-1}{\lambda }\quad &{{{{{\rm{if}}}}}}\,\lambda \, \ne \, 0\\ \log D\quad &{{{{{\rm{if}}}}}}\,\lambda =0\end{array}\right.$$

(2)

Then D_λ is normalized by (D_λ−μ)/σ, where μ and σ are the mean and the standard deviation of the transformed distances with the same 1D genomic distance. After normalization, we expect pairwise distances to distribute identically as standard normal distributions and contribute equally in the following analyses.

Dissimilarities between cells

As noted in the HiCRep paper²⁰, domain structures in contact maps are more stable features than individual interactions. Therefore, comparing larger structures instead of highly dynamic single interactions should be the main goal of a dissimilarity measure. Although HiCRep was developed for bulk Hi-C data comparison, the distance matrix calculated from multiplexed DNA FISH data is negatively correlated with Hi-C contact matrix, so similar issue might arise. Moreover, we are comparing single-cell level structures instead of aggregated means in this case, so each pair of conformations is more variable. To remove the noise, we resize the normalized distance matrix by a factor of k such that the resulting matrix has a side length just above 20. Let ${A}_{ij}^{(n)}$ be the normalized pairwise distance between the ith and the jth locus of the nth cell. Then the ijth entry of the resized matrix H⁽ⁿ⁾ is the mean of the available entries in A⁽ⁿ⁾ over the window of (k(i−1), ki] × (k(j−1), kj]. If the side length of A⁽ⁿ⁾ is not divisible by k, a zero padding is applied, which corresponds to the mean of a standard normal distribution.

The dissimilarity between the nth and the mth cell is then defined by

$${S}_{nm}=s| h(n,m){| }^{-1/2}$$

(3)

where ∣h(n, m)∣ is the number of entries where both H⁽ⁿ⁾ and H^(m) have values, and s is defined by

$${s}^{2}=\sum\limits_{(i,j)\in h(n,m)}{\left({H}_{ij}^{(n)}-{H}_{ij}^{(m)}\right)}^{2}\,{{{{{\rm{if}}}}}}\,| h(n,m)| \, > \, 0\,\,{{\mbox{otherwise undefined}}}\,$$

(4)

This is the Euclidean distance between the available entries weighted by the number of shared entries. If only a small proportion of the observed pairwise distances are shared between the two cells, the value defined might not reflect the true structural relation. Therefore, we further filtered the dissimilarities by the number of shared entries. Specifically, we calculate

$${{{\mbox{shared proportion}}}}_{nm}=\frac{| h(n,m)| }{\min \{| a(n)| ,| a(m)| \}}$$

(5)

for each cell pair, where a(n) and a(m) are the sets of non-missing pairwise distances in cell n and cell m. S_nm’s with ratio below 0.8 are left undefined.

Impute missing pairwise distances

For the nth cell, let p(n) be the set of indices where S_nm, m ∈ p(n) is defined. For the missing locus pair (i, j) of cell n, define u(i, j) be the indices of all cells where ${H}_{ij}^{(m)},m\in u(i,j)$ is available. Thus, the set p(n) ∩ u(i, j) is the indices of all cells with the (i, j)th entry available and has a dissimilarity score with cell n. We replace the missing pairwise distance (i, j) of cell n with

$${H}_{ij}^{(n)}\leftarrow {H}_{ij}^{(k)},k={\arg \min }_{m\in p(n)\cap u(i,j)}{S}_{nm}$$

(6)

It is possible that p(n) ∩ u(i, j) is empty as no S_nm is defined or no m ∈ p(n) has entry (i, j). In such cases, we impute as many pairwise distances as possible and use the updated distances to recalculate the dissimilarities. Then another round of imputation is performed to fill in more missing pairwise distances. Since the missingness decreases after one iteration, both p(n) and u(i, j) contain more indices. By repeating the replacement procedure, all previously unavailable pairwise distances will eventually be imputed.

Recover 3D coordinates from pairwise distances

To recover the underlying 3D coordinates from pairwise distances, we first initialize the missing (x_in, y_in, z_in) by linear imputation (see linear imputation and cubic spline imputation). Let ${W}_{ij}^{(n)}$ be the pairwise distance computed from the lineraly imputed data. The same Box-Cox transformation and z score normalization are used to transform ${W}_{ij}^{(n)}$ to ${f}_{ij}({W}_{ij}^{(n)})$, where

$${f}_{ij}\left({W}_{ij}^{(n)}\right)=\frac{1}{{\sigma }_{ij}}\left[\frac{{[{W}_{ij}^{(n)}]}^{{\lambda }_{ij}}-1}{{\lambda }_{ij}}-{\mu }_{ij}\right]$$

(7)

The difference between the target pairwise distance H⁽ⁿ⁾ and f(W⁽ⁿ⁾) is then defined by

$${L}^{(n)}=\sum\limits_{i\in m(n)}\sum\limits_{j=1}^{K}{\left({f}_{ij}\left({W}_{ij}^{(n)}\right)-{H}_{ij}^{(n)}\right)}^{2}$$

(8)

where m(n) is the set of missing locus on cell n and K is the total number of loci in the imaging region. The partial derivatives with respect to x_in, y_in, and z_in are

$$\frac{\partial {L}^{(n)}}{\partial {x}_{in}}=\sum\limits_{i\in m(n)}\sum\limits_{j=1}^{K}\frac{2}{{\sigma }_{ij}}({x}_{in}-{x}_{jn}){\left[{W}_{ij}^{(n)}\right]}^{{\lambda }_{ij}-2}\left({f}_{ij}\left({W}_{ij}^{(n)}\right)-{H}_{ij}^{(n)}\right)$$

(9)

$$\frac{\partial {L}^{(n)}}{\partial {y}_{in}}=\sum\limits_{i\in m(n)}\sum\limits_{j=1}^{K}\frac{2}{{\sigma }_{ij}}({y}_{in}-{y}_{jn}){\left[{W}_{ij}^{(n)}\right]}^{{\lambda }_{ij}-2}\left({f}_{ij}\left({W}_{ij}^{(n)}\right)-{H}_{ij}^{(n)}\right)$$

(10)

$$\frac{\partial {L}^{(n)}}{\partial {z}_{in}}=\sum\limits_{i\in m(n)}\sum\limits_{j=1}^{K}\frac{2}{{\sigma }_{ij}}({z}_{in}-{z}_{jn}){\left[{W}_{ij}^{(n)}\right]}^{{\lambda }_{ij}-2}\left({f}_{ij}\left({W}_{ij}^{(n)}\right)-{H}_{ij}^{(n)}\right)$$

(11)

Both L⁽ⁿ⁾ and its partial derivatives are passed to the minimize function in SciPy²⁷ and optimized with the L-BFGS algorithm, a quasi-Newton method that stores only a limited number of iterations to accelerate its convergence²¹. The optimized 3D coordinates are the imputation results.

Time cost of SnapFISH-IMPUTE

SnapFISH-IMPUTE is implemented with mpi4py²⁸ to allow multiprocessing. With 40 processes, SnapFISH-IMPUTE takes about 5 minutes to impute the 1298 haploid chromosomes in the 5 kb chromatin tracing data from Huang et al.¹¹. The time cost grows roughly linearly with the number of haploid chromosomes in the dataset.

Linear imputation and cubic spline imputation

For each chromosome, a parametrized continuous spline is fitted for each axis separately:

$${x}_{i}={f}_{x}({t}_{i});{y}_{i}={f}_{y}({t}_{i});{z}_{i}={f}_{z}({t}_{i})$$

(12)

where t_i is the 1D genomic location of the ith observed locus on the chromosome. For the jth missing locus, its 3D coordinates are imputed by evaluating the fitted splines at genomic location t_j (i.e., $({\hat{f}}_{x}({t}_{j}),{\hat{f}}_{y}({t}_{j}),{\hat{f}}_{z}({t}_{j}))$). If f is a linear function, then this corresponds to linear imputation and can be interpreted as an weighted average of two neighboring loci^11,14. If f is a cubic polynomial, then this is the cubic spline imputation. Both methods are implemented with SciPy’s interp1d function from the interpolate module²⁷.

Mean imputation

As defined above, let ${D}_{ij}^{(n)}$ denote the Euclidean distance between locus i and locus j in chromosome n, and let u(i, j) denote the set of indices that ${D}_{ij}^{(m)},m\in u(i,j)$ is defined. We calculate the mean of each locus pair by

$${\bar{D}}_{ij}=\frac{1}{| u(i,j)| }\sum\limits_{n\in u(i,j)}{D}_{ij}^{(n)}$$

(13)

and replace each missing pairwise distance by

$${D}_{ij}^{(n)}={\bar{D}}_{ij},\forall n \, \notin \, u(i,j)$$

(14)

Classify haploid chromosomes by detection efficiencies

For each imaging region, we first calculated the detection efficiency of each chromosome using the raw data. The chromosomes are then ranked and binned by the 33.3% and the 66.7% quantiles of detection efficiencies. We then calculated the pairwise distance matrices and normalize each entry of the distance matrix by its average across the dataset. We next divide the first and the third tertiles of the binned data by a fivefold cross validation. A soft-margin support vector machine with a radial kernel (implemented with the SVC class from the scikit-learn²⁹ package) is then fitted to the training set, and the classification accuracy is calculated on the validation set. This training and validation procedure is carried out on each possible partition of the data.

FitHiC2 output processing

We analyzed the bulk Hi-C data³⁰ used as the benchmark dataset in the original SnapFISH paper²³. Specifically, we ran FitHiC2²⁵ on the mESC replicate with G0 and G1 phase cells at 10 kb resolution and kept interactions with a q value <0.01. We further filtered these interactions based on the coverage of the multiplexed DNA FISH data. Only loops with both anchors falling within the range of the 25 kb subset of DNA seqFISH+ data were kept. Finally, we merged loops that are within 50 kb of each other, leaving 102 loops in total. These 102 loops were used to benchmark the performance of SnapFISH.

Statistics and reproducibility

No statistical tests were conducted to draw conclusions. No statistical methods were used to determine sample sizes. Unless otherwise noted, error bars correspond to 95% confidence intervals. No representative results were selected from repeated measurements. The proposed method was tested using three publically available datasets, and no specific observations were excluded.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The DNA seqFISH+ data of mESCs are downloaded from Zenodo (https://zenodo.org/records/3735329). The 5 kb chromatin tracing data are downloaded from 4DN Data Portal with ID 4DNESC5PKTQ9. The DNA seqFISH+ data of mouse brain cells are downloaded from Zenodo (https://zenodo.org/records/4708112). The aligned results from Jie are downloaded from 4DN Data Portal with IDs 4DNFIS6MLXGA and 4DNFIU73OR5W. The code for all data preprocessing is available at https://github.com/hyuyu104/SnapFISH-IMPUTE/blob/main/jupyter/preprocess.ipynb. The imputed data used in this study are available from Zenodo with the link https://zenodo.org/records/10088109³¹. In addition, the source data for figures in this paper can be found in Supplementary Data 1.

Code availability

The SnapFISH-IMPUTE software is freely available as a Python package at https://pypi.org/project/sfimpute. The source code³² are available at https://zenodo.org/records/10525166.

References

Kempfer, R. & Pombo, A. Methods for mapping 3D chromosome architecture. Nat. Rev. Genet. 21, 207–226 (2020).
Article CAS PubMed Google Scholar
Rowley, M. J. & Corces, V. G. Organizational principles of 3D genome architecture. Nat. Rev. Genet. 19, 789–800 (2018).
Article CAS PubMed Google Scholar
Liu, W. et al. Understanding regulatory mechanisms of brain function and disease through 3D genome organization. Genes 13, 586 (2022).
Article CAS PubMed PubMed Central Google Scholar
Hafner, A. & Boettiger, A. The spatial organization of transcriptional control. Nat. Rev. Genet. 24, 53–68 (2023).
Article CAS PubMed Google Scholar
Wang, S. et al. HiNT: a computational method for detecting copy number variations and translocations from Hi-C data. Genome Biol. 21, 73 (2020).
Article PubMed PubMed Central Google Scholar
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).
Article CAS PubMed PubMed Central Google Scholar
Su, J.-H., Zheng, P., Kinrot, S. S., Bintu, B. & Zhuang, X. Genome-scale imaging of the 3D organization and transcriptional activity of chromatin. Cell 182, 1641–1659.e26 (2020).
Article PubMed PubMed Central Google Scholar
Takei, Y. et al. Single-cell nuclear architecture across cell types in the mouse brain. Science 374, 586–594 (2021).
Article CAS PubMed Google Scholar
Zhuang, X. Spatially resolved transcriptomics adds a new dimension to genomics. Nat. Methods 18, 15–18 (2021).
Article Google Scholar
Takei, Y. et al. Integrated spatial genomics reveals global architecture of single nuclei. Nature 590, 344–350 (2021).
Article CAS PubMed PubMed Central Google Scholar
Huang, H. et al. CTCF mediates dosage- and sequence-context-dependent transcriptional insulation by forming local chromatin domains. Nat. Genet. 53, 1064–1074 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mateo, L. J. et al. Visualizing DNA folding and RNA in embryos at single-cell resolution. Nature 568, 49–54 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cardozo Gizzi, A. M. et al. Microscopy-based chromosome conformation capture enables simultaneous visualization of genome organization and transcription in intact organisms. Mol. Cell 74, 212–222.e5 (2019).
Article PubMed Google Scholar
Rajpurkar, A. R., Mateo, L. J., Murphy, S. E. & Boettiger, A. N. Deep learning connects DNA traces to transcription to reveal predictive features beyond enhancer-promoter contact. Nat. Commun. 12, 3423 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sawh, A. N. et al. Lamina-dependent stretching and unconventional chromosome compartments in early C. elegans embryos. Mol. Cell 78, 96–111.e6 (2020).
Article PubMed PubMed Central Google Scholar
Das, P., Shen, T. & McCord, R. P. Characterizing the variation in chromosome structure ensembles in the context of the nuclear microenvironment. PLoS Comput. Biol. 18, e1010392 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhou, J. et al. Robust single-cell Hi-C clustering by convolution- and random-walk-based imputation. Proc. Natl Acad. Sci. 116, 14011–14018 (2019).
Article CAS PubMed PubMed Central Google Scholar
Zhang, R., Zhou, T. & Ma, J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nat. Biotechnol. 40, 254–261 (2022).
Article CAS PubMed Google Scholar
Jia, B. B., Jussila, A., Kern, C., Zhu, Q. & Ren, B. A spatial genome aligner for resolving chromatin architectures from multiplexed DNA FISH. Nat. Biotechnol. 41, 1004–1017 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yang, T. et al. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 27, 1939–1949 (2017).
Article CAS PubMed PubMed Central Google Scholar
Liu, D. C. & Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 45, 503–528 (1989).
Article Google Scholar
Furlong, E. E. M. & Levine, M. Developmental enhancers and chromosome topology. Science 361, 1341–1345 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lee, L. et al. SnapFISH: a computational pipeline to identify chromatin loops from multiplexed DNA FISH data. Nat. Commun. 14, 4873 (2023).
Article CAS PubMed PubMed Central Google Scholar
Wolff, J., Backofen, R. & Grüning, B. Loop detection using Hi-C data with HiCExplorer. Gigascience 11, giac061 (2022).
Kaul, A., Bhattacharyya, S. & Ay, F. Identifying statistically significant chromatin contacts from Hi-C data with FitHiC2. Nat. Protoc. 15, 991–1012 (2020).
Article CAS PubMed PubMed Central Google Scholar
Hawkins, D. M. & Wixley, R. A. J. A note on the transformation of chi-squared variables to normality. Am. Stat. 40, 296–298 (1986).
Article Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Dalcin, L. & Fang, Y.-L. L. mpi4py: status update after 12 years of development. Comput. Sci. Eng. 23, 47–54 (2021).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Bonev, B. et al. Multiscale 3D genome rewiring during mouse neural development. Cell 171, 557–572.e24 (2017).
Article PubMed PubMed Central Google Scholar
Yu, H. Imputed multiplexed DNA FISH data from mouse cells. Zenodo https://zenodo.org/records/10088109 (2023).
Yu, H. SnapFISH-IMPUTE Software. Zenodo https://zenodo.org/records/10525166 (2024).

Download references

Acknowledgements

This study was funded by the NIH grants R35HG011922 (to M.H.) and U01DA052713 (to Y.L.). Y.L. was also partially funded by the NIH grants U01HG011720, and U24AR076730. M.H. was also partially funded by the NIH grants UM1HG011585.

Author information

Authors and Affiliations

Department of Statistics, University of Wisconsin-Madison, Madison, WI, USA
Hongyu Yu
Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
Hongyu Yu
Department of Statistics, University of Toronto, Ontario, Canada
Daiqing Wu
Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic Foundation, Cleveland, OH, USA
Shreya Mishra & Ming Hu
Department of Computer Science, University of North Carolina, Chapel Hill, NC, USA
Guning Shen & Yun Li
Department of Biology, University of North Carolina, Chapel Hill, NC, USA
Guning Shen
Department of Genetics, University of North Carolina, Chapel Hill, NC, USA
Huaigu Sun & Yun Li
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Yun Li

Authors

Hongyu Yu
View author publications
Search author on:PubMed Google Scholar
Daiqing Wu
View author publications
Search author on:PubMed Google Scholar
Shreya Mishra
View author publications
Search author on:PubMed Google Scholar
Guning Shen
View author publications
Search author on:PubMed Google Scholar
Huaigu Sun
View author publications
Search author on:PubMed Google Scholar
Ming Hu
View author publications
Search author on:PubMed Google Scholar
Yun Li
View author publications
Search author on:PubMed Google Scholar

Contributions

H.Y. implemented the SnapFISH-IMPUTE software. Y.L. and M.H. supervised the project. H.Y., D.W., G.S., S.M., and H.S. performed data analysis and evaluated the method. H.Y., M.H., and Y.L. wrote the manuscript with input from all the authors. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Ming Hu or Yun Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks Yaqiang Cao and Li Li for their contribution to the peer review of this work. Primary Handling Editors: Laura Rodríguez Pérez and Gene Chong.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Description of Additional Supplementary Materials

Supplementary Data 1

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yu, H., Wu, D., Mishra, S. et al. SnapFISH-IMPUTE: an imputation method for multiplexed DNA FISH data. Commun Biol 7, 834 (2024). https://doi.org/10.1038/s42003-024-06428-7

Download citation

Received: 26 January 2024
Accepted: 07 June 2024
Published: 09 July 2024
DOI: https://doi.org/10.1038/s42003-024-06428-7