A cohomology-based Gromov–Hausdorff metric approach for quantifying molecular similarity

Wee, JunJie; Gong, Xue; Tuschmann, Wilderich; Xia, Kelin

doi:10.1038/s41598-025-93381-y

Download PDF

Article
Open access
Published: 26 March 2025

A cohomology-based Gromov–Hausdorff metric approach for quantifying molecular similarity

JunJie Wee¹,
Xue Gong²,
Wilderich Tuschmann³ &
…
Kelin Xia²

Scientific Reports volume 15, Article number: 10458 (2025) Cite this article

2881 Accesses
Metrics details

Subjects

Abstract

We introduce a cohomology-based Gromov–Hausdorff ultrametric method to analyze 1-dimensional and higher-dimensional (co)homology groups, focusing on loops, voids, and higher-dimensional cavity structures in simplicial complexes, to address typical clustering questions arising in molecular data analysis. The Gromov–Hausdorff distance quantifies the dissimilarity between two metric spaces. In this framework, molecules are represented as simplicial complexes, and their cohomology vector spaces are computed to capture intrinsic topological invariants encoding loop and cavity structures. These vector spaces are equipped with a suitable distance measure, enabling the computation of the Gromov–Hausdorff ultrametric to evaluate structural dissimilarities. We demonstrate the methodology using organic–inorganic halide perovskite (OIHP) structures. The results highlight the effectiveness of this approach in clustering various molecular structures. By incorporating geometric information, our method provides deeper insights compared to traditional persistent homology techniques.

Persistent homology elucidates hierarchical structures responsible for mechanical properties in covalent amorphous solids

Article Open access 25 September 2025

Pressure driven rotational isomerism in 2D hybrid perovskites

Article Open access 25 January 2023

Tracking the time evolution of soft matter systems via topological structural heterogeneity

Article Open access 12 January 2022

Introduction

In the last decade, concepts, methods, and techniques from topological and geometric data analysis^1,2,3,4 have found profound applications, becoming indispensable tools in many areas of applied mathematics, such as image processing⁵, pattern recognition⁶, sensor network⁷, robotics⁸, computer graphics and vision⁹, cosmology¹⁰, medicine¹¹, computational and molecular biology¹². A central technique in Topological Data Analysis (TDA) is persistent homology, which provides a robust framework for capturing multi-scale topological features in data^13,14. However, while persistent homology is a powerful tool, it generally provides only limited information about the homology groups of the objects under study, often restricted to basic descriptors such as Betti numbers. In addition to topological tools, geometric theories and methods, such as discrete exterior calculus¹⁵, Laplace operators¹⁶, discrete Ricci curvatures^17,18, optimal transport¹⁹, parametric surfaces²⁰, and geometric flows²¹, have been widely applied in data analysis. These geometric approaches have often been shown to provide deeper insights than those achieved through purely topological methods.

A promising application of advanced geometric and topological techniques is molecular similarity analysis²², which is a fundamental concept in cheminformatics. Molecular similarity measures the resemblance between molecules based on their structure, properties, or function, enabling applications in drug design²³, virtual screening²⁴, and reaction prediction²⁵. By assessing similarity, we can predict properties such as activity and toxicity, thereby accelerating the identification of promising compounds^26,27.

Existing methods for evaluating molecular similarity often rely on structural fingerprints, property-based metrics, or topological and 3D shape representations. For instance, structural fingerprints, such as Extended Connectivity Fingerprint (ECFP)²⁸ and MACCS keys²⁹, encode molecular features as binary sequences, which are typically compared using Tanimoto similarity³⁰. Property-based similarity involves comparing physicochemical attributes, such as molecular weight and polarity, using Euclidean or cosine distances³¹. Topological similarity, often evaluated through graph neural networks (GNNs)³², models molecular connectivity, while 3D similarity captures spatial geometry through metrics such as RMSD or Shape Tanimoto scores^33,34.

Recent advances in applied and computational topology and geometry offer powerful methods to assess molecular similarity. Persistent homology and its various refinements and extensions have proven invaluable for capturing molecular topological features in diverse chemical and biological applications^35,36. However, while effective in tracking the number of connected components, loops, and voids, persistent homology lacks the geometric information necessary to distinguish or identify individual features that arise during filtration. Previous studies have leveraged the connection between persistent homology and dendrograms^37,38 to distinguish structures based on 0-dimensional homology groups, which correspond to connected components. These works employ geometric methods, such as the Gromov–Hausdorff distance, to measure the distance between two brain networks^39,40,41. However, there is still a need for methodologies that enable a more detailed and localized characterization of loops, voids, and higher-dimensional (co)homologies.

We apply a cohomology-based Gromov–Hausdorff ultrametric method to study 1-dimensional and higher-dimensional (co)homology groups associated with loops, voids, and higher-dimensional cavity structures in simplicial complexes. The Gromov–Hausdorff distance ($d_{GH}$)^42,43 quantifies the dissimilarity between two metric spaces and possesses many desirable mathematical properties, making it widely used in differential geometry and applied algebraic topology for tasks such as shape matching, structural comparison, and characterization^44,45,46. In the context of molecular structures A and B, modeled as simplicial complexes $K_A$ and $K_B$, respectively, their cohomology groups give rise to the metric spaces $H^p(K_A)$ and $H^p(K_B)$. In theory, we can then compare their Gromov–Hausdorff distances. Although many lower bounds can be achieved in polynomial time^46,47, computing $d_{GH}$ between arbitrary finite metric spaces is known to be NP-hard, as it leads to quadratic assignment problems⁴⁷. To address this challenge, we introduce a crucial second step: transforming $H^p(K_A)$ and $H^p(K_B)$ into ultrametric spaces, thus obtaining their respective dendrogram representations^37,40,48. This enables us to compute a more tractable ultrametric variation of the Gromov–Hausdorff distance, referred to as the Gromov–Hausdorff ultrametric ($u_{GH}$)^48,49, which can be used to quantify molecular similarity.

The main contributions of this work are as follows:

1.
We present a workflow to utilize a Gromov–Hausdorff ultrametric approach based on 1-dimensional and higher-dimensional cohomology for assessing structural similarity between molecules, effectively capturing local topological features such as loops and voids that influence molecular properties.
2.
We demonstrate the effectiveness of this workflow in clustering tasks using a dataset of organic–inorganic halide perovskites (OIHP).

The remainder of this paper is structured as follows: The next section introduces foundational concepts in algebraic topology, including simplicial complexes and the Hodge Laplacian, which are essential to our methodology. We then present our method for measuring structural similarity, followed by numerical experiments to validate our approach using OIHP material datasets. Finally, we discuss potential future research directions.

Background

Simplicial complexes

In many applications, molecular data from biology, chemistry, and material science can be represented as graphs^50,51. In this framework, the vertices correspond to atoms, while the edges represent affinities between pairs of atoms. Simplicial complexes generalize graphs by capturing higher-order interactions through simplices, such as triangles and tetrahedra. In this and the next section, we aim to provide a brief introduction to the theoretical background necessary for performing computations on simplicial complexes. For a more comprehensive review, we direct readers to^52,53,54,55.

If we let V be a finite set of vertices, then a p-simplex $\sigma ^p$ is a subset of V with $p+1$ vertices. For example, a simplex of dimensions 0, 1, 2, and 3 can be viewed as a point, an edge, a triangle, and a tetrahedron, respectively. More precisely, a p-simplex $\sigma ^p = \{v_0, v_1, v_2, \ldots , v_p\}$ is defined as a convex hull formed by its $p+1$ affinely independent points $v_0, v_1, v_2, \ldots , v_p$ as follows:

$$\begin{aligned} \sigma ^p = \bigg \{\lambda _0v_0+\lambda _1v_1+\cdots +\lambda _pv_p \bigg |\sum _{i=0}^p \lambda _i = 0;\forall i, 0\le \lambda _i \le 1\bigg \}. \end{aligned}$$

A m-face of a p-simplex $\sigma ^p$ is defined as a convex hull formed by $m+1$ vertices of $\sigma ^p$, where $m < p$. If $\sigma ^{m}$ is a face of $\sigma ^p$, denoted by $\sigma ^{m} \subset \sigma ^p$, then $\sigma ^p$ is also referred to as a coface of $\sigma ^{m}$. The upper degree of a p-simplex $\sigma ^p$, denoted by $deg(\sigma ^p)$, is the number of $(p+1)$-simplices for which $\sigma ^p$ is a face. Two simplices $\sigma _1^p$ and $\sigma _2^p$ are upper adjacent and denoted by $\sigma _1^p \frown \sigma _2^p$ if they have a common coface. They are lower adjacent and denoted by $\sigma _1^p \smile \sigma _2^p$ if they share a common face. To facilitate computation, each simplex is assigned an orientation associated with the ordering of its vertices. If two p-simplices $\sigma _1^p$ and $\sigma _2^p$ are oriented similarly, then we write $\sigma _1^p \sim \sigma _2^p$. Conversely, we write $\sigma _1^p \not \sim \sigma _2^p$ if they have opposite orientations.

A p-dimensional simplicial complex K contains up to p-dimensional simplices and satisfies two conditions. First, any face of a simplex from K is also in K. Second, the intersection of any two simplices in K is either empty or a shared face. There are many common types of simplicial complexes such as Vietoris-Rips complex, ${\check{C}}ech$ complex, Alpha complex, and clique complex.

The combinatorial Hodge Laplacian

Computationally, combinatorial Hodge Laplacian or discrete Hodge Laplacian has been proposed^{52,54,55,56,57,58,59,60}. Essentially, this discrete version can be viewed as part of exterior calculus and discrete differential geometry. The concept of the combinatorial Hodge Laplacian is a part of Hodge theory, which has recently been applied to biomolecular data analysis⁶¹. Hodge Laplacian matrices of different dimensions can be constructed on a simplicial complex. A p-th dimensional Hodge Laplacian matrix characterizes topological connections between p-th simplices within a simplicial complex^52,53,62. Note that the graph Laplacian characterizes relations between vertices (0-simplices).

The p-th chain group $C_p(K)$ of a simplicial complex K over some field $\mathbb {F}$ is a vector space over $\mathbb {F}$ whose basis is the set of p-simplices of the simplicial complex K. Elements of $C_p(K)$ are called p-chains. The dual of $C_p(K),$ denoted by $C^p(K),$ is the set of all linear functionals on $C_p(K)$:

$$C^p(K)=\big \{\phi : C_p(K) \rightarrow \mathbb {F}\, : \, \phi \text { is linear} \big \}.$$

$C^p(K)$ is called the p-th cochain group and its elements are called p-cochains. Boundary operators are defined on both the chain and cochain groups. The boundary map $\partial _p$ is a linear transformation which acts on a p-simplex $\sigma ^p=[u_0,u_1,\ldots ,u_p]$ as follows

$$\partial _p([u_0,u_1,\ldots ,u_p])=\sum _{i=0}^p(-1)^i[u_0,\ldots ,u_{i-1},u_{i+1},\ldots ,u_p].$$

The coboundary map $\delta _p:C^p(K) \rightarrow C^{p+1}(K)$ is a linear transformation defined as follows: for a linear functional $\phi \in C^p(K)$ and a $p+1$-simplex $\sigma ^{p+1}=[u_0,u_1,\ldots ,u_{p+1}]$,

$$\delta _p(\phi )(\sigma ^{p+1})=\sum _{i=0}^{p+1}(-1)^i\phi ([u_0,\ldots ,u_{i-1},u_{i+1},\ldots ,u_{p+1}]).$$

The boundary map gives rise to a chain complex, which is a sequence of chain groups connected by boundary maps as follows:

$$0 \rightarrow C_{n}(K)\rightarrow \cdots \xrightarrow {\partial _{p+1}}C_p(K)\xrightarrow {\partial _p} C_{p-1}(K)\cdots \xrightarrow {\partial _2}C_1(K)\xrightarrow {\partial _1}C_0(K)\rightarrow 0.$$

Similar to the boundary map giving rise to the chain complex, the coboundary operator gives rise to a cochain complex:

$$0 \leftarrow C^{n}(K)\leftarrow \cdots \xleftarrow {\delta _{p}}C^p(K)\xleftarrow {\delta _{p-1}} C^{p-1}(K)\cdots \xleftarrow {\delta _1}C^1(K)\xleftarrow {\delta _0}C^0(K)\leftarrow 0.$$

Since $C_p(K)$ and $C^p(K)$ are finite-dimensional, there exists unique matrix representations for $\partial _p$ and $\delta _p$. We have some useful relations regarding matrix representations of $\partial _p$ and $\delta _p$ ($A^T$ represents the transpose of a matrix A):

For all $p \ge 0$, $\partial _{p+1}^T=\delta _p$,
$\partial _p^T=\partial _p^*$,
$\delta _p^T=\delta _p^*.$

Here, $\delta _p^*:C^{p+1}(K)\rightarrow C^p(K)$ is the adjoint/transpose map of $\delta _p$ where

$$\langle \delta _p(f), g \rangle =\langle f, \delta _p^*(g) \rangle ,$$

for every $f \in C^p(K)$, $g \in C^{p+1}(K)$ and a suitable inner product $\langle \cdot , \cdot \rangle$ for $C^p(K)$ and $C^{p+1}(K).$ The adjoint of the boundary operator $\partial _p$, $\partial _p^*$ is also defined analogously. Using the cochain group $C^p(K)$, we can define the p-th cocycle group $Z^p$ and p-th coboundary group $B^p$ as follows:

$$\begin{aligned} Z^p = \text {ker}(\delta _p) = \{c \in C^p | \delta _p(c) = 0\}, \\ B^p = \text {im}(\delta _{p-1}) = \{c\in C^p | \exists d \in C^{p-1}: c = \delta _{p-1}(d) \}. \end{aligned}$$

Then we have the p-th cohomology group $H^p = Z^p/B^p$.

The p-dimensional combinatorial Hodge Laplacian is the linear operator $\Delta _p:C^p(K) \rightarrow C^p(K)$ is defined as follows:

$$\Delta _p= {\left\{ \begin{array}{ll} \delta _p^*\circ \delta _p+\delta _{p-1}\circ \delta _{p-1}^* & \text {if } p \ge 1, \\ \delta _p^*\circ \delta _p & \text {if } p=0. \end{array}\right. }$$

The case where $p=0$ gives rise to the expression of the well-known graph Laplacian. Alternatively, the combinatorial Hodge Laplacian matrix can be expressed using the boundary matrix. The boundary operator $\partial _p$ has a unique matrix representation. Given a simplicial complex K, the p-th boundary matrix ${\mathbf{B}}_p$ is defined as,

$$({\mathbf{B}}_p)_{ij}= {\left\{ \begin{array}{ll} 1 & \text {if } \sigma _i^{p-1} \subset \sigma _j^{p} ~\text {and }~ \sigma _i^{p-1} \sim \sigma _j^{p}, \\ -1 & \text {if } \sigma _i^{p-1} \subset \sigma _j^{p} ~\text {and }~ \sigma _i^{p-1} \not \sim \sigma _j^{p}, \\ 0 & \text {if } \sigma _i^{p-1} \not \subset \sigma _j^{p}, \end{array}\right. }$$

where $\sigma _i^{p-1}$ is the i-th $(p-1)$-simplex and $\sigma _j^{p}$ is the j-th p-simplex.

Given that the highest order of the simplicial complex K is n, the p-th Hodge Laplacian (or combinatorial Laplacian) matrix ${\mathbf{L}}_p$ of K is

$${\mathbf{L}}_p= {\left\{ \begin{array}{ll} {\mathbf{B}}_n^T{\mathbf{B}}_n & \text {if } p=n, \\ {\mathbf{B}}_p^T{\mathbf{B}}_p+{\mathbf{B}}_{p+1}{\mathbf{B}}_{p+1}^T & \text {if } 1 \le p < n, \\ {\mathbf{B}}_1{\mathbf{B}}_1^T & \text {if } p=0. \end{array}\right. }$$

Another way to define the Hodge Laplacian is through simplex relations. When $p=0$,

$$({\mathbf{L}}_0)_{ij}= {\left\{ \begin{array}{ll} \text {deg}(\sigma _i^{0}) & \text {if } i=j, \\ -1 & \text {if } i \ne j~\text {and } \sigma _i^{0} \frown \sigma _j^{0}, \\ 0 & \text {if } i \ne j~\text {and } \sigma _i^{0} \not \frown \sigma _j^{0}, \end{array}\right. }$$

and ${\mathbf{L}}_0$ is exactly the same as the graph Laplacian matrix. When $p>0$,

$$({\mathbf{L}}_p)_{ij}= {\left\{ \begin{array}{ll} \text {deg}(\sigma _i^{p})+ p+1 \quad & \text {if } i=j, \\ 1 \qquad & \text {if } i \ne j, \, \sigma _i^{p} \not \frown \sigma _j^{p}, \sigma _i^{p} \smile \sigma _j^{p}~\text {and } \sigma _i^k \sim \sigma _j^p, \\ -1 \quad & \text {if } i \ne j, \, \sigma _i^{p} \not \frown \sigma _j^{p}, \sigma _i^{p} \smile \sigma _j^{p}~\text {and } \sigma _i^k \not \sim \sigma _j^p, \\ 0 \qquad & \text {if } i \ne j, \, \sigma _i^{p} \frown \sigma _j^{p} ~\text {or }~ \sigma _i^{p} \not \smile \sigma _j^{p}. \end{array}\right. }$$

Mathematically, the eigenvalues of Hodge Laplacian matrices are independent of the choice of the orientation⁵⁷. The eigenspectrum of Hodge Laplacian matrices reveals topological information within simplicial complexes. For instance, the multiplicity of zero eigenvalues of ${\mathbf{L}}_p$ corresponds to the Betti numbers $\beta _p$, reflecting the number of connected components ($\beta _0$), the number of cycles ($\beta _1$), the number of cavities ($\beta _2$), etc.

Furthermore, the eigenvectors corresponding to the zero eigenvalues of ${\mathbf{L}}_p$, also referred to as cohomology generators, can be used to represent the p-dimensional cohomologies. At the algebraic level, choosing a representative for an element in cohomology is inherently problematic due to its quotient structure. However, by leveraging the Hodge decomposition theorem with the inner product structure on the cochain complex, we obtain $\operatorname {ker}(\delta _p) = \operatorname {im}(\delta _{p-1}) \oplus \operatorname {ker}({\mathbf {L}}_p)$, where $\operatorname {ker}({\mathbf {L}}_p)$ is orthogonal to $\operatorname {im}(\delta _{p-1})$. This identification induces the canonical isomorphism $H^p(K) \simeq (\operatorname {im}(\delta _{p-1}) \oplus \operatorname {ker}({\mathbf {L}}_p))/\operatorname {im}(\delta _{p-1}) \simeq \operatorname {ker}({\mathbf {L}}_p)$. In particular, every cohomology class in $H^p(K)$ can be uniquely represented by an element in $\operatorname {ker}({\mathbf {L}}_p)$^53,63,64. For any cohomology generator v of ${\mathbf{L}}_p$, we can arrange the entries of v to match the order of simplices in K. Specifically, the i-th entry $v_i$ corresponds to the p-simplex $\sigma _i^p$ in K. This allows us to visualize v on a simplicial complex.

The Gromov–Hausdorff distance

The Gromov–Hausdorff distance measures the distance between two compact metric spaces^42,43. It measures the smallest distance at which two compact metric spaces can be considered “close”, which is particularly useful when the overall shape matching is more relevant than exact pointwise alignment. Given two metric spaces X and Y, the Gromov–Hausdorff distance looks for all possible isometric embeddings of X and Y into a common metric space Z and then calculates the Hausdorff distance⁶⁵ between these embeddings within Z. Formally, the Gromov–Hausdorff distance $d_{GH}$^42,43 between two compact metric spaces X and Y is defined as

$$\begin{aligned} d_{GH}(X,Y) = \inf d_H^Z(\varphi (X),\psi (Y)), \end{aligned}$$

where the infimum is taken over all possible $Z\in \mathcal {M}$ and isometric embeddings $\varphi :X\hookrightarrow Z$ and $\psi :Y\hookrightarrow Z$, and $d^Z_H$ denotes Hausdorff distance in Z.

Method

Cohomology generator-based distance metrics

For a given simplicial complex K, we denote the space consisting of all the p-dimensional cohomology generators arising from its Hodge Laplacian ${\mathbf{L}}_p$ by $H^p(K)$. Subsequently, we want to construct a metric space from $H^p(K)$ by assigning distances between cohomology generators in $H^p(K)$. We will introduce three types of distance measures to construct the metric space: $l^1$ distance, cocycle distance, and Wasserstein distance.

Throughout this section, we assume that any two cohomology generators v and w from a Hodge Laplacian matrix ${\mathbf{L}}_p$ must have a consistent order of entries. In other words, we can write

$$v = \left( {\begin{array}{*{20}c} {v_{1} } \hfill \\ {v_{2} } \hfill \\ \vdots \hfill \\ {v_{{n_{p} - 1}} } \hfill \\ {v_{{n_{p} }} } \hfill \\ \end{array} } \right)\begin{array}{*{20}c} {\sigma _{1}^{p} } \hfill \\ {\sigma _{2}^{p} } \hfill \\ \vdots \hfill \\ {\sigma _{{n_{p} - 1}}^{p} } \hfill \\ {\sigma _{{n_{p} }}^{p} } \hfill \\ \end{array} \quad {\text{and}}\quad w = \left( {\begin{array}{*{20}c} {w_{1} } \hfill \\ {w_{2} } \hfill \\ \vdots \hfill \\ {w_{{n_{p} - 1}} } \hfill \\ {w_{{n_{p} }} } \hfill \\ \end{array} } \right)\begin{array}{*{20}c} {\sigma _{1}^{p} } \hfill \\ {\sigma _{2}^{p} } \hfill \\ \vdots \hfill \\ {\sigma _{{n_{p} - 1}}^{p} } \hfill \\ {\sigma _{{n_{p} }}^{p} } \hfill \\ \end{array} ,$$

where $v_i$ and $w_i$ are entries in v and w, respectively, corresponding to the simplex $\sigma _i^p$ for all $1\le i \le n_p$. Furthermore, we assume that any cohomology generator in $H^p(K)$ is normalized.

$l^1$ distance

First, we define the $l^1$ distance between two cohomology generators as follows.

Definition 1

($l^1$ distance) Let v and w be two cohomology generators from $H^p(K)$ where K is a simplicial complex. Then the $l^1$ distance between v and w is

$$\begin{aligned} \Vert v-w\Vert _1 = \sum _{i=1}^{n_p} |v_{i}-w_{i}|. \end{aligned}$$

Note that in computation, the eigenvectors may have arbitrary signs. To eliminate the sign ambiguity introduced by the solver, we enforce that the first element of each cohomology generator is non-negative.

Cocycle distance

As the $l^1$ norm of a cohomology generator tends to be larger when the cocycle contains more edges, it serves as a rough indicator of the size of the cocycle. Therefore, it is natural to consider another type of distance that measures the absolute difference between the $l^1$ norms of two cohomology generators. We refer to this as the cocycle distance.

Definition 2

(Cocycle distance) Let v and w be two cohomology generators in $H^p(K)$ where K is a simplicial complex. Their cocycle distance $d_s(v, w)$ is defined as the absolute difference in their $l^1$ norms:

$$\begin{aligned} d_s(v, w) = |\Vert v\Vert _{1}-\Vert w\Vert _{1}|. \end{aligned}$$

For illustration, consider the BINOL-phosphoramide 1_iv, a BINOL (1,1’-bi-2-naphthol)-based phosphoric acid catalyst. We construct the atomic level simplicial complex representation of the molecule, where each vertex represents an atom, and generate the corresponding Hodge Laplacian matrices ${\mathbf{L}}_0$ and ${\mathbf{L}}_1$. The cohomology generators from ${\mathbf{L}}_0$ and ${\mathbf{L}}_1$ are depicted in Fig. 1, where the color of each simplex $\sigma _i^p$ is darker in blue if its associated $|v_i|$ in the cohomology generator is larger. From ${\mathbf{L}}_0$, there is only one cohomology generator, representing a single connected component of the catalyst structure. Therefore, the cohomology generator in ${\mathbf{L}}_0$ assigns equal values to all vertices. On the other hand, ${\mathbf{L}}_1$ produces 7 cohomology generators in dark blue color in Fig. 1 corresponding to unique cocycles. In addition, the non-cohomology generators associated with non-zero eigenvalues reveal local and global clustering patterns. The Fiedler vector effectively clusters the vertices into two groups, colored in blue and red. For example, Fig. 1 shows the Fiedler vector corresponding to the smallest non-zero eigenvalue of the Hodge Laplacian matrix, which clusters the vertices into two groups colored in blue and red.

Wasserstein distance

Recall from the beginning of this section that any cohomology generator v in $H^p(K)$ is assumed to be normalized. Hence, given any two cohomology generators v and w in $H^p(K)$, we can square all the entries of v and w to obtain vectors $v'$ and $w'$, whose entries sum to 1,

$$v = \left( {\begin{array}{*{20}c} {v_{1}^{2} } \\ {v_{2}^{2} } \\ \vdots \\ {v_{{n_{p} - 1}}^{2} } \\ {v_{{n_{p} }}^{2} } \\ \end{array} } \right)\begin{array}{*{20}c} {\sigma _{1}^{p} } \\ {\sigma _{2}^{p} } \\ \vdots \\ {\sigma _{{n_{p} - 1}}^{p} } \\ {\sigma _{{n_{p} }}^{p} } \\ \end{array} \quad {\text{and}}\quad w = \left( {\begin{array}{*{20}c} {w_{1}^{2} } \\ {w_{2}^{2} } \\ \vdots \\ {w_{{n_{p} - 1}}^{2} } \\ {w_{{n_{p} }}^{2} } \\ \end{array} } \right)\begin{array}{*{20}c} {\sigma _{1}^{p} } \\ {\sigma _{2}^{p} } \\ \vdots \\ {\sigma _{{n_{p} - 1}}^{p} } \\ {\sigma _{{n_{p} }}^{p} } \\ \end{array} .$$

(1)

Since the vectors $v'$ and $w'$ have entries that sums to 1, this allows us to treat the values in $v'$ and $w'$ as two probability measures $m_1$ and $m_2$ respectively. Now, we can define the pairwise distance between any two cohomology generators v and w to be the Wasserstein distance^66,67 between $m_1$ and $m_2$.

Definition 3

(Wasserstein distance) The probability measures $m_1$ and $m_2$ are the probability distributions obtained from the entries of $v'$ and $w'$ as defined in (2). Let $\sigma _i^p$ and $\sigma _j^p$ be p-simplices in K. Then, $\xi (\sigma _i^p, \sigma _j^p)$ represents the amount of mass traveling from $\sigma _i^p$ to $\sigma _j^p$. We require that the transportation from $m_1$ to $m_2$ is mass-preserving, i.e., $\sum _{\sigma _j^p \in K} \xi (\sigma _i^p, \sigma _j^p) = m_1(\sigma _i^p)$ and $\sum _{\sigma _i^p\in K} \xi (\sigma _i^p, \sigma _j^p) = m_2(\sigma _j^p)$. The Wasserstein distance between $m_1$ and $m_2$, denoted by $W_1(m_1, m_2)$, is the minimum traveling distance that can be achieved, given by

$$\begin{aligned} W_1(m_1, m_2) = \inf _\xi \sum _{\sigma _i^p\in K} \sum _{\sigma _j^p \in K} d(\sigma _i^p,\sigma _j^p)\xi (\sigma _i^p, \sigma _j^p). \end{aligned}$$

(3)

The distance between two simplices, denoted by $d(\sigma _i^p, \sigma _j^p)$, is defined as the minimum $\ell ^2$ distance between their corresponding vertices. Representing the sets of vertices in the simplices as $\sigma _i^p=\{x_1, x_2, \ldots , x_{p}\}$ and $\sigma _j^p=\{u_1, u_2, \ldots , u_{p}\}$, the distance is then given by

$$\begin{aligned} d(\sigma _i^p, \sigma _j^p) = \min _{\begin{array}{c} 1\le i \le p\\ 1\le j\le p \end{array}}||{\mathbf{x}}_i-{\mathbf{u}}_j||_2, \end{aligned}$$

(4)

where ${\mathbf{x}}_i$ and ${\mathbf{u}}_j$ refer to the coordinates of the vertices $x_i$ and $u_j$ in $\mathbb {R}^3$.

Cohomology-based Gromov–Hausdorff ultrametric approach for structural similarity measurement

Given two molecular structures, A and B, we construct simplicial complexes, denoted by $K_A$ and $K_B$ respectively, where vertices represent atoms. There are various approaches for constructing the simplicial complex⁶⁸. For example, we can build a Vietoris-Rips complex such that a set of atoms is a simplex in K if every atom pair in the set has a $\ell ^2$ distance smaller than a specified filtration threshold. An alternative approach is to generate an Alpha complex, where a set of vertices forms a simplex when its filtration value is smaller than the threshold. For example, the filtration value of an atom pair is one-half of their $\ell ^2$ distance, and the filtration value for a triplet of atoms is the radius of the circle that passes through it.

After constructing the simplicial complex representation of a molecular structure, cohomology generators can be computed from its 1-dimensional Hodge Laplacian matrix ${\mathbf{L}}_1$. Subsequently, we can compute a distance matrix comprised of pairwise distances between all cohomology generators from ${\mathbf{L}}_1$. Three types of distance matrices can be derived for each molecular structure, corresponding to Definition 1, 2, or 3. These types of distance measurements provide us ways to construct cohomology-based ultrametric space and the associated Gromov–Hausdorff ultrametric.

An ultrametric space $(X, d_X)$ is a metric space that satisfies the strong triangle inequality:

$$\forall x, x', x''\in X, \text { one has } d_X(x,x')\le \max (d_X(x,x''),d_X(x'', x')).$$

The Gromov–Hausdorff ultrametric⁴⁹, denoted by $u_{\text {GH}}$, measures distances between compact ultrametric spaces X and Y as follows.

Definition 4

(The Gromov–Hausdorff ultrametric⁴⁹) The Gromov–Hausdorff ultrametric $u_{\text {GH}}$ between compact ultrametric spaces X and Y is

$$\begin{aligned} u_{\text {GH}}(X,Y):= \inf d_H^Z(\varphi _X(X), \varphi _Y(Y)), \end{aligned}$$

where the infimum is taken over all ultrametric spaces Z and isometric embeddings $\varphi _X:X\hookrightarrow Z$ and $\varphi _Y:Y\hookrightarrow Z$.

In our methodology, we transform metric spaces formed by the cohomology generators into an ultrametric space denoted by $H^1(K_A)$ and $H^1(K_B)$ using Algorithm 1 in⁴⁸. This transformation allows us to compute the Gromov–Hausdorff ultrametric $u_{\text {GH}}(H^1(K_A), H^1(K_B))$ between two ultrametric spaces associated with molecular structures A and B, thereby quantifying their structural similarity or facilitating clustering tasks. An illustrative figure for the aforementioned workflow is presented in Fig. 2.

Results

Characterization of OIHP material structures

Organic–inorganic hybrid perovskite (OIHP) materials are favorable candidates for developing efficient and cost-effective solar cells. The stabilization of the OIHP structure is reliant on Van der Waals interactions and hydrogen bonding effects, which are closely associated with the distances between organic molecules and inorganic ions⁶⁹. Our $u_{\text {GH}}$-based features can be employed for characterizing OIHP material structures. To demonstrate this, we examine the clustering of Methylammonium lead halides (MAPbX$_3$, X$=$Cl, Br, I). Three possible phases of MAPbX$_3$ are considered for each X-site atom, including the orthorhombic, tetragonal, and cubic phases as illustrated in Fig. 3a,b.

We analyze 100 configurations from the molecular dynamics (MD) trajectories for each of the 9 OIHP structures, leading to 900 trajectories. Each MD trajectory arises from a molecular dynamics simulation to stabilize the initial configuration and consists of the 3D coordinates of atoms at a specific time. For each configuration, we construct an Alpha complex with 4 filtration thresholds of 3.5 Å, 4 Å, 5 Å, and 6 Å as shown in Fig. 3c and calculate its 1-dimensional co-homology generators and associated pairwise distances using $L_1$ distance as defined in Definition 1. We chose to construct alpha complexes, which are based on the Delaunay triangulation of the points, because they better reflect the underlying geometry of atomic arrangement.

We consider the tasks of clustering 3 types of atoms for X-sites for each possible phase of MAPbX$_3$, by calculating the $u_{GH}$ between the 300 configurations of interest with the same phase for each of the 4 filtration values. This process generates a GH-based statistical feature vector with a length of $300 \times 4 = 1200$ for each configuration, allowing for differentiation among the three X-sites. We compare the K-means clustering by X-site atoms obtained using $u_{\text {GH}}$-based features with the ones achieved using the 3D coordinates, and two commonly used structural fingerprints: ECFP²⁸ and MACC keys²⁹. ECFP is a circular fingerprint that encodes atomic neighborhood patterns, while MACCS Keys captures the presence of specific chemical substructures. We chose to compare these features using only structural information without including additional information such as atomic number and weight, which could potentially bias the clustering algorithm towards identifying atom types.

In Table 1, we show the Adjusted Rand Index (ARI) of K-means clustering for the above methods, where 0 indicates a random assignment of clusters and 1 represents a perfect clustering. We also plotted low-dimensional visualization of these features using UMAP⁷⁰ in Fig. 4, with the x-axis and y-axis representing the two dimensions after dimension reduction by UMAP. Directly using the 3D coordinates proves ineffective in distinguishing between various X-site atoms. Using the 3D coordinates alone proves ineffective in distinguishing between various X-site atoms as shown in low ARI. Incorporating some neighborhood information using ECFP improves the ARI but is still not ideal since the structures have similar neighborhoods pattern. MACC achieves decent clustering, although the ARI remains below 0.9 for the tetragonal structures. On the other hand, our Gromov–Hausdorff ultrametric-based methods exhibit almost perfect clustering structures by X-site atoms. The UMAP plot in Fig. 4 provides additional visual confirmation that our method achieves superior clustering in the low-dimensional projection of the data.

Table 1 Adjusted Rand Index of clustering based on different features.

Full size table

Discussion

In this work, we propose a new framework that employs the cohomology-based Gromov–Hausdorff ultrametric ($u_{GH}$) for quantifying structural similarity between molecular structures. In our workflow, we build simplicial complex representations of molecular structures and compute the $u_{GH}$ between their respective cohomology generator spaces. The cohomology generators effectively encode the local topological invariants, revealing cyclic patterns formed by edges. We illustrate the application of the cohomology-based Gromov–Hausdorff distance using organic–inorganic halide perovskites (OIHP) data. In our numerical experiments, the $u_{GH}$-based approach demonstrated effectiveness in clustering various structures, achieving the highest Adjusted Rand Index compared to other structural features tested including the 3D coordinates, ECFP, and MACC.

We demonstrate the application of the cohomology-based $u_{GH}$ approach in quantifying structural similarities. However, there are some limitations to our current work. For instance, we focused solely on the cohomology generators of the first-order Hodge Laplacian, which capture loop patterns. In the future, exploring non-cohomology generators may reveal clustering information, and higher-order Hodge Laplacians could unveil more complex structures such as cavities. Another limitation is that the choice of kernel vectors for the Hodge Laplacian is not unique. Therefore, for the same structure, we may have different choices for cohomology generators. It remains an open question for future research to determine how to find the optimal cohomology generators that minimize the $u_{GH}$ between two atoms. Other potential future directions involve incorporating cohomology-based $u_{GH}$ features into machine learning models for structure design and prediction. Topological deep learning models have also demonstrated great potential in molecular property prediction and analysis⁷¹. Additionally, our numerical experiments are limited to small molecules from chemistry and physics with hundreds of atoms. A promising avenue involves applying the proposed method to larger biological molecules with tens of thousands of molecules. For instance, it could be employed to quantify structural similarities between protein structures in drug design⁷².

Data availability

Code and data are available in GitHub repository https://github.com/XueGong-git/pyGH.

References

Carlsson, G. Topology and data. Bull. Am. Math. Soc. 46, 255–308 (2009).
Article MathSciNet MATH Google Scholar
Wasserman, L. Topological data analysis. Annu. Rev. Stat. Appl. 5, 501–532 (2018).
Article MathSciNet MATH Google Scholar
Le Roux, B. & Rouanet, H. Geometric Data Analysis: From Correspondence Analysis to Structured Data Analysis (Springer Science & Business Media, 2004).
Kirby, M. Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns (Wiley, 2000).
Klette, R. & Rosenfeld, A. Digital Geometry: Geometric Methods for Digital Picture Analysis (Elsevier, 2004).
Carlsson, G. Topological pattern recognition for point cloud data. Acta Numer. 23, 289–368 (2014).
Article MathSciNet MATH Google Scholar
Gao, J. & Guibas, L. Geometric algorithms for sensor networks. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 370, 27–51 (2012).
Article ADS MathSciNet MATH Google Scholar
Bhattacharya, S. Search-based path planning with homotopy class constraints. Proc. AAAI Conf. Artif. Intell. 24, 1230–1237 (2010).
MATH Google Scholar
Skraba, P., Ovsjanikov, M., Chazal, F. & Guibas, L. Persistence-based segmentation of deformable shapes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, 45–52 (IEEE, 2010).
Sousbie, T. The persistent cosmic web and its filamentary structure-I. Theory and implementation. Mon. Not. R. Astron. Soc. 414, 350–383 (2011).
Article ADS MATH Google Scholar
Skaf, Y. & Laubenbacher, R. Topological data analysis in biomedicine: A review. J. Biomed. Inform. 130, 104082 (2022).
Article PubMed MATH Google Scholar
Yao, Y. et al. Topological methods for exploring low-density states in biomolecular folding pathways. J. Chem. Phys. 130 (2009).
Edelsbrunner, Letscher & Zomorodian,. Topological persistence and simplification. Discrete Comput. Geom. 28, 511–533 (2002).
Article MathSciNet MATH Google Scholar
Zomorodian, A. & Carlsson, G. Computing persistent homology. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, 347–356 (2004).
Hirani, A. N. Discrete Exterior Calculus (California Institute of Technology, 2003).
Belkin, M. & Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15, 1373–1396 (2003).
Article MATH Google Scholar
Ollivier, Y. Ricci curvature of Markov chains on metric spaces. J. Funct. Anal. 256, 810–864 (2009).
Article MathSciNet MATH Google Scholar
Forman,. Bochner’s method for cell complexes and combinatorial Ricci curvature. Discrete Comput. Geom. 29, 323–374 (2003).
Article MathSciNet MATH Google Scholar
Villani, C. et al. Optimal Transport: Old and New, vol. 338 (Springer, 2009).
Farin, G. Curves and Surfaces for Computer-Aided Geometric Design: A Practical Guide (Elsevier, 2014).
Jin, M., Kim, J., Luo, F. & Gu, X. Discrete surface Ricci flow. IEEE Trans. Vis. Comput. Graph. 14, 1030–1043 (2008).
Article PubMed MATH Google Scholar
Bender, A. & Glen, R. C. Molecular similarity: a key technique in molecular informatics. Organ. Biomol. Chem. 2, 3204–3218 (2004).
Article CAS MATH Google Scholar
Vilar, S. et al. Drug-drug interaction through molecular structure similarity analysis. J. Am. Med. Inform. Assoc. 19, 1066–1074 (2012).
Article PubMed PubMed Central MATH Google Scholar
Eckert, H. & Bajorath, J. Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. Drug Discov. Today 12, 225–233 (2007).
Article CAS PubMed MATH Google Scholar
Arteca, G. A., Jammal, V. B., Mezey, P. G. & Mezey, P. G. Shape group studies of molecular similarity and regioselectivity in chemical reactions. J. Comput. Chem. 9, 608–619 (1988).
Article CAS MATH Google Scholar
Petrone, P. M. et al. Rethinking molecular similarity: comparing compounds on the basis of biological activity. ACS Chem. Biol. 7, 1399–1409 (2012).
Article CAS PubMed MATH Google Scholar
Basak, S. C., Grunwald, G. D., Host, G. E., Niemi, G. J. & Bradbury, S. P. A comparative study of molecular similarity, statistical, and neural methods for predicting toxic modes of action. Environ. Toxicol. Chem. 17, 1056–1064 (1998).
Article CAS Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article CAS PubMed MATH Google Scholar
Polton, D. Installation and operational experiences with MACCS (molecular access system). Online Rev. 6, 235–242 (1982).
Article Google Scholar
Bajusz, D., Rácz, A. & Héberger, K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?. J. Cheminform. 7, 1–13 (2015).
Article CAS MATH Google Scholar
Todeschini, R. & Consonni, V. Handbook of Molecular Descriptors (Wiley, 2008).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning, 1263–1272 (PMLR, 2017).
Grant, J. A., Gallardo, M. A. & Pickup, B. T. A fast method of molecular shape comparison: A simple application of a gaussian description of molecular shape. J. Comput. Chem. 17, 1653–1666 (1996).
Article CAS MATH Google Scholar
Good, A. C. & Richards, W. G. Rapid evaluation of shape similarity using gaussian functions. J. Chem. Inf. Comput. Sci. 33, 112–116 (1993).
Article CAS MATH Google Scholar
Townsend, J., Micucci, C. P., Hymel, J. H., Maroulas, V. & Vogiatzis, K. D. Representation of molecular structures with persistent homology for machine learning applications in chemistry. Nat. Commun. 11, 3230 (2020).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Xia, K. & Wei, G.-W. Persistent homology analysis of protein structure, flexibility, and folding. Int. J. Numer. Methods Biomed. Eng. 30, 814–844 (2014).
Article MathSciNet MATH Google Scholar
Carlsson, G. E. et al. Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res. 11, 1425–1470 (2010).
MathSciNet MATH Google Scholar
Carlsson, G. & Mémoli, F. Persistent clustering and a theorem of J. Kleinberg. arXiv preprint arXiv:0808.2241 (2008).
Lee, H., Chung, M. K., Kang, H., Kim, B.-N. & Lee, D. S. Computing the shape of brain networks using graph filtration and Gromov-Hausdorff metric. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2011: 14th International Conference, Toronto, Canada, September 18–22, 2011, Proceedings, Part II 14, 302–309 (Springer, 2011).
Lee, H., Kang, H., Chung, M. K., Kim, B.-N. & Lee, D. S. Persistent brain network homology from the perspective of dendrogram. IEEE Trans. Med. Imaging 31, 2267–2277 (2012).
Article PubMed MATH Google Scholar
Chung, M. K., Lee, H., Solo, V., Davidson, R. J. & Pollak, S. D. Topological distances between brain networks. In Connectomics in NeuroImaging: First International Workshop, CNI 2017, Held in Conjunction with MICCAI 2017, Quebec City, QC, Canada, September 14, 2017, Proceedings 1, 161–170 (Springer, 2017).
Edwards, D. A. The structure of superspace. In Studies in Topology, 121–133 (Elsevier, 1975).
Gromov, M. Groups of polynomial growth and expanding maps (with an appendix by Jacques Tits). Publ. Math. l’IHÉS 53, 53–78 (1981).
Article MATH Google Scholar
Memoli, F. On the use of Gromov-Hausdorff distances for shape comparison. In Eurographics Symposium on Point-based Graphics (eds. Botsch, M., Pajarola, R., Chen, B. & Zwicker, M.). https://doi.org/10.2312/SPBG/SPBG07/081-090 (The Eurographics Association, 2007).
Bronstein, A. M., Bronstein, M. M., Kimmel, R., Mahmoudi, M. & Sapiro, G. A Gromov-Hausdorff framework with diffusion geometry for topologically-robust non-rigid shape matching. Int. J. Comput. Vis. 89, 266–286 (2010).
Article MATH Google Scholar
Chazal, F., Cohen-Steiner, D., Guibas, L. J., Mémoli, F. & Oudot, S. Y. Gromov-Hausdorff stable signatures for shapes using persistence. Comput. Graph. Forum 28, 1393–1403 (2009).
Article MATH Google Scholar
Mémoli, F. Some properties of Gromov-Hausdorff distances. Discrete Comput. Geom. 48, 416–440 (2012).
Article MathSciNet MATH Google Scholar
Mémoli, F., Smith, Z. & Wan, Z. The Gromov–Hausdorff distance between ultrametric spaces: its structure and computation. J. Comput. Geom. 14, 78–143 (2023).
MathSciNet MATH Google Scholar
Zarichnyi, I. Gromov–Hausdorff ultrametric. arXiv preprint arXiv:math/0511437 (2005).
Bajorath, J. Chemoinformatics: Concepts, Methods, and Tools for Drug Discovery, vol. 275 (Springer Science & Business Media, 2004).
Lo, Y. C., Rensi, S. E., Torng, W. & Altman, R. B. Machine learning in chemoinformatics and drug discovery. Drug Discov. Today 23, 1538–1546 (2018).
Article CAS PubMed PubMed Central Google Scholar
Eckmann, B. Harmonische funktionen und randwertaufgaben in einem komplex. Comment. Math. Helv. 17, 240–255 (1944).
Article MathSciNet MATH Google Scholar
Lim, L.-H. Hodge Laplacians on graphs. SIAM Rev. 62, 685–715. https://doi.org/10.1137/18M1223101 (2020).
Article MathSciNet MATH Google Scholar
Shukla, S. & Yogeshwaran, D. Spectral gap bounds for the simplicial Laplacian and an application to random complexes. J. Combin. Theory Ser. A 169, 105134 (2020).
Article MathSciNet MATH Google Scholar
Torres, J. J. & Bianconi, G. Simplicial complexes: higher-order spectral dimension and dynamics. J. Phys. Complex. 1, 015002 (2020).
Article ADS MATH Google Scholar
Muhammad, A. & Egerstedt, M. Control using higher order Laplacians in network topologies. In Proc. of 17th International Symposium on Mathematical Theory of Networks and Systems, 1024–1038 (Citeseer, 2006).
Horak, D. & Jost, J. Spectra of combinatorial Laplace operators on simplicial complexes. Adv. Math. 244, 303–336 (2013).
Article MathSciNet MATH Google Scholar
Barbarossa, S. & Sardellitti, S. Topological signal processing over simplicial complexes. IEEE Trans. Signal Process. 68, 2992–3007 (2020).
Article ADS MathSciNet MATH Google Scholar
Mukherjee, S. & Steenbergen, J. Random walks on simplicial complexes and harmonics. Random Struct. Algorithms 49, 379–405 (2016).
Article MathSciNet PubMed PubMed Central MATH Google Scholar
Parzanchevski, O. & Rosenthal, R. Simplicial complexes: spectrum, homology and random walks. Random Struct. Algorithms 50, 225–261 (2017).
Article MathSciNet MATH Google Scholar
Wei, R. K. J., Wee, J., Laurent, V. E. & Xia, K. Hodge theory-based biomolecular data analysis. Sci. Rep. 12, 9699 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Schaub, M. T., Benson, A. R., Horn, P., Lippner, G. & Jadbabaie, A. Random walks on simplicial complexes and the normalized Hodge 1-Laplacian. SIAM Rev. 62, 353–391. https://doi.org/10.1137/18M1201019 (2020).
Article MathSciNet MATH Google Scholar
Gurnari, D. et al. Probing omics data via harmonic persistent homology. arXiv preprint arXiv:2311.06357 (2023).
Basu, S. & Cox, N. Harmonic persistent homology. SIAM J. Appl. Algebra Geom. 8, 189–224 (2024).
Article MathSciNet MATH Google Scholar
Hausdorff, F. Grundzüge der mengenlehre, vol. 7 (von Veit, 1914).
Kolouri, S., Park, S. R., Thorpe, M., Slepcev, D. & Rohde, G. K. Optimal mass transport: Signal processing and machine-learning applications. IEEE Signal Process. Mag. 34, 43–59 (2017).
Article ADS PubMed PubMed Central MATH Google Scholar
Villani, C. Topics in Optimal Transportation, vol. 58 (American Mathematical Soc., 2021).
Carlsson, E., Carlsson, G. & De Silva, V. An algebraic topological method for feature identification. Int. J. Comput. Geom. Appl. 16, 291–314 (2006).
Article MathSciNet MATH Google Scholar
Anand, D. V., Xu, Q., Wee, J., Xia, K. & Sum, T. C. Topological feature engineering for machine learning based halide perovskite materials design. npj Comput. Mater. 8, 203 (2022).
Article ADS Google Scholar
McInnes, L., Healy, J. & Melville, J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
Ballester, R. et al. Attending to topological spaces: The cellular transformer. arXiv preprint arXiv:2405.14094 (2024).
Dean, P. M. Molecular Similarity in Drug Design (Springer Science & Business Media, 2012).

Download references

Acknowledgements

This work was supported in part by the Singapore Ministry of Education Academic Research fund Tier 1 grant RG16/23, Tier 2 grants MOE-T2EP20120-0010 and MOE-T2EP20221-0003. X.G. acknowledges support from NTU Presidential Postdoctoral Fellowship 023545-00001. The authors would like to thank Yipeng Zhang for his contributions to the revision of this manuscript, and Chuan-Shen Hu for his helpful suggestions.

Author information

Authors and Affiliations

Department of Mathematics, Michigan State University, East Lansing, MI, 48824, USA
JunJie Wee
Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore City, 637371, Singapore
Xue Gong & Kelin Xia
Institut für Algebra und Geometrie, Arbeitsgruppe Differentialgeometrie, Karlsruher Institut für Technologie, 76131, Karlsruhe, Germany
Wilderich Tuschmann

Authors

JunJie Wee
View author publications
Search author on:PubMed Google Scholar
Xue Gong
View author publications
Search author on:PubMed Google Scholar
Wilderich Tuschmann
View author publications
Search author on:PubMed Google Scholar
Kelin Xia
View author publications
Search author on:PubMed Google Scholar

Contributions

W.T. and K.X. designed the experiments. J.W. and X.G. conducted the experiments and analyzed the results. J.W. drafted the initial manuscript. X.G. revised the manuscript. All authors reviewed and contributed to the final manuscript.

Corresponding authors

Correspondence to Xue Gong, Wilderich Tuschmann or Kelin Xia.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wee, J., Gong, X., Tuschmann, W. et al. A cohomology-based Gromov–Hausdorff metric approach for quantifying molecular similarity. Sci Rep 15, 10458 (2025). https://doi.org/10.1038/s41598-025-93381-y

Download citation

Received: 21 November 2024
Accepted: 06 March 2025
Published: 26 March 2025
Version of record: 26 March 2025
DOI: https://doi.org/10.1038/s41598-025-93381-y