Introduction

In the last decade, concepts, methods, and techniques from topological and geometric data analysis1,2,3,4 have found profound applications, becoming indispensable tools in many areas of applied mathematics, such as image processing5, pattern recognition6, sensor network7, robotics8, computer graphics and vision9, cosmology10, medicine11, computational and molecular biology12. A central technique in Topological Data Analysis (TDA) is persistent homology, which provides a robust framework for capturing multi-scale topological features in data13,14. However, while persistent homology is a powerful tool, it generally provides only limited information about the homology groups of the objects under study, often restricted to basic descriptors such as Betti numbers. In addition to topological tools, geometric theories and methods, such as discrete exterior calculus15, Laplace operators16, discrete Ricci curvatures17,18, optimal transport19, parametric surfaces20, and geometric flows21, have been widely applied in data analysis. These geometric approaches have often been shown to provide deeper insights than those achieved through purely topological methods.

A promising application of advanced geometric and topological techniques is molecular similarity analysis22, which is a fundamental concept in cheminformatics. Molecular similarity measures the resemblance between molecules based on their structure, properties, or function, enabling applications in drug design23, virtual screening24, and reaction prediction25. By assessing similarity, we can predict properties such as activity and toxicity, thereby accelerating the identification of promising compounds26,27.

Existing methods for evaluating molecular similarity often rely on structural fingerprints, property-based metrics, or topological and 3D shape representations. For instance, structural fingerprints, such as Extended Connectivity Fingerprint (ECFP)28 and MACCS keys29, encode molecular features as binary sequences, which are typically compared using Tanimoto similarity30. Property-based similarity involves comparing physicochemical attributes, such as molecular weight and polarity, using Euclidean or cosine distances31. Topological similarity, often evaluated through graph neural networks (GNNs)32, models molecular connectivity, while 3D similarity captures spatial geometry through metrics such as RMSD or Shape Tanimoto scores33,34.

Recent advances in applied and computational topology and geometry offer powerful methods to assess molecular similarity. Persistent homology and its various refinements and extensions have proven invaluable for capturing molecular topological features in diverse chemical and biological applications35,36. However, while effective in tracking the number of connected components, loops, and voids, persistent homology lacks the geometric information necessary to distinguish or identify individual features that arise during filtration. Previous studies have leveraged the connection between persistent homology and dendrograms37,38 to distinguish structures based on 0-dimensional homology groups, which correspond to connected components. These works employ geometric methods, such as the Gromov–Hausdorff distance, to measure the distance between two brain networks39,40,41. However, there is still a need for methodologies that enable a more detailed and localized characterization of loops, voids, and higher-dimensional (co)homologies.

We apply a cohomology-based Gromov–Hausdorff ultrametric method to study 1-dimensional and higher-dimensional (co)homology groups associated with loops, voids, and higher-dimensional cavity structures in simplicial complexes. The Gromov–Hausdorff distance (\(d_{GH}\))42,43 quantifies the dissimilarity between two metric spaces and possesses many desirable mathematical properties, making it widely used in differential geometry and applied algebraic topology for tasks such as shape matching, structural comparison, and characterization44,45,46. In the context of molecular structures A and B, modeled as simplicial complexes \(K_A\) and \(K_B\), respectively, their cohomology groups give rise to the metric spaces \(H^p(K_A)\) and \(H^p(K_B)\). In theory, we can then compare their Gromov–Hausdorff distances. Although many lower bounds can be achieved in polynomial time46,47, computing \(d_{GH}\) between arbitrary finite metric spaces is known to be NP-hard, as it leads to quadratic assignment problems47. To address this challenge, we introduce a crucial second step: transforming \(H^p(K_A)\) and \(H^p(K_B)\) into ultrametric spaces, thus obtaining their respective dendrogram representations37,40,48. This enables us to compute a more tractable ultrametric variation of the Gromov–Hausdorff distance, referred to as the Gromov–Hausdorff ultrametric (\(u_{GH}\))48,49, which can be used to quantify molecular similarity.

The main contributions of this work are as follows:

  1. 1.

    We present a workflow to utilize a Gromov–Hausdorff ultrametric approach based on 1-dimensional and higher-dimensional cohomology for assessing structural similarity between molecules, effectively capturing local topological features such as loops and voids that influence molecular properties.

  2. 2.

    We demonstrate the effectiveness of this workflow in clustering tasks using a dataset of organic–inorganic halide perovskites (OIHP).

The remainder of this paper is structured as follows: The next section introduces foundational concepts in algebraic topology, including simplicial complexes and the Hodge Laplacian, which are essential to our methodology. We then present our method for measuring structural similarity, followed by numerical experiments to validate our approach using OIHP material datasets. Finally, we discuss potential future research directions.

Background

Simplicial complexes

In many applications, molecular data from biology, chemistry, and material science can be represented as graphs50,51. In this framework, the vertices correspond to atoms, while the edges represent affinities between pairs of atoms. Simplicial complexes generalize graphs by capturing higher-order interactions through simplices, such as triangles and tetrahedra. In this and the next section, we aim to provide a brief introduction to the theoretical background necessary for performing computations on simplicial complexes. For a more comprehensive review, we direct readers to52,53,54,55.

If we let V be a finite set of vertices, then a p-simplex \(\sigma ^p\) is a subset of V with \(p+1\) vertices. For example, a simplex of dimensions 0, 1, 2, and 3 can be viewed as a point, an edge, a triangle, and a tetrahedron, respectively. More precisely, a p-simplex \(\sigma ^p = \{v_0, v_1, v_2, \ldots , v_p\}\) is defined as a convex hull formed by its \(p+1\) affinely independent points \(v_0, v_1, v_2, \ldots , v_p\) as follows:

$$\begin{aligned} \sigma ^p = \bigg \{\lambda _0v_0+\lambda _1v_1+\cdots +\lambda _pv_p \bigg |\sum _{i=0}^p \lambda _i = 0;\forall i, 0\le \lambda _i \le 1\bigg \}. \end{aligned}$$

A m-face of a p-simplex \(\sigma ^p\) is defined as a convex hull formed by \(m+1\) vertices of \(\sigma ^p\), where \(m < p\). If \(\sigma ^{m}\) is a face of \(\sigma ^p\), denoted by \(\sigma ^{m} \subset \sigma ^p\), then \(\sigma ^p\) is also referred to as a coface of \(\sigma ^{m}\). The upper degree of a p-simplex \(\sigma ^p\), denoted by \(deg(\sigma ^p)\), is the number of \((p+1)\)-simplices for which \(\sigma ^p\) is a face. Two simplices \(\sigma _1^p\) and \(\sigma _2^p\) are upper adjacent and denoted by \(\sigma _1^p \frown \sigma _2^p\) if they have a common coface. They are lower adjacent and denoted by \(\sigma _1^p \smile \sigma _2^p\) if they share a common face. To facilitate computation, each simplex is assigned an orientation associated with the ordering of its vertices. If two p-simplices \(\sigma _1^p\) and \(\sigma _2^p\) are oriented similarly, then we write \(\sigma _1^p \sim \sigma _2^p\). Conversely, we write \(\sigma _1^p \not \sim \sigma _2^p\) if they have opposite orientations.

A p-dimensional simplicial complex K contains up to p-dimensional simplices and satisfies two conditions. First, any face of a simplex from K is also in K. Second, the intersection of any two simplices in K is either empty or a shared face. There are many common types of simplicial complexes such as Vietoris-Rips complex, \({\check{C}}ech\) complex, Alpha complex, and clique complex.

The combinatorial Hodge Laplacian

Computationally, combinatorial Hodge Laplacian or discrete Hodge Laplacian has been proposed52,54,55,56,57,58,59,60. Essentially, this discrete version can be viewed as part of exterior calculus and discrete differential geometry. The concept of the combinatorial Hodge Laplacian is a part of Hodge theory, which has recently been applied to biomolecular data analysis61. Hodge Laplacian matrices of different dimensions can be constructed on a simplicial complex. A p-th dimensional Hodge Laplacian matrix characterizes topological connections between p-th simplices within a simplicial complex52,53,62. Note that the graph Laplacian characterizes relations between vertices (0-simplices).

The p-th chain group \(C_p(K)\) of a simplicial complex K over some field \(\mathbb {F}\) is a vector space over \(\mathbb {F}\) whose basis is the set of p-simplices of the simplicial complex K. Elements of \(C_p(K)\) are called p-chains. The dual of \(C_p(K),\) denoted by \(C^p(K),\) is the set of all linear functionals on \(C_p(K)\):

$$C^p(K)=\big \{\phi : C_p(K) \rightarrow \mathbb {F}\, : \, \phi \text { is linear} \big \}.$$

\(C^p(K)\) is called the p-th cochain group and its elements are called p-cochains. Boundary operators are defined on both the chain and cochain groups. The boundary map \(\partial _p\) is a linear transformation which acts on a p-simplex \(\sigma ^p=[u_0,u_1,\ldots ,u_p]\) as follows

$$\partial _p([u_0,u_1,\ldots ,u_p])=\sum _{i=0}^p(-1)^i[u_0,\ldots ,u_{i-1},u_{i+1},\ldots ,u_p].$$

The coboundary map \(\delta _p:C^p(K) \rightarrow C^{p+1}(K)\) is a linear transformation defined as follows: for a linear functional \(\phi \in C^p(K)\) and a \(p+1\)-simplex \(\sigma ^{p+1}=[u_0,u_1,\ldots ,u_{p+1}]\),

$$\delta _p(\phi )(\sigma ^{p+1})=\sum _{i=0}^{p+1}(-1)^i\phi ([u_0,\ldots ,u_{i-1},u_{i+1},\ldots ,u_{p+1}]).$$

The boundary map gives rise to a chain complex, which is a sequence of chain groups connected by boundary maps as follows:

$$0 \rightarrow C_{n}(K)\rightarrow \cdots \xrightarrow {\partial _{p+1}}C_p(K)\xrightarrow {\partial _p} C_{p-1}(K)\cdots \xrightarrow {\partial _2}C_1(K)\xrightarrow {\partial _1}C_0(K)\rightarrow 0.$$

Similar to the boundary map giving rise to the chain complex, the coboundary operator gives rise to a cochain complex:

$$0 \leftarrow C^{n}(K)\leftarrow \cdots \xleftarrow {\delta _{p}}C^p(K)\xleftarrow {\delta _{p-1}} C^{p-1}(K)\cdots \xleftarrow {\delta _1}C^1(K)\xleftarrow {\delta _0}C^0(K)\leftarrow 0.$$

Since \(C_p(K)\) and \(C^p(K)\) are finite-dimensional, there exists unique matrix representations for \(\partial _p\) and \(\delta _p\). We have some useful relations regarding matrix representations of \(\partial _p\) and \(\delta _p\) (\(A^T\) represents the transpose of a matrix A):

  • For all \(p \ge 0\), \(\partial _{p+1}^T=\delta _p\),

  • \(\partial _p^T=\partial _p^*\),

  • \(\delta _p^T=\delta _p^*.\)

Here, \(\delta _p^*:C^{p+1}(K)\rightarrow C^p(K)\) is the adjoint/transpose map of \(\delta _p\) where

$$\langle \delta _p(f), g \rangle =\langle f, \delta _p^*(g) \rangle ,$$

for every \(f \in C^p(K)\), \(g \in C^{p+1}(K)\) and a suitable inner product \(\langle \cdot , \cdot \rangle\) for \(C^p(K)\) and \(C^{p+1}(K).\) The adjoint of the boundary operator \(\partial _p\), \(\partial _p^*\) is also defined analogously. Using the cochain group \(C^p(K)\), we can define the p-th cocycle group \(Z^p\) and p-th coboundary group \(B^p\) as follows:

$$\begin{aligned} Z^p = \text {ker}(\delta _p) = \{c \in C^p | \delta _p(c) = 0\}, \\ B^p = \text {im}(\delta _{p-1}) = \{c\in C^p | \exists d \in C^{p-1}: c = \delta _{p-1}(d) \}. \end{aligned}$$

Then we have the p-th cohomology group \(H^p = Z^p/B^p\).

The p-dimensional combinatorial Hodge Laplacian is the linear operator \(\Delta _p:C^p(K) \rightarrow C^p(K)\) is defined as follows:

$$\Delta _p= {\left\{ \begin{array}{ll} \delta _p^*\circ \delta _p+\delta _{p-1}\circ \delta _{p-1}^* & \text {if } p \ge 1, \\ \delta _p^*\circ \delta _p & \text {if } p=0. \end{array}\right. }$$

The case where \(p=0\) gives rise to the expression of the well-known graph Laplacian. Alternatively, the combinatorial Hodge Laplacian matrix can be expressed using the boundary matrix. The boundary operator \(\partial _p\) has a unique matrix representation. Given a simplicial complex K, the p-th boundary matrix \({\mathbf{B}}_p\) is defined as,

$$({\mathbf{B}}_p)_{ij}= {\left\{ \begin{array}{ll} 1 & \text {if } \sigma _i^{p-1} \subset \sigma _j^{p} ~\text {and }~ \sigma _i^{p-1} \sim \sigma _j^{p}, \\ -1 & \text {if } \sigma _i^{p-1} \subset \sigma _j^{p} ~\text {and }~ \sigma _i^{p-1} \not \sim \sigma _j^{p}, \\ 0 & \text {if } \sigma _i^{p-1} \not \subset \sigma _j^{p}, \end{array}\right. }$$

where \(\sigma _i^{p-1}\) is the i-th \((p-1)\)-simplex and \(\sigma _j^{p}\) is the j-th p-simplex.

Given that the highest order of the simplicial complex K is n, the p-th Hodge Laplacian (or combinatorial Laplacian) matrix \({\mathbf{L}}_p\) of K is

$${\mathbf{L}}_p= {\left\{ \begin{array}{ll} {\mathbf{B}}_n^T{\mathbf{B}}_n & \text {if } p=n, \\ {\mathbf{B}}_p^T{\mathbf{B}}_p+{\mathbf{B}}_{p+1}{\mathbf{B}}_{p+1}^T & \text {if } 1 \le p < n, \\ {\mathbf{B}}_1{\mathbf{B}}_1^T & \text {if } p=0. \end{array}\right. }$$

Another way to define the Hodge Laplacian is through simplex relations. When \(p=0\),

$$({\mathbf{L}}_0)_{ij}= {\left\{ \begin{array}{ll} \text {deg}(\sigma _i^{0}) & \text {if } i=j, \\ -1 & \text {if } i \ne j~\text {and } \sigma _i^{0} \frown \sigma _j^{0}, \\ 0 & \text {if } i \ne j~\text {and } \sigma _i^{0} \not \frown \sigma _j^{0}, \end{array}\right. }$$

and \({\mathbf{L}}_0\) is exactly the same as the graph Laplacian matrix. When \(p>0\),

$$({\mathbf{L}}_p)_{ij}= {\left\{ \begin{array}{ll} \text {deg}(\sigma _i^{p})+ p+1 \quad & \text {if } i=j, \\ 1 \qquad & \text {if } i \ne j, \, \sigma _i^{p} \not \frown \sigma _j^{p}, \sigma _i^{p} \smile \sigma _j^{p}~\text {and } \sigma _i^k \sim \sigma _j^p, \\ -1 \quad & \text {if } i \ne j, \, \sigma _i^{p} \not \frown \sigma _j^{p}, \sigma _i^{p} \smile \sigma _j^{p}~\text {and } \sigma _i^k \not \sim \sigma _j^p, \\ 0 \qquad & \text {if } i \ne j, \, \sigma _i^{p} \frown \sigma _j^{p} ~\text {or }~ \sigma _i^{p} \not \smile \sigma _j^{p}. \end{array}\right. }$$

Mathematically, the eigenvalues of Hodge Laplacian matrices are independent of the choice of the orientation57. The eigenspectrum of Hodge Laplacian matrices reveals topological information within simplicial complexes. For instance, the multiplicity of zero eigenvalues of \({\mathbf{L}}_p\) corresponds to the Betti numbers \(\beta _p\), reflecting the number of connected components (\(\beta _0\)), the number of cycles (\(\beta _1\)), the number of cavities (\(\beta _2\)), etc.

Furthermore, the eigenvectors corresponding to the zero eigenvalues of \({\mathbf{L}}_p\), also referred to as cohomology generators, can be used to represent the p-dimensional cohomologies. At the algebraic level, choosing a representative for an element in cohomology is inherently problematic due to its quotient structure. However, by leveraging the Hodge decomposition theorem with the inner product structure on the cochain complex, we obtain \(\operatorname {ker}(\delta _p) = \operatorname {im}(\delta _{p-1}) \oplus \operatorname {ker}({\mathbf {L}}_p)\), where \(\operatorname {ker}({\mathbf {L}}_p)\) is orthogonal to \(\operatorname {im}(\delta _{p-1})\). This identification induces the canonical isomorphism \(H^p(K) \simeq (\operatorname {im}(\delta _{p-1}) \oplus \operatorname {ker}({\mathbf {L}}_p))/\operatorname {im}(\delta _{p-1}) \simeq \operatorname {ker}({\mathbf {L}}_p)\). In particular, every cohomology class in \(H^p(K)\) can be uniquely represented by an element in \(\operatorname {ker}({\mathbf {L}}_p)\)53,63,64. For any cohomology generator v of \({\mathbf{L}}_p\), we can arrange the entries of v to match the order of simplices in K. Specifically, the i-th entry \(v_i\) corresponds to the p-simplex \(\sigma _i^p\) in K. This allows us to visualize v on a simplicial complex.

The Gromov–Hausdorff distance

The Gromov–Hausdorff distance measures the distance between two compact metric spaces42,43. It measures the smallest distance at which two compact metric spaces can be considered “close”, which is particularly useful when the overall shape matching is more relevant than exact pointwise alignment. Given two metric spaces X and Y, the Gromov–Hausdorff distance looks for all possible isometric embeddings of X and Y into a common metric space Z and then calculates the Hausdorff distance65 between these embeddings within Z. Formally, the Gromov–Hausdorff distance \(d_{GH}\)42,43 between two compact metric spaces X and Y is defined as

$$\begin{aligned} d_{GH}(X,Y) = \inf d_H^Z(\varphi (X),\psi (Y)), \end{aligned}$$

where the infimum is taken over all possible \(Z\in \mathcal {M}\) and isometric embeddings \(\varphi :X\hookrightarrow Z\) and \(\psi :Y\hookrightarrow Z\), and \(d^Z_H\) denotes Hausdorff distance in Z.

Method

Cohomology generator-based distance metrics

For a given simplicial complex K, we denote the space consisting of all the p-dimensional cohomology generators arising from its Hodge Laplacian \({\mathbf{L}}_p\) by \(H^p(K)\). Subsequently, we want to construct a metric space from \(H^p(K)\) by assigning distances between cohomology generators in \(H^p(K)\). We will introduce three types of distance measures to construct the metric space: \(l^1\) distance, cocycle distance, and Wasserstein distance.

Throughout this section, we assume that any two cohomology generators v and w from a Hodge Laplacian matrix \({\mathbf{L}}_p\) must have a consistent order of entries. In other words, we can write

$$v = \left( {\begin{array}{*{20}c} {v_{1} } \hfill \\ {v_{2} } \hfill \\ \vdots \hfill \\ {v_{{n_{p} - 1}} } \hfill \\ {v_{{n_{p} }} } \hfill \\ \end{array} } \right)\begin{array}{*{20}c} {\sigma _{1}^{p} } \hfill \\ {\sigma _{2}^{p} } \hfill \\ \vdots \hfill \\ {\sigma _{{n_{p} - 1}}^{p} } \hfill \\ {\sigma _{{n_{p} }}^{p} } \hfill \\ \end{array} \quad {\text{and}}\quad w = \left( {\begin{array}{*{20}c} {w_{1} } \hfill \\ {w_{2} } \hfill \\ \vdots \hfill \\ {w_{{n_{p} - 1}} } \hfill \\ {w_{{n_{p} }} } \hfill \\ \end{array} } \right)\begin{array}{*{20}c} {\sigma _{1}^{p} } \hfill \\ {\sigma _{2}^{p} } \hfill \\ \vdots \hfill \\ {\sigma _{{n_{p} - 1}}^{p} } \hfill \\ {\sigma _{{n_{p} }}^{p} } \hfill \\ \end{array} ,$$

where \(v_i\) and \(w_i\) are entries in v and w, respectively, corresponding to the simplex \(\sigma _i^p\) for all \(1\le i \le n_p\). Furthermore, we assume that any cohomology generator in \(H^p(K)\) is normalized.

\(l^1\) distance

First, we define the \(l^1\) distance between two cohomology generators as follows.

Definition 1

(\(l^1\) distance) Let v and w be two cohomology generators from \(H^p(K)\) where K is a simplicial complex. Then the \(l^1\) distance between v and w is

$$\begin{aligned} \Vert v-w\Vert _1 = \sum _{i=1}^{n_p} |v_{i}-w_{i}|. \end{aligned}$$

Note that in computation, the eigenvectors may have arbitrary signs. To eliminate the sign ambiguity introduced by the solver, we enforce that the first element of each cohomology generator is non-negative.

Cocycle distance

As the \(l^1\) norm of a cohomology generator tends to be larger when the cocycle contains more edges, it serves as a rough indicator of the size of the cocycle. Therefore, it is natural to consider another type of distance that measures the absolute difference between the \(l^1\) norms of two cohomology generators. We refer to this as the cocycle distance.

Fig. 1
figure 1

Illustration of a catalyst 1_iv with its \({\mathbf{L}}_0\) and \({\mathbf{L}}_1\) cohomology generators. The \({\mathbf{L}}_0\) cohomology generator shows all the 0-simplices having equal value as it corresponds to the \(\beta _0\), which represents the connected components of the catalyst. The \({\mathbf{L}}_1\) matrix has 7 cohomology generators in total, each representing a unique cocycle within the catalyst.

Definition 2

(Cocycle distance) Let v and w be two cohomology generators in \(H^p(K)\) where K is a simplicial complex. Their cocycle distance \(d_s(v, w)\) is defined as the absolute difference in their \(l^1\) norms:

$$\begin{aligned} d_s(v, w) = |\Vert v\Vert _{1}-\Vert w\Vert _{1}|. \end{aligned}$$

For illustration, consider the BINOL-phosphoramide 1_iv, a BINOL (1,1’-bi-2-naphthol)-based phosphoric acid catalyst. We construct the atomic level simplicial complex representation of the molecule, where each vertex represents an atom, and generate the corresponding Hodge Laplacian matrices \({\mathbf{L}}_0\) and \({\mathbf{L}}_1\). The cohomology generators from \({\mathbf{L}}_0\) and \({\mathbf{L}}_1\) are depicted in Fig. 1, where the color of each simplex \(\sigma _i^p\) is darker in blue if its associated \(|v_i|\) in the cohomology generator is larger. From \({\mathbf{L}}_0\), there is only one cohomology generator, representing a single connected component of the catalyst structure. Therefore, the cohomology generator in \({\mathbf{L}}_0\) assigns equal values to all vertices. On the other hand, \({\mathbf{L}}_1\) produces 7 cohomology generators in dark blue color in Fig. 1 corresponding to unique cocycles. In addition, the non-cohomology generators associated with non-zero eigenvalues reveal local and global clustering patterns. The Fiedler vector effectively clusters the vertices into two groups, colored in blue and red. For example, Fig. 1 shows the Fiedler vector corresponding to the smallest non-zero eigenvalue of the Hodge Laplacian matrix, which clusters the vertices into two groups colored in blue and red.

Wasserstein distance

Recall from the beginning of this section that any cohomology generator v in \(H^p(K)\) is assumed to be normalized. Hence, given any two cohomology generators v and w in \(H^p(K)\), we can square all the entries of v and w to obtain vectors \(v'\) and \(w'\), whose entries sum to 1,

$$v = \left( {\begin{array}{*{20}c} {v_{1}^{2} } \\ {v_{2}^{2} } \\ \vdots \\ {v_{{n_{p} - 1}}^{2} } \\ {v_{{n_{p} }}^{2} } \\ \end{array} } \right)\begin{array}{*{20}c} {\sigma _{1}^{p} } \\ {\sigma _{2}^{p} } \\ \vdots \\ {\sigma _{{n_{p} - 1}}^{p} } \\ {\sigma _{{n_{p} }}^{p} } \\ \end{array} \quad {\text{and}}\quad w = \left( {\begin{array}{*{20}c} {w_{1}^{2} } \\ {w_{2}^{2} } \\ \vdots \\ {w_{{n_{p} - 1}}^{2} } \\ {w_{{n_{p} }}^{2} } \\ \end{array} } \right)\begin{array}{*{20}c} {\sigma _{1}^{p} } \\ {\sigma _{2}^{p} } \\ \vdots \\ {\sigma _{{n_{p} - 1}}^{p} } \\ {\sigma _{{n_{p} }}^{p} } \\ \end{array} .$$
(1)

Since the vectors \(v'\) and \(w'\) have entries that sums to 1, this allows us to treat the values in \(v'\) and \(w'\) as two probability measures \(m_1\) and \(m_2\) respectively. Now, we can define the pairwise distance between any two cohomology generators v and w to be the Wasserstein distance66,67 between \(m_1\) and \(m_2\).

Definition 3

(Wasserstein distance) The probability measures \(m_1\) and \(m_2\) are the probability distributions obtained from the entries of \(v'\) and \(w'\) as defined in (2). Let \(\sigma _i^p\) and \(\sigma _j^p\) be p-simplices in K. Then, \(\xi (\sigma _i^p, \sigma _j^p)\) represents the amount of mass traveling from \(\sigma _i^p\) to \(\sigma _j^p\). We require that the transportation from \(m_1\) to \(m_2\) is mass-preserving, i.e., \(\sum _{\sigma _j^p \in K} \xi (\sigma _i^p, \sigma _j^p) = m_1(\sigma _i^p)\) and \(\sum _{\sigma _i^p\in K} \xi (\sigma _i^p, \sigma _j^p) = m_2(\sigma _j^p)\). The Wasserstein distance between \(m_1\) and \(m_2\), denoted by \(W_1(m_1, m_2)\), is the minimum traveling distance that can be achieved, given by

$$\begin{aligned} W_1(m_1, m_2) = \inf _\xi \sum _{\sigma _i^p\in K} \sum _{\sigma _j^p \in K} d(\sigma _i^p,\sigma _j^p)\xi (\sigma _i^p, \sigma _j^p). \end{aligned}$$
(3)

The distance between two simplices, denoted by \(d(\sigma _i^p, \sigma _j^p)\), is defined as the minimum \(\ell ^2\) distance between their corresponding vertices. Representing the sets of vertices in the simplices as \(\sigma _i^p=\{x_1, x_2, \ldots , x_{p}\}\) and \(\sigma _j^p=\{u_1, u_2, \ldots , u_{p}\}\), the distance is then given by

$$\begin{aligned} d(\sigma _i^p, \sigma _j^p) = \min _{\begin{array}{c} 1\le i \le p\\ 1\le j\le p \end{array}}||{\mathbf{x}}_i-{\mathbf{u}}_j||_2, \end{aligned}$$
(4)

where \({\mathbf{x}}_i\) and \({\mathbf{u}}_j\) refer to the coordinates of the vertices \(x_i\) and \(u_j\) in \(\mathbb {R}^3\).

Cohomology-based Gromov–Hausdorff ultrametric approach for structural similarity measurement

Given two molecular structures, A and B, we construct simplicial complexes, denoted by \(K_A\) and \(K_B\) respectively, where vertices represent atoms. There are various approaches for constructing the simplicial complex68. For example, we can build a Vietoris-Rips complex such that a set of atoms is a simplex in K if every atom pair in the set has a \(\ell ^2\) distance smaller than a specified filtration threshold. An alternative approach is to generate an Alpha complex, where a set of vertices forms a simplex when its filtration value is smaller than the threshold. For example, the filtration value of an atom pair is one-half of their \(\ell ^2\) distance, and the filtration value for a triplet of atoms is the radius of the circle that passes through it.

After constructing the simplicial complex representation of a molecular structure, cohomology generators can be computed from its 1-dimensional Hodge Laplacian matrix \({\mathbf{L}}_1\). Subsequently, we can compute a distance matrix comprised of pairwise distances between all cohomology generators from \({\mathbf{L}}_1\). Three types of distance matrices can be derived for each molecular structure, corresponding to Definition 1, 2, or 3. These types of distance measurements provide us ways to construct cohomology-based ultrametric space and the associated Gromov–Hausdorff ultrametric. 

An ultrametric space \((X, d_X)\) is a metric space that satisfies the strong triangle inequality:

$$\forall x, x', x''\in X, \text { one has } d_X(x,x')\le \max (d_X(x,x''),d_X(x'', x')).$$

The Gromov–Hausdorff ultrametric49, denoted by \(u_{\text {GH}}\), measures distances between compact ultrametric spaces X and Y as follows.

Definition 4

(The Gromov–Hausdorff ultrametric49) The Gromov–Hausdorff ultrametric \(u_{\text {GH}}\) between compact ultrametric spaces X and Y is

$$\begin{aligned} u_{\text {GH}}(X,Y):= \inf d_H^Z(\varphi _X(X), \varphi _Y(Y)), \end{aligned}$$

where the infimum is taken over all ultrametric spaces Z and isometric embeddings \(\varphi _X:X\hookrightarrow Z\) and \(\varphi _Y:Y\hookrightarrow Z\).

In our methodology, we transform metric spaces formed by the cohomology generators into an ultrametric space denoted by \(H^1(K_A)\) and \(H^1(K_B)\) using Algorithm 1 in48. This transformation allows us to compute the Gromov–Hausdorff ultrametric \(u_{\text {GH}}(H^1(K_A), H^1(K_B))\) between two ultrametric spaces associated with molecular structures A and B, thereby quantifying their structural similarity or facilitating clustering tasks. An illustrative figure for the aforementioned workflow is presented in Fig. 2.

Fig. 2
figure 2

Schematic chart of the cohomology-based Gromov–Hausdorff approach for quantifying structural similarity. (a) Input coordinate data of structure A. (b) Construct the associated simplicial complex \(K_A\). (c) Compute the cohomology generators. (d) Construct the ultrametric space \(H^p(K_A)\) from the cohomology generators represented as a dendrograms. Nodes at the bottom correspond to cohomology generators and they merge at heights equal to their distances. (e) Compute the Gromov–Hausdorff ultrametric \(u_{\text {GH}}\) between each pair of structures. Varying filtration values during the construction of the simplicial complex will result in multiple \(u_{\text {GH}}\) matrices. (f) Use the \(u_{\text {GH}}\) matrices as input to clustering algorithms.

Results

Characterization of OIHP material structures

Organic–inorganic hybrid perovskite (OIHP) materials are favorable candidates for developing efficient and cost-effective solar cells. The stabilization of the OIHP structure is reliant on Van der Waals interactions and hydrogen bonding effects, which are closely associated with the distances between organic molecules and inorganic ions69. Our \(u_{\text {GH}}\)-based features can be employed for characterizing OIHP material structures. To demonstrate this, we examine the clustering of Methylammonium lead halides (MAPbX\(_3\), X\(=\)Cl, Br, I). Three possible phases of MAPbX\(_3\) are considered for each X-site atom, including the orthorhombic, tetragonal, and cubic phases as illustrated in Fig. 3a,b.

Fig. 3
figure 3

Illustration of 9 types of OIHP structures with formula MAPbX\(_3\), where MA refers to Methylammonium, Pb is lead and X is bromine (Br), chlorine (Cl) or iodine (I). (a) Three OIHP structures with bromine, chlorine, and iodine. (b) Three phases of OIHP structures, i.e. cubic, orthorhombic, and tetragonal. In total, we consider all 9 possible combinations. (c) Illustration of the construction of the Alpha complex from MAPbBr\(_3\) structures in the cubic phase (left) with a filtration value of 3.5 Å (middle) and 4 Å (right), where the edges are colored by the absolute value of a cohomology generator, with red corresponding to a larger absolute value.

We analyze 100 configurations from the molecular dynamics (MD) trajectories for each of the 9 OIHP structures, leading to 900 trajectories. Each MD trajectory arises from a molecular dynamics simulation to stabilize the initial configuration and consists of the 3D coordinates of atoms at a specific time. For each configuration, we construct an Alpha complex with 4 filtration thresholds of 3.5 Å, 4 Å, 5 Å, and 6 Å  as shown in Fig. 3c and calculate its 1-dimensional co-homology generators and associated pairwise distances using \(L_1\) distance as defined in Definition 1. We chose to construct alpha complexes, which are based on the Delaunay triangulation of the points, because they better reflect the underlying geometry of atomic arrangement.

We consider the tasks of clustering 3 types of atoms for X-sites for each possible phase of MAPbX\(_3\), by calculating the \(u_{GH}\) between the 300 configurations of interest with the same phase for each of the 4 filtration values. This process generates a GH-based statistical feature vector with a length of \(300 \times 4 = 1200\) for each configuration, allowing for differentiation among the three X-sites. We compare the K-means clustering by X-site atoms obtained using \(u_{\text {GH}}\)-based features with the ones achieved using the 3D coordinates, and two commonly used structural fingerprints: ECFP28 and MACC keys29. ECFP is a circular fingerprint that encodes atomic neighborhood patterns, while MACCS Keys captures the presence of specific chemical substructures. We chose to compare these features using only structural information without including additional information such as atomic number and weight, which could potentially bias the clustering algorithm towards identifying atom types.

In Table 1, we show the Adjusted Rand Index (ARI) of K-means clustering for the above methods, where 0 indicates a random assignment of clusters and 1 represents a perfect clustering. We also plotted low-dimensional visualization of these features using UMAP70 in Fig. 4, with the x-axis and y-axis representing the two dimensions after dimension reduction by UMAP. Directly using the 3D coordinates proves ineffective in distinguishing between various X-site atoms. Using the 3D coordinates alone proves ineffective in distinguishing between various X-site atoms as shown in low ARI. Incorporating some neighborhood information using ECFP improves the ARI but is still not ideal since the structures have similar neighborhoods pattern. MACC achieves decent clustering, although the ARI remains below 0.9 for the tetragonal structures. On the other hand, our Gromov–Hausdorff ultrametric-based methods exhibit almost perfect clustering structures by X-site atoms. The UMAP plot in Fig. 4 provides additional visual confirmation that our method achieves superior clustering in the low-dimensional projection of the data.

Table 1 Adjusted Rand Index of clustering based on different features.
Fig. 4
figure 4

Visualization of OIHP molecular configurations using UMAP. The x-axis and y-axis represent the two dimensions after dimensionality reduction by the UMAP.

Discussion

In this work, we propose a new framework that employs the cohomology-based Gromov–Hausdorff ultrametric (\(u_{GH}\)) for quantifying structural similarity between molecular structures. In our workflow, we build simplicial complex representations of molecular structures and compute the \(u_{GH}\) between their respective cohomology generator spaces. The cohomology generators effectively encode the local topological invariants, revealing cyclic patterns formed by edges. We illustrate the application of the cohomology-based Gromov–Hausdorff distance using organic–inorganic halide perovskites (OIHP) data. In our numerical experiments, the \(u_{GH}\)-based approach demonstrated effectiveness in clustering various structures, achieving the highest Adjusted Rand Index compared to other structural features tested including the 3D coordinates, ECFP, and MACC.

We demonstrate the application of the cohomology-based \(u_{GH}\) approach in quantifying structural similarities. However, there are some limitations to our current work. For instance, we focused solely on the cohomology generators of the first-order Hodge Laplacian, which capture loop patterns. In the future, exploring non-cohomology generators may reveal clustering information, and higher-order Hodge Laplacians could unveil more complex structures such as cavities. Another limitation is that the choice of kernel vectors for the Hodge Laplacian is not unique. Therefore, for the same structure, we may have different choices for cohomology generators. It remains an open question for future research to determine how to find the optimal cohomology generators that minimize the \(u_{GH}\) between two atoms. Other potential future directions involve incorporating cohomology-based \(u_{GH}\) features into machine learning models for structure design and prediction. Topological deep learning models have also demonstrated great potential in molecular property prediction and analysis71. Additionally, our numerical experiments are limited to small molecules from chemistry and physics with hundreds of atoms. A promising avenue involves applying the proposed method to larger biological molecules with tens of thousands of molecules. For instance, it could be employed to quantify structural similarities between protein structures in drug design72.