Abstract
Crystal structure similarity is useful for the chemical analysis of nowadays big materials databases and data mining new materials. Here we propose to use two-dimensional Wasserstein distance (earth mover’s distance) to measure the compositional similarity between different compounds, based on the periodic table representation of compositions. To demonstrate the effectiveness of our approach, 1586 Cu-S based compounds are taken from the inorganic crystal structure database (ICSD) to form a validation dataset. By using local structure order parameters as a geometrical similarity metric, the similarity matrix including both compositional and geometrical similarities is calculated. Then all the Cu-S compounds are clustered into 86 groups using the similarity matrix and “density-based spatial clustering of applications with noise” (DBSCAN) algorithm. Some selected groups are analyzed using crystal structure visualization of hundreds of compounds, which provides chemical insights of the similarity metrics and shows the effectiveness of clustering. A group of rare earth containing layered Cu-S compounds is proposed for further experimental investigation as potential thermoelectric materials, based on a structure-property relationship consideration that similar structures tend to have similar properties. The unsupervised clustering approach in this work can be easily applied to other datasets, which will help for chemical understanding of the materials datasets and discover new materials with similarity properties based on the similarity metrics.
Similar content being viewed by others
Introduction
Crystal structure similarity is useful for materials database analysis and new materials discovery, based on the consideration that materials with similar structures are likely to have similar physical and chemical properties1,2. Several tasks in materials informatics related to structure similarity measure, for examples, systematically search for similar compounds when a promising compound with certain properties is known1; generation of dissimilar structures to facilitate potential energy surface search in evolutionary algorithm3,4; development of kernels in machine learning algorithms for materials properties prediction5,6; clustering of compounds to identify a small region or a specific group for further detailed investigation7. The key procedure of these applications is the definition of a crystal structure similarity measure. The crystal structure similarity can be divided to geometrical and compositional similarities, respectively.
The main contribution of this work is a compositional similarity measure using Wasserstein distance and the periodic table. The Wasserstein distance measures the optimal (minimum) transport cost between the two distributions, i.e. how much work needs to be done when make one distribution to be another (see Fig. 1b for an illustration). It is worth noting that earth mover’s distance (EMD), a terminology introduced and then widely used by image processing researchers8, is equivalent to 1st Wasserstein distance9. The EMD represents the minimal amount of work required to transform one distribution into the other by moving the “dirt” around. This method is useful for comparing distributions such as histograms or pixel values in images. Recently, the Wasserstein distance has been adopted as a measure of compositional similarity7. In contrast to prior methods like max matching10, it is simple yet effective, and can cluster the compounds in a chemically understandable way7. In former investigations, the compositions are typically represented by a one-dimensional vector7,10.
In this work, we propose to represent the compositions using a two-dimensional matrix, called “periodic table representation (PTR)” (Fig. 1b). It is found that using PTR as a convolutional neural network input improves prediction accuracy compared to a one-dimensional vector input11,12. The layout of the periodic table contains chemical information in two dimensions, i.e. the elemental properties have trends in both vertical and horizontal directions. Therefore, PTR provides more information than one-dimensional representation such as the Pettifor scale13,14. In this work, the compositional similarity measure is Wasserstein distance and PTR. As for the geometrical similarity, there are many options such as XTALCOMP15, CMPZ16, SOAP17, COMPSTRU18and pymatgen19. A detailed discussion on all the methods is beyond the scope of this paper. Considering code accessibility and a recent benchmark study20, we choose “local structure order parameters”21,22implemented in pymatgen19and Matminer23 as the geometrical similarity measure. The compositional and geometrical similarity measures together give crystal structure similarity.
To further demonstrate how crystal structure similarity can be used for materials dataset analysis and new materials discovery, a Cu-S based compounds dataset from inorganic crystal structure database (ICSD) is taken as an example. These compounds are low cost and eco-friendly materials, and some of them are earth abundant minerals. They are well-studied as functional materials, including thermoelectrics, commonly featuring a moderate power factor (approximately 10−3 W m−2 K−2) and very low lattice thermal conductivity (approximately 1 W m−1 K−1)24,25. It is worth noting that one purpose of this work is to identify new thermoelectric materials for potential experimental verification, and Cu-S based compounds form a proper subset for this purpose, as demonstrated in our previous work25.
Methods
Compositional periodic table representation (PTR)
The composition is represented using a 7 × 32 matrix (Fig. 1a). Unlike common periodic table, rare earth elements are put inside the main table rather than appended below. The value of the corresponding position in the periodic table is set to the atomic percentage of that element in the formula. Take Cu2ZnSnS4 as an example (Fig. 1b): the PTR has values of 0.25, 0.125, 0.125 and 0.5 at the positions of Cu, Zn, Sn and S, respectively. The value of all other elements in the PTR matrix is set to zero, so only a part of PTR with non-zero values is shown in Fig. 1b.
Compositional wasserstein distance
All pairwise Wasserstein distance in the whole dataset is calculated using the EMD function implemented in OpenCV-Python26. For an intuitional understanding of how Wasserstein distance values are calculated, Fig. 1b gives an example showing the Wasserstein distance between two PTRs of Cu2ZnSnS4 and Cu3SnS4. The PTR for Cu2ZnSnS4 is 0.25 Cu, 0.125 Zn, 0.125 Sn and 0.5 S; for Cu3SnS4 the PTR is 0.375 Cu, 0.125 Sn and 0.5 S. Therefore, by moving 0.125 amount from Zn position to Sn position (yellow arrow in Fig. 1b), PTR-Cu2ZnSnS4 is transformed to PTR-Cu3SnS4. The moving amount (in this case 0.125) times the moving distance (in this case is 1 from Zn to Cu) gives the Wasserstein distance between two PTRs.
The implementation of the Wasserstein distance in OpenCV uses the following algorithm: Firstly, each distribution is represented as a “signature”, i.e. a matrix where each row corresponds to a pixel and contains its value and coordinates, in the form of [composition fraction, periodic table column number, periodic table row number]. For example, the matrix of Cu3SnS4 is (note that starting column/row number is 0 instead of 1): [[0.375, 24, 3], [0.125, 27, 4], [0.5, 29, 2]]. Then the cost of moving a cube from one point to another is calculated using Euclidean distance (cv2.DIST_L2). The algorithm computes the optimal flow of the cubes from one distribution to the other, minimizing the total cost of moving. This is done using a transportation algorithm. Finally, the Wasserstein distance is the sum of the costs for the optimal flow.
(a) The periodic table used for representation of the compositions. (b) Example of the periodic table representations (PTRs) and the Wasserstein distance (also known as earth mover’s distance, EMD). Only the part of PTR with non-zero values are shown. The number of the cubes represents atomic percentage of the corresponding element. The yellow arrow indicates how the PTR-Cu2SnZnS4 can be transformed into PTR-Cu3SnS4 by moving 0.125 amount from Zn to Cu, therefore the EMD between two compositions are 0.125 × 1, where 1 is the distance between Cu and Zn.
Local structure order parameters (LoStOPs) and geometrical similarity
LoStOPs is designed to rapidly detect local coordination environments21. It determines to what extent the angles of a given coordination environment agree with those in the ideal coordination environment. LoStOPs is implemented in CrystalNNFingerprint module in matminer23. Here we briefly describes the procedure and more details can be found in the original publication21. There are 20 pre-defined LoStOPs describing the ideal coordination environments up to a coordination number of 12. Using these LoStOPs, site fingerprint feature vector of each atom in a structure is computed. Structure fingerprint, which is representative of the coordination patterns of the whole structure, is computed using statistics of all the site fingerprints. In this work, the statistics includes mean, standard deviation, minimum and maximum, giving one vector for each crystal structure and the vector length is 144. Geometrical similarity is obtained by calculating the Euclidean distance between each pair of the vector representation.
Clustering using crystal structure similarity
The compositional and geometrical distance matrix are normalized and added together to give the crystal structure distance matrix. Based these pre-calculated distance values, density-based spatial clustering of applications with noise (DBSCAN) method implemented in Scikit-learn27 is used to cluster the compounds. DBSCAN clustering using only compositional or geometrical similarity matrix is also computed as comparison. The parameters for DBSCAN are fine-tuned to make sure the number of classes is less than 100.
Dataset of Cu-S based compounds
The inorganic crystal structure database (ICSD, 2020 version) has 2328 entries that contain both copper and sulfur. There are 1568 distinct compounds after removing duplicate entries by using pymatgen19. Two entries are considered as duplicate when: (1) they have the same composition; (2) the StructureMatcher function returns ‘true’ after comparing the structures. All the parameters of StructureMatcher were set to their default values except the fractional length tolerance and the site tolerance, both of which were set to 0.1. Figure S1 shows the elemental distribution of these 1568 Cu-S based compounds. The dataset covers 76 elements which is a good representation of the total ICSD database. This is not surprising considering the Cu-S compounds are well studied, especially as mineral, for centuries.
Crystal structure visualization
To test the effectiveness of the similarity measure and DBSCAN clustering, all the compounds are visualized using VESTA28 and investigated one by one by the authors. The purpose for this human decision is to find if there is any compound that is very different from others in the same group, using topological bonding patterns automatically generated by VESTA. This is specifically done for groups with a large number of compounds. It is worth noting that using a Cu-S subset (instead of the whole ICSD) makes this manually check feasible.
Results and discussion
DBSCAN clustering algorithm identified 86 groups plus 406 compounds do not belong to any group. The count for each group is plotted in Fig. 2a. Some selected groups, which are of interest for their functional properties, are labeled. These groups will be discussed in detail later. Figure 2b shows the t-distributed stochastic neighbor (t-SNE) plot using the total similarity matrix, i.e. compositional and geometrical similarity matrixes added together. The dots are colored according to the selected groups and the clustering patterns is clear, which indicate the effectiveness of the clustering results. We also compared the visualization outcomes using t-SNE, Isometric Mapping (IsoMap), and Multidimensional Scaling (MDS), shown in Figure S2. The DBSCAN results from Wasserstein distance are used. The results demonstrate that t-SNE provided the most distinct visual differentiation among the methods evaluated.
To demonstrate the effectiveness of Wasserstein distance as a similarity measure, we compared the results with Cosine and Euclidean distance using PTR and compositions. The Cosine and Euclidean compositional similarity matrix were also added to LoStOPs matrix, using the same method described in the Method section. The numbers of DBSCAN groups for Wasserstein, Cosine and Euclidean are 87, 30 and 52, respectively. This result indicates that the Wasserstein distance is more capable in compounds clustering. Take group 55 from Wasserstein results as an example, and the comparison is shown in Table S1. Wasserstein group 55 corresponds to Euclidean group 10, but Euclidean group 10 contains several dissimilar materials, for example Na4Zr2Cu4S8. This clearly shows Euclidean distance is less accurate in classifying compounds. Cosine distance performs even worse, and it gives fewest number of groups in these three similarity measure.
(a) Histogram of the 86 groups after DBSCAN clustering of 1568 Cu-S based compounds. Some selected groups are indicated. (b) t-SNE plot of the 1568 compounds. Compounds that do not belong to any group are labeled in light-blue, and compounds that do not belong to the selected groups are labeled in light-green.
Figure 3 shows some representative crystal structures from the selected groups in Fig. 2. As only a subset (i.e. 1568 Cu-S compounds) is considered in this work, it is feasible for us to do a manually check for the groups. As described in the method section, we opened structures in VESTA and check whether they are similar in composition and crystal structure, and whether they should be put into the same group. After manual checking we did not find any misclassified materials, indicating the effectiveness of our approach. Generally, structures in sub-figures a-c all have 3D tetrahedral network and can be considered as superstructures of zinc blende. In other words, these structures can be obtained by cation mutation29or adding/removing atoms25,30 from the zinc blende mother phase. These structures, as well as others in the same groups, are well studied in the literature, so they will be analyzed later to demonstrate the accuracy of the clustering results. Structures in sub-figures d-g are layered structures, all have 2D tetrahedral layers intercalated by metal atoms or oxides. These are not well studied and will be analyzed to show how the clustering results can be used for new materials discovery.
Representative crystal structures of selected groups. The structures are visualized using VESTA28. In all the sub-figures, small yellow sphere is sulfur atom and copper atom locates inside the blue tetrahedron. The black lines indicate the unit cell.
Table 1lists relative information of the selected groups. Group 0 is the largest group with 231 diamond-like Cu-S based compounds. These compounds are of general interest for thermoelectric31,32and photovoltaics29. They have corner-sharing tetrahedral, and the coordination number of all the cations and anions is four (see the stannite structure in Fig. 3as an example). Therefore, clustering of these compounds is straightforward. It is worth noting that, due to the geometrical similarity only accounts for local structure, some wurtzite-related structures are also put into this group. Topologically they are very similar to the zinc blende relates structure considering the coordination and polyhedral connection. They can be easily distinguished when considering their space group33. Group 6 contains colusite-like compounds, which are recently well studied as thermoelectric materials34. This group serves as a good example to show the benefit of including both compositional and geometrical similarities. Mawsonite, stannoidite and colusite have a common structural feature: a cation-sulfur tetrahedron which has different orientation with respect to other tetrahedron (the red tetrahedron in Fig. 3b), and the number of cations is greater than the number of anions by 1, due to this extra mis-oriented tetrahedral. Linus Pauling proposed that such a structure feature can be attributed to the large residual charge on the tetrahedrally coordinated cation35. Therefore, from a chemical point of view, these compounds should be put into the same group, and this demonstrate the effectiveness of the similarity matrix. In addition, there is also a compound (Cu16Zn2In0.12 Fe3.88Ge4S24, ICSD 259855) falsely labeled as -1 when using only geometrical similarity.
Group 40–44 are tetrahedrite-like compounds, which are recently well studied as thermoelectric materials36,37,38,39. There are several variations from the ideal tetrahedrite (Cu12Sb4S13, ICSD 25707), such as sulfur stoichiometry or cation mutations. Therefore, DBSCAN put them into several subgroups. If we adjust the DBSCAN meta-parameters, e.g. increasing the cutoff distance, then these subgroup will merge to one. It is worth noting that spinel structures, which form the second largest group shown in Fig. 2a, are not discussed here, as we only interest in structures with tetrahedral network.
Formation enthalpy and band gap of group 55 rare-earth intercalated compounds (orange dots) and group 62 KCu4S3-like compounds (green dots). The data is from density functional theory calculations of the Materials Project40. Generally similar compounds in the same group have similar values for these two properties.
The rest groups in Table 1all have layered structures. Layered oxysulfides in group 61 and BiCuSO-type compounds in group 65 are visualized and investigated. No outlier compound is found in these two groups, i.e. DBSCAN does put similar compounds in the same group. These compounds are of general interest as functional materials41,42. They usually have complex structures which make further analysis difficult. Therefore, we choose other two types of layered sulfides, group 55 rare-earth intercalated layered sulfides and group 62 KCu4S3-like sulfides, to demonstrate that materials with similar structures are likely to have similar physical and chemical properties. We will also discuss how this can be used for new materials discovery.
Figure 4 shows formation enthalpy and band gap of some compounds in groups 55 and 62. These values are from density functional theory (DFT) calculations of the Materials Project, and compounds in groups 55 and 62 that have not been calculated yet are not shown. Patterns can be found such as KCu4S3-like sulfides are all metallic, and rare-earth intercalated layered sulfides have very close formation enthalpy values. This indicates that similar compounds tend to have similar properties, and hence the similarity measure and clustering can facilitate systematically search: if a compound in this group have a desired property, then the whole group can be investigated where materials with superior properties may be found. Consider Group 55 as an example43: recently, the compounds YCu3Te3 (equivalent to Y0.67Cu2Te2) and DyCu3Te3 have been reported as effective thermoelectric materials with zT value around 0.9 at 900 K. These ternary compounds have a trigonal structure (R3̅) and function as semiconductors with similar band structures and moderate band gaps ranging from 0.69 to 0.82 eV. Based on similarity considerations, Y0.67Cu2S2 and the whole group 55 may be considered for further investigation as thermoelectric materials.
To further demonstrate the approach, comparison of more groups is shown in Figure S3, where clear clustering patterns are evident in both the band gap and formation enthalpy DFT values. For example, in Figure S1a materials in group 8 have band gap values around 1.5 to 2.0 eV, while group 17 shows band gap values around 1.0 to 1.5 eV, slightly lower than group 8. Another example is that group 8 and 76 materials have formation enthalpy values around − 1.7 eV and − 2.0 eV, respectively. Therefore, materials with similar properties are grouped together by our approach, allowing for easy comparison of material families with shared electronic and thermodynamic properties.
It is worthing noting that there are some outlier values in Figure S3. The possible reasons are: (1) The chemical similarity measure based on compositional information and crystal structure is only a precise measure of physical properties, especially for the bandgap values which is from the reciprocal space. In the groups shown in Figure s1a, most materials are insulators, but several outliers are metallic. This might be from some complex interactions between the atoms which a simple chemical measure cannot capture. (2) Some DFT calculations may need more accurate pseudopotential or parameter setting. As Materials Project is trying to improve the accuracy of the DFT data, our method may help with the detection of abnormal values in the DFT calculations.
Conclusion
Periodic table representation of compositions and two-dimensional Wasserstein distance (earth mover’s distance) are proposed as a compositional similarity measure. A dataset of 1586 Cu-S based compounds is used as an example to test this approach. Together with a geometrical similarity measure and DBSCAN, the Cu-S based compounds are clustered into 86 groups. Human-based structure visualization of several selected shows that the approach correctly put chemical similar compounds into the same group. DFT calculated formation enthalpy and band gap of several compounds support the idea that similar structures tend to have similar properties. As a demonstration of how similarity can be used for new materaisl discovery, a group of rare earth containing layered Cu-S compounds is proposed for further experimental investigation as potential thermoelectric materials.
Data availability
The research data supporting the results of this manuscript can be accessed at the following https://github.com/zhangrz1983/PeriodicTableWasserstein.
References
Isayev, O. et al. Chem. Mater. 27 (3), 735–743 (2015).
Bender, A. & Glen, R. C. Org. Biomol. Chem. 2 (22), 3204–3218 (2004).
Oganov, A. R. & Valle, M. J. Chem. Phys. 130 (10), 104504 (2009).
Oganov, A. R., Pickard, C. J., Zhu, Q. & Needs, R. J. Nat. Rev. Mater. 4 (5), 331–348 (2019).
Bartók, A. P., Kondor, R. & Csányi, G. Phys. Rev. B 87 (18), 184115 (2013).
Bartók, A. P. et al. Sci. Adv. 3 (12), e1701816 (2017).
Hargreaves, C. J., Dyer, M. S., Gaultois, M. W., Kurlin, V. A. & Rosseinsky, M. J. Chem. Mater. 32 (24), 10610–10620 (2020).
Rubner, Y., Tomasi, C. & Guibas, L. J. Int. J. Comput. Vision 40 (2), 99–121 (2000).
Villani, C. Optimal transport -- Old and new (Springer, 2008).
Yang, L. & Ceder, G. Phys. Rev. B 88 (22), 224107 (2013).
Zheng, X., Zheng, P. & Zhang, R. Z. Chem. Sci. 9 (44), 8426–8432 (2018).
Zheng, X., Zheng, P., Zheng, L., Zhang, Y. & Zhang, R. Z. Comput. Mater. Sci. 173, 109436 (2020).
Pettifor, D. G. J. Chem. Soc., Faraday Trans. 86 (8), 1209 (1990).
Glawe, H., Sanna, A., Gross, E. K. U. & Marques, M. A. L. New. J. Phys. 18 (9), 093011 (2016).
Lonie, D. C. & Zurek, E. Comput. Phys. Commun. 183 (3), 690–697 (2012).
Hundt, R., Schön, J. C. & Jansen, M. J. Appl. Crystallogr. 39 (1), 6–16 (2006).
De, S., Bartok, A. P., Csanyi, G. & Ceriotti, M. Phys. Chem. Chem. Phys. 18 (20), 13754–13769 (2016).
de la Flor, G., Orobengoa, D., Tasci, E., Perez-Mato, J. M. & Aroyo, M. I. J. Appl. Crystallogr. 49 (2), 653–664 (2016).
Ong, S. P. et al. Comput. Mater. Sci. 68, 314–319 (2013).
Pan, H. et al. Inorg. Chem. 60 (3), 1590–1603 (2021).
Zimmermann, N. E. R. & Jain, A. RSC Adv. 10 (10), 6063–6081 (2020).
Zimmermann, N. E. R. & Jain, A. Acta Crystallogr. Sect. A 74 (a1), a209 (2018).
Ward, L. et al. Comput. Mater. Sci. 152, 60–69 (2018).
Suekuni, K. & Takabatake, T. APL Mater. 4 (10), 104503 (2016).
Zhang, R., Chen, K., Du, B. & Reece, M. J. J. Mater. Chem. A 5 (10), 5013–5019 (2017).
Bradski, G. in Dr. Dobb’s Journal of Software Tools (2000).
Pedregosa, F. et al. J. Mach. Learn. Res. 12 (null), 2825–2830 (2011).
Momma, K. & Izumi, F. J. Appl. Crystallogr. 41, 653–658 (2008).
Wang, C. et al. Chem. Mater. 26 (11), 3411–3417 (2014).
Di Paola, C., Macheda, F., Laricchia, S., Weber, C. & Bonini, N. Phys. Rev. Res. 2 (3), 033055 (2020).
Zhang, J. et al. Adv. Mater. 26 (23), 3848–3853 (2014).
Zhang, R. Z., Gucci, F., Zhu, H., Chen, K. & Reece, M. J. Inorg. Chem. 57 (20), 13027–13033 (2018).
Hicks, D. et al. Npj Comput. Mater. 7 (1), 30 (2021).
Guélou, G., Lemoine, P., Raveau, B. & Guilmeau, E. J. Mater. Chem. C 9 (3), 773–795 (2021).
Pauling, L. Tschermaks Mineralogische und Petrographische Mitteilungen 10 (1), 379–384 (1965).
Lu, X. et al. Adv. Energy Mater. 3 (3), 342–348 (2013).
Chetty, R., Bali, A. & Mallik, R. C. J. Mater. Chem. C 3 (48), 12364–12378 (2015).
Du, B. et al. J. Mater. Chem. C 7 (2), 394–404 (2019).
Du, B., Zhang, R., Chen, K., Mahajan, A. & Reece, M. J. J. Mater. Chem. A 5 (7), 3249–3259 (2017).
Jain, A. et al. APL Mater. 1 (1), 011002 (2013).
Luu, S. D. N. & Vaqueiro, P. J. Materiomics 2 (2), 131–140 (2016).
Larquet, C. & Carenco, S. Front. Chem. 8, 179 (2020).
Wang, T. et al. ACS Appl. Mater. Interfaces 12 (36), 40486–40494 (2020).
Acknowledgements
This work is financially supported by National Key Research and Development Program (2023YFB3003004), Shandong Provincial Key Research and Development Program (2022CXGC020106), Taishan Youth Scholar Project of Shandong Province (tsqn202312313) and Pilot Project for Integrated Innovation of Science, Education and Industry of Qilu University of Technology (Shandong Academy of Sciences) (2022JBZ01-01).
Author information
Authors and Affiliations
Contributions
S.H.:Formal analysis (equal)Investigation (equal)Writing - original draft (equal)T.X.:Formal analysis (equal)Investigation (equal)R.Z.:Conceptualization (equal)Formal analysis (equal)Writing - original draft (equal)Writing - review & editing (equal)Meng Guo: Funding acquisition (equal)Supervision (equal).
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Hao, S., Xia, T., Zhang, R. et al. Clustering Cu-S based compounds using periodic table representation and compositional Wasserstein distance. Sci Rep 14, 31602 (2024). https://doi.org/10.1038/s41598-024-79126-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-79126-3






